[Gossipsub] Add optional metrics #2235

AgeManning · 2021-09-17T01:38:49Z

These adds a feature flag which enables a set of metrics for gossipsub which can be helpful in analysing the mesh behaviour and general health of the network.

This doesn't fit into the libp2p metrics framework (which relies on actual behaviour messages) instead we instrument various sections of the gossipsub code to report and update a general metrics struct.

The metrics keep track of various parameters, but importantly tracks events within the mesh, such as peer churn and causes for the churn. We also keep track of IWANT and duplicates being received which are important metrics for tuning gossipsub parameters for specific networks.

For those that don't compile with the new feature gossipsub-metrics Gossipsub should remain unchanged.

* Added message count stuff * Message count maps use MeshIndex instead of PeerId * Added validated message count * Count rejected messages too.. * simplified adding new counts significantly * some cleanup and stop resetting counts * Added more metrics and cleanup * added peer slot to logging on score * added more metrics for prune cases * added churn_topic * added debug logging for churn counts * immutable borrow * fixed log * improved logging even more.. * ensure message counts structure exists * Forgot to add prune case * cleaned up debugging with macro * fixed unsubscribed condition * renamed some things for clarity * Optimized Performance with Vec Instead of BTreeMap * Updated comment for clarity * removed redundant category and optimized * fixed small edge case * Unified slot-related HashMaps into 1 Structure * Convert Fields to Array for Passing to Functions * Added Method for Iterating over Metrics * small fix * ran cargo fmt * ran clippy * One last refactor using enums. Looking great now. * Fixed bug with assert and improved error message * fixed issue with release builds * Make minimal changes to master * Added tracking of assigning slot * added slot_metrics_topics() & extra info to logs * Made slot_metrics_topics() smarter * First draft * Update metric design * Add new metric and cleanup * Cleanup code & Remove Redundant map Lookups (#147) * Simplify & Remove Redundant Map Lookups * Cleanup code/Remove Redundant lookups in heartbeat * Commented and renamed things for clarity * More renaming / reorganizing for clarity * Fixed clippy's unnecessary clone warning * one last renaming.. Co-authored-by: Mark Mackey <ethereumdreamer@gmail.com> Co-authored-by: ethDreamer <37123614+ethDreamer@users.noreply.github.com>

mxinden

Cool. Always in favor of more instrumentation. Sorry for the delay here.

Feature flag

I wonder whether the gossipsub-metrics and the metrics feature are needed. Off the top of my head, I would suggest putting all metrics related logic behind #[cfg(debug_assertions)]. In debug mode, no one would complain about the overhead of metric tracking. In release mode metric tracking would not happen. Noisiness of the metrics output in debug mode can be controlled via log levels.

Comparison to libp2p-metrics

This doesn't fit into the libp2p metrics framework

Correct. If I understand this pull request correctly, your goal is to be able to do ad-hoc debugging and not continuous monitoring, right? For the latter, I would very much like for libp2p-metrics to support gossipsub. Let me know in case you would like to collaborate on that.

As an example what is already possible today:

https://kademlia-exporter.max-inden.de/d/Pfr0Fj6Mk/rust-libp2p?orgId=1&var-data_source=Prometheus&var-instance=kademlia-exporter-ipfs%3A8080&var-instance=kademlia-exporter-kusama%3A8080&var-instance=kademlia-exporter-polkadot%3A8080&from=now-24h&to=now

mxinden · 2021-10-04T10:17:50Z

protocols/gossipsub/src/behaviour/tests.rs

+    /* For debug purposes
+    use env_logger::{Builder, Env};
+      // Add this line to relevant tests.
+      Builder::from_env(Env::default().default_filter_or("info")).init();


Why not just add let _ = env_logger::try_init(); as the first line of each test, like we do in most other crates?

Yeah, i'll modify this in the next iteration, if we decide to go down the libp2p-metrics route

AgeManning · 2021-10-04T22:44:59Z

Correct. If I understand this pull request correctly, your goal is to be able to do ad-hoc debugging and not continuous monitoring, right? For the latter, I would very much like for libp2p-metrics to support gossipsub. Let me know in case you would like to collaborate on that.

This is actually for optional continuous monitoring. I have set up a bunch of prometheus and grafana metrics for it also. When I skimmed the libp2p-metrics crate it looked like it was just attaching to messages sent to by the behaviour. I guess in order to incorporate these metrics, we'd have to add a bunch more enum types (which I guess can't hurt).

After looking at the kademlia metrics you have, I think something like that would be good for gossipsub, however i'm less interested in active connections more how the mesh and scoring is behaving. Some of the metrics added in this PR could probably be generalised to libp2p-metrics.

Anyway, yeah, would be good to collaborate, depending on how much time you have, otherwise, if you give me a very rough overview, i can go figure it out and integrate into libp2p-metrics.

mxinden · 2021-10-06T15:51:39Z

After looking at the kademlia metrics you have, I think something like that would be good for gossipsub, however i'm less interested in active connections more how the mesh and scoring is behaving. Some of the metrics added in this PR could probably be generalised to libp2p-metrics.

Great. I think you know best what information is relevant. It would need to be exposed via GossipsubEvent.

Anyway, yeah, would be good to collaborate, depending on how much time you have, otherwise, if you give me a very rough overview, i can go figure it out and integrate into libp2p-metrics.

I created a first draft at #2277. Do you want to build on top of that?

Happy to help further. Also feel free to reach out with any Prometheus or Open Metrics questions. I have been part of the maintainer team of the Prometheus project in a past life.

divagant-martian · 2021-10-06T17:39:05Z

Looking into the metrics that @AgeManning added I think those that would make sense to expose via a behaviour event are few. Maybe a middle ground would be to port these new internal metrics to open_metrics_client::registry::Registry under a feature flag (as in this PR). Not sure how to expose them to libp2p-metrics, but do you think it would be feasible somehow?

AgeManning · 2021-10-07T01:20:47Z

I've had a bit of a look further into the libp2p-metrics and wanted to raise a few things.

This PR introduces a new concept of slots which helps track how a mesh is behaving. Essentially the mesh contains a fixed number of slots and peers can enter/leave for various reasons. We can track the churn and reasons for why peers are leaving the mesh to get an idea of how the scoring is working and general behaviour of peers in and out of our meshes.

The current design is to internally collect all this data and assemble it to some consumable form, which is currently exposed via gossipsub.metrics().

There's two things to consider I think:

We dont really want all this metric overhead for users who do not care about monitoring the metrics. I went down a feature route because I didn't want to add any extra bloat to the ram in storing any state. However, it might be nicer to have these adjustable beyond compile-time and instead add them as part of the config. The metrics variable will still be initialized but will not be populated, I guess if the config is not set.
Transitioning to the libp2p-metrics design is non-trivial and potentially sub-optimal I think. From what I can tell, we'd have to be firing quite a lot of events out from the behaviour (such as everytime a message is validated, or we receive a duplicate that gets filtered) which happen very often on a network. Then somewhere in the libp2p-metrics crate we'd probably have to aggregate this data and collect it back into a more succinct useable form. I guess its a matter of where the aggregation of data happens, it seems to make more intuitive sense to do it internally in the behaviour. I guess some of the metrics can work well (ones that happen less often, like broken promises). In summary I agree with @divagant-martian that potentially we dont want to expose all the events we are tracking in the new metrics.

Curious about your thoughts on these.

mxinden · 2021-10-11T09:45:31Z

The current design is to internally collect all this data and assemble it to some consumable form, which is currently exposed via gossipsub.metrics().

What is your preferred way of exposing these metrics? Do you call gossipsub.metrics() and then pass the values to corresponding Prometheus metrics, where later on a Prometheus server scrapes some endpoint on the process? Or do you just log the values of gossipsub.metrics() to stderr?

There's two things to consider I think:

We dont really want all this metric overhead for users who do not care about monitoring the metrics. I went down a feature route because I didn't want to add any extra bloat to the ram in storing any state. However, it might be nicer to have these adjustable beyond compile-time and instead add them as part of the config. The metrics variable will still be initialized but will not be populated, I guess if the config is not set.

👍 agreed. Also since Rust features are additive, thus running two gossipsub instances would either requiring metric collection in both or none.

Transitioning to the libp2p-metrics design is non-trivial and potentially sub-optimal I think. From what I can tell, we'd have to be firing quite a lot of events out from the behaviour (such as everytime a message is validated, or we receive a duplicate that gets filtered) which happen very often on a network. Then somewhere in the libp2p-metrics crate we'd probably have to aggregate this data and collect it back into a more succinct useable form. I guess its a matter of where the aggregation of data happens, it seems to make more intuitive sense to do it internally in the behaviour. I guess some of the metrics can work well (ones that happen less often, like broken promises). In summary I agree with @divagant-martian that potentially we dont want to expose all the events we are tracking in the new metrics.

Agreed that bubbling up all the information via events is not feasible. Some middleground as suggested by @divagant-martian sounds like an approach worth exploring.

Looking into the metrics that @AgeManning added I think those that would make sense to expose via a behaviour event are few. Maybe a middle ground would be to port these new internal metrics to open_metrics_client::registry::Registry under a feature flag (as in this PR). Not sure how to expose them to libp2p-metrics, but do you think it would be feasible somehow?

I am in favor of this approach. GossipSub::new could take an Option<&mut Registry>. Users could then do:

let mut metric_registry = Registry::default();
let metrics = Metrics::new(&mut metric_registry);
let gossipsub = GossipSub::new(Some(registry::sub_registry_with_prefix("gossipsub")), [...]);

divagant-martian · 2021-10-13T17:58:33Z

let mut metric_registry = Registry::default();
let metrics = Metrics::new(&mut metric_registry);
let gossipsub = GossipSub::new(Some(registry::sub_registry_with_prefix("gossipsub")), [...]);

This would require to make Gossipsub dependent on a non-static lifetime, making the type itself no longer static, which conflicts with the requirements of NetworkBehaviour: 'static. Working on an alternative

mxinden · 2021-10-13T18:10:30Z

let mut metric_registry = Registry::default();
let metrics = Metrics::new(&mut metric_registry);
let gossipsub = GossipSub::new(Some(registry::sub_registry_with_prefix("gossipsub")), [...]);
This would require to make Gossipsub dependent on a non-static lifetime, making the type itself no longer static, which conflicts with the requirements of NetworkBehaviour: 'static. Working on an alternative

GossipSub::new would not capture the reference to Registry, but instead, only register the corresponding metrics and then drop the reference. Thus, there is no change to the lifetime requirements of the GossipSub struct itself.

Does that make sense @divagant-martian?

divagant-martian · 2021-10-13T18:15:34Z

Yes, noted too late. Thanks 👍

protocols/gossipsub/src/metrics.rs

AgeManning · 2021-12-13T05:34:52Z

Closing in favour of #2346

AgeManning and others added 3 commits September 13, 2021 08:43

Merge latest master

71e01a7

Add to changelog

e5df08a

mxinden reviewed Oct 4, 2021

View reviewed changes

Initial commit towards open metrics

6df8300

divagant-martian reviewed Oct 14, 2021

View reviewed changes

protocols/gossipsub/src/metrics.rs Outdated Show resolved Hide resolved

AgeManning and others added 4 commits October 15, 2021 17:39

Temp commit

d4daca7

merge latest upstream master

50b7bd7

allow creating gossipsub without a metrics param

0a3b256

add metrics config

20add10

divagant-martian mentioned this pull request Oct 22, 2021

Allow family to use constructors that do not coerce to function pointers prometheus/client_rust#21

Merged

divagant-martian added 5 commits October 22, 2021 15:15

add metrics config

b0b1083

tmp use local open-metrics-client

3242012

Merge branch 'master' into gossip-inspect

9ca41ca

fix example

cfd74f5

add notes and expand config

0926879

divagant-martian mentioned this pull request Oct 29, 2021

protocols/gossipsub: Add mesh metrics #2316

Merged

AgeManning closed this Dec 13, 2021

AgeManning deleted the gossip-inspect branch February 15, 2022 05:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gossipsub] Add optional metrics #2235

[Gossipsub] Add optional metrics #2235

AgeManning commented Sep 17, 2021 •

edited

Loading

mxinden left a comment •

edited

Loading

mxinden Oct 4, 2021

AgeManning Oct 4, 2021

AgeManning commented Oct 4, 2021

mxinden commented Oct 6, 2021

divagant-martian commented Oct 6, 2021 •

edited

Loading

AgeManning commented Oct 7, 2021

mxinden commented Oct 11, 2021

divagant-martian commented Oct 13, 2021

mxinden commented Oct 13, 2021

divagant-martian commented Oct 13, 2021

AgeManning commented Dec 13, 2021

[Gossipsub] Add optional metrics #2235

[Gossipsub] Add optional metrics #2235

Conversation

AgeManning commented Sep 17, 2021 • edited Loading

mxinden left a comment • edited Loading

Choose a reason for hiding this comment

mxinden Oct 4, 2021

Choose a reason for hiding this comment

AgeManning Oct 4, 2021

Choose a reason for hiding this comment

AgeManning commented Oct 4, 2021

mxinden commented Oct 6, 2021

divagant-martian commented Oct 6, 2021 • edited Loading

AgeManning commented Oct 7, 2021

mxinden commented Oct 11, 2021

divagant-martian commented Oct 13, 2021

mxinden commented Oct 13, 2021

divagant-martian commented Oct 13, 2021

AgeManning commented Dec 13, 2021

AgeManning commented Sep 17, 2021 •

edited

Loading

mxinden left a comment •

edited

Loading

divagant-martian commented Oct 6, 2021 •

edited

Loading