Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validator metric for missed blocks #3414

Closed
ghost opened this issue Aug 2, 2022 · 14 comments
Closed

Add validator metric for missed blocks #3414

ghost opened this issue Aug 2, 2022 · 14 comments
Labels
v4.6.0 ETA Q1 2024

Comments

@ghost
Copy link

ghost commented Aug 2, 2022

Description

Hello!
Currently, validator monitoring feature allows monitoring plenty of useful things in beacon node about validators, gathering information from api/gossip/blockchains.
We also can monitor new blocks produced by our validators via validator_monitor_beacon_block_totalmetric
and validator_monitor_prev_epoch_beacon_blocks_total, but we can't see missed blocks. I think it will be a very useful metric for everyone to monitor not only produced blocks but also missed blocks.

@paulhauner
Copy link
Member

I think this should be a feature of the ValidatorMonitor because it already has a list of monitored validators and is set up to receive the necessary information for determining missed blocks.

The core missed-block detection logic should probably run in ValidatorMonitor::process_valid_state. There are some things we need to consider here:

  1. How do we detect a missed block (a.k.a. a "skipped slot")?
  2. How do we know who the proposer was for the skipped slot?
  3. How to reduce false-positives.
  4. De-bouncing alerts.

1. Detecting missed blocks

For (1), this is fairly straight-forward. The state.block_roots() field is a list of block roots for the past 8,192 slots. As a general example, let us assume that ValidatorMonitor::process_valid_state recieves a state with state.slot = 32. If we want to know the block root of the block at slot 16, we can call state.get_block_root(Slot::new(16)).

As an example of detecting missed slots/blocks, let us assume there is a skipped-slot (i.e. a missing block) at slot 17. Calling state.get_block_root(Slot::new(17)) will return the exact same block root as our earlier call with Slot::new(16). Therefore, we know that slot N was skipped/missed if get_block_root(N) == get_block_root(N - 1). Here's a visual representation:

block_roots = [
  0xabc,  // Slot N+0: Skipped status unknown.
  0xdef,  // Slot N+1: Not skipped
  0x012,  // Slot N+2: Not skipped
  0x012,  // Slot N+3: Skipped
  0x012,  // Slot N+4: Skipped
  0x345   // Slot N+5: Not Skipped
]

2. Discovering the proposer of a missed block

Regarding (2), once we know a slot was skipped, we must determine the proposer for that skipped slot. Knowing the proposer allows us to determine if the validator is a "monitored validator" and is worth creating an alert for.

Let us assume from step (1) we learned that slot N has been skipped. The naive method to find the proposer is to call state.get_beacon_proposer_index(N). However, that method is computationally expensive and should be avoided. Instead, I think we should use the BeaconProposerCache. We can see an example of it being used here:

let cached_proposer = self
.beacon_proposer_cache
.lock()
.get_slot::<T::EthSpec>(shuffling_decision_root, proposal_slot);

There are two complications with using this cache. Firstly, the cache might not contain the value that we want. In this case, I think we simply log a warning and then abort the effort to determine the proposer. The cache is large enough that it shouldn't miss and I think it'll only miss when the chain is unhealthy and I don't like starting to perform expensive computations when the chain is unhealthy.

The second complication is determining the shuffling_decision_block: Hash256 to provide to BeaconProposerCache::get_slot. When the state.slot() is in the same epoch as the missed slot N, we can simply call BeaconState::proposer_shuffling_decision_root. However, that function will not work when the missed slot N is in an earlier epoch. This is fairly simple to fix, we just need to re-implement BeaconState::proposer_shuffling_decision_root to use N.epoch() rather than state.current_epoch().

3. Reducing false-positives

Let's consider a simple and common fork on the beacon chain:

     |--B
A <--
     |-------C

The following is true about this chain:

  • There is a block A which is the parent of both block B and C.
  • Block C has "forked-out" block B by building atop A.
  • There are two forks, one where B exists and one where it does not.

Our ValidatorMonitor will call process_valid_state for all of these blocks (A, B and C). When it processes C it will declare there was a skipped slot at B. This is great if it turns out that C is the canonical chain, but it's a false-positive if B is the canonical chain.

This problem is hard to solve. We could completely remove false-positives by only creating alerts for finalized slots. However, finalization takes at least two epochs, so now our alerting system has a two-epoch lag (~13 mins). We are in a trade-off space between a fast alerting system and an correct alerting system.

I propose we find a happy-medium only alerting for missed blocks if they are some MISSED_BLOCK_LAG_SLOTS behind state.slot(). I propose we set MISSED_BLOCK_LAG_SLOTS = 4, however I don't feel strongly about that.

This means that we need forks to span 4 slots before we start to create false-positive logs. Such forks are uncommon and I believe that false-positives are acceptable in this case. The four slots will introduce a 48 second lag to our alerts, which I think is totally acceptable.

4. Debouncing alerts

When we detecting missed blocks in ValidatorMonitor::process_valid_state, I think we should a range of slots for missed blocks. I propose that we check a total of SLOTS_PER_EPOCH - MISSED_BLOCK_LAG_SLOTS slots each time (that would be 32 - 4 == 28 slots with current values). Ensuring we don't go back more than an epoch means we are less likely to get misses on the BeaconProposerCache.

Because we're checking a range of slots each time, it's likely that we'll discover the same missed block more than once. To avoiding creating a new alert/Prometheus metric each time, I suggest that we add a HashSet<(Epoch, ValidatorIndex, Slot)> to the ValidatorMonitor that allows us to record each time we log an alert and avoid re-logging it again in the future.

Whenever we add a cache like this we need to make sure we prune it, otherwise we'll create a memory leak. I propose that each time we run ValidatorMonitor::process_valid_state we get the state.finalized_checkpoint().epoch and remove any entries from the HashSet that have a Slot prior to the finalized epoch.

@v4lproik
Copy link
Contributor

v4lproik commented Aug 16, 2023

Hi! I am an EPF fellow, is there anyone working on this issue? :)

@paulhauner
Copy link
Member

Hey @v4lproik, there's no one working on it yet (that I know of) and I'd love to see you take it on!

@v4lproik
Copy link
Contributor

Perfect, I'll get on with it tomorrow then. Thanks!

@paulhauner
Copy link
Member

Hey @v4lproik, how you are travelling with this one? Let us know if you've decided to do something else and we'll start working on this internally ☺️

@paulhauner paulhauner added the v4.5.0 ETA Q4 2023 label Aug 31, 2023
@v4lproik
Copy link
Contributor

Hey @paulhauner! Sorry, it's holiday time for me, I am coming back at the end of the week and will start working on it if that's okay!

@paulhauner
Copy link
Member

No problems, thanks for getting back to me. I hope you had a nice holiday!

@v4lproik
Copy link
Contributor

v4lproik commented Sep 4, 2023

@paulhauner I sent you a DM on Discord last week to talk about the Ethereum Fellowship and the different LH projects I'd like to work on including this one. Hopefully you have some time to look into it. Thanks!

@michaelsproul
Copy link
Member

@v4lproik Paul is on leave for the next 2 weeks. Please try our #developers channel, or send me a DM (@sproul).

@v4lproik
Copy link
Contributor

v4lproik commented Sep 5, 2023

Thanks a lot Michael! I just sent you a message on Discord!

@v4lproik
Copy link
Contributor

I will send a PR for this issue this week. I made good progress.
Should I also add a chart in the dashboard showing the missed blocks per monitored validator in lighthouse metrics?
Also, do you think it'd be worth adding a cli option that also allows LH to expose the missed finalized blocks so we can compare if we have lots of false positives? Don't wanna break Prometheus cardinality, though.

@michaelsproul
Copy link
Member

Should I also add a chart in the dashboard showing the missed blocks per monitored validator in lighthouse metrics?

Yeah this would be great, thanks!

Also, do you think it'd be worth adding a cli option that also allows LH to expose the missed finalized blocks so we can compare if we have lots of false positives? Don't wanna break Prometheus cardinality, though

This sounds like a good idea to me if you've got time to implement it and it doesn't bloat the impl too much. Having the correct value as well as the best-effort one would be ideal. I think the metrics blow-up shouldn't be too much, particularly if we only enable the missed block metrics when the validator monitor is enabled. I don't even think we'd need to guard it behind a CLI flag.

@v4lproik
Copy link
Contributor

Hi team, I would love some pointers regarding the implementation of the proposer cache. I left some comments in the draft PR.
I will add a second PR for the missed finalized blocks it might bloat a little bit too much this implementation if that's okay. Thanks!

@paulhauner paulhauner added v4.6.0 ETA Q1 2024 and removed v4.5.0 ETA Q4 2023 labels Sep 20, 2023
@jimmygchen
Copy link
Member

Completed in #4731 . Thank you @v4lproik !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v4.6.0 ETA Q1 2024
Projects
None yet
Development

No branches or pull requests

4 participants