Add validator metric for missed blocks #3414

ghost · 2022-08-02T20:06:16Z

Description

Hello!
Currently, validator monitoring feature allows monitoring plenty of useful things in beacon node about validators, gathering information from api/gossip/blockchains.
We also can monitor new blocks produced by our validators via validator_monitor_beacon_block_totalmetric
and validator_monitor_prev_epoch_beacon_blocks_total, but we can't see missed blocks. I think it will be a very useful metric for everyone to monitor not only produced blocks but also missed blocks.

The text was updated successfully, but these errors were encountered:

paulhauner · 2023-07-24T00:41:42Z

I think this should be a feature of the ValidatorMonitor because it already has a list of monitored validators and is set up to receive the necessary information for determining missed blocks.

The core missed-block detection logic should probably run in ValidatorMonitor::process_valid_state. There are some things we need to consider here:

How do we detect a missed block (a.k.a. a "skipped slot")?
How do we know who the proposer was for the skipped slot?
How to reduce false-positives.
De-bouncing alerts.

1. Detecting missed blocks

For (1), this is fairly straight-forward. The state.block_roots() field is a list of block roots for the past 8,192 slots. As a general example, let us assume that ValidatorMonitor::process_valid_state recieves a state with state.slot = 32. If we want to know the block root of the block at slot 16, we can call state.get_block_root(Slot::new(16)).

As an example of detecting missed slots/blocks, let us assume there is a skipped-slot (i.e. a missing block) at slot 17. Calling state.get_block_root(Slot::new(17)) will return the exact same block root as our earlier call with Slot::new(16). Therefore, we know that slot N was skipped/missed if get_block_root(N) == get_block_root(N - 1). Here's a visual representation:

block_roots = [
  0xabc,  // Slot N+0: Skipped status unknown.
  0xdef,  // Slot N+1: Not skipped
  0x012,  // Slot N+2: Not skipped
  0x012,  // Slot N+3: Skipped
  0x012,  // Slot N+4: Skipped
  0x345   // Slot N+5: Not Skipped
]

2. Discovering the proposer of a missed block

Regarding (2), once we know a slot was skipped, we must determine the proposer for that skipped slot. Knowing the proposer allows us to determine if the validator is a "monitored validator" and is worth creating an alert for.

Let us assume from step (1) we learned that slot N has been skipped. The naive method to find the proposer is to call state.get_beacon_proposer_index(N). However, that method is computationally expensive and should be avoided. Instead, I think we should use the BeaconProposerCache. We can see an example of it being used here:

lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs

Lines 3877 to 3880 in dfcb336

    
           let cached_proposer = self 
        
               .beacon_proposer_cache 
        
               .lock() 
        
               .get_slot::<T::EthSpec>(shuffling_decision_root, proposal_slot);

There are two complications with using this cache. Firstly, the cache might not contain the value that we want. In this case, I think we simply log a warning and then abort the effort to determine the proposer. The cache is large enough that it shouldn't miss and I think it'll only miss when the chain is unhealthy and I don't like starting to perform expensive computations when the chain is unhealthy.

The second complication is determining the shuffling_decision_block: Hash256 to provide to BeaconProposerCache::get_slot. When the state.slot() is in the same epoch as the missed slot N, we can simply call BeaconState::proposer_shuffling_decision_root. However, that function will not work when the missed slot N is in an earlier epoch. This is fairly simple to fix, we just need to re-implement BeaconState::proposer_shuffling_decision_root to use N.epoch() rather than state.current_epoch().

3. Reducing false-positives

Let's consider a simple and common fork on the beacon chain:

     |--B
A <--
     |-------C

The following is true about this chain:

There is a block A which is the parent of both block B and C.
Block C has "forked-out" block B by building atop A.
There are two forks, one where B exists and one where it does not.

Our ValidatorMonitor will call process_valid_state for all of these blocks (A, B and C). When it processes C it will declare there was a skipped slot at B. This is great if it turns out that C is the canonical chain, but it's a false-positive if B is the canonical chain.

This problem is hard to solve. We could completely remove false-positives by only creating alerts for finalized slots. However, finalization takes at least two epochs, so now our alerting system has a two-epoch lag (~13 mins). We are in a trade-off space between a fast alerting system and an correct alerting system.

I propose we find a happy-medium only alerting for missed blocks if they are some MISSED_BLOCK_LAG_SLOTS behind state.slot(). I propose we set MISSED_BLOCK_LAG_SLOTS = 4, however I don't feel strongly about that.

This means that we need forks to span 4 slots before we start to create false-positive logs. Such forks are uncommon and I believe that false-positives are acceptable in this case. The four slots will introduce a 48 second lag to our alerts, which I think is totally acceptable.

4. Debouncing alerts

When we detecting missed blocks in ValidatorMonitor::process_valid_state, I think we should a range of slots for missed blocks. I propose that we check a total of SLOTS_PER_EPOCH - MISSED_BLOCK_LAG_SLOTS slots each time (that would be 32 - 4 == 28 slots with current values). Ensuring we don't go back more than an epoch means we are less likely to get misses on the BeaconProposerCache.

Because we're checking a range of slots each time, it's likely that we'll discover the same missed block more than once. To avoiding creating a new alert/Prometheus metric each time, I suggest that we add a HashSet<(Epoch, ValidatorIndex, Slot)> to the ValidatorMonitor that allows us to record each time we log an alert and avoid re-logging it again in the future.

Whenever we add a cache like this we need to make sure we prune it, otherwise we'll create a memory leak. I propose that each time we run ValidatorMonitor::process_valid_state we get the state.finalized_checkpoint().epoch and remove any entries from the HashSet that have a Slot prior to the finalized epoch.

v4lproik · 2023-08-16T13:20:13Z

Hi! I am an EPF fellow, is there anyone working on this issue? :)

paulhauner · 2023-08-16T23:16:27Z

Hey @v4lproik, there's no one working on it yet (that I know of) and I'd love to see you take it on!

v4lproik · 2023-08-17T16:40:26Z

Perfect, I'll get on with it tomorrow then. Thanks!

paulhauner · 2023-08-31T00:29:46Z

Hey @v4lproik, how you are travelling with this one? Let us know if you've decided to do something else and we'll start working on this internally ☺️

v4lproik · 2023-08-31T06:27:27Z

Hey @paulhauner! Sorry, it's holiday time for me, I am coming back at the end of the week and will start working on it if that's okay!

paulhauner · 2023-09-01T04:59:51Z

No problems, thanks for getting back to me. I hope you had a nice holiday!

v4lproik · 2023-09-04T13:55:51Z

@paulhauner I sent you a DM on Discord last week to talk about the Ethereum Fellowship and the different LH projects I'd like to work on including this one. Hopefully you have some time to look into it. Thanks!

michaelsproul · 2023-09-05T00:35:59Z

@v4lproik Paul is on leave for the next 2 weeks. Please try our #developers channel, or send me a DM (@sproul).

v4lproik · 2023-09-05T21:37:26Z

Thanks a lot Michael! I just sent you a message on Discord!

v4lproik · 2023-09-10T20:48:13Z

I will send a PR for this issue this week. I made good progress.
Should I also add a chart in the dashboard showing the missed blocks per monitored validator in lighthouse metrics?
Also, do you think it'd be worth adding a cli option that also allows LH to expose the missed finalized blocks so we can compare if we have lots of false positives? Don't wanna break Prometheus cardinality, though.

michaelsproul · 2023-09-11T00:57:53Z

Should I also add a chart in the dashboard showing the missed blocks per monitored validator in lighthouse metrics?

Yeah this would be great, thanks!

Also, do you think it'd be worth adding a cli option that also allows LH to expose the missed finalized blocks so we can compare if we have lots of false positives? Don't wanna break Prometheus cardinality, though

This sounds like a good idea to me if you've got time to implement it and it doesn't bloat the impl too much. Having the correct value as well as the best-effort one would be ideal. I think the metrics blow-up shouldn't be too much, particularly if we only enable the missed block metrics when the validator monitor is enabled. I don't even think we'd need to guard it behind a CLI flag.

v4lproik · 2023-09-14T21:54:19Z

Hi team, I would love some pointers regarding the implementation of the proposer cache. I left some comments in the draft PR.
I will add a second PR for the missed finalized blocks it might bloat a little bit too much this implementation if that's okay. Thanks!

jimmygchen · 2023-11-10T04:00:23Z

Completed in #4731 . Thank you @v4lproik !

paulhauner mentioned this issue Aug 31, 2023

Track block proposals in the validator monitor #3643

Closed

paulhauner added the v4.5.0 ETA Q4 2023 label Aug 31, 2023

v4lproik mentioned this issue Sep 13, 2023

Add missed blocks to monitored validators #4731

Merged

paulhauner added v4.6.0 ETA Q1 2024 and removed v4.5.0 ETA Q4 2023 labels Sep 20, 2023

jimmygchen closed this as completed Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add validator metric for missed blocks #3414

Add validator metric for missed blocks #3414

ghost commented Aug 2, 2022

paulhauner commented Jul 24, 2023

v4lproik commented Aug 16, 2023 •

edited

Loading

paulhauner commented Aug 16, 2023

v4lproik commented Aug 17, 2023

paulhauner commented Aug 31, 2023

v4lproik commented Aug 31, 2023

paulhauner commented Sep 1, 2023

v4lproik commented Sep 4, 2023

michaelsproul commented Sep 5, 2023

v4lproik commented Sep 5, 2023

v4lproik commented Sep 10, 2023

michaelsproul commented Sep 11, 2023

v4lproik commented Sep 14, 2023

jimmygchen commented Nov 10, 2023

Add validator metric for missed blocks #3414

Add validator metric for missed blocks #3414

Comments

ghost commented Aug 2, 2022

Description

paulhauner commented Jul 24, 2023

1. Detecting missed blocks

2. Discovering the proposer of a missed block

3. Reducing false-positives

4. Debouncing alerts

v4lproik commented Aug 16, 2023 • edited Loading

paulhauner commented Aug 16, 2023

v4lproik commented Aug 17, 2023

paulhauner commented Aug 31, 2023

v4lproik commented Aug 31, 2023

paulhauner commented Sep 1, 2023

v4lproik commented Sep 4, 2023

michaelsproul commented Sep 5, 2023

v4lproik commented Sep 5, 2023

v4lproik commented Sep 10, 2023

michaelsproul commented Sep 11, 2023

v4lproik commented Sep 14, 2023

jimmygchen commented Nov 10, 2023

v4lproik commented Aug 16, 2023 •

edited

Loading