Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State reconstruction broken on Holesky due to backfill bug for skipped slots #4817

Closed
michaelsproul opened this issue Oct 9, 2023 · 2 comments
Assignees
Labels
bug Something isn't working database v4.6.0 ETA Q1 2024

Comments

@michaelsproul
Copy link
Member

Description

There's a bug in block backfill here:

// If we've reached genesis, add the genesis block root to the batch and set the
// anchor slot to 0 to indicate completion.
if expected_block_root == self.genesis_block_root {
let genesis_slot = self.spec.genesis_slot;
chunk_writer.set(
genesis_slot.as_usize(),
self.genesis_block_root,
&mut cold_batch,
)?;
prev_block_slot = genesis_slot;
expected_block_root = Hash256::zero();
break;
}

If there are skipped slots between genesis and the first block, then we will never write a block root for these slots. This breaks the forwards block roots iterator, and consequently state reconstruction. This is exactly what happens on Holesky, because slot 1 is skipped.

Version

Lighthouse v4.5.0 (and all previous versions). Observable on v4.5.0 because of Holesky. Tree-states is also broken (it actually panics).

Steps to resolve

  • We should fill in the intermediate block roots when storing the genesis block root.

Workarounds

  • Users that need state reconstruction on Holesky right now should re-sync from genesis (without checkpoint sync).
  • Once a fix is available users can do a re-sync using checkpoint sync. Existing Holesky databases that were backfilled to genesis are corrupt and should be deleted. There is no plan to provide a fix for these.
@michaelsproul
Copy link
Member Author

This is fixed on tree-states as of v4.5.444-exp.

bors bot pushed a commit that referenced this issue Oct 27, 2023
## Issue Addressed

Closes #4817.

## Proposed Changes

- Fill in the linear block roots array between 0 and the slot of the first block (e.g. slots 0 and 1 on Holesky).
- Backport the `--freezer`, `--skip` and `--limit` options for `lighthouse db inspect` from tree-states. This allows us to easily view the database corruption of 4817 using `lighthouse db inspect --network holesky --freezer --column bbr --output values --limit 2`.
- Backport the `iter_column_from` change and `MemoryStore` overhaul from tree-states. These are required to enable `lighthouse db inspect`.
- Rework `freezer_upper_limit` to allow state lookups for slots below the `state_lower_limit`. Currently state lookups will fail until state reconstruction completes entirely.

There is a new regression test for the main bug, but no test for the `freezer_upper_limit` fix because we don't currently support running state reconstruction partially (see #3026). This will be fixed once we merge `tree-states`! In lieu of an automated test, I've tested manually on a Holesky node while it was reconstructing.

## Additional Info

Users who backfilled Holesky to slot 0 (e.g. using `--reconstruct-historic-states`) need to either:

- Re-sync from genesis.
- Re-sync using checkpoint sync and the changes from this PR.

Due to the recency of the Holesky genesis, writing a custom pass to fix up broken databases (which would require its own thorough testing) was deemed unnecessary. This is the primary reason for this PR being marked `backwards-incompat`.

This will create few conflicts with Deneb, which I've already resolved on `tree-states-deneb` and will be happy to backport to Deneb once this PR is merged to unstable.
bors bot pushed a commit that referenced this issue Oct 27, 2023
## Issue Addressed

Closes #4817.

## Proposed Changes

- Fill in the linear block roots array between 0 and the slot of the first block (e.g. slots 0 and 1 on Holesky).
- Backport the `--freezer`, `--skip` and `--limit` options for `lighthouse db inspect` from tree-states. This allows us to easily view the database corruption of 4817 using `lighthouse db inspect --network holesky --freezer --column bbr --output values --limit 2`.
- Backport the `iter_column_from` change and `MemoryStore` overhaul from tree-states. These are required to enable `lighthouse db inspect`.
- Rework `freezer_upper_limit` to allow state lookups for slots below the `state_lower_limit`. Currently state lookups will fail until state reconstruction completes entirely.

There is a new regression test for the main bug, but no test for the `freezer_upper_limit` fix because we don't currently support running state reconstruction partially (see #3026). This will be fixed once we merge `tree-states`! In lieu of an automated test, I've tested manually on a Holesky node while it was reconstructing.

## Additional Info

Users who backfilled Holesky to slot 0 (e.g. using `--reconstruct-historic-states`) need to either:

- Re-sync from genesis.
- Re-sync using checkpoint sync and the changes from this PR.

Due to the recency of the Holesky genesis, writing a custom pass to fix up broken databases (which would require its own thorough testing) was deemed unnecessary. This is the primary reason for this PR being marked `backwards-incompat`.

This will create few conflicts with Deneb, which I've already resolved on `tree-states-deneb` and will be happy to backport to Deneb once this PR is merged to unstable.
@michaelsproul
Copy link
Member Author

Fixed on unstable. Will be in 4.6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working database v4.6.0 ETA Q1 2024
Projects
None yet
Development

No branches or pull requests

1 participant