Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed block sync might leave some sync data in the main chain database #4866

Closed
SWvheerden opened this issue Oct 28, 2022 · 7 comments
Closed
Assignees
Labels
A-base_node Area - The Tari base node executable and libraries C-bug Category - fixes a bug, typically associated with an issue. P-controversial Process - This PR or Issue is controversial and/or requires more attention that simpler issues

Comments

@SWvheerden
Copy link
Collaborator

Currently, syncing is handled by the BN state machine. If the node things its behind, the node will try header_sync to download headers.
If it has enough VALID headers, it will commit them as headers on the main chain, and continue do to so for as long it is provided VALID headers.
If this fails, the node goes to waiting state.
It should not, it should either try and sync the VALID blocks from a node, or delete the headers and on.
But most cases this will return Ok

The block sync has the same problem, after failing, we need to check and see if we need to reset back to our previous state.

@SWvheerden
Copy link
Collaborator Author

SWvheerden commented Oct 28, 2022

Rewind will store all the re-orged blocks in the orphan database.
We should just chose the highest VALID tip and make sure the main chain is that one

@sdbondi
Copy link
Member

sdbondi commented Oct 28, 2022

Initial reaction is to disagree. The chain metadata keeps track of the valid full block height. We already handle having a chain of valid headers but no/partial blocks. If we reset on an invalid header, we'll have to download valid headers all over again. Same for blocks. If a synced header/block is invalid after syncing others, the blocks that led up to them are still valid so why should we throw them away?

I dont think the rewinding puts the removed blocks back in the orphan database, only updates the orphan tip hash after a reorg, which doesnt seem right. Ah I was mistaken, this happens inside rewind_to_height

if let Some(block) = removed_blocks.first() {

@SWvheerden
Copy link
Collaborator Author

I dont think the rewinding puts the removed blocks back in the orphan database, only updates the orphan tip hash after a reorg, which doesnt seem right.

It happens in rewind, not reorganise

txn.insert_chained_orphan(block.clone());

@SWvheerden
Copy link
Collaborator Author

Initial reaction is to disagree. The chain metadata keeps track of the valid full block height. We already handle having a chain of valid headers but no/partial blocks. If we reset on an invalid header, we'll have to download valid headers all over again. Same for blocks. If a synced header/block is invalid after syncing others, the blocks that led up to them are still valid so why should we throw them away?

The header sync does this correctly, it only "swaps" if the syncing chain has higher PoW and is Valid. It will even do that for a partial. Which is correct.

But going into a waiting state, can cause issues as now the database has incomplete state while accepting new blocks.
And in LMDB, we have orphan blocks -> ref via hash. But in the Main Chain, we have header -> height.

So what can happen? A header sync attempts a sync of a few blocks, but fails on connection. No the BN goes back to waiting state meaning it can accept blocks again. Now it accepts new blocks, for those same heights again...
And we have a broken state

@SWvheerden
Copy link
Collaborator Author

The same problem can be said for the block sync. We might download 1000 block headers, and all are valid.
But if the first block in that header is invalid, that chain will always be invalid, and we need to revert back to our old chain

@sdbondi
Copy link
Member

sdbondi commented Oct 28, 2022

When propagated blocks is active again, and receives a block, it should recognise that the block as an orphan, and not base anything on the headers - This does fit with the error we were seeing so definitely could be a bug there.

Agree with block sync that we need to remove invalid headers up to and after a failed block (mmrs are invalid). Maybe we need to restore the old state, but since we only change for a higher PoW I don't think that is critical.

@stringhandler stringhandler added the P-controversial Process - This PR or Issue is controversial and/or requires more attention that simpler issues label Oct 31, 2022
@stringhandler
Copy link
Collaborator

I don't really understand the problem or solution, but happy if you want to leave this issue open and make a PR against it, showing the problem

@stringhandler stringhandler added C-bug Category - fixes a bug, typically associated with an issue. A-base_node Area - The Tari base node executable and libraries labels Oct 31, 2022
@stringhandler stringhandler added this to the Stagenet Freeze milestone Oct 31, 2022
@SWvheerden SWvheerden self-assigned this Nov 14, 2022
stringhandler pushed a commit that referenced this issue Nov 28, 2022
Description
---
If sync fails resets chain to the highest pow chain the node locally has the data to. 

Motivation and Context
---

See: #4866

How Has This Been Tested?
---
Unit tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-base_node Area - The Tari base node executable and libraries C-bug Category - fixes a bug, typically associated with an issue. P-controversial Process - This PR or Issue is controversial and/or requires more attention that simpler issues
Projects
Archived in project
Development

No branches or pull requests

3 participants