Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Root cause] Finality lag and slow parachain block production immediately after runtime upgrade #5738

Open
4 of 6 tasks
alexggh opened this issue Sep 17, 2024 · 8 comments
Open
4 of 6 tasks

Comments

@alexggh
Copy link
Contributor

alexggh commented Sep 17, 2024

What happened

Immediately after runtime was updated to 1.3.0 which was enacted at block, finality started lagging and parachains blocks weren't being produced as usual.

The reason for that was that a significant number of validators were crashing with the following error:

Sep 17 08:03:13 dot1 taskset[688103]: 2024-09-17 08:03:13 Overseer exited with error err=Generated(SubsystemStalled("candidate-backing-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))

Sep 17 08:04:13 dot1 taskset[688103]: 2024-09-17 08:04:13 Task "polkadot-runtime-api-request" (Group: runtime-api) was still running with 3 instances after waiting 60 seconds to finish.
Sep 17 08:04:13 dot1 taskset[688103]: Error:
Sep 17 08:04:13 dot1 taskset[688103]:    0: Other: Essential task failed.
Sep 17 08:04:13 dot1 taskset[688103]: Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Sep 17 08:04:13 dot1 taskset[688103]: Run with RUST_BACKTRACE=full to include source snippets.

After restart the validators worked as expected but it took around ~40min for finality to catch:

Screenshot 2024-09-17 at 16 13 09

Additionally, because parachain collators froze or crashed as well, some of them had to be restarted, for examples on asset-hub it can be seen here the collators that are still not producing any blocks: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fdot-rpc.stakeworld.io%2Fassethub#/collators and need manual intervation.

Root cause

All failure modes seem to be the same candidate-backing-subsystem does not process signals and after 64 blocks the size of the subsystem signals channels the overseers decides that it is stuck and everything is terminated and restarted.

Overseer exited with error err=Generated(SubsystemStalled("candidate-backing-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))

candidate-backing-subsystem seems to be waiting on calls from runtime-api subsystem which seems to be taking a long long time:

Sep 17 08:04:13 dot1 taskset[688103]: 2024-09-17 08:04:13 Task "polkadot-runtime-api-request" (Group: runtime-api) was still running with 3 instances after waiting 60 seconds to finish.

This data leads to the runtime call fetch_claim_queue which is a new runtime call that was emitted after the runtime upgrade, because the parachainHost api was bumped to pub const CLAIM_QUEUE_RUNTIME_REQUIREMENT: u32 = 11.

A probable explanation for what happened is that we were affected by this limitation #64 and the fact the new runtime API claim_queue is using storage data that is created by a migration parachains_configuration::migration::v12::MigrateToV12 included in the same runtime.

So, the first time claim_queue got called it used garbage data that made it take a very long time and the candidate-backing subsystem got stucked waiting on the api calls, that also explains why after restart on the following blocks things recovered and worked as expected.

An additional data point is the fact the subsystem got declared stalled by the overseer exactly 64 blocks after the block containing the runtime was imported.

Testing with chopsticks at block 22572435 confirmed that claim_queue is taking a lot of time to complete because of ParachainHost corruption.

Conlusion

Root-cause is fairly well understood and checked, so no further impact is expected.

Action items

  • Gather more data to confirm this is the root-cause.
  • Track the network to make sure after everyone recovered their validators and collators the block production of the relaychain and parachain is the same as before.
  • Why wasn't this a problem on Kusama runtime upgrade ? Corruption happened on kusama as well at block 24,786,390, but we got lucky/unlucky with the corruption and claim_queue api returned faster, so there was no impact.
  • Paseo seems to have been impacted by this, so we should've had a way to delay the runtime upgrade.
  • Postmortem.
  • Make the subsystem more robust so they don't fail hard on this type of errors, if the subsystem wouldn't have crashed and would just move on to next block, then we wouldn't have seen any impact.
@ordian
Copy link
Member

ordian commented Sep 17, 2024

Make the subsystem more robust so they don't fail hard on this type of errors, if the subsystem wouldn't have crashed and would just move on to next block, then we wouldn't have seen any impact.

Apart from fixing #64, we could add timeouts for runtime APIs that are (slightly) lower than subsystem stall timeout. Although this sounds brittle, it might be a bit better than letting the node hang and shutdown.

@burdges
Copy link

burdges commented Sep 17, 2024

Aside from issues that caused this, we've a back pressure limitation there too right? If we've 1/3rd bad backers and 1/3rd bad relay chain block producers, then we'd still have 1/9th the candidates of a full chain, plus whatever we consider essential, so that's considerable throughput.

Availabiltiy bitfields could be subject to back pressure too maybe? I suppose that's a topic for another thread.

@bkchr
Copy link
Member

bkchr commented Sep 17, 2024

Although this sounds brittle, it might be a bit better than letting the node hang and shutdown.

Generally the runtime is seen as trust source code. Also aborting is still not that easy to do with wasmtime.

@alexggh
Copy link
Contributor Author

alexggh commented Sep 17, 2024

Aside from issues that caused this, we've a back pressure limitation there too right? If we've 1/3rd bad backers and 1/3rd bad relay chain block producers, then we'd still have 1/9th the candidates of a full chain, plus whatever we consider essential, so that's considerable throughput.

In this case I estimate almost all the validators restarted at the same time, the recovery itself was slower than expected because of two things:

  1. Parachain collators got stuck as well from the same reason, parachain would not produce blocks
  2. When all nodes got restarted, all of them had to go through DHT peer discovery, which is around 10s of minutes, PVF recompilation, and probably other stuff we do at restart that I'm missing, so it took awhile until everything was in placed for nodes to finalize blocks.

@kianenigma
Copy link
Contributor

Thanks for the great writeup!

Q1:

Why wasn't this a problem on Kusama runtime upgrade ? Corruption happened on kusama as well at block 24,786,390, but we got lucky/unlucky with the corruption and claim_queue api returned faster, so there was no impact.

I didn't quite get why Kusama survived? The same migration is there as well.

Q2: Why does unmigrated data cause a runtime API to take an arbitrarily long amount of time? If the data is unmigrated, I would expect the API to still return immediately, and perhaps return None or Default or Err or something. But I can't see why it would cause it to just take a long time to respond?

@sandreim
Copy link
Contributor

Q2: Why does unmigrated data cause a runtime API to take an arbitrarily long amount of time? If the data is unmigrated, I would expect the API to still return immediately, and perhaps return None or Default or Err or something. But I can't see why it would cause it to just take a long time to respond?

The claim_queue runtime API will use a corrupted host configuration with this piece of code:

let n_session_cores = T::AssignmentProvider::session_core_count();

Checkout the values used by the code below in the HostConfiguration snippet.

"HostConfiguration"{
   "approval_voting_params":"ApprovalVotingParams"{
      "max_approval_coalesce_count":655360
   },
   "scheduler_params":"SchedulerParams"{
      "group_rotation_frequency":131072,
      "paras_availability_period":83951616,
      "max_validators_per_core":"None",
      "lookahead":100663296,
      "num_cores":1476395008,
      "max_availability_timeouts":50331650,
      "on_demand_queue_max_size":1493172224,
      "on_demand_target_queue_utilization":0%,
      "on_demand_fee_variability":50.331648%,
      "on_demand_base_fee":2658455997759531942378630683232829440,
      "ttl":134217728
   }
}

On Kusama we had different values there which made the API run faster.

@alexggh
Copy link
Contributor Author

alexggh commented Sep 18, 2024

I didn't quite get why Kusama survived? The same migration is there as well.

Most likely it matters what's the garbage and on kusama the API is still returning, you can test it with chopsticks on kusama.

On polkadot it gets stuck because of those really high number values @sandreim posted here: #5738 (comment), lookahead and num_cores translate to a lot of loops.

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/2024-09-17-polkadot-finality-lag-slow-parachain-production-immediately-after-runtime-upgrade-post-mortem/10057/1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants