Do not re-prepare PVFs if not needed #4211

s0me0ne-unkn0wn · 2024-04-19T09:19:58Z

Currently, PVFs are re-prepared if any execution environment parameter changes. As we've recently seen on Kusama and Polkadot, that may lead to a severe finality lag because every validator has to re-prepare every PVF. That cannot be avoided altogether; however, we could cease re-preparing PVFs when a change in the execution environment can't lead to a change in the artifact itself. For example, it's clear that changing the execution timeout cannot affect the artifact.

In this PR, I'm introducing a separate hash for the subset of execution environment parameters that changes only if a preparation-related parameter changes. It introduces some minor code duplication, but without that, the scope of changes would be much bigger.

TODO:

Add a test to ensure the artifact is not re-prepared if non-preparation-related parameter is changed
Add a test to ensure the artifact is re-prepared if a preparation-related parameter is changed
Add comments, warnings, and, possibly, a test to ensure a new parameter ever added to the executor environment parameters will be evaluated by the author of changes with respect to its artifact preparation impact and added to the new hash preimage if needed.

Closes #4132

…epare-if-non-needed

polkadot/primitives/src/v7/executor_params.rs

sandreim · 2024-04-21T07:15:56Z

polkadot/primitives/src/v7/executor_params.rs

+
+		let mut enc = b"prep".to_vec();
+		for param in &self.0 {
+			if matches!(param, StackLogicalMax(_) | PvfPrepTimeout(..) | WasmExtBulkMemory) {


This looks brittle wrt adding new parameters. I would rewrite and use !matches such that any new parameter added is considered preparation related unless we explicitly name it here as non-preparation related.

That's an obvious improvement I've overseen, thank you for the suggestion!

Another question I'm trying to answer: is PvfPrepTimeout a parameter that affects preparation indeed?

That may be a bit counterintuitive, but I'm more inclined to say "no" than "yes". We have a strict pre-checking timeout, and we never re-pre-check, even if executor environment params change. The preparation timeout is lenient, and anyway, those timeouts are non-deterministic. So I don't really see much sense in re-preparing artifacts if this timeout changes (especially given that we're not going to ever decrease them). WDYT?

Let's just try to see what would happen if we say no. Let's first pick the likely more troublesome situation, we for some reason reduce the timeout:

Artifacts are not being recompiled, but a node that restarts will and may now hit a timeout it did not hit before. If I am not mistaken, the node will not dispute, but simply don't vote - right? Now, if that happens on just a few nodes, not too much harm done.

Now, if this is a general problem, everytime a node restarts, we have another machine not voting. So we might run into finality issues.

Problems:

Potential finality stall.

The issue will only show up gradually over time and might be hard to debug. Real issues might only appear weeks later.

Now, let's flip it and let's say "yes". Now all validators will recompile immediately, including backers. I am assuming that anything that got backed with the old artifact, will also be approved and disputed with the old artifact (this is true, right?). In that case the situation would be way better, because now we only have a single parachain that will completely cease to make progress, but all other paras and the relay chain won't be affected at all. Only the backers for that para will try to prepare and fail, as long as they are assigned.

If we assume we will only ever increase that timeout, then of course this should be fine, but is this a sound assumption? Why would we change the timeout at all?

Either, we are being attacked and someone abuses those long timeouts - we would need to reduce.

We have legitimate PVFs that run into the limit ... we would need to increase or fix otherwise.

Couldn't say, which scenario is more likely. That we are never ever re-prechecking is more a bug than anything. What we should actually be doing is re-prechecking and only enacting the new paras, once this was successful. That would also avoid the finality issues on changing the parameters.

the node will not dispute, but simply don't vote - right?

Yes, I believe so. Although we treat the preparation timeout as a deterministic error (as it hasn't failed during the pre-check), we don't dispute.

I am assuming that anything that got backed with the old artifact, will also be approved and disputed with the old artifact (this is true, right?).

We should always use execution parameters from the session where the candidate was produced, so yes, they should always be executed with the same artifact.

we only have a single parachain that will completely cease to make progress

Why do you think it's only a single parachain? The timeout is per-network per-session, so after passing the session boundary where this parameter change is enacted, all the artifacts should be re-prepared, as we saw during the Kusama and Polkadot incidents.

If we assume we will only ever increase that timeout, then of course this should be fine, but is this a sound assumption?

I remember we talked about decreasing the timeouts as not being safe, as it may make PVFs that are already on-chain fail. At the very least, we should have some tooling that could be used to check all the already-existing PVFs not to fail with the new set of executor parameters on the reference hardware (I think @ordian was working on that one?)

At the very least, we should have some tooling that could be used to check all the already-existing PVFs not to fail with the new set of executor parameters on the reference hardware (I think @ordian was working on that one?)

The tooling we do have, but currently it's not integrated into the release process. But even if it could be used to check the current PVFs at the time of the release, one could imagine an attacker uploading a new PVF (with on-demand core) right after the release that will be on the edge of the old time limit. So it doesn't solve the problem by itself.

So it doesn't solve the problem by itself.

Rerunning pre-checking before enactment would.

Why do you think it's only a single parachain

Not necessarily a single one, but most certainly not all of them. Otherwise we would have been really stupid with the parameters.

I remember we talked about decreasing the timeouts as not being safe, as it may make PVFs that are already on-chain fail.

True, but I think this should be fixed. Without pre-checking we get a finality stall, that alone is reason enough to do it properly and not enact parameters that have not been checked. Like, assuming we want to change these parameters at all: Why would we only ever want to increase? Most likely we need to increase to mitigate some issue, but would actually want to decrease again later, once it is fixed. Machines also get faster, compilers get better, so the chances that we actually want to decrease at some point are likely higher that we would want to increase.

In other words, by saying "no" we dig ourselves deeper into a solution that is actually not sound.

We should always use execution parameters from the session where the candidate was produced, so yes, they should always be executed with the same artifact.

But that's not true, is it ?

On approval voting we fetch executor params based on the block the candidate is included: https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/core/approval-voting/src/lib.rs#L2967, .

On backing we fetch them based on the relay_parent:
https://github.com/paritytech/polkadot-sdk/blob/master/polkadot/node/core/backing/src/lib.rs#L666

I don't think there is anything preventing the two relay chain blocks being in different sessions at the boundary, especially with async backing.

But that's not true, is it ?

Hmmm, that may be an important find... If that does not hold, it should be fixed. The very purport of the executor parameters is to always use the same set of parameters with the same candidate.

polkadot/primitives/src/v7/executor_params.rs

polkadot/node/core/pvf/src/artifacts.rs

AndreiEres

lgtm

polkadot/primitives/src/v7/executor_params.rs

…epare-if-non-needed

Currently, PVFs are re-prepared if any execution environment parameter changes. As we've recently seen on Kusama and Polkadot, that may lead to a severe finality lag because every validator has to re-prepare every PVF. That cannot be avoided altogether; however, we could cease re-preparing PVFs when a change in the execution environment can't lead to a change in the artifact itself. For example, it's clear that changing the execution timeout cannot affect the artifact. In this PR, I'm introducing a separate hash for the subset of execution environment parameters that changes only if a preparation-related parameter changes. It introduces some minor code duplication, but without that, the scope of changes would be much bigger. TODO: - [x] Add a test to ensure the artifact is not re-prepared if non-preparation-related parameter is changed - [x] Add a test to ensure the artifact is re-prepared if a preparation-related parameter is changed - [x] Add comments, warnings, and, possibly, a test to ensure a new parameter ever added to the executor environment parameters will be evaluated by the author of changes with respect to its artifact preparation impact and added to the new hash preimage if needed. Closes paritytech#4132

s0me0ne-unkn0wn added 4 commits April 19, 2024 10:36

Implement separate hash for preparation-related parameters

e96e9b5

Add tests

8897d8d

Add more tests and comments

065aac2

Merge remote-tracking branch 'origin/master' into s0me0ne/no-pvf-repr…

a4ec27c

…epare-if-non-needed

s0me0ne-unkn0wn added the T8-polkadot This PR/Issue is related to/affects the Polkadot network. label Apr 20, 2024

s0me0ne-unkn0wn marked this pull request as ready for review April 20, 2024 14:41

s0me0ne-unkn0wn requested review from koute, eskimor, AndreiEres, alexggh and sandreim April 20, 2024 14:41

s0me0ne-unkn0wn commented Apr 20, 2024

View reviewed changes

polkadot/primitives/src/v7/executor_params.rs Outdated Show resolved Hide resolved

sandreim reviewed Apr 21, 2024

View reviewed changes

s0me0ne-unkn0wn added 2 commits April 21, 2024 16:17

Address reviews

b6f5ecb

Add prdoc

9c1778a

sandreim reviewed Apr 22, 2024

View reviewed changes

polkadot/primitives/src/v7/executor_params.rs Show resolved Hide resolved

sandreim approved these changes Apr 23, 2024

View reviewed changes

alexggh approved these changes Apr 23, 2024

View reviewed changes

polkadot/node/core/pvf/src/artifacts.rs Show resolved Hide resolved

AndreiEres approved these changes Apr 23, 2024

View reviewed changes

polkadot/primitives/src/v7/executor_params.rs Outdated Show resolved Hide resolved

s0me0ne-unkn0wn added 2 commits April 24, 2024 16:49

Address discussions

7ab8ba4

Merge remote-tracking branch 'origin/master' into s0me0ne/no-pvf-repr…

0c6afcd

…epare-if-non-needed

s0me0ne-unkn0wn added this pull request to the merge queue Apr 25, 2024

Merged via the queue into master with commit c26cf3f Apr 25, 2024
141 of 143 checks passed

s0me0ne-unkn0wn deleted the s0me0ne/no-pvf-reprepare-if-non-needed branch April 25, 2024 10:40

s0me0ne-unkn0wn mentioned this pull request Apr 25, 2024

Candidate should always be executed with executor parameters from the session it was produced in #4292

Open

peterwht mentioned this pull request Aug 15, 2024

chore: upgrade to 1.12.0 r0gue-io/pop-node#193

Merged

This was referenced Aug 21, 2024

Update polkadot-sdk from v1.11.0 to stable2407 moondance-labs/tanssi#659

Open

Update polkadot-sdk from v1.11.0 to stable2407 moonbeam-foundation/moonbeam#2912

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not re-prepare PVFs if not needed #4211

Do not re-prepare PVFs if not needed #4211

s0me0ne-unkn0wn commented Apr 19, 2024 •

edited

Loading

sandreim Apr 21, 2024

s0me0ne-unkn0wn Apr 21, 2024 •

edited

Loading

eskimor Apr 24, 2024

s0me0ne-unkn0wn Apr 24, 2024

ordian Apr 24, 2024 •

edited

Loading

eskimor Apr 25, 2024

alexggh Apr 25, 2024

s0me0ne-unkn0wn Apr 25, 2024

AndreiEres left a comment

Do not re-prepare PVFs if not needed #4211

Do not re-prepare PVFs if not needed #4211

Conversation

s0me0ne-unkn0wn commented Apr 19, 2024 • edited Loading

sandreim Apr 21, 2024

Choose a reason for hiding this comment

s0me0ne-unkn0wn Apr 21, 2024 • edited Loading

Choose a reason for hiding this comment

eskimor Apr 24, 2024

Choose a reason for hiding this comment

s0me0ne-unkn0wn Apr 24, 2024

Choose a reason for hiding this comment

ordian Apr 24, 2024 • edited Loading

Choose a reason for hiding this comment

eskimor Apr 25, 2024

Choose a reason for hiding this comment

alexggh Apr 25, 2024

Choose a reason for hiding this comment

s0me0ne-unkn0wn Apr 25, 2024

Choose a reason for hiding this comment

AndreiEres left a comment

Choose a reason for hiding this comment

s0me0ne-unkn0wn commented Apr 19, 2024 •

edited

Loading

s0me0ne-unkn0wn Apr 21, 2024 •

edited

Loading

ordian Apr 24, 2024 •

edited

Loading