Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling for threaded verification errors #1748

Merged
merged 7 commits into from
Apr 11, 2024

Conversation

cryptonemo
Copy link
Collaborator

For synth porep, we switch to threaded/parallel proof verification as an optimization. This takes it a step further and now allows errors to be propagated/returned properly.

Note: There is a FIXME in the code that will be removed after it's properly tested via filecoin-ffi testing.

@cryptonemo
Copy link
Collaborator Author

Looks like all the CI failures are 'Infrastructure failures' and I noticed I can no longer restart failed jobs. Everything passes locally.

Copy link

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't comm_d_proof_inner.validate and the comm_r last asserts above fail in a similar way if the trees are corrupted?

@cryptonemo
Copy link
Collaborator Author

Can't comm_d_proof_inner.validate and the comm_r last asserts above fail in a similar way if the trees are corrupted?

Yes, there are other cases. One thing (for testing) at a time. Once wired through FFI and I'm sure it's propagated properly, adding in the others are straight-forward.

@cryptonemo
Copy link
Collaborator Author

Good news is that everything worked wired through locally. I'll look into the other cases.

Copy link
Contributor

@vmx vmx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code reports the most recent error only. A possible change would be to have a vector of errors and pushing them into a vector. Once done print the contents of the whole vector if it's non-empty.

storage-proofs-porep/src/stacked/vanilla/proof.rs Outdated Show resolved Hide resolved
storage-proofs-porep/src/stacked/vanilla/proof.rs Outdated Show resolved Hide resolved
@cryptonemo
Copy link
Collaborator Author

The current code reports the most recent error only. A possible change would be to have a vector of errors and pushing them into a vector. Once done print the contents of the whole vector if it's non-empty.

That's incorrect. Look closer.

@vmx
Copy link
Contributor

vmx commented Apr 8, 2024

The current code reports the most recent error only. A possible change would be to have a vector of errors and pushing them into a vector. Once done print the contents of the whole vector if it's non-empty.

That's incorrect. Look closer.

Oh right. It's on the first error it encounters. That's probably good enough.

@vmx
Copy link
Contributor

vmx commented Apr 8, 2024

The current code reports the most recent error only. A possible change would be to have a vector of errors and pushing them into a vector. Once done print the contents of the whole vector if it's non-empty.

That's incorrect. Look closer.

Oh right. It's on the first error it encounters. That's probably good enough.

I re-read the code again. Now I think it collects all errors. But wow, there must be a way to make all this easier to follow.

@cryptonemo
Copy link
Collaborator Author

Updated the error handling on all 4 of the threaded verifiers added for synth porep proofs. Wasted some cycles looking into it and of the 4 verifiers that were made threaded for consistency, only 1 (or 2) had the most perf impact. I recall this now, but again, the parallelism was added across the board for consistency, and now we consistently propagate errors across them all. There's also still a FIXME in place (on purpose) and some additional testing is needed (via my local api and ffi branches, soon to be PRs).

@cryptonemo cryptonemo changed the title Improve error handling for encoding proof verification errors Improve error handling for threaded verification errors Apr 8, 2024
@@ -140,8 +152,7 @@ impl<'a, Tree: 'static + MerkleTreeTrait, G: 'static + Hasher> StackedDrg<'a, Tr
Challenges::Synth(synth_challenges) => {
// If there are no synthetic vanilla proofs stored on disk yet, generate them.
if pub_inputs.seed.is_none() {
info!("generating synthetic vanilla proofs in a single partition");
assert_eq!(partition_count, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'm not seeing why this one assert was removed, but I won't let it delay approval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, I'm looking at #1750 now.

Copy link
Collaborator Author

@cryptonemo cryptonemo Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I'm not seeing why this one assert was removed, but I won't let it delay approval.

Adding it was an error. The partition count is only 1 for test sectors in porep, and the logging comment implied the incorrect thing (i.e that the single pass of synth porep proof generation implies that the partition count must be 1, which is wrong; it just meant we were not generating synth porep proofs with regard to challenges being partition specific like in other places).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^Fortunately, it was never released with that assert. It came up quickly in overall testing while looking into this issue though.

@cryptonemo cryptonemo merged commit 3f018b5 into master Apr 11, 2024
32 checks passed
@cryptonemo cryptonemo deleted the propagate-invalid-encoding-proof branch April 11, 2024 17:07
vmx added a commit that referenced this pull request Apr 11, 2024
Instead of using asserts, return proper errors. As this is non-trivial
with yastl, rely solely on rayon for parallelism. The performance
characteristics are the same (it's maybe slightly, negligible faster).

Replaces #1748.
@vmx vmx mentioned this pull request Apr 11, 2024
vmx added a commit that referenced this pull request Apr 15, 2024
Asserts in a yastl thread pool can lead to things hanging, instead of
proper panicking. Spawning the validation was a performance optimization,
but it turns out it's not needed. Hence we can just remove the thread
pool from there. This has also the advantage that we don't have two
different threads pools (from Rayon and yastl) fighting for the same
resources. Also the number of threads for the Rayon thread pool can be
bound with an environment variable (the yastl one not), so there's more
control for running the operations.

There was another instance of using the yastl thread pool for verifying
the data before writing the Synthetic PoReps to disk. It again has the
problem with the assert. With using Rayon instead, we again have the
advantage of a more controllable thread pool and it's also slightly
faster.

Replaces #1748.
vmx added a commit that referenced this pull request Apr 15, 2024
#1748 fixed a problem
where errors (asserts) where not properly propagated upwards. There were
threads just hanging without making any progress.

I usually follow the development paradigm of "Make it work, make it right,
make it fast" [1], where I interpret the "make it right" as "make it simple".
With "simple" I mean things like clear code, easy to follow, no new paradigms,
following the style of the existing code base.

I'd like to be more concrete about that for this change, on why I find this
version "simpler".

The core issue was an assert within a spawned thread in a yastl pool. Someone
new to the code base coould re-introduce such a problem easily, hence it would
be best if we can actually prevent that. With removing the yastl thread pool,
where it's not really needed for better performance, we can easily prevent
that.

We see three difference uses of the yastl thread pool within the `proof.rs`.
One is using it to pipeline operations, another one is for a highly parallizing
an operation. Usually for data parallism we use Rayon in this code base. In
the pipelining use case, there is Rayon used within pipeline. The yastl thread
pool will spawn only a limited set of threads, so this looks alright. For the
highly parallelised case, we only use yastl and not Rayon (although we should
look into just Rayon instead).

The third use, where we mix yastl and Rayon for highly parallel operations is
removed with this PR. This is intentional. Having two thread pools, which by
default use as many threads as cores could easily overprovision a system. This
could potentially lead to unintended slow downs. Another benefit of just using
Rayon is, that the number of threads can be controlled with an environment
variable. This gives more control when several instances of this code is run
in parallel, which is the case for some storage providers.

Switching back from errors to asserts. I don't know the reason, why the
asserts were changed to errors. Hence I'm switching it back to asserts, as
now it's easily possible. If one looks at the diff between the version prior
to #1728 and this, then the diff is pretty minimal an straight forward. Also
one verification again runs only in debug mode and not also in release mode.
Though if errors are desired, they can be easily be introduce by switching the
asserts to anyhow's `ensure!()` macro.

Following the error handling needs less context. With this change, the asserts
are happening in the code right away, it works the way people coding Rust are
used to. Prior to this change, more context is needed. Taking the "invalid
comm_d"-error as an example. It happens on line 456 [2]. Now reading through
the code what it means: first look for the `invalid_comm_d` variable. It
defines an instance of the `InvalidChallengeCoordinate` struct, which we take
a quick look at and see that's a local one, for specifcally error handling.
That instance is wrapped in an Arc which is interesting at a first glance,
as in Rust shared state is usually tried to be avoided where possible. When
looking into the usage of `invalid_comm_d` it becomes clear that we need the
`Arc`, as the might assign a different value to it in the error case. We need
a mutable reference here, but again usually in Rust immutable types are
preferred. So if we can avoid the `Arc` as well some mutability, it's a win
for being more Rust idiomatic, hence easier for people familiar with Rust.

For me that falls under the "code is written once, but read many times"
category. So making it easy to read is a win.

In the lower part of the change, the yastl usage for high data parallelism is
removed in favour of Rayon, see above for some of the reasons. Also using
Rayon here seems to use the thread pool more efficiently, at least on the
machine I've tested it on (with 64 threads).

When looking at the diff between this change and the commit prior to #1748

    git diff 8f5bd86.. -w -- storage-proofs-porep/src/stacked/vanilla/proof.rs

Then the changes are very minimal, which I also count as a sign for being
"simpler".

[1]: https://wiki.c2.com/?MakeItWorkMakeItRightMakeItFast
[2]: https://github.com/filecoin-project/rust-fil-proofs/blob/3f018b51b6327b135830899d237a7ba181942d7e/storage-proofs-porep/src/stacked/vanilla/proof.rs#L456-L457
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants