Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create binaries to run phases independently #1678

Open
vmx opened this issue Mar 9, 2023 · 7 comments
Open

Create binaries to run phases independently #1678

vmx opened this issue Mar 9, 2023 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@vmx
Copy link
Contributor

vmx commented Mar 9, 2023

Description

Currently the Filecoin proofs are consumed by Lotus via the FFI as single library. In addition to the library use case, the idea is to provide separate binaries for each phases (or perhaps even more fine-grained, but as a start the phases should be fine). This would serve several needs that occurred in the past:

  • Easier testing: Let's say you have a bug that only shows when unsealing previously sealed data (like 64GB Lifecycle Test Failure #1647). The whole process takes many hours. If you could run many of the phases only once, and the re-run parts of the pipeline to narrow down the issue, it could safe a lot of time
  • Bechmarking:
    • Improving current setups: you either want to benchmark a code change or different hardware setups. Currently you'd run the whole process and then probably look at the logs how long things took. With separate binaries, you could prepare things up to a certain step you specifically want to benchmark and only iterate on that (also brought up at Benchy cannot run PC1 only #1676 (comment)).
    • Comparing to other implementations: Recently Supranational publish a PC2 implementation as a standalone binary. For doing a comparison with the current implementation it would need to be integrated into the current code base. If there were binaries already, you could run the pc2 binary on the same input data and compare the results.
  • Flexible deployments: You might want to orchestrate the sealing process with your own tools. Currently it's an engineering task that requires Rust knowledge as you'd directly need to call into the Rust code, if you want to run specific pieces or want additional monitoring. With having separate binaries it becomes more of a dev-ops kind problem, where you can build tooling around

Acceptance criteria

There are binaries that could be run in sequence the do the full lifecycle of sealing and unsealing a sector.

Risks + pitfalls

It may lead to refactorings in case the current internal APIs do not fit. Though I see it as a good thing as the APIs should already be flexible enough to make this working.

Where to begin

benchy does already partly support running certain phases only. But it's not that flexible and has known issues.

@cryptonemo cryptonemo added the enhancement New feature or request label Mar 9, 2023
@cryptonemo
Copy link
Collaborator

To be clear, Proofs is a library and will remain that way. Binaries would be an enhancement, using the library.

@vmx
Copy link
Contributor Author

vmx commented Mar 9, 2023

To be clear, Proofs is a library and will remain that way. Binaries would be an enhancement, using the library.

Thanks for calling this out. I've changed the first paragraph to make this clearer.

@RobQuistNL
Copy link

This would be an awesome feature to have - it would greatly help with benchmarking seperate stages and working on improvements.

It would be very nice to have a way to validate that the result of the benchmark is correct, too. Not sure if that's inherently possible as we're skipping some steps though.

Example would be;

cargo run --bin benchy -- single-step -- ap --sectornumber 123 --size 512MiB --result /mnt/benchfiles # Generates "unsealed" sector file (/mnt/benchfiles/unsealed/123/)
cargo run --bin benchy -- single-step -- pc1 --sectornumber 123 --result /mnt/benchfiles # Uses the "unsealed" sector file from the AP step, generates the layer files in the "cache" folder (/mnt/benchfiles/cache/123/) (if i'm not mistaken, PC1 in lotus-worker stores it there too)
cargo run --bin benchy -- single-step -- pc2 --sectornumber 123 --result /mnt/benchfiles # Uses the layer files from the PC1 step, generates its files in the "cache" folder (/mnt/benchfiles/cache/123/) (if i'm not mistaken, PC2 in lotus-worker stores it there)

and so on for C1 / C2

@lovel8
Copy link
Contributor

lovel8 commented Apr 11, 2023

@vmx It is recommended to support the following functional requirements:

  1. For performance testing
  • Added configuration support for the total number of task cycle executions to verify the stability of the program run and the stability of the calculation efficiency.
  • Add support for configuring the number of concurrent task executions in each stage, such as 30 P1s and 4 P2s concurrently, to adapt to real system resources (CPU, GPU, memory resource limitations) and achieve maximum resource utilization.
  • Add statistics log of maximum system resource usage during runtime (eg: CPU, GPU, memory) for analysis and optimization.
  1. Positioning for the problem
    Added support for lotus panic, benchy reruns from the problem phase (eg: P2) to reproduce and debug the problem.

@vmx
Copy link
Contributor Author

vmx commented Apr 11, 2023

  1. For performance testing

    • Added configuration support for the total number of task cycle executions to verify the stability of the program run and the stability of the calculation efficiency.

    • Add support for configuring the number of concurrent task executions in each stage, such as 30 P1s and 4 P2s concurrently, to adapt to real system resources (CPU, GPU, memory resource limitations) and achieve maximum resource utilization.

    • Add statistics log of maximum system resource usage during runtime (eg: CPU, GPU, memory) for analysis and optimization.

Those are probably out of scope. The idea is to have binaries, so that you can build those tools on-top of it. You could create your own runners that do exactly the testing that you need.

2. Positioning for the problem
Added support for lotus panic, benchy reruns from the problem phase (eg: P2) to reproduce and debug the problem.

Yes, ideally it should be possible to run just a certain step on the data you already have.

@vmx
Copy link
Contributor Author

vmx commented Jun 21, 2023

Some of the requirements re-formulated as user stories:

As a storage provider I'd like to

  • be able to write my own workflow/scheduling/custom solution, so that I can reach better resource utilization.
  • be able to be able to stop and resume jobs, so that I can reach better resource utilization.
  • have more fine grained control which parts of the proving pipeline are run at which point in time, so that I can optimize for different priorities of deals.

If anyone has more, please share them here.

@RobQuistNL
Copy link

Yes! :)

Clear documentation (or examples) on how to run the various parts, what data they need & generate, how to pass this data through etc.

In here also the supranational updates would be easier to implement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants