Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1.0] Development Work #60

Merged
merged 175 commits into from
Dec 20, 2023
Merged

[V1.0] Development Work #60

merged 175 commits into from
Dec 20, 2023

Conversation

luigifcruz
Copy link
Owner

@luigifcruz luigifcruz commented Jul 10, 2023

The scope of this pull request is to remove unnecessary complexity from BLADE. This aims to simplify maintainability and facilitate integration with third-party frameworks. The end result should retain the performance while reducing the boilerplate code necessary to create pipelines and schedule work. The Python interface will also receive an overhaul to match C++ API in both performance and features. Higher-level applications such as the CLI will also be rewritten using the Python bindings instead of the C++ API.

It isn't in the scope of this PR to modify the code of existing modules. This will be addressed in future versions of BLADE. Features such as support for multiple ArrayTensor data layouts like ATFP and AFTP, the addition of half-precision computation support, and in-place array operations.

  • Removal of CUDA Graphs. [Unnecessary Complexity]
  • Removal of CLI. [To-be Rewritten in Python]
  • Remove support for nested Pipelines. [In Favor Of Variable Rate Pipelines]
  • Replace pre-made Pipelines with Bundles. [Better Single Pipeline Module Support]
  • Refactor Runner with buffer management.
  • Modernization of the Python API. Possibly with nanobind. [Too Low-level Currently]
  • Add variable rate pipeline modules. [Gather, Copy, Permutation]
  • Add an open-source license.
  • Write updated documentation. [Readme]

Closes #65 and #61.

@luigifcruz
Copy link
Owner Author

luigifcruz commented Jul 13, 2023

It will be necessary to refactor the Blade::Runner class to add support for handling asynchronous memory copies from host to device locally instead of deferring it to the pipeline execution. This will result in less VRAM usage with no expected impact on performance.

@luigifcruz
Copy link
Owner Author

To ensure a level of stability for BLADE users, it is essential to keep track of performance regressions. In previous versions, benchmark results were manually saved to a file inside the repository tree. This PR adds the ability to automatically run benchmarks and tests using a standard server. This makes it possible to graph the benchmark results of components from the past and see how they have changed over time.

Automated benchmarks and tests will be run before any pull request is merged into the main branch to ensure quality. Approved work-in-progress pull requests will also have the ability to run these automated tasks to keep track of regressions as they occur.

For consistent results, a self-hosted machine dedicated to this task will serve as the standard server. Its hardware configuration will be representative of the hardware currently used in production. By using the results produced by this server, it will be possible to extrapolate the performance for other hardware configurations. This capability will help better understand how to deploy the pipeline in a heterogeneous server hardware configuration environment.

@luigifcruz
Copy link
Owner Author

All tests are passing with no actionable TODO for this PR. Merging!

@luigifcruz luigifcruz merged commit 066482a into main Dec 20, 2023
1 check passed
@luigifcruz luigifcruz deleted the dev branch December 20, 2023 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[V1.0] Add a license. [V1.0] Global refactor.
1 participant