Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge branch-0.39 into branch-0.40 #253

Merged
merged 5 commits into from
Jul 25, 2024

Conversation

jameslamb
Copy link
Member

Getting the changes from recent PRs targeting branch-0.39 onto branch-0.40

Followed the procedure described at https://docs.rapids.ai/maintainers/forward-merger

Notes for Reviewers

This should be non-squash admin-merged.

pentschev and others added 5 commits July 22, 2024 16:41
)

The missing support for specifying memory type to `ucxx::MemoryHandle` prevents usage of non-host memory allocations. With this change it is now possible to create `ucxx::MemoryHandle` with CUDA memory allocation (UCX-managed), as well as pointing the handle to existing CUDA memory allocations.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - James Lamb (https://github.com/jameslamb)
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: rapidsai#246
Contributes to rapidsai/build-planning#31

In short, RAPIDS DLFW builds want to produce wheels with unsuffixed dependencies, e.g. `cudf` depending on `rmm`, not `rmm-cu12`.

This PR is part of a series across all of RAPIDS to try to support that type of build by setting up CUDA-suffixed and CUDA-unsuffixed dependency lists in `dependencies.yaml`.

For more details, see:
* rapidsai/build-planning#31 (comment)
* rapidsai/cudf#16183

## Notes for Reviewers

### Why target 24.08?

This is targeting 24.08 because:

1. it should be very low-risk
2. getting these changes into 24.08 prevents the need to carry around patches for every library in DLFW builds using RAPIDS 24.08

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#251
…i#238)

# Add several fixes primarily targeting worker progress thread

Several changes are added to improve stability of worker progress thread. Details of changes are below.

## Support generic callback cancelation

Once registered it was not possible to cancel generic callbacks, which could cause the callback to be called after the timeout had expired and the owner/resources were not alive anymore. This change allows canceling a generic callback that couldn't be completed successfully, avoiding access to invalid memory of objects that are not alive anymore.

## Remove `ErrorCallbackData`

The `ErrorCallbackData` structure and `ucxx::Endpoint::errorCallback` static method used by `ucxx::Endpoint` were removed in favor of using a friend function `endpointErrorCallback` that receives a pointer to `ucxx::Endpoint` itself, thus simplifying ownership of all resources where only `ucxx::Endpoint` now is responsible to retain ownership of all resources.

## Improvements to `ucxx::InflightRequests`

### Add missing locks

All accesses to any of internal `_trackedRequests` resources are now protected by locks. This has been the source of various issues in the past, thus switching to a safer approach seems wise.

During `merge()` both merger and mergee have resource locks now to prevent races.

### Use regular instances instead of pointer for `TrackedRequests`

To simplify `ucxx::InflightRequests`/`TrackedRequests` instead of using pointers to `InflightRequestsMap` objects, switch to using simple instances.

## Improvements to `ucxx::WorkerProgressThread`

### Replace `std::ref` with `std::shared_ptr`

Using `std::ref` was causing undefined behavior in the state of `_stop`, switching to `std::shared_ptr` has resolved the problem.

### Set progress thread ID before progress loop stats

Set the thread ID with parent before starting the progress loop, so that checking whether the a callback is running in the progress thread is always correct, otherwise the parent may be unable to determine the progress thread has already started.

### Use simple instance instead of pointer

New members, such as `isRunning()` and `stop()`, were introduced so a simple instance of `WorkerProgressThread` may be used instead of smart pointers. With this change the owner of the instance may query whether the thread indeed stopped, as opposed to just letting it go out-of-scope while the thread was not joined.

## Improvement to `CallbackNotifier`

Instead of blocking continuously for longer periods of time, it's now possible to periodically signal the worker again to cause it to wakeup (when in blocking mode). Currently defaults to 100ms.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: rapidsai#238
…pidsai#250)

Add option to allow enabling/disabling endpoint error handling in C++ benchmark, switching the default to disable matching `ucx_perftest`.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: rapidsai#250
@jameslamb jameslamb added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 25, 2024
@jameslamb jameslamb requested review from a team as code owners July 25, 2024 15:22
@jameslamb jameslamb requested a review from AyodeAwe July 25, 2024 15:22
@AyodeAwe AyodeAwe merged commit 8cdbcf0 into rapidsai:branch-0.40 Jul 25, 2024
3 checks passed
@jameslamb jameslamb deleted the branch-0.40-merge-0.39 branch July 25, 2024 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants