Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Enable per-thread default stream in free-threading builds #133

Open
leofang opened this issue Sep 26, 2024 · 2 comments
Open

RFC: Enable per-thread default stream in free-threading builds #133

leofang opened this issue Sep 26, 2024 · 2 comments
Labels
P1 RFC Plans and announcements

Comments

@leofang
Copy link
Member

leofang commented Sep 26, 2024

tl;dr: For the Python 3.13 free-threading build (cp313t), the per-thread default stream is enabled and used by default. Users need to set CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM=0 to explicitly opt out and restore the old behavior.

In CUDA, there are two kinds of default streams:

  • Legacy default stream (synchronizing all blocking streams)
    • Unless some action is done as per the CUDA Programming Guide, this is the default. Most of the time the null/0 stream is a synonym of the legacy default stream
  • Per-thread default stream (only synchronizing with the legacy default stream)

Today, CUDA Python offers a way to switch between the legacy and per-thread default streams (at the time of loading driver symbols) via the environment variable CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM, and the default has been as if it is set to 0 (so using the legacy default stream).

However, it is a very common pitfall for performance-seeking applications and users who find themselves needing to create nonblocking streams explicitly to avoid implicit synchronization. This change would lift the need of creating nonblocking streams. This change would also allow GPU workloads launched from different host threads -- without an explicit stream in use -- to have an opportunity of overlapping and executing in parallel, instead of being serialized on the same (legacy default) stream.

The free threading build offers a natural opportunity and perfect timing for us to change the default to as if the env var is set to 1 and using the per-thread default stream. This also gives NVIDIA a path forward to assess the feasibility of deprecating (and eventually removing!) the legacy default stream, which has been a long-time quest we seek to conquer.

Users who use the regular build will not be affected, only those testing the experimental cp313t free-threading build will.

@leofang leofang added P0 RFC Plans and announcements and removed triage labels Sep 26, 2024
@leofang leofang pinned this issue Sep 26, 2024
@leofang leofang changed the title [RFC] Enable per-thread default stream in free-threading builds RFC: Enable per-thread default stream in free-threading builds Sep 26, 2024
@leofang leofang added P1 and removed P0 labels Sep 26, 2024
@gmarkall
Copy link

From the perspective of numba-cuda I think this is a positive change and a good chance to make the change to the default. Numba does already support PTDS with the user explicitly setting it: https://numba.readthedocs.io/en/stable/reference/envvars.html#envvar-NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM

The user can also explicitly force which kind of default stream they want: https://numba.readthedocs.io/en/stable/cuda-reference/host.html#numba.cuda.per_thread_default_stream / https://numba.readthedocs.io/en/stable/cuda-reference/host.html#numba.cuda.legacy_default_stream but I think this is orthogonal to what the default is.

@leofang
Copy link
Member Author

leofang commented Sep 26, 2024

cc @kmaehashi for vis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 RFC Plans and announcements
Projects
None yet
Development

No branches or pull requests

2 participants