Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update conda recipes for Enhanced Compatibility effort #9456

Merged
merged 1 commit into from
Nov 11, 2021

Conversation

ajschmidt8
Copy link
Member

This PR updates the conda recipe build strings and cudatoolkit version specifications as part of the Enhanced Compatibility efforts.

The build strings have been updated to only include the major CUDA version (i.e. librmm-21.12.00a-cuda11_gc781527_12.tar.bz2) and the cudatoolkit version specifications will now be formatted like cudatoolkit >=x,<y.0a0 (i.e. cudatoolkit >=11,<12.0a0).

Moving forward, we'll build the packages with a single CUDA version (i.e. 11.4) and test them in environments with different CUDA versions (i.e. 11.0, 11.2, 11.4, etc.).

@ajschmidt8 ajschmidt8 added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 15, 2021
@github-actions github-actions bot added the conda label Oct 15, 2021
@codecov
Copy link

codecov bot commented Oct 15, 2021

Codecov Report

Merging #9456 (a7679c6) into branch-21.12 (ab4bfaa) will increase coverage by 0.04%.
The diff coverage is 0.00%.

❗ Current head a7679c6 differs from pull request most recent head ea5aff9. Consider uploading reports for the commit ea5aff9 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9456      +/-   ##
================================================
+ Coverage         10.79%   10.83%   +0.04%     
================================================
  Files               116      117       +1     
  Lines             18869    19442     +573     
================================================
+ Hits               2036     2106      +70     
- Misses            16833    17336     +503     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 60 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b66f14e...ea5aff9. Read the comment docs.

@ajschmidt8
Copy link
Member Author

rerun tests

@kkraus14
Copy link
Collaborator

Moving forward, we'll build the packages with a single CUDA version (i.e. 11.4) and test them in environments with different CUDA versions (i.e. 11.0, 11.2, 11.4, etc.).

It would be good to get confirmation that this is safe to do while including PTX in the libcudf binary. I believe that if we compile with CUDA 11.4 it embeds PTX with a version that matches 11.4 and if you that PTX needs to get loaded or JITed on say an 11.0 machine the driver will error. I believe we decided to do the reverse and only compile with CUDA 11.0 in conda-forge for this same reason. cc @jakirkham

@jakirkham
Copy link
Member

In conda-forge we had to start with 11.2, but yes this was one issue

There are a few conversations happening internally about how exactly CUDA Enhanced Compatibility (CEC) can be used

@ajschmidt8 ajschmidt8 marked this pull request as ready for review November 8, 2021 21:48
@ajschmidt8 ajschmidt8 requested a review from a team as a code owner November 8, 2021 21:48
@ajschmidt8
Copy link
Member Author

rerun tests

Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please share the results of the discussion with regards to handling of PTX before merging

@robertmaynard
Copy link
Contributor

robertmaynard commented Nov 10, 2021

Please share the results of the discussion with regards to handling of PTX before merging

@kkraus14

From the C++ side we either emit SASS for all the necessary architectures so they are not concerned about PTX issues, or we have projects like jitify convert to SASS at runtime and therefore don't run into this problem.

In some cases, when using a new runtime with an older driver, Numba will currently emit PTX not understood by the driver. We are working with the Numba team to provide a fix for this before the 21.12 release finalizes. In the interim, users of nightlies will be need to run on a driver new enough to run the code emitted by their runtime or temporarily downgrade to an older runtime.

@JohnZed JohnZed dismissed kkraus14’s stale review November 11, 2021 17:57

Sorry to dismiss, but we need this PR to get through in order to proceed with the 11.5 rollout, so we are going to merge it in. Please see Rob's comment above for more info. Numba changes will take a few more days to come out.

@ajschmidt8
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 98f5a28 into rapidsai:branch-21.12 Nov 11, 2021
@ajschmidt8 ajschmidt8 deleted the update-recipe branch November 11, 2021 17:59
@kkraus14
Copy link
Collaborator

No worries on dismissing the review. I wonder if at a certain point we should just stop shipping PTX in the libcudf binaries because one of three things will happen if the driver needs to interact with the libcudf PTX:

  1. The driver errors from getting a newer PTX version than it supports due to us needing to be aggressive in adopting new CUDA Toolkit versions.
  2. We hit a newer than currently supported architecture and the driver tries to JIT compile the PTX into SASS and it presents itself as a process hang to the user because of the size of libcudf.
  3. We hit an older unsupported architecture and the driver tries to JIT compile the PTX into SASS and it presents itself as a process hang to the user because of the size of libcudf. At some point it errors either during this JIT compilation or during runtime from the use of architecture features not present on the older architecture.

All of these feel like a distinctly worse user experience than just not shipping the PTX and having the driver error from failing to find SASS for the current device architecture.

@JohnZed
Copy link
Contributor

JohnZed commented Nov 11, 2021

@kkraus14 #2 is an interesting point in particular.... That's basically the specific usecase that PTX is meant to support, but I agree that the JIT times can be very surprising. I think this is the norm for most shipping CUDA software, though, since when new architectures come out, it's also very unpleasant if there is no ability for them to use popular OSS CUDA code at all. I think the ideal would be if we could detect the need to JIT for a new version and output a clear message saying "first time JIT, this will take a while." I don't know of a way to intercept that JIT and issue a message though.

@JohnZed
Copy link
Contributor

JohnZed commented Nov 12, 2021

Asked around, and it sounds like there is not currently an API to intercept when JIT'ing happens and share a message, unfortunately.

ajschmidt8 added a commit to ajschmidt8/cudf that referenced this pull request Nov 19, 2021
The `dask-cudf` recipe was changed after rapidsai#9456 was opened, so the Enhanced Compatibility changes never made it into this recipe. This PR fixes that.
ajschmidt8 added a commit that referenced this pull request Nov 19, 2021
The `dask-cudf` recipe was changed after #9456 was opened, so the Enhanced Compatibility changes never made it into this recipe. This PR fixes that.
ajschmidt8 added a commit to rapidsai/cucim that referenced this pull request Nov 29, 2021
Similar to rapidsai/cudf#9456.

This PR updates the `conda` recipe build strings and `cudatoolkit` version specifications as part of the Enhanced Compatibility efforts.

The build strings have been updated to only include the major CUDA version (i.e. `librmm-21.12.00a-cuda11_gc781527_12.tar.bz2`) and the `cudatoolkit` version specifications will now be formatted like `cudatoolkit >=x,<y.0a0` (i.e. `cudatoolkit >=11,<12.0a0`).

Moving forward, we'll build the packages with a single CUDA version (i.e. `11.5`) and test them in environments with different CUDA versions (i.e. `11.0`, `11.2`, `11.4`, `11.5` etc.).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants