Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[E2E][CUDA] Add barrier before all_of_group in ballot_group_algorithms test. #13661

Closed
wants to merge 3 commits into from

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented May 6, 2024

Fixes #12995
failure for cuda 12.4.

all_of_group calls vote.sync.all ptx instruction in the CUDA backend. It seems cuda 12.4 needs to have all members of the non-uniform ballot group in converged control flow to solve this failure.

From my understanding, this change shouldn't be necessary as per the cuda spec for sm_60 and above: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-vote-functions

"For .target sm_6x or below, all threads in membermask must execute the same vote.sync instruction in convergence, and only threads belonging to some membermask can be active when the vote.sync instruction is executed. Otherwise, the behavior is undefined."

I think this is a cuda ptxas bug, but I'm adding a barrier here just so the test passes once we switch to cuda 12.4. This test already passes fine for cuda 12.3 and below. There is no difference in the ptx generated for cuda 12.4, so I think this must be a ptxas/sass issue. Note that strictly speaking we support sm_5x (which would require the barrier addition here anyway) but in reality these "Maxwell" cards are very rarely used because they don't have any data centre cards in this generation. We get asked about "Kepler" support sm_3x sometimes (that we don't officially support because it is below sm_50), but I don't ever remember a sm_5x request/issue.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk JackAKirk requested a review from a team as a code owner May 6, 2024 15:15
@JackAKirk JackAKirk changed the title [CUDA] Add barrier before all_of_group in ballot_group_algorithms test. [test-e2e][CUDA] Add barrier before all_of_group in ballot_group_algorithms test. May 6, 2024
@JackAKirk JackAKirk changed the title [test-e2e][CUDA] Add barrier before all_of_group in ballot_group_algorithms test. [E2E][CUDA] Add barrier before all_of_group in ballot_group_algorithms test. May 6, 2024
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
@JackAKirk
Copy link
Contributor Author

@intel/llvm-reviewers-runtime
Would it be possible to review this?

Thanks

// Note that this barrier is required for the test to pass for
// cuda 12.4 even for sm_60 and later devices. This appears to be a
// cuda ptxas bug.
sycl::group_barrier(BallotGroup);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we maybe mask it with #ifdef __NVPTX__ so we don't inadvertently ignore problems on other targets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if it should be the fix in the implementation and not in the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah makes sense. Ta

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if it should be the fix in the implementation and not in the test.

I can create a ticket to confirm that it is a ptx bug and report it to nvidia. They also just released cuda 12.5, so it is possible it is working there. Prior to cuda 12.4 this test passed.

@aelovikov-intel aelovikov-intel requested review from Pennycook and a team May 23, 2024 15:58
@JackAKirk
Copy link
Contributor Author

I'm just going to close this PR. We've confirmed that it is fixed if you use the cuda driver that was released with cuda 12.5. If you use that driver then the cuda 12.4 toolkit can be used and the test passes. We will probably just make sure that we don't use the cuda driver associated with the cuda 12.4 toolkit in the CI when we upgrade it.

@JackAKirk JackAKirk closed this May 30, 2024
aelovikov-intel pushed a commit that referenced this pull request Jun 5, 2024
This upgrades the docker to use the cuda 12.5 image.

I've ran the test-e2e locally using cuda 12.5 and all is well. cuda 12.5
also fixed an issue introduced by the cuda 12.4 driver: see
#13661 (comment)

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[E2E][CUDA] NonUniformGroups/ballot_group_algorithms.cpp failed on CUDA
3 participants