Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding CI scripts #45

Merged
merged 20 commits into from
Aug 29, 2024
Merged

Conversation

G-Ragghianti
Copy link
Collaborator

No description provided.

@mkstoyanov
Copy link
Collaborator

Thanks for setting this up, let me know if you run into any issues.

@G-Ragghianti
Copy link
Collaborator Author

One question so far: the intel-mkl package in spack has been deprecated in favor of the oneapi version (intel-oneapi-mkl). We were previously using intel-mkl for the MKL heffte backend. Is it sufficient to use just the ONEAPI backend and deprecate the MKL backend, or do they serve different purposes?

@mkstoyanov
Copy link
Collaborator

One question so far: the intel-mkl package in spack has been deprecated in favor of the oneapi version (intel-oneapi-mkl). We were previously using intel-mkl for the MKL heffte backend. Is it sufficient to use just the ONEAPI backend and deprecate the MKL backend, or do they serve different purposes?

The MKL backend is CPU only and doesn't require the Intel SYCL compiler. This is needed for many platforms regardless of the spack deprecation.

However, we don't need a separate test for all possible backend combinations (simply not practical). It is Ok to just test MKL + OneAPI in spack, especially if the stand-alone MKL package has been deperecated. We can even deprecate the standalone +mkl variant in spack (not not in C++).

@G-Ragghianti
Copy link
Collaborator Author

Maybe the +mkl variant in the spack package should be changed to use intel-oneapi-mkl instead of intel-mkl?

@mkstoyanov
Copy link
Collaborator

Maybe the +mkl variant in the spack package should be changed to use intel-oneapi-mkl instead of intel-mkl?

Another possibility, haven't tested it but it should work.

@G-Ragghianti
Copy link
Collaborator Author

I have most of the checks passing now. Do you think there is a good chance that we can get the two failing checks to pass, or would you be OK with implementing the CI the way it is now?

@mkstoyanov
Copy link
Collaborator

Generally speaking, so long as most of the tests are working, then the CI is good.

The Intel test seems a problem with GPU, it is complaining about missing fp64 capabilities. Is it detecting the correct device?

The ROCm is also puzzling, it is failing to launch a basic scaling kernel, which is just that, scales a vector by a fixed number.

Both of those could be some issue with the environment selecting the correct device and/or permissions. But this is just a guess, I'm not familiar with the environment.

@G-Ragghianti
Copy link
Collaborator Author

For the intel gpu, I think it is actually missing the fp64 capability. The GPU is an Arc770 (consumer card). We had to disable the double precision tests in slate for this. Is that a possibility here?

@mkstoyanov
Copy link
Collaborator

For the intel gpu, I think it is actually missing the fp64 capability. The GPU is an Arc770 (consumer card). We had to disable the double precision tests in slate for this. Is that a possibility here?

Unfortunately, heFFTe doesn't have that option at the moment. I could add it in time, but it is a significant intrusion into the code. It's probably best to either disable the test for now, or don't use the GPU, just run SYCL on the CPU (fallback mode). That still tests the GPU kernels and it will work with all precision.

@G-Ragghianti
Copy link
Collaborator Author

Just FYI: I'm keeping this PR as a draft until we finalize the slurm queue policies at our site. The new CI system runs the jobs through our queues, and I need to make sure the CI jobs don't interfere with normal users.

@mkstoyanov
Copy link
Collaborator

Just FYI: I'm keeping this PR as a draft until we finalize the slurm queue policies at our site. The new CI system runs the jobs through our queues, and I need to make sure the CI jobs don't interfere with normal users.

No worries, I won't merge anything until you say it's ready (at which point you can merge it yourself).

@ax3l
Copy link
Contributor

ax3l commented Aug 23, 2024

Hi, this is a great PR to add CI - do you like to update and merge it maybe partially already? :)

(Our interest: We want to start depending on heFFTe in WarpX & ImpactX mainline, and having CI in a dependency is one of our quality criteria :))

@mkstoyanov
Copy link
Collaborator

The PR is @G-Ragghianti's work as it is connected to the machines at IC. Given the lack of post-ECP funding, I don't know what the status is here or if this is even feasible at the moment.

I am doing now is I have a series of automated scripts that run docker, same as a CI would except it is not automatically triggered from github. That's always been a challenge due to security and authentication.

The free github runners are a bit deficient, no GPU support of any kind and heFFTe needs at least 12 MPI ranks to have a non-symmetric data layout of boxes. I can try to setup one or two free github tests, would that be sufficient? Or do I need to find some alternatives.

@G-Ragghianti
Copy link
Collaborator Author

We intend to run these on our site-hosted runners. They are currently running as dedicated runners, but this PR was created specifically to use an on-demand runner system that we have yet to fully deploy. I can convert this PR to use the dedicated runners that we currently have. We intend to support this despite the end of ECP funding.

@mkstoyanov
Copy link
Collaborator

@G-Ragghianti it would be of great help to support this, even if the tests don't span the entire range of supported GPUs and build configurations. I don't know if @ax3l has a time-frame in mind.

@G-Ragghianti
Copy link
Collaborator Author

I've updated the PR to use the currently available self-hosted runners that we have. There is still a small problem with FFTW module not being compatible with our module for openmpi. I will fix this, and then it can be merged.

@ax3l
Copy link
Contributor

ax3l commented Aug 29, 2024

Yay! All green :)

@G-Ragghianti G-Ragghianti merged commit be7be23 into icl-utk-edu:master Aug 29, 2024
10 checks passed
@G-Ragghianti G-Ragghianti removed the draft label Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants