Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly Trilinos failures in Stokhos, cuda/11.2.2, non-UVM build #1959

Closed
ndellingwood opened this issue Aug 30, 2023 · 6 comments
Closed

Nightly Trilinos failures in Stokhos, cuda/11.2.2, non-UVM build #1959

ndellingwood opened this issue Aug 30, 2023 · 6 comments

Comments

@ndellingwood
Copy link
Contributor

Nightly Trilinos failures in Stokhos, cuda/11.2.2, non-UVM builds in the following tests:

03:33:49 	2164 - Stokhos_KokkosCrsMatrixUQPCEUnitTest_Cuda_MPI_1 (Failed)
# Failure output
03:04:19 (CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorLaunchFailure): unspecified launch failure /home/jenkins/weaver/workspace/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/Trilinos/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:154
03:04:19 Backtrace:
03:04:19 3. Kokkos_CrsMatrix_PCE_UQ_PCE_DS_MeanMultiplyRank1_UnitTest ... [0x10228364] 
...
03:33:49 	2175 - Stokhos_KokkosCrsMatrixMPVectorUnitTest_Cuda_MPI_1 (Failed)
# Failure output
03:04:44 4. Kokkos_CrsMatrix_MP_DS_DefaultMultiply_Multiply_4_UnitTest ... [Passed] (0.0355 sec)
03:04:44 (ptr->cuda_stream_synchronize_wrapper(stream)) error( cudaErrorLaunchFailure): unspecified launch failure /home/jenkins/weaver/workspace/KokkosEco_Trilinos_Weaver_CUDA112_opt-no-uvm/Trilinos/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:167
03:04:44 Backtrace:
03:04:44 5. Kokkos_CrsMatrix_MP_DS_KokkosMultiply_Multiply_Default_UnitTest ... [0x1023cfb4] 

The build had previously been broken until merge of #1937 to resolve synchronization of kokkos-kernels@develop with Trilinos@develop; there is some discrepancy between the changes in #1937 and trilinos/Trilinos#12103 (I think the changes were motivated by failing Stokhos tests) - @cwpearson @brian-kelley could the discrepancy between that kokkos-kernels vs Trilinos result in the Stokhos failures shown here? Or is there an additional change needed in Stokhos to support the use of exec space instances with spmv added with #1932 ? Or something else?

To pinpoint if the incompatibility came with the streams update I can test before the bsr changes merged to both repos, e.g.

6c06bd0
trilinos/Trilinos@eece9b3

Reproducer (Weaver rhel8 queue):

export ATDM_CONFIG_REGISTER_CUSTOM_CONFIG_DIR=${TRILINOS_DIR}/cmake/std/atdm/contributed/weaver
source /projects/ppc64le-pwr9-rhel8/legacy-env.sh
source ${TRILINOS_DIR}/cmake/std/atdm/load-env.sh weaver-cuda-11.2-opt
export OMPI_CXX=$KOKKOS_DIR/bin/nvcc_wrapper

cmake \
-G"Unix Makefiles" \
-DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
-DCMAKE_INSTALL_PREFIX=$TRILINOS_INSTALL_DIR \
-DCMAKE_CXX_STANDARD="17" \
-DFC_FN_UNDERSCORE=UNDER \
-DTPL_ENABLE_CUSPARSE=ON \
-DTrilinos_ENABLE_TESTS=ON \
-DTrilinos_ENABLE_ALL_PACKAGES=ON \
-DTrilinos_ENABLE_Stokhos=ON \
-DKokkos_ENABLE_CUDA_UVM=OFF \
-DKokkos_ARCH_VOLTA70=ON \
-DKokkos_ARCH_POWER9=ON \
-DKokkos_SOURCE_DIR_OVERRIDE:STRING=kokkos \
-DKokkosKernels_SOURCE_DIR_OVERRIDE:STRING=kokkos-kernels \
-DKokkos_CoreUnitTest_CudaTimingBased_MPI_1_DISABLE=ON \
-DKokkos_CoreUnitTest_Default_MPI_1_SET_RUN_SERIAL=ON \
-DIntrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests_MPI_1_SET_RUN_SERIAL=ON \
$TRILINOS_DIR
@brian-kelley
Copy link
Contributor

I'm trying to reproduce this now

@brian-kelley
Copy link
Contributor

@ndellingwood I reproduced this and got a backtrace. It seems that the issue is simply that

  • Stokhos has its own specializations of KokkosSparse::spmv for PCE scalar types
  • With Adding exec space instance to spmv #1932, KokkosSparse::spmv has several new overloads. The versions not taking a space argument (like what Stokhos calls in this test) are now implemented by passing the default space to a new overload.
  • Stokhos does not have specializations for these overloads, so the default KK impl is called and it crashes

So I think to fix this, Stokhos changes are needed. But those Stokhos changes would only be compatible with KK develop, not the 4.1 in Trilinos now. Should I try to patch the three recent spmv-related PRs (#1932, #1937 and #1953) into Trilinos, as well as the Stokhos update?

@brian-kelley
Copy link
Contributor

Actually, I may be able to fix this in Stokhos using version ifdefs to work with both develop and master KK.

@brian-kelley
Copy link
Contributor

OK, this was actually pretty easy to fix - trilinos/Trilinos#12190 should take care of it and only needed changes in Stokhos. I haven't finished testing it locally so it's a draft.

@ndellingwood
Copy link
Contributor Author

Thanks for addressing this so quickly @brian-kelley !

@brian-kelley
Copy link
Contributor

That PR actually didn't fix the issue for kokkos-kernels develop, even though it passed PR testing with kokkos-kernels master. I'm now trying a different way to fix the issue and will open a new PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants