Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Factor build/link/use of VecGeom/CUDA using libraries for simplicity #279

Closed
wants to merge 8 commits into from

Conversation

drbenmorgan
Copy link
Member

Due to VecGeom's legacy CUDA design and the exposure of device functions in public APIs of libraries, ensuring correct singular device code links in final executables is, and continues to be, awkward. This largely means reproducing VecGeom's pattern of shared/static/device linked libraries downstream so that usage can be appropriately tracked and the correct linking done at each stage. A suitable CMake module that packages all of the needed functionality is available from the Celeritas project where it's also been tested across a range of use cases and is therefore now pretty robust (as it can be).

This PR imports this CMake module into AdePT (it's a completely standalone file) and ports the build, link and use of AdePT to use it. The primary changes are:

  • Provide a FindVecGeom wrapper module to add RDC properties to VecGeom targets (to possibly be upstreamed in VecGeom)
  • Replace target_XXX calls with cuda_rdc_XXX equivalents
  • Remove obsolete AdePT_cuda and AdePT_standalone libraries that CudaRdcUtils build and handles automatically locally and in downstream use.
  • Install FindVecGeom and CudaRdcUtils for use by downstream clients

I've tested this locally on a Rocky 9/GCC 11/Cuda 12 system with an underlying VecGeom install with static libs and CUDA enabled. The build and tests all run fine and without problem. I've also been able to take one of the examples, recast it into a standalone project and build and run it against an install of AdePT. However, CI and wider use may show up additional bugs...

Pinging @pcanal and @sethrj for info and any additional comments they want to make.

@phsft-bot
Copy link

Can one of the admins verify this patch?

@pcanal
Copy link

pcanal commented Feb 29, 2024

For simplicity of future update to CudaRdcUtils.cmake, the git commit log should indicate which commit from either the Celeritas repository or the CudaRdc repository the file was copied from.

@@ -8,8 +8,8 @@ macro(build_tests TESTS)
foreach(TEST ${TESTS})
get_filename_component(TARGET_NAME ${TEST} NAME_WE)
add_executable(${TARGET_NAME} ${TEST})
set_target_properties(${TARGET_NAME} PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(${TARGET_NAME} AdePT_cuda AdePT_G4_integration AdePT)
# NB: We do not use PRIVATE for executables due to an apparent limitation in CudaRDCUtils
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the symptoms/issues?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output from the cmake step if adding PRIVATE here:

CMake Error at cmake/CudaRdcUtils.cmake:850 (target_link_libraries):
  The keyword signature for target_link_libraries has already been used with
  the target "test_atomic".  All uses of target_link_libraries with a target
  must be either all-keyword or all-plain.

  The uses of the keyword signature are here:

   * cmake/CudaRdcUtils.cmake:738 (target_link_libraries)

Call Stack (most recent call first):
  test/CMakeLists.txt:12 (cuda_rdc_target_link_libraries)
  test/CMakeLists.txt:76 (build_tests)

The target_link_libraries at 850 is "bare" and could probably use a direct PRIVATE though I do recall there are use cases for PUBLIC linking of executables (plugins?) if only for CPU.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an actual difference on the link line of the executable between the 2 settings? (I naively thought it would only be a setting for dependency propagation in CMake and thus 'useless' for executable).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never observed a difference, though I wondered if there's some connection to symbol visibility/export. We can probably safely ignore any difference for now given the scope in which the RDC linking is required.

<TARGET INCLUDE DIRECTORIES>
${AdePT_INCLUDE_DIRS})
```cmake
include(CudaRdcUtils)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be done transparently in AdePTConfig.cmake.in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to enforce an explicit include for now for clarity in downstream use It also matches the Celeritas pattern of use. We can of course review (in both) in future.

@agheata
Copy link
Contributor

agheata commented Mar 1, 2024

@drbenmorgan , both me an @JuanGonzalezCaminero are getting locally linking errors with this branch. I have the VecGeom master and cmake 3.28.3

[ 47%] Linking CXX executable ../BuildProducts/bin/test_atomic
/usr/bin/ld: ../BuildProducts/lib/libAdePT.so: undefined reference to `AdePTGeant4Integration::ReturnTracks(std::vector<adeptint::TrackData, std::allocator<adeptint::TrackData> >*, int)'
/usr/bin/ld: ../BuildProducts/lib/libAdePT.so: undefined reference to `AdePTGeant4Integration::ProcessGPUHits(HostScoring&, HostScoring::Stats&)'
/usr/bin/ld: ../BuildProducts/lib/libAdePT.so: undefined reference to `AdePTGeant4Integration::InitScoringData(adeptint::VolAuxData*)'
/usr/bin/ld: ../BuildProducts/lib/libAdePT.so: undefined reference to `AdePTGeant4Integration::GetUniformFieldZ()'
/usr/bin/ld: ../BuildProducts/lib/libAdePT.so: undefined reference to `AdePTGeant4Integration::InitVolAuxData(adeptint::VolAuxData*, G4HepEmState*, bool, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*)'
/usr/bin/ld: ../BuildProducts/lib/libAdePT.so: undefined reference to `AdePTGeant4Integration::CheckGeometry(G4HepEmState*)'

However, I checked and the symbols are defined in libAdePT_G4_integration.so

0000000000071410 T AdePTGeant4Integration::ReturnTracks(std::vector<adeptint::TrackData, std::allocator<adeptint::TrackData> >*, int)

The error is the same if I change manually the link order of libAdePT.so and libAdePT_G4_integration.so
Any idea?

@agheata
Copy link
Contributor

agheata commented Mar 1, 2024

also getting when linking example1:

make[1]: *** [CMakeFiles/Makefile2:1376: test/CMakeFiles/test_atomic.dir/all] Error 2
/usr/bin/ld: /home/agheata/geant_src/VecGeom_install/lib/libvecgeom.a(UnplacedCone.cpp.o): undefined reference to symbol '_ZNK7vecgeom3cxx9DevicePtrINS_4cuda13SUnplacedConeINS2_9ConeTypes13UniversalConeEEEE9ConstructIJdddddddEEEvDpT_'
/usr/bin/ld: /home/agheata/geant_src/VecGeom_install/lib/libvecgeomcuda.so: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status
make[2]: *** [examples/Example1/CMakeFiles/example1.dir/build.make:392: BuildProducts/bin/example1] Error 1

@drbenmorgan
Copy link
Member Author

@agheata, I've been chatting to @JuanGonzalezCaminero on Mattermost about this - I'm in the process of rebuilding my local stack with the newer CUDA and CMake that he has to see if I can reproduce. What I am now spotting is there is a undeclared (to CMake) circular dependence between libAdePT and libAdePT_G4_integration....

@drbenmorgan
Copy link
Member Author

@pcanal may have insight on the

/usr/bin/ld: /home/agheata/geant_src/VecGeom_install/lib/libvecgeomcuda.so: error adding symbols: DSO missing from command line

error. I believe was seen as well in Celeritas and patched?

@drbenmorgan
Copy link
Member Author

One other thing to check: what is the full list of Vecgeom libraries in your installs (e.g. under /home/agheata/geant_src/VecGeom_install/lib)? I've tried locally with both static and shared builds and am simply not able to reproduce. Could you also let me know the versions of VecGeom and Geant4 (down to commit) you are building against, and the exact set of build options used to configure and install them please?

There's some weird difference or specific configuration setup that's causing these problems for you but not me or the CI system.

@pcanal
Copy link

pcanal commented Mar 1, 2024

@pcanal may have insight on the

/usr/bin/ld: /home/agheata/geant_src/VecGeom_install/lib/libvecgeomcuda.so: error adding symbols: DSO missing from command line

error. I believe was seen as well in Celeritas and patched?

Yes we have seem similar but slightly different. The fix was celeritas-project/celeritas@03b449a (#1102). A description of the problem can be found at: https://zhangboyi.gitlab.io/post/2020-09-14-resolve-dso-missing-from-command-line-error/

The bottom-line is that you might need to add explicitly more library to the link library list or .... something is messed up in the installation (eg. installation of both shared and static version of VecGeom in the same area or .... )

@agheata
Copy link
Contributor

agheata commented Mar 4, 2024

there is a undeclared (to CMake) circular dependence between libAdePT and libAdePT_G4_integration....

Indeed, we are looking now into how to break the dependency of lib AdePT on libAdePT_G4_integration

@agheata
Copy link
Contributor

agheata commented Mar 4, 2024

One other thing to check: what is the full list of Vecgeom libraries in your installs (e.g. under /home/agheata/geant_src/VecGeom_install/lib)?

libvecgeom.a libvecgeomcuda.so libvecgeomcuda_static.a libvgdml.a

Could you also let me know the versions of VecGeom and Geant4 (down to commit) you are building against, and the exact set of build options used to configure and install them please?
* 89ec50d4 (HEAD -> master, origin/master, origin/HEAD) VECGEOM-618 : Added new definition of SafetyToOut to fix SafetyToOut issue as mentioned in JIRA issue-618 [Raman Sehgal]

cmake  $GEANT_ROOT/VecGeom \
         -DCMAKE_INSTALL_PREFIX="$INSTALL_FOLDER" \
         -DCMAKE_PREFIX_PATH="$VECCORE_ROOT" \
         -DVECGEOM_GDML=ON \
         -DVECGEOM_ROOT=OFF \
         -DVECGEOM_GEANT4=OFF \
         -DVECGEOM_ENABLE_CUDA=ON \
         -DVECGEOM_BACKEND=Scalar \
         -DVECGEOM_NO_SPECIALIZATION=ON \
         -DVECGEOM_VECTOR=avx2 \
         -DVECGEOM_TEST_BENCHMARK=ON \
         -DVECGEOM_TEST_COVERAGE=OFF \
         -DCMAKE_BUILD_TYPE=$BUILDMODE \
         -DCMAKE_CUDA_ARCHITECTURES=75 \
         -DUSE_CACHED_TRANSFORMATIONS=ON \
         -DCMAKE_CXX_STANDARD=17 \
         -DCMAKE_CUDA_STANDARD=17 \
         -DEXAMPLES=ON

and for AdePT:
Geant4 v11.2.0

cmake -S. -B./adept-build/${VGMODE} \
      -DCMAKE_INSTALL_PREFIX="$INSTALL_FOLDER" \
      -DCMAKE_BUILD_TYPE=$BUILDMODE \
      -DCMAKE_PREFIX_PATH="$VECCORE_ROOT;$VECGEOMROOT/lib/cmake/VecGeom;$HEPMC3_ROOT;$G4INSTALL;$G4HepEm_INSTALL" \
      -DCMAKE_CUDA_ARCHITECTURES=75 \
      -DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
      -DADEPT_USE_SURF=OFF \
      -DBENCHMARK=ON \
      -DBUILD_TESTING=ON

@drbenmorgan
Copy link
Member Author

Thanks for the info @agheata, I think the crucial thing is that we have the same VecGeom library structure here, i.e. only a static libvecgeom. Let's take one step back to the error you reported with the build of test_atomic:

> [ 47%] Linking CXX executable ../BuildProducts/bin/test_atomic
> /usr/bin/ld: ../BuildProducts/lib/libAdePT.so: undefined reference to `AdePTGeant4Integration::ReturnTracks(std::vector<adeptint::TrackData, std::allocator<adeptint::TrackData> >*, int)'
> /usr/bin/ld: ../BuildProducts/lib/libAdePT.so: undefined reference to `AdePTGeant4Integration::ProcessGPUHits(HostScoring&, HostScoring::Stats&)'
> /usr/bin/ld: ../BuildProducts/lib/libAdePT.so: undefined reference to `AdePTGeant4Integration::InitScoringData(adeptint::VolAuxData*)'
> /usr/bin/ld: ../BuildProducts/lib/libAdePT.so: undefined reference to `AdePTGeant4Integration::GetUniformFieldZ()'
> /usr/bin/ld: ../BuildProducts/lib/libAdePT.so: undefined reference to `AdePTGeant4Integration::InitVolAuxData(adeptint::VolAuxData*, G4HepEmState*, bool, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*)'
> /usr/bin/ld: ../BuildProducts/lib/libAdePT.so: undefined reference to `AdePTGeant4Integration::CheckGeometry(G4HepEmState*)'

Could you run:

$ make clean && make VERBOSE=1 test_atomic

and post the compile/link commands for the source file and end executable (should be the last two) please? For comparison, with my local, working build, I get (paths reduced for clarity):

[13/14] /.../nvcc -forward-unknown-to-host-compiler -O2 -g -DNDEBUG "--generate-code=arch=compute_86,code=[compute_86,sm_86]" /.../BuildProducts/lib64/libAdePT_static.a /.../lib64/libvecgeomcuda_static.a -Xnvlink --suppress-stack-size-warning -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink test/CMakeFiles/test_atomic.dir/test_atomic.cu.o -o CMakeFiles/test_atomic.dir/cmake_device_link.o -L/.../lib/stubs  -L/...cuda.../lib /.../libvgdml.a /.../libvecgeom.a -lrt -lpthread -ldl  -lcudadevrt -lcudart

[14/14] : && /usr/bin/c++ -O2 -g -DNDEBUG  test/CMakeFiles/test_atomic.dir/test_atomic.cu.o CMakeFiles/test_atomic.dir/cmake_device_link.o -o BuildProducts/bin/test_atomic -L/...cuda.../stubs   -L/...cuda.../lib -Wl,-rpath,/.../  BuildProducts/lib64/libAdePT_G4_integration.so  BuildProducts/lib64/libAdePT.so  /.../libvecgeomcuda.so  /.../libvgdml.a  /.../libvecgeom.a  -lrt  -lpthread  -ldl  /...geant4, g4hepemlibs.../  -lcudadevrt  -lcudart

@JuanGonzalezCaminero
Copy link
Contributor

JuanGonzalezCaminero commented Mar 5, 2024

This is what I get for test_atomic @drbenmorgan

[100%] Linking CUDA device code CMakeFiles/test_atomic.dir/cmake_device_link.o
cd /home/gjuan/Dev/G4/AdePT_Master/build/test && /usr/bin/cmake -E cmake_link_script CMakeFiles/test_atomic.dir/dlink.txt --verbose=1
/usr/local/cuda-12.3/bin/nvcc -forward-unknown-to-host-compiler -O3 -DNDEBUG "--generate-code=arch=compute_89,code=[compute_89,sm_89]" /.../libAdePT_static.a /.../libvecgeomcuda_static.a -Xnvlink --suppress-stack-size-warning -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink --options-file CMakeFiles/test_atomic.dir/deviceObjects1.rsp -o CMakeFiles/test_atomic.dir/cmake_device_link.o --options-file CMakeFiles/test_atomic.dir/deviceLinkLibs.rsp

[100%] Linking CXX executable ../BuildProducts/bin/test_atomic
cd /home/gjuan/Dev/G4/AdePT_Master/build/test && /usr/bin/cmake -E cmake_link_script CMakeFiles/test_atomic.dir/link.txt --verbose=1
/usr/bin/c++ -O3 -DNDEBUG CMakeFiles/test_atomic.dir/test_atomic.cu.o CMakeFiles/test_atomic.dir/cmake_device_link.o -o ../BuildProducts/bin/test_atomic   -L/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs  -L/usr/local/cuda-12.3/targets/x86_64-linux/lib  -L/usr/lib/gcc/x86_64-linux-gnu/12  -Wl,-rpath,/.../ ../BuildProducts/lib/libAdePT_G4_integration.so ../BuildProducts/lib/libAdePT.so /.../libvecgeomcuda.so /.../libvgdml.a /.../libvecgeom.a -lrt -lpthread -ldl /.../libG4Tree.so /.../libG4FR.so /.../libG4GMocren.so /.../libG4visHepRep.so /.../libG4RayTracer.so /.../libG4VRML.so /.../libG4ToolsSG.so /.../libG4vis_management.so /.../libG4modeling.so /.../libG4interfaces.so /.../libG4persistency.so /usr/lib/x86_64-linux-gnu/libxerces-c.so /.../libG4error_propagation.so /.../libG4readout.so /.../libG4physicslists.so /.../libG4run.so /.../libG4parmodels.so /.../libg4HepEm.so /.../libG4event.so /.../libG4tracking.so /.../libg4HepEmInit.so /.../libG4processes.so /.../libG4analysis.so /.../libG4digits_hits.so /.../libG4track.so /.../libG4geometry.so /.../libG4graphics_reps.so /.../libG4expat.so /.../libG4particles.so /.../libG4materials.so /.../libG4intercoms.so /.../libG4global.so /.../libG4ptl.so.2.3.3 /.../libG4zlib.so /.../libg4HepEmRun.so /.../libG4clhep.so /.../libg4HepEmData.so -lcudadevrt -lcudart

@drbenmorgan
Copy link
Member Author

drbenmorgan commented Mar 5, 2024

Thanks @JuanGonzalezCaminero, is that failing with the same error as @agheata reported earlier? The only difference I can spot is the use of response files for the device link step. Could you check/post the contents of the files:

test/CMakeFiles/test_atomic.dir/deviceObjects1.rsp
test/CMakeFiles/test_atomic.dir/deviceLinkLibs.rsp

please?

edit: clarified the path to these files from the build directory.

@JuanGonzalezCaminero
Copy link
Contributor

JuanGonzalezCaminero commented Mar 5, 2024

Yes, we were having the same errors, the files have:

deviceObjects1.rsp:

CMakeFiles/test_atomic.dir/test_atomic.cu.o

deviceLinkLibs.rsp

-L"/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs"  -L"/usr/local/cuda-12.3/targets/x86_64-linux/lib"  -L"/usr/lib/gcc/x86_64-linux-gnu/12"  /home/gjuan/Dev/G4/vecgeom_installation/lib/libvgdml.a /home/gjuan/Dev/G4/vecgeom_installation/lib/libvecgeom.a -lrt -lpthread -ldl  -lcudadevrt -lcudart

@JuanGonzalezCaminero
Copy link
Contributor

Indeed I'm able to build after the last commit @drbenmorgan

@drbenmorgan
Copy link
Member Author

O.k., I have absolutely no idea why things are failing on these specific machines, so the latest commit is a shot in the dark. All it does is add a dummy .cu file to AdePT_G4_integration. This forces it to be treated as an RDC library and should then appear on the device link line for test_atomic. This might resolve the missing symbol errors on the CERN machines, but if not then....

@drbenmorgan
Copy link
Member Author

Thanks for the update @JuanGonzalezCaminero, that's something of a relief! If you could also check that your separate "findpackage_test" example building against an install also works, that'd cover that base as well.

All I can really imagine is that this is something to do with GCC - could you and @agheata double check and confirm which Linux, gcc and associated toolchain is in use please (i.e. gcc, gcc-ar gcc-ranlib etc)?

@JuanGonzalezCaminero
Copy link
Contributor

Ah, I wasn't building the examples, I confirm the commit solved the linking issue with AdePT_G4_Integration, however the vecgeom DSO missing from command line error still appears, so I can't test the external linking yet.

@drbenmorgan
Copy link
Member Author

Right, same exercise as before, so could you post the full link line for example1 please? It'll be the last one if you run

$ make clean && make VERBOSE=1 example1

@JuanGonzalezCaminero
Copy link
Contributor

This is what I get:

cd /home/gjuan/Dev/G4/AdePT_Master/build/examples/Example1 && /usr/bin/cmake -E cmake_link_script CMakeFiles/example1.dir/link.txt --verbose=1
/usr/bin/c++ -O3 -DNDEBUG CMakeFiles/example1.dir/example1.cpp.o CMakeFiles/example1.dir/__/common/src/ParticleGun.cc.o CMakeFiles/example1.dir/__/common/src/ParticleGunMessenger.cc.o CMakeFiles/example1.dir/__/common/src/FTFP_BERT_HepEm.cc.o CMakeFiles/example1.dir/__/common/src/FTFP_BERT_AdePT.cc.o CMakeFiles/example1.dir/src/ActionInitialisation.cc.o CMakeFiles/example1.dir/src/DetectorConstruction.cc.o CMakeFiles/example1.dir/src/DetectorMessenger.cc.o CMakeFiles/example1.dir/src/EventAction.cc.o CMakeFiles/example1.dir/src/EventActionMessenger.cc.o CMakeFiles/example1.dir/src/SteppingAction.cc.o CMakeFiles/example1.dir/src/SimpleHit.cc.o CMakeFiles/example1.dir/src/PrimaryGeneratorAction.cc.o CMakeFiles/example1.dir/src/RunAction.cc.o CMakeFiles/example1.dir/src/SensitiveDetector.cc.o CMakeFiles/example1.dir/src/TrackingAction.cc.o CMakeFiles/example1.dir/__/common/src/HepMC3G4AsciiReader.cc.o CMakeFiles/example1.dir/__/common/src/HepMC3G4AsciiReaderMessenger.cc.o CMakeFiles/example1.dir/__/common/src/HepMC3G4Interface.cc.o -o ../../BuildProducts/bin/example1  -Wl,-rpath,/.../ /.../HepMC3libs ../../BuildProducts/lib/libAdePT_G4_integration_final.so ../../BuildProducts/lib/libAdePT_G4_integration.so ../../BuildProducts/lib/libAdePT.so /.../libvecgeomcuda.so /.../libvgdml.a /.../libvecgeom.a -lrt -lpthread -ldl /.../Geant4libs /.../G4HepEMlibs

@drbenmorgan
Copy link
Member Author

Which looks identical to mine... One last thing to try - could you modify the link command in examples/Example1/CMakeLists.txt to:

cuda_rdc_target_link_libraries(example1
  PRIVATE
    AdePT_G4_integration
    AdePT
    VecGeom::vecgeom
    ${HEPMC3_LIBRARIES} 
    ${HEPMC3_FIO_LIBRARIES}
)

Note the extra VecGeom link. This should fully declare that as needed in the link rather than relying on transitive inclusion. If that still fails, could you also:

  1. Wipe that build directory and start from scratch (I have seen occasional caching issues with RDC builds)
  2. If it fails again, post the link line now used.

@agheata agheata mentioned this pull request Mar 6, 2024
@drbenmorgan
Copy link
Member Author

As noted in #280, we should wait on the fix to the circular dependency presented there to go through with the existing link structure. I can then rebase here so we can at least decouple these issues (if they are not already orthogonal).

As I commented in #280:

FWIW, I can build the changes here without issue on my local system (Alma 9) with either GCC 11.4 or 12.2 and with CUDA > 12.0 or 12.3, plus a standard VecGeom CUDA enabled install.

and also the CI machine is passing. @agheata, @JuanGonzalezCaminero, which machines are you using, and are they accessible for CERN account holders at all? Are you able to reproduce the errors on lxplus-gpu? We can discuss further on Mattermost as required, but there's something very specific to your setups that's exposing this problem.

One additional cross-check to try (if @pcanal and @sethrj don't mind):

  1. Could @agheata and @JuanGonzalezCaminero try building the current Celeritas develop branch on the systems in question using the Geant4/VecGeom/etc setup as for AdePT
  2. Could @pcanal and @sethrj try building this PR in their Celeritas dev environment(s)

This should expose any fundamental system problems.

@JuanGonzalezCaminero
Copy link
Contributor

@drbenmorgan Building the current Celeritas develop branch with the following configuration:

CXX=/usr/bin/gcc-10 cmake -S. -B./build 
    -DCMAKE_INSTALL_PREFIX="../celeritas_installation" 
    -DCMAKE_PREFIX_PATH="../geant4_installation;../vecgeom_installation" 
    -DCMAKE_CUDA_ARCHITECTURES=89

I get the following when it tries to link orange-update

[ 65%] Linking CXX executable ../bin/orange-update
/usr/bin/ld: CMakeFiles/orange-update.dir/orange-update.cc.o: undefined reference to symbol '_ZNKSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE7compareEPKc@@GLIBCXX_3.4.21'
/usr/bin/ld: /lib/x86_64-linux-gnu/libstdc++.so.6: error adding symbols: DSO missing from command line

But in this case it seems unrelated. It seems I can't compile with gcc 12.3, the compiler I used instead is gcc 10.5

@sethrj
Copy link
Contributor

sethrj commented Mar 6, 2024

@JuanGonzalezCaminero that last error message looks like you're mixing incompatible compilers...

@drbenmorgan
Copy link
Member Author

Thanks @sethrj, much appreciated 🙇! That was with GCC8, CUDA 11.5 (which OS?)?

@sethrj
Copy link
Contributor

sethrj commented Mar 6, 2024

linux-rhel8-cascadelake, gcc 8.5.0, CUDA 11.5.

@drbenmorgan
Copy link
Member Author

O.k., on my setup (linux-rocky9-cascadelake, gcc 11.4.1, CUDA 12) the same spack env builds both this PR and celeritas develop perfectly (so even Celeritas with Vecgeom 2.0.0-rc2, which is this tag/tarball

CMakeLists.txt Outdated
src/AdePTPhysics.cc
src/HepEMPhysics.cc
src/AdePTGeant4Integration.cpp
src/adept_g4_dummy.cu
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cuda_rdc_ function are meant to add this when needed. The fact that it isn't means that the detection mechanism in the cuda_rdc_ function that the library and/or executable is depending on more than one independent CUDA rdc library failed (this could indeed be related to the circular depenendy).

@JuanGonzalezCaminero
Copy link
Contributor

@drbenmorgan We just merged the removal of circular dependencies. Could this PR be rebased onto master?

- Taken as of commit fb03216c0cd701c6aff0eb901370dad19e178503
- celeritas-project/celeritas@fb03216
Due to VecGeom's legacy CUDA design and the exposure of device
functions in public APIs of libraries, ensuring correct singular
device links in final executables is awkward. The Celeritas project
has evolved the build/link pattern used in VecGeom into the
"CudaRdcUtils" CMake module that enables projects using VecGeom (or
building libraries that expose device APIs) to be correctly integrated
and used by downstream clients. This is acheived through the use of
custom CMake properties and wrappers around the fundamental target_XXX
interfaces CMake provides, making use as transparent as it can be
given the problem faced.

Use CudaRdcUtils to build, link, install, and use AdePT

- Provide FindVecGeom wrapper module to add RDC properties to
  VecGeom targets
- Replace target_XXX calls with cuda_rdc_XXX equivalents
- Remove obsolete "AdePT_cuda" and "AdePT_standalone" libraries
  that CudaRdcUtils build and handles automatically
- Install and use FindVecGeom and CudaRdcUtils for use by downstream
  clients

Tests and executables build and run successfully on a local Rocky 9,
GCC11, CUDA 12 environment.
Link errors observed on CERN machines report missing symbols that
are in AdePT_G4_integration, and with this library on the link line.
The cause of this is not totally clear given success on other
platforms, though it is known that there is a circular dependency
between the AdePT and AdePT_G4_integration layer that might be an
underlying issue.

To attempt resolution/diagnosis, add a dummy .cu file to
AdePT_G4_integration. This forces it to be built as an RDC library and
thus onto the device link line alongside AdePT, potnetially resolving
the device link errors.
@drbenmorgan
Copy link
Member Author

Latest commits are a very quick and dirty rebase on the latest master, with fixups for the change to header only etc. It works locally for me modulo some device code warnings on flags and that the testField test doesn't pass. However, I want to see what the CI shows first.

@drbenmorgan
Copy link
Member Author

O.k., CI looks good, so @JuanGonzalezCaminero, @agheata could you try building again and see if things are now resolved on your side please?

@JuanGonzalezCaminero
Copy link
Contributor

JuanGonzalezCaminero commented Mar 18, 2024

Still getting one error here:

[100%] Linking CXX executable ../../BuildProducts/bin/example1
/usr/bin/ld: ../../BuildProducts/lib/libAdePT_G4_integration.so: undefined reference to `__cudaRegisterLinkedBinary_5ad212d0_23_AdePTTrackingManager_cu_6f198e90_884112'

Simplified link line:

/usr/bin/c++ -O3 -DNDEBUG CMakeFiles/example1.dir/example1.cpp.o CMakeFiles/example1.dir/__/common/src/ParticleGun.cc.o CMakeFiles/example1.dir/__/common/src/ParticleGunMessenger.cc.o CMakeFiles/example1.dir/__/common/src/FTFP_BERT_HepEm.cc.o CMakeFiles/example1.dir/__/common/src/FTFP_BERT_AdePT.cc.o CMakeFiles/example1.dir/src/ActionInitialisation.cc.o CMakeFiles/example1.dir/src/DetectorConstruction.cc.o CMakeFiles/example1.dir/src/DetectorMessenger.cc.o CMakeFiles/example1.dir/src/EventAction.cc.o CMakeFiles/example1.dir/src/EventActionMessenger.cc.o CMakeFiles/example1.dir/src/SteppingAction.cc.o CMakeFiles/example1.dir/src/SimpleHit.cc.o CMakeFiles/example1.dir/src/PrimaryGeneratorAction.cc.o CMakeFiles/example1.dir/src/RunAction.cc.o CMakeFiles/example1.dir/src/SensitiveDetector.cc.o CMakeFiles/example1.dir/src/TrackingAction.cc.o CMakeFiles/example1.dir/__/common/src/HepMC3G4AsciiReader.cc.o CMakeFiles/example1.dir/__/common/src/HepMC3G4AsciiReaderMessenger.cc.o CMakeFiles/example1.dir/__/common/src/HepMC3G4Interface.cc.o -o ../../BuildProducts/bin/example1  -Wl,-rpath,/.../ HepMC3libs ../../BuildProducts/lib/libAdePT_G4_integration_final.so ../../BuildProducts/lib/libAdePT_G4_integration.so VecGeomlibs -lrt -lpthread -ldl XercesC G4libs G4HepEMlibs

And the same line on Master:

/usr/bin/c++ -O3 -DNDEBUG CMakeFiles/example1.dir/example1.cpp.o CMakeFiles/example1.dir/__/common/src/ParticleGun.cc.o CMakeFiles/example1.dir/__/common/src/ParticleGunMessenger.cc.o CMakeFiles/example1.dir/__/common/src/FTFP_BERT_HepEm.cc.o CMakeFiles/example1.dir/__/common/src/FTFP_BERT_AdePT.cc.o CMakeFiles/example1.dir/src/ActionInitialisation.cc.o CMakeFiles/example1.dir/src/DetectorConstruction.cc.o CMakeFiles/example1.dir/src/DetectorMessenger.cc.o CMakeFiles/example1.dir/src/EventAction.cc.o CMakeFiles/example1.dir/src/EventActionMessenger.cc.o CMakeFiles/example1.dir/src/SteppingAction.cc.o CMakeFiles/example1.dir/src/SimpleHit.cc.o CMakeFiles/example1.dir/src/PrimaryGeneratorAction.cc.o CMakeFiles/example1.dir/src/RunAction.cc.o CMakeFiles/example1.dir/src/SensitiveDetector.cc.o CMakeFiles/example1.dir/src/TrackingAction.cc.o CMakeFiles/example1.dir/__/common/src/HepMC3G4AsciiReader.cc.o CMakeFiles/example1.dir/__/common/src/HepMC3G4AsciiReaderMessenger.cc.o CMakeFiles/example1.dir/__/common/src/HepMC3G4Interface.cc.o -o ../../BuildProducts/bin/example1   -L/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs  -L/usr/local/cuda-12.3/targets/x86_64-linux/lib  -L/usr/lib/gcc/x86_64-linux-gnu/12  -Wl,-rpath,/.../ ../../BuildProducts/lib/libAdePT_G4_integration.so HepMC3libs VecGeomlibs -lrt -lpthread -ldl XercesC G4libs G4HepEMlibs -lcudadevrt -lcudart_static -lrt -lpthread -ldl 

@drbenmorgan
Copy link
Member Author

Argh... Could you try adding this line:

target_link_options(example1 PRIVATE "-Wl,--no-as-needed")

at line 64 of examples/example1/CMakeLists.txt please? Are you running on an Ubuntu machine by any chance?

@JuanGonzalezCaminero
Copy link
Contributor

That line solves it indeed. I'm running on Debian

@sethrj
Copy link
Contributor

sethrj commented Mar 18, 2024

Yep, I found out quite some time ago that VecGeom requires this on Ubuntu systems: see https://wiki.gentoo.org/wiki/Project:Quality_Assurance/-Wl,-z,defs_and_-Wl,--no-allow-shlib-undefined .

@drbenmorgan
Copy link
Member Author

Thanks both! @JuanGonzalezCaminero, could you try modifying the line to:

target_link_options(example1 PRIVATE "-Wl,-z,defs")

and see what it prints please? Not a fix, but useful to see what it produces. I'll have a think about how to fix this for installs, but in the meantime I think we can use a project wide no-as-needed for our own builds.

@JuanGonzalezCaminero
Copy link
Contributor

As far as I can tell I get the same error and link line when using those options. Is there something specific I should look for?

@drbenmorgan
Copy link
Member Author

If I understand the option correctly, it should report specifics on what's missing, though perhaps it needs to be applied to the libraries itself? Anyway, I'm now able to reproduce the error locally by adding the --as-needed flag to my builds so can at leats check fixes locally (which I'll do).

@agheata
Copy link
Contributor

agheata commented Mar 20, 2024

@drbenmorgan it also works for me when adding target_link_options(integrationBenchmark PRIVATE "-Wl,--no-as-needed"), do you plan to add it globally for now in this PR? Or should we use -Wl,-z,defs during the creation of the libAdePT_G4_integration libs to force the resolution of all dependencies before the examples are built?

@drbenmorgan
Copy link
Member Author

IIUC, -Wl,-z,defs is only for library developers to warn, rather than fix things (but I need to explore this more). The idea for now would be to use --no-as-needed globally in AdePT, and then document that this may be needed by clients to resolve things.

As @WitekPokorski reported that the current AdePT libraries had been successfully linked in Gaussino, I'm first going to go back to the celer-adept joint integration layer and see what happens there when fully linking in and using AdePT before deciding how to progress this PR (I also want to go back and update g4vg and its use as I think that's a higher priority?).

@drbenmorgan drbenmorgan changed the title Factor build/link/use of VecGeom/CUDA using libraries for simplicity Draft: Factor build/link/use of VecGeom/CUDA using libraries for simplicity Mar 20, 2024
@drbenmorgan drbenmorgan marked this pull request as draft March 20, 2024 16:24
@drbenmorgan
Copy link
Member Author

Closing as superseded by #289

@drbenmorgan drbenmorgan closed this May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants