Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device Clusterization Reorganization, main branch (2024.04.17.) #545

Merged

Conversation

krasznaa
Copy link
Member

After #543, this is an even more ambitious PR for splitting traccc::cuda::clusterization_algorithm and traccc::sycl::clusterization_algorithm in two. (Re-)Introducing the traccc::cuda::spacepoint_formation_algorithm and traccc::sycl::spacepoint_formation_algorithm algorithms.

The biggest change is in traccc::device::ccl_kernel. Since now that function receives a resizable measurement buffer, it no longer deals with figuring out the index of the measurement that it should fill. It just uses vecmem's built-in ability for doing this sort of a thing.

I also:

  • Got rid of traccc::cuda::experimental::clusterization_algorithm and traccc::sycl::experimental::clusterization_algorithm. The updated algorithms now behave in the way that those "experimental" algorithms were meant to.
    • But I kept traccc::cuda::experimental::spacepoint_formation and traccc::sycl::experimental::spacepoint_formation around for now. Since the Detray usage in those is still something that I'll want to steal from. 😉
  • Got rid of some minute duplications in the code. Introducing some new code that the CUDA and SYCL implementations would use from the same place.
  • Did some general code cleanup here and there.

With all of this, I simplified the device code in a few places a bit. So I wanted to see a little more in detail what happens to the performance of the code with these updates. Unfortunately the answer is a bit complicated... With the multi-threaded throughput tests, using 4 CPU threads, I see the following:

$\mu$ = 40 $\mu$ = 140 $\mu$ = 200 $\mu$ = 300
Old CUDA 2285.54 1454.86 920.469 527.403
New CUDA 1685.72 1038.05 742.887 424.643
Old SYCL 1568.65 1084.27 746.129 431.996
New SYCL 1587.81 778.757 523.391 296.209

It indeed looks far from great. 😦 However, when I do the same with the single-threaded throughput tests, I get:

$\mu$ = 40 $\mu$ = 140 $\mu$ = 200 $\mu$ = 300
Old CUDA 1395.47 674.289 471.894 300.83
New CUDA 1407.7 674.373 478.964 300.309
Old SYCL 963.808 456.35 331.508 227.2
New SYCL 971.387 390.687 283.706 187.615

The CUDA numbers here look all good, but SYCL is still not great. 😟 But I don't fully understand why.

At first I was convinced that it would be the simplifications that I made in traccc::device::ccl_kernel that would be responsible for the slowdown. But not really. Even when I went back to the previous logic for how positions in the output container are chosen, the throughput numbers didn't change much.

I'll just give up for today... But let's discuss about this tomorrow...

@krasznaa krasznaa added tests Make sure the code keeps working cuda Changes related to CUDA sycl Changes related to SYCL examples Changes to the examples labels Apr 17, 2024
@krasznaa krasznaa changed the title Host Clusterization Reorganization, main branch (2024.04.17.) Device Clusterization Reorganization, main branch (2024.04.17.) Apr 17, 2024
@krasznaa
Copy link
Member Author

I'm still stumped. 😦

According to NSight Systems the new code does not execute kernels or memory copies any more slowly. If anything, it is even faster than the current code. 🤔

  • With the current code I get this overall summary for running 1000 $\mu$=200 events on 4 CPU threads:

image

  • With the new code, for the same setup, I get:

image

Notice how the CCL kernel became faster with the updates. 🤔 Remember that, while I do introduce an additional algorithm with this re-shuffling, the number of kernels being executed remains the same!

So, since I could just not understand how the throughput test would become slower, while the kernels seemingly become faster, I ran the applications through VTune as well. And there is indeed a smoking gun there. 🤔 The current code achives the following CPU occupancy during the throughput test:

image

While this PR's code does this:

image

(The big spike at 1 thread is due to the initialization in both cases.)

However, I can't figure out what is causing this. 😦 So if anybody has any good idea, I'm all ears. Since the "threading analysis" of VTune is just not revealing anything to me at the moment...

@beomki-yeo
Copy link
Contributor

beomki-yeo commented Apr 18, 2024

I don't know why you get rid of the later part of clusterization which counts the measurement. (I also don't know why you removed the sorting part. It will need to be rolled back if you encounter weird results from tracking finding once everything is in the single piece.)
How are you counting the number of measurements used for the spacepoint_formation? Have you checked if you have redundant spacepoints during the seeding?

@krasznaa krasznaa force-pushed the DeviceClusterizationReorg-main-20240419 branch from 6e07118 to 35f10fa Compare April 18, 2024 13:49
@beomki-yeo
Copy link
Contributor

Out of curiosity have you checked if you are getting the same results from cpu and cuda?

@krasznaa
Copy link
Member Author

So... fun! I now bumped into the same issue that I believe is giving us grief with clusterizing the ODD cells on a GPU. It seems that the code meant to order cells during CSV reading, is not doing its job. 😦

I was getting afraid that there would be no thing left to debug... 😮‍💨

@krasznaa krasznaa force-pushed the DeviceClusterizationReorg-main-20240419 branch from 53e0b50 to ac95998 Compare April 19, 2024 08:09
@krasznaa
Copy link
Member Author

Never mind, we don't get two birds with just the one stone... 😦 I just didn't set up the new assertion correctly in the code...

@krasznaa
Copy link
Member Author

On the ODD front, since I'm now trying to see if I can make the device clusterization work for that in this PR, I now see this:

...
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
[New Thread 0x7fffee7ff000 (LWP 560859)]
[New Thread 0x7fffed3ff000 (LWP 560860)]
[Detaching after fork from child process 560861]
[New Thread 0x7fffecbfe000 (LWP 560873)]
[New Thread 0x7fffe5fbd000 (LWP 560874)]

Assertion failed.

Thread 1 "traccc_seq_exam" received signal CUDA_EXCEPTION_12, Warp Assert.
[Switching focus to CUDA kernel 0, grid 1, block (26,0,0), thread (126,0,0), device 0, sm 3, warp 3, lane 30]
0x00007fffbee511a0 in __assert_fail ()
(cuda-gdb) bt
#0  0x00007fffbee511a0 in __assert_fail ()
#1  0x00007fffbee75770 in traccc::device::reduce_problem_cell (cells=..., cid=0, start=1073741824, end=33554451, 
    adjc=<error reading variable: Unknown storage specifier (read) 0x10000>, adjv=0x100000000)
    at /home/krasznaa/ATLAS/projects/traccc/traccc/device/common/include/traccc/clusterization/device/impl/reduce_problem_cell.ipp:70
#2  0x00007fffbee441e0 in traccc::device::ccl_kernel<traccc::cuda::barrier> (threadId=126, blckDim=128, blockId=26, 
    cells_view=..., modules_view=..., max_cells_per_partition=1536, target_cells_per_partition=1024, 
    partition_start=<error reading variable: Unknown storage specifier (read) 0x10000>, 
    partition_end=<error reading variable: Unknown storage specifier (read) 0x10000>, 
    outi=<error reading variable: Unknown storage specifier (read) 0x10000>, f=0x7fffdd000010, gf=0x7fffdd000c10, barrier=..., 
    measurements_view=..., cell_links=...)
    at /home/krasznaa/ATLAS/projects/traccc/traccc/device/common/include/traccc/clusterization/device/impl/ccl_kernel.ipp:226
#3  0x00007fffbee40d90 in traccc::cuda::kernels::ccl_kernel<<<(81,1,1),(128,1,1)>>> (cells_view=..., modules_view=..., 
    max_cells_per_partition=1536, target_cells_per_partition=1024, measurements_view=..., cell_links=...)
    at /home/krasznaa/ATLAS/projects/traccc/traccc/device/cuda/src/clusterization/clusterization_algorithm.cu:40
(cuda-gdb)

(This is one of the reasons that I added the many new assertions to the code. To try to understand this lingering issue.)

Any quick ideas for how we could end up with a cell that has more than 8 neighbours? 😕 (I'll try to figure this out myself as well, but if somebody already has an idea, that would help...)

@stephenswat
Copy link
Member

It would be possible - although weird - that the same cell has multiple activations in the same input?

@stephenswat
Copy link
Member

Aha!

traccc/data/odd/muon100GeV-geant4 $ cat event000000000-cells.csv | cut -d ',' -f 1,3,4 | sort | uniq -c

...

2 2161728508332653312,219,0

So there are some duplicate cells in the input data!

@stephenswat
Copy link
Member

In fact, some are repeated up to 12 times:

      7 2089670364539207936,226,0
      7 2089670364539207936,229,0
      7 2089670364539207936,230,0
      7 2089670364539207936,231,0
      7 2089670364539317760,156,0
      7 2089670501978336256,132,0
      8 1224979373522735104,328,703
      8 1224979373522735104,329,704
      8 1729382531788187392,177,42
      8 1729382531788189440,12,27
      8 1729382531788189440,13,27
      8 1729382531788189440,14,27
      8 1729382669227169536,6,87
      8 1729382669227169536,7,87
      8 2089670364539207936,227,0
      8 2089670364539207936,228,0
      8 2089670501977937664,38,0
      8 2089670501977937664,39,0
      8 2089670501977937664,43,0
      8 2089670501977937664,44,0
      8 2089670501978336256,129,0
      9 1729382531788187392,178,42
      9 1729382531788187392,179,42
     10 2089670501977937664,40,0
     11 2089670501977937664,42,0
     12 2089670501977937664,41,0

@krasznaa
Copy link
Member Author

That was one of my guesses. But, the following is now also failing for me. 😕

diff --git a/io/src/read_geometry.cpp b/io/src/read_geometry.cpp
index c88e8fe..f52ce52 100644
--- a/io/src/read_geometry.cpp
+++ b/io/src/read_geometry.cpp
@@ -19,6 +19,7 @@
 #include <vecmem/memory/host_memory_resource.hpp>
 
 // System include(s).
+#include <cassert>
 #include <stdexcept>
 
 namespace {
@@ -50,6 +51,7 @@ read_json_geometry(std::string_view filename) {
         barcode_map = std::make_unique<
             std::map<std::uint64_t, detray::geometry::barcode>>();
         for (const auto& surface : detector.surfaces()) {
+            assert(barcode_map->find(surface.source) == barcode_map->end());
             (*barcode_map)[surface.source] = surface.barcode();
         }
     }

I.e. the same surface identifier shows up multiple times from Detray. Could it be that we end up merging modules in some weird way?

Though this is probably just a red herring. At least for the clusterization...

@krasznaa
Copy link
Member Author

In fact, some are repeated up to 12 times:

      7 2089670364539207936,226,0
      7 2089670364539207936,229,0
      7 2089670364539207936,230,0
      7 2089670364539207936,231,0
      7 2089670364539317760,156,0
      7 2089670501978336256,132,0
      8 1224979373522735104,328,703
      8 1224979373522735104,329,704
      8 1729382531788187392,177,42
      8 1729382531788189440,12,27
      8 1729382531788189440,13,27
      8 1729382531788189440,14,27
      8 1729382669227169536,6,87
      8 1729382669227169536,7,87
      8 2089670364539207936,227,0
      8 2089670364539207936,228,0
      8 2089670501977937664,38,0
      8 2089670501977937664,39,0
      8 2089670501977937664,43,0
      8 2089670501977937664,44,0
      8 2089670501978336256,129,0
      9 1729382531788187392,178,42
      9 1729382531788187392,179,42
     10 2089670501977937664,40,0
     11 2089670501977937664,42,0
     12 2089670501977937664,41,0

Since you have some "shell magic" at the ready as it seems, could you remove the duplicates from the files and send me the updated TGZ? I'll upload it as a new version.

@stephenswat
Copy link
Member

Let me doctor up some deduplication code and produce some deduplicated files.

@stephenswat
Copy link
Member

It's also worth understanding how this happens, because I can't say it makes much sense to me. Is there any physical reason why we would get a read-out for one pixel twice? Or 12 times?

@stephenswat
Copy link
Member

Interestingly, the full TML dataset also seems to have some duplicated hits, although the number is much smaller than it is for the ODD files. Could be that this just succeeded by chance, e.g. there was never a duplicated hit with 8 other hits in the neighbourhood, so there was some buffer in the 8-length array to compensate.

@krasznaa
Copy link
Member Author

There must be some issue in the Acts "digitization" code. 🤔 This behaviour sounds like a plausible outcome if some mistake is made there. (Remember, simulation just tells us how much energy was deposited exactly where in the detector volumes. We then need to turn those energy deposits into information that the real hardware would've read out as well. I.e. we need to "digitize" the simulated data, as we call it in ATLAS.)

Pinging @asalzburger for info. 😉

Made vecmem::cuda::clusterization_algorithm and
vecmem::sycl::clusterization_algorithm both output measurement
containers, and introduced vecmem::cuda::spacepoint_formation_algorithm
and vecmem::sycl::spacepoint_formation_algorithm for turning
those measurements into spacepoints.

At the same time modified the shared clusterization code a little.
Simplifying how traccc::device::ccl_kernel would fill its output
container, and making sure that functions from traccc::core are
re-used wherever possible.

Implemented the setting of unique identifiers on the measurements,
meant for the ambiguity resolution algorithm.
@krasznaa krasznaa force-pushed the DeviceClusterizationReorg-main-20240419 branch from ac95998 to 4bb7d10 Compare April 19, 2024 09:16
@stephenswat
Copy link
Member

Okay, pcadp04:/mnt/ssd1/sswatman/traccc/data/odd has been deduplicated.

@stephenswat
Copy link
Member

$ build/bin/traccc_seq_example_cuda --input-directory=odd/muon100GeV-geant4/ --detector-file=geometries/odd/odd_geometry_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-events=10

Running Full Tracking Chain Using CUDA

>>> Detector Options <<<
  Detector file       : geometries/odd/odd_geometry_detray.json
  Material file       : 
  Surface rid file    : 
  Use detray::detector: yes
  Digitization file   : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : odd/muon100GeV-geant4/
  Number of input events        : 10
  Number of input events to skip: 0
>>> Clusterization Options <<<
  Target cells per partition: 1024
>>> Track Seeding Options <<<
  None
>>> Performance Measurement Options <<<
  Run performance checks: no
>>> Accelerator Options <<<
  Compare with CPU results: no

WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
==> Statistics ... 
- read    792140 cells from 112034 modules
- created (cpu)  0 measurements     
- created (cpu)  0 spacepoints     
- created (cuda) 319346 spacepoints     
- created  (cpu) 0 seeds
- created (cuda) 42343 seeds
==>Elapsed times...
           File reading  (cpu)  1464 ms
         Clusterization (cuda)  42 ms
   Spacepoint formation (cuda)  24 ms
                Seeding (cuda)  47 ms
           Track params (cuda)  3 ms
                     Wall time  1587 ms

😄

@stephenswat
Copy link
Member

See pcadp04:/mnt/ssd1/sswatman/traccc/data/traccc-data-v6.tar.gz.

Requiring the users to provide the function with vector views instead.
@asalzburger
Copy link
Contributor

Ouch, I guess we could see an issue of Geant4 versus Fatras simulation, while Fatras is guaranteed to have one step per module (i.e. the particle only intersects the module once), Geant4 physics can force to split a segment into several steps as some fancy physics processes may happen. e.g. delta rays.

If we don't catch them before the clusterizer, we might produce duplicate cells ?1?

@krasznaa
Copy link
Member Author

So... Stephen, since I'd like to sort this out, so that we could move on to doing all the rest that we still need to do: How do you feel about the changes?

The very last one, where I made the code go from using bare pointers to the shared memory blocks, to using vecmem vectors, did cause a small performance drop. (O(1%)) Which I personally am okay with, given the debugging benefits. (In debug mode the code will be more talkative about where it encountered an issue.) But I'm open to a discussion about this.

In any case, I'd want to sort out the addition of the new data file in a separate PR. And then of course I'll want to move on to looking at the full ODD tracking chain with CUDA. 😉

@stephenswat
Copy link
Member

I think all the issues are resolved, I don't think a 1% performance drop warrants any debate, we can just get it in no problem.

@krasznaa krasznaa merged commit 4665550 into acts-project:main Apr 19, 2024
17 of 18 checks passed
@krasznaa krasznaa deleted the DeviceClusterizationReorg-main-20240419 branch April 19, 2024 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda Changes related to CUDA examples Changes to the examples sycl Changes related to SYCL tests Make sure the code keeps working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants