Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Throughput Measurements, main branch (2024.05.04.) #572

Merged

Conversation

krasznaa
Copy link
Member

@krasznaa krasznaa commented May 4, 2024

This PR is made on top of #571. We'll need to merge in that one first. 😉

With the host and CUDA track finding and fitting more or less working on the ODD ttbar simulations, I spent a bit of time to update the throughput measurement applications to be able to use the ODD files. This is what came out of it. 😉

The code technically works. But boy, will we have some work with making it fast... Because right now it's really not. 😦 I can keep up with what my RTX 3080 is doing, with about 8 CPU threads. 🤔 And this relationship is pretty similar across all pileup values. (Though I didn't do a thorough measurement yet.)

The "good news" is that nvtop shows that my GPU is not getting too busy throughout the measurements, so there should be some big bottlenecks in the code right now. Also, at $\mu$ = 200 I can not use more than a single CPU thread for the jobs anymore with traccc_throughput_mt_cuda, otherwise I run out of GPU memory.

So... Technically things work, but we have plenty of good work ahead of us to make this all perform well! 😉

P.S. The applications for now continue to work on our old TML files as well. For now it was not a major pain to keep that alive. But soon we'll probably want to get rid of the TML functionality from this code. 🤔

@krasznaa krasznaa added build This relates to the build system cuda Changes related to CUDA improvement Improve an existing feature sycl Changes related to SYCL cpu Changes related to CPU code examples Changes to the examples labels May 4, 2024
@krasznaa
Copy link
Member Author

krasznaa commented May 4, 2024

To give some "evidence" of what I see. 😉

  • With 8 threads on my CPU:
[bash][Legolas]:build > ./bin/traccc_throughput_mt --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=10 --processed-events=100 --cpu-threads=8

Running Multi-threaded host-only throughput tests

>>> Detector Options <<<
  Detector file       : geometries/odd/odd-detray_geometry_detray.json
  Material file       : 
  Surface rid file    : geometries/odd/odd-detray_surface_grids_detray.json
  Use detray::detector: yes
  Digitization file   : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : odd/geant4_ttbar_mu200/
  Number of input events        : 10
  Number of input events to skip: 0
>>> Clusterization Options <<<
  Target cells per partition: 1024
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 4294967295
  Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
  Constraint step size  : 3.40282e+38 [mm]
  Overstep tolerance    : -100 [um]
  Minimum mask tolerance: 1e-05 [mm]
  Maximum mask tolerance: 1 [mm]
  Search window         : 0 x 0
  Runge-Kutta tolerance : 0.0001
>>> Throughput Measurement Options <<<
  Cold run event(s) : 10
  Processed event(s): 100
  Log file          : 
>>> Multi-Threading Options <<<
  CPU threads: 8

WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 19157 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000000-cells.csv
WARNING: @traccc::io::csv::read_cells: 24524 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000001-cells.csv
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv
WARNING: @traccc::io::csv::read_cells: 20889 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000003-cells.csv
WARNING: @traccc::io::csv::read_cells: 15151 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000004-cells.csv
WARNING: @traccc::io::csv::read_cells: 21299 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000005-cells.csv
WARNING: @traccc::io::csv::read_cells: 20111 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000006-cells.csv
WARNING: @traccc::io::csv::read_cells: 17117 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000007-cells.csv
WARNING: @traccc::io::csv::read_cells: 14836 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000008-cells.csv
WARNING: @traccc::io::csv::read_cells: 14147 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000009-cells.csv
Reconstructed track parameters: 2742001
Time totals:
                  File reading  6800 ms
            Warm-up processing  7137 ms
              Event processing  48152 ms
Throughput:
            Warm-up processing  713.705 ms/event, 1.40114 events/s
              Event processing  481.527 ms/event, 2.07673 events/s
[bash][Legolas]:build >
  • With 1 (CPU) thread, on my GPU:
[bash][Legolas]:build > ./bin/traccc_throughput_mt_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=10 --processed-events=100 --cpu-threads=1

Running Multi-threaded CUDA GPU throughput tests

>>> Detector Options <<<
  Detector file       : geometries/odd/odd-detray_geometry_detray.json
  Material file       : 
  Surface rid file    : geometries/odd/odd-detray_surface_grids_detray.json
  Use detray::detector: yes
  Digitization file   : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : odd/geant4_ttbar_mu200/
  Number of input events        : 10
  Number of input events to skip: 0
>>> Clusterization Options <<<
  Target cells per partition: 1024
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 4294967295
  Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
  Constraint step size  : 3.40282e+38 [mm]
  Overstep tolerance    : -100 [um]
  Minimum mask tolerance: 1e-05 [mm]
  Maximum mask tolerance: 1 [mm]
  Search window         : 0 x 0
  Runge-Kutta tolerance : 0.0001
>>> Throughput Measurement Options <<<
  Cold run event(s) : 10
  Processed event(s): 100
  Log file          : 
>>> Multi-Threading Options <<<
  CPU threads: 1

WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 19157 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000000-cells.csv
WARNING: @traccc::io::csv::read_cells: 24524 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000001-cells.csv
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv
WARNING: @traccc::io::csv::read_cells: 20889 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000003-cells.csv
WARNING: @traccc::io::csv::read_cells: 15151 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000004-cells.csv
WARNING: @traccc::io::csv::read_cells: 21299 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000005-cells.csv
WARNING: @traccc::io::csv::read_cells: 20111 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000006-cells.csv
WARNING: @traccc::io::csv::read_cells: 17117 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000007-cells.csv
WARNING: @traccc::io::csv::read_cells: 14836 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000008-cells.csv
WARNING: @traccc::io::csv::read_cells: 14147 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000009-cells.csv
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Reconstructed track parameters: 2854095
Time totals:
                  File reading  7371 ms
            Warm-up processing  5448 ms
              Event processing  48704 ms
Throughput:
            Warm-up processing  544.806 ms/event, 1.83552 events/s
              Event processing  487.047 ms/event, 2.05319 events/s
[bash][Legolas]:build >

But honestly, VS Code was using more of my GPU in that time window than the throughput test. 🙃

@stephenswat
Copy link
Member

Oof, that's grim. But at least it works!

@stephenswat
Copy link
Member

Out of curiosity, can you run it through nvprof to see the time distribution of the kernels?

@krasznaa
Copy link
Member Author

krasznaa commented May 4, 2024

Tomorrow. Yes, I'll be able to. (I expect huge valleys between the kernels... Since even nvtop showed the GPU not doing anything for half the time.)

@beomki-yeo
Copy link
Contributor

This is good!
Please dont forget to set max num branches per seed of track finding to lower value (10 or 100).
I just forgot to fix its default value in the previous PR and it is not good for gpu perfirmance

@krasznaa
Copy link
Member Author

krasznaa commented May 5, 2024

As I wrote yesterday, there should be some easy wins here. Even if I don't yet know what our current bottleneck is. 🤔

Running 100 $\mu$=200 ttbar events through a simple NSys profiling, I see the following:

image

The kernels are big. traccc::cuda::kernels::propagate_to_next_surface dominates absolutely everything else.

🤔 But for some reason no kernel is currently running in most of the time. 😕 That's the thing that we'll need to understand first and foremost. As it should give us almost an order of magnitude speedup right away. 🙏

track_candidates);

// Return the final container, copied back to the host.
return m_result_copy(track_states);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out, it is this line that is killing us. 🤔

Right now the throughput measurement applications use a vecmem::cuda::host_memory_resource, wrapped by a vecmem::binary_page_memory_resource as the "host resource" during this copy back to the host. And I see our caching memory resource using all the CPU time in the world during this copy back to the host. 😦

The result jagged vector is very big as it seems. Since when I turn off host memory caching, things get even slower, this time around the CUDA memory allocations and de-allocations taking forever...

image

If I remove the copy back to the host at the end of the algorithm chain, I get:

[bash][Legolas]:traccc > ./out/build/sycl/bin/traccc_throughput_st_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=10 --processed-events=100

Running Single-threaded CUDA GPU throughput tests

>>> Detector Options <<<
  Detector file       : geometries/odd/odd-detray_geometry_detray.json
  Material file       : 
  Surface rid file    : geometries/odd/odd-detray_surface_grids_detray.json
  Use detray::detector: yes
  Digitization file   : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : odd/geant4_ttbar_mu200/
  Number of input events        : 10
  Number of input events to skip: 0
>>> Clusterization Options <<<
  Target cells per partition: 1024
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 4294967295
  Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
  Constraint step size  : 3.40282e+38 [mm]
  Overstep tolerance    : -100 [um]
  Minimum mask tolerance: 1e-05 [mm]
  Maximum mask tolerance: 1 [mm]
  Search window         : 0 x 0
  Runge-Kutta tolerance : 0.0001
>>> Throughput Measurement Options <<<
  Cold run event(s) : 10
  Processed event(s): 100
  Log file          : 

WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 19157 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000000-cells.csv
WARNING: @traccc::io::csv::read_cells: 24524 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000001-cells.csv
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv
WARNING: @traccc::io::csv::read_cells: 20889 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000003-cells.csv
WARNING: @traccc::io::csv::read_cells: 15151 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000004-cells.csv
WARNING: @traccc::io::csv::read_cells: 21299 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000005-cells.csv
WARNING: @traccc::io::csv::read_cells: 20111 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000006-cells.csv
WARNING: @traccc::io::csv::read_cells: 17117 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000007-cells.csv
WARNING: @traccc::io::csv::read_cells: 14836 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000008-cells.csv
WARNING: @traccc::io::csv::read_cells: 14147 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000009-cells.csv
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Reconstructed track parameters: 0
Time totals:
                  File reading  7505 ms
            Warm-up processing  1187 ms
              Event processing  9605 ms
Throughput:
            Warm-up processing  118.77 ms/event, 8.41966 events/s
              Event processing  96.0576 ms/event, 10.4104 events/s
[bash][Legolas]:traccc >

Which is at least semi-competitive against my 64 CPU threads.

[bash][Legolas]:traccc > ./out/build/sycl/bin/traccc_throughput_mt --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=10 --processed-events=100 --cpu-threads=64

Running Multi-threaded host-only throughput tests

>>> Detector Options <<<
  Detector file       : geometries/odd/odd-detray_geometry_detray.json
  Material file       : 
  Surface rid file    : geometries/odd/odd-detray_surface_grids_detray.json
  Use detray::detector: yes
  Digitization file   : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : odd/geant4_ttbar_mu200/
  Number of input events        : 10
  Number of input events to skip: 0
>>> Clusterization Options <<<
  Target cells per partition: 1024
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 4294967295
  Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
  Constraint step size  : 3.40282e+38 [mm]
  Overstep tolerance    : -100 [um]
  Minimum mask tolerance: 1e-05 [mm]
  Maximum mask tolerance: 1 [mm]
  Search window         : 0 x 0
  Runge-Kutta tolerance : 0.0001
>>> Throughput Measurement Options <<<
  Cold run event(s) : 10
  Processed event(s): 100
  Log file          : 
>>> Multi-Threading Options <<<
  CPU threads: 64

WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 19157 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000000-cells.csv
WARNING: @traccc::io::csv::read_cells: 24524 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000001-cells.csv
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv
WARNING: @traccc::io::csv::read_cells: 20889 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000003-cells.csv
WARNING: @traccc::io::csv::read_cells: 15151 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000004-cells.csv
WARNING: @traccc::io::csv::read_cells: 21299 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000005-cells.csv
WARNING: @traccc::io::csv::read_cells: 20111 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000006-cells.csv
WARNING: @traccc::io::csv::read_cells: 17117 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000007-cells.csv
WARNING: @traccc::io::csv::read_cells: 14836 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000008-cells.csv
WARNING: @traccc::io::csv::read_cells: 14147 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000009-cells.csv
Reconstructed track parameters: 2831983
Time totals:
                  File reading  6869 ms
            Warm-up processing  4840 ms
              Event processing  13350 ms
Throughput:
            Warm-up processing  484.024 ms/event, 2.06601 events/s
              Event processing  133.508 ms/event, 7.49017 events/s
[bash][Legolas]:traccc >

So: Once again we'll be in the business of EDM / memory management optimizations!

@krasznaa
Copy link
Member Author

krasznaa commented May 5, 2024

With the copy back to the host removed (temporarily), there don't seem to be any other "major" bottlenecks in the application that would prevent kernels from following one another quickly. 🤔

image

In my book that's not a terrible situation...

So that the throughput examples could use it for reading in
their input data.
Taught the host and CUDA algorithms how to perform track
finding and fitting when a Detray geometry is used.

Updated the common code of the throughput applications to
deal with reading in a Detray geometry when appropriate,
and to create the processing algorithms with their new
interfaces.

Updated the SYCL algorithm to mimic the host and CUDA
algorithms. Even though it will not perform any track
finding or fitting for the moment.
@krasznaa krasznaa force-pushed the TrackingThroughput-main-20240504 branch from 513a4b8 to 5ce8138 Compare May 5, 2024 08:13
@beomki-yeo
Copy link
Contributor

Thanks for the profiling.

Some comments:

  1. AFAIK, we are using managed memory for detray geometry. It will be interesting if we try the benchmark with device memory. Anyway it is supported 🤔

  2. Just to explain one of the bottlenecks in the CKF:

image

Before KF, the CKF kernels are being smaller and smaller due to the smaller amount of tracks. I observed that there is only one track running in the GPU CKF after ~15 step (= 15 th surface-to-surface transport), which is pretty waste of GPU. We will need to think about how to improve this. (Not an easy thing to fix though)

  1. Not a comment, but it is time for me to go back to detray work and optimize the propagation code. There are already a couple of unoptimized things that come to mind.

Copy link
Contributor

@beomki-yeo beomki-yeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Now we need 10^5 GPUs to meet 1 MHz data rate.

@niermann999
Copy link
Contributor

niermann999 commented May 5, 2024

1. AFAIK, we are using managed memory for detray geometry. It will be interesting if we try the benchmark with device memory. Anyway it is supported 🤔

It looks to me like the detector is being copied to device properly

What I would like to understand is how the navigator is used in propagate_to_next surface: I.e. is it allowed to hold its state in between runs or is the navigation state recreated every time so that the cache has to be rebuilt at every new surface that is reached? The latter option could be costly

@krasznaa
Copy link
Member Author

krasznaa commented May 5, 2024

Approved. Now we need 10^5 GPUs to meet 1 MHz data rate.

image

@beomki-yeo
Copy link
Contributor

beomki-yeo commented May 5, 2024

is the navigation state recreated every time so that the cache has to be rebuilt at every new surface that is reached? The latter option could be costly

The latter is the case unfortunately. Meanwhile, I am not sure if the major bottleneck is coming from rebuilding the navigation state or matrix operations in stepping. (I would bet on the matrix operations though)

@krasznaa krasznaa merged commit 7d96c67 into acts-project:main May 5, 2024
17 of 18 checks passed
@krasznaa krasznaa deleted the TrackingThroughput-main-20240504 branch May 5, 2024 10:07
@niermann999
Copy link
Contributor

is the navigation state recreated every time so that the cache has to be rebuilt at every new surface that is reached? The latter option could be costly

The latter is the case unfortunately. Meanwhile, I am not sure if the major bottleneck is coming from rebuilding the navigation state or matrix operations in stepping

Yes, that would be interesting to know. I think @andiwand just managed to speed up the ACTS stepper by a lot. However, having to query the grid that often and then doing the intersection of up to a 50 surfaces and subsequent sorting of the cache could have a sizeable impact, too

@andiwand
Copy link

andiwand commented May 5, 2024

Yes, that would be interesting to know. I think @andiwand just managed to speed up the ACTS stepper by a lot. However, having to query the grid that often and then doing the intersection of up to a 50 surfaces and subsequent sorting of the cache could have a sizeable impact, too

I guess you have something similar to the EigenStepper in traccc? In Acts I was able to get a speedup of 2x by unrolling the math, use better common expressions and make use of all the 0s in the jacobians. acts-project/acts#3150

That said these 2x speedup only results in a 15% speedup of the track finding because more time is spent in the navigation, KF math and some suboptimal measurement calibration step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build This relates to the build system cpu Changes related to CPU code cuda Changes related to CUDA examples Changes to the examples improvement Improve an existing feature sycl Changes related to SYCL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants