Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize probs with OpenMP #800

Merged
merged 29 commits into from
Jul 16, 2024
Merged

Parallelize probs with OpenMP #800

merged 29 commits into from
Jul 16, 2024

Conversation

vincentmr
Copy link
Contributor

@vincentmr vincentmr commented Jul 16, 2024

Before submitting

Please complete the following checklist when submitting a PR:

  • All new features must include a unit test.
    If you've fixed a bug or added code that should be tested, add a test to the
    tests directory!

  • All new functions and code must be clearly commented and documented.
    If you do make documentation changes, make sure that the docs build and
    render correctly by running make docs.

  • Ensure that the test suite passes, by running make test.

  • Add a new entry to the .github/CHANGELOG.md file, summarizing the
    change, and including a link back to the PR.

  • Ensure that code is properly formatted by running make format.

When all the above are checked, delete everything above the dashed
line and fill in the pull request template.


Context:
probs is central in circuit simulation measurements.

Description of the Change:
Parallelize probs loops using OpenMP.

Benefits:
Faster execution with several threads.
The following benchmarks are performed on ISAIC's AMD EPYC-Milan Processor using a several core/threads. The times are obtained averaging the computation of probs(target) 5 times for various number of targets. We use the last release implementation as a reference. Since #795 brings some speed-ups even for a single thread, this is why we observe speed-ups > number of threads.

speedup_vs_nthreads

Another view on the data is the strong scaling efficiency. It is almost perfect for 2-4 threads, fairly good for 8 threads and diminishes significantly for 16 threads.

efficiency_vs_nthreads

Possible Drawbacks:

Related GitHub Issues:

Copy link

codecov bot commented Jul 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.36%. Comparing base (c164fe5) to head (4b7b794).

❗ There is a different number of reports uploaded between BASE (c164fe5) and HEAD (4b7b794). Click for more details.

HEAD has 5 uploads less than BASE
Flag BASE (c164fe5) HEAD (4b7b794)
12 7
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #800      +/-   ##
==========================================
- Coverage   98.64%   92.36%   -6.29%     
==========================================
  Files         114       73      -41     
  Lines       17653    11167    -6486     
==========================================
- Hits        17414    10314    -7100     
- Misses        239      853     +614     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vincentmr vincentmr marked this pull request as ready for review July 16, 2024 14:49
@vincentmr vincentmr requested a review from a team July 16, 2024 15:22
@LuisAlfredoNu
Copy link
Contributor

I think that it would be nice to add a scaling efficiency plot. What do you think @vincentmr ?

Copy link
Contributor

@LuisAlfredoNu LuisAlfredoNu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation has good scaling and achieves an important speedup.

Copy link
Contributor

@AmintorDusko AmintorDusko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Thank you for that!

@vincentmr vincentmr merged commit 4ec49b8 into master Jul 16, 2024
68 of 69 checks passed
@vincentmr vincentmr deleted the probs_omp branch July 16, 2024 17:00
vincentmr added a commit that referenced this pull request Jul 24, 2024
### Before submitting

Please complete the following checklist when submitting a PR:

- [x] All new features must include a unit test.
If you've fixed a bug or added code that should be tested, add a test to
the
      [`tests`](../tests) directory!

- [x] All new functions and code must be clearly commented and
documented.
If you do make documentation changes, make sure that the docs build and
      render correctly by running `make docs`.

- [x] Ensure that the test suite passes, by running `make test`.

- [x] Add a new entry to the `.github/CHANGELOG.md` file, summarizing
the
      change, and including a link back to the PR.

- [x] Ensure that code is properly formatted by running `make format`. 

When all the above are checked, delete everything above the dashed
line and fill in the pull request template.


------------------------------------------------------------------------------------------------------------

**Context:**
`sample` call `generate_samples` which computes the full probabilities
and uses the alias method to generate samples for all wires. This is
wasteful whenever samples are required only for a subset of all wires.

**Description of the Change:**
Move alias method logic to the `discrete_random_variable` class.

**Benefits:**
Compute minimal probs and samples.
We benchmark the current changes against `master`, which already
benefits from some good speed-ups introduce in #795 and #800 .

We use ISAIC's AMD EPYC-Milan Processor using a single core/thread. The
times are obtained using at least 5 experiments and running for at least
250 milliseconds. We begin comparing `master`'s
`generate_samples(num_samples)` with ours `generate_samples({0},
num_samples)`. For 4-12 qubits, overheads dominate the calculation (the
absolute times range from 6 microseconds to 18 milliseconds, which is
not a lot. Already at 12 qubits however, a trend appears where our
implementation is significantly faster. This is to be expected for two
reason: `probs(wires)` itself is faster than `probs()` (for enough
qubits) and `sample(wires)` starts also requiring significantly less
work than `sample()`.


![speedup_vs_nthreads_1w](https://github.com/user-attachments/assets/472748e9-d812-489c-a00f-2b2b74c7e682)

Next we turn to comparing `master`'s `generate_samples(num_samples)`
with ours `generate_samples({0..num_qubits/2}, num_samples)`. The
situation there is similar, with speed-ups close to 1 for the smaller
qubit counts and (sometimes) beyond 20x for qubit counts above 20.


![speedup_vs_nthreads_hfw](https://github.com/user-attachments/assets/f39e3ccd-8051-4a57-a857-9cd13f547865)

Finally we compare `master`'s `generate_samples(num_samples)` with ours
`generate_samples({0..num_qubits-1}, num_samples)` (i.e. computing
samples on all wires). We expect similar performance since the main
difference comes from the caching mechanism in `master`'s discrete
random variable generator. The data suggests caching samples is
counter-productive compared with calculating the sample values on the
fly.


![speedup_vs_nthreads_fullw](https://github.com/user-attachments/assets/2c70ed21-2236-479e-be3d-6017b42fdc5e)

Turning OMP ON, using 16 threads and comparing `master`'s
`generate_samples(num_samples)` with ours `generate_samples({0},
num_samples)` we get good speed-ups above 12 qubits. Below that the
overhead of spawning threads isn't repaid, but absolute times remain
low.


![speedup_vs_omp16_1w](https://github.com/user-attachments/assets/e3e90a55-399f-4a5b-b90e-7059a0486228)

**Possible Drawbacks:**

**Related GitHub Issues:**
[sc-65127]

---------

Co-authored-by: ringo-but-quantum <github-ringo-but-quantum@xanadu.ai>
Co-authored-by: Ali Asadi <10773383+maliasadi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants