Add sample(wires) support in LightningQubit (#809)

### Before submitting Please complete the following checklist when submitting a PR: - [x] All new features must include a unit test. If you've fixed a bug or added code that should be tested, add a test to the [`tests`](../tests) directory! - [x] All new functions and code must be clearly commented and documented. If you do make documentation changes, make sure that the docs build and render correctly by running `make docs`. - [x] Ensure that the test suite passes, by running `make test`. - [x] Add a new entry to the `.github/CHANGELOG.md` file, summarizing the change, and including a link back to the PR. - [x] Ensure that code is properly formatted by running `make format`. When all the above are checked, delete everything above the dashed line and fill in the pull request template. ------------------------------------------------------------------------------------------------------------ **Context:** `sample` call `generate_samples` which computes the full probabilities and uses the alias method to generate samples for all wires. This is wasteful whenever samples are required only for a subset of all wires. **Description of the Change:** Move alias method logic to the `discrete_random_variable` class. **Benefits:** Compute minimal probs and samples. We benchmark the current changes against `master`, which already benefits from some good speed-ups introduce in #795 and #800 . We use ISAIC's AMD EPYC-Milan Processor using a single core/thread. The times are obtained using at least 5 experiments and running for at least 250 milliseconds. We begin comparing `master`'s `generate_samples(num_samples)` with ours `generate_samples({0}, num_samples)`. For 4-12 qubits, overheads dominate the calculation (the absolute times range from 6 microseconds to 18 milliseconds, which is not a lot. Already at 12 qubits however, a trend appears where our implementation is significantly faster. This is to be expected for two reason: `probs(wires)` itself is faster than `probs()` (for enough qubits) and `sample(wires)` starts also requiring significantly less work than `sample()`. ![speedup_vs_nthreads_1w](https://github.com/user-attachments/assets/472748e9-d812-489c-a00f-2b2b74c7e682) Next we turn to comparing `master`'s `generate_samples(num_samples)` with ours `generate_samples({0..num_qubits/2}, num_samples)`. The situation there is similar, with speed-ups close to 1 for the smaller qubit counts and (sometimes) beyond 20x for qubit counts above 20. ![speedup_vs_nthreads_hfw](https://github.com/user-attachments/assets/f39e3ccd-8051-4a57-a857-9cd13f547865) Finally we compare `master`'s `generate_samples(num_samples)` with ours `generate_samples({0..num_qubits-1}, num_samples)` (i.e. computing samples on all wires). We expect similar performance since the main difference comes from the caching mechanism in `master`'s discrete random variable generator. The data suggests caching samples is counter-productive compared with calculating the sample values on the fly. ![speedup_vs_nthreads_fullw](https://github.com/user-attachments/assets/2c70ed21-2236-479e-be3d-6017b42fdc5e) Turning OMP ON, using 16 threads and comparing `master`'s `generate_samples(num_samples)` with ours `generate_samples({0}, num_samples)` we get good speed-ups above 12 qubits. Below that the overhead of spawning threads isn't repaid, but absolute times remain low. ![speedup_vs_omp16_1w](https://github.com/user-attachments/assets/e3e90a55-399f-4a5b-b90e-7059a0486228) **Possible Drawbacks:** **Related GitHub Issues:** [sc-65127] --------- Co-authored-by: ringo-but-quantum <github-ringo-but-quantum@xanadu.ai> Co-authored-by: Ali Asadi <10773383+maliasadi@users.noreply.github.com>
PennyLaneAI · Jul 24, 2024 · de7beb7 · de7beb7
1 parent 9a83ff4
commit de7beb7
Show file tree

Hide file tree

Showing 9 changed files with 180 additions and 148 deletions.
diff --git a/.github/CHANGELOG.md b/.github/CHANGELOG.md
@@ -15,6 +15,9 @@
 
 ### Improvements
 
+* Add `generate_samples(wires)` support in LightningQubit.
+  [(#809)](https://github.com/PennyLaneAI/pennylane-lightning/pull/809)
+
 * Optimize the OpenMP parallelization of Lightning-Qubit's `probs` for all number of targets.
   [(#807)](https://github.com/PennyLaneAI/pennylane-lightning/pull/807)
 

diff --git a/pennylane_lightning/core/_version.py b/pennylane_lightning/core/_version.py
@@ -16,4 +16,4 @@
    Version number (major.minor.patch[-label])
 """
 
-__version__ = "0.38.0-dev12"
+__version__ = "0.38.0-dev13"
diff --git a/pennylane_lightning/core/src/simulators/lightning_qubit/bindings/LQubitBindings.hpp b/pennylane_lightning/core/src/simulators/lightning_qubit/bindings/LQubitBindings.hpp
@@ -292,6 +292,27 @@ void registerBackendSpecificMeasurements(PyClass &pyclass) {
                     static_cast<sparse_index_type>(values.request().size));
             },
             "Variance of a sparse Hamiltonian.")
+        .def("generate_samples",
+             [](Measurements<StateVectorT> &M,
+                const std::vector<std::size_t> &wires,
+                const std::size_t num_shots) {
+                 const std::size_t num_wires = wires.size();
+                 auto &&result = M.generate_samples(wires, num_shots);
+                 const std::size_t ndim = 2;
+                 const std::vector<std::size_t> shape{num_shots, num_wires};
+                 constexpr auto sz = sizeof(std::size_t);
+                 const std::vector<std::size_t> strides{sz * num_wires, sz};
+                 // return 2-D NumPy array
+                 return py::array(py::buffer_info(
+                     result.data(), /* data as contiguous array  */
+                     sz,            /* size of one scalar        */
+                     py::format_descriptor<std::size_t>::format(), /* data type
+                                                                    */
+                     ndim,   /* number of dimensions      */
+                     shape,  /* shape of the matrix       */
+                     strides /* strides for each axis     */
+                     ));
+             })
         .def("generate_mcmc_samples", [](Measurements<StateVectorT> &M,
                                          std::size_t num_wires,
                                          const std::string &kernelname,

diff --git a/pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementKernels.hpp b/pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementKernels.hpp
@@ -19,6 +19,9 @@
 #pragma once
 
 #include <complex>
+#include <random>
+#include <stack>
+#include <utility>
 #include <vector>
 
 #include "BitUtil.hpp"
@@ -296,6 +299,86 @@ namespace PUtil = Pennylane::Util;
 
 namespace Pennylane::LightningQubit::Measures {
 
+/**
+ * @brief Generate samples using the alias method.
+ *
+ * @tparam PrecisionT Precision data type
+ */
+template <typename PrecisionT> class DiscreteRandomVariable {
+  private:
+    static constexpr std::size_t default_index =
+        std::numeric_limits<std::size_t>::max();
+    std::mt19937 &gen_;
+    const std::size_t n_probs_;
+    const std::vector<std::pair<double, std::size_t>> bucket_partners_;
+    mutable std::uniform_real_distribution<PrecisionT> distribution_{0.0, 1.0};
+
+  public:
+    /**
+     * @brief Create a DiscreteRandomVariable object.
+     *
+     * @param gen Random number generator reference.
+     * @param probs Probabilities for values 0 up to N - 1, where N =
+     * probs.size().
+     */
+    DiscreteRandomVariable(std::mt19937 &gen,
+                           const std::vector<PrecisionT> &probs)
+        : gen_{gen}, n_probs_{probs.size()},
+          bucket_partners_(init_bucket_partners_(probs)) {}
+
+    /**
+     * @brief Return a discrete random value.
+     */
+    std::size_t operator()() const {
+        const auto idx =
+            static_cast<std::size_t>(distribution_(gen_) * n_probs_);
+        if (distribution_(gen_) >= bucket_partners_[idx].first &&
+            bucket_partners_[idx].second != default_index) {
+            return bucket_partners_[idx].second;
+        }
+        return idx;
+    }
+
+  private:
+    /**
+     * @brief Initialize the probability table of the alias method.
+     */
+    std::vector<std::pair<double, std::size_t>>
+    init_bucket_partners_(const std::vector<PrecisionT> &probs) {
+        std::vector<std::pair<double, std::size_t>> bucket_partners(
+            n_probs_, {0.0, default_index});
+        std::stack<std::size_t> underfull_bucket_ids;
+        std::stack<std::size_t> overfull_bucket_ids;
+
+        for (std::size_t i = 0; i < n_probs_; i++) {
+            bucket_partners[i].first = n_probs_ * probs[i];
+            if (bucket_partners[i].first < 1.0) {
+                underfull_bucket_ids.push(i);
+            } else {
+                overfull_bucket_ids.push(i);
+            }
+        }
+
+        while (!underfull_bucket_ids.empty() && !overfull_bucket_ids.empty()) {
+            auto i = overfull_bucket_ids.top();
+            overfull_bucket_ids.pop();
+            auto j = underfull_bucket_ids.top();
+            underfull_bucket_ids.pop();
+
+            bucket_partners[j].second = i;
+            bucket_partners[i].first += bucket_partners[j].first - 1.0;
+
+            if (bucket_partners[i].first < 1.0) {
+                underfull_bucket_ids.push(i);
+            } else {
+                overfull_bucket_ids.push(i);
+            }
+        }
+
+        return bucket_partners;
+    }
+};
+
 /**
  * @brief Probabilities for a subset of the full system.
  *

diff --git a/pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementsLQubit.hpp b/pennylane_lightning/core/src/simulators/lightning_qubit/measurements/MeasurementsLQubit.hpp
@@ -24,7 +24,6 @@
 #include <complex>
 #include <cstdio>
 #include <random>
-#include <stack>
 #include <type_traits>
 #include <unordered_map>
 #include <vector>
@@ -124,7 +123,7 @@ class Measurements final
 
         const ComplexT *arr_data = this->_statevector.getData();
 
-        // Templated 1-4 wire cases; return probs 
+        // Templated 1-4 wire cases; return probs
         PROBS_SPECIAL_CASE(1);
         PROBS_SPECIAL_CASE(2);
         PROBS_SPECIAL_CASE(3);
@@ -175,7 +174,7 @@ class Measurements final
      *
      * @return Floating point std::vector with probabilities.
      */
-    auto probs(size_t num_shots) -> std::vector<PrecisionT> {
+    auto probs(std::size_t num_shots) -> std::vector<PrecisionT> {
         return BaseType::probs(num_shots);
     }
 
@@ -258,7 +257,7 @@ class Measurements final
                 const index_type *entries_ptr, const ComplexT *values_ptr,
                 const index_type numNNZ) -> PrecisionT {
         PL_ABORT_IF(
-            (this->_statevector.getLength() != (size_t(row_map_size) - 1)),
+            (this->_statevector.getLength() != (std::size_t(row_map_size) - 1)),
             "Statevector and Hamiltonian have incompatible sizes.");
         auto operator_vector = Util::apply_Sparse_Matrix(
             this->_statevector.getData(),
@@ -289,7 +288,7 @@ class Measurements final
             "The lengths of the list of operations and wires do not match.");
         std::vector<PrecisionT> expected_value_list;
 
-        for (size_t index = 0; index < operations_list.size(); index++) {
+        for (std::size_t index = 0; index < operations_list.size(); index++) {
             expected_value_list.emplace_back(
                 expval(operations_list[index], wires_list[index]));
         }
@@ -461,7 +460,7 @@ class Measurements final
 
         std::vector<PrecisionT> expected_value_list;
 
-        for (size_t index = 0; index < operations_list.size(); index++) {
+        for (std::size_t index = 0; index < operations_list.size(); index++) {
             expected_value_list.emplace_back(
                 var(operations_list[index], wires_list[index]));
         }
@@ -488,7 +487,7 @@ class Measurements final
         std::size_t num_qubits = this->_statevector.getNumQubits();
         std::uniform_real_distribution<PrecisionT> distrib(0.0, 1.0);
         std::vector<std::size_t> samples(num_samples * num_qubits, 0);
-        std::unordered_map<size_t, std::size_t> cache;
+        std::unordered_map<std::size_t, std::size_t> cache;
         this->setRandomSeed();
 
         TransitionKernelType transition_kernel = TransitionKernelType::Local;
@@ -502,13 +501,13 @@ class Measurements final
         std::size_t idx = 0;
 
         // Burn In
-        for (size_t i = 0; i < num_burnin; i++) {
+        for (std::size_t i = 0; i < num_burnin; i++) {
             idx = metropolis_step(this->_statevector, tk, this->rng, distrib,
                                   idx); // Burn-in.
         }
 
         // Sample
-        for (size_t i = 0; i < num_samples; i++) {
+        for (std::size_t i = 0; i < num_samples; i++) {
             idx = metropolis_step(this->_statevector, tk, this->rng, distrib,
                                   idx);
 
@@ -521,7 +520,7 @@ class Measurements final
 
             // If not cached, compute
             else {
-                for (size_t j = 0; j < num_qubits; j++) {
+                for (std::size_t j = 0; j < num_qubits; j++) {
                     samples[i * num_qubits + (num_qubits - 1 - j)] =
                         (idx >> j) & 1U;
                 }
@@ -551,7 +550,7 @@ class Measurements final
                    const index_type *entries_ptr, const ComplexT *values_ptr,
                    const index_type numNNZ) {
         PL_ABORT_IF(
-            (this->_statevector.getLength() != (size_t(row_map_size) - 1)),
+            (this->_statevector.getLength() != (std::size_t(row_map_size) - 1)),
             "Statevector and Hamiltonian have incompatible sizes.");
         auto operator_vector = Util::apply_Sparse_Matrix(
             this->_statevector.getData(),
@@ -577,79 +576,36 @@ class Measurements final
      * @return 1-D vector of samples in binary, each sample is
      * separated by a stride equal to the number of qubits.
      */
-    std::vector<std::size_t> generate_samples(size_t num_samples) {
+    std::vector<std::size_t> generate_samples(const std::size_t num_samples) {
         const std::size_t num_qubits = this->_statevector.getNumQubits();
-        auto &&probabilities = probs();
+        std::vector<std::size_t> wires(num_qubits);
+        std::iota(wires.begin(), wires.end(), 0);
+        return generate_samples(wires, num_samples);
+    }
 
-        std::vector<std::size_t> samples(num_samples * num_qubits, 0);
-        std::uniform_real_distribution<PrecisionT> distribution(0.0, 1.0);
-        std::unordered_map<size_t, std::size_t> cache;
+    /**
+     * @brief Generate samples.
+     *
+     * @param wires Sample are generated for the specified wires.
+     * @param num_samples The number of samples to generate.
+     * @return 1-D vector of samples in binary, each sample is
+     * separated by a stride equal to the number of qubits.
+     */
+    std::vector<std::size_t>
+    generate_samples(const std::vector<std::size_t> &wires,
+                     const std::size_t num_samples) {
+        const std::size_t n_wires = wires.size();
+        std::vector<std::size_t> samples(num_samples * n_wires);
         this->setRandomSeed();
-
-        const std::size_t N = probabilities.size();
-        std::vector<double> bucket(N);
-        std::vector<std::size_t> bucket_partner(N);
-        std::stack<std::size_t> overfull_bucket_ids;
-        std::stack<std::size_t> underfull_bucket_ids;
-
-        for (size_t i = 0; i < N; i++) {
-            bucket[i] = N * probabilities[i];
-            bucket_partner[i] = i;
-            if (bucket[i] > 1.0) {
-                overfull_bucket_ids.push(i);
-            }
-            if (bucket[i] < 1.0) {
-                underfull_bucket_ids.push(i);
-            }
-        }
-
-        // Run alias algorithm
-        while (!underfull_bucket_ids.empty() && !overfull_bucket_ids.empty()) {
-            // get an overfull bucket
-            std::size_t i = overfull_bucket_ids.top();
-
-            // get an underfull bucket
-            std::size_t j = underfull_bucket_ids.top();
-            underfull_bucket_ids.pop();
-
-            // underfull bucket is partned with an overfull bucket
-            bucket_partner[j] = i;
-            bucket[i] = bucket[i] + bucket[j] - 1;
-
-            // if overfull bucket is now underfull
-            // put in underfull stack
-            if (bucket[i] < 1) {
-                overfull_bucket_ids.pop();
-                underfull_bucket_ids.push(i);
-            }
-
-            // if overfull bucket is full -> remove
-            else if (bucket[i] == 1.0) {
-                overfull_bucket_ids.pop();
-            }
-        }
-
-        // Pick samples
-        for (size_t i = 0; i < num_samples; i++) {
-            PrecisionT pct = distribution(this->rng) * N;
-            auto idx = static_cast<std::size_t>(pct);
-            if (pct - idx > bucket[idx]) {
-                idx = bucket_partner[idx];
-            }
-            // If cached, retrieve sample from cache
-            if (cache.contains(idx)) {
-                std::size_t cache_id = cache[idx];
-                auto it_temp = samples.begin() + cache_id * num_qubits;
-                std::copy(it_temp, it_temp + num_qubits,
-                          samples.begin() + i * num_qubits);
-            }
-            // If not cached, compute
-            else {
-                for (size_t j = 0; j < num_qubits; j++) {
-                    samples[i * num_qubits + (num_qubits - 1 - j)] =
-                        (idx >> j) & 1U;
-                }
-                cache[idx] = i;
+        DiscreteRandomVariable<PrecisionT> drv{this->rng, probs(wires)};
+        // The Python layer expects a 2D array with dimensions (n_samples x
+        // n_wires) and hence the linear index is `s * n_wires + (n_wires - 1 -
+        // j)` `s` being the "slow" row index and `j` being the "fast" column
+        // index
+        for (std::size_t s = 0; s < num_samples; s++) {
+            const std::size_t idx = drv();
+            for (std::size_t j = 0; j < n_wires; j++) {
+                samples[s * n_wires + (n_wires - 1 - j)] = (idx >> j) & 1U;
             }
         }
         return samples;