Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor histogram reduction using cuco::static_map::insert_or_apply #16485

Draft
wants to merge 12 commits into
base: branch-24.12
Choose a base branch
from

Conversation

srinivasyadav18
Copy link
Contributor

Description

Refactors histogram reduce aggregation and groupby aggregation using cuco's latest static_map insert_or_apply feature.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Aug 2, 2024
@srinivasyadav18 srinivasyadav18 added Performance Performance related issue non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Aug 2, 2024
cpp/benchmarks/reduction/histogram.cpp Outdated Show resolved Hide resolved
Comment on lines 61 to 65
// compute_row_frequencies does not provide stable ordering
thrust::sort_by_key(rmm::exec_policy(stream),
distinct_indices->begin(),
distinct_indices->end(),
distinct_counts->mutable_view().begin<int64_t>());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this sort impact the groupby histogram performance?

Copy link
Contributor Author

@srinivasyadav18 srinivasyadav18 Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is slight regression around 5% on small input ( with 100'000) , but overall improvement hides this extra sort_by_key . The overall speed up is upto 30% as the input size increases, even with this extra sorting step.

@srinivasyadav18
Copy link
Contributor Author

Improves histogram reduce aggregation performance upto 50%.

['./histogram_insert_or_apply.json', './histogram_base.json']
# histogram

#Ref = histogram with insert_or_apply
#Cmp = histogram branch 24.10

## [0] Tesla T4

|  T  |  null_probability  |  cardinality  |  num_rows  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-----|--------------------|---------------|------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 |        0.1         |      100      |   10000    | 180.703 us |      39.94% | 227.669 us |      32.14% |  46.966 us |  25.99% |   PASS   |
| I32 |        0.1         |     1000      |   10000    | 181.682 us |      42.38% | 227.440 us |      33.12% |  45.759 us |  25.19% |   PASS   |
| I32 |        0.1         |     10000     |   10000    | 176.039 us |      31.56% | 219.957 us |      25.44% |  43.918 us |  24.95% |   PASS   |
| I32 |        0.1         |      100      |   100000   | 239.124 us |      24.93% | 258.138 us |      25.81% |  19.014 us |   7.95% |   PASS   |
| I32 |        0.1         |     1000      |   100000   | 231.062 us |      37.80% | 241.066 us |      37.01% |  10.003 us |   4.33% |   PASS   |
| I32 |        0.1         |     10000     |   100000   | 230.437 us |      32.11% | 250.806 us |      29.72% |  20.370 us |   8.84% |   PASS   |
| I32 |        0.1         |    100000     |   100000   | 230.135 us |       1.04% | 261.997 us |      38.21% |  31.863 us |  13.85% |   FAIL   |
| I32 |        0.1         |      100      |  1000000   | 868.713 us |       0.50% | 840.984 us |      10.76% | -27.729 us |  -3.19% |   FAIL   |
| I32 |        0.1         |     1000      |  1000000   | 867.295 us |      14.51% | 866.554 us |       1.60% |  -0.740 us |  -0.09% |   PASS   |
| I32 |        0.1         |     10000     |  1000000   | 858.138 us |       0.38% | 903.184 us |       8.49% |  45.046 us |   5.25% |   FAIL   |
| I32 |        0.1         |    100000     |  1000000   | 974.762 us |       0.50% |   1.393 ms |       0.95% | 417.869 us |  42.87% |   FAIL   |
| I32 |        0.1         |    1000000    |  1000000   |   1.030 ms |       0.33% |   1.418 ms |       0.78% | 388.207 us |  37.69% |   FAIL   |
| I32 |        0.1         |      100      |  10000000  |   6.876 ms |       0.20% |   7.167 ms |       2.38% | 291.010 us |   4.23% |   FAIL   |
| I32 |        0.1         |     1000      |  10000000  |   7.341 ms |       2.03% |   7.483 ms |       1.26% | 141.677 us |   1.93% |   FAIL   |
| I32 |        0.1         |     10000     |  10000000  |   7.087 ms |       0.24% |   7.511 ms |       0.50% | 423.336 us |   5.97% |   FAIL   |
| I32 |        0.1         |    100000     |  10000000  |   8.284 ms |       0.09% |  12.611 ms |       0.82% |   4.327 ms |  52.24% |   FAIL   |
| I32 |        0.1         |    1000000    |  10000000  |   9.455 ms |       0.14% |  14.600 ms |       0.65% |   5.145 ms |  54.42% |   FAIL   |
| I32 |        0.1         |   10000000    |  10000000  |   9.681 ms |       0.09% |  13.840 ms |       0.72% |   4.159 ms |  42.96% |   FAIL   |
| I32 |        0.1         |      100      | 100000000  |  65.411 ms |       0.41% |  71.947 ms |       2.64% |   6.535 ms |   9.99% |   FAIL   |
| I32 |        0.1         |     1000      | 100000000  |  70.149 ms |       0.24% |  75.198 ms |       1.79% |   5.049 ms |   7.20% |   FAIL   |
| I32 |        0.1         |     10000     | 100000000  |  69.318 ms |       0.48% |  74.746 ms |       0.85% |   5.428 ms |   7.83% |   FAIL   |
| I32 |        0.1         |    100000     | 100000000  |  81.846 ms |       0.04% | 126.286 ms |       0.81% |  44.441 ms |  54.30% |   FAIL   |
| I32 |        0.1         |    1000000    | 100000000  |  94.746 ms |       0.04% | 143.983 ms |       0.50% |  49.237 ms |  51.97% |   FAIL   |
| I32 |        0.1         |   10000000    | 100000000  |  99.966 ms |       0.27% | 155.018 ms |       0.43% |  55.052 ms |  55.07% |   FAIL   |
| I32 |        0.1         |   100000000   | 100000000  | 101.119 ms |       0.05% | 141.565 ms |       0.44% |  40.446 ms |  40.00% |   FAIL   |
| I64 |        0.1         |      100      |   10000    | 146.905 us |       2.88% | 189.556 us |       1.30% |  42.652 us |  29.03% |   FAIL   |
| I64 |        0.1         |     1000      |   10000    | 149.762 us |       1.29% | 193.670 us |      37.72% |  43.908 us |  29.32% |   FAIL   |
| I64 |        0.1         |     10000     |   10000    | 149.714 us |       1.35% | 201.940 us |       1.85% |  52.226 us |  34.88% |   FAIL   |
| I64 |        0.1         |      100      |   100000   | 232.627 us |       3.84% | 248.534 us |       2.42% |  15.906 us |   6.84% |   FAIL   |
| I64 |        0.1         |     1000      |   100000   | 233.984 us |       1.53% | 244.597 us |       1.54% |  10.613 us |   4.54% |   FAIL   |
| I64 |        0.1         |     10000     |   100000   | 231.418 us |       1.32% | 259.619 us |       0.95% |  28.201 us |  12.19% |   FAIL   |
| I64 |        0.1         |    100000     |   100000   | 236.997 us |       2.15% | 261.924 us |       0.89% |  24.928 us |  10.52% |   FAIL   |
| I64 |        0.1         |      100      |  1000000   | 847.845 us |       0.54% | 898.336 us |       3.15% |  50.492 us |   5.96% |   FAIL   |
| I64 |        0.1         |     1000      |  1000000   | 870.061 us |       0.64% | 921.853 us |       1.11% |  51.792 us |   5.95% |   FAIL   |
| I64 |        0.1         |     10000     |  1000000   | 875.642 us |       0.50% | 960.936 us |       1.28% |  85.294 us |   9.74% |   FAIL   |
| I64 |        0.1         |    100000     |  1000000   |   1.012 ms |       0.34% |   1.499 ms |       1.02% | 487.233 us |  48.14% |   FAIL   |
| I64 |        0.1         |    1000000    |  1000000   |   1.075 ms |       0.35% |   1.514 ms |       0.79% | 439.029 us |  40.86% |   FAIL   |
| I64 |        0.1         |      100      |  10000000  |   7.088 ms |       0.37% |   7.795 ms |       3.26% | 706.947 us |   9.97% |   FAIL   |
| I64 |        0.1         |     1000      |  10000000  |   7.316 ms |       0.50% |   7.951 ms |       1.75% | 634.953 us |   8.68% |   FAIL   |
| I64 |        0.1         |     10000     |  10000000  |   7.224 ms |       0.27% |   8.105 ms |       1.90% | 880.550 us |  12.19% |   FAIL   |
| I64 |        0.1         |    100000     |  10000000  |   8.595 ms |       0.07% |  13.349 ms |       3.03% |   4.754 ms |  55.32% |   FAIL   |
| I64 |        0.1         |    1000000    |  10000000  |   9.656 ms |       1.34% |  15.252 ms |       0.71% |   5.597 ms |  57.97% |   FAIL   |
| I64 |        0.1         |   10000000    |  10000000  |   9.877 ms |       0.11% |  14.546 ms |       0.61% |   4.669 ms |  47.27% |   FAIL   |
| I64 |        0.1         |      100      | 100000000  |  69.830 ms |       0.46% |  76.834 ms |       2.84% |   7.004 ms |  10.03% |   FAIL   |
| I64 |        0.1         |     1000      | 100000000  |  70.124 ms |       0.42% |  79.255 ms |       1.52% |   9.131 ms |  13.02% |   FAIL   |
| I64 |        0.1         |     10000     | 100000000  |  69.631 ms |       0.36% |  79.553 ms |       0.70% |   9.921 ms |  14.25% |   FAIL   |
| I64 |        0.1         |    100000     | 100000000  |  84.208 ms |       0.08% | 131.957 ms |       0.74% |  47.749 ms |  56.70% |   FAIL   |
| I64 |        0.1         |    1000000    | 100000000  |  96.112 ms |       0.05% | 148.288 ms |       0.47% |  52.176 ms |  54.29% |   FAIL   |
| I64 |        0.1         |   10000000    | 100000000  | 101.066 ms |       0.02% | 159.668 ms |       0.50% |  58.601 ms |  57.98% |   FAIL   |
| I64 |        0.1         |   100000000   | 100000000  | 102.346 ms |       0.07% | 146.586 ms |       0.49% |  44.240 ms |  43.23% |   FAIL   |

@srinivasyadav18
Copy link
Contributor Author

Improves performance on groupby_histogram upto 30% on most cases. (Some minor regressions around 5% on lower inputs due to extra sorting step).

['./histogram_groupby_insert_or_apply.json', './histogram_groupby_base.json']
# groupby_histogram

#Ref = histogram groupby with insert_or_apply
#Cmp = histogram groupby branch 24.10

## [0] Tesla T4

|  T  |  null_probability  |  cardinality  |  num_rows  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-----|--------------------|---------------|------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 |         0          |      100      |    100     | 349.943 us |      30.20% | 330.614 us |      30.28% | -19.330 us |  -5.52% |   PASS   |
| I32 |        0.1         |      100      |    100     | 443.786 us |      28.73% | 433.660 us |      18.68% | -10.127 us |  -2.28% |   PASS   |
| I32 |        0.9         |      100      |    100     | 449.412 us |      27.02% | 433.196 us |      21.95% | -16.216 us |  -3.61% |   PASS   |
| I32 |         0          |      100      |    1000    | 350.858 us |      20.83% | 356.112 us |      29.08% |   5.254 us |   1.50% |   PASS   |
| I32 |        0.1         |      100      |    1000    | 452.332 us |      24.20% | 457.942 us |      25.67% |   5.609 us |   1.24% |   PASS   |
| I32 |        0.9         |      100      |    1000    | 458.041 us |      21.21% | 449.879 us |       2.14% |  -8.162 us |  -1.78% |   PASS   |
| I32 |         0          |     1000      |    1000    | 351.071 us |      30.13% | 346.311 us |       2.20% |  -4.760 us |  -1.36% |   PASS   |
| I32 |        0.1         |     1000      |    1000    | 446.845 us |       2.49% | 455.524 us |       0.98% |   8.679 us |   1.94% |   FAIL   |
| I32 |        0.9         |     1000      |    1000    | 452.394 us |       1.31% | 457.206 us |      22.44% |   4.812 us |   1.06% |   PASS   |
| I32 |         0          |      100      |   10000    | 353.434 us |       1.89% | 361.423 us |       1.11% |   7.989 us |   2.26% |   FAIL   |
| I32 |        0.1         |      100      |   10000    | 443.700 us |       5.04% | 464.247 us |       2.31% |  20.547 us |   4.63% |   FAIL   |
| I32 |        0.9         |      100      |   10000    | 446.813 us |      22.77% | 453.153 us |       2.02% |   6.340 us |   1.42% |   PASS   |
| I32 |         0          |     1000      |   10000    | 352.120 us |       0.83% | 358.861 us |       0.96% |   6.741 us |   1.91% |   FAIL   |
| I32 |        0.1         |     1000      |   10000    | 442.210 us |       1.69% | 464.918 us |       1.82% |  22.708 us |   5.14% |   FAIL   |
| I32 |        0.9         |     1000      |   10000    | 440.401 us |       1.70% | 453.649 us |       5.10% |  13.248 us |   3.01% |   FAIL   |
| I32 |         0          |     10000     |   10000    | 350.092 us |       1.59% | 356.938 us |      29.66% |   6.846 us |   1.96% |   FAIL   |
| I32 |        0.1         |     10000     |   10000    | 444.024 us |      23.38% | 453.530 us |       1.27% |   9.506 us |   2.14% |   FAIL   |
| I32 |        0.9         |     10000     |   10000    | 450.018 us |       1.26% | 460.032 us |       1.25% |  10.014 us |   2.23% |   FAIL   |
| I32 |         0          |      100      |   100000   | 439.603 us |       9.45% | 421.552 us |       9.44% | -18.052 us |  -4.11% |   PASS   |
| I32 |        0.1         |      100      |   100000   | 494.887 us |       0.96% | 480.105 us |       1.57% | -14.782 us |  -2.99% |   FAIL   |
| I32 |        0.9         |      100      |   100000   | 465.652 us |       1.65% | 463.798 us |       1.03% |  -1.854 us |  -0.40% |   PASS   |
| I32 |         0          |     1000      |   100000   | 400.612 us |       1.53% | 366.621 us |       1.64% | -33.991 us |  -8.48% |   FAIL   |
| I32 |        0.1         |     1000      |   100000   | 491.635 us |       1.05% | 469.043 us |      21.77% | -22.591 us |  -4.60% |   FAIL   |
| I32 |        0.9         |     1000      |   100000   | 486.444 us |       1.44% | 468.532 us |       0.97% | -17.912 us |  -3.68% |   FAIL   |
| I32 |         0          |     10000     |   100000   | 403.298 us |       5.14% | 370.975 us |       0.86% | -32.323 us |  -8.01% |   FAIL   |
| I32 |        0.1         |     10000     |   100000   | 495.892 us |      20.67% | 471.791 us |       2.00% | -24.101 us |  -4.86% |   FAIL   |
| I32 |        0.9         |     10000     |   100000   | 490.380 us |       1.69% | 467.581 us |       1.07% | -22.799 us |  -4.65% |   FAIL   |
| I32 |         0          |    100000     |   100000   | 402.927 us |       1.54% | 368.367 us |       1.52% | -34.560 us |  -8.58% |   FAIL   |
| I32 |        0.1         |    100000     |   100000   | 491.585 us |       0.99% | 470.086 us |       1.03% | -21.500 us |  -4.37% |   FAIL   |
| I32 |        0.9         |    100000     |   100000   | 491.760 us |       1.35% | 463.090 us |       1.72% | -28.669 us |  -5.83% |   FAIL   |
| I32 |         0          |      100      |  1000000   |   1.328 ms |       0.33% |   1.523 ms |       6.49% | 194.368 us |  14.63% |   FAIL   |
| I32 |        0.1         |      100      |  1000000   |   1.453 ms |       7.18% |   1.768 ms |       0.84% | 315.084 us |  21.69% |   FAIL   |
| I32 |        0.9         |      100      |  1000000   |   1.431 ms |       0.50% |   1.515 ms |       1.00% |  84.076 us |   5.88% |   FAIL   |
| I32 |         0          |     1000      |  1000000   |   1.324 ms |       0.31% |   1.502 ms |       0.69% | 177.517 us |  13.41% |   FAIL   |
| I32 |        0.1         |     1000      |  1000000   |   1.441 ms |       0.69% |   1.735 ms |       1.10% | 294.264 us |  20.43% |   FAIL   |
| I32 |        0.9         |     1000      |  1000000   |   1.267 ms |       0.49% |   1.461 ms |       6.48% | 194.039 us |  15.31% |   FAIL   |
| I32 |         0          |     10000     |  1000000   |   1.320 ms |       0.37% |   1.490 ms |       0.63% | 169.856 us |  12.86% |   FAIL   |
| I32 |        0.1         |     10000     |  1000000   |   1.449 ms |       0.35% |   1.714 ms |       0.45% | 265.381 us |  18.32% |   FAIL   |
| I32 |        0.9         |     10000     |  1000000   |   1.268 ms |       0.46% |   1.464 ms |       0.50% | 196.597 us |  15.51% |   FAIL   |
| I32 |         0          |    100000     |  1000000   |   1.353 ms |       0.36% |   1.518 ms |       1.04% | 164.147 us |  12.13% |   FAIL   |
| I32 |        0.1         |    100000     |  1000000   |   1.490 ms |       0.39% |   1.743 ms |       0.47% | 252.826 us |  16.96% |   FAIL   |
| I32 |        0.9         |    100000     |  1000000   |   1.361 ms |       0.39% |   1.573 ms |       6.68% | 211.856 us |  15.57% |   FAIL   |
| I32 |         0          |    1000000    |  1000000   |   1.350 ms |       0.35% |   1.524 ms |       0.50% | 173.967 us |  12.88% |   FAIL   |
| I32 |        0.1         |    1000000    |  1000000   |   1.491 ms |       6.92% |   1.737 ms |       0.35% | 246.013 us |  16.50% |   FAIL   |
| I32 |        0.9         |    1000000    |  1000000   |   1.471 ms |       0.36% |   1.644 ms |       0.45% | 172.901 us |  11.75% |   FAIL   |
| I32 |         0          |      100      |  10000000  |  11.762 ms |       0.33% |  14.270 ms |       0.50% |   2.508 ms |  21.32% |   FAIL   |
| I32 |        0.1         |      100      |  10000000  |  12.162 ms |       0.23% |  16.590 ms |       0.37% |   4.429 ms |  36.42% |   FAIL   |
| I32 |        0.9         |      100      |  10000000  |  14.889 ms |       0.50% |  16.906 ms |       1.11% |   2.017 ms |  13.55% |   FAIL   |
| I32 |         0          |     1000      |  10000000  |  11.657 ms |       0.10% |  14.462 ms |       1.26% |   2.805 ms |  24.06% |   FAIL   |
| I32 |        0.1         |     1000      |  10000000  |  12.143 ms |       0.43% |  16.413 ms |       0.59% |   4.270 ms |  35.17% |   FAIL   |
| I32 |        0.9         |     1000      |  10000000  |  11.940 ms |       0.50% |  13.802 ms |       1.37% |   1.862 ms |  15.60% |   FAIL   |
| I32 |         0          |     10000     |  10000000  |  11.541 ms |       0.18% |  14.255 ms |       1.32% |   2.714 ms |  23.51% |   FAIL   |
| I32 |        0.1         |     10000     |  10000000  |  11.893 ms |       0.29% |  16.106 ms |       0.67% |   4.213 ms |  35.42% |   FAIL   |
| I32 |        0.9         |     10000     |  10000000  |   9.906 ms |       0.81% |  12.938 ms |       1.29% |   3.032 ms |  30.60% |   FAIL   |
| I32 |         0          |    100000     |  10000000  |  11.575 ms |       0.10% |  14.236 ms |       0.49% |   2.661 ms |  22.99% |   FAIL   |
| I32 |        0.1         |    100000     |  10000000  |  12.043 ms |       0.17% |  15.909 ms |       0.64% |   3.866 ms |  32.11% |   FAIL   |
| I32 |        0.9         |    100000     |  10000000  |  10.053 ms |       0.36% |  13.098 ms |       0.57% |   3.045 ms |  30.29% |   FAIL   |
| I32 |         0          |    1000000    |  10000000  |  12.091 ms |       0.29% |  14.634 ms |       1.34% |   2.543 ms |  21.03% |   FAIL   |
| I32 |        0.1         |    1000000    |  10000000  |  12.678 ms |       0.14% |  16.426 ms |       0.66% |   3.748 ms |  29.56% |   FAIL   |
| I32 |        0.9         |    1000000    |  10000000  |  11.352 ms |       0.34% |  14.470 ms |       0.68% |   3.118 ms |  27.47% |   FAIL   |
| I32 |         0          |   10000000    |  10000000  |  12.299 ms |       0.15% |  14.913 ms |       0.47% |   2.614 ms |  21.25% |   FAIL   |
| I32 |        0.1         |   10000000    |  10000000  |  12.969 ms |       0.21% |  16.601 ms |       1.19% |   3.632 ms |  28.01% |   FAIL   |
| I32 |        0.9         |   10000000    |  10000000  |  12.875 ms |       2.23% |  15.604 ms |       0.65% |   2.729 ms |  21.19% |   FAIL   |

cpp/src/reductions/histogram.cu Outdated Show resolved Hide resolved
map.insert(pair_iter, pair_iter + input.num_rows(), key_hasher, key_equal, stream.value());
}

auto const key_equal = row_comp.equal_to<false>(has_nulls, null_equality::EQUAL, value_comp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if has_nested_columns is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested types are not yet supported with histogram. So the current implementation throws an error.

"Nested types are not yet supported in histogram aggregation.",

cpp/src/reductions/histogram.cu Outdated Show resolved Hide resolved
@mhaseeb123 mhaseeb123 self-assigned this Sep 24, 2024
@mhaseeb123 mhaseeb123 changed the base branch from branch-24.10 to branch-24.12 September 24, 2024 21:15
Copy link

copy-pr-bot bot commented Sep 24, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants