For powers of 10, replace ipow with switch #15353

pmattione-nvidia · 2024-03-20T18:24:19Z

This adds a new runtime calculation of the power-of-10 needed for applying decimal scale factors with a switch statement. This provides the fastest way of applying the scale. Note that the multiply and divide operations are performed within the switch itself, so that the compiler sees the full instruction to optimize assembly code gen. See code comments for details.

This cannot be used within fixed_point (e.g. for comparison operators and rescaling) as it introduced too much register pressure to unrelated benchmarks. It will only be used for the decimal <--> floating conversion, so it has been moved there to be in a new header file where that code will reside (in an upcoming PR). This is part of a larger change to change the algorithm for decimal <--> floating conversion to a more accurate one that is forthcoming soon.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…itch.

pmattione-nvidia · 2024-03-20T18:29:28Z

/ok to test

vuule · 2024-03-21T20:31:58Z

are there benchmarks that show the performance impact?
Edit: Apologies, missed the last paragraph. Disregard

harrism

Looks good. One suggestion to simplify.

cpp/include/cudf/fixed_point/fixed_point.hpp

bdice · 2024-04-16T18:22:56Z

cpp/include/cudf/fixed_point/fixed_point.hpp

+  // Stop at 10^18 to speed up compilation; literals can be used for smaller powers of 10.
+  static_assert(Exp10 >= 18);
+  if constexpr (Exp10 == 18)
+    return __int128_t(1000000000000000000LL);


Use separators for readability.

Suggested change

return __int128_t(1000000000000000000LL);

return __int128_t(1'000'000'000'000'000'000LL);

The jitify compile cannot handle separators when combined with the LL at the end.

Have you opened a bug report for that?

No, should I open an issue here on the github repo, or a bug through the nvidia bugs system?

I mean an issue on jitify's GitHub repo.

Ah ok, done.

bdice · 2024-04-16T18:24:53Z

cpp/include/cudf/fixed_point/fixed_point.hpp

+CUDF_HOST_DEVICE inline T divide_power10_64bit(T value, int exp10)
+{
+  // See comments in divide_power10_32bit() for discussion.
+  // clang-format off


I'd prefer to leave clang-format on.

Turning it off makes it line-up as:

case 0: return value; case 1: return value / 10; case 2: return value / 100; case 3: return value / 1000; case 4: return value / 10000; case 5: return value / 100000; case 6: return value / 1000000; case 7: return value / 10000000; case 8: return value / 100000000; case 9: return value / 1000000000; case 10: return value / 10000000000LL; case 11: return value / 100000000000LL; case 12: return value / 1000000000000LL; case 13: return value / 10000000000000LL; case 14: return value / 100000000000000LL; case 15: return value / 1000000000000000LL; case 16: return value / 10000000000000000LL; case 17: return value / 100000000000000000LL; case 18: return value / 1000000000000000000LL; case 19: return value / 10000000000000000000ULL; // 10^19 only fits if unsigned! default: return 0;

Which I think is less readable.

I don’t see the problem. Let’s turn it on.

With clang-format off it's much easier to see that everything is increasing by a power of 10 when they're all lined up at the same position. It's not that important to me though, so I'll re-enable clang-format here.

bdice · 2024-04-16T18:27:25Z

cpp/include/cudf/fixed_point/fixed_point.hpp

+  // If you do, prefer calling the bit-size-specific versions
+  if constexpr (sizeof(Rep) <= 4) {
+    return multiply_power10_32bit(value, exp10);
+  } else if constexpr (sizeof(Rep) == 8) {


Why is the above using <= 4 but this is using == 8? Shouldn't both be <= or ==?

You could call it with a type where sizeof is 1 or 2, hence the <= 4. I could change the other to <= 8, but I'm not anticipating us having any types with size in between.

Let’s be consistent and use <=.

…idia/cudf into actual_ipow_improve

…arking issues. 128bit functions can now be inlined.

…d. Code comments updated.

… conversion.

pmattione-nvidia · 2024-04-26T18:29:01Z

When the full suite of benchmarks were run, it turns out that the GroupBy benchmarks were too adversely affected by this change to go in as it was. There was too much register pressure using these inlined functions for fixed-point comparison operations (rescaling) that was blowing up the benchmark. Therefore this code will now only be used for the decimal <--> floating conversion, and has been moved to a new header file dedicated for that (upcoming) code. Since it's been removed from fixed_point.hpp, we no longer need to modify the jitify compilation flags either for 128bit integer types. Finally, since the decimal <--> floating conversion only uses these functions for unsigned integers, the code has been updated for those as well.

mhaseeb123 · 2024-05-20T18:08:31Z

LGTM with all the latest changes!

pmattione-nvidia · 2024-05-21T18:31:52Z

/merge

This PR contains the main algorithm for the new decimal <--> floating conversion code. This algorithm was written to address the precision issues described [here](#14169). ### Summary * The new algorithm is more accurate than the previous code, but it is also far more complex. * It can perform conversions that were not even possible in the old code due to overflow (decimal32/64/128 conversions only worked for scale factors up to 10^9/18/38, respectively). Now the entire floating-point range is convertible, including denormals. * This new algorithm is significantly faster in some parts of the conversion phase-space, and in some parts slightly slower. ### Previous PR's These contain the supporting parts of this work: * [Explicit conversion PR](#15438) * [Benchmarking PR](#15334) * [Powers-of-10 PR](#15353) * [Utilities PR](#15359). These utilities are updated here to support denormals. ### Algorithm Outline We convert floating -> (integer) decimal by: * Extract the floating-point mantissa (converted to integer) and power-of-2 * For float we use a uint64 to contain our data during the below shifting/scaling, for double uint128_t * In this shifting integer, we alternately apply the extracted powers-of-2 (bit-shifts, until they're all used) and scale-factor powers-of-10 (multiply/divide) as needed to reach the desired scale factor. Decimal -> floating is just the reverse operation. ### Supplemental Changes * Testing: Add decimal128, add precise-conversion tests. Remove kludges due to inaccurate conversions. Add test for zeroes. * Benchmarking: Enable regions of conversion phase-space for benchmarking that were not possible in the old algorithm. * Unary: Cleanup by using CUDF_ENABLE_IF. Call new conversion code for base-10 fixed-point. ### Performance for various conversions/input-ranges * Note: F32/F64 is float/double New algorithm is **FASTER** by: * F64 --> decimal64: 60% for E8 --> E15 * F64 --> decimal128: 13% for E-8 --> E-15 * F64 --> decimal128: 22% for E8 --> E15 * F64 --> decimal128: 27% for E31 --> E38 * decimal32 --> F64: 18% for E-3 --> E4 * decimal64 --> F64: 27% for E-14 --> E-7 * decimal64 --> F64: 17% for E-3 --> E4 * decimal128 --> F64: 21% for E-14 --> E-7 * decimal128 --> F64: 11% for E-3 --> E4 * decimal128 --> F64: 13% for E31 --> E38 New algorithm is **SLOWER** by: * F32 --> decimal32: 3% for E-3 --> E4 * F32 --> decimal64: 2% for E-14 --> E14 * F64 --> decimal32: 3% for E-3 --> E4 * decimal32 --> F32: 5% for E-3 --> E4 * decimal128 --> F64: 36% for E-37 --> E-30 Other kernels: * The PYMOD binary-op benchmark is 7% slower. ### Performance discussion * Many conversions have identical speed, indicating these algorithms are often fast and we are instead bottlenecked on overheads such as getting the input to the gpu in the first place. * F64 conversions are often much faster than the old algorithm as the new algorithm completely avoids the FP64 pipeline. Other than the cast to double itself, all of the operations are on integers. Thus we don't have threads competing with each other and taking turns for access to the floating-point cores. * The conversions are slightly slower for floats with powers-of-10 near zero. Presumably this is due to code overhead for e.g., handling a large range of inputs, UB-checks for bit shifts, branches for denormals, etc. * The conversion is slower for decimal128 conversions with very small exponents, which requires several large divisions (128bit divided by 64bit). * The PYMOD kernel is slower due to register pressure from the introduction of the new division routines in the earlier PR. Even though this benchmark does not perform decimal <--> floating conversions, it gets hit because of inlined template code in the kernel increasing the code/register pressure. Authors: - Paul Mattione (https://github.com/pmattione-nvidia) Approvers: - Jason Lowe (https://github.com/jlowe) - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) URL: #15905

When multiplying or dividing by a power of 10, switch from ipow to sw…

62155b2

…itch.

pmattione-nvidia added libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 20, 2024

pmattione-nvidia requested review from a team as code owners March 20, 2024 18:24

pmattione-nvidia requested review from robertmaynard and bdice March 20, 2024 18:24

github-actions bot added the CMake CMake build issue label Mar 20, 2024

harrism requested changes Apr 3, 2024

View reviewed changes

cpp/include/cudf/fixed_point/fixed_point.hpp Outdated Show resolved Hide resolved

cpp/include/cudf/fixed_point/fixed_point.hpp Outdated Show resolved Hide resolved

cpp/include/cudf/fixed_point/fixed_point.hpp Outdated Show resolved Hide resolved

Calculate large powers of 10 recursively at compile time.

97df0c9

harrism approved these changes Apr 8, 2024

View reviewed changes

pmattione-nvidia and others added 3 commits April 10, 2024 11:21

Merge branch 'branch-24.06' into actual_ipow_improve

85cbd89

Fix conversion to unsigned integer, choice of Rep type.

0d672cf

Merge branch 'branch-24.06' into actual_ipow_improve

e363259

bdice reviewed Apr 16, 2024

View reviewed changes

pmattione-nvidia added 6 commits April 17, 2024 11:55

Formatting and consistency changes

2c58b7d

Merge branch 'actual_ipow_improve' of https://github.com/pmattione-nv…

9001b64

…idia/cudf into actual_ipow_improve

fix comments

f9a0e89

Power functions no longer used for fixed_point shifting due to benchm…

0c8247a

…arking issues. 128bit functions can now be inlined.

Switch to unsigned only, since floating conversion only needs unsigne…

6187807

…d. Code comments updated.

Move powers of 10 code to a new header file specifically for floating…

7feb1c1

… conversion.

github-actions bot removed the CMake CMake build issue label Apr 26, 2024

Merge branch 'branch-24.06' into actual_ipow_improve

93956fd

pmattione-nvidia requested a review from bdice May 2, 2024 19:42

mhaseeb123 approved these changes May 20, 2024

View reviewed changes

pmattione-nvidia changed the base branch from branch-24.06 to branch-24.08 May 21, 2024 14:16

Merge branch 'branch-24.08' into actual_ipow_improve

f0c97b8

rapids-bot bot merged commit 333718a into rapidsai:branch-24.08 May 21, 2024
71 checks passed

This was referenced May 30, 2024

[BUG] casting from float32 to Decimal64Dtype is resulting in incorrect values #14169

Closed

New Decimal <--> Floating conversion #15905

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For powers of 10, replace ipow with switch #15353

For powers of 10, replace ipow with switch #15353

pmattione-nvidia commented Mar 20, 2024 •

edited

Loading

pmattione-nvidia commented Mar 20, 2024

vuule commented Mar 21, 2024 •

edited

Loading

harrism left a comment

bdice Apr 16, 2024

pmattione-nvidia Apr 16, 2024

harrism Apr 16, 2024

pmattione-nvidia Apr 17, 2024

pmattione-nvidia Apr 26, 2024

harrism Apr 29, 2024

pmattione-nvidia May 2, 2024

bdice Apr 16, 2024

pmattione-nvidia Apr 16, 2024

pmattione-nvidia Apr 16, 2024

bdice Apr 16, 2024 •

edited

Loading

pmattione-nvidia Apr 16, 2024

bdice Apr 16, 2024

pmattione-nvidia Apr 16, 2024

bdice Apr 16, 2024

pmattione-nvidia commented Apr 26, 2024

mhaseeb123 commented May 20, 2024

pmattione-nvidia commented May 21, 2024

	return __int128_t(1000000000000000000LL);
	return __int128_t(1'000'000'000'000'000'000LL);

For powers of 10, replace ipow with switch #15353

For powers of 10, replace ipow with switch #15353

Conversation

pmattione-nvidia commented Mar 20, 2024 • edited Loading

Checklist

pmattione-nvidia commented Mar 20, 2024

vuule commented Mar 21, 2024 • edited Loading

harrism left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmattione-nvidia commented Apr 26, 2024

mhaseeb123 commented May 20, 2024

pmattione-nvidia commented May 21, 2024

pmattione-nvidia commented Mar 20, 2024 •

edited

Loading

vuule commented Mar 21, 2024 •

edited

Loading

bdice Apr 16, 2024 •

edited

Loading