Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For powers of 10, replace ipow with switch #15353

Merged
merged 13 commits into from
May 21, 2024

Conversation

pmattione-nvidia
Copy link
Contributor

@pmattione-nvidia pmattione-nvidia commented Mar 20, 2024

This adds a new runtime calculation of the power-of-10 needed for applying decimal scale factors with a switch statement. This provides the fastest way of applying the scale. Note that the multiply and divide operations are performed within the switch itself, so that the compiler sees the full instruction to optimize assembly code gen. See code comments for details.

This cannot be used within fixed_point (e.g. for comparison operators and rescaling) as it introduced too much register pressure to unrelated benchmarks. It will only be used for the decimal <--> floating conversion, so it has been moved there to be in a new header file where that code will reside (in an upcoming PR). This is part of a larger change to change the algorithm for decimal <--> floating conversion to a more accurate one that is forthcoming soon.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@pmattione-nvidia pmattione-nvidia added libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 20, 2024
@pmattione-nvidia pmattione-nvidia requested review from a team as code owners March 20, 2024 18:24
@github-actions github-actions bot added the CMake CMake build issue label Mar 20, 2024
@pmattione-nvidia
Copy link
Contributor Author

/ok to test

@vuule
Copy link
Contributor

vuule commented Mar 21, 2024

are there benchmarks that show the performance impact?
Edit: Apologies, missed the last paragraph. Disregard

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One suggestion to simplify.

cpp/include/cudf/fixed_point/fixed_point.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/fixed_point/fixed_point.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/fixed_point/fixed_point.hpp Outdated Show resolved Hide resolved
// Stop at 10^18 to speed up compilation; literals can be used for smaller powers of 10.
static_assert(Exp10 >= 18);
if constexpr (Exp10 == 18)
return __int128_t(1000000000000000000LL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use separators for readability.

Suggested change
return __int128_t(1000000000000000000LL);
return __int128_t(1'000'000'000'000'000'000LL);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jitify compile cannot handle separators when combined with the LL at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you opened a bug report for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, should I open an issue here on the github repo, or a bug through the nvidia bugs system?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean an issue on jitify's GitHub repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, done.

CUDF_HOST_DEVICE inline T divide_power10_64bit(T value, int exp10)
{
// See comments in divide_power10_32bit() for discussion.
// clang-format off
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to leave clang-format on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turning it off makes it line-up as:

case 0: return value;
case 1: return value / 10;
case 2: return value / 100;
case 3: return value / 1000;
case 4: return value / 10000;
case 5: return value / 100000;
case 6: return value / 1000000;
case 7: return value / 10000000;
case 8: return value / 100000000;
case 9: return value / 1000000000;
case 10: return value / 10000000000LL;
case 11: return value / 100000000000LL;
case 12: return value / 1000000000000LL;
case 13: return value / 10000000000000LL;
case 14: return value / 100000000000000LL;
case 15: return value / 1000000000000000LL;
case 16: return value / 10000000000000000LL;
case 17: return value / 100000000000000000LL;
case 18: return value / 1000000000000000000LL;
case 19: return value / 10000000000000000000ULL;  // 10^19 only fits if unsigned!
default: return 0;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which I think is less readable.

Copy link
Contributor

@bdice bdice Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t see the problem. Let’s turn it on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With clang-format off it's much easier to see that everything is increasing by a power of 10 when they're all lined up at the same position. It's not that important to me though, so I'll re-enable clang-format here.

// If you do, prefer calling the bit-size-specific versions
if constexpr (sizeof(Rep) <= 4) {
return multiply_power10_32bit(value, exp10);
} else if constexpr (sizeof(Rep) == 8) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the above using <= 4 but this is using == 8? Shouldn't both be <= or ==?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could call it with a type where sizeof is 1 or 2, hence the <= 4. I could change the other to <= 8, but I'm not anticipating us having any types with size in between.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s be consistent and use <=.

@github-actions github-actions bot removed the CMake CMake build issue label Apr 26, 2024
@pmattione-nvidia
Copy link
Contributor Author

When the full suite of benchmarks were run, it turns out that the GroupBy benchmarks were too adversely affected by this change to go in as it was. There was too much register pressure using these inlined functions for fixed-point comparison operations (rescaling) that was blowing up the benchmark. Therefore this code will now only be used for the decimal <--> floating conversion, and has been moved to a new header file dedicated for that (upcoming) code. Since it's been removed from fixed_point.hpp, we no longer need to modify the jitify compilation flags either for 128bit integer types. Finally, since the decimal <--> floating conversion only uses these functions for unsigned integers, the code has been updated for those as well.

@mhaseeb123
Copy link
Member

LGTM with all the latest changes!

@pmattione-nvidia pmattione-nvidia changed the base branch from branch-24.06 to branch-24.08 May 21, 2024 14:16
@pmattione-nvidia
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 333718a into rapidsai:branch-24.08 May 21, 2024
71 checks passed
rapids-bot bot pushed a commit that referenced this pull request Jul 11, 2024
This PR contains the main algorithm for the new decimal <--> floating conversion code. This algorithm was written to address the precision issues described [here](#14169). 

### Summary
* The new algorithm is more accurate than the previous code, but it is also far more complex.  
* It can perform conversions that were not even possible in the old code due to overflow (decimal32/64/128 conversions only worked for scale factors up to 10^9/18/38, respectively). Now the entire floating-point range is convertible, including denormals. 
* This new algorithm is significantly faster in some parts of the conversion phase-space, and in some parts slightly slower.  

### Previous PR's 
These contain the supporting parts of this work:  
* [Explicit conversion PR](#15438) 
* [Benchmarking PR](#15334)
* [Powers-of-10 PR](#15353)
* [Utilities PR](#15359). These utilities are updated here to support denormals. 

### Algorithm Outline
We convert floating -> (integer) decimal by:
* Extract the floating-point mantissa (converted to integer) and power-of-2
* For float we use a uint64 to contain our data during the below shifting/scaling, for double uint128_t
* In this shifting integer, we alternately apply the extracted powers-of-2 (bit-shifts, until they're all used) and scale-factor powers-of-10 (multiply/divide) as needed to reach the desired scale factor. 

Decimal -> floating is just the reverse operation.  

### Supplemental Changes
* Testing: Add decimal128, add precise-conversion tests. Remove kludges due to inaccurate conversions. Add test for zeroes. 
* Benchmarking: Enable regions of conversion phase-space for benchmarking that were not possible in the old algorithm. 
* Unary: Cleanup by using CUDF_ENABLE_IF.  Call new conversion code for base-10 fixed-point. 

### Performance for various conversions/input-ranges
* Note: F32/F64 is float/double

New algorithm is **FASTER** by: 
* F64             --> decimal64:   60% for E8    --> E15
* F64             --> decimal128: 13% for E-8  --> E-15
* F64             --> decimal128: 22% for E8    --> E15
* F64             --> decimal128: 27% for E31  --> E38
* decimal32   --> F64:             18% for E-3   --> E4
* decimal64   --> F64:             27% for E-14 --> E-7
* decimal64   --> F64:             17% for E-3   --> E4
* decimal128 --> F64:             21% for E-14 --> E-7
* decimal128 --> F64:             11% for E-3   --> E4
* decimal128 --> F64:             13% for E31   --> E38

New algorithm is **SLOWER** by: 
* F32             --> decimal32:     3% for E-3   --> E4
* F32             --> decimal64:     2% for E-14   --> E14
* F64             --> decimal32:     3% for E-3   --> E4
* decimal32   --> F32:               5% for E-3   --> E4
* decimal128 --> F64:             36% for E-37 --> E-30

Other kernels:
* The PYMOD binary-op benchmark is 7% slower.  

### Performance discussion
* Many conversions have identical speed, indicating these algorithms are often fast and we are instead bottlenecked on overheads such as getting the input to the gpu in the first place. 
* F64 conversions are often much faster than the old algorithm as the new algorithm completely avoids the FP64 pipeline. Other than the cast to double itself, all of the operations are on integers. Thus we don't have threads competing with each other and taking turns for access to the floating-point cores. 
* The conversions are slightly slower for floats with powers-of-10 near zero.  Presumably this is due to code overhead for e.g., handling a large range of inputs, UB-checks for bit shifts, branches for denormals, etc. 
* The conversion is slower for decimal128 conversions with very small exponents, which requires several large divisions (128bit divided by 64bit). 
* The PYMOD kernel is slower due to register pressure from the introduction of the new division routines in the earlier PR. Even though this benchmark does not perform decimal <--> floating conversions, it gets hit because of inlined template code in the kernel increasing the code/register pressure.

Authors:
  - Paul Mattione (https://github.com/pmattione-nvidia)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - Bradley Dice (https://github.com/bdice)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #15905
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants