Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inf/NaN comparisons for FLOAT orderby in window functions #13635

Merged
merged 14 commits into from
Jul 5, 2023

Conversation

mythrocks
Copy link
Contributor

Description

This is a follow-up to #13512 (which added support for floating point order-by columns in window functions), and #13606 (which fixed how negative values are handled for floating point order-by).

This commit fixes how NaN and +/- Infinity values are handled for floating point.

Prior to this commit, the calculations for range window extents depended on the behaviour of thrust::less<float> and thrust::greater<float>, as well as addition/subtraction on +/- Infinity. This produced some unexpected results:

  1. thrust::less/greater on NaN does not produce strict ordering.
  2. Addition/Subtraction on the numerical values of Infinity could produce finite values that interfere with window extent calculations. Ideally, the results should have remained infinite.

This commit adds custom comparators with NaN awareness, to better handle columns that contain NaNs. It also fixes range calculations where Infinity is involved.

Tests have been added to cover ASC/DESC order sorting on FLOAT, with NaN and Infinity values.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

This is a follow-up to rapidsai#13512 (which added support for floating point
order-by columns in window functions), and rapidsai#13606 (which fixed how
negative values are handled for floating point order-by).

This commit fixes how `NaN` and `+/- Infinity` values are handled
for floating point.

Prior to this commit, the calculations for range window extents depended
on the behaviour of `thrust::less<float>` and `thrust::greater<float>`,
as well as addition/subtraction on `+/- Infinity`. This produced some
unexpected results:
1. `thrust::less`/`greater` on `NaN` does not produce strict ordering.
2. Addition/Subtraction on the numerical values of `Infinity` could
   produce finite values that interfere with window extent calculations.
   Ideally, the results should have remained infinite.

This commit adds custom comparators with `NaN` awareness, to better
handle columns that contain `NaN`s. It also fixes range calculations
where `Infinity` is involved.

Tests have been added to cover ASC/DESC order sorting on `FLOAT`,
with `NaN` and `Infinity` values.
@mythrocks mythrocks added bug Something isn't working Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Jun 28, 2023
@mythrocks mythrocks requested a review from a team as a code owner June 28, 2023 22:11
@mythrocks mythrocks self-assigned this Jun 28, 2023
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jun 28, 2023
mythrocks added a commit to mythrocks/spark-rapids that referenced this pull request Jun 28, 2023
Depends on rapidsai/cudf#13635.

This commit adds support for floating point order-by columns in
RANGE based window functions.

Prior to this commit, when the `GpuWindowExec` was presented with an
order-by column of floating-point type, the entire window operation
would fall back to CPU execution. This should now execute entirely
on the GPU.
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few small comments. Mostly confused about ordering when one element is NaN.

cpp/src/rolling/grouped_rolling.cu Outdated Show resolved Hide resolved
cpp/src/rolling/grouped_rolling.cu Outdated Show resolved Hide resolved
mythrocks added a commit to mythrocks/spark-rapids that referenced this pull request Jun 29, 2023
Depends on rapidsai/cudf#13635.

This commit adds support for floating point order-by columns in
RANGE based window functions.

Prior to this commit, when the `GpuWindowExec` was presented with an
order-by column of floating-point type, the entire window operation
would fall back to CPU execution. This should now execute entirely
on the GPU.
mythrocks added a commit to mythrocks/spark-rapids that referenced this pull request Jun 29, 2023
Depends on rapidsai/cudf#13635.

This commit adds support for floating point order-by columns in
RANGE based window functions.

Prior to this commit, when the `GpuWindowExec` was presented with an
order-by column of floating-point type, the entire window operation
would fall back to CPU execution. This should now execute entirely
on the GPU.
mythrocks added a commit to mythrocks/spark-rapids that referenced this pull request Jun 29, 2023
Depends on rapidsai/cudf#13635.

This commit adds support for floating point order-by columns in
RANGE based window functions.

Prior to this commit, when the `GpuWindowExec` was presented with an
order-by column of floating-point type, the entire window operation
would fall back to CPU execution. This should now execute entirely
on the GPU.

Example query:
```sql
SELECT COUNT(1) OVER (PARTITION BY grp ORDER BY float_column
                      RANGE BETWEEN 1.023 PRECEDING AND 3.14159 FOLLOWING)
FROM my_float_table;
```
mythrocks added a commit to mythrocks/spark-rapids that referenced this pull request Jun 29, 2023
Depends on rapidsai/cudf#13635.

This commit adds support for floating point order-by columns in
RANGE based window functions.

Prior to this commit, when the `GpuWindowExec` was presented with an
order-by column of floating-point type, the entire window operation
would fall back to CPU execution. This should now execute entirely
on the GPU.

Example query:
```sql
SELECT COUNT(1) OVER (PARTITION BY grp ORDER BY float_column
                      RANGE BETWEEN 1.023 PRECEDING AND 3.14159 FOLLOWING)
FROM my_float_table;
```

Signed-off-by: MithunR <mythrocks@gmail.com>
@github-actions github-actions bot added the CMake CMake build issue label Jun 30, 2023
@mythrocks mythrocks requested a review from vuule July 3, 2023 21:15
@mythrocks
Copy link
Contributor Author

I've added tests for the add/subtract_safe() functions as well. Just in case.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the test coverage, thank you for taking the time to add the unit tests.

cpp/src/rolling/detail/range_comparator_utils.cuh Outdated Show resolved Hide resolved
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice change and the tests added are very welcome.

cpp/tests/rolling/range_comparator_test.cu Show resolved Hide resolved
cpp/tests/rolling/range_comparator_test.cu Show resolved Hide resolved
cpp/src/rolling/detail/range_comparator_utils.cuh Outdated Show resolved Hide resolved
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. One small typo in comment block.

cpp/src/rolling/detail/range_comparator_utils.cuh Outdated Show resolved Hide resolved
@mythrocks
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit e907d3f into rapidsai:branch-23.08 Jul 5, 2023
53 checks passed
@mythrocks
Copy link
Contributor Author

Thank you for the reviews, all. I have merged this change.

mythrocks added a commit to NVIDIA/spark-rapids that referenced this pull request Jul 10, 2023
#8637)

* Support Float order-by columns for RANGE window functions

Depends on rapidsai/cudf#13635.

This commit adds support for floating point order-by columns in
RANGE based window functions.

Prior to this commit, when the `GpuWindowExec` was presented with an
order-by column of floating-point type, the entire window operation
would fall back to CPU execution. This should now execute entirely
on the GPU.

Example query:
```sql
SELECT COUNT(1) OVER (PARTITION BY grp ORDER BY float_column
                      RANGE BETWEEN 1.023 PRECEDING AND 3.14159 FOLLOWING)
FROM my_float_table;
```

Signed-off-by: MithunR <mythrocks@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants