[BUG] Flaky test - udf_test.py::test_window_aggregate_udf #740

tgravescs · 2020-09-11T14:35:06Z

Describe the bug

This is failing on the spark 3.0.1 integration test and it also failed on a documentation change in the premerge build. though it looks like there are slightly different parameters in these 2 failures:

https://ci.ngcc.nvidia.com/job/spark/job/rapids_premerge-github/609/
04:54:30 FAILED ../../src/main/python/udf_test.py::test_group_aggregate_udf[Long][IGNORE_ORDER, ALLOW_NON_GPU(AggregateInPandasExec,PythonUDF,Alias)]

https://ci.ngcc.nvidia.com/job/spark/job/rapids_it-3.0.1-SNAPSHOT-0.2-github/55/
09:08:02 FAILED src/main/python/udf_test.py::test_window_aggregate_udf[Long][IGNORE_ORDER, ALLOW_NON_GPU(WindowInPandasExec,PythonUDF,WindowExpression,Alias,WindowSpecDefinition,SpecifiedWindowFrame,UnboundedPreceding$,UnboundedFollowing$)]

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

sameerz · 2020-09-15T20:56:28Z

Marking as P2, @GaryShen2008 please confirm whether this test is required for the pandas cudf functionality.

tgravescs · 2020-12-08T14:31:15Z

lots of UDF tests failing across the builds, not sure if they are all the same. The main error I see is:
05:12:27 E OSError: Invalid IPC stream: negative continuation token

This needs to be fixed or tests removed.

firestarman · 2020-12-09T10:12:57Z

The original error seems to be related to long overflow, which is different from the IPC error recently.
File the PR #1343, suppose it could fix the IPC error.

GaryShen2008 · 2020-12-18T09:32:06Z

Move it to next sprint since it requires more discussion about how to fix it.
It's not caused from the WindowInPandasUDF, but might be a Long overflow issue in the random test case.

firestarman · 2021-01-08T09:17:33Z

A corner case, it is low priority. Keep it open for tracking.

sameerz · 2021-01-08T19:14:10Z

@firestarman this is marked as a P1, can you explain the corner case and why it is low priority?

firestarman · 2021-01-11T05:22:11Z

It is not 100% reproduceable, and only happened for the long sum case randomly.
It was marked as P2 first due to maybe long overflow for some cases.
Later it was changed to P1 because of the IPC error, but the IPC error has been fixed. So it could be OK to change back to P2 for tracking.

sameerz · 2021-01-12T21:59:21Z

Since the underlying functionality is disabled by default, lowering the priority to P2 and removing it from the current release.

sameerz · 2021-01-12T22:00:11Z

Please xfail the test for now with a comment linking to this issue.

firestarman · 2021-02-24T08:56:45Z

Eventually it turns out to be a float precision issue in Python computation. The test has the UDF defined as below.

    @f.pandas_udf('long')
    def pandas_sum(to_process: pd.Series) -> int:
        return to_process.sum()

The Series instance to_process (original type int64) passed in has already been casted to float64 since it contains null values, and the numbers inside to_process are the same but with different orders between CPU and GPU tests. Then the function sum produces different values when some of the numbers are big enough and will lose precision when casting to float64. Here is an example to reproduce the precision issue.

>>> gs_f = pd.Series([
...     2485679188720175927,
...     8357139477133345268,
...     1,
...     -7730374341177287187,
...     -4737283057172104304,
...     7293465880926600716,
...     -3263989102713806634,
...     7138732477696760926,
...     2366698092255848196,
...     -1207164858762133741,
...     -219832261138823073,
...     -7766870047782081191,
...     4602336914501788898,
...     None])
>>> gs_f.dtype
dtype('float64')
>>> gs_f.sum().astype(int)
7318538362488283136
>>> 
>>> cs_f = pd.Series([2485679188720175927,
...     8357139477133345268,
...     1,
...     -7730374341177287187,
...     -4737283057172104304,
...     7293465880926600716,
...     7138732477696760926,
...     -3263989102713806634,
...     2366698092255848196,
...     -1207164858762133741,
...     -219832261138823073,
...     None,
...     4602336914501788898,
...     -7766870047782081191])
>>> cs_f.dtype
dtype('float64')
>>> cs_f.sum().astype(int)
7318538362488282112

gs_f and cs_f have the same numbers but with different orders, you can see the values of sum are different from each other.
Similarly in the Pandas UDF above, to_process contains the same numbers for CPU and GPU, but with different orders, so getting different sum values before retuning from the UDF, and then the tests failed.

The precision issue is likely to lead to the failures in issue #757 too.

@revans2 Could you share some suggestions to handle this case ?

One possible solution i can think of is to make sure the input has the same order for CPU and GPU by sorting the input data before calling sum. Such as

    @f.pandas_udf('long')
    def pandas_sum(to_process: pd.Series) -> int:
        return to_process.sort_values().sum()

… cudf (NVIDIA#740) Signed-off-by: Jason Lowe <jlowe@nvidia.com> Signed-off-by: Jason Lowe <jlowe@nvidia.com>

tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 11, 2020

sameerz added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Sep 15, 2020

firestarman linked a pull request Sep 18, 2020 that will close this issue

[WIP] Enable two udf tests failed before. #806

Closed

GaryShen2008 assigned firestarman Dec 3, 2020

tgravescs added P0 Must have for release and removed P1 Nice to have for release labels Dec 8, 2020

tgravescs added this to the Dec 7 - Dec 18 milestone Dec 8, 2020

martinstuder mentioned this issue Dec 9, 2020

[BUG] Invalid IPC stream: negative continuation token #1347

Closed

GaryShen2008 modified the milestones: Dec 7 - Dec 18, Jan 4 - Jan 15 Dec 18, 2020

sameerz added the ? - Needs Triage Need team to review and classify label Jan 8, 2021

sameerz added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify P0 Must have for release labels Jan 12, 2021

GaryShen2008 removed this from the Jan 4 - Jan 15 milestone Jan 16, 2021

GaryShen2008 added this to the Jan 18 - Jan 29 milestone Jan 16, 2021

NvTimLiu mentioned this issue Jan 20, 2021

[BUILD] databricks IT tests should run in parallel #1499

Closed

GaryShen2008 modified the milestones: Jan 18 - Jan 29, Feb 1 - Feb 12 Feb 1, 2021

sameerz removed this from the Feb 1 - Feb 12 milestone Feb 12, 2021

firestarman mentioned this issue Feb 24, 2021

[BUG]Flaky test - udf_test.py::test_group_aggregate_udf #757

Closed

firestarman mentioned this issue Feb 25, 2021

Pandas UDF: Sort the data before computing the sum. #1810

Merged

revans2 closed this as completed in #1810 Feb 25, 2021

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Fix build after removal of cudf/detail/copy.cuh and upmerge to latest…

2738cd9

… cudf (NVIDIA#740) Signed-off-by: Jason Lowe <jlowe@nvidia.com> Signed-off-by: Jason Lowe <jlowe@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Flaky test - udf_test.py::test_window_aggregate_udf #740

[BUG] Flaky test - udf_test.py::test_window_aggregate_udf #740

tgravescs commented Sep 11, 2020

sameerz commented Sep 15, 2020

tgravescs commented Dec 8, 2020

firestarman commented Dec 9, 2020

GaryShen2008 commented Dec 18, 2020 •

edited

Loading

firestarman commented Jan 8, 2021 •

edited

Loading

sameerz commented Jan 8, 2021

firestarman commented Jan 11, 2021

sameerz commented Jan 12, 2021

sameerz commented Jan 12, 2021

firestarman commented Feb 24, 2021 •

edited

Loading

[BUG] Flaky test - udf_test.py::test_window_aggregate_udf #740

[BUG] Flaky test - udf_test.py::test_window_aggregate_udf #740

Comments

tgravescs commented Sep 11, 2020

sameerz commented Sep 15, 2020

tgravescs commented Dec 8, 2020

firestarman commented Dec 9, 2020

GaryShen2008 commented Dec 18, 2020 • edited Loading

firestarman commented Jan 8, 2021 • edited Loading

sameerz commented Jan 8, 2021

firestarman commented Jan 11, 2021

sameerz commented Jan 12, 2021

sameerz commented Jan 12, 2021

firestarman commented Feb 24, 2021 • edited Loading

GaryShen2008 commented Dec 18, 2020 •

edited

Loading

firestarman commented Jan 8, 2021 •

edited

Loading

firestarman commented Feb 24, 2021 •

edited

Loading