Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flaky test - udf_test.py::test_window_aggregate_udf #740

Closed
tgravescs opened this issue Sep 11, 2020 · 10 comments · Fixed by #1810
Closed

[BUG] Flaky test - udf_test.py::test_window_aggregate_udf #740

tgravescs opened this issue Sep 11, 2020 · 10 comments · Fixed by #1810
Assignees
Labels
bug Something isn't working P1 Nice to have for release

Comments

@tgravescs
Copy link
Collaborator

Describe the bug

This is failing on the spark 3.0.1 integration test and it also failed on a documentation change in the premerge build. though it looks like there are slightly different parameters in these 2 failures:

https://ci.ngcc.nvidia.com/job/spark/job/rapids_premerge-github/609/
04:54:30 FAILED ../../src/main/python/udf_test.py::test_group_aggregate_udf[Long][IGNORE_ORDER, ALLOW_NON_GPU(AggregateInPandasExec,PythonUDF,Alias)]

https://ci.ngcc.nvidia.com/job/spark/job/rapids_it-3.0.1-SNAPSHOT-0.2-github/55/
09:08:02 FAILED src/main/python/udf_test.py::test_window_aggregate_udf[Long][IGNORE_ORDER, ALLOW_NON_GPU(WindowInPandasExec,PythonUDF,WindowExpression,Alias,WindowSpecDefinition,SpecifiedWindowFrame,UnboundedPreceding$,UnboundedFollowing$)]

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue

Additional context
Add any other context about the problem here.

@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify labels Sep 11, 2020
@sameerz sameerz added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify labels Sep 15, 2020
@sameerz
Copy link
Collaborator

sameerz commented Sep 15, 2020

Marking as P2, @GaryShen2008 please confirm whether this test is required for the pandas cudf functionality.

@firestarman firestarman linked a pull request Sep 18, 2020 that will close this issue
@tgravescs
Copy link
Collaborator Author

lots of UDF tests failing across the builds, not sure if they are all the same. The main error I see is:
05:12:27 E OSError: Invalid IPC stream: negative continuation token

This needs to be fixed or tests removed.

@tgravescs tgravescs added P0 Must have for release and removed P1 Nice to have for release labels Dec 8, 2020
@tgravescs tgravescs added this to the Dec 7 - Dec 18 milestone Dec 8, 2020
@firestarman
Copy link
Collaborator

The original error seems to be related to long overflow, which is different from the IPC error recently.
File the PR #1343, suppose it could fix the IPC error.

@GaryShen2008
Copy link
Collaborator

GaryShen2008 commented Dec 18, 2020

Move it to next sprint since it requires more discussion about how to fix it.
It's not caused from the WindowInPandasUDF, but might be a Long overflow issue in the random test case.

@firestarman
Copy link
Collaborator

firestarman commented Jan 8, 2021

A corner case, it is low priority. Keep it open for tracking.

@sameerz sameerz added the ? - Needs Triage Need team to review and classify label Jan 8, 2021
@sameerz
Copy link
Collaborator

sameerz commented Jan 8, 2021

@firestarman this is marked as a P1, can you explain the corner case and why it is low priority?

@firestarman
Copy link
Collaborator

It is not 100% reproduceable, and only happened for the long sum case randomly.
It was marked as P2 first due to maybe long overflow for some cases.
Later it was changed to P1 because of the IPC error, but the IPC error has been fixed. So it could be OK to change back to P2 for tracking.

@sameerz
Copy link
Collaborator

sameerz commented Jan 12, 2021

Since the underlying functionality is disabled by default, lowering the priority to P2 and removing it from the current release.

@sameerz sameerz added P1 Nice to have for release and removed ? - Needs Triage Need team to review and classify P0 Must have for release labels Jan 12, 2021
@sameerz
Copy link
Collaborator

sameerz commented Jan 12, 2021

Please xfail the test for now with a comment linking to this issue.

@GaryShen2008 GaryShen2008 removed this from the Jan 4 - Jan 15 milestone Jan 16, 2021
@firestarman
Copy link
Collaborator

firestarman commented Feb 24, 2021

Eventually it turns out to be a float precision issue in Python computation. The test has the UDF defined as below.

    @f.pandas_udf('long')
    def pandas_sum(to_process: pd.Series) -> int:
        return to_process.sum()

The Series instance to_process (original type int64) passed in has already been casted to float64 since it contains null values, and the numbers inside to_process are the same but with different orders between CPU and GPU tests. Then the function sum produces different values when some of the numbers are big enough and will lose precision when casting to float64. Here is an example to reproduce the precision issue.

>>> gs_f = pd.Series([
...     2485679188720175927,
...     8357139477133345268,
...     1,
...     -7730374341177287187,
...     -4737283057172104304,
...     7293465880926600716,
...     -3263989102713806634,
...     7138732477696760926,
...     2366698092255848196,
...     -1207164858762133741,
...     -219832261138823073,
...     -7766870047782081191,
...     4602336914501788898,
...     None])
>>> gs_f.dtype
dtype('float64')
>>> gs_f.sum().astype(int)
7318538362488283136
>>> 
>>> cs_f = pd.Series([2485679188720175927,
...     8357139477133345268,
...     1,
...     -7730374341177287187,
...     -4737283057172104304,
...     7293465880926600716,
...     7138732477696760926,
...     -3263989102713806634,
...     2366698092255848196,
...     -1207164858762133741,
...     -219832261138823073,
...     None,
...     4602336914501788898,
...     -7766870047782081191])
>>> cs_f.dtype
dtype('float64')
>>> cs_f.sum().astype(int)
7318538362488282112

gs_f and cs_f have the same numbers but with different orders, you can see the values of sum are different from each other.
Similarly in the Pandas UDF above, to_process contains the same numbers for CPU and GPU, but with different orders, so getting different sum values before retuning from the UDF, and then the tests failed.

The precision issue is likely to lead to the failures in issue #757 too.

@revans2 Could you share some suggestions to handle this case ?

One possible solution i can think of is to make sure the input has the same order for CPU and GPU by sorting the input data before calling sum. Such as

    @f.pandas_udf('long')
    def pandas_sum(to_process: pd.Series) -> int:
        return to_process.sort_values().sum()

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
… cudf (NVIDIA#740)

Signed-off-by: Jason Lowe <jlowe@nvidia.com>

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Nice to have for release
Projects
None yet
4 participants