Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_no_fallback_when_ansi_enabled failed in databricks #3611

Closed
abellina opened this issue Sep 22, 2021 · 3 comments · Fixed by #3615
Closed

[BUG] test_no_fallback_when_ansi_enabled failed in databricks #3611

abellina opened this issue Sep 22, 2021 · 3 comments · Fixed by #3615
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@abellina
Copy link
Collaborator

abellina commented Sep 22, 2021

@razajafri found this in one of his PRs, the results from the CPU and GPU do not match for test_no_fallback_when_ansi_enabled, which I added here #3597:

First few rows from the CPU:

 [Row(a=None, first(b)=118, last(b)=507, min(b)=118, max(b)=507), Row(a=1, first(b)=507, last(b)=507, min(b)=507, max(b)=507),  Row(a=2, first(b)=848, last(b)=848, min(b)=848, max(b)=848),

First few rows from the GPU:

[Row(a=1, first(b)=507, last(b)=507, min(b)=507, max(b)=507), Row(a=2, first(b)=848, last(b)=848, min(b)=848, max(b)=848)

Last row on the GPU:

Row(a=None, first(b)=118, last(b)=507, min(b)=118, max(b)=507

I am not entirely sure how this is happening, after a coalesce(1) and orderBy(every column).

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels Sep 22, 2021
@abellina abellina self-assigned this Sep 22, 2021
@abellina abellina added this to the Sep 13 - Sep 24 milestone Sep 22, 2021
@jlowe
Copy link
Member

jlowe commented Sep 22, 2021

I am not entirely sure how this is happening, after a coalesce(1) and orderBy(every column)

But the orderBy is not the last thing in the query?

        df = gen_df(spark, [('a', data_gen), ('b', data_gen)], length=100)
        # coalescing because of first/last are not deterministic
        df = df.coalesce(1).orderBy("a", "b")
        return df.groupBy('a').agg(f.first("b"), f.last("b"), f.min("b"), f.max("b"))

What's preventing a Spark implementation from hash-aggregating the grouping and therefore the resulting order of the output rows being non-deterministic because it's dumping hash table contents?

@abellina
Copy link
Collaborator Author

Ok @jlowe is absolutely right. Just adding the ignore order marker here should do it.

@razajafri
Copy link
Collaborator

Do you want me to do this as part of my #3330? Adding skip is the same amount of work as adding ignore_order

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants