Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Workaround null_equality problem of join on StructType #2126

Closed
sperlingxx opened this issue Apr 14, 2021 · 3 comments
Closed

[FEA] Workaround null_equality problem of join on StructType #2126

sperlingxx opened this issue Apr 14, 2021 · 3 comments
Assignees
Labels
task Work required that improves the product but is not user facing

Comments

@sperlingxx
Copy link
Collaborator

sperlingxx commented Apr 14, 2021

Is your feature request related to a problem? Please describe.
As mentioned in cuDF issue #7934, join API of cuDF provided only supports universe null_equality. Meanwhile, join of Spark doesn't treat null values within nested types as null, which means we may need different null_equality options for different join keys if join keys consist of both StructTypes and AtomicTypes.

Describe the solution you'd like
In long-term, the best solution is that libcudf provides join API with column-wise null_eqauality options. But within short-term, I think we can walk around the problem from rapids plugin side.

My proposal is simply switching compareNullsEqual to true for join types other than FullOuterJoin. After switching, we can handle both nullable atomic types and nullable children of nested types correctly under inner join and one-side outer joins.

It is because Spark optimizer rule InferFiltersFromConstraints inserts IsNotNull Constraints for two-sides inputs of inner join and buld-side input of one-side outer joins, which ensure all atomic join columns are non-nullable. That's why we can safely switch on compareNullsEqual without failing join on columns of nullable AtomicTypes. And the rule InferFiltersFromConstraints is enabled by default.

Here is the example for the effect of InferFiltersFromConstraints:

== Parsed Logical Plan ==
Join Inner, (a#0 = a#4)
:- LogicalRDD [a#0, b#1], false
+- LogicalRDD [a#4, b#5], false
== Analyzed Logical Plan ==
a: int, b: int, a: int, b: int
Join Inner, (a#0 = a#4)
:- LogicalRDD [a#0, b#1], false
+- LogicalRDD [a#4, b#5], false
== Optimized Logical Plan ==
Join Inner, (a#0 = a#4)
:- Filter isnotnull(a#0)
:  +- LogicalRDD [a#0, b#1], false
+- Filter isnotnull(a#4)
   +- LogicalRDD [a#4, b#5], false

Of course, this tempoarary solution doesn't work on FullOuterJoin. But the FullOuterJoin is also broked by cuDF issue #7947.

@sperlingxx sperlingxx added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 14, 2021
@sperlingxx
Copy link
Collaborator Author

Hi @hyperbolic2346 @revans2 @abellina, what do you think of this walkaround solution?

@revans2
Copy link
Collaborator

revans2 commented Apr 14, 2021

I am OK with the plan, but we will probably need some tweaks to it. The issue is that even though InferFiltersFromConstraints inserts null filters we have seen situations where that does not work 100% of the time. We ran into some real performance problems when a large number of nulls ended up in the join and cudf tried to insert them into the build table, which resulted in a lot of slowness everywhere. We worked around this by filtering the nulls out of the table before we did the join until cudf fixed the problem (mostly).

So if we are joining on a struct, and it is a join we can support, then we should filter out all of the nulls from the keys prior to doing the join. If we are not joining on a struct, then I would want to look at the performance difference between doing the filter first and doing not doing it to decide what we want to do.

As a side note when we put in the null filtering before the join we ended up with some memory problems related to broadcast joins. we never completely tracked down what was happening in these cases and we don't know if it is fragmentation or because we actually have to leak the broadcast table.

@hyperbolic2346
Copy link
Collaborator

I think this workaround seems reasonable.

I spoke with the cudf team and the feeling is that the current implementation is what users of cudf would expect and they have no plans to make anything bespoke for Spark. They did note that they would help in any way needed to add access to features necessary here. I don't think anything is required though.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Apr 20, 2021
@sameerz sameerz changed the title [FEA] Walkaround null_equality problem of join on StructType [FEA] Workaround null_equality problem of join on StructType May 12, 2021
@sameerz sameerz added task Work required that improves the product but is not user facing and removed feature request New feature or request labels May 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

No branches or pull requests

4 participants