[FEA] Workaround null_equality problem of join on StructType #2126

sperlingxx · 2021-04-14T09:51:14Z

Is your feature request related to a problem? Please describe.
As mentioned in cuDF issue #7934, join API of cuDF provided only supports universe null_equality. Meanwhile, join of Spark doesn't treat null values within nested types as null, which means we may need different null_equality options for different join keys if join keys consist of both StructTypes and AtomicTypes.

Describe the solution you'd like
In long-term, the best solution is that libcudf provides join API with column-wise null_eqauality options. But within short-term, I think we can walk around the problem from rapids plugin side.

My proposal is simply switching compareNullsEqual to true for join types other than FullOuterJoin. After switching, we can handle both nullable atomic types and nullable children of nested types correctly under inner join and one-side outer joins.

It is because Spark optimizer rule InferFiltersFromConstraints inserts IsNotNull Constraints for two-sides inputs of inner join and buld-side input of one-side outer joins, which ensure all atomic join columns are non-nullable. That's why we can safely switch on compareNullsEqual without failing join on columns of nullable AtomicTypes. And the rule InferFiltersFromConstraints is enabled by default.

Here is the example for the effect of InferFiltersFromConstraints:

== Parsed Logical Plan ==
Join Inner, (a#0 = a#4)
:- LogicalRDD [a#0, b#1], false
+- LogicalRDD [a#4, b#5], false
== Analyzed Logical Plan ==
a: int, b: int, a: int, b: int
Join Inner, (a#0 = a#4)
:- LogicalRDD [a#0, b#1], false
+- LogicalRDD [a#4, b#5], false
== Optimized Logical Plan ==
Join Inner, (a#0 = a#4)
:- Filter isnotnull(a#0)
:  +- LogicalRDD [a#0, b#1], false
+- Filter isnotnull(a#4)
   +- LogicalRDD [a#4, b#5], false

Of course, this tempoarary solution doesn't work on FullOuterJoin. But the FullOuterJoin is also broked by cuDF issue #7947.

The text was updated successfully, but these errors were encountered:

sperlingxx · 2021-04-14T09:55:21Z

Hi @hyperbolic2346 @revans2 @abellina, what do you think of this walkaround solution?

revans2 · 2021-04-14T17:28:56Z

I am OK with the plan, but we will probably need some tweaks to it. The issue is that even though InferFiltersFromConstraints inserts null filters we have seen situations where that does not work 100% of the time. We ran into some real performance problems when a large number of nulls ended up in the join and cudf tried to insert them into the build table, which resulted in a lot of slowness everywhere. We worked around this by filtering the nulls out of the table before we did the join until cudf fixed the problem (mostly).

So if we are joining on a struct, and it is a join we can support, then we should filter out all of the nulls from the keys prior to doing the join. If we are not joining on a struct, then I would want to look at the performance difference between doing the filter first and doing not doing it to decide what we want to do.

As a side note when we put in the null filtering before the join we ended up with some memory problems related to broadcast joins. we never completely tracked down what was happening in these cases and we don't know if it is fragmentation or because we actually have to leak the broadcast table.

hyperbolic2346 · 2021-04-19T22:21:55Z

I think this workaround seems reasonable.

I spoke with the cudf team and the feeling is that the current implementation is what users of cudf would expect and they have no plans to make anything bespoke for Spark. They did note that they would help in any way needed to add access to features necessary here. I don't think anything is required though.

sperlingxx added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 14, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Apr 20, 2021

sameerz assigned sperlingxx Apr 20, 2021

sameerz modified the milestones: Apr 12 - Apr 23, Apr 26 - May 7 Apr 20, 2021

sperlingxx mentioned this issue Apr 30, 2021

Support GpuHashJoin on Structs #2173

Merged

sperlingxx closed this as completed May 5, 2021

sameerz changed the title ~~[FEA] Walkaround null_equality problem of join on StructType~~ [FEA] Workaround null_equality problem of join on StructType May 12, 2021

sameerz added task Work required that improves the product but is not user facing and removed feature request New feature or request labels May 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Workaround null_equality problem of join on StructType #2126

[FEA] Workaround null_equality problem of join on StructType #2126

sperlingxx commented Apr 14, 2021 •

edited

Loading

sperlingxx commented Apr 14, 2021

revans2 commented Apr 14, 2021

hyperbolic2346 commented Apr 19, 2021

[FEA] Workaround null_equality problem of join on StructType #2126

[FEA] Workaround null_equality problem of join on StructType #2126

Comments

sperlingxx commented Apr 14, 2021 • edited Loading

sperlingxx commented Apr 14, 2021

revans2 commented Apr 14, 2021

hyperbolic2346 commented Apr 19, 2021

sperlingxx commented Apr 14, 2021 •

edited

Loading