You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
As mentioned in cuDF issue #7934, join API of cuDF provided only supports universe null_equality. Meanwhile, join of Spark doesn't treat null values within nested types as null, which means we may need different null_equality options for different join keys if join keys consist of both StructTypes and AtomicTypes.
Describe the solution you'd like
In long-term, the best solution is that libcudf provides join API with column-wise null_eqauality options. But within short-term, I think we can walk around the problem from rapids plugin side.
My proposal is simply switching compareNullsEqual to true for join types other than FullOuterJoin. After switching, we can handle both nullable atomic types and nullable children of nested types correctly under inner join and one-side outer joins.
It is because Spark optimizer rule InferFiltersFromConstraints inserts IsNotNull Constraints for two-sides inputs of inner join and buld-side input of one-side outer joins, which ensure all atomic join columns are non-nullable. That's why we can safely switch on compareNullsEqual without failing join on columns of nullable AtomicTypes. And the rule InferFiltersFromConstraints is enabled by default.
Here is the example for the effect of InferFiltersFromConstraints:
I am OK with the plan, but we will probably need some tweaks to it. The issue is that even though InferFiltersFromConstraints inserts null filters we have seen situations where that does not work 100% of the time. We ran into some real performance problems when a large number of nulls ended up in the join and cudf tried to insert them into the build table, which resulted in a lot of slowness everywhere. We worked around this by filtering the nulls out of the table before we did the join until cudf fixed the problem (mostly).
So if we are joining on a struct, and it is a join we can support, then we should filter out all of the nulls from the keys prior to doing the join. If we are not joining on a struct, then I would want to look at the performance difference between doing the filter first and doing not doing it to decide what we want to do.
As a side note when we put in the null filtering before the join we ended up with some memory problems related to broadcast joins. we never completely tracked down what was happening in these cases and we don't know if it is fragmentation or because we actually have to leak the broadcast table.
I spoke with the cudf team and the feeling is that the current implementation is what users of cudf would expect and they have no plans to make anything bespoke for Spark. They did note that they would help in any way needed to add access to features necessary here. I don't think anything is required though.
sameerz
changed the title
[FEA] Walkaround null_equality problem of join on StructType
[FEA] Workaround null_equality problem of join on StructType
May 12, 2021
Is your feature request related to a problem? Please describe.
As mentioned in cuDF issue #7934, join API of cuDF provided only supports universe null_equality. Meanwhile, join of Spark doesn't treat null values within nested types as null, which means we may need different null_equality options for different join keys if join keys consist of both StructTypes and AtomicTypes.
Describe the solution you'd like
In long-term, the best solution is that libcudf provides join API with column-wise null_eqauality options. But within short-term, I think we can walk around the problem from rapids plugin side.
My proposal is simply switching
compareNullsEqual
totrue
for join types other thanFullOuterJoin
. After switching, we can handle both nullable atomic types and nullable children of nested types correctly under inner join and one-side outer joins.It is because Spark optimizer rule InferFiltersFromConstraints inserts IsNotNull Constraints for two-sides inputs of inner join and buld-side input of one-side outer joins, which ensure all atomic join columns are non-nullable. That's why we can safely switch on
compareNullsEqual
without failing join on columns of nullable AtomicTypes. And the ruleInferFiltersFromConstraints
is enabled by default.Here is the example for the effect of
InferFiltersFromConstraints
:Of course, this tempoarary solution doesn't work on
FullOuterJoin
. But theFullOuterJoin
is also broked by cuDF issue #7947.The text was updated successfully, but these errors were encountered: