[BUG] Null equality of structs doesn't match to spark #7934

sperlingxx · 2021-04-12T05:28:46Z

Describe the bug
For structs with nullable children, these children share same null_equality with other join keys, since cuDF flattens all columns. In Spark, null_equality of atomic types are false, but null_equality of structures' children are true. Here is the the ording comparison strategy of Spark.

So, the problem is that there exists no universe null_equality option for join keys which are composed by both structs with nullable children and nullable atomic types.

P.S. This issue is a corrected version of #7911.

The text was updated successfully, but these errors were encountered:

jrhemstad · 2021-04-12T14:26:17Z

This was discussed here #7226 (comment). libcudf will not support specifying a per member null equality or null ordering. To do so in Spark, you can flatten to columns yourself and pass in the resulting table with whatever null equality/order you want.

hyperbolic2346 · 2021-04-13T02:49:06Z

@jrhemstad Is it possible for the flattening code to have spark-friendly defaults like for null ordering?

      if (child.type().id() == type_id::STRUCT) {
        flatten_struct_column(structs_column_view{child}, col_order, null_order::BEFORE);
        // default spark behaviour is null_order::BEFORE
      } else {
        flat_columns.push_back(child);
        if (not column_order.empty()) flat_column_order.push_back(col_order);
        if (not null_precedence.empty()) flat_null_precedence.push_back(null_order::BEFORE);
        // default spark behaviour is null_order::BEFORE
      }

Edit: Thinking about this more, I'm not sure it would be that easy to do that as we don't know the origin of a column when building the hash map.

jrhemstad · 2021-04-13T13:08:17Z

@jrhemstad Is it possible for the flattening code to have spark-friendly defaults like for null ordering?

As I understand it, the null equality/ordering specified for the top-level struct parent will be applied to all children. Based on #7226, my understanding was that this was the default behavior of Spark.

sperlingxx · 2021-04-13T14:07:40Z

@jrhemstad Is it possible for the flattening code to have spark-friendly defaults like for null ordering?

As I understand it, the null equality/ordering specified for the top-level struct parent will be applied to all children. Based on #7226, my understanding was that this was the default behavior of Spark.

Spark join skips null records during key matching, unless it is a FullOuterJoin. And Spark only checks validity of all root keys. For instance, key {'a': null, b: 2, c: 3} is regarded as a null key, but key {'a':{'A': null, 'B': 1}, b: 2, c: 3} is regarded as a non-null key.
Here is the code link of sort-merge join.

hyperbolic2346 · 2021-04-13T23:35:31Z

I see two approaches to fixing this problem. The first is to pass in per-column nullability information, which seems a non-starter. The second would be to have cudf have different nuillability for the flattened columns, possibly as an option to match spark. I'm unsure if other frameworks expect the same. I am asking around to see if this is unique to spark.

kkraus14 · 2021-04-13T23:40:46Z

cc @felipeblazing @williamBlazing what's the behavior you'd want for Blazing in this case?

cc @shwina @brandon-b-miller any thoughts from the Python/Pandas perspective?

shwina · 2021-04-14T14:51:37Z

~~Pandas appears to have semantics:~~

edit: please ignore this; see my next comment.

>>> s1 = pd.Series({"a": [1, 2, pd.NA], "b": pd.NA})
>>> s1 == s1
a     True
b    False
dtype: bool

jrhemstad · 2021-04-14T15:09:43Z

Pandas appears to have semantics:

>>> s1 = pd.Series({"a": [1, 2, pd.NA], "b": pd.NA})
>>> s1 == s1
a     True
b    False
dtype: bool

I don't understand the syntax of this example.

shwina · 2021-04-14T15:53:49Z

Here is a Pandas series of dictionaries ("structs"). The top-level parent is nullable, and so is the (single) child column. The top level parent has a non-null element at position 0, and a null element at position 1.

>>> s = cudf.Series([{"a": pd.NA}, pd.NA])
>>> print(s)
0    {'a': <NA>}
1           <NA>
dtype: object

Doing an elementwise compare of this Series with itself:

>>> s == s
0     True
1    False
dtype: bool

We see that null equality at the top level is False (giving False at position 1), while for the child column, it is True (giving True at position 0).

jrhemstad · 2021-04-14T17:24:52Z

Do the rules for elementwise comparison apply to merge as well? What happens when you do an inner join of s with itself?

kkraus14 · 2021-04-14T17:27:29Z

@shwina do we want to follow the Pandas behavior here? It's going to defer to Python dictionary comparison here.

kkraus14 · 2021-04-14T17:37:19Z

@shwina I also think your example above is potentially broken, pd.NA is a singleton, so all of those will reference the same object which will have Python short circuit normal equality checks when doing the dictionary equality check.

shwina · 2021-04-14T17:41:25Z

I agree we may not want the same semantics. It's somewhat serendipitous that Pandas and Spark exhibit the same behaviour here.

hyperbolic2346 · 2021-04-19T21:58:35Z

Spoke with Keith and Jake about this. The feeling is that this is custom Spark behavior more than something that would be expected by an end-user of cudf. As such, cudf will not implement Spark-specific changes like this, but will certainly expose what is needed to allow the Spark plugin to build what it needs.

In this case, the plugin doesn't need any extra support as removing top-level null entries and then passing the null equality internal structure comparisons expect will work.

sperlingxx added bug Something isn't working Needs Triage Need team to review and classify labels Apr 12, 2021

sperlingxx mentioned this issue Apr 12, 2021

[BUG] Null Policy of EqualJoin on StructType #7911

Closed

jrhemstad added not a bug and removed Needs Triage Need team to review and classify labels Apr 12, 2021

sperlingxx mentioned this issue Apr 14, 2021

[FEA] Workaround null_equality problem of join on StructType NVIDIA/spark-rapids#2126

Closed

sperlingxx mentioned this issue Apr 19, 2021

Support GpuHashJoin on Structs NVIDIA/spark-rapids#2173

Merged

hyperbolic2346 closed this as completed Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Null equality of structs doesn't match to spark #7934

[BUG] Null equality of structs doesn't match to spark #7934

sperlingxx commented Apr 12, 2021

jrhemstad commented Apr 12, 2021

hyperbolic2346 commented Apr 13, 2021 •

edited

Loading

jrhemstad commented Apr 13, 2021

sperlingxx commented Apr 13, 2021

hyperbolic2346 commented Apr 13, 2021

kkraus14 commented Apr 13, 2021

shwina commented Apr 14, 2021 •

edited

Loading

jrhemstad commented Apr 14, 2021

shwina commented Apr 14, 2021 •

edited

Loading

jrhemstad commented Apr 14, 2021

kkraus14 commented Apr 14, 2021

kkraus14 commented Apr 14, 2021

shwina commented Apr 14, 2021

hyperbolic2346 commented Apr 19, 2021

[BUG] Null equality of structs doesn't match to spark #7934

[BUG] Null equality of structs doesn't match to spark #7934

Comments

sperlingxx commented Apr 12, 2021

jrhemstad commented Apr 12, 2021

hyperbolic2346 commented Apr 13, 2021 • edited Loading

jrhemstad commented Apr 13, 2021

sperlingxx commented Apr 13, 2021

hyperbolic2346 commented Apr 13, 2021

kkraus14 commented Apr 13, 2021

shwina commented Apr 14, 2021 • edited Loading

jrhemstad commented Apr 14, 2021

shwina commented Apr 14, 2021 • edited Loading

jrhemstad commented Apr 14, 2021

kkraus14 commented Apr 14, 2021

kkraus14 commented Apr 14, 2021

shwina commented Apr 14, 2021

hyperbolic2346 commented Apr 19, 2021

hyperbolic2346 commented Apr 13, 2021 •

edited

Loading

shwina commented Apr 14, 2021 •

edited

Loading

shwina commented Apr 14, 2021 •

edited

Loading