Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Null equality of structs doesn't match to spark #7934

Closed
sperlingxx opened this issue Apr 12, 2021 · 14 comments
Closed

[BUG] Null equality of structs doesn't match to spark #7934

sperlingxx opened this issue Apr 12, 2021 · 14 comments
Labels
bug Something isn't working

Comments

@sperlingxx
Copy link
Contributor

Describe the bug
For structs with nullable children, these children share same null_equality with other join keys, since cuDF flattens all columns. In Spark, null_equality of atomic types are false, but null_equality of structures' children are true. Here is the the ording comparison strategy of Spark.

So, the problem is that there exists no universe null_equality option for join keys which are composed by both structs with nullable children and nullable atomic types.

P.S. This issue is a corrected version of #7911.

@sperlingxx sperlingxx added bug Something isn't working Needs Triage Need team to review and classify labels Apr 12, 2021
@jrhemstad
Copy link
Contributor

This was discussed here #7226 (comment). libcudf will not support specifying a per member null equality or null ordering. To do so in Spark, you can flatten to columns yourself and pass in the resulting table with whatever null equality/order you want.

@jrhemstad jrhemstad added not a bug and removed Needs Triage Need team to review and classify labels Apr 12, 2021
@hyperbolic2346
Copy link
Contributor

hyperbolic2346 commented Apr 13, 2021

@jrhemstad Is it possible for the flattening code to have spark-friendly defaults like for null ordering?

      if (child.type().id() == type_id::STRUCT) {
        flatten_struct_column(structs_column_view{child}, col_order, null_order::BEFORE);
        // default spark behaviour is null_order::BEFORE
      } else {
        flat_columns.push_back(child);
        if (not column_order.empty()) flat_column_order.push_back(col_order);
        if (not null_precedence.empty()) flat_null_precedence.push_back(null_order::BEFORE);
        // default spark behaviour is null_order::BEFORE
      }

Edit: Thinking about this more, I'm not sure it would be that easy to do that as we don't know the origin of a column when building the hash map.

@jrhemstad
Copy link
Contributor

@jrhemstad Is it possible for the flattening code to have spark-friendly defaults like for null ordering?

As I understand it, the null equality/ordering specified for the top-level struct parent will be applied to all children. Based on #7226, my understanding was that this was the default behavior of Spark.

@sperlingxx
Copy link
Contributor Author

@jrhemstad Is it possible for the flattening code to have spark-friendly defaults like for null ordering?

As I understand it, the null equality/ordering specified for the top-level struct parent will be applied to all children. Based on #7226, my understanding was that this was the default behavior of Spark.

Spark join skips null records during key matching, unless it is a FullOuterJoin. And Spark only checks validity of all root keys. For instance, key {'a': null, b: 2, c: 3} is regarded as a null key, but key {'a':{'A': null, 'B': 1}, b: 2, c: 3} is regarded as a non-null key.
Here is the code link of sort-merge join.

@hyperbolic2346
Copy link
Contributor

I see two approaches to fixing this problem. The first is to pass in per-column nullability information, which seems a non-starter. The second would be to have cudf have different nuillability for the flattened columns, possibly as an option to match spark. I'm unsure if other frameworks expect the same. I am asking around to see if this is unique to spark.

@kkraus14
Copy link
Collaborator

cc @felipeblazing @williamBlazing what's the behavior you'd want for Blazing in this case?

cc @shwina @brandon-b-miller any thoughts from the Python/Pandas perspective?

@shwina
Copy link
Contributor

shwina commented Apr 14, 2021

Pandas appears to have semantics:

edit: please ignore this; see my next comment.

>>> s1 = pd.Series({"a": [1, 2, pd.NA], "b": pd.NA})
>>> s1 == s1
a     True
b    False
dtype: bool

@jrhemstad
Copy link
Contributor

Pandas appears to have semantics:

>>> s1 = pd.Series({"a": [1, 2, pd.NA], "b": pd.NA})
>>> s1 == s1
a     True
b    False
dtype: bool

I don't understand the syntax of this example.

@shwina
Copy link
Contributor

shwina commented Apr 14, 2021

Here is a Pandas series of dictionaries ("structs"). The top-level parent is nullable, and so is the (single) child column. The top level parent has a non-null element at position 0, and a null element at position 1.

>>> s = cudf.Series([{"a": pd.NA}, pd.NA])
>>> print(s)
0    {'a': <NA>}
1           <NA>
dtype: object

Doing an elementwise compare of this Series with itself:

>>> s == s
0     True
1    False
dtype: bool

We see that null equality at the top level is False (giving False at position 1), while for the child column, it is True (giving True at position 0).

@jrhemstad
Copy link
Contributor

Do the rules for elementwise comparison apply to merge as well? What happens when you do an inner join of s with itself?

@kkraus14
Copy link
Collaborator

@shwina do we want to follow the Pandas behavior here? It's going to defer to Python dictionary comparison here.

@kkraus14
Copy link
Collaborator

@shwina I also think your example above is potentially broken, pd.NA is a singleton, so all of those will reference the same object which will have Python short circuit normal equality checks when doing the dictionary equality check.

@shwina
Copy link
Contributor

shwina commented Apr 14, 2021

I agree we may not want the same semantics. It's somewhat serendipitous that Pandas and Spark exhibit the same behaviour here.

@hyperbolic2346
Copy link
Contributor

Spoke with Keith and Jake about this. The feeling is that this is custom Spark behavior more than something that would be expected by an end-user of cudf. As such, cudf will not implement Spark-specific changes like this, but will certainly expose what is needed to allow the Spark plugin to build what it needs.

In this case, the plugin doesn't need any extra support as removing top-level null entries and then passing the null equality internal structure comparisons expect will work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants