Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support union for nested types #1459

Closed
revans2 opened this issue Jan 6, 2021 · 4 comments · Fixed by #3359
Closed

[FEA] Support union for nested types #1459

revans2 opened this issue Jan 6, 2021 · 4 comments · Fixed by #3359
Labels
feature request New feature or request

Comments

@revans2
Copy link
Collaborator

revans2 commented Jan 6, 2021

Is your feature request related to a problem? Please describe.
Union only works for non-nested types right now. It would be nice to support it for nested types as well. This is mostly blocked on scalar support for nested types. When we do get around to implementing it we should be sure to test with SPARK-32376 where a struct can have null columns added to it to make the two sides match. I am not sure if it will require any code changes on our part though.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Jan 6, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jan 12, 2021
@kuhushukla
Copy link
Collaborator

Thanks for the feature request!

Did some simple struct tests and adding the nested types to expr checks seems to work and match cpu result. But again, it was a preliminary test.
@revans2

This is mostly blocked on scalar support for nested types

Could you elaborate what do you mean by this? Which test should exercise this limitation and can we do without it? Thank you.

@revans2
Copy link
Collaborator Author

revans2 commented Feb 3, 2021

This is mostly blocked on scalar support for nested types

In Spark 3.1.0 and above when it sees a union of Struct<A: Int, B: Array<String>> and Struct<A:Int, C: Long> the output will look like Struct<A:Int, B: Array<String>, C: Long>. To make this work the first data frame will add a C to the Struct that is always nulls, and the second data frame will add a B to the Struct that is always null. The way that works is to insert a project that picks apart the original struct and puts it back together with the new scalar null value inserted in. This requires us to be able to create a scalar null for an Array<String> and expand that out into a full column.

It is not directly in union that the problem show up, but it is a secondary effect that happens afterwards.

@razajafri
Copy link
Collaborator

razajafri commented Mar 12, 2021

@sameerz we won't be able to support the union of structs in cases where the struct can have a null value. I have added tests for all the cases in #1919 and xfailed the ones that we don't support ATM.

This is still dependent on the cudf support for Scalar Structs

@razajafri razajafri removed their assignment Mar 17, 2021
@sameerz sameerz removed this from the Mar 15 - March 26 milestone Mar 30, 2021
@sameerz
Copy link
Collaborator

sameerz commented Jul 1, 2021

Update: at this point we support union of structs and union of nested structs. We still need to support union of of lists and maps.

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
[auto-merge] bot-auto-merge-branch-23.10 to branch-23.12 [skip ci] [bot]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants