Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support sorting on keys that are structs #7226

Closed
kuhushukla opened this issue Jan 27, 2021 · 23 comments · Fixed by #7422
Closed

[FEA] Support sorting on keys that are structs #7226

kuhushukla opened this issue Jan 27, 2021 · 23 comments · Fixed by #7422
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@kuhushukla
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Allow cudf sort function to take one or more sorting keys that are nested types like lists and/or structs.
Describe the solution you'd like
sort(col(my_struct_col))) should return a column vector that is sorted ascending or descending with nulls first or last.

@kuhushukla kuhushukla added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Jan 27, 2021
@jrhemstad
Copy link
Contributor

Struct columns of non-nested types won't be too bad as it's just like adding extra columns to a multi-column sort.

Arbitrarily nested lists on the other-hand are going to be extremely difficult to the point of being effectively impossible.

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jan 27, 2021
@kuhushukla
Copy link
Contributor Author

kuhushukla commented Jan 28, 2021

I think structs of non-nested types would be a great start, given our user story!

@revans2
Copy link
Contributor

revans2 commented Jan 29, 2021

I think structs of non-nested types would be a great start, given our user story!

To be clear structs of strings would be needed.

@kkraus14 kkraus14 changed the title [FEA] Support sorting on keys that are nested types [FEA] Support sorting on keys that are structs Feb 2, 2021
@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 2, 2021

Updated title to capture that that this is for structs and not lists

@jrhemstad
Copy link
Contributor

The way to make this non-intrusive is to have a struct to table conversion function, something like:

table_view struct_to_table(structs_column_view);

This will flatten any of internal columns (and likewise with nested struct columns) into a table_view. You can then insert the table_view input to sort with the expanded struct column.

@jlowe
Copy link
Member

jlowe commented Feb 3, 2021

This will flatten any of internal columns (and likewise with nested struct columns) into a table_view. You can then insert the table_view input to sort with the expanded struct column.

What about the validity vector of the parent struct column (and nested versions)?

@jrhemstad
Copy link
Contributor

This will flatten any of internal columns (and likewise with nested struct columns) into a table_view. You can then insert the table_view input to sort with the expanded struct column.

What about the validity vector of the parent struct column (and nested versions)?

Since they're just views, you can set the validity mask of the expanded child column to be the same as the parent struct column. I believe that should preserve the desired semantics. In fact, I thought we already pushed the parent null mask down into the children for structs?

@jlowe
Copy link
Member

jlowe commented Feb 3, 2021

In fact, I thought we already pushed the parent null mask down into the children for structs?

We do, but after the sort we still need the parent struct validity to be updated so it is accurately reflected in subsequent operations (e.g.: write out to Parquet). The parent struct may say an entry is valid yet all the children are null, and that's semantically different than a null struct. Pushing the parent validity down to the children is very useful when working with the children in isolation, but we're not doing that here. We're sorting the struct column itself, not the individual children within it. The two are similar but distinct operations.

I don't know how we can reconstruct the parent struct validity from the resulting children validity after the fact. AFAICT either the sort method will have to support this update directly somehow (and therefore take the struct column directly) or we'd have to use the child columns (if they are sort keys) to generate the gather map then use that to update the struct validity vectors.

@jlowe
Copy link
Member

jlowe commented Feb 3, 2021

The parent struct may say an entry is valid yet all the children are null, and that's semantically different than a null struct.

Hmm, I wonder if there's a sort order difference between a null struct vs. a valid struct containing all null children. If so then we can't use the individual child columns to proxy for the parent struct column.

@jrhemstad
Copy link
Contributor

jrhemstad commented Feb 3, 2021

In fact, I thought we already pushed the parent null mask down into the children for structs?

We do, but after the sort we still need the parent struct validity to be updated so it is accurately reflected in subsequent operations (e.g.: write out to Parquet). The parent struct may say an entry is valid yet all the children are null, and that's semantically different than a null struct. Pushing the parent validity down to the children is very useful when working with the children in isolation, but we're not doing that here. We're sorting the struct column itself, not the individual children within it. The two are similar but distinct operations.

I don't know how we can reconstruct the parent struct validity from the resulting children validity after the fact. AFAICT either the sort method will have to support this update directly somehow (and therefore take the struct column directly) or we'd have to use the child columns (if they are sort keys) to generate the gather map then use that to update the struct validity vectors.

Flattening the struct column into a table is just for establishing the lexicographical ordering (i.e., the gather map). We still have the original column and so when we call gather on the original column with the gather map, it'll preserve whatever the original state was of the original input struct column.

@revans2
Copy link
Contributor

revans2 commented Feb 3, 2021

Just to be clear in Spark when sorting structs the order of null first vs null last first applies to the struct itself and then applies to the fields in the struct.
desc_nulls_first

+------------+
|           s|
+------------+
|        null|
|        null|
|   {1, null}|
|   {0, null}|
|   {null, 1}|
|   {null, 0}|
|{null, null}|
|{null, null}|
+------------+

desc_nulls_last

+------------+
|           s|
+------------+
|   {1, null}|
|   {0, null}|
|   {null, 1}|
|   {null, 0}|
|{null, null}|
|{null, null}|
|        null|
|        null|
+------------+

The columns within the struct are sorted in the order that they appear in the struct.
Ascending order

+-----------+
|          s|
+-----------+
|  {0, 4, 1}|
|  {0, 4, 2}|
|  {1, 2, 5}|
|  {1, 3, 3}|
|  {1, 3, 4}|
|  {2, 1, 7}|
|  {2, 1, 8}|
|  {2, 2, 6}|
|  {3, 0, 9}|
| {3, 0, 10}|
| {3, 0, 11}|
|{4, -2, 14}|
|{4, -1, 12}|
|{4, -1, 13}|
|{5, -3, 16}|
|{5, -3, 17}|
|{5, -2, 15}|
|{6, -4, 18}|
|{6, -4, 19}|
+-----------+

Hopefully all of that was intuitive, but I wanted to be sure that the requirement is 100% clear

@jlowe
Copy link
Member

jlowe commented Feb 3, 2021

Just to be clear in Spark when sorting structs the order of null first vs null last first applies to the struct itself and then applies to the fields in the struct.

That means we won't be able to use only the child columns to proxy for the struct. A column to proxy as the struct validity vector will need to be prepended to the child columns when building the sort keys. The struct validity vector can be expanded as a BOOL8 column to meet that requirement.

@revans2
Copy link
Contributor

revans2 commented Feb 3, 2021

The struct validity vector can be expanded as a BOOL8 column to meet that requirement.

Or the comparator when sorting includes the base column of the struct (with the validity in it) and ignores the value or just produces a constant throw away value, and the null handling logic already in place just works.

@jlowe
Copy link
Member

jlowe commented Feb 3, 2021

Or the comparator when sorting includes the base column of the struct (with the validity in it) and ignores the value or just produces a constant throw away value, and the null handling logic already in place just works.

Agreed that would be preferable. I was thinking of ways to accomplish it with the existing comparator functionality.

@jrhemstad
Copy link
Contributor

This example doesn't make any sense to me:

desc_nulls_first

+------------+
|           s|
+------------+
|        null|
|        null|
|   {1, null}|
|   {0, null}|
|   {null, 1}|
|   {null, 0}|
|{null, null}|
|{null, null}|
+------------+

In a descending sort where nulls are considered larger than all other values ("nulls first") then shouldn't it be:

+------------+
|           s|
+------------+
|        null|
|        null|
|{null, null}|
|{null, null}|
|   {null, 1}|
|   {null, 0}|
|   {1, null}|
|   {0, null}|
+------------+

@revans2
Copy link
Contributor

revans2 commented Feb 4, 2021

@jrhemstad you are 100% right. I had to go back and read the code. In Spark the null ordering within a struct is separate from the null ordering on the main columns. It always goes with the default value of nulls first on ascending and nulls last on descending. That is counter intuitive to me, but it is what they do.

@jrhemstad
Copy link
Contributor

jrhemstad commented Feb 4, 2021

@jrhemstad you are 100% right. I had to go back and read the code. In Spark the null ordering within a struct is separate from the null ordering on the main columns. It always goes with the default value of nulls first on ascending and nulls last on descending. That is counter intuitive to me, but it is what they do.

That's not something we'll be able to support natively. The way to accomplish having struct members have different null ordering from the struct column itself is to explode the struct column children into their own columns much like I've already described, but additionally inserting a BOOL8 column that represents that parent column's bitmask (like @jlowe described above). Then you can specify whatever per-column (and therefore per-struct member) null ordering you wish.

If the null ordering is the same, we can support the fact that a null struct is different from a valid struct with all null members, i.e., null is ordered differently from {null, null}, we'll just have to do something different than I originally thought.

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 4, 2021

If the null ordering is the same, we can support the fact that a null struct is different from a valid struct with all null members, i.e., null is ordered differently from {null, null}, we'll just have to do something different than I originally thought.

This is the behavior we'd want from the Python side.

@revans2
Copy link
Contributor

revans2 commented Feb 4, 2021

That is kind of what I expected.

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 4, 2021

Is this something that would make sense to push to change the behavior in Spark?

@revans2
Copy link
Contributor

revans2 commented Feb 5, 2021

Is this something that would make sense to push to change the behavior in Spark?

We could try to change the Spark behavior, but my guess is that we would still have to support both ways. At least while the change rolls out as this would be a breaking change and likely have to go into Spark 4.0. Also, in enterprise software, like this, we would likely have to include a flag so users can get the old behavior if they rely on it. Because the default values are self consistent I am hoping that we can cover 99% of the use cases without having to split apart the struct.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@jlowe
Copy link
Member

jlowe commented Mar 22, 2021

This is being actively worked on in #7422.

rapids-bot bot pushed a commit that referenced this issue Mar 26, 2021
closes #7226

Add struct column support to `cudf::sort`, `cudf::sorted_order`
struct is supported by flattening the struct into individual columns in table_view, null mask of struct is converted to boolean column with same null_mask.

Authors:
  - Karthikeyan (@karthikeyann)

Approvers:
  - Gera Shegalov (@gerashegalov)
  - David (@davidwendt)
  - Nghia Truong (@ttnghia)
  - Jake Hemstad (@jrhemstad)
  - Conor Hoekstra (@codereport)

URL: #7422
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants