Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build python output schema from udf expressions #1794

Merged
merged 2 commits into from
Feb 24, 2021

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Feb 23, 2021

Build the Python output schema from the Python UDF expressions instead of the plan result attributes, because the result attributes are NOT always equal to the Python output schema.

For example, on databricks when projecting only one column from a Python UDF output where containing multiple result columns, there will be only one attribute in the result attributes for the projecting output, but the output schema for this Python UDF contains multiple columns.

Closes #1644

Signed-off-by: Firestarman firestarmanllc@gmail.com

Because the result attributes are NOT always equal to the python output schema.

For example, on databricks when projecting only one column from a python UDF output
where containing multiple result columns, there will be only one attribute in the
result attributes for the projecting output, but the output schema for this
python udf contains multiple columns.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

build

1 similar comment
@firestarman
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Feb 23, 2021
@@ -555,7 +555,7 @@ case class GpuArrowEvalPythonExec(

// cache in a local to avoid serializing the plan
val inputSchema = child.output.toStructType
val pythonOutputSchema = StructType.fromAttributes(resultAttrs)
val pythonOutputSchema = StructType.fromAttributes(udfs.map(_.resultAttribute))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Great catch on this by the way. Could we add some comments here explaining what is happening and why we are not using resultAttrs that was passed in? If I made the mistake before then a comment will hopefully help someone not break it going back again, especially if it only shows up for databricks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, updated.

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

@revans2 revans2 merged commit d2b6bfc into NVIDIA:branch-0.4 Feb 24, 2021
@firestarman firestarman deleted the fix-py-out-schema branch February 25, 2021 01:18
@sameerz sameerz added the bug Something isn't working label Mar 1, 2021
@sameerz sameerz added this to the Feb 16 - Feb 26 milestone Mar 1, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] test_window_aggregate_udf_array_from_python fails on databricks
3 participants