Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Structs for UnionExec #1919

Merged
merged 9 commits into from
Mar 23, 2021

Conversation

razajafri
Copy link
Collaborator

This PR whitelabels UnionExec for structs and adds tests for 3.1.1

Signed-off-by: Raza Jafri rjafri@nvidia.com

@sameerz sameerz added the feature request New feature or request label Mar 11, 2021
@pytest.mark.parametrize('data_gen', all_gen + [all_basic_struct_gen, StructGen([['child0', DecimalGen(7, 2)]])], ids=idfn)
@pytest.mark.skipif(is_before_spark_311(), reason="This is supported only in Spark 3.1.1+")
def test_union_by_missing_col_name(data_gen):
if (isinstance(data_gen, StructGen)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xfail would stick around for all further test case iterations once activated in my experience. Some examples of the documented way of registering parameter-specific xfail are e.g here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated but the xfail doesn't persist or stick, can you please correct me if I am wrong. I tested it by placing the struct data_gen before the all_gen and it was not xfailing every test after struct.

Nevertheless, I have made the change to follow the convention.

@razajafri
Copy link
Collaborator Author

@gerashegalov PTAL

@razajafri
Copy link
Collaborator Author

build

Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, question about nesting levels

ExecChecks((TypeSig.commonCudfTypes + TypeSig.NULL + TypeSig.DECIMAL + TypeSig.STRUCT
.withPsNote(TypeEnum.STRUCT, "unionByName will not optionally fill missing columns with " +
"nulls when the col has structs"))
.nested(), TypeSig.all),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to allow this for multi- level nesting right away? In the sort PR we are doing single-level first

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I am wrong but won't they both result in the same code?

new TypeSig(initialTypes, nestedTypes ++ nesting.initialTypes, litOnlyTypes, notes)

new TypeSig(initialTypes, initialTypes ++ nestedTypes, litOnlyTypes, notes)

because in your case you are basically passing the nesting.initialTypes = initialTypes so how is it different?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is whether Struct spills into nested types , you can observe the difference in the REPL:

scala> val pluginSupportedOrderableSig = (
     |     TypeSig.commonCudfTypes +
     |     TypeSig.NULL +
     |     TypeSig.DECIMAL +
     |     TypeSig.STRUCT.nested(
     |       TypeSig.commonCudfTypes +
     |       TypeSig.NULL +
     |       TypeSig.DECIMAL
     |     ))
pluginSupportedOrderableSig: com.nvidia.spark.rapids.TypeSig = com.nvidia.spark.rapids.TypeSig@468f547e

scala> val typeSigBlanketNested = (TypeSig.commonCudfTypes + TypeSig.NULL + TypeSig.DECIMAL + TypeSig.STRUCT).nested()

scala> val singleNested = new StructType().add("value", IntegerType)
singleNested: org.apache.spark.sql.types.StructType = StructType(StructField(value,IntegerType,true))

scala> val doubleNested = new StructType().add("nestedInt", new StructType().add("value", IntegerType))
doubleNested: org.apache.spark.sql.types.StructType = StructType(StructField(nestedInt,StructType(StructField(value,IntegerType,true)),true))

scala> pluginSupportedOrderableSig.isSupportedByPlugin(singleNested, true)
res8: Boolean = true

scala> pluginSupportedOrderableSig.isSupportedByPlugin(doubleNested, true)
res7: Boolean = false

But with typeSigBlanketNested e.g double nesting would be allowed

scala> typeSigBlanketNested.isSupportedByPlugin(doubleNested, true)
res11: Boolean = true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh of course! I missed a very important thing that the struct will be a part of the initialTypes. Thanks for the explanation

I have pushed an update. PTAL

@razajafri
Copy link
Collaborator Author

CI failed due to lack of resources

@razajafri
Copy link
Collaborator Author

build

1 similar comment
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

@gerashegalov can you please see if I have addressed all your concerns?

gerashegalov
gerashegalov previously approved these changes Mar 20, 2021
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

tried some wording on the doc but not sure it's accurate

@razajafri
Copy link
Collaborator Author

@gerashegalov your approval was clobbered due to the updated docs. Please approve again

gerashegalov
gerashegalov previously approved these changes Mar 20, 2021
@sameerz
Copy link
Collaborator

sameerz commented Mar 22, 2021

build

@razajafri
Copy link
Collaborator Author

CI build fails because a line exceeds 100 chars but the line shows up as within 100 chars in the PR and my local build

(exchange, conf, p, r) => new GpuBroadcastMeta(exchange, conf, p, r)),

@razajafri
Copy link
Collaborator Author

Figured it out, @gerashegalov made a suggestion that I accepted that made the line exceed 100 chars. I understand that he made the suggestion in the browser and its not his fault.

This is why I tend to not accept suggestions directly from the browser and usually make the change manually. @gerashegalov can you please bless this PR one more time after the build is successful?

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

build

1 similar comment
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

I had to force push to get rid of the error. I hope this will pass tests

@razajafri
Copy link
Collaborator Author

build

razajafri and others added 8 commits March 23, 2021 09:38
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
….scala

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
….scala

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

def test_union_by_missing_col_name(data_gen):
assert_gpu_and_cpu_are_equal_collect(
lambda spark : binary_op_df(spark, data_gen).withColumnRenamed("a", "x")
.unionByName(debug_df(binary_op_df(spark, data_gen).withColumnRenamed("a", "y")), True))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the debug_df

StructGen([['child1', IntegerGen()]])), marks=nested_scalar_mark),
(StructGen([['child0', DecimalGen(7, 2)]], nullable=False),
StructGen([['child1', IntegerGen()]], nullable=False))], ids=idfn)
def test_union_struct(data_gen):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of confused by the testing here.

We have test_union_struct that tests that we can do a union of structs with different children so long as the struct is not nullable. I thought that this required version 3.1.1 or above to work. Am I wrong? I would also like the name to be clearer something like test_union_struct_missing_children

Then we have test_union that tests if we can do a union of anything so long as the types are the same on both sides.

Finally we have test_union_by_missing_col_name which requires being Spark 3.1.1 or after, which verifies that we can rename some columns and nulls are inserted in the place of the missing columns.

Could we have some comments added explaining the difference between the various tests?

@@ -19,7 +19,7 @@
from marks import *
from pyspark.sql.types import *
import pyspark.sql.functions as f
from spark_session import is_before_spark_310
from spark_session import is_before_spark_311
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did all of these change to is_before_spark_311? I know that spark310 was never really released, so it probably does not matter too much, but it is confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want me to revert these changes? I thought to change it because 310 wasn't released

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to revert it. Just would have been nice to have it called out in the PR so I know that it is in there instead of scratching my head as to why there are changes in here that don't look related.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

@revans2 can you please take another look?

@razajafri razajafri merged commit 2d0b246 into NVIDIA:branch-0.5 Mar 23, 2021
@razajafri razajafri deleted the union-exec-test branch March 23, 2021 21:47
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* union support for structs

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* added more tests

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* format change

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* added missing all_gen

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* addressed review comments

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* line exceeds limit

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* added comments for tests

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* union support for structs

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* added more tests

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* format change

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* added missing all_gen

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* addressed review comments

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* line exceeds limit

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* added comments for tests

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants