Add support for Structs for UnionExec #1919

razajafri · 2021-03-11T21:50:11Z

This PR whitelabels UnionExec for structs and adds tests for 3.1.1

Signed-off-by: Raza Jafri rjafri@nvidia.com

gerashegalov · 2021-03-12T01:11:07Z

integration_tests/src/main/python/repart_test.py

+@pytest.mark.parametrize('data_gen', all_gen + [all_basic_struct_gen, StructGen([['child0', DecimalGen(7, 2)]])], ids=idfn)
+@pytest.mark.skipif(is_before_spark_311(), reason="This is supported only in Spark 3.1.1+")
+def test_union_by_missing_col_name(data_gen):
+    if (isinstance(data_gen, StructGen)):


xfail would stick around for all further test case iterations once activated in my experience. Some examples of the documented way of registering parameter-specific xfail are e.g here

I have updated but the xfail doesn't persist or stick, can you please correct me if I am wrong. I tested it by placing the struct data_gen before the all_gen and it was not xfailing every test after struct.

Nevertheless, I have made the change to follow the convention.

razajafri · 2021-03-16T19:46:39Z

@gerashegalov PTAL

razajafri · 2021-03-17T16:19:46Z

build

gerashegalov

LGTM, question about nesting levels

integration_tests/src/main/python/repart_test.py

gerashegalov · 2021-03-17T18:02:37Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

+      ExecChecks((TypeSig.commonCudfTypes + TypeSig.NULL + TypeSig.DECIMAL + TypeSig.STRUCT
+        .withPsNote(TypeEnum.STRUCT, "unionByName will not optionally fill missing columns with " +
+          "nulls when the col has structs"))
+        .nested(), TypeSig.all),


do we want to allow this for multi- level nesting right away? In the sort PR we are doing single-level first

Correct me if I am wrong but won't they both result in the same code?

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala

Line 180 in 236caa0

new TypeSig(initialTypes, nestedTypes ++ nesting.initialTypes, litOnlyTypes, notes)

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala

Line 188 in 236caa0

new TypeSig(initialTypes, initialTypes ++ nestedTypes, litOnlyTypes, notes)

because in your case you are basically passing the nesting.initialTypes = initialTypes so how is it different?

The difference is whether Struct spills into nested types , you can observe the difference in the REPL:

scala> val pluginSupportedOrderableSig = ( | TypeSig.commonCudfTypes + | TypeSig.NULL + | TypeSig.DECIMAL + | TypeSig.STRUCT.nested( | TypeSig.commonCudfTypes + | TypeSig.NULL + | TypeSig.DECIMAL | )) pluginSupportedOrderableSig: com.nvidia.spark.rapids.TypeSig = com.nvidia.spark.rapids.TypeSig@468f547e scala> val typeSigBlanketNested = (TypeSig.commonCudfTypes + TypeSig.NULL + TypeSig.DECIMAL + TypeSig.STRUCT).nested() scala> val singleNested = new StructType().add("value", IntegerType) singleNested: org.apache.spark.sql.types.StructType = StructType(StructField(value,IntegerType,true)) scala> val doubleNested = new StructType().add("nestedInt", new StructType().add("value", IntegerType)) doubleNested: org.apache.spark.sql.types.StructType = StructType(StructField(nestedInt,StructType(StructField(value,IntegerType,true)),true)) scala> pluginSupportedOrderableSig.isSupportedByPlugin(singleNested, true) res8: Boolean = true scala> pluginSupportedOrderableSig.isSupportedByPlugin(doubleNested, true) res7: Boolean = false

But with typeSigBlanketNested e.g double nesting would be allowed

scala> typeSigBlanketNested.isSupportedByPlugin(doubleNested, true) res11: Boolean = true

Oh of course! I missed a very important thing that the struct will be a part of the initialTypes. Thanks for the explanation

I have pushed an update. PTAL

razajafri · 2021-03-19T00:26:21Z

CI failed due to lack of resources

razajafri · 2021-03-19T00:26:25Z

build

razajafri · 2021-03-20T00:39:31Z

build

razajafri · 2021-03-20T00:39:54Z

@gerashegalov can you please see if I have addressed all your concerns?

gerashegalov

LGTM

tried some wording on the doc but not sure it's accurate

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

razajafri · 2021-03-20T01:43:51Z

@gerashegalov your approval was clobbered due to the updated docs. Please approve again

sameerz · 2021-03-22T04:58:45Z

build

razajafri · 2021-03-22T18:09:08Z

CI build fails because a line exceeds 100 chars but the line shows up as within 100 chars in the PR and my local build

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

Line 2693 in 4684a32

(exchange, conf, p, r) => new GpuBroadcastMeta(exchange, conf, p, r)),

razajafri · 2021-03-22T18:37:57Z

Figured it out, @gerashegalov made a suggestion that I accepted that made the line exceed 100 chars. I understand that he made the suggestion in the browser and its not his fault.

This is why I tend to not accept suggestions directly from the browser and usually make the change manually. @gerashegalov can you please bless this PR one more time after the build is successful?

razajafri · 2021-03-22T18:38:03Z

build

razajafri · 2021-03-22T22:20:57Z

build

razajafri · 2021-03-23T01:16:36Z

build

razajafri · 2021-03-23T05:30:00Z

I had to force push to get rid of the error. I hope this will pass tests

razajafri · 2021-03-23T05:33:09Z

build

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

….scala Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-03-23T16:39:56Z

build

revans2 · 2021-03-23T16:53:42Z

integration_tests/src/main/python/repart_test.py

+def test_union_by_missing_col_name(data_gen):
+    assert_gpu_and_cpu_are_equal_collect(
+        lambda spark : binary_op_df(spark, data_gen).withColumnRenamed("a", "x")
+                                .unionByName(debug_df(binary_op_df(spark, data_gen).withColumnRenamed("a", "y")), True))


please remove the debug_df

revans2 · 2021-03-23T16:58:01Z

integration_tests/src/main/python/repart_test.py

+                                                    StructGen([['child1', IntegerGen()]])), marks=nested_scalar_mark),
+                                      (StructGen([['child0', DecimalGen(7, 2)]], nullable=False),
+                                       StructGen([['child1', IntegerGen()]], nullable=False))], ids=idfn)
+def test_union_struct(data_gen):


I'm kind of confused by the testing here.

We have test_union_struct that tests that we can do a union of structs with different children so long as the struct is not nullable. I thought that this required version 3.1.1 or above to work. Am I wrong? I would also like the name to be clearer something like test_union_struct_missing_children

Then we have test_union that tests if we can do a union of anything so long as the types are the same on both sides.

Finally we have test_union_by_missing_col_name which requires being Spark 3.1.1 or after, which verifies that we can rename some columns and nulls are inserted in the place of the missing columns.

Could we have some comments added explaining the difference between the various tests?

revans2 · 2021-03-23T16:59:18Z

integration_tests/src/main/python/sort_test.py

@@ -19,7 +19,7 @@
 from marks import *
 from pyspark.sql.types import *
 import pyspark.sql.functions as f
-from spark_session import is_before_spark_310
+from spark_session import is_before_spark_311


Why did all of these change to is_before_spark_311? I know that spark310 was never really released, so it probably does not matter too much, but it is confusing.

Do you want me to revert these changes? I thought to change it because 310 wasn't released

No need to revert it. Just would have been nice to have it called out in the PR so I know that it is in there instead of scratching my head as to why there are changes in here that don't look related.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-03-23T18:55:32Z

build

razajafri · 2021-03-23T21:33:15Z

@revans2 can you please take another look?

* union support for structs Signed-off-by: Raza Jafri <rjafri@nvidia.com> * added more tests Signed-off-by: Raza Jafri <rjafri@nvidia.com> * format change Signed-off-by: Raza Jafri <rjafri@nvidia.com> * added missing all_gen Signed-off-by: Raza Jafri <rjafri@nvidia.com> * addressed review comments Signed-off-by: Raza Jafri <rjafri@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com> * line exceeds limit Signed-off-by: Raza Jafri <rjafri@nvidia.com> * added comments for tests Signed-off-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>

sameerz added the feature request New feature or request label Mar 11, 2021

gerashegalov requested changes Mar 12, 2021

View reviewed changes

razajafri mentioned this pull request Mar 12, 2021

[FEA] Support union for nested types #1459

Closed

sameerz added this to the Mar 15 - March 26 milestone Mar 16, 2021

gerashegalov reviewed Mar 17, 2021

View reviewed changes

gerashegalov previously approved these changes Mar 20, 2021

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Outdated Show resolved Hide resolved

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Outdated Show resolved Hide resolved

razajafri dismissed gerashegalov’s stale review via 6a1624c March 20, 2021 01:43

gerashegalov previously approved these changes Mar 20, 2021

View reviewed changes

razajafri dismissed gerashegalov’s stale review via 3cdad73 March 22, 2021 22:20

razajafri force-pushed the union-exec-test branch from 66ae393 to 3cdad73 Compare March 23, 2021 05:28

razajafri and others added 8 commits March 23, 2021 09:38

union support for structs

c1ab2c7

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

added more tests

9abb414

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

format change

39c8615

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

added missing all_gen

e59e7f1

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

addressed review comments

1228416

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides…

b1b2c8f

….scala Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides…

4c8e942

….scala Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com>

line exceeds limit

ba79907

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri force-pushed the union-exec-test branch from 44f971d to ba79907 Compare March 23, 2021 16:39

revans2 reviewed Mar 23, 2021

View reviewed changes

added comments for tests

a8f4371

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

revans2 approved these changes Mar 23, 2021

View reviewed changes

razajafri merged commit 2d0b246 into NVIDIA:branch-0.5 Mar 23, 2021

razajafri deleted the union-exec-test branch March 23, 2021 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Structs for UnionExec #1919

Add support for Structs for UnionExec #1919

razajafri commented Mar 11, 2021

gerashegalov Mar 12, 2021

razajafri Mar 12, 2021

razajafri commented Mar 16, 2021

razajafri commented Mar 17, 2021

gerashegalov left a comment

gerashegalov Mar 17, 2021

razajafri Mar 18, 2021

gerashegalov Mar 19, 2021

razajafri Mar 19, 2021

razajafri commented Mar 19, 2021

razajafri commented Mar 19, 2021

razajafri commented Mar 20, 2021

razajafri commented Mar 20, 2021

gerashegalov left a comment

razajafri commented Mar 20, 2021

sameerz commented Mar 22, 2021

razajafri commented Mar 22, 2021

razajafri commented Mar 22, 2021

razajafri commented Mar 22, 2021

razajafri commented Mar 22, 2021

razajafri commented Mar 23, 2021

razajafri commented Mar 23, 2021

razajafri commented Mar 23, 2021

razajafri commented Mar 23, 2021

revans2 Mar 23, 2021

revans2 Mar 23, 2021

revans2 Mar 23, 2021

razajafri Mar 23, 2021

revans2 Mar 23, 2021

razajafri commented Mar 23, 2021

razajafri commented Mar 23, 2021

Add support for Structs for UnionExec #1919

Add support for Structs for UnionExec #1919

Conversation

razajafri commented Mar 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razajafri commented Mar 16, 2021

razajafri commented Mar 17, 2021

gerashegalov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razajafri commented Mar 19, 2021

razajafri commented Mar 19, 2021

razajafri commented Mar 20, 2021

razajafri commented Mar 20, 2021

gerashegalov left a comment

Choose a reason for hiding this comment

razajafri commented Mar 20, 2021

sameerz commented Mar 22, 2021

razajafri commented Mar 22, 2021

razajafri commented Mar 22, 2021

razajafri commented Mar 22, 2021

razajafri commented Mar 22, 2021

razajafri commented Mar 23, 2021

razajafri commented Mar 23, 2021

razajafri commented Mar 23, 2021

razajafri commented Mar 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

razajafri commented Mar 23, 2021

razajafri commented Mar 23, 2021