-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Structs for UnionExec #1919
Conversation
@pytest.mark.parametrize('data_gen', all_gen + [all_basic_struct_gen, StructGen([['child0', DecimalGen(7, 2)]])], ids=idfn) | ||
@pytest.mark.skipif(is_before_spark_311(), reason="This is supported only in Spark 3.1.1+") | ||
def test_union_by_missing_col_name(data_gen): | ||
if (isinstance(data_gen, StructGen)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xfail would stick around for all further test case iterations once activated in my experience. Some examples of the documented way of registering parameter-specific xfail are e.g here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated but the xfail doesn't persist or stick, can you please correct me if I am wrong. I tested it by placing the struct data_gen before the all_gen and it was not xfail
ing every test after struct.
Nevertheless, I have made the change to follow the convention.
@gerashegalov PTAL |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, question about nesting levels
ExecChecks((TypeSig.commonCudfTypes + TypeSig.NULL + TypeSig.DECIMAL + TypeSig.STRUCT | ||
.withPsNote(TypeEnum.STRUCT, "unionByName will not optionally fill missing columns with " + | ||
"nulls when the col has structs")) | ||
.nested(), TypeSig.all), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to allow this for multi- level nesting right away? In the sort PR we are doing single-level first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I am wrong but won't they both result in the same code?
new TypeSig(initialTypes, nestedTypes ++ nesting.initialTypes, litOnlyTypes, notes) |
new TypeSig(initialTypes, initialTypes ++ nestedTypes, litOnlyTypes, notes) |
because in your case you are basically passing the nesting.initialTypes = initialTypes
so how is it different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is whether Struct spills into nested
types , you can observe the difference in the REPL:
scala> val pluginSupportedOrderableSig = (
| TypeSig.commonCudfTypes +
| TypeSig.NULL +
| TypeSig.DECIMAL +
| TypeSig.STRUCT.nested(
| TypeSig.commonCudfTypes +
| TypeSig.NULL +
| TypeSig.DECIMAL
| ))
pluginSupportedOrderableSig: com.nvidia.spark.rapids.TypeSig = com.nvidia.spark.rapids.TypeSig@468f547e
scala> val typeSigBlanketNested = (TypeSig.commonCudfTypes + TypeSig.NULL + TypeSig.DECIMAL + TypeSig.STRUCT).nested()
scala> val singleNested = new StructType().add("value", IntegerType)
singleNested: org.apache.spark.sql.types.StructType = StructType(StructField(value,IntegerType,true))
scala> val doubleNested = new StructType().add("nestedInt", new StructType().add("value", IntegerType))
doubleNested: org.apache.spark.sql.types.StructType = StructType(StructField(nestedInt,StructType(StructField(value,IntegerType,true)),true))
scala> pluginSupportedOrderableSig.isSupportedByPlugin(singleNested, true)
res8: Boolean = true
scala> pluginSupportedOrderableSig.isSupportedByPlugin(doubleNested, true)
res7: Boolean = false
But with typeSigBlanketNested e.g double nesting would be allowed
scala> typeSigBlanketNested.isSupportedByPlugin(doubleNested, true)
res11: Boolean = true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh of course! I missed a very important thing that the struct will be a part of the initialTypes. Thanks for the explanation
I have pushed an update. PTAL
CI failed due to lack of resources |
build |
1 similar comment
build |
@gerashegalov can you please see if I have addressed all your concerns? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tried some wording on the doc but not sure it's accurate
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
@gerashegalov your approval was clobbered due to the updated docs. Please approve again |
build |
CI build fails because a line exceeds 100 chars but the line shows up as within 100 chars in the PR and my local build spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Line 2693 in 4684a32
|
Figured it out, @gerashegalov made a suggestion that I accepted that made the line exceed 100 chars. I understand that he made the suggestion in the browser and its not his fault. This is why I tend to not accept suggestions directly from the browser and usually make the change manually. @gerashegalov can you please bless this PR one more time after the build is successful? |
build |
build |
1 similar comment
build |
66ae393
to
3cdad73
Compare
I had to force push to get rid of the error. I hope this will pass tests |
build |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
….scala Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com>
….scala Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
44f971d
to
ba79907
Compare
build |
def test_union_by_missing_col_name(data_gen): | ||
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark : binary_op_df(spark, data_gen).withColumnRenamed("a", "x") | ||
.unionByName(debug_df(binary_op_df(spark, data_gen).withColumnRenamed("a", "y")), True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the debug_df
StructGen([['child1', IntegerGen()]])), marks=nested_scalar_mark), | ||
(StructGen([['child0', DecimalGen(7, 2)]], nullable=False), | ||
StructGen([['child1', IntegerGen()]], nullable=False))], ids=idfn) | ||
def test_union_struct(data_gen): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of confused by the testing here.
We have test_union_struct
that tests that we can do a union of structs with different children so long as the struct is not nullable. I thought that this required version 3.1.1 or above to work. Am I wrong? I would also like the name to be clearer something like test_union_struct_missing_children
Then we have test_union
that tests if we can do a union of anything so long as the types are the same on both sides.
Finally we have test_union_by_missing_col_name
which requires being Spark 3.1.1 or after, which verifies that we can rename some columns and nulls are inserted in the place of the missing columns.
Could we have some comments added explaining the difference between the various tests?
@@ -19,7 +19,7 @@ | |||
from marks import * | |||
from pyspark.sql.types import * | |||
import pyspark.sql.functions as f | |||
from spark_session import is_before_spark_310 | |||
from spark_session import is_before_spark_311 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did all of these change to is_before_spark_311? I know that spark310 was never really released, so it probably does not matter too much, but it is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to revert these changes? I thought to change it because 310 wasn't released
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to revert it. Just would have been nice to have it called out in the PR so I know that it is in there instead of scratching my head as to why there are changes in here that don't look related.
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
@revans2 can you please take another look? |
* union support for structs Signed-off-by: Raza Jafri <rjafri@nvidia.com> * added more tests Signed-off-by: Raza Jafri <rjafri@nvidia.com> * format change Signed-off-by: Raza Jafri <rjafri@nvidia.com> * added missing all_gen Signed-off-by: Raza Jafri <rjafri@nvidia.com> * addressed review comments Signed-off-by: Raza Jafri <rjafri@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com> * line exceeds limit Signed-off-by: Raza Jafri <rjafri@nvidia.com> * added comments for tests Signed-off-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
* union support for structs Signed-off-by: Raza Jafri <rjafri@nvidia.com> * added more tests Signed-off-by: Raza Jafri <rjafri@nvidia.com> * format change Signed-off-by: Raza Jafri <rjafri@nvidia.com> * added missing all_gen Signed-off-by: Raza Jafri <rjafri@nvidia.com> * addressed review comments Signed-off-by: Raza Jafri <rjafri@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com> * Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com> * line exceeds limit Signed-off-by: Raza Jafri <rjafri@nvidia.com> * added comments for tests Signed-off-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Raza Jafri <rjafri@nvidia.com> Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
This PR whitelabels UnionExec for structs and adds tests for 3.1.1
Signed-off-by: Raza Jafri rjafri@nvidia.com