-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Decimals with negative scale for Parquet Cached Batch Serializer #2675
Support for Decimals with negative scale for Parquet Cached Batch Serializer #2675
Conversation
...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
@revans2 thanks for reviewing. I am still working on the case where a dataframe is cached on GPU but read on the CPU. should have something by mid-day |
@revans2 Can you please review this again? I have added the nested capability when writing parquet on the GPU as per your original feedback. I wanted to avoid casting Decimals when writing on GPU because we are writing them as INTs to begin with but that would be a problem if a user cached a Dataframe on GPU and tried to read it on CPU. |
build |
1 similar comment
build |
build |
@gerashegalov @revans2 do you have any more comments on this PR? |
|
||
enableVectorizedConf = [{"spark.sql.inMemoryColumnarStorage.enableVectorizedReader" : "true"}, | ||
{"spark.sql.inMemoryColumnarStorage.enableVectorizedReader" : "false"}] | ||
conf = [{"spark.sql.inMemoryColumnarStorage.enableVectorizedReader": "true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't like this being called conf
. It makes it too simple to not parametrize something and use this directly. This is very minor.
I personally preferred the old way where we would have a param for something very specific and then build up the conf adding in what is needed without modifying anything global. The main reason for that is that the name of the test will not need to include the full config (which this is doing). But that is an even more minor nit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed your concern I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding you addressed enableVectorizedConf
but not granular parameters. It's minor but to avoid some repetitiveness Instead of whole lengthy conf parameters we could have parameters to the tune :
@pytest.mark.parametrize('enable_vectorized_conf', ['VECTORIZED_ON', 'VECTORIZED_OFF'], ...)
def test_passing_gpuExpr_as_Expr(enable_vectorized_conf):
assert_gpu_and_cpu_are_equal_collect(
...,
conf = { 'spark.sql.inMemoryColumnarStorage.enableVectorizedReader': enable_vectorized_conf == 'VECTORIZED_ON' }
)
...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
PCBS stores cache in parquet format which is good because it compresses it but the drawback is that we have to handle data that parquet doesn't one example is Decimals with negative scale, which is supported as a Legacy feature in Spark. This PR deals with storing the unscaled long value associated with Decimals so we can support negative scale without doing much Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Changes made for adding support for nested types with Decimals with scale < 0. Some clean up was also made Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
* Fixed a bug where the seelcted attributes can be in a different order from cached attributes causing a crash in a distributed env Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not thrilled about the ifTrueThenDeepConvertTypeAtoTypeB
API, and I think we can improve it, but I think it is good enough for now.
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
I am having trouble testing my code locally. I apologize for the repeated CI failure |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the serializer code yet. I defer to @revans2 and other reviews. Coding style nits are minor.
predicate: (DataType, ColumnView) => Boolean, | ||
toClose: ArrayBuffer[ColumnView]): ColumnView = { | ||
dataType match { | ||
case a:ArrayType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: here and elsewhere in the pattern match, space after :
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@revans2 @gerashegalov |
build |
PCBS stores cache in parquet format which is good because it compresses it but the drawback is that we have to handle data that parquet doesn't one example is Decimals with negative scale, which is supported as a Legacy feature in Spark.
This PR deals with storing the unscaled long value associated with Decimals so we can support negative scale. This PR also lays the foundation for possible improvement in performance which is a part of #1143
Signed-off-by: Raza Jafri rjafri@nvidia.com