Support for Decimals with negative scale for Parquet Cached Batch Serializer #2675

razajafri · 2021-06-10T04:10:23Z

PCBS stores cache in parquet format which is good because it compresses it but the drawback is that we have to handle data that parquet doesn't one example is Decimals with negative scale, which is supported as a Legacy feature in Spark.

This PR deals with storing the unscaled long value associated with Decimals so we can support negative scale. This PR also lays the foundation for possible improvement in performance which is a part of #1143

Signed-off-by: Raza Jafri rjafri@nvidia.com

integration_tests/src/main/python/cache_test.py

...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala

razajafri · 2021-06-11T06:47:56Z

@revans2 thanks for reviewing. I am still working on the case where a dataframe is cached on GPU but read on the CPU. should have something by mid-day

razajafri · 2021-06-14T21:41:47Z

@revans2 Can you please review this again? I have added the nested capability when writing parquet on the GPU as per your original feedback. I wanted to avoid casting Decimals when writing on GPU because we are writing them as INTs to begin with but that would be a problem if a user cached a Dataframe on GPU and tried to read it on CPU.

integration_tests/src/main/python/cache_test.py

razajafri · 2021-06-18T03:43:38Z

build

razajafri · 2021-06-18T14:10:35Z

build

razajafri · 2021-06-18T16:39:52Z

build

razajafri · 2021-06-21T15:18:41Z

@gerashegalov @revans2 do you have any more comments on this PR?

revans2 · 2021-06-21T16:52:13Z

integration_tests/src/main/python/cache_test.py


-enableVectorizedConf = [{"spark.sql.inMemoryColumnarStorage.enableVectorizedReader" : "true"},
-                        {"spark.sql.inMemoryColumnarStorage.enableVectorizedReader" : "false"}]
+conf = [{"spark.sql.inMemoryColumnarStorage.enableVectorizedReader": "true",


nit: I don't like this being called conf. It makes it too simple to not parametrize something and use this directly. This is very minor.

I personally preferred the old way where we would have a param for something very specific and then build up the conf adding in what is needed without modifying anything global. The main reason for that is that the name of the test will not need to include the full config (which this is doing). But that is an even more minor nit.

I have addressed your concern I think

My understanding you addressed enableVectorizedConf but not granular parameters. It's minor but to avoid some repetitiveness Instead of whole lengthy conf parameters we could have parameters to the tune :

@pytest.mark.parametrize('enable_vectorized_conf', ['VECTORIZED_ON', 'VECTORIZED_OFF'], ...) def test_passing_gpuExpr_as_Expr(enable_vectorized_conf): assert_gpu_and_cpu_are_equal_collect( ..., conf = { 'spark.sql.inMemoryColumnarStorage.enableVectorizedReader': enable_vectorized_conf == 'VECTORIZED_ON' } )

...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala

PCBS stores cache in parquet format which is good because it compresses it but the drawback is that we have to handle data that parquet doesn't one example is Decimals with negative scale, which is supported as a Legacy feature in Spark. This PR deals with storing the unscaled long value associated with Decimals so we can support negative scale without doing much Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Changes made for adding support for nested types with Decimals with scale < 0. Some clean up was also made Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* Fixed a bug where the seelcted attributes can be in a different order from cached attributes causing a crash in a distributed env Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-06-25T21:25:31Z

build

revans2

I am still not thrilled about the ifTrueThenDeepConvertTypeAtoTypeB API, and I think we can improve it, but I think it is good enough for now.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-06-27T20:44:27Z

build

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-06-28T20:39:29Z

I am having trouble testing my code locally. I apologize for the repeated CI failure

razajafri · 2021-06-28T20:39:53Z

build

gerashegalov

I don't fully understand the serializer code yet. I defer to @revans2 and other reviews. Coding style nits are minor.

gerashegalov · 2021-06-29T19:55:28Z

sql-plugin/src/main/java/com/nvidia/spark/rapids/ColumnCastUtil.scala

+        predicate: (DataType, ColumnView) => Boolean,
+        toClose: ArrayBuffer[ColumnView]): ColumnView = {
+      dataType match {
+        case a:ArrayType =>


nit: here and elsewhere in the pattern match, space after :

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-07-01T01:27:19Z

@revans2 @gerashegalov
Thanks for reviewing, this PR is now ready. I have replaced binary_op_df data_gen with unary_op_df and that reduced the data size and made the test run better

razajafri · 2021-07-01T01:45:40Z

build

revans2 reviewed Jun 10, 2021

View reviewed changes

sameerz added the task Work required that improves the product but is not user facing label Jun 10, 2021

sameerz added this to the June 7 - June 18 milestone Jun 10, 2021

razajafri requested a review from revans2 June 14, 2021 21:41

gerashegalov reviewed Jun 15, 2021

View reviewed changes

integration_tests/src/main/python/cache_test.py Outdated Show resolved Hide resolved

sameerz mentioned this pull request Jun 16, 2021

[FEA] Improve performance of Cache plugin #1143

Open

7 tasks

revans2 changed the title ~~Support for Decimals with negative scale~~ Support for Decimals with negative scale for Parquet Cached Batch Serializer Jun 18, 2021

razajafri requested a review from gerashegalov June 18, 2021 19:49

revans2 requested changes Jun 21, 2021

View reviewed changes

sameerz modified the milestones: June 7 - June 18, June 21 - July 2 Jun 21, 2021

razajafri and others added 11 commits June 23, 2021 17:40

addressed review comments

be68c7e

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

fixed formatting

cb75127

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

remove the casting logic when uncompressing

407b814

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Add support for nested types with Decimal

ba50b7b

Changes made for adding support for nested types with Decimals with scale < 0. Some clean up was also made Signed-off-by: Raza Jafri <rjafri@nvidia.com>

cleaned import

13c2955

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

renamed the method

3d3b11d

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

Update integration_tests/src/main/python/cache_test.py

286b8ca

Co-authored-by: Gera Shegalov <gshegalov@nvidia.com> Signed-off-by: Raza Jafri <rjafri@nvidia.com>

fixed Spark310ParquetWriterSuite failure

c8866d5

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* Removed the castMap, its not needed as we can pass the schema directly

9802884

* Fixed a bug where the seelcted attributes can be in a different order from cached attributes causing a crash in a distributed env Signed-off-by: Raza Jafri <rjafri@nvidia.com>

fixed the tests

6479174

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

revans2 previously approved these changes Jun 25, 2021

View reviewed changes

razajafri mentioned this pull request Jun 25, 2021

Simplify ifTrueThenConvertTypeAToTypeB method #2825

Closed

fixed a failing test

6738380

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri dismissed revans2’s stale review via 6738380 June 27, 2021 20:43

revans2 previously approved these changes Jun 28, 2021

View reviewed changes

Fixed spark311tests profile

0ad817e

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri dismissed revans2’s stale review via 0ad817e June 28, 2021 20:34

gerashegalov previously approved these changes Jun 29, 2021

View reviewed changes

reducing the test data size

0f2cae8

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri dismissed gerashegalov’s stale review via 0f2cae8 July 1, 2021 01:24

address formatting changes

7257bdc

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri requested review from revans2 and gerashegalov July 1, 2021 01:45

gerashegalov approved these changes Jul 1, 2021

View reviewed changes

razajafri merged commit 5a379fb into NVIDIA:branch-21.08 Jul 1, 2021

razajafri deleted the cache_decimal_to_int branch July 1, 2021 04:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Decimals with negative scale for Parquet Cached Batch Serializer #2675

Support for Decimals with negative scale for Parquet Cached Batch Serializer #2675

razajafri commented Jun 10, 2021

razajafri commented Jun 11, 2021

razajafri commented Jun 14, 2021

razajafri commented Jun 18, 2021

razajafri commented Jun 18, 2021

razajafri commented Jun 18, 2021

razajafri commented Jun 21, 2021

revans2 Jun 21, 2021

razajafri Jun 24, 2021

gerashegalov Jun 29, 2021

razajafri commented Jun 25, 2021

revans2 left a comment

razajafri commented Jun 27, 2021

razajafri commented Jun 28, 2021

razajafri commented Jun 28, 2021

gerashegalov left a comment

gerashegalov Jun 29, 2021

razajafri commented Jul 1, 2021

razajafri commented Jul 1, 2021

Support for Decimals with negative scale for Parquet Cached Batch Serializer #2675

Support for Decimals with negative scale for Parquet Cached Batch Serializer #2675

Conversation

razajafri commented Jun 10, 2021

razajafri commented Jun 11, 2021

razajafri commented Jun 14, 2021

razajafri commented Jun 18, 2021

razajafri commented Jun 18, 2021

razajafri commented Jun 18, 2021

razajafri commented Jun 21, 2021

revans2 Jun 21, 2021

Choose a reason for hiding this comment

razajafri Jun 24, 2021

Choose a reason for hiding this comment

gerashegalov Jun 29, 2021

Choose a reason for hiding this comment

razajafri commented Jun 25, 2021

revans2 left a comment

Choose a reason for hiding this comment

razajafri commented Jun 27, 2021

razajafri commented Jun 28, 2021

razajafri commented Jun 28, 2021

gerashegalov left a comment

Choose a reason for hiding this comment

gerashegalov Jun 29, 2021

Choose a reason for hiding this comment

razajafri commented Jul 1, 2021

razajafri commented Jul 1, 2021