Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Decimals with negative scale for Parquet Cached Batch Serializer #2675

Merged
merged 20 commits into from
Jul 1, 2021

Conversation

razajafri
Copy link
Collaborator

PCBS stores cache in parquet format which is good because it compresses it but the drawback is that we have to handle data that parquet doesn't one example is Decimals with negative scale, which is supported as a Legacy feature in Spark.

This PR deals with storing the unscaled long value associated with Decimals so we can support negative scale. This PR also lays the foundation for possible improvement in performance which is a part of #1143

Signed-off-by: Raza Jafri rjafri@nvidia.com

@sameerz sameerz added the task Work required that improves the product but is not user facing label Jun 10, 2021
@sameerz sameerz added this to the June 7 - June 18 milestone Jun 10, 2021
@razajafri
Copy link
Collaborator Author

@revans2 thanks for reviewing. I am still working on the case where a dataframe is cached on GPU but read on the CPU. should have something by mid-day

@razajafri
Copy link
Collaborator Author

@revans2 Can you please review this again? I have added the nested capability when writing parquet on the GPU as per your original feedback. I wanted to avoid casting Decimals when writing on GPU because we are writing them as INTs to begin with but that would be a problem if a user cached a Dataframe on GPU and tried to read it on CPU.

@razajafri razajafri requested a review from revans2 June 14, 2021 21:41
@razajafri
Copy link
Collaborator Author

build

1 similar comment
@razajafri
Copy link
Collaborator Author

build

@revans2 revans2 changed the title Support for Decimals with negative scale Support for Decimals with negative scale for Parquet Cached Batch Serializer Jun 18, 2021
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

@gerashegalov @revans2 do you have any more comments on this PR?


enableVectorizedConf = [{"spark.sql.inMemoryColumnarStorage.enableVectorizedReader" : "true"},
{"spark.sql.inMemoryColumnarStorage.enableVectorizedReader" : "false"}]
conf = [{"spark.sql.inMemoryColumnarStorage.enableVectorizedReader": "true",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't like this being called conf. It makes it too simple to not parametrize something and use this directly. This is very minor.

I personally preferred the old way where we would have a param for something very specific and then build up the conf adding in what is needed without modifying anything global. The main reason for that is that the name of the test will not need to include the full config (which this is doing). But that is an even more minor nit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed your concern I think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding you addressed enableVectorizedConf but not granular parameters. It's minor but to avoid some repetitiveness Instead of whole lengthy conf parameters we could have parameters to the tune :

@pytest.mark.parametrize('enable_vectorized_conf', ['VECTORIZED_ON', 'VECTORIZED_OFF'], ...) 
def test_passing_gpuExpr_as_Expr(enable_vectorized_conf):
   assert_gpu_and_cpu_are_equal_collect(
      ...,
      conf = { 'spark.sql.inMemoryColumnarStorage.enableVectorizedReader': enable_vectorized_conf == 'VECTORIZED_ON' }
   )

razajafri and others added 11 commits June 23, 2021 17:40
PCBS stores cache in parquet format which is good because it compresses
it but the drawback is that we have to handle data that parquet doesn't
one example is Decimals with negative scale, which is supported as a
Legacy feature in Spark.

This PR deals with storing the unscaled long value associated with
Decimals so we can support negative scale without doing much

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Changes made for adding support for nested types with Decimals with
scale < 0. Some clean up was also made

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Co-authored-by: Gera Shegalov <gshegalov@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
* Fixed a bug where the seelcted attributes can be in a different order
from cached attributes causing a crash in a distributed env

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Jun 25, 2021
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not thrilled about the ifTrueThenDeepConvertTypeAtoTypeB API, and I think we can improve it, but I think it is good enough for now.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Jun 28, 2021
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

I am having trouble testing my code locally. I apologize for the repeated CI failure

@razajafri
Copy link
Collaborator Author

build

gerashegalov
gerashegalov previously approved these changes Jun 29, 2021
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the serializer code yet. I defer to @revans2 and other reviews. Coding style nits are minor.

predicate: (DataType, ColumnView) => Boolean,
toClose: ArrayBuffer[ColumnView]): ColumnView = {
dataType match {
case a:ArrayType =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: here and elsewhere in the pattern match, space after :

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

@revans2 @gerashegalov
Thanks for reviewing, this PR is now ready. I have replaced binary_op_df data_gen with unary_op_df and that reduced the data size and made the test run better

@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri merged commit 5a379fb into NVIDIA:branch-21.08 Jul 1, 2021
@razajafri razajafri deleted the cache_decimal_to_int branch July 1, 2021 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants