Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parquet-based cache serializer #638

Merged
merged 18 commits into from
Sep 9, 2020
Merged

Conversation

razajafri
Copy link
Collaborator

This PR covers the Cache plug which converts ColumnarBatch/InternalRow to ParquetCachedBatch. When cache is read back, the batch is converted to ColumnarBatch and left on the GPU when its performant to do so.

Currently working on writing a simple test that will allow us to quickly know if the serializer is being picked up by Spark session or not. That will come as a follow-on work as well unless this is still pending.

There is still a lot to be done, this is a first step in the right direction

@revans2
Copy link
Collaborator

revans2 commented Sep 2, 2020

build

@revans2
Copy link
Collaborator

revans2 commented Sep 2, 2020

I forgot that we also need some updates to the documentation to be able to explain that this exists, but only for 3.1.0, that it is still in preview etc.

@sameerz sameerz added the feature request New feature or request label Sep 2, 2020
@jlowe jlowe changed the title Cache plug Add Parquet-based cache serializer Sep 3, 2020
@sameerz
Copy link
Collaborator

sameerz commented Sep 8, 2020

I forgot that we also need some updates to the documentation to be able to explain that this exists, but only for 3.1.0, that it is still in preview etc.

Do we need to update any of our 0.2 documentation?

@revans2
Copy link
Collaborator

revans2 commented Sep 8, 2020

I would like some docs, but because 3.1 will not be part of the official release jars I'm not sure it is going to make that much of a difference because the user would have to build it themselves.

@razajafri
Copy link
Collaborator Author

@revans2 I think all your concerns have been addressed. PTAL

revans2
revans2 previously approved these changes Sep 8, 2020
@revans2
Copy link
Collaborator

revans2 commented Sep 8, 2020

build

@razajafri
Copy link
Collaborator Author

@revans2 fixed the assert. I know you said it was minor but I still went ahead and fixed it. Please approve again

@revans2
Copy link
Collaborator

revans2 commented Sep 8, 2020

build

@razajafri razajafri merged commit e39a105 into NVIDIA:branch-0.2 Sep 9, 2020
@razajafri razajafri deleted the cache_plug branch September 9, 2020 00:34
@razajafri razajafri mentioned this pull request Oct 20, 2020
12 tasks
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* upmerged

* Pluggable cache using parquet to compress/decompress

* test change needed for running with the serializer

* Added GpuInMemoryTableScanExec

* add 3.1 dependency

* cache plugin with shims

* Tagged RowToColumnar

* cleaning up the TransitionOverrides

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* cleanup

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* review changes

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* only read necessary columns

* missing configs.md

* regenerated configs.md

* addressed review comments

* fix the assert

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* upmerged

* Pluggable cache using parquet to compress/decompress

* test change needed for running with the serializer

* Added GpuInMemoryTableScanExec

* add 3.1 dependency

* cache plugin with shims

* Tagged RowToColumnar

* cleaning up the TransitionOverrides

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* cleanup

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* review changes

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

* only read necessary columns

* missing configs.md

* regenerated configs.md

* addressed review comments

* fix the assert

Co-authored-by: Raza Jafri <rjafri@nvidia.com>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#638)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants