Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create non-shim specific version of ParquetCachedBatchSerializer #3473

Merged
merged 18 commits into from
Sep 14, 2021

Conversation

tgravescs
Copy link
Collaborator

This adds upon #3390 (credit to @razajafri as well) to create a common user facing generic class for the ParquetCachedBatchSerializer that then loads the shim specific version of it as necessary.

fixes #3314

User just has to set:
--conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer

And that should handle loading the shim. Right now there is only one version underneath the 311+-all directory.

This includes the spark 3.1.1 version of com.nvidia.spark.ParquetCachedBatchSerializer into the base of the jar to be loaded after that the shim version load will be the one under the spark specific directory. spark312/, spark311, etc..

I did not add the function to ShimLoader (like newDriverPlugin) because the code is in the spark311+-all directory since the CachedBatchSerializer isn't available until spark 3.1.0

Fixed the databricks 8.2 build and ran tests.

Ran tests on spark 3.1.1, 3.1.2, 3.2.0 and databricks 8.2

I was not able to validate the test script updates.

@tgravescs tgravescs added the feature request New feature or request label Sep 14, 2021
@tgravescs tgravescs added this to the Sep 13 - Sep 24 milestone Sep 14, 2021
@tgravescs tgravescs self-assigned this Sep 14, 2021
@tgravescs
Copy link
Collaborator Author

build

* @param conf the configuration for the job.
* @return an RDD of the input cached batches transformed into the ColumnarBatch format.
*/
override def gpuConvertCachedBatchToColumnarBatch(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only function not in the Spark CachedBatchSerializer

@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

build

@tgravescs tgravescs merged commit b3f9773 into NVIDIA:branch-21.10 Sep 14, 2021
@tgravescs tgravescs deleted the 3390-tgraves branch September 14, 2021 21:04
gerashegalov added a commit that referenced this pull request Nov 12, 2021
- Upgrade to Scala 2.12.15
- Add `-Xfatal-warnings` to scalac params
- Add `nowarn` annotations to existing warnings

Closes #3473
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Create non-shim specific version of ParquetCachedBatchSerializer
3 participants