-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parquet-based cache serializer #638
Conversation
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
shims/spark300/src/main/scala/com/nvidia/spark/rapids/shims/spark300/Spark300Shims.scala
Outdated
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
shims/spark310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/Spark310Shims.scala
Outdated
Show resolved
Hide resolved
...park310/src/main/scala/org/apache/spark/sql/rapids/shims/spark310/GpuColumnarToRowExec.scala
Outdated
Show resolved
Hide resolved
...310/src/main/scala/org/apache/spark/sql/rapids/shims/spark310/GpuInMemoryTableScanExec.scala
Outdated
Show resolved
Hide resolved
...310/src/main/scala/org/apache/spark/sql/rapids/shims/spark310/GpuInMemoryTableScanExec.scala
Outdated
Show resolved
Hide resolved
I forgot that we also need some updates to the documentation to be able to explain that this exists, but only for 3.1.0, that it is still in preview etc. |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Do we need to update any of our 0.2 documentation? |
I would like some docs, but because 3.1 will not be part of the official release jars I'm not sure it is going to make that much of a difference because the user would have to build it themselves. |
shims/spark301/src/main/scala/com/nvidia/spark/rapids/shims/spark301/Spark301Shims.scala
Outdated
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
...310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
shims/spark310/src/main/scala/com/nvidia/spark/rapids/shims/spark310/Spark310Shims.scala
Outdated
Show resolved
Hide resolved
@revans2 I think all your concerns have been addressed. PTAL |
build |
@revans2 fixed the assert. I know you said it was minor but I still went ahead and fixed it. Please approve again |
build |
* upmerged * Pluggable cache using parquet to compress/decompress * test change needed for running with the serializer * Added GpuInMemoryTableScanExec * add 3.1 dependency * cache plugin with shims * Tagged RowToColumnar * cleaning up the TransitionOverrides Signed-off-by: Raza Jafri <rjafri@nvidia.com> * cleanup Signed-off-by: Raza Jafri <rjafri@nvidia.com> * review changes Signed-off-by: Raza Jafri <rjafri@nvidia.com> * only read necessary columns * missing configs.md * regenerated configs.md * addressed review comments * fix the assert Co-authored-by: Raza Jafri <rjafri@nvidia.com>
* upmerged * Pluggable cache using parquet to compress/decompress * test change needed for running with the serializer * Added GpuInMemoryTableScanExec * add 3.1 dependency * cache plugin with shims * Tagged RowToColumnar * cleaning up the TransitionOverrides Signed-off-by: Raza Jafri <rjafri@nvidia.com> * cleanup Signed-off-by: Raza Jafri <rjafri@nvidia.com> * review changes Signed-off-by: Raza Jafri <rjafri@nvidia.com> * only read necessary columns * missing configs.md * regenerated configs.md * addressed review comments * fix the assert Co-authored-by: Raza Jafri <rjafri@nvidia.com>
…IDIA#638) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com> Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
This PR covers the Cache plug which converts
ColumnarBatch/InternalRow
toParquetCachedBatch
. When cache is read back, the batch is converted toColumnarBatch
and left on the GPU when its performant to do so.Currently working on writing a simple test that will allow us to quickly know if the serializer is being picked up by Spark session or not. That will come as a follow-on work as well unless this is still pending.
There is still a lot to be done, this is a first step in the right direction