[BUG] cache of struct does not work on databricks 8.2ML #2856

viadea · 2021-07-01T04:49:10Z

Describe the bug
A clear and concise description of what the bug is.

cache of struct does not work on databricks 8.2ML.

When setting spark.sql.cache.serializer com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer, it falls back on CPU.
When setting spark.sql.cache.serializer com.nvidia.spark.rapids.shims.spark311db.ParquetCachedBatchSerializer, it fails with :
ClassNotFoundException: com.nvidia.spark.rapids.shims.spark311db.ParquetCachedBatchSerializer
I also checked the shim layers and could not find this ParquetCachedBatchSerializer in spark311db shims:

$ grep -r ParquetCachedBatchSerializer *
spark311/src/main/scala/org/apache/spark/sql/rapids/shims/spark311/GpuInMemoryTableScanExec.scala:import com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer
spark311/src/main/scala/org/apache/spark/sql/rapids/shims/spark311/GpuInMemoryTableScanExec.scala:    relation.cacheBuilder.serializer.asInstanceOf[ParquetCachedBatchSerializer]
spark311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/Spark311Shims.scala:            if (!scan.relation.cacheBuilder.serializer.isInstanceOf[ParquetCachedBatchSerializer]) {
spark311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/Spark311Shims.scala:              willNotWorkOnGpu("ParquetCachedBatchSerializer is not being used")
spark311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/Spark311Shims.scala:    if (serClass == classOf[ParquetCachedBatchSerializer]) {
spark311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala:class ParquetCachedBatchSerializer extends CachedBatchSerializer with Arm {
spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/Spark311CDHShims.scala:            if (!scan.relation.cacheBuilder.serializer.isInstanceOf[ParquetCachedBatchSerializer]) {
spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/Spark311CDHShims.scala:              willNotWorkOnGpu("ParquetCachedBatchSerializer is not being used")
spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/Spark311CDHShims.scala:    if (serClass == classOf[ParquetCachedBatchSerializer]) {
spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/ParquetCachedBatchSerializer.scala:class ParquetCachedBatchSerializer extends CachedBatchSerializer with Arm {
spark312/src/main/scala/com/nvidia/spark/rapids/shims/spark312/ParquetCachedBatchSerializer.scala:class ParquetCachedBatchSerializer extends shims.spark311.ParquetCachedBatchSerializer {
spark313/src/main/scala/com/nvidia/spark/rapids/shims/spark313/ParquetCachedBatchSerializer.scala:class ParquetCachedBatchSerializer extends shims.spark312.ParquetCachedBatchSerializer {
spark320/src/main/scala/com/nvidia/spark/rapids/shims/spark320/ParquetCachedBatchSerializer.scala:class ParquetCachedBatchSerializer extends shims.spark311.ParquetCachedBatchSerializer {

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._

val data = Seq(
    Row(Row("Adam ","","Green"),"1","M",1000),
    Row(Row("Bob ","Middle","Green"),"2","M",2000),
    Row(Row("Cathy ","","Green"),"3","F",3000)
)

val schema = (new StructType()
  .add("name",new StructType()
    .add("firstname",StringType)
    .add("middlename",StringType)
    .add("lastname",StringType)) 
  .add("id",StringType)
  .add("gender",StringType)
  .add("salary",IntegerType))

val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.write.format("parquet").mode("overwrite").save("/tmp/testparquet")
val df2 = spark.read.parquet("/tmp/testparquet")
df2.createOrReplaceTempView("df2")
val df3=spark.sql("select struct(name, struct(name.firstname, name.lastname) as newname) as col from df2").cache
df3.createOrReplaceTempView("df3")

spark.sql("select count(distinct col.name.firstname) from df3").show
spark.sql("select count(distinct col.name.firstname) from df3").explain

Below plan is shown:

== Physical Plan ==
GpuColumnarToRowTransition false
+- GpuHashAggregate(keys=[], functions=[gpucount(distinct _gen_alias_173#173)]), filters=List(None))
   +- GpuShuffleCoalesce 2147483647
      +- GpuColumnarExchange gpusinglepartitioning$(), ENSURE_REQUIREMENTS, [id=#579]
         +- GpuHashAggregate(keys=[], functions=[partial_gpucount(distinct _gen_alias_173#173)]), filters=List(None))
            +- GpuHashAggregate(keys=[_gen_alias_173#173], functions=[]), filters=List())
               +- GpuShuffleCoalesce 2147483647
                  +- GpuColumnarExchange gpuhashpartitioning(_gen_alias_173#173, 200), ENSURE_REQUIREMENTS, [id=#575]
                     +- GpuHashAggregate(keys=[_gen_alias_173#173], functions=[]), filters=List())
                        +- GpuProject [col#98.name.firstname AS _gen_alias_173#173]
                           +- GpuRowToColumnar TargetSize(2147483647)
                              +- InMemoryTableScan [col#98]
                                    +- InMemoryRelation [col#98], StorageLevel(disk, memory, deserialized, 1 replicas)
                                          +- GpuProject [named_struct(name, name#57, newname, named_struct(firstname, name#57.firstname, lastname, name#57.lastname)) AS col#98]
                                             +- GpuFileGpuScan parquet [name#57] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/testparquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:struct<firstname:string,middlename:string,lastname:string>>

Expected behavior
A clear and concise description of what you expected to happen.

Correct plan should be:

== Physical Plan ==
GpuColumnarToRowTransition false
+- GpuHashAggregate(keys=[], functions=[gpucount(distinct _gen_alias_80#80)]), filters=List(None))
   +- GpuShuffleCoalesce 2147483647
      +- GpuColumnarExchange gpusinglepartitioning$(), ENSURE_REQUIREMENTS, [id=#121]
         +- GpuHashAggregate(keys=[], functions=[partial_gpucount(distinct _gen_alias_80#80)]), filters=List(None))
            +- GpuHashAggregate(keys=[_gen_alias_80#80], functions=[]), filters=List())
               +- GpuShuffleCoalesce 2147483647
                  +- GpuColumnarExchange gpuhashpartitioning(_gen_alias_80#80, 200), ENSURE_REQUIREMENTS, [id=#110]
                     +- GpuHashAggregate(keys=[_gen_alias_80#80], functions=[]), filters=List())
                        +- GpuProject [col#25.name.firstname AS _gen_alias_80#80]
                           +- GpuInMemoryTableScan [col#25]
                                 +- InMemoryRelation [col#25], StorageLevel(disk, memory, deserialized, 1 replicas)
                                       +- GpuProject [named_struct(name, name#16, newname, named_struct(firstname, name#16.firstname, lastname, name#16.lastname)) AS col#25]
                                          +- GpuFileGpuScan parquet [name#16] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testparquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:struct<firstname:string,middlename:string,lastname:string>>

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue

Databricks 8.2ML GPU with spark 3.1.1

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

viadea · 2021-07-01T22:34:37Z

Here is the classes in 21.06 rapids jar:

$ jar tf rapids-4-spark_2.12-21.06.0.jar |grep ParquetCachedBatchSerializer.class
com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.class
com/nvidia/spark/rapids/shims/spark312/ParquetCachedBatchSerializer.class
com/nvidia/spark/rapids/shims/spark311cdh/ParquetCachedBatchSerializer.class

tgravescs · 2021-07-06T20:45:13Z

Please make sure we have a test added for this on databricks as well.

viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 1, 2021

razajafri self-assigned this Jul 2, 2021

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Jul 6, 2021

razajafri mentioned this issue Jul 7, 2021

Added ParquetCachedBatchSerializer support for Databricks #2880

Merged

viadea mentioned this issue Jul 9, 2021

[BUG] Extra GpuColumnarToRow when using ParquetCachedBatchSerializer on databricks #2896

Closed

razajafri closed this as completed in #2880 Jul 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cache of struct does not work on databricks 8.2ML #2856

[BUG] cache of struct does not work on databricks 8.2ML #2856

viadea commented Jul 1, 2021

viadea commented Jul 1, 2021

tgravescs commented Jul 6, 2021

[BUG] cache of struct does not work on databricks 8.2ML #2856

[BUG] cache of struct does not work on databricks 8.2ML #2856

Comments

viadea commented Jul 1, 2021

viadea commented Jul 1, 2021

tgravescs commented Jul 6, 2021