[BUG] Cache of Array using ParquetCachedBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" #2942

viadea · 2021-07-15T22:06:49Z

Describe the bug
A clear and concise description of what the bug is.
Cache of Array using ParqudBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR".

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

pyspark --conf spark.sql.cache.serializer=com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer

Then run:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split

data = [("aaa", "123 456 789"), ("bbb", "444 555 666"), ("ccc", "777 888 999")]
columns = ["a","b"]
df = spark.createDataFrame(data).toDF(*columns)
newdf = df.withColumn('newb', split(col('b'),' '))
newdf.persist()
newdf.count()

Full stacktrace:

  File "/xxx/xxx/spark/myspark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o177.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 17.0 failed 4 times, most recent failure: Lost task 5.3 in stage 17.0 (TID 60) (xxx.xxx.xxx.xxx executor 0): java.lang.IllegalStateException: DATA ACCESS MUST BE ON A HOST VECTOR
      at com.nvidia.spark.rapids.GpuColumnVectorBase.isNullAt(GpuColumnVectorBase.java:53)
      at org.apache.spark.sql.vectorized.ColumnarBatchRow.isNullAt(ColumnarBatch.java:190)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:156)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:148)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:464)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:148)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54)
      at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
      at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182)
      at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44)
      at com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$InternalRowToCachedBatchIterator.next(ParquetCachedBatchSerializer.scala:1231)
      at com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$InternalRowToCachedBatchIterator.next(ParquetCachedBatchSerializer.scala:1074)
      at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
      at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222)
      at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
      at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
      at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
      at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
      at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
      at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
      at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
      at org.apache.spark.scheduler.Task.run(Task.scala:131)
      at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      at java.base/java.lang.Thread.run(Thread.java:834)

Expected behavior
A clear and concise description of what you expected to happen.
It should not fail.

Note: If we unpersist the DF, it works fine:

newdf.unpersist()
newdf.count()

Environment details (please complete the following information)

Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
Spark configuration settings related to the issue
Spark 3.1.1
CUDA 11.0
RAPIDS plugin 21.08 snapshot

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 15, 2021

viadea changed the title ~~etCache[BUG] Cache of Array using ParqudBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR"~~ [BUG] Cache of Array using ParqudBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" Jul 15, 2021

razajafri changed the title ~~[BUG] Cache of Array using ParqudBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR"~~ [BUG] Cache of Array using ParquetBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" Jul 19, 2021

razajafri changed the title ~~[BUG] Cache of Array using ParquetBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR"~~ [BUG] Cache of Array using ParquetCachedBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" Jul 19, 2021

razajafri self-assigned this Jul 19, 2021

razajafri mentioned this issue Jul 20, 2021

Put the GPU data back on host before processing cache on CPU #2970

Merged

Salonijain27 added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Jul 20, 2021

Salonijain27 added this to the July 19 - July 30 milestone Jul 20, 2021

razajafri closed this as completed in #2970 Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Cache of Array using ParquetCachedBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" #2942

[BUG] Cache of Array using ParquetCachedBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" #2942

viadea commented Jul 15, 2021 •

edited

Loading

[BUG] Cache of Array using ParquetCachedBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" #2942

[BUG] Cache of Array using ParquetCachedBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" #2942

Comments

viadea commented Jul 15, 2021 • edited Loading

viadea commented Jul 15, 2021 •

edited

Loading