Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cache of Array using ParquetCachedBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" #2942

Closed
viadea opened this issue Jul 15, 2021 · 0 comments · Fixed by #2970
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@viadea
Copy link
Collaborator

viadea commented Jul 15, 2021

Describe the bug
A clear and concise description of what the bug is.
Cache of Array using ParqudBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR".

Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.

pyspark --conf spark.sql.cache.serializer=com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer

Then run:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split

data = [("aaa", "123 456 789"), ("bbb", "444 555 666"), ("ccc", "777 888 999")]
columns = ["a","b"]
df = spark.createDataFrame(data).toDF(*columns)
newdf = df.withColumn('newb', split(col('b'),' '))
newdf.persist()
newdf.count()

Full stacktrace:

  File "/xxx/xxx/spark/myspark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o177.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 17.0 failed 4 times, most recent failure: Lost task 5.3 in stage 17.0 (TID 60) (xxx.xxx.xxx.xxx executor 0): java.lang.IllegalStateException: DATA ACCESS MUST BE ON A HOST VECTOR
      at com.nvidia.spark.rapids.GpuColumnVectorBase.isNullAt(GpuColumnVectorBase.java:53)
      at org.apache.spark.sql.vectorized.ColumnarBatchRow.isNullAt(ColumnarBatch.java:190)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.writeFields(ParquetWriteSupport.scala:156)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.$anonfun$write$1(ParquetWriteSupport.scala:148)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.consumeMessage(ParquetWriteSupport.scala:464)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:148)
      at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.write(ParquetWriteSupport.scala:54)
      at org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
      at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:182)
      at org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:44)
      at com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$InternalRowToCachedBatchIterator.next(ParquetCachedBatchSerializer.scala:1231)
      at com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer$CachedBatchIteratorProducer$InternalRowToCachedBatchIterator.next(ParquetCachedBatchSerializer.scala:1074)
      at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
      at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222)
      at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
      at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1423)
      at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
      at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
      at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
      at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
      at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
      at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
      at org.apache.spark.scheduler.Task.run(Task.scala:131)
      at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
      at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      at java.base/java.lang.Thread.run(Thread.java:834)

Expected behavior
A clear and concise description of what you expected to happen.
It should not fail.

Note: If we unpersist the DF, it works fine:

newdf.unpersist()
newdf.count()

Environment details (please complete the following information)

  • Environment location: [Standalone, YARN, Kubernetes, Cloud(specify cloud provider)]
  • Spark configuration settings related to the issue
    Spark 3.1.1
    CUDA 11.0
    RAPIDS plugin 21.08 snapshot

Additional context
Add any other context about the problem here.

@viadea viadea added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 15, 2021
@viadea viadea changed the title etCache[BUG] Cache of Array using ParqudBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" [BUG] Cache of Array using ParqudBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" Jul 15, 2021
@razajafri razajafri changed the title [BUG] Cache of Array using ParqudBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" [BUG] Cache of Array using ParquetBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" Jul 19, 2021
@razajafri razajafri changed the title [BUG] Cache of Array using ParquetBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" [BUG] Cache of Array using ParquetCachedBatchSerializer failed with "DATA ACCESS MUST BE ON A HOST VECTOR" Jul 19, 2021
@razajafri razajafri self-assigned this Jul 19, 2021
@Salonijain27 Salonijain27 added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Jul 20, 2021
@Salonijain27 Salonijain27 added this to the July 19 - July 30 milestone Jul 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants