[BUG] NPE during serialization for shuffle in array-aggregation-with-limit query #5469

gerashegalov · 2022-05-12T03:02:08Z

Describe the bug
Similar to #5140 when the column partition has only null data we hit an NPE with a query like this:

>>> spark.createDataFrame([ [None] ], 'a array<int>').selectExpr('ARRAY_MAX(a)').limit(1).collect()

throws

java.lang.NullPointerException
        at ai.rapids.cudf.JCudfSerialization$DataOutputStreamWriter.copyDataFrom(JCudfSerialization.java:645)
        at ai.rapids.cudf.JCudfSerialization$DataWriter.copyDataFrom(JCudfSerialization.java:592)
        at ai.rapids.cudf.JCudfSerialization.copySlicedAndPad(JCudfSerialization.java:1150)
        at ai.rapids.cudf.JCudfSerialization.sliceBasicData(JCudfSerialization.java:1440)
        at ai.rapids.cudf.JCudfSerialization.writeSliced(JCudfSerialization.java:1520)
        at ai.rapids.cudf.JCudfSerialization.writeSliced(JCudfSerialization.java:1563)
        at ai.rapids.cudf.JCudfSerialization.writeToStream(JCudfSerialization.java:1613)
        at com.nvidia.spark.rapids.GpuColumnarBatchSerializerInstance$$anon$1.writeValue(GpuColumnarBatchSerializer.scala:96)
        at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:283)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:171)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)

If there a non-null row then the result is correct:

>>> spark.createDataFrame([ [None], [[1]] ], 'a array<int>').selectExpr('ARRAY_MAX(a)').limit(10).collect()
22/05/12 02:58:31 WARN GpuOverrides: 
*Exec <CollectLimitExec> will run on GPU
  *Partitioning <SinglePartition$> will run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> array_max(a#28) AS array_max(a)#30 will run on GPU
      *Expression <ArrayMax> array_max(a#28) will run on GPU
    ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
      @Expression <AttributeReference> a#28 could run on GPU

[Row(array_max(a)=None), Row(array_max(a)=1)]

Steps/Code to reproduce bug
Start pyspark 3.2.1 REPL to execute steps above with master local[1].

{'version': '22.06.0-SNAPSHOT', 'user': 'user', 'url': 'git@github.com:NVIDIA/spark-rapids.git', 'date': '2022-05-11T22:08:32Z', 'revision': '00c0a6c9e8a2314cfd956280cef132fa295f85e7', 'cudf_version': '22.06.0-SNAPSHOT', 'branch': 'branch-22.06'}
>>> spark.sparkContext._jvm.com.nvidia.spark.rapids.RapidsPluginUtils.loadProps('cudf-java-version-info.properties')
{'version': '22.06.0-SNAPSHOT', 'user': '', 'url': 'https://github.com/rapidsai/cudf.git', 'date': '2022-05-11T02:30:46Z', 'revision': '4539e5e60d2bc2c81338a5d646f5a9c3ac5ef7cf', 'branch': 'HEAD'}

Conf:

key	value
spark.rapids.driver.user.timezone	Z
spark.rapids.memory.gpu.allocFraction	0.96
spark.rapids.memory.gpu.minAllocFraction	0
spark.rapids.sql.enabled	true
spark.rapids.sql.exec.CollectLimitExec	true
spark.rapids.sql.explain	ALL
spark.rapids.sql.test.allowedNonGpu	org.apache.spark.sql.execution.LeafExecNode
spark.rapids.sql.test.enabled	false
spark.rapids.sql.udfCompiler.enabled	true

Expected behavior
the failing query should return NULL for a NULL-valued array-typed row
[Row(array_max(a)=None)]

Environment details (please complete the following information)

anywhere, local dev

Additional context
N/A

The text was updated successfully, but these errors were encountered:

jlowe · 2022-05-12T16:11:47Z

I was unable to reproduce this with a local Spark 3.2.1 on rapids4spark hash e41a6f3 and cudf hash 2b204d0.

That being said, I can see in the cudf JCudfSerialization code how this might happen if a buffer is null in the column view.

gerashegalov · 2022-05-12T18:07:41Z

I use https://github.com/gerashegalov/rapids-shell

rapids.sh -m=local[1] -cmd=pyspark

Updated repro details

jlowe · 2022-05-13T15:50:15Z

Dug into this, and the NPE is a result of bad behavior in cudf documented by rapidsai/cudf#10556. The RAPIDS Accelerator evaluates an expression which must produce the same number of output rows as input rows. After the bad segmented max evaluation the single row input column becomes a zero row output batch, but this is part of an overall projection which expects the number of output rows to be the number of input rows. This ends up building a ColumnarBatch with a row count of 1 but the column within has a row count of 0, causing a non-zero copy of a null data buffer.

sameerz · 2022-05-17T20:55:12Z

No plugin change needed if the cudf dependency is fixed.

gerashegalov · 2022-05-19T19:19:30Z

fixed by rapidsai/cudf#10876

gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 12, 2022

jlowe self-assigned this May 13, 2022

sameerz added cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed ? - Needs Triage Need team to review and classify labels May 17, 2022

gerashegalov closed this as completed May 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] NPE during serialization for shuffle in array-aggregation-with-limit query #5469

[BUG] NPE during serialization for shuffle in array-aggregation-with-limit query #5469

gerashegalov commented May 12, 2022 •

edited

Loading

jlowe commented May 12, 2022

gerashegalov commented May 12, 2022 •

edited

Loading

jlowe commented May 13, 2022

sameerz commented May 17, 2022

gerashegalov commented May 19, 2022

[BUG] NPE during serialization for shuffle in array-aggregation-with-limit query #5469

[BUG] NPE during serialization for shuffle in array-aggregation-with-limit query #5469

Comments

gerashegalov commented May 12, 2022 • edited Loading

jlowe commented May 12, 2022

gerashegalov commented May 12, 2022 • edited Loading

jlowe commented May 13, 2022

sameerz commented May 17, 2022

gerashegalov commented May 19, 2022

gerashegalov commented May 12, 2022 •

edited

Loading

gerashegalov commented May 12, 2022 •

edited

Loading