Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NPE during serialization for shuffle in array-aggregation-with-limit query #5469

Closed
gerashegalov opened this issue May 12, 2022 · 5 comments
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@gerashegalov
Copy link
Collaborator

gerashegalov commented May 12, 2022

Describe the bug
Similar to #5140 when the column partition has only null data we hit an NPE with a query like this:

>>> spark.createDataFrame([ [None] ], 'a array<int>').selectExpr('ARRAY_MAX(a)').limit(1).collect()

throws

java.lang.NullPointerException
        at ai.rapids.cudf.JCudfSerialization$DataOutputStreamWriter.copyDataFrom(JCudfSerialization.java:645)
        at ai.rapids.cudf.JCudfSerialization$DataWriter.copyDataFrom(JCudfSerialization.java:592)
        at ai.rapids.cudf.JCudfSerialization.copySlicedAndPad(JCudfSerialization.java:1150)
        at ai.rapids.cudf.JCudfSerialization.sliceBasicData(JCudfSerialization.java:1440)
        at ai.rapids.cudf.JCudfSerialization.writeSliced(JCudfSerialization.java:1520)
        at ai.rapids.cudf.JCudfSerialization.writeSliced(JCudfSerialization.java:1563)
        at ai.rapids.cudf.JCudfSerialization.writeToStream(JCudfSerialization.java:1613)
        at com.nvidia.spark.rapids.GpuColumnarBatchSerializerInstance$$anon$1.writeValue(GpuColumnarBatchSerializer.scala:96)
        at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:283)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:171)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)

If there a non-null row then the result is correct:

>>> spark.createDataFrame([ [None], [[1]] ], 'a array<int>').selectExpr('ARRAY_MAX(a)').limit(10).collect()
22/05/12 02:58:31 WARN GpuOverrides: 
*Exec <CollectLimitExec> will run on GPU
  *Partitioning <SinglePartition$> will run on GPU
  *Exec <ProjectExec> will run on GPU
    *Expression <Alias> array_max(a#28) AS array_max(a)#30 will run on GPU
      *Expression <ArrayMax> array_max(a#28) will run on GPU
    ! <RDDScanExec> cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.execution.RDDScanExec
      @Expression <AttributeReference> a#28 could run on GPU

[Row(array_max(a)=None), Row(array_max(a)=1)]

Steps/Code to reproduce bug
Start pyspark 3.2.1 REPL to execute steps above with master local[1].

{'version': '22.06.0-SNAPSHOT', 'user': 'user', 'url': 'git@github.com:NVIDIA/spark-rapids.git', 'date': '2022-05-11T22:08:32Z', 'revision': '00c0a6c9e8a2314cfd956280cef132fa295f85e7', 'cudf_version': '22.06.0-SNAPSHOT', 'branch': 'branch-22.06'}
>>> spark.sparkContext._jvm.com.nvidia.spark.rapids.RapidsPluginUtils.loadProps('cudf-java-version-info.properties')
{'version': '22.06.0-SNAPSHOT', 'user': '', 'url': 'https://github.com/rapidsai/cudf.git', 'date': '2022-05-11T02:30:46Z', 'revision': '4539e5e60d2bc2c81338a5d646f5a9c3ac5ef7cf', 'branch': 'HEAD'}

Conf:

key value
spark.rapids.driver.user.timezone Z
spark.rapids.memory.gpu.allocFraction 0.96
spark.rapids.memory.gpu.minAllocFraction 0
spark.rapids.sql.enabled true
spark.rapids.sql.exec.CollectLimitExec true
spark.rapids.sql.explain ALL
spark.rapids.sql.test.allowedNonGpu org.apache.spark.sql.execution.LeafExecNode
spark.rapids.sql.test.enabled false
spark.rapids.sql.udfCompiler.enabled true

Expected behavior
the failing query should return NULL for a NULL-valued array-typed row
[Row(array_max(a)=None)]

Environment details (please complete the following information)

  • anywhere, local dev

Additional context
N/A

@gerashegalov gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify labels May 12, 2022
@jlowe
Copy link
Member

jlowe commented May 12, 2022

I was unable to reproduce this with a local Spark 3.2.1 on rapids4spark hash e41a6f3 and cudf hash 2b204d0.

That being said, I can see in the cudf JCudfSerialization code how this might happen if a buffer is null in the column view.

@gerashegalov
Copy link
Collaborator Author

gerashegalov commented May 12, 2022

I use https://github.com/gerashegalov/rapids-shell

rapids.sh -m=local[1] -cmd=pyspark

Updated repro details

@jlowe jlowe self-assigned this May 13, 2022
@jlowe
Copy link
Member

jlowe commented May 13, 2022

Dug into this, and the NPE is a result of bad behavior in cudf documented by rapidsai/cudf#10556. The RAPIDS Accelerator evaluates an expression which must produce the same number of output rows as input rows. After the bad segmented max evaluation the single row input column becomes a zero row output batch, but this is part of an overall projection which expects the number of output rows to be the number of input rows. This ends up building a ColumnarBatch with a row count of 1 but the column within has a row count of 0, causing a non-zero copy of a null data buffer.

@sameerz sameerz added cudf_dependency An issue or PR with this label depends on a new feature in cudf and removed ? - Needs Triage Need team to review and classify labels May 17, 2022
@sameerz
Copy link
Collaborator

sameerz commented May 17, 2022

No plugin change needed if the cudf dependency is fixed.

@gerashegalov
Copy link
Collaborator Author

fixed by rapidsai/cudf#10876

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

No branches or pull requests

3 participants