[BUG] inner join fails with Column size cannot be negative #1119

tgravescs · 2020-11-13T19:12:24Z

Describe the bug

This same join used to work on older 0.3 versions and it works fine on the CPU side, seems like something recent broke this. This is running on databricks. I have not yet tried coming up with small way to reproduce.

20/11/13 18:53:17 ERROR Executor: Exception in task 0.2 in stage 21.0 (TID 574)
ai.rapids.cudf.CudfException: cuDF failure at: /ansible-managed/jenkins-slave/slave2/workspace/spark/cudf16_nightly/cpp/src/column/column_view.cpp:41: Column size cannot be negative.
	at ai.rapids.cudf.Table.innerJoin(Native Method)
	at ai.rapids.cudf.Table.access$3500(Table.java:36)
	at ai.rapids.cudf.Table$TableOperation.innerJoin(Table.java:2105)
	at com.nvidia.spark.rapids.shims.spark300db.GpuHashJoin.doJoinLeftRight(GpuHashJoin.scala:310)
	at com.nvidia.spark.rapids.shims.spark300db.GpuHashJoin.com$nvidia$spark$rapids$shims$spark300db$GpuHashJoin$$doJoin(GpuHashJoin.scala:277)
	at com.nvidia.spark.rapids.shims.spark300db.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:226)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at com.nvidia.spark.rapids.GpuHashAggregateExec.$anonfun$doExecuteColumnar$1(aggregate.scala:420)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:844)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:844)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
	at org.apache.spark.scheduler.Task.run(Task.scala:117)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

tgravescs · 2020-11-17T23:32:54Z

I went back to the 0.2 release where this query worked the last time I ran it and its not failing but with out of memory. Its possible the data changed. The error from 0.2 is:

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at: /usr/local/rapids/include/rmm/mr/device/pool_memory_resource.hpp:167: Maximum pool size exceeded
	at ai.rapids.cudf.Table.innerJoin(Native Method)
	at ai.rapids.cudf.Table.access$3200(Table.java:35)
	at ai.rapids.cudf.Table$TableOperation.innerJoin(Table.java:1986)
	at com.nvidia.spark.rapids.shims.spark300db.GpuHashJoin.doJoinLeftRight(GpuHashJoin.scala:290)

jlowe · 2020-11-19T16:32:57Z

The underlying error in cudf is originating from a lack of overflow checking in cudf gather. Filed rapidsai/cudf#6801

tgravescs · 2021-01-05T16:37:32Z

this is essentially just an out of memory, reran and confirmed proper error coming out.

total size of output strings is too large for a cudf column

…IDIA#1119) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels Nov 13, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Nov 17, 2020

sameerz assigned jlowe Nov 17, 2020

jlowe removed their assignment Dec 4, 2020

sameerz assigned tgravescs Dec 17, 2020

sameerz added this to the Jan 4 - Jan 15 milestone Dec 17, 2020

tgravescs closed this as completed Jan 5, 2021

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to f881c6c52cb5c7ee96520548a2423da06ffe2d25 (NV…

c0c6e0f

…IDIA#1119) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] inner join fails with Column size cannot be negative #1119

[BUG] inner join fails with Column size cannot be negative #1119

tgravescs commented Nov 13, 2020

tgravescs commented Nov 17, 2020

jlowe commented Nov 19, 2020

tgravescs commented Jan 5, 2021

[BUG] inner join fails with Column size cannot be negative #1119

[BUG] inner join fails with Column size cannot be negative #1119

Comments

tgravescs commented Nov 13, 2020

tgravescs commented Nov 17, 2020

jlowe commented Nov 19, 2020

tgravescs commented Jan 5, 2021