Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] inner join fails with Column size cannot be negative #1119

Closed
tgravescs opened this issue Nov 13, 2020 · 3 comments
Closed

[BUG] inner join fails with Column size cannot be negative #1119

tgravescs opened this issue Nov 13, 2020 · 3 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@tgravescs
Copy link
Collaborator

Describe the bug

This same join used to work on older 0.3 versions and it works fine on the CPU side, seems like something recent broke this. This is running on databricks. I have not yet tried coming up with small way to reproduce.

20/11/13 18:53:17 ERROR Executor: Exception in task 0.2 in stage 21.0 (TID 574)
ai.rapids.cudf.CudfException: cuDF failure at: /ansible-managed/jenkins-slave/slave2/workspace/spark/cudf16_nightly/cpp/src/column/column_view.cpp:41: Column size cannot be negative.
	at ai.rapids.cudf.Table.innerJoin(Native Method)
	at ai.rapids.cudf.Table.access$3500(Table.java:36)
	at ai.rapids.cudf.Table$TableOperation.innerJoin(Table.java:2105)
	at com.nvidia.spark.rapids.shims.spark300db.GpuHashJoin.doJoinLeftRight(GpuHashJoin.scala:310)
	at com.nvidia.spark.rapids.shims.spark300db.GpuHashJoin.com$nvidia$spark$rapids$shims$spark300db$GpuHashJoin$$doJoin(GpuHashJoin.scala:277)
	at com.nvidia.spark.rapids.shims.spark300db.GpuHashJoin$$anon$1.hasNext(GpuHashJoin.scala:226)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at com.nvidia.spark.rapids.GpuHashAggregateExec.$anonfun$doExecuteColumnar$1(aggregate.scala:420)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:844)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:844)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
	at org.apache.spark.scheduler.Task.run(Task.scala:117)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:639)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1559)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:642)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
@tgravescs tgravescs added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels Nov 13, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 17, 2020
@tgravescs
Copy link
Collaborator Author

I went back to the 0.2 release where this query worked the last time I ran it and its not failing but with out of memory. Its possible the data changed. The error from 0.2 is:

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: RMM failure at: /usr/local/rapids/include/rmm/mr/device/pool_memory_resource.hpp:167: Maximum pool size exceeded
	at ai.rapids.cudf.Table.innerJoin(Native Method)
	at ai.rapids.cudf.Table.access$3200(Table.java:35)
	at ai.rapids.cudf.Table$TableOperation.innerJoin(Table.java:1986)
	at com.nvidia.spark.rapids.shims.spark300db.GpuHashJoin.doJoinLeftRight(GpuHashJoin.scala:290)

@jlowe
Copy link
Member

jlowe commented Nov 19, 2020

The underlying error in cudf is originating from a lack of overflow checking in cudf gather. Filed rapidsai/cudf#6801

@jlowe jlowe removed their assignment Dec 4, 2020
@sameerz sameerz added this to the Jan 4 - Jan 15 milestone Dec 17, 2020
@tgravescs
Copy link
Collaborator Author

this is essentially just an out of memory, reran and confirmed proper error coming out.

total size of output strings is too large for a cudf column

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#1119)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

3 participants