Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_hash_reduction_decimal_overflow_sum[30] failed OOM in integration tests #4315

Closed
pxLi opened this issue Dec 7, 2021 · 5 comments · Fixed by #4317 or #4326
Closed

[BUG] test_hash_reduction_decimal_overflow_sum[30] failed OOM in integration tests #4315

pxLi opened this issue Dec 7, 2021 · 5 comments · Fixed by #4317 or #4326
Assignees
Labels
bug Something isn't working test Only impacts tests

Comments

@pxLi
Copy link
Collaborator

pxLi commented Dec 7, 2021

Describe the bug
related to #4272

FAILED ../../src/main/python/hash_aggregate_test.py::test_hash_reduction_decimal_overflow_sum[30]

java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-cudf_nightly-dev-github-556-cuda11/cpp/build/_deps/rmm-src/include/rmm/mr/device/arena_memory_resource.hpp:157: Maximum pool size exceeded
[2021-12-07T06:04:57.784Z] 	at ai.rapids.cudf.ColumnView.binaryOpVS(Native Method)
[2021-12-07T06:04:57.784Z] 	at ai.rapids.cudf.ColumnView.binaryOp(ColumnView.java:1205)
[2021-12-07T06:04:57.784Z] 	at ai.rapids.cudf.ColumnView.binaryOp(ColumnView.java:1195)
[2021-12-07T06:04:57.784Z] 	at ai.rapids.cudf.BinaryOperable.div(BinaryOperable.java:171)
[2021-12-07T06:04:57.784Z] 	at org.apache.spark.sql.rapids.GpuDecimalSumHighDigits.$anonfun$columnarEval$10(AggregateFunctions.scala:720)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2021-12-07T06:04:57.784Z] 	at org.apache.spark.sql.rapids.GpuDecimalSumHighDigits.withResource(AggregateFunctions.scala:690)
[2021-12-07T06:04:57.784Z] 	at org.apache.spark.sql.rapids.GpuDecimalSumHighDigits.$anonfun$columnarEval$9(AggregateFunctions.scala:719)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2021-12-07T06:04:57.784Z] 	at org.apache.spark.sql.rapids.GpuDecimalSumHighDigits.withResource(AggregateFunctions.scala:690)
[2021-12-07T06:04:57.784Z] 	at org.apache.spark.sql.rapids.GpuDecimalSumHighDigits.$anonfun$columnarEval$8(AggregateFunctions.scala:718)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2021-12-07T06:04:57.784Z] 	at org.apache.spark.sql.rapids.GpuDecimalSumHighDigits.withResource(AggregateFunctions.scala:690)
[2021-12-07T06:04:57.784Z] 	at org.apache.spark.sql.rapids.GpuDecimalSumHighDigits.columnarEval(AggregateFunctions.scala:713)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:34)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.GpuExpressionsUtils$.columnarEvalToColumn(GpuExpressions.scala:94)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.GpuProjectExec$.projectSingle(basicPhysicalOperators.scala:102)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.GpuProjectExec$.$anonfun$project$1(basicPhysicalOperators.scala:109)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1(implicits.scala:162)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.$anonfun$safeMap$1$adapted(implicits.scala:159)
[2021-12-07T06:04:57.784Z] 	at scala.collection.immutable.List.foreach(List.scala:392)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.RapidsPluginImplicits$MapsSafely.safeMap(implicits.scala:159)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.RapidsPluginImplicits$AutoCloseableProducingSeq.safeMap(implicits.scala:194)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.GpuProjectExec$.project(basicPhysicalOperators.scala:109)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.GpuHashAggregateIterator$AggHelper.preProcess(aggregate.scala:698)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$computeAggregate$1(aggregate.scala:776)
[2021-12-07T06:04:57.784Z] 	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2021-12-07T06:04:57.785Z] 	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2021-12-07T06:04:57.785Z] 	at com.nvidia.spark.rapids.GpuHashAggregateIterator.withResource(aggregate.scala:180)
[2021-12-07T06:04:57.785Z] 	at com.nvidia.spark.rapids.GpuHashAggregateIterator.com$nvidia$spark$rapids$GpuHashAggregateIterator$$computeAggregate(aggregate.scala:773)
[2021-12-07T06:04:57.785Z] 	at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$aggregateInputBatches$1(aggregate.scala:284)
[2021-12-07T06:04:57.785Z] 	at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$aggregateInputBatches$1$adapted(aggregate.scala:282)
[2021-12-07T06:04:57.785Z] 	at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
[2021-12-07T06:04:57.785Z] 	at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
[2021-12-07T06:04:57.785Z] 	at com.nvidia.spark.rapids.GpuHashAggregateIterator.withResource(aggregate.scala:180)
[2021-12-07T06:04:57.785Z] 	at com.nvidia.spark.rapids.GpuHashAggregateIterator.aggregateInputBatches(aggregate.scala:282)
[2021-12-07T06:04:57.785Z] 	at com.nvidia.spark.rapids.GpuHashAggregateIterator.$anonfun$next$2(aggregate.scala:237)
[2021-12-07T06:04:57.785Z] 	at scala.Option.getOrElse(Option.scala:189)
[2021-12-07T06:04:57.785Z] 	at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:234)
[2021-12-07T06:04:57.785Z] 	at com.nvidia.spark.rapids.GpuHashAggregateIterator.next(aggregate.scala:180)
[2021-12-07T06:04:57.785Z] 	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.partNextBatch(GpuShuffleExchangeExecBase.scala:288)
[2021-12-07T06:04:57.785Z] 	at org.apache.spark.sql.rapids.execution.GpuShuffleExchangeExecBase$$anon$1.hasNext(GpuShuffleExchangeExecBase.scala:304)
[2021-12-07T06:04:57.785Z] 	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
[2021-12-07T06:04:57.785Z] 	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
[2021-12-07T06:04:57.785Z] 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
[2021-12-07T06:04:57.785Z] 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
[2021-12-07T06:04:57.785Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:127)
[2021-12-07T06:04:57.785Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
[2021-12-07T06:04:57.785Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
[2021-12-07T06:04:57.785Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
[2021-12-07T06:04:57.785Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2021-12-07T06:04:57.785Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2021-12-07T06:04:57.785Z] 	at java.lang.Thread.run(Thread.java:748)
@pxLi pxLi added bug Something isn't working ? - Needs Triage Need team to review and classify test Only impacts tests and removed ? - Needs Triage Need team to review and classify labels Dec 7, 2021
@pxLi
Copy link
Collaborator Author

pxLi commented Dec 7, 2021

going to give it more reasonable rmm pool size

@pxLi
Copy link
Collaborator Author

pxLi commented Dec 8, 2021

reopened, still seeing intermittent OOM issue, going to decrease concurrentGpuTasks and do some testing

@pxLi
Copy link
Collaborator Author

pxLi commented Dec 8, 2021

ran dozens of tests, did not see this again after decrease concurrentGPUTasks, I will keep monitoring this

@jlowe
Copy link
Member

jlowe commented Jan 19, 2022

Saw this in one of the nightly tests again.

@revans2
Copy link
Collaborator

revans2 commented Jan 20, 2022

This feels like there is some other issue that is happening here that concern me.

I removed the test in #4570 so technically we can close this because the test is gone so it is impossible for this to happen again. But... I hacked together a quick max mem allocated patch for CUDF so we could tell how much memory each test took, and this test took 453 MiB. By far the largest amount of memory consumed, but 500 MiB should not be too much if we are shooting for 2 GiB of GPU memory per test task (which does not work, but I'll address that in a separate issue), and especially when #4336 supposedly split this test off to run with a hard coded parallelism of 2, which in theory should make it so we get at least 8 GiB of memory.

So I think the next step is to start playing around with the nightly test script to really clean it up and verify that everything is working as expected. But I am going to close this because this particular test is no more.

@revans2 revans2 closed this as completed Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test Only impacts tests
Projects
None yet
3 participants