Throw again after logging that RMM could not intialize #5243

abellina · 2022-04-13T14:57:26Z

Signed-off-by: Alessandro Bellina abellina@nvidia.com

Closes: #5242.

This is a (small) proposed diff to make failure to initialize RMM (which throws a CudfException) fatal, instead of silently continuing with a default initialized raw cuda memory resource.

The executor logs will look like this:

22/04/13 14:56:06 ERROR GpuDeviceManager: Could not initialize RMM, exiting!
 ai.rapids.cudf.CudfException: RMM failure at: /home/jenkins/agent/workspace/jenkins-cudf-for-dev-32-cuda11/cpp/build/_deps/rmm-src/include/rmm/mr/device/    cuda_async_memory_resource.hpp:67: cudaMallocAsync not supported with this CUDA driver/runtime version
  at ai.rapids.cudf.Rmm.initializeInternal(Native Method)

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

revans2 · 2022-04-13T15:28:46Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala

@@ -301,7 +301,9 @@ object GpuDeviceManager extends Logging {
        Rmm.initialize(init, logConf, poolAllocation)
        RapidsBufferCatalog.init(conf)
      } catch {
-        case e: Exception => logError("Could not initialize RMM", e)
+        case e: CudfException =>
+          logError("Could not initialize RMM, exiting!", e)


This is typically an anti-pattern

https://stackoverflow.com/questions/6639963/why-is-log-and-throw-considered-an-anti-pattern

Is there a reason we need the log statement? Should we just lat the exception continue up?

yes agree, I wanted to add something more specific than the error seen here: #5242 (comment), that's the only reason.

Currently this exception that I am re-throwing gets caught up again in the RapidsExecutorPlugin where we exit. Perhaps I should change the message here to say we are exiting. Do you have a preference?

22/04/13 14:39:05 ERROR RapidsExecutorPlugin: Exception in the executor plugin

I would rather have the code that exits say we are exiting, otherwise there is coupling between the code that exits and here for no good reason.

@revans2 handled here: f6545d3

…e ExecutorPlugin

abellina · 2022-04-13T19:23:12Z

build

Throw again after logging that RMM could not intialize

4f4172f

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

abellina added this to the Apr 4 - Apr 15 milestone Apr 13, 2022

abellina added the bug Something isn't working label Apr 13, 2022

revans2 reviewed Apr 13, 2022

View reviewed changes

Let exception as is from RMM, and change slightly the wording from th…

f6545d3

…e ExecutorPlugin

jlowe approved these changes Apr 13, 2022

View reviewed changes

abellina merged commit b6eaa50 into NVIDIA:branch-22.06 Apr 14, 2022

abellina deleted the fail_outright_when_pool_fails branch April 14, 2022 13:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Throw again after logging that RMM could not intialize #5243

Throw again after logging that RMM could not intialize #5243

abellina commented Apr 13, 2022

revans2 Apr 13, 2022

abellina Apr 13, 2022 •

edited

Loading

revans2 Apr 13, 2022

abellina Apr 13, 2022

abellina commented Apr 13, 2022

Throw again after logging that RMM could not intialize #5243

Throw again after logging that RMM could not intialize #5243

Conversation

abellina commented Apr 13, 2022

revans2 Apr 13, 2022

Choose a reason for hiding this comment

abellina Apr 13, 2022 • edited Loading

Choose a reason for hiding this comment

revans2 Apr 13, 2022

Choose a reason for hiding this comment

abellina Apr 13, 2022

Choose a reason for hiding this comment

abellina commented Apr 13, 2022

abellina Apr 13, 2022 •

edited

Loading