Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throw again after logging that RMM could not intialize #5243

Merged
merged 2 commits into from
Apr 14, 2022

Conversation

abellina
Copy link
Collaborator

Signed-off-by: Alessandro Bellina abellina@nvidia.com

Closes: #5242.

This is a (small) proposed diff to make failure to initialize RMM (which throws a CudfException) fatal, instead of silently continuing with a default initialized raw cuda memory resource.

The executor logs will look like this:

22/04/13 14:56:06 ERROR GpuDeviceManager: Could not initialize RMM, exiting!
 ai.rapids.cudf.CudfException: RMM failure at: /home/jenkins/agent/workspace/jenkins-cudf-for-dev-32-cuda11/cpp/build/_deps/rmm-src/include/rmm/mr/device/    cuda_async_memory_resource.hpp:67: cudaMallocAsync not supported with this CUDA driver/runtime version
  at ai.rapids.cudf.Rmm.initializeInternal(Native Method)

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@abellina abellina added this to the Apr 4 - Apr 15 milestone Apr 13, 2022
@abellina abellina added the bug Something isn't working label Apr 13, 2022
@@ -301,7 +301,9 @@ object GpuDeviceManager extends Logging {
Rmm.initialize(init, logConf, poolAllocation)
RapidsBufferCatalog.init(conf)
} catch {
case e: Exception => logError("Could not initialize RMM", e)
case e: CudfException =>
logError("Could not initialize RMM, exiting!", e)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is typically an anti-pattern

https://stackoverflow.com/questions/6639963/why-is-log-and-throw-considered-an-anti-pattern

Is there a reason we need the log statement? Should we just lat the exception continue up?

Copy link
Collaborator Author

@abellina abellina Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes agree, I wanted to add something more specific than the error seen here: #5242 (comment), that's the only reason.

Currently this exception that I am re-throwing gets caught up again in the RapidsExecutorPlugin where we exit. Perhaps I should change the message here to say we are exiting. Do you have a preference?

22/04/13 14:39:05 ERROR RapidsExecutorPlugin: Exception in the executor plugin

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have the code that exits say we are exiting, otherwise there is coupling between the code that exits and here for no good reason.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@revans2 handled here: f6545d3

@abellina
Copy link
Collaborator Author

build

@abellina abellina merged commit b6eaa50 into NVIDIA:branch-22.06 Apr 14, 2022
@abellina abellina deleted the fail_outright_when_pool_fails branch April 14, 2022 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Executor falls back to cudaMalloc if the pool can't be initialized
3 participants