Try using contiguous rank to fix cuda_visible_devices #1926

VibhuJawa · 2023-10-24T05:35:25Z

This PR attempts to solve rapidsai/cugraph#3889

copy-pr-bot · 2023-10-24T05:35:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cjnolet · 2023-10-24T12:13:12Z

/ok to test

VibhuJawa · 2023-10-25T01:29:52Z

Can confirm that with this PR below issue (rapidsai/cugraph#3889) is resolved .

In [1]: from cugraph.testing.mg_utils import start_dask_client
In [2]: client, cluster = start_dask_client(dask_worker_devices=[1], jit_unspill=False)

jnke2016 · 2023-10-25T01:49:27Z

Thanks @VibhuJawa.
This PR looks good to me. I will need to test this PR on Draco to make sure it doesn't bring back the nccl transient issue that was resolved few months ago

VibhuJawa · 2023-10-25T01:49:50Z

In [1]: from raft_dask.common import Comms
   ...: from dask_cuda import LocalCUDACluster
   ...: from dask.distributed import Client
   ...: 
   ...: cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES="1")
   ...: client = Client(cluster)
   ...: cb = Comms(verbose=True, comms_p2p=True)
   ...: cb.init()

With brach-23.12:

Initializing comms!
2023-10-24 18:44:17,633 - distributed.worker - WARNING - Run Failed
Function: _func_init_all
args:     (b'b\xbbF\xa8\x1d\xa6O0\xba\xd9\xcb\x16\xbd\xfe\xbd\xd4', b'\x9fW\x1bd\xfb\x81Rs\x02\x00\xcd\xab\n!\xe4F\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xe0\x11\x06VI\x7f\x00\x00 \xb6E\xa0E\x7f\x00\x00p\x86\xb4VI\x7f\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x02\x93\x86\xb1U\x00\x00\x00\x00\x00\x00\xce\x90\xd6*`^\xfa\x84\xb1U\x00\x00\xe0\x11\x06VI\x7f\x00\x00\x00\x03\xf6\x9a\x9c\xaa\xa3w \xb9U\xacE\x7f\x00\x00p\x86\xb4VI\x7f\x00', True, {'tcp://127.0.0.1:38629': {'rank': 1, 'port': 47810}}, True, 0)
kwargs:   {'dask_worker': <Worker 'tcp://127.0.0.1:38629', name: 1, status: running, stored: 0, running: 0/1, ready: 0, comm: 0, waiting: 0>}
Traceback (most recent call last):
  File "/datasets/vjawa/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/distributed/worker.py", line 3174, in run
    result = await function(*args, **kwargs)
  File "/datasets/vjawa/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/raft_dask/common/comms.py", line 446, in _func_init_all
    _func_init_nccl(sessionId, uniqueId, dask_worker=dask_worker)
  File "/datasets/vjawa/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/raft_dask/common/comms.py", line 511, in _func_init_nccl
    n.init(nWorkers, uniqueId, wid)
  File "nccl.pyx", line 151, in raft_dask.common.nccl.nccl.init
RuntimeError: NCCL_ERROR: b'invalid argument (run with NCCL_DEBUG=WARN for details)'

With PR:

Initializing comms!
Initialization complete.

cjnolet · 2023-10-25T01:51:05Z

/merge

cjnolet · 2023-10-25T01:51:42Z

/merge

This is a follow up on rapidsaigh-1926, since the rank sorting seemed a bit hard to understand. It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal? If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs. OTOH, if the NVML device order doesn't matter, we could just sort the workers directly. The original [PR]() mentions: NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of: * Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation. * At least raise an error, because if NCCL will end up raising the error it will be very confusing.

This is a follow up on rapidsaigh-1926, since the rank sorting seemed a bit hard to understand. It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal? If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs. OTOH, if the NVML device order doesn't matter, we could just sort the workers directly. The original rapidsaigh-1587 mentions: NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of: * Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation. * At least raise an error, because if NCCL will end up raising the error it will be very confusing.

@seberg

This PR is based on @seberg work in #1928 . From the PR: This is a follow up on #1926, since the rank sorting seemed a bit hard to understand. It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal? If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs. OTOH, if the NVML device order doesn't matter, we could just sort the workers directly. The original #1587 mentions: NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of: Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation. At least raise an error, because if NCCL will end up raising the error it will be very confusing. Authors: - Vibhu Jawa (https://github.com/VibhuJawa) - Sebastian Berg (https://github.com/seberg) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2228

Try using contiguous rank to fix cuda_visible_devices

e85e010

Merge branch 'branch-23.12' into try_rank_fix

ce30a79

github-actions bot added the python label Oct 24, 2023

Run on subset of functions

c5a0249

cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 24, 2023

VibhuJawa mentioned this pull request Oct 25, 2023

[BUG]: NCCL failure when starting a dask client rapidsai/cugraph#3889

Closed

2 tasks

cjnolet approved these changes Oct 25, 2023

View reviewed changes

cjnolet marked this pull request as ready for review October 25, 2023 01:51

cjnolet requested a review from a team as a code owner October 25, 2023 01:51

rapids-bot bot merged commit 3b87796 into rapidsai:branch-23.12 Oct 25, 2023
63 checks passed

cjnolet approved these changes Oct 25, 2023

View reviewed changes

seberg mentioned this pull request Oct 25, 2023

MAINT: Simplify NCCL worker rank identification #1928

Closed

VibhuJawa mentioned this pull request Mar 15, 2024

MAINT: Simplify NCCL worker rank identification #2228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try using contiguous rank to fix cuda_visible_devices #1926

Try using contiguous rank to fix cuda_visible_devices #1926

VibhuJawa commented Oct 24, 2023

copy-pr-bot bot commented Oct 24, 2023

cjnolet commented Oct 24, 2023

VibhuJawa commented Oct 25, 2023 •

edited

Loading

jnke2016 commented Oct 25, 2023

VibhuJawa commented Oct 25, 2023

cjnolet commented Oct 25, 2023

cjnolet commented Oct 25, 2023

Try using contiguous rank to fix cuda_visible_devices #1926

Try using contiguous rank to fix cuda_visible_devices #1926

Conversation

VibhuJawa commented Oct 24, 2023

copy-pr-bot bot commented Oct 24, 2023

cjnolet commented Oct 24, 2023

VibhuJawa commented Oct 25, 2023 • edited Loading

jnke2016 commented Oct 25, 2023

VibhuJawa commented Oct 25, 2023

cjnolet commented Oct 25, 2023

cjnolet commented Oct 25, 2023

VibhuJawa commented Oct 25, 2023 •

edited

Loading