Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Segmentation fault when initializing the comms #4635

Open
2 tasks done
jnke2016 opened this issue Aug 27, 2024 · 1 comment
Open
2 tasks done

[BUG]: Segmentation fault when initializing the comms #4635

jnke2016 opened this issue Aug 27, 2024 · 1 comment
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@jnke2016
Copy link
Contributor

Version

24.10

Which installation method(s) does this occur on?

No response

Describe the bug.

A user reported a segmentation fault when initializing the comms while using one of our latest nightlies. This bug is not currently reproducible by any of our nightly tests

Minimum reproducible example

Not reproducible yet

Relevant log output

stcomp>():222] - 2024-08-25 11:15:35,660 - distributed.core - INFO - Starting established connection to tcp://10.174.164.228:43037
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:22,812 - distributed.worker - INFO - Run out-of-band function '_get_nvml_device_index'
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:22,812 - distributed.worker - INFO - Run out-of-band function '_get_nvml_device_index'
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:22,835 - distributed.worker - INFO - Run out-of-band function '_func_ucp_listener_port'
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:22,835 - distributed.worker - INFO - Run out-of-band function '_func_ucp_listener_port'
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:22,881 - distributed.worker - INFO - Run out-of-band function '_func_init_all'
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:22,881 - distributed.worker - INFO - Run out-of-band function '_func_init_all'
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:30,551 - distributed.worker - INFO - Run out-of-band function '_subcomm_init'
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2024-08-25 11:16:30,585 - distributed.worker - INFO - Run out-of-band function '_subcomm_init'
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - [1724584582.860222] [cgjben2-a6wj25tiqsi5u-w-10:8866 :0] parser.c:2036 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - [1724584582.860222] [cgjben2-a6wj25tiqsi5u-w-10:8866 :0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - [cgjben2-a6wj25tiqsi5u-w-10:8866 :0:8866] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - ==== backtrace (tid: 8866) ====
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 0 /mnt/1/python_env/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(ucs_handle_error+0x2fd) [0x7f1228e5a06d]
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 1 /mnt/1/python_env/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(+0x2a264) [0x7f1228e5a264]
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 2 /mnt/1/python_env/lib/python3.10/site-packages/raft_dask/common/../../../.././libucs.so.0(+0x2a42a) [0x7f1228e5a42a]
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - 3 /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980) [0x7f129dad7980]
24/08/25 11:16:45 WARN python: [dask_cluster.py:():222] - =================================

Environment details

No response

Other/Misc.

No response

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@jnke2016 jnke2016 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Aug 27, 2024
@jnke2016 jnke2016 self-assigned this Aug 27, 2024
@jnke2016
Copy link
Contributor Author

@nv-rliu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant