Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FLAG_RNDV_FRAG assertion failure with cuda transfers within the node #5646

Closed
Akshay-Venkatesh opened this issue Sep 1, 2020 · 2 comments · Fixed by #5675
Closed

FLAG_RNDV_FRAG assertion failure with cuda transfers within the node #5646

Akshay-Venkatesh opened this issue Sep 1, 2020 · 2 comments · Fixed by #5675
Labels

Comments

@Akshay-Venkatesh
Copy link
Contributor

Describe the bug

When rc is removed from UCX_TLS to move data between GPUs that are not cuda-ipc accessible, the above assertion issue shows up. This happened when trying to see if #5473 addresses #3249

Steps to Reproduce

$ mpirun -mca btl ^openib -mca pml ucx -np 2 --map-by ppr:1:socket -x UCX_TLS=mm,cuda_copy,cuda_ipc,gdr_copy -x UCX_MEMTYPE_CACHE=n -x MUCX_MAX_RNDV_RAILS=1 
-x CUDA_VISIBLE_DEVICES=0,5 -x LD_LIBRARY_PATH ./get_local_ompi_rank_hca mpi/pt2pt/osu_bw -m 1:$((2 ** 22)) D D
local rank 0: using hca mlx5_0:1 openib using mlx5_0
local rank 1: using hca mlx5_2:1 openib using mlx5_2
# OSU MPI-CUDA Bandwidth Test v5.6.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
1                       0.44
2                       0.62
4                       1.23
8                       2.46
16                      7.19
32                      9.88
64                     19.00
128                    37.75
256                    74.91
512                   111.66
1024                  195.81
2048                  282.60
4096                  339.61
8192                  388.52
[prm-dgx-16:37750:0:37750]        rndv.c:1689 Assertion `!(rreq->flags & UCP_REQUEST_FLAG_RNDV_FRAG)' failed
==== backtrace (tid:  37750) ====
 0  $UCX_HOME/lib/libucs.so.0(ucs_handle_error+0x73) [0x7fcf78182f1f]
 1  $UCX_HOME/lib/libucs.so.0(ucs_fatal_error_message+0xdf) [0x7fcf7818030c]
 2  $UCX_HOME/lib/libucs.so.0(+0x2b49a) [0x7fcf7818049a]
 3  $UCX_HOME/lib/libucp.so.0(ucp_rndv_data_handler+0x10b) [0x7fcf7888947e]
 4  $UCX_HOME/lib/libuct.so.0(+0x175ea) [0x7fcf785fa5ea]
 5  $UCX_HOME/lib/libuct.so.0(+0x18024) [0x7fcf785fb024]
 6  $UCX_HOME/lib/libucp.so.0(+0x3753f) [0x7fcf7886553f]
 7  $UCX_HOME/lib/libucp.so.0(ucp_worker_progress+0x137) [0x7fcf7886d96f]
 8  $MPI_HOME/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17) [0x7fcf78b3d027]
 9  $MPI_HOME/lib/libopen-pal.so.40(opal_progress+0x48) [0x7fcfb7c1d1e8]
10  $MPI_HOME/lib/libmpi.so.40(ompi_request_default_wait_all+0x4c9) [0x7fcfbb81efc9]
11  $MPI_HOME/lib/libmpi.so.40(PMPI_Waitall+0x337) [0x7fcfbb9242f7]
12  mpi/pt2pt/osu_bw() [0x4027b7]
=================================

cc @bureddy

@bureddy
Copy link
Contributor

bureddy commented Sep 2, 2020

@Akshay-Venkatesh will check.

@Akshay-Venkatesh
Copy link
Contributor Author

Akshay-Venkatesh commented Sep 10, 2020

@bureddy thanks for the fix.

#5675 does fix the above assertion failure. I've verified with osu microbenchmarks 5.6.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants