Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] when ucp listener enabled we bind 16 times always #2474

Closed
abellina opened this issue May 21, 2021 · 0 comments · Fixed by #2476
Closed

[BUG] when ucp listener enabled we bind 16 times always #2474

abellina opened this issue May 21, 2021 · 0 comments · Fixed by #2476
Assignees
Labels
bug Something isn't working P0 Must have for release shuffle things that impact the shuffle plugin

Comments

@abellina
Copy link
Collaborator

abellina commented May 21, 2021

Noticed this when looking at more error handling using UCX's UCPListener:

[1621604466.240135] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d00950c0 on cm 0x7f48d00801f0 with fd: 389 listening on 192.168.50.80:36221
[1621604466.240202] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d0095890 on cm 0x7f48d00801f0 with fd: 390 listening on 192.168.50.80:36222
[1621604466.240257] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d007d2b0 on cm 0x7f48d00801f0 with fd: 391 listening on 192.168.50.80:36223
[1621604466.240289] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d007d280 on cm 0x7f48d00801f0 with fd: 392 listening on 192.168.50.80:36224
[1621604466.240320] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d0095ce0 on cm 0x7f48d00801f0 with fd: 393 listening on 192.168.50.80:36225
[1621604466.240351] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d0095e30 on cm 0x7f48d00801f0 with fd: 394 listening on 192.168.50.80:36226
[1621604466.240384] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d0096120 on cm 0x7f48d00801f0 with fd: 395 listening on 192.168.50.80:36227
[1621604466.240415] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d0096270 on cm 0x7f48d00801f0 with fd: 396 listening on 192.168.50.80:36228
[1621604466.240446] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d00963c0 on cm 0x7f48d00801f0 with fd: 397 listening on 192.168.50.80:36229
[1621604466.240482] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d0096510 on cm 0x7f48d00801f0 with fd: 398 listening on 192.168.50.80:36230
[1621604466.240513] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d0096660 on cm 0x7f48d00801f0 with fd: 399 listening on 192.168.50.80:36231
[1621604466.240543] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d00967b0 on cm 0x7f48d00801f0 with fd: 400 listening on 192.168.50.80:36232
[1621604466.240573] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d0096900 on cm 0x7f48d00801f0 with fd: 401 listening on 192.168.50.80:36233
[1621604466.240604] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d0096a50 on cm 0x7f48d00801f0 with fd: 402 listening on 192.168.50.80:36234
[1621604466.240636] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d0096ba0 on cm 0x7f48d00801f0 with fd: 403 listening on 192.168.50.80:36235
[1621604466.240666] [test-server:9155 :0]   tcp_listener.c:134  UCX  DEBUG created a TCP listener 0x7f48d0096cf0 on cm 0x7f48d00801f0 with fd: 404 listening on 192.168.50.80:3623

The 16 binds are because we are not exiting early here: https://github.com/NVIDIA/spark-rapids/blob/branch-21.06/shuffle-plugin/src/main/scala/com/nvidia/spark/rapids/shuffle/ucx/UCX.scala#L209

Fix coming.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels May 21, 2021
@abellina abellina added this to the May 10 - May 21 milestone May 21, 2021
@abellina abellina added the shuffle things that impact the shuffle plugin label May 21, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release shuffle things that impact the shuffle plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants