Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s2sw_pdlib_debug gnu job hanging on HERA (non-reproduciable) #2382

Open
jiandewang opened this issue Jul 31, 2024 · 7 comments
Open

s2sw_pdlib_debug gnu job hanging on HERA (non-reproduciable) #2382

jiandewang opened this issue Jul 31, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@jiandewang
Copy link
Collaborator

Description

During my testing of new MOM6 code in UWM I found s2sw_pdlib_debug gnu job hanging on HERA but it is non-reproduciable. So I turned back to use current UWM version, cloned this morning,
commit b5a1976 (HEAD -> develop, origin/develop, origin/HEAD)
Author: Dusan Jovic 48258889+DusanJovic-NOAA@users.noreply.github.com
Date: Tue Jul 30 07:17:15 2024 -0400

Fix dumpfields=true option by using ESMF_FieldBundleWrite (#2355)

but I found the same situation.
I repeated it 10 times and found out 5 of them succeeded while the rest of them hanged and timed out.
I also repeated 10 times on hercules and all of them are fine.

I have a strong feeling that this issue is related to machine rather to code or resource settings but I don't know how to prove it. I tested this job many times before today's trying of UWM develop branch. What I found is that it heavily depends on when I sbumit the job. I had times that all my 20 tries were OK and I also had times that more than half of my job hanged.

To Reproduce:

clone latest UWM
run s2sw_pdlib_debug gnu job, repeat it several times. Some will run fine but some will time out

Additional context

error information:
180: WARNING: Open MPI failed to TCP connect to a peer MPI process. This
180: should not happen.
180:
180: Your Open MPI job may now hang or fail.
180:
180: Local host: h4c43
180: PID: 3159416
180: Message: connect() to 10.184.4.51:1034 failed
180: Error: Resource temporarily unavailable (11)

Output

see HERA /scratch1/NCEPDEV/stmp2/Jiande.Wang/FV3_RT/rt_324380-HEAD-pdlib/cpld_debug_pdlib_p8_gnu/err

@jiandewang jiandewang added the bug Something isn't working label Jul 31, 2024
@jiandewang
Copy link
Collaborator Author

I am going to open a HERA ticket to see whether SA can give us some clue

@jiandewang
Copy link
Collaborator Author

from HERA SA:
based on the error output, it looks like your job is trying to use TCP versus IB/RDMA. I think this could be part of the problem.

Looking at your job stack, looks like you are using a custom install of openmpi. I saw from "ompi_info" that looks like it was build with "--without-verbs". This might be normal but typically for IB to work you want verbs.

We believe the first step is to look at your openmpi stack as it does not appear to be setup to use IB.

@BrianCurtis-NOAA
Copy link
Collaborator

@jkbk2004 @RatkoVasic-NOAA FYI

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Aug 1, 2024

I think it might be openmpi issue with gnu. s2sw_pdlib_debug_gnu/cpld_debug_pdlib_p8 job is hanging.

 25: WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
 25: should not happen.
 25:
 25: Your Open MPI job may now hang or fail.
 25:
 25:   Local host: h1c10
 25:   PID:        145750
 25:   Message:    connect() to 10.184.4.41:1061 failed
 25:   Error:      Resource temporarily unavailable (11)
 25: --------------------------------------------------------------------------
  18: --------------------------------------------------------------------------
 18: WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
 18: should not happen.
 18:
 18: Your Open MPI job may now hang or fail.
 18:
 18:   Local host: h1c10
 18:   PID:        145743
 18:   Message:    connect() to 10.184.4.41:1031 failed
 18:   Error:      Resource temporarily unavailable (11)
 18: --------------------------------------------------------------------------
 32: [h1c10:145757] 10 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
 32: [h1c10:145757] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
  0: slurmstepd: error: *** STEP 64288223.0 ON h1c10 CANCELLED AT 2024-08-01T04:56:25 DUE TO TIME LIMIT ***

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Aug 1, 2024

I think it might be openmpi issue with gnu. s2sw_pdlib_debug_gnu/cpld_debug_pdlib_p8 job is hanging.

 25: WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
 25: should not happen.
 25:
 25: Your Open MPI job may now hang or fail.
 25:
 25:   Local host: h1c10
 25:   PID:        145750
 25:   Message:    connect() to 10.184.4.41:1061 failed
 25:   Error:      Resource temporarily unavailable (11)
 25: --------------------------------------------------------------------------
  18: --------------------------------------------------------------------------
 18: WARNING: Open MPI failed to TCP connect to a peer MPI process.  This
 18: should not happen.
 18:
 18: Your Open MPI job may now hang or fail.
 18:
 18:   Local host: h1c10
 18:   PID:        145743
 18:   Message:    connect() to 10.184.4.41:1031 failed
 18:   Error:      Resource temporarily unavailable (11)
 18: --------------------------------------------------------------------------
 32: [h1c10:145757] 10 more processes have sent help message help-mpi-btl-tcp.txt / client connect fail
 32: [h1c10:145757] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
  0: slurmstepd: error: *** STEP 64288223.0 ON h1c10 CANCELLED AT 2024-08-01T04:56:25 DUE TO TIME LIMIT ***

@RatkoVasic-NOAA @ulmononian we need a plan to resolve the issue. Different problem on Hercules but openmpi used to cause issue on hercules as well.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Aug 1, 2024

looks like similar issue reported: open-mpi/ompi#11508

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Aug 1, 2024

another information with TCP/OMPI issue: open-mpi/ompi#10734

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Status: No status
Development

No branches or pull requests

4 participants