-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s2sw_pdlib_debug gnu job hanging on HERA (non-reproduciable) #2382
Comments
I am going to open a HERA ticket to see whether SA can give us some clue |
from HERA SA: Looking at your job stack, looks like you are using a custom install of openmpi. I saw from "ompi_info" that looks like it was build with "--without-verbs". This might be normal but typically for IB to work you want verbs. We believe the first step is to look at your openmpi stack as it does not appear to be setup to use IB. |
I think it might be openmpi issue with gnu. s2sw_pdlib_debug_gnu/cpld_debug_pdlib_p8 job is hanging.
|
@RatkoVasic-NOAA @ulmononian we need a plan to resolve the issue. Different problem on Hercules but openmpi used to cause issue on hercules as well. |
looks like similar issue reported: open-mpi/ompi#11508 |
another information with TCP/OMPI issue: open-mpi/ompi#10734 |
Description
During my testing of new MOM6 code in UWM I found s2sw_pdlib_debug gnu job hanging on HERA but it is non-reproduciable. So I turned back to use current UWM version, cloned this morning,
commit b5a1976 (HEAD -> develop, origin/develop, origin/HEAD)
Author: Dusan Jovic 48258889+DusanJovic-NOAA@users.noreply.github.com
Date: Tue Jul 30 07:17:15 2024 -0400
but I found the same situation.
I repeated it 10 times and found out 5 of them succeeded while the rest of them hanged and timed out.
I also repeated 10 times on hercules and all of them are fine.
I have a strong feeling that this issue is related to machine rather to code or resource settings but I don't know how to prove it. I tested this job many times before today's trying of UWM develop branch. What I found is that it heavily depends on when I sbumit the job. I had times that all my 20 tries were OK and I also had times that more than half of my job hanged.
To Reproduce:
clone latest UWM
run s2sw_pdlib_debug gnu job, repeat it several times. Some will run fine but some will time out
Additional context
error information:
180: WARNING: Open MPI failed to TCP connect to a peer MPI process. This
180: should not happen.
180:
180: Your Open MPI job may now hang or fail.
180:
180: Local host: h4c43
180: PID: 3159416
180: Message: connect() to 10.184.4.51:1034 failed
180: Error: Resource temporarily unavailable (11)
Output
see HERA /scratch1/NCEPDEV/stmp2/Jiande.Wang/FV3_RT/rt_324380-HEAD-pdlib/cpld_debug_pdlib_p8_gnu/err
The text was updated successfully, but these errors were encountered: