Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation faults when combined with @threads with memory allocation #337

Closed
twhitehead opened this issue Jan 10, 2020 · 24 comments · Fixed by #363
Closed

segmentation faults when combined with @threads with memory allocation #337

twhitehead opened this issue Jan 10, 2020 · 24 comments · Fixed by #363
Labels

Comments

@twhitehead
Copy link

twhitehead commented Jan 10, 2020

One of our users was having problems with their hybrid MPI/threaded Julia code segment faulting on our clusters.

OS: Linux (CentOS 7)
Julia: 1.3.0
OpenMPI: 3.1.2

I simplified their code down the following demo

using MPI

function main()
    MPI.Init()

    Threads.@threads for i in 1:100
        A = rand(1000,1000)
        A1 = inv(A)
        oops = A1[1.6]
    end

    MPI.Finalize()
end

main()
  • exceptions sometimes turn into segmentation faults inside of @threads for loops
  • for a reliable segmentation fault you need to perform a reasonable amount of work in the loop
$ export JULIA_NUM_THREADS=2
$ mpirun -n 2 example.jl
[1578695090.632505] [gra797:13151:0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1578695091.181319] [gra797:13152:0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[gra797:13151:1:13155] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b6ccc961008)
[gra797:13152:1:13156] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b5877538008)
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000a9904 maybe_collect()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_threads.h:283
 2 0x00000000000a9904 jl_gc_managed_malloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:3116
 3 0x000000000007a160 _new_array_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:109
 4 0x000000000007db3e jl_array_copy()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:1135
 5 0x000000000005d22c _jl_invoke()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gf.c:2135
 6 0x0000000000078e19 jl_apply()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia.h:1631
===================
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000a8d74 maybe_collect()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_threads.h:283
 2 0x00000000000a8d74 jl_gc_pool_alloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:1096
 3 0x000000000005d22c _jl_invoke()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gf.c:2135
 4 0x0000000000078e19 jl_apply()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia.h:1631
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 13151 on node gra797 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Here is some possibly relevant info from ompi_info as well

...
  Configure command line: '--prefix=/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc7.3/openmpi/3.1.2'
                          '--build=x86_64-pc-linux-gnu'
                          '--host=x86_64-pc-linux-gnu' '--enable-shared'
                          '--with-verbs' '--enable-mpirun-prefix-by-default'
                          '--with-hwloc=external' '--without-usnic'
                          '--with-ucx' '--disable-wrapper-runpath'
                          '--disable-wrapper-rpath' '--with-munge'
                          '--with-slurm' '--with-pmi=/opt/software/slurm'
                          '--enable-mpi-cxx' '--with-hcoll'
                          '--disable-show-load-errors-by-default'
                          '--enable-mca-dso=common-libfabric,common-ofi,common-verbs,atomic-mxm,btl-openib,btl-scif,coll-fca,coll-hcoll,ess-tm,fs-lustre,mtl-mxm,mtl-ofi,mtl-psm,mtl-psm2,osc-ucx,oob-ud,plm-tm,pmix-s1,pmix-s2,pml-ucx,pml-yalla,pnet-opa,psec-munge,ras-tm,rml-ofi,scoll-mca,sec-munge,spml-ikrit,'
...
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
...

EDIT: Removed exception bit as as noted below it isn't required.

@simonbyrne
Copy link
Member

What version of MPI.jl are you using? You can get it via julia -e 'using Pkg; Pkg.status()'

@simonbyrne
Copy link
Member

simonbyrne commented Jan 10, 2020

Hmm, I can reproduce on our cluster (with openmpi 3.1.4 and 4.0.1). It might have something to do with openmpi/UCX's funny malloc hooks, which have caused problems before. cc'ing @kpamnany and @vchuravy who might have some ideas?

mpich doesn't seem to have this problem, so I would suggest using that in the meantime?

@simonbyrne simonbyrne added the bug label Jan 10, 2020
@twhitehead
Copy link
Author

Noticed after I wrote this that the exception part is not critical.

The program simply has to spend enough time (or maybe do enough allocating?) inside the @thread for loop and you get a crash (e.g., I get reliable crashes for 1000x1000 matricies but when I reduce it to 10x10 it rarely crashes).

using MPI

function main()
    MPI.Init()

    Threads.@threads for i in 1:2
        A = rand(1000,1000)
        A1 = inv(A)
    end

    MPI.Finalize()
end

main()

I'll adjust the title accordingly.

@twhitehead twhitehead changed the title segmentation faults when combined with @threads and exceptions segmentation faults when combined with @threads with sufficient work Jan 10, 2020
@simonbyrne
Copy link
Member

simonbyrne commented Jan 11, 2020

It's something to do with how MPI and Julia malloc interact. The simplest example (which doesn't require this package) that I can reproduce is:

# threads.jl
function main()
    ccall((:MPI_Init, :libmpi), Nothing, (Ptr{Cint},Ptr{Cint}), C_NULL, C_NULL)
    Threads.@threads for i in 1:100
        A = rand(1000,1000)
    end
end

main()

then calling

JULIA_NUM_THREADS=2 mpiexec -n 2 julia threads.jl

@simonbyrne simonbyrne changed the title segmentation faults when combined with @threads with sufficient work segmentation faults when combined with @threads with memory allocation Jan 11, 2020
@simonbyrne
Copy link
Member

My guess is that it's due to UCX, which has caused similar problems in the past (#298).

They've fixed quite a few issues on their main branch, but haven't made a new release for quite a while (I think they've made some breaking API changes, so need to coordinate with openmpi). Hopefully this is one of those that will get fixed.

@simonbyrne
Copy link
Member

I tried this on master build of UCX + openmpi and see the same issue.

@kpamnany
Copy link
Contributor

Try calling MPI_Init_thread with MPI_THREAD_MULTIPLE instead of MPI_Init? Just to rule that out.

@simonbyrne
Copy link
Member

Yes, that seems to give the same result.

function main()
    r = Ref{Cint}()
    ccall((:MPI_Init_thread, :libmpi), Cint,
          (Ptr{Cint},Ptr{Cvoid},Cint,Ptr{Cint}),
          C_NULL, C_NULL, Cint(3), r)
    @show r[]
    Threads.@threads for i in 1:100
        A = rand(1000,1000)
    end
end

(MPI_THREAD_MULTIPLE == 3)

@twhitehead
Copy link
Author

twhitehead commented Jan 12, 2020

It still blows up for me with OpenMPI 3.1.2 when I force just using tcp and self (this is after allocating 2 tasks with 2 cores each and setting JULIA_NUM_THREADS=2)

$ mpirun --mca pml ob1 --mca btl tcp,self julia example.jl
[gra797:13477:0:13484] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b984f48b008)
[gra797:13478:0:13485] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2af4b14e9008)
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000a8cf4 maybe_collect()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_threads.h:283
 2 0x00000000000a8cf4 jl_gc_big_alloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:837
 3 0x00000000000a9300 jl_gc_alloc_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_internal.h:238
 4 0x00000000000a9300 jl_gc_alloc_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_internal.h:240
 5 0x00000000000a9300 jl_gc_alloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:2940
 6 0x000000000007ac7e _new_array_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:100
 7 0x000000000007ac7e _new_array()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:163
 8 0x000000000007ac7e jl_alloc_array_1d()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:424
 9 0x000000000005d22c _jl_invoke()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gf.c:2135
10 0x0000000000078e19 jl_apply()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia.h:1631
===================
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000a8cf4 maybe_collect()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_threads.h:283
 2 0x00000000000a8cf4 jl_gc_big_alloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:837
 3 0x00000000000a9300 jl_gc_alloc_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_internal.h:238
 4 0x00000000000a9300 jl_gc_alloc_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_internal.h:240
 5 0x00000000000a9300 jl_gc_alloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:2940
 6 0x000000000007ac7e _new_array_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:100
 7 0x000000000007ac7e _new_array()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:163
 8 0x000000000007ac7e jl_alloc_array_1d()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:424
 9 0x000000000005d22c _jl_invoke()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gf.c:2135
10 0x0000000000078e19 jl_apply()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia.h:1631
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 13477 on node gra797 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@eschnett
Copy link
Contributor

A few generic comments (that might not apply here, but nevertheless):

  • You call to MPI_init_thread requests MPI_THREAD_MULTIPLE, but does not check whether this is actually supported by the installed library. You need to check the return value.
  • It is also possible that there is some confusion between the MPI headers that are found by Julia, the shared MPI library that is used for linking, the mpiexec that you are using at run time, and the MPI library that is actually used at run time.
  • Finally, do you see the segfaults at run time, or during shutdown? You could add a call to MPI_Barrier when your code is done to find out.

@simonbyrne
Copy link
Member

* You call to `MPI_init_thread` requests `MPI_THREAD_MULTIPLE`, but does not check whether this is actually supported by the installed library. You need to check the return value.

Yes, I printed the result: it returns the same value.

* It is also possible that there is some confusion between the MPI headers that are found by Julia, the shared MPI library that is used for linking, the `mpiexec` that you are using at run time, and the MPI library that is actually used at run time.

Yes, I tried it with a specified path for libmpi and mpiexec

* Finally, do you see the segfaults at run time, or during shutdown? You could add a call to `MPI_Barrier` when your code is done to find out.

They are at runtime: I with one rank and printing between the loop and finalize, I get

├ cat threads.jl
const libmpi = expanduser("~/usr/lib/libmpi.so")

function main()
    r = Ref{Cint}()
    ccall((:MPI_Init_thread, libmpi), Cint,
          (Ptr{Cint},Ptr{Cvoid},Cint,Ptr{Cint}),
          C_NULL, C_NULL, Cint(3), r)
    @show r[]
    Threads.@threads for i in 1:100
        A = rand(1000,1000)
    end
    println("done")
    ccall((:MPI_Finalize, libmpi), Cint,
          (),)
end

main()

├ JULIA_NUM_THREADS=2 ~/usr/bin/mpirun -n 1 julia threads.jl
r[] = 3
[hpc-91-09:117009:0:117057] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2aaaaaaf3008)
==== backtrace (tid: 117057) ====
 0  /home/spjbyrne/usr/lib/libucs.so.0(ucs_handle_error+0x19c) [0x2aaaed9253dc]
 1  /home/spjbyrne/usr/lib/libucs.so.0(+0x2570c) [0x2aaaed92570c]
 2  /home/spjbyrne/usr/lib/libucs.so.0(+0x2597b) [0x2aaaed92597b]
 3  /central/software/julia/1.3.0/bin/../lib/libjulia.so.1(jl_gc_managed_malloc+0x74) [0x2aaaaad788c4]
 4  /central/software/julia/1.3.0/bin/../lib/libjulia.so.1(jl_alloc_array_2d+0x24c) [0x2aaaaad4a08c]
 5  [0x2aaac5335404]
 6  [0x2aaac5335609]
 7  [0x2aaac5334aff]
 8  [0x2aaac5334b1d]
 9  /central/software/julia/1.3.0/bin/../lib/libjulia.so.1(jl_apply_generic+0x53c) [0x2aaaaad2c79c]
10  /central/software/julia/1.3.0/bin/../lib/libjulia.so.1(+0x78e79) [0x2aaaaad47e79]
=================================

signal (11): Segmentation fault
in expression starting at /central/home/spjbyrne/src/MPI.jl/threads.jl:17
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 117009 on node hpc-91-09 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@simonbyrne
Copy link
Member

One way forward would be to see if we can recreate the behavior in C: I guess this would consist of writing a multithreaded C program that internally calls malloc. Unfortunately this is beyond my C skills, so if someone else wanted to take a stab at it I would be grateful.

@simonbyrne
Copy link
Member

Another datapoint: it works correctly with GC disabled:

function main()
    GC.enable(false)
    r = Ref{Cint}()
    ccall((:MPI_Init_thread, libmpi), Cint,
          (Ptr{Cint},Ptr{Cvoid},Cint,Ptr{Cint}),
          C_NULL, C_NULL, Cint(3), r)
    @show r[]
    Threads.@threads for i in 1:100
        A = rand(1000,1000)
    end
    println("done")
    ccall((:MPI_Finalize, libmpi), Cint,
          (),)
end

@simonbyrne
Copy link
Member

Ah, running it with UCX_HANDLE_ERRORS=debug gives me the following stacktrace:

(gdb) backtrace
#0  0x00002aaaaad778d1 in _mm_pause ()
    at /usr/local/lib/gcc/x86_64-pc-linux-gnu/7.3.0/include/xmmintrin.h:1267
#1  jl_gc_wait_for_the_world ()
    at /buildworker/worker/package_linux64/build/src/gc.c:205
#2  jl_gc_collect (full=full@entry=0)
    at /buildworker/worker/package_linux64/build/src/gc.c:2897
#3  0x00002aaaaad77d0b in maybe_collect (ptls=0x2aaaaaaff7e0)
    at /buildworker/worker/package_linux64/build/src/gc.c:781
#4  jl_gc_pool_alloc (ptls=ptls@entry=0x2aaaaaaff7e0,
---Type <return> to continue, or q <return> to quit---
    pool_offset=pool_offset@entry=1496, osize=osize@entry=80)
    at /buildworker/worker/package_linux64/build/src/gc.c:1096
#5  0x00002aaaaad4a0aa in jl_gc_alloc_ (
    ty=0x2aaab7e08310 <jl_system_image_data+5402704>, sz=64,
    ptls=0x2aaaaaaff7e0)
    at /buildworker/worker/package_linux64/build/src/julia_internal.h:233
#6  _new_array_ (elsz=8, isunion=<optimized out>, isunboxed=<optimized out>,
    dims=<synthetic pointer>, ndims=2,
    atype=0x2aaab7e08310 <jl_system_image_data+5402704>)
    at /buildworker/worker/package_linux64/build/src/array.c:112
#7  _new_array (dims=<synthetic pointer>, ndims=2,
    atype=0x2aaab7e08310 <jl_system_image_data+5402704>)
    at /buildworker/worker/package_linux64/build/src/array.c:163
#8  jl_alloc_array_2d (atype=0x2aaab7e08310 <jl_system_image_data+5402704>,
    nr=1000, nc=1000)
    at /buildworker/worker/package_linux64/build/src/array.c:431
#9  0x00002aaac5335404 in ?? ()
#10 0x0000000000000004 in ?? ()
#11 0x0000000000000000 in ?? ()

#1 points to this "FIXME":
https://github.com/JuliaLang/julia/blob/248bc460bba587dd1c4741f434c5305218a1f87e/src/gc.c#L197-L206

@vchuravy
Copy link
Member

IIUC correctly, Julia uses signals to communicate between threads (e.g. to implement wait for the world)

So the "error" above is normal and is a case where you need to tell GDB to ignore that signal.
Which is probably the same problem that we are encountering with ucx. It seems to capture Signal 11 and interpret it as an error.

@vchuravy
Copy link
Member

@simonbyrne can you try running with 'UCX_ERROR_SIGNALS="" as a environment variable?

@simonbyrne
Copy link
Member

Hmm, that seems to work correctly.

@simonbyrne
Copy link
Member

more data:

├ ~/usr/bin/mpirun -x JULIA_NUM_THREADS=2 -x UCX_ERROR_SIGNALS="SIGILL,SIGBUS,SIGFPE" -n 1 julia threads.jl
r[] = 3
done

├ ~/usr/bin/mpirun -x JULIA_NUM_THREADS=2 -x UCX_ERROR_SIGNALS="SIGILL,SIGSEGV,SIGBUS,SIGFPE" -n 1 julia threads.jl
r[] = 3
# segfault

@simonbyrne
Copy link
Member

simonbyrne commented Jan 25, 2020

@vchuravy and @kpamnany narrowed this down to UCX intercepting signals, and according to the Julia developer docs:

The profiler uses SIGUSR2 for sampling and the garbage collector uses SIGSEGV for threads synchronization.

So basically users need to disable UCX intercepting the SIGSEGV error signal, which can be done by setting the (undocumented) environment variable UCX_ERROR_SIGNALS to be empty, or a list which excludes SIGSEGV (the default appears to be "SIGILL,SIGSEGV,SIGBUS,SIGFPE".

@twhitehead Can you confirm if this fixes your users' problem?

@twhitehead
Copy link
Author

Our user reports that this (setting UCX_ERROR_SIGNALS="") doesn't not fix their problem. My testing also seems to indicate that, while it changes the nature of the error reported, it does not cause the sample code to stop crashing.

[tyson@gra797 julia]$ cat example.jl
using MPI

function main()
    MPI.Init()

    Threads.@threads for i in 1:2
        A = rand(10000,10000)
        A1 = inv(A)
    end

    MPI.Finalize()
end

main()

[tyson@gra797 julia]$ UCX_ERROR_SIGNALS= mpirun julia example.jl
[1580249018.518004] [gra799:22756:0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1580249018.570226] [gra797:11768:0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 22756 on node gra799 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I think though that, as has been long suggested, it may still be UCX related. While looking at the shared libraries loaded, I noticed my earlier attempt to avoid UCX through MCA parameters was not quite complete as it also gets sucked in from the OSC layer.

I'm currently getting no crashes when run as follows

$ mpirun --mca pml ob1 --mca btl tcp,self --mca osc pt2pt julia example.jl

I'll play around more with variants (e.g., try OpenIB in to the BTL layer, etc.) to ensure I'm not missing anything and also verify with our user whether they exported the UCX_ERROR_SIGNALS= setting or not, and get back to the ticket.

Thanks everyone for all the work digging into this.

@simonbyrne
Copy link
Member

simonbyrne commented Jan 29, 2020

Interesting: it looks like it is passing on the node you are launching it from (gra797), but failing on the other one (gra799). Is the environment being passed correctly to the launched processes?

Does it work if you add set the environment variable in your script, i.e. add

ENV["UCX_ERROR_SIGNALS"] = ""

to the top of example.jl?

If that works, we can add that to the __init__() function in MPI.jl

@twhitehead
Copy link
Author

@simonbyrne that was a very good idea. Unfortunately it doesn't seem to work though (or at least not under OpenMPI 3.1.2 and UCX 1.5.2).

That is, I added a printout line at the top to verify it is set (checked that it does throw an exception if the environment lookup fails)

print(gethostname(),": UCX_ERROR_SIGNALS=",ENV["UCX_ERROR_SIGNALS"],"\n")

but the program still crashes

[tyson@gra-login1 ~]$ UCX_ERROR_SIGNALS= JULIA_NUM_THREADS=2 salloc --ntasks 2 --nodes 2 --cpus-per-task 2 -t 3:0:0 --mem-per-cpu 2g -A def-tyson-ab
[tyson@gra2 ~]$ cd julia
[tyson@gra2 julia]$ mpirun julia example.jl
gra2: UCX_ERROR_SIGNALS=
gra19: UCX_ERROR_SIGNALS=
[1580332663.359764] [gra2:2357 :0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1580332663.567514] [gra19:18975:0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node gra2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@simonbyrne
Copy link
Member

Hmm, I see the same thing using the same versions of UCX and OpenMPI, but not with the latest versions, so I think you will need to upgrade.

simonbyrne added a commit that referenced this issue Mar 20, 2020
Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test.

Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.
simonbyrne added a commit that referenced this issue Mar 20, 2020
Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test.

Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.
simonbyrne added a commit that referenced this issue Mar 20, 2020
Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test.

Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.
simonbyrne added a commit that referenced this issue Mar 20, 2020
Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test.

Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.
@s-fuerst
Copy link
Contributor

s-fuerst commented Nov 5, 2021

FYI: I ran into the same problem, however, on our cluster (Lise@HLRN) OpenMPI is built without UCX support. Updating from 3.1.5 to 4.1.1 didn't helped either. One potential solution here is to change the PML from cm to ob1 with openib as the BTL, a second (better solution) is to use Intel MPI's implementation instead of OpenMPI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants