segmentation faults when combined with @threads with memory allocation #337

twhitehead · 2020-01-10T22:28:29Z

One of our users was having problems with their hybrid MPI/threaded Julia code segment faulting on our clusters.

OS: Linux (CentOS 7)
Julia: 1.3.0
OpenMPI: 3.1.2

I simplified their code down the following demo

using MPI

function main()
    MPI.Init()

    Threads.@threads for i in 1:100
        A = rand(1000,1000)
        A1 = inv(A)
        oops = A1[1.6]
    end

    MPI.Finalize()
end

main()

~~exceptions sometimes turn into segmentation faults inside of @threads for loops~~
for a reliable segmentation fault you need to perform a reasonable amount of work in the loop

$ export JULIA_NUM_THREADS=2
$ mpirun -n 2 example.jl

[1578695090.632505] [gra797:13151:0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1578695091.181319] [gra797:13152:0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[gra797:13151:1:13155] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b6ccc961008)
[gra797:13152:1:13156] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b5877538008)
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000a9904 maybe_collect()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_threads.h:283
 2 0x00000000000a9904 jl_gc_managed_malloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:3116
 3 0x000000000007a160 _new_array_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:109
 4 0x000000000007db3e jl_array_copy()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:1135
 5 0x000000000005d22c _jl_invoke()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gf.c:2135
 6 0x0000000000078e19 jl_apply()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia.h:1631
===================
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000a8d74 maybe_collect()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_threads.h:283
 2 0x00000000000a8d74 jl_gc_pool_alloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:1096
 3 0x000000000005d22c _jl_invoke()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gf.c:2135
 4 0x0000000000078e19 jl_apply()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia.h:1631
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 13151 on node gra797 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Here is some possibly relevant info from ompi_info as well

...
  Configure command line: '--prefix=/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/gcc7.3/openmpi/3.1.2'
                          '--build=x86_64-pc-linux-gnu'
                          '--host=x86_64-pc-linux-gnu' '--enable-shared'
                          '--with-verbs' '--enable-mpirun-prefix-by-default'
                          '--with-hwloc=external' '--without-usnic'
                          '--with-ucx' '--disable-wrapper-runpath'
                          '--disable-wrapper-rpath' '--with-munge'
                          '--with-slurm' '--with-pmi=/opt/software/slurm'
                          '--enable-mpi-cxx' '--with-hcoll'
                          '--disable-show-load-errors-by-default'
                          '--enable-mca-dso=common-libfabric,common-ofi,common-verbs,atomic-mxm,btl-openib,btl-scif,coll-fca,coll-hcoll,ess-tm,fs-lustre,mtl-mxm,mtl-ofi,mtl-psm,mtl-psm2,osc-ucx,oob-ud,plm-tm,pmix-s1,pmix-s2,pml-ucx,pml-yalla,pnet-opa,psec-munge,ras-tm,rml-ofi,scoll-mca,sec-munge,spml-ikrit,'
...
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, ORTE progress: yes, Event lib:
                          yes)
...

EDIT: Removed exception bit as as noted below it isn't required.

The text was updated successfully, but these errors were encountered:

simonbyrne · 2020-01-10T22:40:06Z

What version of MPI.jl are you using? You can get it via julia -e 'using Pkg; Pkg.status()'

simonbyrne · 2020-01-10T23:39:32Z

Hmm, I can reproduce on our cluster (with openmpi 3.1.4 and 4.0.1). It might have something to do with openmpi/UCX's funny malloc hooks, which have caused problems before. cc'ing @kpamnany and @vchuravy who might have some ideas?

mpich doesn't seem to have this problem, so I would suggest using that in the meantime?

twhitehead · 2020-01-10T23:58:30Z

Noticed after I wrote this that the exception part is not critical.

The program simply has to spend enough time (or maybe do enough allocating?) inside the @thread for loop and you get a crash (e.g., I get reliable crashes for 1000x1000 matricies but when I reduce it to 10x10 it rarely crashes).

using MPI

function main()
    MPI.Init()

    Threads.@threads for i in 1:2
        A = rand(1000,1000)
        A1 = inv(A)
    end

    MPI.Finalize()
end

main()

I'll adjust the title accordingly.

simonbyrne · 2020-01-11T00:15:21Z

It's something to do with how MPI and Julia malloc interact. The simplest example (which doesn't require this package) that I can reproduce is:

# threads.jl
function main()
    ccall((:MPI_Init, :libmpi), Nothing, (Ptr{Cint},Ptr{Cint}), C_NULL, C_NULL)
    Threads.@threads for i in 1:100
        A = rand(1000,1000)
    end
end

main()

then calling

JULIA_NUM_THREADS=2 mpiexec -n 2 julia threads.jl

simonbyrne · 2020-01-11T00:28:02Z

My guess is that it's due to UCX, which has caused similar problems in the past (#298).

They've fixed quite a few issues on their main branch, but haven't made a new release for quite a while (I think they've made some breaking API changes, so need to coordinate with openmpi). Hopefully this is one of those that will get fixed.

simonbyrne · 2020-01-11T06:17:08Z

I tried this on master build of UCX + openmpi and see the same issue.

kpamnany · 2020-01-11T14:44:30Z

Try calling MPI_Init_thread with MPI_THREAD_MULTIPLE instead of MPI_Init? Just to rule that out.

simonbyrne · 2020-01-12T03:53:21Z

Yes, that seems to give the same result.

function main()
    r = Ref{Cint}()
    ccall((:MPI_Init_thread, :libmpi), Cint,
          (Ptr{Cint},Ptr{Cvoid},Cint,Ptr{Cint}),
          C_NULL, C_NULL, Cint(3), r)
    @show r[]
    Threads.@threads for i in 1:100
        A = rand(1000,1000)
    end
end

(MPI_THREAD_MULTIPLE == 3)

twhitehead · 2020-01-12T05:25:17Z

It still blows up for me with OpenMPI 3.1.2 when I force just using tcp and self (this is after allocating 2 tasks with 2 cores each and setting JULIA_NUM_THREADS=2)

$ mpirun --mca pml ob1 --mca btl tcp,self julia example.jl

[gra797:13477:0:13484] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2b984f48b008)
[gra797:13478:0:13485] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2af4b14e9008)
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000a8cf4 maybe_collect()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_threads.h:283
 2 0x00000000000a8cf4 jl_gc_big_alloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:837
 3 0x00000000000a9300 jl_gc_alloc_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_internal.h:238
 4 0x00000000000a9300 jl_gc_alloc_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_internal.h:240
 5 0x00000000000a9300 jl_gc_alloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:2940
 6 0x000000000007ac7e _new_array_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:100
 7 0x000000000007ac7e _new_array()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:163
 8 0x000000000007ac7e jl_alloc_array_1d()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:424
 9 0x000000000005d22c _jl_invoke()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gf.c:2135
10 0x0000000000078e19 jl_apply()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia.h:1631
===================
==== backtrace ====
 0 0x0000000000010e90 __funlockfile()  ???:0
 1 0x00000000000a8cf4 maybe_collect()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_threads.h:283
 2 0x00000000000a8cf4 jl_gc_big_alloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:837
 3 0x00000000000a9300 jl_gc_alloc_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_internal.h:238
 4 0x00000000000a9300 jl_gc_alloc_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia_internal.h:240
 5 0x00000000000a9300 jl_gc_alloc()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gc.c:2940
 6 0x000000000007ac7e _new_array_()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:100
 7 0x000000000007ac7e _new_array()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:163
 8 0x000000000007ac7e jl_alloc_array_1d()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/array.c:424
 9 0x000000000005d22c _jl_invoke()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/gf.c:2135
10 0x0000000000078e19 jl_apply()  /dev/shm/ebuser/avx2/Julia/1.3.0/gmkl-2018.3/julia-1.3.0/src/julia.h:1631
===================
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 13477 on node gra797 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

eschnett · 2020-01-12T16:04:34Z

A few generic comments (that might not apply here, but nevertheless):

You call to MPI_init_thread requests MPI_THREAD_MULTIPLE, but does not check whether this is actually supported by the installed library. You need to check the return value.
It is also possible that there is some confusion between the MPI headers that are found by Julia, the shared MPI library that is used for linking, the mpiexec that you are using at run time, and the MPI library that is actually used at run time.
Finally, do you see the segfaults at run time, or during shutdown? You could add a call to MPI_Barrier when your code is done to find out.

simonbyrne · 2020-01-13T17:12:35Z

* You call to `MPI_init_thread` requests `MPI_THREAD_MULTIPLE`, but does not check whether this is actually supported by the installed library. You need to check the return value.

Yes, I printed the result: it returns the same value.

* It is also possible that there is some confusion between the MPI headers that are found by Julia, the shared MPI library that is used for linking, the `mpiexec` that you are using at run time, and the MPI library that is actually used at run time.

Yes, I tried it with a specified path for libmpi and mpiexec

* Finally, do you see the segfaults at run time, or during shutdown? You could add a call to `MPI_Barrier` when your code is done to find out.

They are at runtime: I with one rank and printing between the loop and finalize, I get

├ cat threads.jl
const libmpi = expanduser("~/usr/lib/libmpi.so")

function main()
    r = Ref{Cint}()
    ccall((:MPI_Init_thread, libmpi), Cint,
          (Ptr{Cint},Ptr{Cvoid},Cint,Ptr{Cint}),
          C_NULL, C_NULL, Cint(3), r)
    @show r[]
    Threads.@threads for i in 1:100
        A = rand(1000,1000)
    end
    println("done")
    ccall((:MPI_Finalize, libmpi), Cint,
          (),)
end

main()

├ JULIA_NUM_THREADS=2 ~/usr/bin/mpirun -n 1 julia threads.jl
r[] = 3
[hpc-91-09:117009:0:117057] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2aaaaaaf3008)
==== backtrace (tid: 117057) ====
 0  /home/spjbyrne/usr/lib/libucs.so.0(ucs_handle_error+0x19c) [0x2aaaed9253dc]
 1  /home/spjbyrne/usr/lib/libucs.so.0(+0x2570c) [0x2aaaed92570c]
 2  /home/spjbyrne/usr/lib/libucs.so.0(+0x2597b) [0x2aaaed92597b]
 3  /central/software/julia/1.3.0/bin/../lib/libjulia.so.1(jl_gc_managed_malloc+0x74) [0x2aaaaad788c4]
 4  /central/software/julia/1.3.0/bin/../lib/libjulia.so.1(jl_alloc_array_2d+0x24c) [0x2aaaaad4a08c]
 5  [0x2aaac5335404]
 6  [0x2aaac5335609]
 7  [0x2aaac5334aff]
 8  [0x2aaac5334b1d]
 9  /central/software/julia/1.3.0/bin/../lib/libjulia.so.1(jl_apply_generic+0x53c) [0x2aaaaad2c79c]
10  /central/software/julia/1.3.0/bin/../lib/libjulia.so.1(+0x78e79) [0x2aaaaad47e79]
=================================

signal (11): Segmentation fault
in expression starting at /central/home/spjbyrne/src/MPI.jl/threads.jl:17
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 117009 on node hpc-91-09 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

simonbyrne · 2020-01-16T17:33:39Z

One way forward would be to see if we can recreate the behavior in C: I guess this would consist of writing a multithreaded C program that internally calls malloc. Unfortunately this is beyond my C skills, so if someone else wanted to take a stab at it I would be grateful.

simonbyrne · 2020-01-18T23:58:49Z

Another datapoint: it works correctly with GC disabled:

function main()
    GC.enable(false)
    r = Ref{Cint}()
    ccall((:MPI_Init_thread, libmpi), Cint,
          (Ptr{Cint},Ptr{Cvoid},Cint,Ptr{Cint}),
          C_NULL, C_NULL, Cint(3), r)
    @show r[]
    Threads.@threads for i in 1:100
        A = rand(1000,1000)
    end
    println("done")
    ccall((:MPI_Finalize, libmpi), Cint,
          (),)
end

simonbyrne · 2020-01-19T04:29:12Z

Ah, running it with UCX_HANDLE_ERRORS=debug gives me the following stacktrace:

(gdb) backtrace
#0  0x00002aaaaad778d1 in _mm_pause ()
    at /usr/local/lib/gcc/x86_64-pc-linux-gnu/7.3.0/include/xmmintrin.h:1267
#1  jl_gc_wait_for_the_world ()
    at /buildworker/worker/package_linux64/build/src/gc.c:205
#2  jl_gc_collect (full=full@entry=0)
    at /buildworker/worker/package_linux64/build/src/gc.c:2897
#3  0x00002aaaaad77d0b in maybe_collect (ptls=0x2aaaaaaff7e0)
    at /buildworker/worker/package_linux64/build/src/gc.c:781
#4  jl_gc_pool_alloc (ptls=ptls@entry=0x2aaaaaaff7e0,
---Type <return> to continue, or q <return> to quit---
    pool_offset=pool_offset@entry=1496, osize=osize@entry=80)
    at /buildworker/worker/package_linux64/build/src/gc.c:1096
#5  0x00002aaaaad4a0aa in jl_gc_alloc_ (
    ty=0x2aaab7e08310 <jl_system_image_data+5402704>, sz=64,
    ptls=0x2aaaaaaff7e0)
    at /buildworker/worker/package_linux64/build/src/julia_internal.h:233
#6  _new_array_ (elsz=8, isunion=<optimized out>, isunboxed=<optimized out>,
    dims=<synthetic pointer>, ndims=2,
    atype=0x2aaab7e08310 <jl_system_image_data+5402704>)
    at /buildworker/worker/package_linux64/build/src/array.c:112
#7  _new_array (dims=<synthetic pointer>, ndims=2,
    atype=0x2aaab7e08310 <jl_system_image_data+5402704>)
    at /buildworker/worker/package_linux64/build/src/array.c:163
#8  jl_alloc_array_2d (atype=0x2aaab7e08310 <jl_system_image_data+5402704>,
    nr=1000, nc=1000)
    at /buildworker/worker/package_linux64/build/src/array.c:431
#9  0x00002aaac5335404 in ?? ()
#10 0x0000000000000004 in ?? ()
#11 0x0000000000000000 in ?? ()

#1 points to this "FIXME":
https://github.com/JuliaLang/julia/blob/248bc460bba587dd1c4741f434c5305218a1f87e/src/gc.c#L197-L206

vchuravy · 2020-01-25T00:07:23Z

IIUC correctly, Julia uses signals to communicate between threads (e.g. to implement wait for the world)

So the "error" above is normal and is a case where you need to tell GDB to ignore that signal.
Which is probably the same problem that we are encountering with ucx. It seems to capture Signal 11 and interpret it as an error.

vchuravy · 2020-01-25T00:11:37Z

@simonbyrne can you try running with 'UCX_ERROR_SIGNALS="" as a environment variable?

simonbyrne · 2020-01-25T00:24:46Z

Hmm, that seems to work correctly.

simonbyrne · 2020-01-25T00:32:28Z

more data:

├ ~/usr/bin/mpirun -x JULIA_NUM_THREADS=2 -x UCX_ERROR_SIGNALS="SIGILL,SIGBUS,SIGFPE" -n 1 julia threads.jl
r[] = 3
done

├ ~/usr/bin/mpirun -x JULIA_NUM_THREADS=2 -x UCX_ERROR_SIGNALS="SIGILL,SIGSEGV,SIGBUS,SIGFPE" -n 1 julia threads.jl
r[] = 3
# segfault

simonbyrne · 2020-01-25T05:02:31Z

@vchuravy and @kpamnany narrowed this down to UCX intercepting signals, and according to the Julia developer docs:

The profiler uses SIGUSR2 for sampling and the garbage collector uses SIGSEGV for threads synchronization.

So basically users need to disable UCX intercepting the SIGSEGV error signal, which can be done by setting the (undocumented) environment variable UCX_ERROR_SIGNALS to be empty, or a list which excludes SIGSEGV (the default appears to be "SIGILL,SIGSEGV,SIGBUS,SIGFPE".

@twhitehead Can you confirm if this fixes your users' problem?

twhitehead · 2020-01-28T22:24:54Z

Our user reports that this (setting UCX_ERROR_SIGNALS="") doesn't not fix their problem. My testing also seems to indicate that, while it changes the nature of the error reported, it does not cause the sample code to stop crashing.

[tyson@gra797 julia]$ cat example.jl
using MPI

function main()
    MPI.Init()

    Threads.@threads for i in 1:2
        A = rand(10000,10000)
        A1 = inv(A)
    end

    MPI.Finalize()
end

main()

[tyson@gra797 julia]$ UCX_ERROR_SIGNALS= mpirun julia example.jl
[1580249018.518004] [gra799:22756:0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1580249018.570226] [gra797:11768:0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 22756 on node gra799 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I think though that, as has been long suggested, it may still be UCX related. While looking at the shared libraries loaded, I noticed my earlier attempt to avoid UCX through MCA parameters was not quite complete as it also gets sucked in from the OSC layer.

I'm currently getting no crashes when run as follows

$ mpirun --mca pml ob1 --mca btl tcp,self --mca osc pt2pt julia example.jl

I'll play around more with variants (e.g., try OpenIB in to the BTL layer, etc.) to ensure I'm not missing anything and also verify with our user whether they exported the UCX_ERROR_SIGNALS= setting or not, and get back to the ticket.

Thanks everyone for all the work digging into this.

simonbyrne · 2020-01-29T06:00:38Z

Interesting: it looks like it is passing on the node you are launching it from (gra797), but failing on the other one (gra799). Is the environment being passed correctly to the launched processes?

Does it work if you add set the environment variable in your script, i.e. add

ENV["UCX_ERROR_SIGNALS"] = ""

to the top of example.jl?

If that works, we can add that to the __init__() function in MPI.jl

twhitehead · 2020-01-29T21:20:46Z

@simonbyrne that was a very good idea. Unfortunately it doesn't seem to work though (or at least not under OpenMPI 3.1.2 and UCX 1.5.2).

That is, I added a printout line at the top to verify it is set (checked that it does throw an exception if the environment lookup fails)

print(gethostname(),": UCX_ERROR_SIGNALS=",ENV["UCX_ERROR_SIGNALS"],"\n")

but the program still crashes

[tyson@gra-login1 ~]$ UCX_ERROR_SIGNALS= JULIA_NUM_THREADS=2 salloc --ntasks 2 --nodes 2 --cpus-per-task 2 -t 3:0:0 --mem-per-cpu 2g -A def-tyson-ab
[tyson@gra2 ~]$ cd julia
[tyson@gra2 julia]$ mpirun julia example.jl

gra2: UCX_ERROR_SIGNALS=
gra19: UCX_ERROR_SIGNALS=
[1580332663.359764] [gra2:2357 :0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1580332663.567514] [gra19:18975:0]         parser.c:1369 UCX  WARN  unused env variable: UCX_MEM_MMAP_RELOC (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node gra2 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

simonbyrne · 2020-01-29T21:50:22Z

Hmm, I see the same thing using the same versions of UCX and OpenMPI, but not with the latest versions, so I think you will need to upgrade.

Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test. Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.

s-fuerst · 2021-11-05T13:50:15Z

FYI: I ran into the same problem, however, on our cluster (Lise@HLRN) OpenMPI is built without UCX support. Updating from 3.1.5 to 4.1.1 didn't helped either. One potential solution here is to change the PML from cm to ob1 with openib as the BTL, a second (better solution) is to use Intel MPI's implementation instead of OpenMPI.

simonbyrne added the bug label Jan 10, 2020

twhitehead changed the title ~~segmentation faults when combined with @threads and exceptions~~ segmentation faults when combined with @threads with sufficient work Jan 10, 2020

simonbyrne changed the title ~~segmentation faults when combined with @threads with sufficient work~~ segmentation faults when combined with @threads with memory allocation Jan 11, 2020

simonbyrne mentioned this issue Jan 25, 2020

mention UCX multithreading issue #343

Closed

simonbyrne mentioned this issue Mar 14, 2020

Move to KernelAbstractions CliMA/ClimateMachine.jl#791

Merged

simonbyrne added a commit that referenced this issue Mar 20, 2020

Add threading support

2d0bf2d

Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test. Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.

simonbyrne mentioned this issue Mar 20, 2020

Add threading support #363

Merged

simonbyrne added a commit that referenced this issue Mar 20, 2020

Add threading support

c9ba4ed

Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test. Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.

simonbyrne added a commit that referenced this issue Mar 20, 2020

Add threading support

3aa9a11

Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test. Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.

simonbyrne closed this as completed in #363 Mar 20, 2020

simonbyrne added a commit that referenced this issue Mar 20, 2020

Add threading support (#363)

84efa98

Adds `MPI.Init_thread` and the `ThreadLevel` enum, along with a threaded test. Additionally, set the UCX_ERROR_SIGNALS environment variable if not already set to fix #337.

This was referenced Mar 10, 2023

MPI implementations intercepting Signals is incompatible with Julia GC safepoint #725

Open

Segfault fix Julia-Tempering/Pigeons.jl#30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segmentation faults when combined with @threads with memory allocation #337

segmentation faults when combined with @threads with memory allocation #337

twhitehead commented Jan 10, 2020 •

edited

Loading

simonbyrne commented Jan 10, 2020

simonbyrne commented Jan 10, 2020 •

edited

Loading

twhitehead commented Jan 10, 2020

simonbyrne commented Jan 11, 2020 •

edited

Loading

simonbyrne commented Jan 11, 2020

simonbyrne commented Jan 11, 2020

kpamnany commented Jan 11, 2020

simonbyrne commented Jan 12, 2020

twhitehead commented Jan 12, 2020 •

edited

Loading

eschnett commented Jan 12, 2020

simonbyrne commented Jan 13, 2020

simonbyrne commented Jan 16, 2020

simonbyrne commented Jan 18, 2020

simonbyrne commented Jan 19, 2020

vchuravy commented Jan 25, 2020

vchuravy commented Jan 25, 2020

simonbyrne commented Jan 25, 2020

simonbyrne commented Jan 25, 2020

simonbyrne commented Jan 25, 2020 •

edited

Loading

twhitehead commented Jan 28, 2020

simonbyrne commented Jan 29, 2020 •

edited

Loading

twhitehead commented Jan 29, 2020

simonbyrne commented Jan 29, 2020

s-fuerst commented Nov 5, 2021 •

edited

Loading

segmentation faults when combined with @threads with memory allocation #337

segmentation faults when combined with @threads with memory allocation #337

Comments

twhitehead commented Jan 10, 2020 • edited Loading

simonbyrne commented Jan 10, 2020

simonbyrne commented Jan 10, 2020 • edited Loading

twhitehead commented Jan 10, 2020

simonbyrne commented Jan 11, 2020 • edited Loading

simonbyrne commented Jan 11, 2020

simonbyrne commented Jan 11, 2020

kpamnany commented Jan 11, 2020

simonbyrne commented Jan 12, 2020

twhitehead commented Jan 12, 2020 • edited Loading

eschnett commented Jan 12, 2020

simonbyrne commented Jan 13, 2020

simonbyrne commented Jan 16, 2020

simonbyrne commented Jan 18, 2020

simonbyrne commented Jan 19, 2020

vchuravy commented Jan 25, 2020

vchuravy commented Jan 25, 2020

simonbyrne commented Jan 25, 2020

simonbyrne commented Jan 25, 2020

simonbyrne commented Jan 25, 2020 • edited Loading

twhitehead commented Jan 28, 2020

simonbyrne commented Jan 29, 2020 • edited Loading

twhitehead commented Jan 29, 2020

simonbyrne commented Jan 29, 2020

s-fuerst commented Nov 5, 2021 • edited Loading

twhitehead commented Jan 10, 2020 •

edited

Loading

simonbyrne commented Jan 10, 2020 •

edited

Loading

simonbyrne commented Jan 11, 2020 •

edited

Loading

twhitehead commented Jan 12, 2020 •

edited

Loading

simonbyrne commented Jan 25, 2020 •

edited

Loading

simonbyrne commented Jan 29, 2020 •

edited

Loading

s-fuerst commented Nov 5, 2021 •

edited

Loading