Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent segfault in libdl test on linux64 buildbots #13719

Closed
tkelman opened this issue Oct 22, 2015 · 26 comments · Fixed by #22828
Closed

Intermittent segfault in libdl test on linux64 buildbots #13719

tkelman opened this issue Oct 22, 2015 · 26 comments · Fixed by #22828
Labels
system:linux Affects only Linux

Comments

@tkelman
Copy link
Contributor

tkelman commented Oct 22, 2015

See recent failures at http://buildbot.e.ip.saba.us:8010/builders/build_ubuntu12.04-x64?numbuilds=250 and http://buildbot.e.ip.saba.us:8010/builders/build_centos7.1-x64?numbuilds=250

    From worker 3:       * enums                 in   1.33 seconds, maxrss 1165.46 MB
    From worker 2:       * misc                  in  24.35 seconds, maxrss 1406.89 MB
    From worker 2:       * i18n                  in   0.03 seconds, maxrss 1406.89 MB
    From worker 2:       * workspace             in   0.35 seconds, maxrss 1406.89 MB

signal (11): Segmentation fault
_IO_feof at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x7f8cf8b107dd)
anonymous at essentials.jl:116
cd at file.jl:22
jl_apply_generic at /home/centos/buildbot/slave/build_centos7_1-x64/build/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7f8cf8b0d7a3)
unknown function (ip: 0x7f8cf8b0cbe1)
unknown function (ip: 0x7f8cf8b0e038)
unknown function (ip: 0x7f8cf8b0e1ee)
unknown function (ip: 0x7f8cf8b219df)
unknown function (ip: 0x7f8cf8b222a9)
jl_load_file_string at /home/centos/buildbot/slave/build_centos7_1-x64/build/usr/bin/../lib/libjulia.so (unknown line)
include_string at essentials.jl:116
jl_apply_generic at /home/centos/buildbot/slave/build_centos7_1-x64/build/usr/bin/../lib/libjulia.so (unknown line)
include_from_node1 at ./loading.jl:387
jl_apply_generic at /home/centos/buildbot/slave/build_centos7_1-x64/build/usr/bin/../lib/libjulia.so (unknown line)
runtests at util.jl:179
jlcall_runtests_22758 at  (unknown line)
jl_apply_generic at /home/centos/buildbot/slave/build_centos7_1-x64/build/usr/bin/../lib/libjulia.so (unknown line)
anonymous at /home/centos/buildbot/slave/build_centos7_1-x64/build/test/runtests.jl:36
jl_f_apply at /home/centos/buildbot/slave/build_centos7_1-x64/build/usr/bin/../lib/libjulia.so (unknown line)
anonymous at multi.jl:898
run_work_thunk at multi.jl:651
jlcall_run_work_thunk_22669 at  (unknown line)
jl_apply_generic at /home/centos/buildbot/slave/build_centos7_1-x64/build/usr/bin/../lib/libjulia.so (unknown line)
anonymous at multi.jl:898
unknown function (ip: 0x7f8cf8b13dc8)
unknown function (ip: (nil))
    From worker 2:       * libdl                Worker 2 terminated.
ERROR (unhandled task failure): EOFError: read end of file
    From worker 4:       * int                   in   1.33 seconds, maxrss  156.14 MB
    From worker 4:       * intset                in   0.75 seconds, maxrss  156.14 MB
    From worker 4:       * floatfuncs            in   1.66 seconds, maxrss  156.14 MB
    From worker 3:       * cmdlineargs           in  28.07 seconds, maxrss 1175.49 MB
    From worker 3:       * fft                   in  24.96 seconds, maxrss 1226.80 MB
    From worker 4:       * parallel             
    From worker 4:   in  35.81 seconds, maxrss  156.14 MB
    From worker 3:       * dsp                   in  15.16 seconds, maxrss 1229.05 MB
    From worker 4:       * examples              in  22.73 seconds, maxrss  191.95 MB
Exception running test libdl :
ProcessExitedException()
ERROR: LoadError: Some tests exited with errors.
 [inlined code] from error.jl:21
 in anonymous at /home/centos/buildbot/slave/build_centos7_1-x64/build/test/runtests.jl:64
 in cd at file.jl:22
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:384
 [inlined code] from ./operators.jl:313
 in process_options at ./client.jl:277
 in _start at ./client.jl:377
while loading /home/centos/buildbot/slave/build_centos7_1-x64/build/test/runtests.jl, in expression starting on line 13
@tkelman tkelman added system:linux Affects only Linux test This change adds or pertains to unit tests system:32-bit Affects only 32-bit systems system:arm ARMv7 and AArch64 and removed system:32-bit Affects only 32-bit systems system:arm ARMv7 and AArch64 labels Oct 22, 2015
@kshyatt
Copy link
Contributor

kshyatt commented Nov 25, 2015

Is this still happening, or can we close?

@tkelman
Copy link
Contributor Author

tkelman commented Nov 25, 2015

Still happening. I got this to a state where it happens repeatably on a local Linux 64 bit machine (using LLVM 3.7.0, but that may or may not be related since the buildbots are all using 3.3) and have been delta-debugging it for several days to try to reduce the number of tests that are needed to run to reproduce it.

@kshyatt
Copy link
Contributor

kshyatt commented Nov 25, 2015

Is this something where you would be helped by more eyes/hands?

@tkelman
Copy link
Contributor Author

tkelman commented Nov 25, 2015

If anyone else can reproduce reliably, it will likely need someone who knows better than I do what to look for to fix it. It's most repeatable when running all tests in a single process via make testall1.

@tkelman
Copy link
Contributor Author

tkelman commented Dec 1, 2015

I've reduced the list of tests a bit here locally, but it's still a pretty long list where removing any one of them causes the failure to go away. Anyone else see this locally, on a make testall1 or Base.runtests("all", 1), or otherwise?

@tkelman
Copy link
Contributor Author

tkelman commented Dec 4, 2015

Whoa, we had a whole string of a bunch of these: http://buildbot.e.ip.saba.us:8010/builders/build_centos7.1-x64?numbuilds=100

Really, no one but me has seen this locally?

edit: this may be a consequence of "too many people use ubuntu" http://buildbot.e.ip.saba.us:8010/builders/build_ubuntu14.04-x64?numbuilds=100

@tkelman
Copy link
Contributor Author

tkelman commented Dec 8, 2015

9 days and counting since successful nightlies. http://build.julialang.org:8010/builders/build_centos7.1-x64?numbuilds=250

Someone who understands how dlopen works should probably try a docker container or vm of centos 7 to reproduce this.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 21, 2015

tentatively closing as 5a66fba seems to have fixed the buildbot

@vtjnash vtjnash added backport pending 0.4 and removed test This change adds or pertains to unit tests labels Dec 21, 2015
@tkelman
Copy link
Contributor Author

tkelman commented Dec 21, 2015

It looks like this does not occur on the problematic buildbot for release-0.4, but might not hurt to backport anyway?

@vtjnash vtjnash reopened this Dec 23, 2015
@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 23, 2015

I've decided to reopen this, since the commit was more of a band-aid and doesn't address the underlying problem: in short, Julia has probably allocated too much memory.

In particular, we are probably running into the behavior noted in the following thread, where fork (unlike mmap) does not allow any memory overcommit, and > 50% of physical memory is already allocated (swap is disabled): https://lkml.org/lkml/2009/2/11/319

@carnaval do you see any issue with Julia marking its memory pool as MADV_DONTFORK? that should also help fork performance.

@yuyichao
Copy link
Contributor

marking its memory pool as MADV_DONTFORK

Will this fail if the argument to exec is a pointer to julia string?

@yuyichao
Copy link
Contributor

And does vfork or posix_spawn have the same issue?

@Keno
Copy link
Member

Keno commented Dec 23, 2015

Isn't the whole point of fork that it's supposed to use COW pages? Why does it run out of memory?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 23, 2015

COW pages are still counted against the (new) process, and there isn't enough physical memory in the system to permit that without overcommitting (which the linux kernel implementation of fork will refuse to do)

this doesn't affect vfork, while the behavior of posix_spawn in this regard is implementation defined, but probably follows the behavior of fork

Will this fail if the argument to exec is a pointer to julia string?

yes (although a stack copy would be trivial). more generally, it would also mean you can't simply fork julia to run two copies (although i doubt that'll work very well anyways due to the file descriptors all being shared)

@yuyichao
Copy link
Contributor

yuyichao commented Jan 9, 2016

Is the issue here the virtual address space size or the memory we are actually using? And is it affected by MADV_DONTNEED?

I'm wondering if we can just specify MADV_DONTNEED when we free the pages and set it back when we allocate them again.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 9, 2016

there appears to be no mechanism to reduce the commit size, and it is only affected by MADV_DONTFORK

it is only affected by the number of pages that have actually been committed -- reserved pages don't count.

@afbarnard
Copy link

I am currently experiencing this problem as well. I thought maybe running the tests single-threaded would avoid the problem, but no luck. So, both make testall1 and make testall crash with a segfault in the libdl test. It looks like there is a memory leak, so maybe that is the source of the memory pressure.

Details:

  • Linux 4.2.8-200.fc22.x86_64 (Fedora 22)
  • libc-2.21
  • Julia commit e3ad75f (release-0.4)

Here is the output I get from make testall1. Notice the monotonically increasing maxrss.

[julia-0.4]$ nice make testall1
    JULIA test/all
     * linalg/triangular     in 100.93 seconds, maxrss  427.25 MB
     * linalg/qr             in  28.52 seconds, maxrss  498.93 MB
     * linalg/dense          in  24.70 seconds, maxrss  547.99 MB
     * linalg/matmul         in   8.20 seconds, maxrss  574.98 MB
     * linalg/schur          in   5.26 seconds, maxrss  577.95 MB
     * linalg/special        in   3.07 seconds, maxrss  582.96 MB
     * linalg/eigen          in   1.52 seconds, maxrss  593.65 MB
     * linalg/bunchkaufman   in   0.71 seconds, maxrss  594.37 MB
     * linalg/svd            in   3.99 seconds, maxrss  596.94 MB
     * linalg/lapack         in  12.77 seconds, maxrss  604.21 MB
     * linalg/tridiag        in   5.43 seconds, maxrss  614.03 MB
     * linalg/bidiag         in   7.67 seconds, maxrss  641.57 MB
     * linalg/diagonal       in   9.35 seconds, maxrss  659.53 MB
     * linalg/pinv           in   4.97 seconds, maxrss  742.71 MB
     * linalg/givens         in   1.48 seconds, maxrss  742.71 MB
     * linalg/cholesky       in   6.11 seconds, maxrss  742.71 MB
     * linalg/lu             in  10.45 seconds, maxrss  742.71 MB
     * linalg/symmetric      in   5.57 seconds, maxrss  742.71 MB
     * linalg/generic        in   2.13 seconds, maxrss  742.71 MB
     * linalg/uniformscaling in   1.10 seconds, maxrss  742.71 MB
     * linalg/arnoldi        in   8.54 seconds, maxrss  742.71 MB
     * core                  in  15.01 seconds, maxrss  742.71 MB
     * keywordargs           in   1.19 seconds, maxrss  751.81 MB
     * numbers               in  64.03 seconds, maxrss  829.46 MB
     * printf                in   3.20 seconds, maxrss  831.95 MB
     * char                  in   0.53 seconds, maxrss  831.99 MB
     * string                in  16.49 seconds, maxrss  914.32 MB
     * triplequote           in   0.19 seconds, maxrss  914.32 MB
     * unicode               in  11.08 seconds, maxrss  914.81 MB
     * dates                 in  53.19 seconds, maxrss  915.14 MB
     * dict                  in  10.32 seconds, maxrss  915.14 MB
     * hashing               in   5.15 seconds, maxrss  915.14 MB
     * remote                in   0.41 seconds, maxrss  915.14 MB
     * iobuffer              in   1.21 seconds, maxrss  915.14 MB
     * staged                in   0.68 seconds, maxrss  915.14 MB
     * arrayops              in  33.07 seconds, maxrss 1056.38 MB
     * tuple                 in   0.95 seconds, maxrss 1056.38 MB
     * subarray              in 517.67 seconds, maxrss 1325.88 MB
     * reduce                in   4.13 seconds, maxrss 1344.77 MB
     * reducedim             in  18.41 seconds, maxrss 1345.05 MB
     * random                in  13.91 seconds, maxrss 1378.79 MB
     * abstractarray         in  34.60 seconds, maxrss 1418.34 MB
     * intfuncs              in   0.60 seconds, maxrss 1419.29 MB
     * simdloop              in   1.04 seconds, maxrss 1420.99 MB
     * blas                  in   4.97 seconds, maxrss 1431.13 MB
     * sparse                in  81.20 seconds, maxrss 1489.93 MB
     * bitarray              in  73.80 seconds, maxrss 1558.29 MB
     * copy                  in   1.98 seconds, maxrss 1560.54 MB
     * math                  in  10.05 seconds, maxrss 1601.85 MB
     * fastmath              in   3.18 seconds, maxrss 1614.45 MB
     * functional            in   1.54 seconds, maxrss 1614.95 MB
     * operators             in   0.52 seconds, maxrss 1614.95 MB
     * path                  in   3.72 seconds, maxrss 1615.43 MB
     * ccall                 in   2.53 seconds, maxrss 1615.87 MB
     * parse                 in   1.28 seconds, maxrss 1616.67 MB
     * loading               in   0.05 seconds, maxrss 1616.67 MB
     * bigint                in   2.28 seconds, maxrss 1619.98 MB
     * sorting               in  24.20 seconds, maxrss 1687.94 MB
     * statistics            in   5.75 seconds, maxrss 1687.94 MB
     * spawn                       [stdio passthrough ok]
 in   9.88 seconds, maxrss 1694.80 MB
     * backtrace             in   0.26 seconds, maxrss 1694.80 MB
     * priorityqueue         in   1.29 seconds, maxrss 1701.26 MB
     * file                  in  35.45 seconds, maxrss 1718.39 MB
     * mmap                  in  10.43 seconds, maxrss 1718.39 MB
     * version               in   2.18 seconds, maxrss 1718.39 MB
     * resolve               in   6.54 seconds, maxrss 1718.39 MB
     * pollfd                in   3.60 seconds, maxrss 1718.39 MB
     * mpfr                  in   3.66 seconds, maxrss 1718.39 MB
     * broadcast             in  10.70 seconds, maxrss 1718.39 MB
     * complex               in   3.12 seconds, maxrss 1718.39 MB
     * socket                in   1.76 seconds, maxrss 1718.39 MB
     * floatapprox           in   0.30 seconds, maxrss 1718.39 MB
     * readdlm               in  20.41 seconds, maxrss 1738.92 MB
     * reflection           The following 'Returned code...' warnings indicate normal behavior:
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
WARNING: Returned code may not match what actually runs.
 in   1.07 seconds, maxrss 1740.77 MB
     * regex                 in   1.18 seconds, maxrss 1742.20 MB
     * float16               in   1.18 seconds, maxrss 1743.88 MB
     * combinatorics         in   2.54 seconds, maxrss 1748.37 MB
     * sysinfo               in   1.14 seconds, maxrss 1759.16 MB
     * rounding              in   0.77 seconds, maxrss 1759.16 MB
     * ranges                in  45.62 seconds, maxrss 1770.81 MB
     * mod2pi                in   0.12 seconds, maxrss 1771.06 MB
     * euler                 in   0.68 seconds, maxrss 1782.77 MB
     * show                  in   6.46 seconds, maxrss 1786.22 MB
     * lineedit              in   3.61 seconds, maxrss 1787.22 MB
     * replcompletions       in   3.82 seconds, maxrss 1787.46 MB
     * repl                  in   4.62 seconds, maxrss 1789.85 MB
     * replutil              in   1.58 seconds, maxrss 1793.34 MB
     * sets                  in   2.45 seconds, maxrss 1795.55 MB
     * test                  in   0.34 seconds, maxrss 1795.55 MB
     * goto                  in   0.04 seconds, maxrss 1795.55 MB
     * llvmcall              in   0.10 seconds, maxrss 1795.55 MB
     * grisu                 in   5.09 seconds, maxrss 1800.99 MB
     * nullable              in   2.04 seconds, maxrss 1802.31 MB
     * meta                  in   0.25 seconds, maxrss 1802.76 MB
     * profile               in   5.63 seconds, maxrss 1855.52 MB
     * libgit2               in   0.02 seconds, maxrss 1855.52 MB
     * docs                  in   4.15 seconds, maxrss 1855.52 MB
     * markdown              in   5.85 seconds, maxrss 1855.52 MB
     * base64                in   0.46 seconds, maxrss 1855.52 MB
     * serialize             in   3.31 seconds, maxrss 1855.52 MB
     * functors              in   0.51 seconds, maxrss 1855.52 MB
     * misc                  in  36.33 seconds, maxrss 1938.10 MB
     * enums                 in   1.67 seconds, maxrss 1938.10 MB
     * cmdlineargs           in  23.29 seconds, maxrss 1938.10 MB
     * i18n                  in   0.02 seconds, maxrss 1938.10 MB
     * workspace             in   0.23 seconds, maxrss 1938.10 MB
     * libdl                
signal (11): Segmentation fault
_IO_feof at /lib64/libc.so.6 (unknown line)
jl_lookup_soname at /home/barnard/.local/julia-0.4/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7ff065f64cc8)
dlopen_e at libdl.jl:42
anonymous at /home/barnard/.local/julia-0.4/test/libdl.jl:73
cd at ./file.jl:22
jl_apply_generic at /home/barnard/.local/julia-0.4/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7ff065f616f3)
unknown function (ip: 0x7ff065f60a89)
unknown function (ip: 0x7ff065f61eed)
unknown function (ip: 0x7ff065f621ee)
unknown function (ip: 0x7ff065f76307)
unknown function (ip: 0x7ff065f76bcc)
jl_load at /home/barnard/.local/julia-0.4/usr/bin/../lib/libjulia.so (unknown line)
include at ./boot.jl:261
jl_apply_generic at /home/barnard/.local/julia-0.4/usr/bin/../lib/libjulia.so (unknown line)
include_from_node1 at ./loading.jl:304
jl_apply_generic at /home/barnard/.local/julia-0.4/usr/bin/../lib/libjulia.so (unknown line)
runtests at util.jl:179
jlcall_runtests_21212 at  (unknown line)
jl_apply_generic at /home/barnard/.local/julia-0.4/usr/bin/../lib/libjulia.so (unknown line)
anonymous at /home/barnard/.local/julia-0.4/test/runtests.jl:36
jl_f_apply at /home/barnard/.local/julia-0.4/usr/bin/../lib/libjulia.so (unknown line)
anonymous at multi.jl:690
run_work_thunk at multi.jl:651
remotecall_fetch at multi.jl:724
jl_apply_generic at /home/barnard/.local/julia-0.4/usr/bin/../lib/libjulia.so (unknown line)
jl_f_apply at /home/barnard/.local/julia-0.4/usr/bin/../lib/libjulia.so (unknown line)
remotecall_fetch at multi.jl:740
jl_apply_generic at /home/barnard/.local/julia-0.4/usr/bin/../lib/libjulia.so (unknown line)
anonymous at /home/barnard/.local/julia-0.4/test/runtests.jl:36
unknown function (ip: 0x7ff065f67964)
unknown function (ip: (nil))
/bin/sh: line 1:  2134 Segmentation fault      (core dumped) /home/barnard/.local/julia-0.4/usr/bin/julia --check-bounds=yes --startup-file=no ./runtests.jl all
Makefile:9: recipe for target 'all' failed
make[1]: *** [all] Error 139
Makefile:540: recipe for target 'testall1' failed
make: *** [testall1] Error 2

@yuyichao
Copy link
Contributor

Did this commit miss the backport?

@tkelman
Copy link
Contributor Author

tkelman commented Jan 18, 2016

Whoops, guess so. Wasn't sure if it was master only. Now we have an example on release. @afbarnard how much memory does your system have?

@afbarnard
Copy link

The above log was for a laptop with 4GB. I just tested on my workstation (8GB) and the increasing memory usage issue exists but no crash.

Workstation details:

  • Linux 4.3.3-300.fc23.x86_64 (Fedora 23)
  • libc-2.22
  • Julia commit e3ad75f (release-0.4)

Snipped output from make testall1:

    JULIA test/all
     * linalg/triangular     in 130.38 seconds, maxrss  435.04 MB
     * linalg/qr             in  36.83 seconds, maxrss  498.81 MB
     * linalg/dense          in  31.75 seconds, maxrss  548.69 MB
...
     * libdl                 in   0.99 seconds, maxrss 1942.04 MB
...
     * fft                   in  34.08 seconds, maxrss 1965.41 MB
     * dsp                   in  27.58 seconds, maxrss 1969.39 MB
     * examples              in  73.37 seconds, maxrss 1969.39 MB
     * compile              
...
 in  21.39 seconds, maxrss 1969.39 MB
    SUCCESS

tkelman pushed a commit that referenced this issue Jan 18, 2016
(cherry picked from commit 5a66fba)
ref #13719
@tkelman
Copy link
Contributor Author

tkelman commented Jan 18, 2016

@afbarnard thanks. Can you try again on your laptop with a85c3a0? It's really valuable to have someone who can locally, reliably reproduce a bug like this, so thanks for testing things out!

@afbarnard
Copy link

So make clean && make && make testall1 after updating to a85c3a0 (release-0.4) runs without crashing on my laptop. (The tests complete with SUCCESS.)

There still appears to be a memory leak issue, however. (Final maxrss 1986 MB!) Is something being done about that?

@tkelman
Copy link
Contributor Author

tkelman commented Jan 19, 2016

That's just how much memory the test suite uses.

@tkelman
Copy link
Contributor Author

tkelman commented Mar 17, 2017

@vtjnash do you consider this still not completely fixed?

@brevans
Copy link

brevans commented Oct 3, 2017

Hello,
I have users who are interested in running Julia on our clusters, but we are seeing similar issues as are described here. All our nodes run RHEL7.3. For the binary release for 0.6.0 we see segfaults in several tests. The cloned and built version 0.6.0 shows the growth in maxrss that others in this thread are experiencing. Attached are the results from the tests, run thusly:

#downloaded from https://julialang-s3.julialang.org/bin/linux/x64/0.6/julia-0.6.0-linux-x86_64.tar.gz
julia -E "Base.runtests()" > julia-0.6.0_binary_testall.txt 2>&1

#git cloned and checked out v 0.6.0
#gcc v4.9.3 and CMake v3.6.2
make clean && make && make testall1 > ~/julia-0.6.0_src_testall.txt 2>&1

julia-0.6.0_binary_testall.txt
julia-0.6.0_src_testall.txt

Any additional tests or info I can provide I'd be happy to!

@vtjnash
Copy link
Sponsor Member

vtjnash commented Dec 26, 2017

@brevans That's a different issue. It seems that dlopen("libjulia") is not getting a handle to the existing libjulia.so, as the test is expecting. You should open a separate issue for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:linux Affects only Linux
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants