Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on 1.10 ubuntu long #3184

Open
thofma opened this issue Jan 12, 2024 · 22 comments · Fixed by #3368
Open

Error on 1.10 ubuntu long #3184

thofma opened this issue Jan 12, 2024 · 22 comments · Fixed by #3368
Labels
bug: crash bug Something isn't working

Comments

@thofma
Copy link
Collaborator

thofma commented Jan 12, 2024

If one looks at https://github.com/oscar-system/Oscar.jl/commits/master/, one sees that often "Run tests / test (~1.10.0-0, long, ubuntu-latest) (push)" fails. The error looks scary, e.g. in https://github.com/oscar-system/Oscar.jl/actions/runs/7364038493/job/20044077891#step:7:4952 and https://github.com/oscar-system/Oscar.jl/actions/runs/7364038493/job/20044077891#step:7:26094:

!!! ERROR in jl_ -- ABORTING !!!

Does anyone have an idea where that might be coming from? I have not tried to reproduce it locally. It does not look like #2441.

CC: @lgoettgens @benlorenz

@thofma thofma added the bug Something isn't working label Jan 12, 2024
@lgoettgens
Copy link
Member

No Idea

@benlorenz
Copy link
Member

Some weird GC corruption that seems to happen when the Serialization/IPC tests happen, it seems related to julia tasks but I haven't been able to reproduce this locally. I have the long testset running in a loop with rr to trigger and capture this (currently at about 100 iterations).

So far I got only one other crash but in the test group elliptic_surfaces.jl that runs before the IPC stuff:

[4832] signal (11.1): Segmentation fault
in expression starting at /home/datastore/lorenz/software/julia/Oscar.jl/test/AlgebraicGeometry/Schemes/elliptic_surface.jl:1
jl_object_id__cold at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/builtins.c:455
type_hash at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1575
typekey_hash at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1605
jl_precompute_memoized_dt at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:1685
inst_datatype_inner at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:2081
jl_inst_arg_tuple_type at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/jltypes.c:2176
arg_type_tuple at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2232 [inlined]
jl_lookup_generic_ at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3020 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3072
iterate at ./generator.jl:47 [inlined]
collect at ./array.jl:834
unknown function (ip: 0x1522095c16a5)
convert_return at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:216
unknown function (ip: 0x1522095c11c9)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#197 at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:92
unknown function (ip: 0x1522095c141c)
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
convert_normal_value at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:152
unknown function (ip: 0x1522095c1336)    
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
convert_return at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:223
unknown function (ip: 0x1522095c11c9)    
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
low_level_caller_rng at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/caller.jl:378
minAssGTZ at /home/datastore/lorenz/software/julia/depot/packages/Singular/tZxbi/src/Meta.jl:45
unknown function (ip: 0x1522095c0389)    
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
#minimal_primes#335 at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:830
minimal_primes at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:818 [inlined]
__compute_is_prime__ at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:1255
#356 at /home/datastore/lorenz/software/julia/depot/packages/AbstractAlgebra/R29qD/src/Attributes.jl:357 [inlined]
get! at ./dict.jl:479    
get_attribute! at /home/datastore/lorenz/software/julia/depot/packages/AbstractAlgebra/R29qD/src/Attributes.jl:230 [inlined]
is_prime at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpoly-ideals.jl:1254
unknown function (ip: 0x1521620272f5)    
_jl_invoke at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci4-6/julialang/julia-release-1-dot-10/src/gf.c:3076
__compute_is_prime__ at /home/datastore/lorenz/software/julia/Oscar.jl/src/Rings/mpolyquo-localizations.jl:1853
#914 at /home/datastore/lorenz/software/julia/depot/packages/AbstractAlgebra/R29qD/src/Attributes.jl:357 [inlined]
get! at ./dict.jl:479                    
unknown function (ip: 0x152162026da0)    

@jankoboehm
Copy link
Contributor

typeinf_local and deserialize occur, perhaps something related to type inference in the deserialization, like compiler getting an unexpected type. Imagine something like this could happen in deserialization, but why only in this test?

@ThomasBreuer
Copy link
Member

The same crash as described by @benlorenz happened also in the corresponding test run for #3018 after the changes that were pushed yesterday.

@fingolfin
Copy link
Member

The second backtrace reported in here by @benlorenz involves Singular.jl and the primdec library function minAssGTZ -- specifically the code in Singular.jl which converts its return value to Julia. Maybe there is a GC.preserve missing there or some other bug. Perhaps it causes a memory corruption and then triggers the second crash, too... even if it not, that needs to be solved.

@lgoettgens
Copy link
Member

After digging into the first backtrace again, this is a GC corruption error (https://github.com/oscar-system/Oscar.jl/actions/runs/7364038493/job/20044077891#step:7:4949), so this could be due to the same issue.

@benlorenz
Copy link
Member

I have a preliminary fix for the crash I reported (jl_object_id__cold) here: oscar-system/Singular.jl#749. This adds a missing GC protection in the libsingular_julia code for passing data from a sleftv back to julia. I want to do some further testing now, unfortunately (for me at least ...) these crashes are rather rare.

@benlorenz
Copy link
Member

The original error (GC error (probable corruption)) also happens on macos, observed during my flint 2.9 backport testing:
https://github.com/benlorenz/Oscar.jl/actions/runs/7583365592/job/20655088309#step:9:4981
(but even less often than on ubuntu)

@joschmitt
Copy link
Member

@fingolfin
Copy link
Member

In both recent occurrences, the crash happend shortly after we see

Testing test/AlgebraicGeometry/Schemes/elliptic_surface.jl [...]

which I think means it is probably in the middle of testing test/Serialization/IPC.jl? (There is no message "Starting tests for ..." before that, perhaps we could add such a message?)

@fingolfin
Copy link
Member

Specifically, if we add a "Starting tests..." message before loading IPC.jl, and also force a full GC before that message, then perhaps we can get a better idea as to whether the corruption happens before IPC.jl, or during it?

@benlorenz
Copy link
Member

I can add the message, but I would like to hold off a bit with adding something like an explicit GC now since we just started doing the tests with libsingular_julia 0.40.11 which is the first version including my sleftv fix. (At least until we see another error with that version...)

@benlorenz
Copy link
Member

benlorenz commented Jan 24, 2024

It still happens with the new libsingular and even with the explicit GC call it happens within the IPC.jl tests: https://github.com/oscar-system/Oscar.jl/actions/runs/7638814195/job/20810486432?pr=3229#step:8:4959
Unfortunately I haven't been able to reproduce this crash outside of github actions. I have two jobs running the long testsuite with 300 successful iterations so far.

@fingolfin
Copy link
Member

Also happened https://github.com/oscar-system/Oscar.jl/actions/runs/7638161558/job/20808482695?pr=3226

Could it be that it again can only reproduced on a memory starved machine, with 7-8 GB RAM?

@benlorenz
Copy link
Member

The workers should be less memory starved now, they were recently upgraded to have 4 CPUs and 16 GB of memory.

@ThomasBreuer
Copy link
Member

@benlorenz
Copy link
Member

I have opened a PR to disable the IPC test for now while I try to debug this further: #3246

@fingolfin
Copy link
Member

And herr is an instance of the crash with Julia 1.9: https://github.com/oscar-system/Oscar.jl/actions/runs/7665378425/job/20891166477?pr=3247

@benlorenz
Copy link
Member

Thanks for noticing. That is interesting, it turns out that the effect of doing GC.gc() before the IPC.jl tests seems to increase the rate at which the error occurs. (But still only on github actions so far ...)
Maybe that helped trigger this on 1.9 as well.

@benlorenz
Copy link
Member

Our CI looks a lot better now without the IPC.jl tests, which should help with development. But I am continuing to look into this.
Please post any further errors you notice in the CI.

I just found this one during QuadFormAndIsom, unfortunately without any backtrace:

Sat, 27 Jan 2024 14:58:45 GMT GC: pause 27.39ms. collected 39.011118MB. incr 
Sat, 27 Jan 2024 14:58:45 GMT corrupted double-linked list
Sat, 27 Jan 2024 14:58:45 GMT
Sat, 27 Jan 2024 14:58:45 GMT [1921] signal (6.-6): Aborted
Sat, 27 Jan 2024 14:58:45 GMT in expression starting at /home/runner/work/Oscar.jl/Oscar.jl/experimental/QuadFormAndIsom/test/runtests.jl:269
Sat, 27 Jan 2024 17:09:25 GMT Error: The operation was canceled.

from https://github.com/oscar-system/Oscar.jl/actions/runs/7679187557/job/20929824694?pr=3212#step:8:1790

benlorenz added a commit that referenced this issue Jan 29, 2024
@benlorenz
Copy link
Member

After some more debugging I found that the error will quite surely be gone once 1.10.1 is released, fixed via JuliaLang/julia@8a04df0 (#52755). I don't really now why this happens so much more on 1.10 but probably due to the more agressive GC.

In this workflow I have about 150 successful runs of the long group including the IPC.jl tests, with an intermediate julia build from the backports-release-1.10 branch.

So once that is released I will try to reactivate these tests and hopefully close this ticket.

benlorenz added a commit that referenced this issue Feb 14, 2024
benlorenz added a commit that referenced this issue Feb 15, 2024
#3368)

* Revert "Serialization: disable IPC test until #3184 is solved (#3246)"

This reverts commit 67ccc93.

* tests: remove GC.gc() before IPC tests
ooinaruhugh pushed a commit to ooinaruhugh/Oscar.jl that referenced this issue Feb 15, 2024
…lved (oscar-system#3246)" (oscar-system#3368)

* Revert "Serialization: disable IPC test until oscar-system#3184 is solved (oscar-system#3246)"

This reverts commit 67ccc93.

* tests: remove GC.gc() before IPC tests
@thofma
Copy link
Collaborator Author

thofma commented Apr 4, 2024

This is back: https://github.com/Nemocas/Nemo.jl/actions/runs/8546742962/job/23417708965?pr=1700

(This downstream test run only checks Oscar.)

@thofma thofma reopened this Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: crash bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants