Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"died due to signal 11" in atomic::static_init libcore test on Android #49775

Closed
SimonSapin opened this issue Apr 8, 2018 · 11 comments
Closed
Labels
A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) O-android Operating system: Android T-release Relevant to the release subteam, which will review and decide on the PR/issue.

Comments

@SimonSapin
Copy link
Contributor

The same test has failed on Travis in multiple PRs, including at least:

I can reproduce on some branches (but not master) by running RUST_TEST_THREADS=1 src/ci/docker/run.sh arm-android. The end of the output looks like:

test atomic::int_nand ... ok
test atomic::int_or ... ok
test atomic::int_xor ... ok
test atomic::static_init ... died due to signal 11
error: test failed, to rerun pass '--test coretests'


command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "test" "--target" "arm-linux-androideabi" "-j" "8" "--release" "--locked" "--features" "panic-unwind jemalloc backtrace" "--manifest-path" "/checkout/src/libstd/Cargo.toml" "-p" "core" "--"
expected success, got: exit code: 3


failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test --target arm-linux-androideabi

Which means the test process dies with “SIGSEGV Invalid memory reference” while executing this test:

static S_FALSE: AtomicBool = AtomicBool::new(false);
static S_TRUE: AtomicBool = AtomicBool::new(true);
static S_INT: AtomicIsize  = AtomicIsize::new(0);
static S_UINT: AtomicUsize = AtomicUsize::new(0);

#[test]
fn static_init() {
    assert!(!S_FALSE.load(SeqCst));
    assert!(S_TRUE.load(SeqCst));
    assert!(S_INT.load(SeqCst) == 0);
    assert!(S_UINT.load(SeqCst) == 0);
}

I haven’t figured out how to use gdb inside the Android emulator inside Docker to debug this further.

CC @alexcrichton, @rust-lang/release

@SimonSapin SimonSapin added O-android Operating system: Android A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) T-release Relevant to the release subteam, which will review and decide on the PR/issue. labels Apr 8, 2018
@kennytm
Copy link
Member

kennytm commented Apr 8, 2018

Assume the error is first discovered in #49675 (based on 74abffe). The final PRs involved in that rollup are:

It was reported that the error still appears locally, so one of the above would be the cause.

From #49669 (comment), we knew the cause should be already contained by d919eb671b5c6d3c68221831a8d1186f492ea4d4 (i.e. 48fa6f9 on master). 74abffe...48fa6f9 includes the following PRs:

*   48fa6f9631 Auto merge of #49696 - alexcrichton:rollup, r=alexcrichton
|\  
| * cd615e9863 Rollup merge of #49705 - alexcrichton:less-manifest-docs, r=kennytm
| * 4d239ab14e Rollup merge of #49697 - kennytm:name-every-builder, r=aturon
| * 83669ecc1f Rollup merge of #49621 - Nemo157:impl-unpin-for-pin, r=withoutboats
| * 71bf15c6e8 Rollup merge of #49686 - memoryleak47:typo, r=alexcrichton
| * e6947ecf4d Rollup merge of #49597 - alexcrichton:proc-macro-v2, r=petrochenkov
| * 72ac3ebf68 Rollup merge of #49497 - scalexm:hrtb, r=nikomatsakis
| * 46492ffabd Rollup merge of #49350 - abonander:macros-in-extern, r=petrochenkov
| * b0bd9a771e Rollup merge of #49045 - Zoxc:tls, r=michaelwoerister
* | 7222241e7c Auto merge of #49045 - Zoxc:tls, r=michaelwoerister
|/  
*   56714acc5e Auto merge of #49684 - kennytm:rollup, r=kennytm
|\  
| * f4511e2437 Rollup merge of #49674 - alexcrichton:no-incremental-rustc, r=michaelwoerister
| * d05009bfa1 Rollup merge of #49667 - Manishearth:preview-features, r=nikomatsakis
| * 84f6440d79 Rollup merge of #49654 - davidtwco:issue-29893, r=alexcrichton
| * b146e33518 Rollup merge of #49563 - japaric:std-thumb, r=alexcrichton
| * 23689cc8e9 Rollup merge of #49496 - glandium:master, r=sfackler
| * 19c69082f5 Rollup merge of #49432 - nabijaczleweli:master, r=michaelwoerister
| * a70f844012 Rollup merge of #49345 - davidtwco:issue-44109, r=nikomatsakis
| * 46d0befb8e Rollup merge of #49253 - chmanchester:probing_fix, r=alexcrichton
| * 920249abdd Rollup merge of #48658 - llogiq:no-more-cas, r=kennytm
* | 01d0be9925 Auto merge of #48851 - petrochenkov:genparattr, r=nikomatsakis
|/  
* 4bf76d6745 Auto merge of #48709 - tinaun:issue48703, r=nikomatsakis
* 3b1fa867f2 Auto merge of #49587 - Bobo1239:master, r=nrc

PRs in both lists are:

@alexcrichton
Copy link
Member

Yes I've been able to reproduce this locally and it seems to be unrelated to changes that we're landing and otherwise just related to how libcore or libstd itself is optimized. I've attempted to debug this to get an actual cause but to no avail so far.

@alexcrichton
Copy link
Member

I'm pretty certain though that this isn't spurious in the sense that @bors: retry would do much, it seems deterministic based on changes in libstd/libcore. The changes, however, don't feel at all related to the bug still...

@alexcrichton
Copy link
Member

Ok I think I've found the cause of this. I've got no idea how this ever worked before! (as is par for the course on these kinds of bugs)

So a program like this:

use std::sync::atomic::*;
use std::sync::atomic::Ordering::SeqCst;

static FOO: AtomicIsize  = AtomicIsize::new(0);

fn main() {
    FOO.load(SeqCst);
}

generates roughly this IR:

@FOO = internal unnamed_addr global <{ [4 x i8] }> zeroinitializer, align 4

define void @main() {
  %a = load atomic i32, i32* bitcast (<{ [4 x i8] }>* @FOO to i32*) seq_cst, align 4
  ret void
}

Now the crucial thing here is that we don't actually modify @FOO in LLVM, so when run through optimizations it determines:

@FOO = internal unnamed_addr constant <{ [4 x i8] }> zeroinitializer, align 4

Which sounds reasonable! As a result, our constant was promoted from BSS to read-only data. We can see this:

$ nm -a without-optimizations | grep INT
00064124 b _ZN3lib5S_INT17hb5c694457a35af85E
$ nm -a with-optimizations | grep INT
00058600 r _ZN3lib5S_INT17hb5c694457a35af85E

Here the b means bss and r means rodata.

Now the tricky part comes later during the actual codegen. Here if you disassemble the function we get:

00008554 <main>:
    8554:       e92d4800        push    {fp, lr}
    8558:       e59f000c        ldr     r0, [pc, #12]   ; 856c <main+0x18>
    855c:       e3a01000        mov     r1, #0
    8560:       e3a02000        mov     r2, #0
    8564:       eb0002ef        bl      9128 <__sync_val_compare_and_swap_4>
    8568:       e8bd8800        pop     {fp, pc}
    856c:       0000aadc        .word   0x0000aadc

So it turns out we're not using atomic instructions at all, but rather we're using the __sync_val_compare_and_swap_4 intrinsic function. On Android this is loaded from libgcc, and the definition bottoms out in the __kuser_cmpxchg function provided by the kernel. This kernel helper presumably does syscall stuff if necessary, but on the emulator the kernel apparently discovered that the hardware supports atomic instructions so it's using atomic instructions! Notably gdb says that the faulting instruction is:

(gdb) disas $pc-12,$pc+12
Dump of assembler code from 0xffff0fc0 to 0xffff0fd4:
   0xffff0fc0:  ldrex   r3, [r2]
   0xffff0fc4:  subs    r3, r3, r0
=> 0xffff0fc8:  strexeq r3, r1, [r2]
   0xffff0fcc:  teqeq   r3, #1
   0xffff0fd0:  beq     0xffff0fc0
End of assembler dump.

Turns out r2 is the same as the address to our static which is in rodata. Now that we're in rodata this ends up failing with a fault because we can't actually modify rodata.


So it looks like tl;dr; LLVM will aggressively flag unmodified internal globals as constant (yay!) which allocates them in rodata instead of bss. In this case, however, the intrinsic operation that we lowered down to ended up actually doing a write which LLVM didn't know about making that an invalid optimization.

Given that we don't actually have many instances of atomics without modifying them I'm just gonna fix the test and call it a day...

@alexcrichton
Copy link
Member

As to how this ever worked I'm sort of msytified. I'm just gonna assume that all rodata doesn't actually end up being read-only for various reasons in the loader and we got lucky before where our constants here just happened to wind up in a read/write section of memory

@alexcrichton
Copy link
Member

I've posted a workaround at #49811

@cuviper
Copy link
Member

cuviper commented Apr 9, 2018

In this case, however, the intrinsic operation that we lowered down to ended up actually doing a write which LLVM didn't know about making that an invalid optimization.

This sounds like something that still needs to be fixed!

@BatmanAoD
Copy link
Member

On Android this is loaded from libgcc, and the definition bottoms out in the __kuser_cmpxchg function provided by the kernel. This kernel helper presumably does syscall stuff if necessary, but on the emulator the kernel apparently discovered that the hardware supports atomic instructions so it's using atomic instructions!

Since we have no control over Android's libgcc, is it possible that the implementation on an actual Android device might be updated in the future to use real atomic instructions, thus triggering the bug on that device as well as in the emulator?

alexcrichton added a commit to alexcrichton/rust that referenced this issue Apr 9, 2018
See rust-lang#49775 for some more information but it looks like this is working around an
LLVM bug for the time being.

Closes rust-lang#49775
@alexcrichton
Copy link
Member

@cuviper indeed! I've opened https://bugs.llvm.org/show_bug.cgi?id=37061

@alexcrichton
Copy link
Member

@BatmanAoD hm I'm not sure I quite follow? It looks like our vanilla arm-linux-androideabi is targeting such an old version of ARM/Android that it requires usage of intrinsics for atomic operations. The implementation of the intrinsics are behaving differently than what LLVM assumes, aka loading from an atomic is actually doing a cmpxchg to learn the value. The actual write happens deep within the intrinsic inside of the kernel-provided __kuser_cmpxchg function, and I'm sort of just assuming that the kernel is doing runtime detection to see if atomics are actually available and filling in an appropriate implementaiton. If a kernel implementation of the cmpxchg intrinsic were used I'm not sure if it'd also deliver a fault.

If the code is recompiled with support for atomic instructions then I don't think any fault will happen as LLVM's assumption that the static is never modified will actually be true (as it's generating the atomic instructions).

Does that answer your question?

bors added a commit that referenced this issue Apr 9, 2018
std: Be sure to modify atomics in tests

See #49775 for some more information but it looks like this is working around an
LLVM bug for the time being.

Closes #49775
@BatmanAoD
Copy link
Member

@alexcrichton Yes; I misunderstood or misread and somehow got the impression that libgcc was responsible for making the determination of whether to invoke the strexeq instruction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) O-android Operating system: Android T-release Relevant to the release subteam, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants