Stack unwinding cannot go beneath IRQ frames #304

brenns10 · 2023-06-14T19:01:18Z

I've regularly seen this with customer vmcores, and I decided to reproduce this with the vmtest kernels so that it would be super easy for anyone to reproduce and debug. I brought up this issue in #206, but I believe that was a different (possibly related) issue, so I thought it would be cleaner to keep this one separate.

Suppose that during an interrupt handler, the kernel calls panic(). One common instance of this would be the soft lockup detector. Unfortunately, soft lockup detector is not enabled on the vmtest kernels. So instead, I implemented a simple kernel module that would call panic() in IRQ context, by way of using an IPI:

/*
 * ipipanic.c: Send an IPI to another CPU which calls panic()
 */
#include <linux/module.h>
#include <linux/smp.h>
#include <linux/version.h>

/* f39650de687e3 ("kernel.h: split out panic and oops helpers") */
#if LINUX_VERSION_CODE < KERNEL_VERSION(5,14,0)
#include <linux/kernel.h>
#else
#include <linux/panic.h>
#endif

static void do_ipipanic(void *unused)
{
	panic("ipipanic: trigger panic");
}

static int __init init_ipipanic(void)
{
	int cpu = get_cpu();
	int dst;

	for_each_online_cpu(dst) {
		if (dst != cpu) {
			smp_call_function_single(dst, do_ipipanic, NULL, true);
		}
	}
	put_cpu();
	return 0;
}

module_init(init_ipipanic);

MODULE_AUTHOR("Stephen Brennan <stephen@brennan.io>");
MODULE_DESCRIPTION("Send an IPI to another CPU which calls panic()");
MODULE_LICENSE("GPL");

I then configured kdump inside the vmtest kernel and loaded this module, triggering a panic. In the kexec kernel, I used makedumpfile /proc/vmcore /io/vmcore to write the vmcore.

The kernel's dmesg for the panic included the following stack trace:

CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Tainted: G           O      5.4.246-vmtest21.1default #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
Call Trace:
 <IRQ>
 dump_stack+0x50/0x63
 panic+0x10b/0x2e9
 ? 0xffffffffc0357000
 do_ipipanic+0xc/0x1000 [drgn_test]
 flush_smp_call_function_queue+0xa1/0xd1
 smp_call_function_single_interrupt+0x5a/0xa3
 call_function_single_interrupt+0xf/0x20
 </IRQ>
 ? default_idle+0x18/0x27
 ? default_idle+0x16/0x27
 ? do_idle+0xb3/0x1b8
 ? cpu_startup_entry+0x18/0x1a
 ? start_kernel+0x406/0x427
 ? secondary_startup_64+0xa4/0xb0

So there are two segments: first, the IRQ stack where the IPI was handled and called panic(). But beneath that is the idle task's stack. In other situations (e.g. the soft lockup case), the task which was interrupted may not be the idle task, it may be some very interesting task. We should be able to see this stack, so long as the task was executing in the kernel -- obviously userspace stacks aren't necessarily in-scope when debugging a kernel.

However, drgn (0.0.22, and main) cannot unwind past the IRQ frame:

>>> prog.crashed_thread().stack_trace()
#0  crash_setup_regs (./arch/x86/include/asm/kexec.h:100:3)
#1  __crash_kexec (./kernel/kexec_core.c:952:4)
#2  panic (./kernel/panic.c:305:3)
#3  do_ipipanic (./drgn_test/drgn_test.c:17:2)
#4  flush_smp_call_function_queue (./kernel/smp.c:244:4)
#5  smp_call_function_single_interrupt (./arch/x86/kernel/smp.c:262:2)
#6  call_function_single_interrupt+0xf/0x14 (./arch/x86/entry/entry_64.S:856)

For comparison, crash is normally able to unwind past the IRQ frames, e.g:

crash> bt
PID: 0        TASK: ffffffffbb811780  CPU: 0    COMMAND: "swapper/0"
 #0 [ffffb38540003e30] machine_kexec at ffffffffbaa368cc
 #1 [ffffb38540003e78] __crash_kexec at ffffffffbaaab894
 #2 [ffffb38540003f40] panic at ffffffffbaddeb20
 #3 [ffffb38540003fc0] do_ipipanic at ffffffffc035700c [drgn_test]
 #4 [ffffb38540003fc8] flush_smp_call_function_queue at ffffffffbaaa5417
 #5 [ffffb38540003fe0] smp_call_function_single_interrupt at ffffffffbae01d15
 #6 [ffffb38540003ff0] call_function_single_interrupt at ffffffffbae0165f
--- <IRQ stack> ---
 #7 [ffffffffbb803e18] call_function_single_interrupt at ffffffffbae0165f
    [exception RIP: default_idle+24]
    RIP: ffffffffbade6490  RSP: ffffffffbb803ec8  RFLAGS: 00000242
    RAX: 0000000000000001  RBX: 0000000000000000  RCX: 0000000000005d92
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000001
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000400  R12: ffffffffbb811780
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffff04  CS: 0010  SS: 0018
 #8 [ffffffffbb803ec0] default_idle at ffffffffbade648e
 #9 [ffffffffbb803ec8] do_idle at ffffffffbaa6b5ef
#10 [ffffffffbb803f08] cpu_startup_entry at ffffffffbaa6b82e
#11 [ffffffffbb803f18] start_kernel at ffffffffbba3be84
#12 [ffffffffbb803f50] secondary_startup_64 at ffffffffbaa000d4

I say normally, because in more recent kernels, crash actually fails to go past the IRQ frame, with a bt: WARNING: possibly bogus exception frame. I've tested the following on crash 8.0.3 and drgn main:

Kernel Version	Drgn (main)	Drgn+ORC (main)	Crash (8.0.3)
`6.3.7-vmtest21.1default`	no task stack (has bogus `???` frame)	no task stack	no task stack (warns about "bogus" frame)
`5.15.116-vmtest21.1default`	no task stack (has bogus `???` frame)	no task stack	no task stack (warns about "bogus" frame)
`5.14.21-vmtest21.1default`	no task stack (has bogus `???` frame)	no task stack	no task stack (warns about "bogus" frame)
`5.13.19-vmtest21.1default`	no task stack (has bogus `???` frame)	no task stack	no task stack (warns about "bogus" frame)
`5.12.19-vmtest21.1default`	no task stack (has bogus `???` frame)	no task stack	no task stack (warns about "bogus" frame)
`5.11.22-vmtest21.1default`	has task stack	has task stack	partial task stack (warns about "bogus" frame)
`5.10.183-vmtest21.1default`	has task stack	has task stack	partial task stack (warns about "bogus" frame)
`5.4.246-vmtest21.1default`	no task stack	no task stack	has task stack
`4.19.285-vmtest21.1default`	no task stack (has bogus `???` frame)	no task stack (has bogus `???` frame)	fails to load vmcore...
`4.14.317-vmtest21.1default`	no task stack (has bogus `???` frame)	no task stack (has bogus `???` frame)	has task stack

Reproducer steps:

Checkout my reproducer branch. It has some tweaks to the vmtest code as well as the above kernel module replacing the drgn_test.ko module.
Ensure your host system has makedumpfile installed (from kexec-tools)
python -m vmtest.kmod -k $KVER to build the module
Create a directory to hold the vmcores: mkdir io
python -m vmtest.vm -k $KVER -w $(pwd)/io to run the VM
Within the VM, python3 -m vmtest.enter_kdump -n to configure the kexec kernel (but skip panicking)
Within the VM, insmod build/vmtest/x86_64/drgn_test-$KVER.ko to trigger the panic
When the kexec kernel boots and drops you into a shell, you can then use makedumpfile /proc/vmcore /io/vmcore-$KVER to create the dump. I have encountered some low-memory issues, but none yet have prevented saving the vmcore. I assume increasing the crashkernel reservation would resolve the issues.

The text was updated successfully, but these errors were encountered:

osandov · 2023-06-14T19:08:06Z

I've seen this in production, too, but hadn't gotten around to reproducing it, so this is super helpful! I'm still under the impression that ORC is supposed to help us unwind through an IRQ, but maybe I'm wrong or maybe there's a bug in the ORC unwinding code.

brenns10 · 2023-06-14T20:21:04Z

I'm not so sure that we can rely on the ORC or DWARF to get us through the exception, but I likely haven't spent as much time looking at it as you have!

Looking at how crash does it, it does seem like the transition from the exception stack into the normal task stack has a bit of manual effort involved... But then again, I don't know that crash's way is necessarily the way we should emulate, it's just another data point.

In the case of my particular reproducer, it does seem as simple as "keep unwinding, assuming that the pt_regs location got pushed to the stack":

>>> regs = Object(prog, "struct pt_regs *", address=prog.crashed_thread().stack_trace()[-1].sp)
>>> regs[0]
(struct pt_regs){
        .r15 = (unsigned long)0,
        .r14 = (unsigned long)0,
        .r13 = (unsigned long)0,
        .r12 = (unsigned long)18446744072560383872,
        .bp = (unsigned long)0,
        .bx = (unsigned long)0,
        .r11 = (unsigned long)1024,
        .r10 = (unsigned long)0,
        .r9 = (unsigned long)0,
        .r8 = (unsigned long)0,
        .ax = (unsigned long)1,
        .cx = (unsigned long)23954,
        .dx = (unsigned long)0,
        .si = (unsigned long)0,
        .di = (unsigned long)1,
        .orig_ax = (unsigned long)18446744073709551364,
        .ip = (unsigned long)18446744072549721232,
        .cs = (unsigned long)16,
        .flags = (unsigned long)578,
        .sp = (unsigned long)18446744072560328392,
        .ss = (unsigned long)24,
}
>>> prog.stack_trace(regs)
#0  default_idle (./arch/x86/kernel/process.c:573:2)
#1  cpuidle_idle_call (./kernel/sched/idle.c:154:3)
#2  do_idle (./kernel/sched/idle.c:264:4)
#3  cpu_startup_entry (./kernel/sched/idle.c:356:3)
#4  start_kernel (./init/main.c:782:2)
#5  secondary_startup_64+0xa4/0xa4 (./arch/x86/kernel/head_64.S:241)

edit: and this approach works on the customer vmcore which finally caused me to go down the rabbit hole, which is pretty nice! The only difference is that sometimes, there's a "bogus" frame at the end of the stack trace, e.g.:

#24 smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1100)
#25 apic_timer_interrupt+0x1c6/0x1cb0xffffffff9da052c6 (arch/x86/entry/entry_64.S:808)
#26 0xffff993d3d5a74

And so rather than using the last frame's sp, you need to use the second-to-last frame's sp.

brenns10 · 2023-06-14T20:44:30Z

This gets curiouser and curiouser :)

Drgn unwinds through the IRQ back into the task stack with no problems on 5.10... but crash can't handle that kernel. I'm updating the original comment with a table summarizing all of the results. Super bizarre stuff. The original comment also now has my reproducer branch linked with instructions.

brenns10 · 2023-06-14T21:26:23Z

Ok, table is now updated with every vmtest kernel version for x86_64/default. I also ran the test with DRGN_PREFER_ORC_UNWINDER=1 on each kernel version and included the results in a separate column. Finally, I'm attaching a log of each kernel version, every stack trace, so we can refer to details later on.
UNWINDING.txt

brenns10 · 2023-06-14T21:50:50Z

With the full table, I can at least start compiling some possible kernel commits to "blame". Not that the bug is necessarily the kernel's fault, but merely that there's something different between kernel versions which may point us to the part of the unwind that is correct or incorrect.

torvalds/linux@951c2a5 "x86/irq/64: Adjust the per CPU irq stack pointer by 8" feels quite relevant. This was committed in the 5.12 cycle, and so it could explain why the 5.12+ kernels have different behavior (both for drgn, and for crash). A second look shows this... seems unrelated.

If we can find a suspect change, it would be good to revert it and see if that makes allows drgn to unwind past the execption frame again.

osandov · 2023-06-15T19:04:08Z

Thanks again for the reproducer! I got it running locally and investigated a bit. I found a nasty typo in the ORC unwinder:

diff --git a/libdrgn/arch_x86_64.c b/libdrgn/arch_x86_64.c
index 57ebcd4..d65b181 100644
--- a/libdrgn/arch_x86_64.c
+++ b/libdrgn/arch_x86_64.c
@@ -76,7 +76,7 @@ orc_to_cfi_x86_64(const struct drgn_orc_entry *orc,
 		break;
 	case ORC_REG_SP_INDIRECT:
 		rule.kind = DRGN_CFI_RULE_AT_REGISTER_ADD_OFFSET;
-		rule.regno = DRGN_REGISTER_NUMBER(rbp);
+		rule.regno = DRGN_REGISTER_NUMBER(rsp);
 		rule.offset = orc->sp_offset;
 		break;
 	case ORC_REG_BP_INDIRECT:

With that fixed, 6.2.16-vmtest21.1default works with DRGN_PREFER_ORC_UNWINDER=1:

>>> prog.crashed_thread().stack_trace()
#0  crash_setup_regs (./arch/x86/include/asm/kexec.h:95:3)
#1  __crash_kexec (./kernel/kexec_core.c:968:4)
#2  panic (./kernel/panic.c:359:3)
#3  do_ipipanic (./drgn_test/drgn_test.c:17:2)
#4  __flush_smp_call_function_queue (./kernel/smp.c:630:4)
#5  __sysvec_call_function_single (./arch/x86/kernel/smp.c:248:2)
#6  sysvec_call_function_single (./arch/x86/kernel/smp.c:243:1)
#7  asm_sysvec_call_function_single+0x1a/0x1f (./arch/x86/include/asm/idtentry.h:657)
#8  default_idle (./arch/x86/kernel/process.c:731:1)
#9  default_idle_call (./kernel/sched/idle.c:109:3)
#10 cpuidle_idle_call (./kernel/sched/idle.c:191:3)
#11 do_idle (./kernel/sched/idle.c:303:4)
#12 cpu_startup_entry (./kernel/sched/idle.c:400:3)
#13 rest_init (./init/main.c:732:2)
#14 arch_call_rest_init (./init/main.c:894:2)
#15 start_kernel (./init/main.c:1148:2)
#16 secondary_startup_64+0xe5/0xe5 (./arch/x86/kernel/head_64.S:358)

But not without it:

>>> prog.crashed_thread().stack_trace()
#0  crash_setup_regs (./arch/x86/include/asm/kexec.h:95:3)
#1  __crash_kexec (./kernel/kexec_core.c:968:4)
#2  panic (./kernel/panic.c:359:3)
#3  do_ipipanic (./drgn_test/drgn_test.c:17:2)
#4  __flush_smp_call_function_queue (./kernel/smp.c:630:4)
#5  __sysvec_call_function_single (./arch/x86/kernel/smp.c:248:2)
#6  sysvec_call_function_single (./arch/x86/kernel/smp.c:243:1)
#7  ???

This means that sysvec_call_function_single has inaccurate DWARF CFI but accurate ORC. This doesn't play well with drgn's current fallback behavior. (We prefer DWARF CFI because that recovers more registers to use for locals.) This will require more thought.

ORC_REG_SP_INDIRECT is supposed to be an indirect access via rsp, but we have a typo and are using rbp instead. This is a partial fix for #304. Fixes: 630d39e ("libdrgn: add ORC unwinder") Signed-off-by: Omar Sandoval <osandov@osandov.com>

brenns10 · 2023-06-15T20:25:09Z

I applied your ORC patch and re-ran prog.crashed_thread().stack_trace() on each of the vmcores I had previously collected. First with the default, second with DRGN_PREFER_ORC_UNWINDER=1. Here's a summary of the new results.

Kernel Version	Drgn Default	Drgn `DRGN_PREFER_ORC_UNWINDER=1`
`6.3.7-vmtest21.1default`	Missing task stack (has extra `???` frame)	Has task stack
`5.15.116-vmtest21.1default`	Missing task stack (has extra `???` frame)	Has task stack
`5.14.21-vmtest21.1default`	Missing task stack (has extra `???` frame)	Has task stack
`5.13.19-vmtest21.1default`	Missing task stack (has extra `???` frame)	Has task stack
`5.12.19-vmtest21.1default`	Missing task stack (has extra `???` frame)	Has task stack
`5.11.22-vmtest21.1default`	Has task stack	Has task stack
`5.10.183-vmtest21.1default`	Has task stack	Has task stack
`5.4.246-vmtest21.1default`	Missing task stack	Missing task stack
`4.19.285-vmtest21.1default`	Has task stack	Has task stack
`4.14.317-vmtest21.1default`	Has task stack	Has task stack

brenns10 · 2023-06-15T20:47:51Z

So that was a pretty awesome fix, great find there!

For the remainder of the matrix, it seems likely that there was some change in the 5.12 release cycle which broke the DWARF CFI unwinding. The commit I linked above was definitely not the commit that caused it, but it was part of a series that reworked the x86 IRQ entry code. It seems likely that the change is in there. Comparing the ORC unwind on 5.12 and 5.11, we can see that the stack traces are in fact different:

5.12:

#0  crash_setup_regs (./arch/x86/include/asm/kexec.h:95:3)
#1  __crash_kexec (./kernel/kexec_core.c:957:4)
#2  panic (./kernel/panic.c:251:3)
#3  do_ipipanic (./drgn_test/drgn_test.c:17:2)
#4  flush_smp_call_function_queue (./kernel/smp.c:395:4)
#5  __sysvec_call_function_single (./arch/x86/kernel/smp.c:248:2)
#6  sysvec_call_function_single (./arch/x86/kernel/smp.c:243:1)
#7  asm_sysvec_call_function_single+0x12/0x17 (./arch/x86/include/asm/idtentry.h:646)
#8  default_idle (./arch/x86/kernel/process.c:689:1)
#9  default_idle_call (./kernel/sched/idle.c:112:3)
#10 cpuidle_idle_call (./kernel/sched/idle.c:194:3)
#11 do_idle (./kernel/sched/idle.c:300:4)
#12 cpu_startup_entry (./kernel/sched/idle.c:397:3)
#13 start_kernel (./init/main.c:1060:2)
#14 secondary_startup_64+0xc7/0xc7 (./arch/x86/kernel/head_64.S:283)

5.11:

#0  crash_setup_regs (./arch/x86/include/asm/kexec.h:95:3)
#1  __crash_kexec (./kernel/kexec_core.c:957:4)
#2  panic (./kernel/panic.c:251:3)
#3  do_ipipanic (./drgn_test/drgn_test.c:17:2)
#4  flush_smp_call_function_queue (./kernel/smp.c:395:4)
#5  __sysvec_call_function_single (./arch/x86/kernel/smp.c:248:2)
#6  asm_call_on_stack (./arch/x86/entry/entry_64.S:788)
#7  __run_sysvec_on_irqstack (./arch/x86/include/asm/irq_stack.h:37:2)
#8  run_sysvec_on_irqstack_cond (./arch/x86/include/asm/irq_stack.h:89:3)
#9  sysvec_call_function_single (./arch/x86/kernel/smp.c:243:1)
#10 asm_sysvec_call_function_single+0x12/0x17 (./arch/x86/include/asm/idtentry.h:652)
#11 default_idle (./arch/x86/kernel/process.c:689:1)
#12 default_idle_call (./kernel/sched/idle.c:112:3)
#13 cpuidle_idle_call (./kernel/sched/idle.c:194:3)
#14 do_idle (./kernel/sched/idle.c:300:4)
#15 cpu_startup_entry (./kernel/sched/idle.c:397:3)
#16 secondary_startup_64+0xc7/0xc7 (./arch/x86/kernel/head_64.S:283)

In 5.11 we have the __run_sysvec_on_irqstack, run_sysvec_on_irqstack_cond, and asm_call_on_stack functions. They are removed in 5.12, seemingly converted to macros here. I know that these frames are all inline so they don't have separate CFAs on the stack. But maybe the conversion from inline to macro caused the compiler to be unable to generate correct DWARF CFI?

osandov · 2023-07-02T06:07:58Z

The patch series you linked to is indeed the problem: since that series, the stack switch happens in inline assembly via the macros run_sysvec_on_irqstack_cond -> call_on_irqstack_cond -> call_on_irqstack -> call_on_stack. This inline assembly ends up in the sysvec_call_function_single function. The compiler is not aware of inline assembly when generating DWARF, so the DWARF ends up not reflecting the stack switch.

This type of stack switch seems to be represented by ORC_REG_SP_INDIRECT. So, we can handle this by preferring ORC over DWARF if the ORC entry for the instruction pointer uses ORC_REG_SP_INDIRECT. This will require some reworking, but nothing too crazy.

brenns10 · 2024-02-23T18:40:21Z

Just an interesting note regarding unwinding through exception frames... There's a reason that I've probably noticed more issues with this than your average drgn user. The Oracle UEK 5 & 6 kernels that I tend to do a lot of support on (4.14 and 5.4 based) do not have ORC enabled! They do have frame pointers, and I typically have DWARF available, but there's no ORC to fall back on. Contrast that to UEK7 (5.15 based) which has ORC enabled, and no frame pointers.

osandov added the bug Something isn't working label Jul 3, 2023

osandov self-assigned this Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack unwinding cannot go beneath IRQ frames #304

Stack unwinding cannot go beneath IRQ frames #304

brenns10 commented Jun 14, 2023 •

edited

Loading

osandov commented Jun 14, 2023

brenns10 commented Jun 14, 2023 •

edited

Loading

brenns10 commented Jun 14, 2023

brenns10 commented Jun 14, 2023

brenns10 commented Jun 14, 2023 •

edited

Loading

osandov commented Jun 15, 2023

brenns10 commented Jun 15, 2023

brenns10 commented Jun 15, 2023

osandov commented Jul 2, 2023

brenns10 commented Feb 23, 2024

Stack unwinding cannot go beneath IRQ frames #304

Stack unwinding cannot go beneath IRQ frames #304

Comments

brenns10 commented Jun 14, 2023 • edited Loading

osandov commented Jun 14, 2023

brenns10 commented Jun 14, 2023 • edited Loading

brenns10 commented Jun 14, 2023

brenns10 commented Jun 14, 2023

brenns10 commented Jun 14, 2023 • edited Loading

osandov commented Jun 15, 2023

brenns10 commented Jun 15, 2023

brenns10 commented Jun 15, 2023

osandov commented Jul 2, 2023

brenns10 commented Feb 23, 2024

brenns10 commented Jun 14, 2023 •

edited

Loading

brenns10 commented Jun 14, 2023 •

edited

Loading

brenns10 commented Jun 14, 2023 •

edited

Loading