arch: introduce `arch_stack_walk()` and add implementation for RISCV #73587

ycsin · 2024-05-31T14:26:56Z

While reviewing #72890, I think the stacktrace inplementation in RISCV can probably be factored out so that it can be useful for #72890.

And seems like Linux already has an API for such use case, I guess we can probably just steal it over..

Added an implementation for RISC-V, tested and works on a modified version of #72890

Added a shell command to unwind a stack:

uart:~$ kernel threads
Scheduler: 11 since last call
Threads:
*0x80017138 shell_uart
        options: 0x0, priority: 14 timeout: 0
        state: queued, entry: 0x800029ac
        stack size 3072, unused 1316, usage 1756 / 3072 (57 %)

 0x80017ca8 sysworkq
        options: 0x1, priority: -1 timeout: 0
        state: pending, entry: 0x80006842
        stack size 1024, unused 644, usage 380 / 1024 (37 %)

 0x800177e0 idle
        options: 0x1, priority: 15 timeout: 0
        state: , entry: 0x800065ae
        stack size 512, unused 180, usage 332 / 512 (64 %)

 0x80017950 main
        options: 0x1, priority: 0 timeout: 13
        state: suspended, entry: 0x80006326
        stack size 4096, unused 3604, usage 492 / 4096 (12 %)

uart:~$ kernel unwind 0x80017ca8
Unwinding 0x80017ca8 sysworkq
ra: 0x80007114 [z_swap+0x58]
ra: 0x80007ae8 [z_sched_wait+0x10]
ra: 0x8000689a [work_queue_main+0x58]
ra: 0x800006de [z_thread_entry+0x2e]

include/zephyr/arch/arch_interface.h

arch/Kconfig

andyross

Some notes. Overall this looks clean to me. Note that there is also a ton of per-arch stack walking code that e.g. does LOG_ERR() or printk() to hand format stack traces that we should look at replacing with a portable one based on this.

include/zephyr/arch/arch_interface.h

andyross · 2024-05-31T17:00:56Z

include/zephyr/arch/arch_interface.h

@@ -1240,6 +1242,35 @@ bool arch_pcie_msi_vector_connect(msi_vector_t *vector,
 */
 void arch_spin_relax(void);

+#ifdef CONFIG_ARCH_STACKWALK


Do we need this as a Kconfig? Seems like a feature that, if unused, would be dropped from the link. The only other reason to make it tunable would be to optimize build time for uncalled code, but again this is tiny and IMHO we don't want to be playing games like that at the arch level.

Removed CONFIG_ARCH_STACKWALK but I retained the CONFIG_ARCH_HAS_STACKWALK, which is used to guard the CONFIG_ARCH_STACKWALK_MAX_FRAMES. CONFIG_ARCH_STACKWALK_MAX_FRAMES (like the CONFIG_EXCEPTION_STACK_TRACE_MAX_FRAMES) will be helpful to make sure that we dont stuck in an infinite loop, LMK what you think

include/zephyr/arch/arch_interface.h

ycsin · 2024-06-11T15:10:58Z

Removed the implementation for x86 & arm64 as they probably do not work properly outside of the fatal use case.

ycsin · 2024-06-11T15:19:15Z

cc @npitre

ycsin · 2024-06-11T15:46:30Z

arch_ interfaces are not something meant to be implemented out-of-tree

Q: Does this apply to the OOT archs? Do we actually support OOT archs?

npitre · 2024-06-11T18:17:15Z

Your commit log says:

The fp-based implementation of stack-unwinding when the esf is
NULL came from a5d86b8b077466384f2c2ea6771498a1e265bb95.

But...

$ git log a5d86b8b077466384f2c2ea6771498a1e265bb95
fatal: bad object a5d86b8b077466384f2c2ea6771498a1e265bb95

You may refer to a git hash only if it is merged upstream. Pull requests are
applied with a "rebase" not with a "merge" so any hash from any private tree
will be different once applied.

Then, in the code:

       if ((esf == NULL) && (csf == NULL)) {
               return;
       } else if (esf != NULL) {
               fp = esf->s0;
               ra = esf->mepc;
       } else if ((csf == NULL) || csf == &_current->callee_saved) {
               esf = *((struct arch_esf **)(((uintptr_t)_current_cpu->irq_stack) - 16));
               fp = esf->s0;
               ra = (uintptr_t)walk_stackframe;
       } else {
               fp = csf->s0;
               ra = csf->ra;
       }

This would require some comments. Way too much magic is going on here.
For instance, what is that irq_stack - 16 all about? I probably can guess
only because I'm familiar with the IRQ stack switch as I wrote it. But even
so I'm not completely sure. Now imagine anybody else.

Also, commenting how each of these esfand csf combinations may
come to be would help as well.

Furthermore, if both esf and csf are NULL then the first if is true.
If the second if is false, that means esf is null and therefore csf
must be non null... meaning the csf == NULL condition in the 3rd if will
never be true.

So please don't hesitate to comment such tricky code. Otherwise a year from
now even you won't remember why you wrote it.

ycsin · 2024-06-12T10:10:13Z

Thanks for the review

You may refer to a git hash only if it is merged upstream. Pull requests are
applied with a "rebase" not with a "merge" so any hash from any private tree
will be different once applied.

I've updated the commit to point at the PR instead

Furthermore, if both esf and csf are NULL then the first if is true.
If the second if is false, that means esf is null and therefore csf
must be non null... meaning the csf == NULL condition in the 3rd if will
never be true.

Good catch, I've reoredered the if conditionals and added some comments to it, could you please take another look?

An architecture can indicate that it has an implementation for the `arch_stack_walk()` function by selecting `ARCH_HAS_STACKWALK`. Set the default value of `EXCEPTION_STACK_TRACE_MAX_FRAMES` to `ARCH_STACKWALK_MAX_FRAMES` if the latter is available. Signed-off-by: Yong Cong Sin <ycsin@meta.com>

Created the `arch_stack_walk()` function out from the original `z_riscv_unwind_stack()`, it's been updated to support unwinding any thread. Updated the stack_unwind test case accordingly. Increased the delay in `test_fatal_on_smp`, to wait for the the fatal thread to be terminated, as stacktrace can take a bit more time. Doubled the kernel/smp testcase timeout from 60 (default) to 120s, as some of the tests can take a little bit more than 60s to finish. Signed-off-by: Yong Cong Sin <ycsin@meta.com>

Now that the unwind starts from mepc already, the symbol name at the mepc reg is kinda redundant, so just remove it. Signed-off-by: Yong Cong Sin <ycsin@meta.com>

Add a shell command to unwind a thread using its thread id. uart:~$ kernel threads Scheduler: 11 since last call Threads: *0x80017138 shell_uart options: 0x0, priority: 14 timeout: 0 state: queued, entry: 0x800029ac stack size 3072, unused 1316, usage 1756 / 3072 (57 %) 0x80017ca8 sysworkq options: 0x1, priority: -1 timeout: 0 state: pending, entry: 0x80006842 stack size 1024, unused 644, usage 380 / 1024 (37 %) 0x800177e0 idle options: 0x1, priority: 15 timeout: 0 state: , entry: 0x800065ae stack size 512, unused 180, usage 332 / 512 (64 %) 0x80017950 main options: 0x1, priority: 0 timeout: 13 state: suspended, entry: 0x80006326 stack size 4096, unused 3604, usage 492 / 4096 (12 %) uart:~$ kernel unwind 0x80017ca8 Unwinding 0x80017ca8 sysworkq ra: 0x80007114 [z_swap+0x58] ra: 0x80007ae8 [z_sched_wait+0x10] ra: 0x8000689a [work_queue_main+0x58] ra: 0x800006de [z_thread_entry+0x2e] Signed-off-by: Yong Cong Sin <ycsin@meta.com>

ycsin · 2024-06-13T03:06:58Z

Increase some of the timeouts in tests as the stacktrace requires a little bit more time than before, and rebased on main

ycsin · 2024-06-13T03:12:00Z

I'm not sure if it's just my laptop's performance is bad, the test_inc_concurrency alone takes 100+ seconds to finish

START - test_inc_concurrency
type 0: cnt 60000, spend 1299 ms
type 1: cnt 60000, spend 53075 ms
type 2: cnt 60000, spend 54327 ms
 PASS - test_inc_concurrency in 108.705 seconds

Not sure if 120s is enough in upstream

ycsin · 2024-06-13T16:55:32Z

ping @andyross @dcpleung, could you please take another look? Thanks

ycsin changed the title ~~Pr/arch stack walk~~ arch: introduce arch_stack_walk() May 31, 2024

ycsin force-pushed the pr/arch_stack_walk branch from 0c6c067 to 44b19d7 Compare May 31, 2024 15:15

ycsin requested review from fkokosinski, tejlmand, andyross, carlocaione and cfriedt and removed request for tejlmand May 31, 2024 15:17

ycsin commented May 31, 2024

View reviewed changes

include/zephyr/arch/arch_interface.h Outdated Show resolved Hide resolved

ycsin mentioned this pull request May 31, 2024

subsys: samples: doc: arch: Make perf subsystem #72890

Merged

ycsin added area: Kernel area: Debugging area: RISCV RISCV Architecture (32-bit & 64-bit) labels May 31, 2024

ycsin force-pushed the pr/arch_stack_walk branch from 44b19d7 to 2ae2c15 Compare May 31, 2024 15:58

ycsin mentioned this pull request May 31, 2024

arch: deprecate z_arch_esf_t with struct arch_esf, introduce an arch-agnostic exception.h for it #73593

Merged

ycsin commented May 31, 2024

View reviewed changes

arch/Kconfig Outdated Show resolved Hide resolved

ycsin marked this pull request as ready for review May 31, 2024 16:43

zephyrbot added the area: Architectures label May 31, 2024

zephyrbot requested review from dcpleung, edersondisouza, katsuster, kgugala, mgielda, nashif, npitre and tgorochowik May 31, 2024 16:43

zephyrbot assigned fkokosinski May 31, 2024

andyross reviewed May 31, 2024

View reviewed changes

ycsin force-pushed the pr/arch_stack_walk branch from 2ae2c15 to fc78d4e Compare June 1, 2024 01:19

ycsin commented Jun 1, 2024

View reviewed changes

include/zephyr/arch/arch_interface.h Outdated Show resolved Hide resolved

ycsin force-pushed the pr/arch_stack_walk branch from c2f111c to 9fa2dd3 Compare June 11, 2024 15:07

ycsin added this to the v3.7.0 milestone Jun 11, 2024

ycsin force-pushed the pr/arch_stack_walk branch from 9fa2dd3 to b1b05eb Compare June 12, 2024 10:06

ycsin force-pushed the pr/arch_stack_walk branch 2 times, most recently from 12574fe to bc53115 Compare June 12, 2024 16:37

npitre previously approved these changes Jun 12, 2024

View reviewed changes

nashif previously approved these changes Jun 12, 2024

View reviewed changes

ycsin dismissed stale reviews from nashif and npitre via 51af6ec June 13, 2024 01:12

ycsin force-pushed the pr/arch_stack_walk branch from bc53115 to 51af6ec Compare June 13, 2024 01:12

ycsin added 4 commits June 13, 2024 10:50

arch: riscv: stop printing symbol name at mepc

c919034

Now that the unwind starts from mepc already, the symbol name at the mepc reg is kinda redundant, so just remove it. Signed-off-by: Yong Cong Sin <ycsin@meta.com>

ycsin force-pushed the pr/arch_stack_walk branch from 51af6ec to 21a0e8a Compare June 13, 2024 02:55

npitre approved these changes Jun 13, 2024

View reviewed changes

ycsin requested a review from nashif June 13, 2024 04:07

nashif approved these changes Jun 13, 2024

View reviewed changes

cfriedt approved these changes Jun 13, 2024

View reviewed changes

dcpleung approved these changes Jun 13, 2024

View reviewed changes

nashif merged commit b98a607 into zephyrproject-rtos:main Jun 13, 2024
31 checks passed

ycsin deleted the pr/arch_stack_walk branch June 14, 2024 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arch: introduce `arch_stack_walk()` and add implementation for RISCV #73587

arch: introduce `arch_stack_walk()` and add implementation for RISCV #73587

ycsin commented May 31, 2024 •

edited

Loading

andyross left a comment

andyross May 31, 2024

ycsin Jun 5, 2024

ycsin commented Jun 11, 2024

ycsin commented Jun 11, 2024

ycsin commented Jun 11, 2024

npitre commented Jun 11, 2024

ycsin commented Jun 12, 2024 •

edited

Loading

ycsin commented Jun 13, 2024

ycsin commented Jun 13, 2024

ycsin commented Jun 13, 2024

arch: introduce arch_stack_walk() and add implementation for RISCV #73587

arch: introduce arch_stack_walk() and add implementation for RISCV #73587

Conversation

ycsin commented May 31, 2024 • edited Loading

andyross left a comment

Choose a reason for hiding this comment

andyross May 31, 2024

Choose a reason for hiding this comment

ycsin Jun 5, 2024

Choose a reason for hiding this comment

ycsin commented Jun 11, 2024

ycsin commented Jun 11, 2024

ycsin commented Jun 11, 2024

npitre commented Jun 11, 2024

ycsin commented Jun 12, 2024 • edited Loading

ycsin commented Jun 13, 2024

ycsin commented Jun 13, 2024

ycsin commented Jun 13, 2024

arch: introduce `arch_stack_walk()` and add implementation for RISCV #73587

arch: introduce `arch_stack_walk()` and add implementation for RISCV #73587

ycsin commented May 31, 2024 •

edited

Loading

ycsin commented Jun 12, 2024 •

edited

Loading