Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arch: introduce arch_stack_walk() and add implementation for RISCV #73587

Merged
merged 4 commits into from
Jun 13, 2024

Conversation

ycsin
Copy link
Member

@ycsin ycsin commented May 31, 2024

While reviewing #72890, I think the stacktrace inplementation in RISCV can probably be factored out so that it can be useful for #72890.

And seems like Linux already has an API for such use case, I guess we can probably just steal it over..

Added an implementation for RISC-V, tested and works on a modified version of #72890
perf_sp_1

Added a shell command to unwind a stack:

uart:~$ kernel threads
Scheduler: 11 since last call
Threads:
*0x80017138 shell_uart
        options: 0x0, priority: 14 timeout: 0
        state: queued, entry: 0x800029ac
        stack size 3072, unused 1316, usage 1756 / 3072 (57 %)

 0x80017ca8 sysworkq
        options: 0x1, priority: -1 timeout: 0
        state: pending, entry: 0x80006842
        stack size 1024, unused 644, usage 380 / 1024 (37 %)

 0x800177e0 idle
        options: 0x1, priority: 15 timeout: 0
        state: , entry: 0x800065ae
        stack size 512, unused 180, usage 332 / 512 (64 %)

 0x80017950 main
        options: 0x1, priority: 0 timeout: 13
        state: suspended, entry: 0x80006326
        stack size 4096, unused 3604, usage 492 / 4096 (12 %)

uart:~$ kernel unwind 0x80017ca8
Unwinding 0x80017ca8 sysworkq
ra: 0x80007114 [z_swap+0x58]
ra: 0x80007ae8 [z_sched_wait+0x10]
ra: 0x8000689a [work_queue_main+0x58]
ra: 0x800006de [z_thread_entry+0x2e]

@ycsin ycsin changed the title Pr/arch stack walk arch: introduce arch_stack_walk() May 31, 2024
@ycsin ycsin requested review from fkokosinski, tejlmand, andyross, carlocaione and cfriedt and removed request for tejlmand May 31, 2024 15:17
arch/Kconfig Outdated Show resolved Hide resolved
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes. Overall this looks clean to me. Note that there is also a ton of per-arch stack walking code that e.g. does LOG_ERR() or printk() to hand format stack traces that we should look at replacing with a portable one based on this.

include/zephyr/arch/arch_interface.h Outdated Show resolved Hide resolved
@@ -1240,6 +1242,35 @@ bool arch_pcie_msi_vector_connect(msi_vector_t *vector,
*/
void arch_spin_relax(void);

#ifdef CONFIG_ARCH_STACKWALK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this as a Kconfig? Seems like a feature that, if unused, would be dropped from the link. The only other reason to make it tunable would be to optimize build time for uncalled code, but again this is tiny and IMHO we don't want to be playing games like that at the arch level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed CONFIG_ARCH_STACKWALK but I retained the CONFIG_ARCH_HAS_STACKWALK, which is used to guard the CONFIG_ARCH_STACKWALK_MAX_FRAMES. CONFIG_ARCH_STACKWALK_MAX_FRAMES (like the CONFIG_EXCEPTION_STACK_TRACE_MAX_FRAMES) will be helpful to make sure that we dont stuck in an infinite loop, LMK what you think

include/zephyr/arch/arch_interface.h Outdated Show resolved Hide resolved
@ycsin
Copy link
Member Author

ycsin commented Jun 11, 2024

Removed the implementation for x86 & arm64 as they probably do not work properly outside of the fatal use case.

@ycsin
Copy link
Member Author

ycsin commented Jun 11, 2024

cc @npitre

@ycsin ycsin added this to the v3.7.0 milestone Jun 11, 2024
@ycsin
Copy link
Member Author

ycsin commented Jun 11, 2024

arch_ interfaces are not something meant to be implemented out-of-tree

Q: Does this apply to the OOT archs? Do we actually support OOT archs?

@npitre
Copy link
Collaborator

npitre commented Jun 11, 2024

Your commit log says:

The fp-based implementation of stack-unwinding when the esf is
NULL came from a5d86b8b077466384f2c2ea6771498a1e265bb95.

But...

$ git log a5d86b8b077466384f2c2ea6771498a1e265bb95
fatal: bad object a5d86b8b077466384f2c2ea6771498a1e265bb95

You may refer to a git hash only if it is merged upstream. Pull requests are
applied with a "rebase" not with a "merge" so any hash from any private tree
will be different once applied.

Then, in the code:

       if ((esf == NULL) && (csf == NULL)) {
               return;
       } else if (esf != NULL) {
               fp = esf->s0;
               ra = esf->mepc;
       } else if ((csf == NULL) || csf == &_current->callee_saved) {
               esf = *((struct arch_esf **)(((uintptr_t)_current_cpu->irq_stack) - 16));
               fp = esf->s0;
               ra = (uintptr_t)walk_stackframe;
       } else {
               fp = csf->s0;
               ra = csf->ra;
       }

This would require some comments. Way too much magic is going on here.
For instance, what is that irq_stack - 16 all about? I probably can guess
only because I'm familiar with the IRQ stack switch as I wrote it. But even
so I'm not completely sure. Now imagine anybody else.

Also, commenting how each of these esfand csf combinations may
come to be would help as well.

Furthermore, if both esf and csf are NULL then the first if is true.
If the second if is false, that means esf is null and therefore csf
must be non null... meaning the csf == NULL condition in the 3rd if will
never be true.

So please don't hesitate to comment such tricky code. Otherwise a year from
now even you won't remember why you wrote it.

@ycsin
Copy link
Member Author

ycsin commented Jun 12, 2024

Thanks for the review

You may refer to a git hash only if it is merged upstream. Pull requests are
applied with a "rebase" not with a "merge" so any hash from any private tree
will be different once applied.

I've updated the commit to point at the PR instead

Furthermore, if both esf and csf are NULL then the first if is true.
If the second if is false, that means esf is null and therefore csf
must be non null... meaning the csf == NULL condition in the 3rd if will
never be true.

Good catch, I've reoredered the if conditionals and added some comments to it, could you please take another look?

@ycsin ycsin force-pushed the pr/arch_stack_walk branch 2 times, most recently from 12574fe to bc53115 Compare June 12, 2024 16:37
npitre
npitre previously approved these changes Jun 12, 2024
nashif
nashif previously approved these changes Jun 12, 2024
@ycsin ycsin dismissed stale reviews from nashif and npitre via 51af6ec June 13, 2024 01:12
An architecture can indicate that it has an implementation for
the `arch_stack_walk()` function by selecting
`ARCH_HAS_STACKWALK`.

Set the default value of `EXCEPTION_STACK_TRACE_MAX_FRAMES` to
`ARCH_STACKWALK_MAX_FRAMES` if the latter is available.

Signed-off-by: Yong Cong Sin <ycsin@meta.com>
Created the `arch_stack_walk()` function out from the original
`z_riscv_unwind_stack()`, it's been updated to support
unwinding any thread.

Updated the stack_unwind test case accordingly.

Increased the delay in `test_fatal_on_smp`, to wait
for the the fatal thread to be terminated, as stacktrace can
take a bit more time.

Doubled the kernel/smp testcase timeout from 60 (default) to
120s, as some of the tests can take a little bit more than 60s
to finish.

Signed-off-by: Yong Cong Sin <ycsin@meta.com>
Now that the unwind starts from mepc already, the symbol
name at the mepc reg is kinda redundant, so just remove it.

Signed-off-by: Yong Cong Sin <ycsin@meta.com>
Add a shell command to unwind a thread using its thread id.

uart:~$ kernel threads
Scheduler: 11 since last call
Threads:
*0x80017138 shell_uart
        options: 0x0, priority: 14 timeout: 0
        state: queued, entry: 0x800029ac
        stack size 3072, unused 1316, usage 1756 / 3072 (57 %)

 0x80017ca8 sysworkq
        options: 0x1, priority: -1 timeout: 0
        state: pending, entry: 0x80006842
        stack size 1024, unused 644, usage 380 / 1024 (37 %)

 0x800177e0 idle
        options: 0x1, priority: 15 timeout: 0
        state: , entry: 0x800065ae
        stack size 512, unused 180, usage 332 / 512 (64 %)

 0x80017950 main
        options: 0x1, priority: 0 timeout: 13
        state: suspended, entry: 0x80006326
        stack size 4096, unused 3604, usage 492 / 4096 (12 %)

uart:~$ kernel unwind 0x80017ca8
Unwinding 0x80017ca8 sysworkq
ra: 0x80007114 [z_swap+0x58]
ra: 0x80007ae8 [z_sched_wait+0x10]
ra: 0x8000689a [work_queue_main+0x58]
ra: 0x800006de [z_thread_entry+0x2e]

Signed-off-by: Yong Cong Sin <ycsin@meta.com>
@ycsin
Copy link
Member Author

ycsin commented Jun 13, 2024

Increase some of the timeouts in tests as the stacktrace requires a little bit more time than before, and rebased on main

@ycsin
Copy link
Member Author

ycsin commented Jun 13, 2024

I'm not sure if it's just my laptop's performance is bad, the test_inc_concurrency alone takes 100+ seconds to finish

START - test_inc_concurrency
type 0: cnt 60000, spend 1299 ms
type 1: cnt 60000, spend 53075 ms
type 2: cnt 60000, spend 54327 ms
 PASS - test_inc_concurrency in 108.705 seconds

Not sure if 120s is enough in upstream

@ycsin ycsin requested a review from nashif June 13, 2024 04:07
@ycsin
Copy link
Member Author

ycsin commented Jun 13, 2024

ping @andyross @dcpleung, could you please take another look? Thanks

@nashif nashif merged commit b98a607 into zephyrproject-rtos:main Jun 13, 2024
31 checks passed
@ycsin ycsin deleted the pr/arch_stack_walk branch June 14, 2024 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants