Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demotion reloaded #2

Merged
merged 13 commits into from
Jul 28, 2017

Conversation

lucabe72
Copy link

Demotion reloaded, without migration

balsini and others added 9 commits June 20, 2017 10:03
To implement a new throttling policy for RT cgroups, the already existing
mechanism is removed from rt.c.

Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
The runtime of RT tasks controlled by CGroups are enforced by the
SCHED_DEADLINE scheduling class, based on the runtime and period (the
deadline is set equal to the period) parameters.

sched_dl_entity may also represent a group of RT tasks, providing a rt_rq.

Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Add pointer to rt_rq used when a demoted task was demoted
(before any migrations of the demoted task). This allows
locking the correct rq for lists manipulation of the rt_se's
cfs_throttle_task.

Signed-off-by: Andres Oportus <andresoportus@google.com>
@jlelli jlelli self-requested a review June 20, 2017 12:52
@jlelli jlelli self-assigned this Jun 20, 2017
/*
if (running)
put_prev_task(rq, p);
*/
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replied to the email, but I do not see replies here... So, here it is again:
this function is invoked by cfs_throttle_rt_tasks(), that is invoked by update_curr_rt().
Invoking put_prev_task() would result in another invocation of update_curr_rt(), potentially causing some issues.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Also, considering that put_prev_task_rt() would only enqueue the task in the pushable list (and we don't want that). It seems save to remove it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I remember: I removed it due to some crash I was seeing, that I decided to be caused by infinite recursion (update_curr_rt -> cfs_throttle_rt_tasks -> __setprio_fifo -> put_prev_task_rt -> update_curr_rt -> cfs_throttle_rt_tasks -> ...)

enqueue_task(cpu_rq(cpu), p, ENQUEUE_REPLENISH | ENQUEUE_MOVE | ENQUEUE_RESTORE);

check_class_changed(cpu_rq(cpu), p, prev_class, oldprio);
out:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This label doesn't seem to be used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhmm... Right. I suspect is a leftover from some previous change; I am going to check

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I checked:
your original patch contained an

  •   if (p->sched_class == &rt_sched_class)
    
  •           goto out;
    

near the beginning of __setprio_fifo().
Since I think that entering __setprio_fifo() with sched_class == rt_sched_class, I changed this to

  •   BUG_ON(p->sched_class == &rt_sched_class);
    

but I forgot to remove the "out:" label.

@lucabe72
Copy link
Author

lucabe72 commented Jun 22, 2017 via email

@jlelli
Copy link
Owner

jlelli commented Jun 22, 2017

So, periodic and periodic1 doesn't seem to have problems (anymore).
But, periodic2 generates the following (when run for over 100 sec):

[  147.659662] Unable to handle kernel NULL pointer dereference at virtual address 00000038
[  147.667862] pgd = ffffff800a7ac000
[  147.671300] [00000038] *pgd=000000007bffe003, *pud=000000007bffe003, *pmd=0000000000000000
[  147.679683] Internal error: Oops: 96000006 [#1] PREEMPT SMP
[  147.685326] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.43-HCBS-Demotion-05354-g332859bfbe08-dirty #4
[  147.694823] Hardware name: HiKey Development Board (DT)
[  147.700106] task: ffffffc035156100 ti: ffffffc035158000 task.ti: ffffffc035158000
[  147.707681] PC is at set_next_entity+0x2c/0x10a0
[  147.712351] LR is at pick_next_task_fair+0xb0/0xd10
[  147.717281] pc : [<ffffff800810a3d8>] lr : [<ffffff800811908c>] pstate: 600001c5
[  147.724759] sp : ffffffc03515bd50
[  147.728107] x29: ffffffc03515bd50 x28: ffffff8008d60428
[  147.733491] x27: ffffff8008d60000 x26: ffffffc0794a6f80
[  147.738873] x25: ffffffc035156700 x24: 0000000000000000
[  147.744254] x23: ffffff8009854000 x22: ffffffc0794a6f98 [  147.749209] CPU0: update max cpu_capacity 1024

[  147.753970]
[  147.755650] x21: ffffffc0794a7038 x20: ffffffc0794a6f80
[  147.761028] x19: 0000000000000000 x18: 0000000000000000
[  147.766405] x17: 0000000000000000 x16: 0000000000000000
[  147.771783] x15: 0000000000000000 x14: 0000000000000000
[  147.777160] x13: 0000000000000000 x12: 0000000034d5d91d
[  147.782537] x11: ffffff8008d60420 x10: 0000000000000005
[  147.787914] x9 : ffffff80098f7000 x8 : 0000000000000004
[  147.793292] x7 : ffffff8008d40980 x6 : 0000000000000000
[  147.798668] x5 : 0000000000000080 x4 : ffffff8008118fdc
[  147.804044] x3 : 0000000000000001 x2 : ffffff80081069ec
[  147.809421] x1 : 0000000000000000 x0 : ffffff800811908c
[  147.814801]
[  147.814801] SP: 0xffffffc03515bcd0:
[  147.819817] bcd0  794a6f98 ffffffc0 09854000 ffffff80 00000000 00000000 35156700 ffffffc0
[  147.828134] bcf0  794a6f80 ffffffc0 08d60000 ffffff80 08d60428 ffffff80 3515bd50 ffffffc0
[  147.836448] bd10  0811908c ffffff80 3515bd50 ffffffc0 0810a3d8 ffffff80 600001c5 00000000
[  147.844763] bd30  3515bd80 ffffffc0 081316d4 ffffff80 ffffffff ffffffff 35158000 ffffffc0
[  147.853079] bd50  3515bde0 ffffffc0 0811908c ffffff80 00000000 00000000 794a6f80 ffffffc0
[  147.861393] bd70  794a7038 ffffffc0 794a6f98 ffffffc0 09854000 ffffff80 00000000 00000000
[  147.869708] bd90  35156700 ffffffc0 794a6f80 ffffffc0 08d60000 ffffff80 08d60428 ffffff80
[  147.878024] bdb0  794a7038 ffffffc0 794a6f80 ffffffc0 794a7038 ffffffc0 794a6f98 ffffffc0
[  147.886347]
[  147.886347] X20: 0xffffffc0794a6f00:
[  147.891451] 6f00  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  147.899767] 6f20  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  147.908082] 6f40  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  147.916397] 6f60  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  147.924713] 6f80  f009f008 dead4ead 00000003 00000000 35156100 ffffffc0 099b0f60 ffffff80
[  147.933028] 6fa0  09bcd430 ffffff80 09bdbac0 ffffff80 0900fe08 ffffff80 00000003 00000000
[  147.941343] 6fc0  080fc5e8 ffffff80 00000001 00000000 00000000 00000000 00000000 00000000
[  147.949658] 6fe0  00000004 00000000 00000010 00000000 0000001b 00000000 ffff6b0f 00000000
[  147.957974]
[  147.957974] X21: 0xffffffc0794a6fb8:
[  147.963078] 6fb8  00000003 00000000 080fc5e8 ffffff80 00000001 00000000 00000000 00000000
[  147.971394] 6fd8  00000000 00000000 00000004 00000000 00000010 00000000 0000001b 00000000
[  147.979709] 6ff8  ffff6b0f 00000000 00000000 00000000 00000000 00000000 00000001 00000000
[  147.988025] 7018  000000ce 00000000 00000000 00000000 000043b8 00000000 00007b88 00000000
[  147.996339] 7038  000000ce 00000000 00000000 00000000 00000001 00000001 7db4db90 00000008
[  148.004654] 7058  18552f7c 00000034 44a07210 ffffffc0 00000000 00000000 00000000 00000000
[  148.012969] 7078  00000000 00000000 00000000 00000000 00000000 00000000 0000001e 00000000
[  148.021284] 7098  6132d986 00000022 002356fe 00000000 00b376ae 00000052 00000030 00000000
[  148.029600]
[  148.029600] X22: 0xffffffc0794a6f18:
[  148.034704] 6f18  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  148.043018] 6f38  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  148.051334] 6f58  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  148.059649] 6f78  00000000 00000000 f009f008 dead4ead 00000003 00000000 35156100 ffffffc0
[  148.067964] 6f98  099b0f60 ffffff80 09bcd430 ffffff80 09bdbac0 ffffff80 0900fe08 ffffff80
[  148.076279] 6fb8  00000003 00000000 080fc5e8 ffffff80 00000001 00000000 00000000 00000000
[  148.084594] 6fd8  00000000 00000000 00000004 00000000 00000010 00000000 0000001b 00000000
[  148.092909] 6ff8  ffff6b0f 00000000 00000000 00000000 00000000 00000000 00000001 00000000
[  148.101226]
[  148.101226] X25: 0xffffffc035156680:
[  148.106330] 6680  00000001 00000000 00000000 00000000 00000001 00000000 00000000 00000000
[  148.114645] 66a0  00000000 00000000 00000000 00000000 00000000 dead4ead ffffffff 00000000
[  148.122960] 66c0  ffffffff ffffffff 099ad6e8 ffffff80 00000000 00000000 00000000 00000000
[  148.131276] 66e0  0900a3b0 ffffff80 00000000 00000000 00000000 00000000 00000000 00000000
[  148.139592] 6700  00002b64 00000000 03938700 00000000 03938700 00000000 00000000 00000000
[  148.147907] 6720  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  148.156223] 6740  35156740 ffffffc0 35156740 ffffffc0 35156750 ffffffc0 35156750 ffffffc0
[  148.164539] 6760  35156760 ffffffc0 35156760 ffffffc0 00000000 00000000 3d922040 ffffffc0
[  148.172857]
[  148.172857] X26: 0xffffffc0794a6f00:
[  148.177960] 6f00  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  148.186275] 6f20  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  148.194590] 6f40  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  148.202905] 6f60  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  148.211220] 6f80  f009f008 dead4ead 00000003 00000000 35156100 ffffffc0 099b0f60 ffffff80
[  148.219535] 6fa0  09bcd430 ffffff80 09bdbac0 ffffff80 0900fe08 ffffff80 00000003 00000000
[  148.227850] 6fc0  080fc5e8 ffffff80 00000001 00000000 00000000 00000000 00000000 00000000
[  148.236165] 6fe0  00000004 00000000 00000010 00000000 0000001b 00000000 ffff6b0f 00000000
[  148.244482]
[  148.244482] X29: 0xffffffc03515bcd0:
[  148.249585] bcd0  794a6f98 ffffffc0 09854000 ffffff80 00000000 00000000 35156700 ffffffc0
[  148.257900] bcf0  794a6f80 ffffffc0 08d60000 ffffff80 08d60428 ffffff80 3515bd50 ffffffc0
[  148.266215] bd10  0811908c ffffff80 3515bd50 ffffffc0 0810a3d8 ffffff80 600001c5 00000000
[  148.274531] bd30  3515bd80 ffffffc0 081316d4 ffffff80 ffffffff ffffffff 35158000 ffffffc0
[  148.282846] bd50  3515bde0 ffffffc0 0811908c ffffff80 00000000 00000000 794a6f80 ffffffc0
[  148.291161] bd70  794a7038 ffffffc0 794a6f98 ffffffc0 09854000 ffffff80 00000000 00000000
[  148.299477] bd90  35156700 ffffffc0 794a6f80 ffffffc0 08d60000 ffffff80 08d60428 ffffff80
[  148.307792] bdb0  794a7038 ffffffc0 794a6f80 ffffffc0 794a7038 ffffffc0 794a6f98 ffffffc0
[  148.316107]
[  148.317611] Process swapper/3 (pid: 0, stack limit = 0xffffffc035158020)
[  148.324384] Stack: (0xffffffc03515bd50 to 0xffffffc03515c000)
[  148.330191] bd40:                                   ffffffc03515bde0 ffffff800811908c
[  148.338107] bd60: 0000000000000000 ffffffc0794a6f80 ffffffc0794a7038 ffffffc0794a6f98
[  148.346022] bd80: ffffff8009854000 0000000000000000 ffffffc035156700 ffffffc0794a6f80
[  148.353937] bda0: ffffff8008d60000 ffffff8008d60428 ffffffc0794a7038 ffffffc0794a6f80
[  148.361852] bdc0: ffffffc0794a7038 ffffffc0794a6f98 ffffff8009854000 ffffff80081316d4
[  148.369767] bde0: ffffffc03515be90 ffffff8008d40cb4 ffffffc0794a6f80 ffffffc035156100
[  148.377683] be00: 0000000000000000 ffffffc0794a6f98 ffffff8009854000 0000000000000000
[  148.385598] be20: ffffffc035156700 ffffffc0794a6f80 ffffff8008d60000 ffffff8008d60428
[  148.393513] be40: ffffff80093def80 ffffffc035156100 ffffffc0794a7038 ffffff80098f7cd0
[  148.401428] be60: ffffffc0794a7038 ffffff8008d605f0 ffffffc000000000 ffffffc035156100
[  148.409343] be80: ffffffc0794a6f80 ffffffc035156100 ffffffc03515bf20 ffffff8008d41574
[  148.417258] bea0: ffffffc035158000 ffffff8008d5f000 ffffff80099a0000 ffffffc071d94400
[  148.425173] bec0: ffffff8009946ab8 ffffff8009218cc0 ffffff80093ddc50 ffffffc035158000
[  148.433088] bee0: ffffff800999e000 ffffff8009852000 ffffffc03515bf20 ffffff8008d4156c
[  148.441004] bf00: ffffffc035158000 ffffff8008d5f000 ffffff80099a0000 ffffff8008d41574
[  148.448919] bf20: ffffffc03515bf40 ffffff8008d415f0 ffffff8009852000 ffffff8008d5f000
[  148.456834] bf40: ffffffc03515bf50 ffffff8008121754 ffffffc03515bfc0 ffffff8008090e64
[  148.464749] bf60: 0000000000000003 ffffff800989e080 ffffffc035158000 0000000000000000
[  148.472663] bf80: 0000000000000000 0000000000000000 00000000027a9000 00000000027ac000
[  148.480579] bfa0: ffffff80080828d0 0000000000000000 00000000ffffffff ffffffc035158000
[  148.488494] bfc0: 0000000000000000 0000000000d4d03c 0000000034d5d91d 0000000000000e12
[  148.496409] bfe0: 0000000000000000 0000000000000000 00ee003e00e900a5 e9db62ffd3fb42ff
[  148.504322] Call trace:
[  148.506793] Exception stack(0xffffffc03515bb80 to 0xffffffc03515bcb0)
[  148.513303] bb80: 0000000000000000 0000008000000000 ffffffc03515bd50 ffffff800810a3d8
[  148.521218] bba0: 0000000000000055 0000000000000114 ffffffc03515bcd0 ffffff8008136434
[  148.529134] bbc0: ffffffc035158000 ffffff800a6f0288 0000000000000000 0000000000000000
[  148.537048] bbe0: 0000000000000002 0000000000000001 0000000000000000 ffffff800816cae8
[  148.544963] bc00: 00000000000001c0 ffffff80099a0468 0000000000000000 0000000000000000
[  148.552878] bc20: ffffff800811908c 0000000000000000 ffffff80081069ec 0000000000000001
[  148.560793] bc40: ffffff8008118fdc 0000000000000080 0000000000000000 ffffff8008d40980
[  148.568708] bc60: 0000000000000004 ffffff80098f7000 0000000000000005 ffffff8008d60420
[  148.576622] bc80: 0000000034d5d91d 0000000000000000 0000000000000000 0000000000000000
[  148.584536] bca0: 0000000000000000 0000000000000000
[  148.589468] [<ffffff800810a3d8>] set_next_entity+0x2c/0x10a0
[  148.595189] [<ffffff800811908c>] pick_next_task_fair+0xb0/0xd10
[  148.601176] [<ffffff8008d40cb4>] __schedule+0x420/0xc10
[  148.606458] [<ffffff8008d41574>] schedule+0x40/0xa0
[  148.611389] [<ffffff8008d415f0>] schedule_preempt_disabled+0x1c/0x2c
[  148.617815] [<ffffff8008121754>] cpu_startup_entry+0x13c/0x464
[  148.623713] [<ffffff8008090e64>] secondary_start_kernel+0x164/0x1b4
[  148.630046] [<0000000000d4d03c>] 0xd4d03c
[  148.634099] Code: aa0103f3 aa0003f5 aa1e03e0 d503201f (b9403a60)
[  148.749686] BUG: spinlock lockup suspected on CPU#0, kworker/0:1/578
[  148.756117]  lock: 0xffffffc0794a6f80, .magic: dead4ead, .owner: swapper/3/0, .owner_cpu: 3
[  148.764563] CPU: 0 PID: 578 Comm: kworker/0:1 Tainted: G      D         4.4.43-HCBS-Demotion-05354-g332859bfbe08-dirty #4
[  148.775637] Hardware name: HiKey Development Board (DT)
[  148.780924] Workqueue: events_freezable thermal_zone_device_check
[  148.787086] Call trace:
[  148.789559] [<ffffff800808ae98>] dump_backtrace+0x0/0x1e0
[  148.795016] [<ffffff800808b098>] show_stack+0x20/0x28
[  148.800125] [<ffffff8008553374>] dump_stack+0xa8/0xe0
[  148.805231] [<ffffff800813a4d4>] spin_dump+0x78/0x9c
[  148.810248] [<ffffff800813a7c8>] do_raw_spin_lock+0x180/0x1b4
[  148.816057] [<ffffff8008d46fb4>] _raw_spin_lock_irqsave+0x78/0x98
[  148.822217] [<ffffff8008123a60>] cpufreq_notifier_trans+0x128/0x14c
[  148.828552] [<ffffff80080ef154>] notifier_call_chain+0x64/0x9c
[  148.834449] [<ffffff80080efbdc>] __srcu_notifier_call_chain+0xa0/0xf0
[  148.840958] [<ffffff80080efc64>] srcu_notifier_call_chain+0x38/0x44
[  148.847296] [<ffffff80088f5644>] cpufreq_notify_transition+0xfc/0x2e0
[  148.853807] [<ffffff80088f7bec>] cpufreq_freq_transition_end+0x3c/0xb0
[  148.860405] [<ffffff80088f84a0>] __cpufreq_driver_target+0x1dc/0x320
[  148.866829] [<ffffff80088fa460>] cpufreq_governor_performance+0x50/0x60
[  148.873516] [<ffffff80088f6034>] __cpufreq_governor+0xb8/0x1ec
[  148.879411] [<ffffff80088f6994>] cpufreq_set_policy+0x2ac/0x3f0
[  148.885394] [<ffffff80088f9164>] cpufreq_update_policy+0x84/0x114
[  148.891555] [<ffffff80088da4ec>] cpufreq_set_cur_state+0x64/0x94
[  148.897626] [<ffffff80088d4ca4>] thermal_cdev_update.part.26+0x9c/0x22c
[  148.904312] [<ffffff80088d5b48>] power_actor_set_power+0x70/0x9c
[  148.910384] [<ffffff80088d9bc0>] power_allocator_throttle+0x4c8/0xad8
[  148.916893] [<ffffff80088d4e9c>] handle_thermal_trip.part.21+0x68/0x334
[  148.923579] [<ffffff80088d56e4>] thermal_zone_device_update+0xb8/0x280
[  148.930177] [<ffffff80088d58cc>] thermal_zone_device_check+0x20/0x2c
[  148.936601] [<ffffff80080e55a8>] process_one_work+0x1f8/0x70c
[  148.942408] [<ffffff80080e5bf8>] worker_thread+0x13c/0x4a4
[  148.947953] [<ffffff80080ed5cc>] kthread+0xe8/0xfc
[  148.952796] [<ffffff8008085ed0>] ret_from_fork+0x10/0x40
[  149.166404] BUG: spinlock lockup suspected on CPU#4, periodic2.sh/2858
[  149.173013]  lock: 0xffffffc0794a6f80, .magic: dead4ead, .owner: swapper/3/0, .owner_cpu: 3
[  149.181457] CPU: 4 PID: 2858 Comm: periodic2.sh Tainted: G      D         4.4.43-HCBS-Demotion-05354-g332859bfbe08-dirty #4
[  149.192707] Hardware name: HiKey Development Board (DT)
[  149.197985] Call trace:
[  149.200458] [<ffffff800808ae98>] dump_backtrace+0x0/0x1e0
[  149.205915] [<ffffff800808b098>] show_stack+0x20/0x28
[  149.211021] [<ffffff8008553374>] dump_stack+0xa8/0xe0
[  149.216126] [<ffffff800813a4d4>] spin_dump+0x78/0x9c
[  149.221144] [<ffffff800813a7c8>] do_raw_spin_lock+0x180/0x1b4
[  149.226952] [<ffffff8008d46f20>] _raw_spin_lock+0x6c/0x88
[  149.232411] [<ffffff80080fae64>] __task_rq_lock+0x58/0xdc
[  149.237868] [<ffffff80080ffb10>] wake_up_new_task+0xdc/0x318
[  149.243588] [<ffffff80080c5f14>] _do_fork+0xfc/0x6f0
[  149.248606] [<ffffff80080c6658>] SyS_clone+0x44/0x50
[  149.253623] [<ffffff8008085f30>] el0_svc_naked+0x24/0x28
[  149.640303] BUG: spinlock lockup suspected on CPU#3, swapper/3/0
[  149.646376]  lock: 0xffffffc0794a6f80, .magic: dead4ead, .owner: swapper/3/0, .owner_cpu: 3
[  149.654819] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G      D         4.4.43-HCBS-Demotion-05354-g332859bfbe08-dirty #4
[  149.665542] Hardware name: HiKey Development Board (DT)
[  149.670820] Call trace:
[  149.673291] [<ffffff800808ae98>] dump_backtrace+0x0/0x1e0
[  149.678748] [<ffffff800808b098>] show_stack+0x20/0x28
[  149.683853] [<ffffff8008553374>] dump_stack+0xa8/0xe0
[  149.688959] [<ffffff800813a4d4>] spin_dump+0x78/0x9c
[  149.693976] [<ffffff800813a7c8>] do_raw_spin_lock+0x180/0x1b4
[  149.699783] [<ffffff8008d46f20>] _raw_spin_lock+0x6c/0x88
[  149.705240] [<ffffff80080fc5e8>] scheduler_tick+0x50/0x2ac
[  149.710785] [<ffffff800815cf78>] update_process_times+0x58/0x70
[  149.716769] [<ffffff80081703a4>] tick_sched_timer+0x7c/0xfc
[  149.722401] [<ffffff800815d5b8>] __hrtimer_run_queues+0x164/0x624
[  149.728560] [<ffffff800815eb74>] hrtimer_interrupt+0xb0/0x1f4
[  149.734369] [<ffffff8008933150>] arch_timer_handler_phys+0x3c/0x48
[  149.740618] [<ffffff8008149e90>] handle_percpu_devid_irq+0xe8/0x3d0
[  149.746954] [<ffffff8008145104>] generic_handle_irq+0x34/0x4c
[  149.752761] [<ffffff80081451ac>] __handle_domain_irq+0x90/0xf8
[  149.758656] [<ffffff8008082544>] gic_handle_irq+0x64/0xc4
[  149.764113] Exception stack(0xffffffc0792e0050 to 0xffffffc0792e0180)
[  149.770622] 0040:                                   ffffffc03515b900 0000008000000000
[  149.778537] 0060: ffffffc03515ba30 ffffff8008d47230 0000000060000145 ffffffc035156100
[  149.786452] 0080: ffffffc03515ba30 ffffffc03515b900 0000000000000000 0000000000000000
[  149.794366] 00a0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  149.802281] 00c0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  149.810196] 00e0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  149.818111] 0100: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  149.826026] 0120: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  149.833940] 0140: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  149.841855] 0160: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  149.849770] [<ffffff80080857b8>] el1_irq+0xb8/0x130
[  149.854699] [<ffffff8008d47230>] _raw_spin_unlock_irq+0x3c/0x74
[  149.860682] [<ffffff800808b168>] die+0xc8/0x1b4
[  149.865262] [<ffffff800809b36c>] __do_kernel_fault.part.6+0x7c/0x90
[  149.871601] [<ffffff80080986a0>] do_translation_fault+0x0/0xec
[  149.877498] [<ffffff8008098768>] do_translation_fault+0xc8/0xec
[  149.883481] [<ffffff80080822e0>] do_mem_abort+0x54/0xb4
[  149.888761] Exception stack(0xffffffc03515bb80 to 0xffffffc03515bcb0)
[  149.895272] bb80: 0000000000000000 0000008000000000 ffffffc03515bd50 ffffff800810a3d8
[  149.903187] bba0: 0000000000000055 0000000000000114 ffffffc03515bcd0 ffffff8008136434
[  149.911102] bbc0: ffffffc035158000 ffffff800a6f0288 0000000000000000 0000000000000000
[  149.919017] bbe0: 0000000000000002 0000000000000001 0000000000000000 ffffff800816cae8
[  149.926932] bc00: 00000000000001c0 ffffff80099a0468 0000000000000000 0000000000000000
[  149.934847] bc20: ffffff800811908c 0000000000000000 ffffff80081069ec 0000000000000001
[  149.942762] bc40: ffffff8008118fdc 0000000000000080 0000000000000000 ffffff8008d40980
[  149.950676] bc60: 0000000000000004 ffffff80098f7000 0000000000000005 ffffff8008d60420
[  149.958591] bc80: 0000000034d5d91d 0000000000000000 0000000000000000 0000000000000000
[  149.966505] bca0: 0000000000000000 0000000000000000
[  149.971434] [<ffffff80080855c8>] el1_da+0x18/0x78
[  149.976189] [<ffffff800811908c>] pick_next_task_fair+0xb0/0xd10
[  149.982174] [<ffffff8008d40cb4>] __schedule+0x420/0xc10
[  149.987456] [<ffffff8008d41574>] schedule+0x40/0xa0
[  149.992387] [<ffffff8008d415f0>] schedule_preempt_disabled+0x1c/0x2c
[  149.998809] [<ffffff8008121754>] cpu_startup_entry+0x13c/0x464
[  150.004705] [<ffffff8008090e64>] secondary_start_kernel+0x164/0x1b4
[  150.011038] [<0000000000d4d03c>] 0xd4d03c

@lucabe72
Copy link
Author

lucabe72 commented Jun 22, 2017 via email

@lucabe72
Copy link
Author

lucabe72 commented Jun 22, 2017 via email

@jlelli
Copy link
Owner

jlelli commented Jun 22, 2017

While switching between the two classes yes. Exactly when I'm not sure without adding some debug output.

@lucabe72
Copy link
Author

lucabe72 commented Jun 22, 2017 via email

@jlelli
Copy link
Owner

jlelli commented Jun 22, 2017

No sorry I wasn't clear. I extended the test to 200s and crash seems to happens after 100s (but this varies).

@lucabe72
Copy link
Author

lucabe72 commented Jun 23, 2017 via email

Luca Abeni added 4 commits July 6, 2017 08:25
In some cases, the scheduler invokes set_curr_task() before enqueueing the
task. But if the task is enqueued as RT, it can be demoted during enqueue...
In this case, set_curr_task() is called for the RT scheduling class, but
the task ends up being enqueued in the CFS rq... And set_curr_task() is
not invoked for the CFS scheduling class!

Fix this by explicitly invoking the CFS set_curr_task() (if needed) in
case of demotion, before enqueueing in the CFS rq.
The demotion mechanism currently still has some bugs, that can be
triggered by using the "periodic1.sh" or "periodic2.sh" scripts.
After some experiments, it turned out that these changes improve the
stability of the patchset (with this patch, the demotion mechanism can
survive to 33mins of "periodic1.sh" or "periodic2.sh".
The bugs are probably still there, though.
@jlelli jlelli merged commit af1ac98 into jlelli:wip/deadline-4.4-for-retis Jul 28, 2017
jlelli pushed a commit that referenced this pull request Dec 4, 2017
…l calls

Provide a different lockdep key for rxrpc_call::user_mutex when the call is
made on a kernel socket, such as by the AFS filesystem.

The problem is that lockdep registers a false positive between userspace
calling the sendmsg syscall on a user socket where call->user_mutex is held
whilst userspace memory is accessed whereas the AFS filesystem may perform
operations with mmap_sem held by the caller.

In such a case, the following warning is produced.

======================================================
WARNING: possible circular locking dependency detected
4.14.0-fscache+ torvalds#243 Tainted: G            E
------------------------------------------------------
modpost/16701 is trying to acquire lock:
 (&vnode->io_lock){+.+.}, at: [<ffffffffa000fc40>] afs_begin_vnode_operation+0x33/0x77 [kafs]

but task is already holding lock:
 (&mm->mmap_sem){++++}, at: [<ffffffff8104376a>] __do_page_fault+0x1ef/0x486

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (&mm->mmap_sem){++++}:
       __might_fault+0x61/0x89
       _copy_from_iter_full+0x40/0x1fa
       rxrpc_send_data+0x8dc/0xff3
       rxrpc_do_sendmsg+0x62f/0x6a1
       rxrpc_sendmsg+0x166/0x1b7
       sock_sendmsg+0x2d/0x39
       ___sys_sendmsg+0x1ad/0x22b
       __sys_sendmsg+0x41/0x62
       do_syscall_64+0x89/0x1be
       return_from_SYSCALL_64+0x0/0x75

-> #2 (&call->user_mutex){+.+.}:
       __mutex_lock+0x86/0x7d2
       rxrpc_new_client_call+0x378/0x80e
       rxrpc_kernel_begin_call+0xf3/0x154
       afs_make_call+0x195/0x454 [kafs]
       afs_vl_get_capabilities+0x193/0x198 [kafs]
       afs_vl_lookup_vldb+0x5f/0x151 [kafs]
       afs_create_volume+0x2e/0x2f4 [kafs]
       afs_mount+0x56a/0x8d7 [kafs]
       mount_fs+0x6a/0x109
       vfs_kern_mount+0x67/0x135
       do_mount+0x90b/0xb57
       SyS_mount+0x72/0x98
       do_syscall_64+0x89/0x1be
       return_from_SYSCALL_64+0x0/0x75

-> #1 (k-sk_lock-AF_RXRPC){+.+.}:
       lock_sock_nested+0x74/0x8a
       rxrpc_kernel_begin_call+0x8a/0x154
       afs_make_call+0x195/0x454 [kafs]
       afs_fs_get_capabilities+0x17a/0x17f [kafs]
       afs_probe_fileserver+0xf7/0x2f0 [kafs]
       afs_select_fileserver+0x83f/0x903 [kafs]
       afs_fetch_status+0x89/0x11d [kafs]
       afs_iget+0x16f/0x4f8 [kafs]
       afs_mount+0x6c6/0x8d7 [kafs]
       mount_fs+0x6a/0x109
       vfs_kern_mount+0x67/0x135
       do_mount+0x90b/0xb57
       SyS_mount+0x72/0x98
       do_syscall_64+0x89/0x1be
       return_from_SYSCALL_64+0x0/0x75

-> #0 (&vnode->io_lock){+.+.}:
       lock_acquire+0x174/0x19f
       __mutex_lock+0x86/0x7d2
       afs_begin_vnode_operation+0x33/0x77 [kafs]
       afs_fetch_data+0x80/0x12a [kafs]
       afs_readpages+0x314/0x405 [kafs]
       __do_page_cache_readahead+0x203/0x2ba
       filemap_fault+0x179/0x54d
       __do_fault+0x17/0x60
       __handle_mm_fault+0x6d7/0x95c
       handle_mm_fault+0x24e/0x2a3
       __do_page_fault+0x301/0x486
       do_page_fault+0x236/0x259
       page_fault+0x22/0x30
       __clear_user+0x3d/0x60
       padzero+0x1c/0x2b
       load_elf_binary+0x785/0xdc7
       search_binary_handler+0x81/0x1ff
       do_execveat_common.isra.14+0x600/0x888
       do_execve+0x1f/0x21
       SyS_execve+0x28/0x2f
       do_syscall_64+0x89/0x1be
       return_from_SYSCALL_64+0x0/0x75

other info that might help us debug this:

Chain exists of:
  &vnode->io_lock --> &call->user_mutex --> &mm->mmap_sem

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&mm->mmap_sem);
                               lock(&call->user_mutex);
                               lock(&mm->mmap_sem);
  lock(&vnode->io_lock);

 *** DEADLOCK ***

1 lock held by modpost/16701:
 #0:  (&mm->mmap_sem){++++}, at: [<ffffffff8104376a>] __do_page_fault+0x1ef/0x486

stack backtrace:
CPU: 0 PID: 16701 Comm: modpost Tainted: G            E   4.14.0-fscache+ torvalds#243
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
 dump_stack+0x67/0x8e
 print_circular_bug+0x341/0x34f
 check_prev_add+0x11f/0x5d4
 ? add_lock_to_list.isra.12+0x8b/0x8b
 ? add_lock_to_list.isra.12+0x8b/0x8b
 ? __lock_acquire+0xf77/0x10b4
 __lock_acquire+0xf77/0x10b4
 lock_acquire+0x174/0x19f
 ? afs_begin_vnode_operation+0x33/0x77 [kafs]
 __mutex_lock+0x86/0x7d2
 ? afs_begin_vnode_operation+0x33/0x77 [kafs]
 ? afs_begin_vnode_operation+0x33/0x77 [kafs]
 ? afs_begin_vnode_operation+0x33/0x77 [kafs]
 afs_begin_vnode_operation+0x33/0x77 [kafs]
 afs_fetch_data+0x80/0x12a [kafs]
 afs_readpages+0x314/0x405 [kafs]
 __do_page_cache_readahead+0x203/0x2ba
 ? filemap_fault+0x179/0x54d
 filemap_fault+0x179/0x54d
 __do_fault+0x17/0x60
 __handle_mm_fault+0x6d7/0x95c
 handle_mm_fault+0x24e/0x2a3
 __do_page_fault+0x301/0x486
 do_page_fault+0x236/0x259
 page_fault+0x22/0x30
RIP: 0010:__clear_user+0x3d/0x60
RSP: 0018:ffff880071e93da0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 000000000000011c RCX: 000000000000011c
RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000060f720
RBP: 000000000060f720 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000001 R11: ffff8800b5459b68 R12: ffff8800ce150e00
R13: 000000000060f720 R14: 00000000006127a8 R15: 0000000000000000
 padzero+0x1c/0x2b
 load_elf_binary+0x785/0xdc7
 search_binary_handler+0x81/0x1ff
 do_execveat_common.isra.14+0x600/0x888
 do_execve+0x1f/0x21
 SyS_execve+0x28/0x2f
 do_syscall_64+0x89/0x1be
 entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7fdb6009ee07
RSP: 002b:00007fff566d9728 EFLAGS: 00000246 ORIG_RAX: 000000000000003b
RAX: ffffffffffffffda RBX: 000055ba57280900 RCX: 00007fdb6009ee07
RDX: 000055ba5727f270 RSI: 000055ba5727cac0 RDI: 000055ba57280900
RBP: 000055ba57280900 R08: 00007fff566d9700 R09: 0000000000000000
R10: 000055ba5727cac0 R11: 0000000000000246 R12: 0000000000000000
R13: 000055ba5727cac0 R14: 000055ba5727f270 R15: 0000000000000000

Signed-off-by: David Howells <dhowells@redhat.com>
jlelli pushed a commit that referenced this pull request Dec 4, 2017
Jiri Pirko says:

====================
mlxsw: GRE offloading fixes

Petr says:

This patchset fixes a couple bugs in offloading GRE tunnels in mlxsw
driver.

Patch #1 fixes a problem that local routes pointing at a GRE tunnel
device are offloaded even if that netdevice is down.

Patch #2 detects that as a result of moving a GRE netdevice to a
different VRF, two tunnels now have a conflict of local addresses,
something that the mlxsw driver can't offload.

Patch #3 fixes a FIB abort caused by forming a route pointing at a
GRE tunnel that is eligible for offloading but already onloaded.

Patch #4 fixes a problem that next hops migrated to a new RIF kept the
old RIF reference, which went dangling shortly afterwards.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
jlelli pushed a commit that referenced this pull request Jan 25, 2018
In the function brcmf_sdio_firmware_callback() the driver is
unbound from the sdio function devices in the error path.
However, the order in which it is done resulted in a use-after-free
issue (see brcmf_ops_sdio_remove() in bcmsdh.c). Hence change
the order and first unbind sdio function #2 device and then
unbind sdio function #1 device.

Cc: stable@vger.kernel.org # v4.12.x
Fixes: 7a51461 ("brcmfmac: unbind all devices upon failure in firmware callback")
Reported-by: Stefan Wahren <stefan.wahren@i2se.com>
Reviewed-by: Hante Meuleman <hante.meuleman@broadcom.com>
Reviewed-by: Pieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
Reviewed-by: Franky Lin <franky.lin@broadcom.com>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
jlelli pushed a commit that referenced this pull request Jan 25, 2018
Default value of pcc_subspace_idx is -1.
Make sure to check pcc_subspace_idx before using the same as array index.
This will avoid following KASAN warnings too.

[   15.113449] ==================================================================
[   15.116983] BUG: KASAN: global-out-of-bounds in cppc_get_perf_caps+0xf3/0x3b0
[   15.116983] Read of size 8 at addr ffffffffb9a5c0d8 by task swapper/0/1
[   15.116983] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.15.0-rc2+ #2
[   15.116983] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.2.8 01/26/2016
[   15.116983] Call Trace:
[   15.116983]  dump_stack+0x7c/0xbb
[   15.116983]  print_address_description+0x1df/0x290
[   15.116983]  kasan_report+0x28a/0x370
[   15.116983]  ? cppc_get_perf_caps+0xf3/0x3b0
[   15.116983]  cppc_get_perf_caps+0xf3/0x3b0
[   15.116983]  ? cpc_read+0x210/0x210
[   15.116983]  ? __rdmsr_on_cpu+0x90/0x90
[   15.116983]  ? rdmsrl_on_cpu+0xa9/0xe0
[   15.116983]  ? rdmsr_on_cpu+0x100/0x100
[   15.116983]  ? wrmsrl_on_cpu+0x9c/0xd0
[   15.116983]  ? wrmsrl_on_cpu+0x9c/0xd0
[   15.116983]  ? wrmsr_on_cpu+0xe0/0xe0
[   15.116983]  __intel_pstate_cpu_init.part.16+0x3a2/0x530
[   15.116983]  ? intel_pstate_init_cpu+0x197/0x390
[   15.116983]  ? show_no_turbo+0xe0/0xe0
[   15.116983]  ? __lockdep_init_map+0xa0/0x290
[   15.116983]  intel_pstate_cpu_init+0x30/0x60
[   15.116983]  cpufreq_online+0x155/0xac0
[   15.116983]  cpufreq_add_dev+0x9b/0xb0
[   15.116983]  subsys_interface_register+0x1ae/0x290
[   15.116983]  ? bus_unregister_notifier+0x40/0x40
[   15.116983]  ? mark_held_locks+0x83/0xb0
[   15.116983]  ? _raw_write_unlock_irqrestore+0x32/0x60
[   15.116983]  ? intel_pstate_setup+0xc/0x104
[   15.116983]  ? intel_pstate_setup+0xc/0x104
[   15.116983]  ? cpufreq_register_driver+0x1ce/0x2b0
[   15.116983]  cpufreq_register_driver+0x1ce/0x2b0
[   15.116983]  ? intel_pstate_setup+0x104/0x104
[   15.116983]  intel_pstate_register_driver+0x3a/0xa0
[   15.116983]  intel_pstate_init+0x3c4/0x434
[   15.116983]  ? intel_pstate_setup+0x104/0x104
[   15.116983]  ? intel_pstate_setup+0x104/0x104
[   15.116983]  do_one_initcall+0x9c/0x206
[   15.116983]  ? parameq+0xa0/0xa0
[   15.116983]  ? initcall_blacklisted+0x150/0x150
[   15.116983]  ? lock_downgrade+0x2c0/0x2c0
[   15.116983]  kernel_init_freeable+0x327/0x3f0
[   15.116983]  ? start_kernel+0x612/0x612
[   15.116983]  ? _raw_spin_unlock_irq+0x29/0x40
[   15.116983]  ? finish_task_switch+0xdd/0x320
[   15.116983]  ? finish_task_switch+0x8e/0x320
[   15.116983]  ? rest_init+0xd0/0xd0
[   15.116983]  kernel_init+0xf/0x11a
[   15.116983]  ? rest_init+0xd0/0xd0
[   15.116983]  ret_from_fork+0x24/0x30

[   15.116983] The buggy address belongs to the variable:
[   15.116983]  __key.36299+0x38/0x40

[   15.116983] Memory state around the buggy address:
[   15.116983]  ffffffffb9a5bf80: fa fa fa fa 00 fa fa fa fa fa fa fa 00 fa fa fa
[   15.116983]  ffffffffb9a5c000: fa fa fa fa 00 fa fa fa fa fa fa fa 00 fa fa fa
[   15.116983] >ffffffffb9a5c080: fa fa fa fa 00 fa fa fa fa fa fa fa 00 00 00 00
[   15.116983]                                                     ^
[   15.116983]  ffffffffb9a5c100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   15.116983]  ffffffffb9a5c180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   15.116983] ==================================================================

Fixes: 85b1407 (ACPI / CPPC: Make CPPC ACPI driver aware of PCC subspace IDs)
Reported-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: George Cherian <george.cherian@cavium.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
jlelli pushed a commit that referenced this pull request Nov 28, 2019
…kernel/git/kvmarm/kvmarm into HEAD

KVM/arm fixes for 5.4, take #2

Special PMU edition:

- Fix cycle counter truncation
- Fix cycle counter overflow limit on pure 64bit system
- Allow chained events to be actually functional
- Correct sample period after overflow
jlelli pushed a commit that referenced this pull request Nov 28, 2019
All bonding device has same lockdep key and subclass is initialized with
nest_level.
But actual nest_level value can be changed when a lower device is attached.
And at this moment, the subclass should be updated but it seems to be
unsafe.
So this patch makes bonding use dynamic lockdep key instead of the
subclass.

Test commands:
    ip link add bond0 type bond

    for i in {1..5}
    do
	    let A=$i-1
	    ip link add bond$i type bond
	    ip link set bond$i master bond$A
    done
    ip link set bond5 master bond0

Splat looks like:
[  307.992912] WARNING: possible recursive locking detected
[  307.993656] 5.4.0-rc3+ torvalds#96 Tainted: G        W
[  307.994367] --------------------------------------------
[  307.995092] ip/761 is trying to acquire lock:
[  307.995710] ffff8880513aac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[  307.997045]
	       but task is already holding lock:
[  307.997923] ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[  307.999215]
	       other info that might help us debug this:
[  308.000251]  Possible unsafe locking scenario:

[  308.001137]        CPU0
[  308.001533]        ----
[  308.001915]   lock(&(&bond->stats_lock)->rlock#2/2);
[  308.002609]   lock(&(&bond->stats_lock)->rlock#2/2);
[  308.003302]
		*** DEADLOCK ***

[  308.004310]  May be due to missing lock nesting notation

[  308.005319] 3 locks held by ip/761:
[  308.005830]  #0: ffffffff9fcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
[  308.006894]  #1: ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[  308.008243]  #2: ffffffff9f9219c0 (rcu_read_lock){....}, at: bond_get_stats+0x9f/0x500 [bonding]
[  308.009422]
	       stack backtrace:
[  308.010124] CPU: 0 PID: 761 Comm: ip Tainted: G        W         5.4.0-rc3+ torvalds#96
[  308.011097] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  308.012179] Call Trace:
[  308.012601]  dump_stack+0x7c/0xbb
[  308.013089]  __lock_acquire+0x269d/0x3de0
[  308.013669]  ? register_lock_class+0x14d0/0x14d0
[  308.014318]  lock_acquire+0x164/0x3b0
[  308.014858]  ? bond_get_stats+0xb8/0x500 [bonding]
[  308.015520]  _raw_spin_lock_nested+0x2e/0x60
[  308.016129]  ? bond_get_stats+0xb8/0x500 [bonding]
[  308.017215]  bond_get_stats+0xb8/0x500 [bonding]
[  308.018454]  ? bond_arp_rcv+0xf10/0xf10 [bonding]
[  308.019710]  ? rcu_read_lock_held+0x90/0xa0
[  308.020605]  ? rcu_read_lock_sched_held+0xc0/0xc0
[  308.021286]  ? bond_get_stats+0x9f/0x500 [bonding]
[  308.021953]  dev_get_stats+0x1ec/0x270
[  308.022508]  bond_get_stats+0x1d1/0x500 [bonding]

Fixes: d3fff6c ("net: add netdev_lockdep_set_classes() helper")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
jlelli pushed a commit that referenced this pull request Nov 28, 2019
When the extent tree is modified, it should be protected by inode
cluster lock and ip_alloc_sem.

The extent tree is accessed and modified in the
ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem.

The following is a case.  The function ocfs2_fiemap is accessing the
extent tree, which is modified at the same time.

  kernel BUG at fs/ocfs2/extent_map.c:475!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...]
  CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2
  Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018
  task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000
  RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
  Call Trace:
    ocfs2_fiemap+0x1e3/0x430 [ocfs2]
    do_vfs_ioctl+0x155/0x510
    SyS_ioctl+0x81/0xa0
    system_call_fastpath+0x18/0xd8
  Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f
  RIP  ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
  ---[ end trace c8aa0c8180e869dc ]---
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled

This issue can be reproduced every week in a production environment.

This issue is related to the usage mode.  If others use ocfs2 in this
mode, the kernel will panic frequently.

[akpm@linux-foundation.org: coding style fixes]
[Fix new warning due to unused function by removing said function - Linus ]
Link: http://lkml.kernel.org/r/1568772175-2906-2-git-send-email-sunny.s.zhang@oracle.com
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jlelli pushed a commit that referenced this pull request May 5, 2020
FuzzUSB (a variant of syzkaller) found a free-while-still-in-use bug
in the USB scatter-gather library:

BUG: KASAN: use-after-free in atomic_read
include/asm-generic/atomic-instrumented.h:26 [inline]
BUG: KASAN: use-after-free in usb_hcd_unlink_urb+0x5f/0x170
drivers/usb/core/hcd.c:1607
Read of size 4 at addr ffff888065379610 by task kworker/u4:1/27

CPU: 1 PID: 27 Comm: kworker/u4:1 Not tainted 5.5.11 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.10.2-1ubuntu1 04/01/2014
Workqueue: scsi_tmf_2 scmd_eh_abort_handler
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0xce/0x128 lib/dump_stack.c:118
 print_address_description.constprop.4+0x21/0x3c0 mm/kasan/report.c:374
 __kasan_report+0x153/0x1cb mm/kasan/report.c:506
 kasan_report+0x12/0x20 mm/kasan/common.c:639
 check_memory_region_inline mm/kasan/generic.c:185 [inline]
 check_memory_region+0x152/0x1b0 mm/kasan/generic.c:192
 __kasan_check_read+0x11/0x20 mm/kasan/common.c:95
 atomic_read include/asm-generic/atomic-instrumented.h:26 [inline]
 usb_hcd_unlink_urb+0x5f/0x170 drivers/usb/core/hcd.c:1607
 usb_unlink_urb+0x72/0xb0 drivers/usb/core/urb.c:657
 usb_sg_cancel+0x14e/0x290 drivers/usb/core/message.c:602
 usb_stor_stop_transport+0x5e/0xa0 drivers/usb/storage/transport.c:937

This bug occurs when cancellation of the S-G transfer races with
transfer completion.  When that happens, usb_sg_cancel() may continue
to access the transfer's URBs after usb_sg_wait() has freed them.

The bug is caused by the fact that usb_sg_cancel() does not take any
sort of reference to the transfer, and so there is nothing to prevent
the URBs from being deallocated while the routine is trying to use
them.  The fix is to take such a reference by incrementing the
transfer's io->count field while the cancellation is in progres and
decrementing it afterward.  The transfer's URBs are not deallocated
until io->complete is triggered, which happens when io->count reaches
zero.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Reported-and-tested-by: Kyungtae Kim <kt0755@gmail.com>
CC: <stable@vger.kernel.org>

Link: https://lore.kernel.org/r/Pine.LNX.4.44L0.2003281615140.14837-100000@netrider.rowland.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
jlelli pushed a commit that referenced this pull request May 19, 2020
…f fs_info::journal_info

[BUG]
One run of btrfs/063 triggered the following lockdep warning:
  ============================================
  WARNING: possible recursive locking detected
  5.6.0-rc7-custom+ torvalds#48 Not tainted
  --------------------------------------------
  kworker/u24:0/7 is trying to acquire lock:
  ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]

  but task is already holding lock:
  ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(sb_internal#2);
    lock(sb_internal#2);

   *** DEADLOCK ***

   May be due to missing lock nesting notation

  4 locks held by kworker/u24:0/7:
   #0: ffff88817b495948 ((wq_completion)btrfs-endio-write){+.+.}, at: process_one_work+0x557/0xb80
   #1: ffff888189ea7db8 ((work_completion)(&work->normal_work)){+.+.}, at: process_one_work+0x557/0xb80
   #2: ffff88817d3a46e0 (sb_internal#2){.+.+}, at: start_transaction+0x66c/0x890 [btrfs]
   #3: ffff888174ca4da8 (&fs_info->reloc_mutex){+.+.}, at: btrfs_record_root_in_trans+0x83/0xd0 [btrfs]

  stack backtrace:
  CPU: 0 PID: 7 Comm: kworker/u24:0 Not tainted 5.6.0-rc7-custom+ torvalds#48
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
  Call Trace:
   dump_stack+0xc2/0x11a
   __lock_acquire.cold+0xce/0x214
   lock_acquire+0xe6/0x210
   __sb_start_write+0x14e/0x290
   start_transaction+0x66c/0x890 [btrfs]
   btrfs_join_transaction+0x1d/0x20 [btrfs]
   find_free_extent+0x1504/0x1a50 [btrfs]
   btrfs_reserve_extent+0xd5/0x1f0 [btrfs]
   btrfs_alloc_tree_block+0x1ac/0x570 [btrfs]
   btrfs_copy_root+0x213/0x580 [btrfs]
   create_reloc_root+0x3bd/0x470 [btrfs]
   btrfs_init_reloc_root+0x2d2/0x310 [btrfs]
   record_root_in_trans+0x191/0x1d0 [btrfs]
   btrfs_record_root_in_trans+0x90/0xd0 [btrfs]
   start_transaction+0x16e/0x890 [btrfs]
   btrfs_join_transaction+0x1d/0x20 [btrfs]
   btrfs_finish_ordered_io+0x55d/0xcd0 [btrfs]
   finish_ordered_fn+0x15/0x20 [btrfs]
   btrfs_work_helper+0x116/0x9a0 [btrfs]
   process_one_work+0x632/0xb80
   worker_thread+0x80/0x690
   kthread+0x1a3/0x1f0
   ret_from_fork+0x27/0x50

It's pretty hard to reproduce, only one hit so far.

[CAUSE]
This is because we're calling btrfs_join_transaction() without re-using
the current running one:

btrfs_finish_ordered_io()
|- btrfs_join_transaction()		<<< Call #1
   |- btrfs_record_root_in_trans()
      |- btrfs_reserve_extent()
	 |- btrfs_join_transaction()	<<< Call #2

Normally such btrfs_join_transaction() call should re-use the existing
one, without trying to re-start a transaction.

But the problem is, in btrfs_join_transaction() call #1, we call
btrfs_record_root_in_trans() before initializing current::journal_info.

And in btrfs_join_transaction() call #2, we're relying on
current::journal_info to avoid such deadlock.

[FIX]
Call btrfs_record_root_in_trans() after we have initialized
current::journal_info.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
jlelli pushed a commit that referenced this pull request May 19, 2020
…kernel/git/kvmarm/kvmarm into kvm-master

KVM/arm fixes for Linux 5.7, take #2

- Fix compilation with Clang
- Correctly initialize GICv4.1 in the absence of a virtual ITS
- Move SP_EL0 save/restore to the guest entry/exit code
- Handle PC wrap around on 32bit guests, and narrow all 32bit
  registers on userspace access
jlelli pushed a commit that referenced this pull request May 19, 2020
abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
is and controls the activation of use_delay mechanism. Once a cgroup goes
over budget from forced IOs, it has to pay it back with its future budget.
The progress guarantee on debt paying comes from the iocg being active -
active iocgs are processed by the periodic timer, which ensures that as time
passes the debts dissipate and the iocg returns to normal operation.

However, both iocg activation and vdebt handling are asynchronous and a
sequence like the following may happen.

1. The iocg is in the process of being deactivated by the periodic timer.

2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
   without anything because it still sees that the iocg is already active.

3. The iocg is deactivated.

4. The bio from #2 is over budget but needs to be forced. It increases
   abs_vdebt and goes over the threshold and enables use_delay.

5. IO control is enabled for the iocg's subtree and now IOs are attributed
   to the descendant cgroups and the iocg itself no longer issues IOs.

This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
further IOs which can activate it. This can end up unduly punishing all the
descendants cgroups.

The usual throttling path has the same issue - the iocg must be active while
throttled to ensure that future event will wake it up - and solves the
problem by synchronizing the throttling path with a spinlock. abs_vdebt
handling is another form of overage handling and shares a lot of
characteristics including the fact that it isn't in the hottest path.

This patch fixes the above and other possible races by strictly
synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Vlad Dmitriev <vvd@fb.com>
Cc: stable@vger.kernel.org # v5.4+
Fixes: e1518f6 ("blk-iocost: Don't let merges push vtime into the future")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
jlelli pushed a commit that referenced this pull request May 19, 2020
Since 5.7-rc1, on btrfs we have a percpu counter initialization for
which we always pass a GFP_KERNEL gfp_t argument (this happens since
commit 2992df7 ("btrfs: Implement DREW lock")).

That is safe in some contextes but not on others where allowing fs
reclaim could lead to a deadlock because we are either holding some
btrfs lock needed for a transaction commit or holding a btrfs
transaction handle open.  Because of that we surround the call to the
function that initializes the percpu counter with a NOFS context using
memalloc_nofs_save() (this is done at btrfs_init_fs_root()).

However it turns out that this is not enough to prevent a possible
deadlock because percpu_alloc() determines if it is in an atomic context
by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this
case) and it is not aware that a NOFS context is set.

Because percpu_alloc() thinks it is in a non atomic context it locks the
pcpu_alloc_mutex.  This can result in a btrfs deadlock when
pcpu_balance_workfn() is running, has acquired that mutex and is waiting
for reclaim, while the btrfs task that called percpu_counter_init() (and
therefore percpu_alloc()) is holding either the btrfs commit_root
semaphore or a transaction handle (done fs/btrfs/backref.c:
iterate_extent_inodes()), which prevents reclaim from finishing as an
attempt to commit the current btrfs transaction will deadlock.

Lockdep reports this issue with the following trace:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.6.0-rc7-btrfs-next-77 #1 Not tainted
  ------------------------------------------------------
  kswapd0/91 is trying to acquire lock:
  ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]

  but task is already holding lock:
  ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #4 (fs_reclaim){+.+.}:
         fs_reclaim_acquire.part.0+0x25/0x30
         __kmalloc+0x5f/0x3a0
         pcpu_create_chunk+0x19/0x230
         pcpu_balance_workfn+0x56a/0x680
         process_one_work+0x235/0x5f0
         worker_thread+0x50/0x3b0
         kthread+0x120/0x140
         ret_from_fork+0x3a/0x50

  -> #3 (pcpu_alloc_mutex){+.+.}:
         __mutex_lock+0xa9/0xaf0
         pcpu_alloc+0x480/0x7c0
         __percpu_counter_init+0x50/0xd0
         btrfs_drew_lock_init+0x22/0x70 [btrfs]
         btrfs_get_fs_root+0x29c/0x5c0 [btrfs]
         resolve_indirect_refs+0x120/0xa30 [btrfs]
         find_parent_nodes+0x50b/0xf30 [btrfs]
         btrfs_find_all_leafs+0x60/0xb0 [btrfs]
         iterate_extent_inodes+0x139/0x2f0 [btrfs]
         iterate_inodes_from_logical+0xa1/0xe0 [btrfs]
         btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs]
         btrfs_ioctl+0x165a/0x3130 [btrfs]
         ksys_ioctl+0x87/0xc0
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x5c/0x260
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  -> #2 (&fs_info->commit_root_sem){++++}:
         down_write+0x38/0x70
         btrfs_cache_block_group+0x2ec/0x500 [btrfs]
         find_free_extent+0xc6a/0x1600 [btrfs]
         btrfs_reserve_extent+0x9b/0x180 [btrfs]
         btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
         alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
         __btrfs_cow_block+0x122/0x5a0 [btrfs]
         btrfs_cow_block+0x106/0x240 [btrfs]
         commit_cowonly_roots+0x55/0x310 [btrfs]
         btrfs_commit_transaction+0x509/0xb20 [btrfs]
         sync_filesystem+0x74/0x90
         generic_shutdown_super+0x22/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x93/0xc0
         exit_to_usermode_loop+0xf9/0x100
         do_syscall_64+0x20d/0x260
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  -> #1 (&space_info->groups_sem){++++}:
         down_read+0x3c/0x140
         find_free_extent+0xef6/0x1600 [btrfs]
         btrfs_reserve_extent+0x9b/0x180 [btrfs]
         btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
         alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
         __btrfs_cow_block+0x122/0x5a0 [btrfs]
         btrfs_cow_block+0x106/0x240 [btrfs]
         btrfs_search_slot+0x50c/0xd60 [btrfs]
         btrfs_lookup_inode+0x3a/0xc0 [btrfs]
         __btrfs_update_delayed_inode+0x90/0x280 [btrfs]
         __btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs]
         __btrfs_run_delayed_items+0x8e/0x180 [btrfs]
         btrfs_commit_transaction+0x31b/0xb20 [btrfs]
         iterate_supers+0x87/0xf0
         ksys_sync+0x60/0xb0
         __ia32_sys_sync+0xa/0x10
         do_syscall_64+0x5c/0x260
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  -> #0 (&delayed_node->mutex){+.+.}:
         __lock_acquire+0xef0/0x1c80
         lock_acquire+0xa2/0x1d0
         __mutex_lock+0xa9/0xaf0
         __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         btrfs_evict_inode+0x40d/0x560 [btrfs]
         evict+0xd9/0x1c0
         dispose_list+0x48/0x70
         prune_icache_sb+0x54/0x80
         super_cache_scan+0x124/0x1a0
         do_shrink_slab+0x176/0x440
         shrink_slab+0x23a/0x2c0
         shrink_node+0x188/0x6e0
         balance_pgdat+0x31d/0x7f0
         kswapd+0x238/0x550
         kthread+0x120/0x140
         ret_from_fork+0x3a/0x50

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaim

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(fs_reclaim);
                                 lock(pcpu_alloc_mutex);
                                 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/91:
   #0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
   #1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0
   #2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50

  stack backtrace:
  CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x8f/0xd0
   check_noncircular+0x170/0x190
   __lock_acquire+0xef0/0x1c80
   lock_acquire+0xa2/0x1d0
   __mutex_lock+0xa9/0xaf0
   __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   btrfs_evict_inode+0x40d/0x560 [btrfs]
   evict+0xd9/0x1c0
   dispose_list+0x48/0x70
   prune_icache_sb+0x54/0x80
   super_cache_scan+0x124/0x1a0
   do_shrink_slab+0x176/0x440
   shrink_slab+0x23a/0x2c0
   shrink_node+0x188/0x6e0
   balance_pgdat+0x31d/0x7f0
   kswapd+0x238/0x550
   kthread+0x120/0x140
   ret_from_fork+0x3a/0x50

This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL
to percpu_counter_init() in contextes where it is not reclaim safe,
however that type of approach is discouraged since
memalloc_[nofs|noio]_save() were introduced.  Therefore this change
makes pcpu_alloc() look up into an existing nofs/noio context before
deciding whether it is in an atomic context or not.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Link: http://lkml.kernel.org/r/20200430164356.15543-1-fdmanana@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jlelli pushed a commit that referenced this pull request Jan 11, 2021
With SO_RCVLOWAT, under memory pressure,
it is possible to enter a state where:

1. We have not received enough bytes to satisfy SO_RCVLOWAT.
2. We have not entered buffer pressure (see tcp_rmem_pressure()).
3. But, we do not have enough buffer space to accept more packets.

In this case, we advertise 0 rwnd (due to #3) but the application does
not drain the receive queue (no wakeup because of #1 and #2) so the
flow stalls.

Modify the heuristic for SO_RCVLOWAT so that, if we are advertising
rwnd<=rcv_mss, force a wakeup to prevent a stall.

Without this patch, setting tcp_rmem to 6143 and disabling TCP
autotune causes a stalled flow. With this patch, no stall occurs. This
is with RPC-style traffic with large messages.

Fixes: 03f45c8 ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
jlelli pushed a commit that referenced this pull request Jan 11, 2021
Dave reported a problem with my rwsem conversion patch where we got the
following lockdep splat:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-default+ #1297 Not tainted
  ------------------------------------------------------
  kswapd0/76 is trying to acquire lock:
  ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]

  but task is already holding lock:
  ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #4 (fs_reclaim){+.+.}-{0:0}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 fs_reclaim_acquire.part.0+0x25/0x30
	 kmem_cache_alloc+0x30/0x9c0
	 alloc_inode+0x81/0x90
	 iget_locked+0xcd/0x1a0
	 kernfs_get_inode+0x1b/0x130
	 kernfs_get_tree+0x136/0x210
	 sysfs_get_tree+0x1a/0x50
	 vfs_get_tree+0x1d/0xb0
	 path_mount+0x70f/0xa80
	 do_mount+0x75/0x90
	 __x64_sys_mount+0x8e/0xd0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #3 (kernfs_mutex){+.+.}-{3:3}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 __mutex_lock+0xa0/0xaf0
	 kernfs_add_one+0x23/0x150
	 kernfs_create_dir_ns+0x58/0x80
	 sysfs_create_dir_ns+0x70/0xd0
	 kobject_add_internal+0xbb/0x2d0
	 kobject_add+0x7a/0xd0
	 btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs]
	 btrfs_read_block_groups+0x1f1/0x8c0 [btrfs]
	 open_ctree+0x981/0x1108 [btrfs]
	 btrfs_mount_root.cold+0xe/0xb0 [btrfs]
	 legacy_get_tree+0x2d/0x60
	 vfs_get_tree+0x1d/0xb0
	 fc_mount+0xe/0x40
	 vfs_kern_mount.part.0+0x71/0x90
	 btrfs_mount+0x13b/0x3e0 [btrfs]
	 legacy_get_tree+0x2d/0x60
	 vfs_get_tree+0x1d/0xb0
	 path_mount+0x70f/0xa80
	 do_mount+0x75/0x90
	 __x64_sys_mount+0x8e/0xd0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (btrfs-extent-00){++++}-{3:3}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 down_read_nested+0x45/0x220
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
	 check_committed_ref+0x69/0x200 [btrfs]
	 btrfs_cross_ref_exist+0x65/0xb0 [btrfs]
	 run_delalloc_nocow+0x446/0x9b0 [btrfs]
	 btrfs_run_delalloc_range+0x61/0x6a0 [btrfs]
	 writepage_delalloc+0xae/0x160 [btrfs]
	 __extent_writepage+0x262/0x420 [btrfs]
	 extent_write_cache_pages+0x2b6/0x510 [btrfs]
	 extent_writepages+0x43/0x90 [btrfs]
	 do_writepages+0x40/0xe0
	 __writeback_single_inode+0x62/0x610
	 writeback_sb_inodes+0x20f/0x500
	 wb_writeback+0xef/0x4a0
	 wb_do_writeback+0x49/0x2e0
	 wb_workfn+0x81/0x340
	 process_one_work+0x233/0x5d0
	 worker_thread+0x50/0x3b0
	 kthread+0x137/0x150
	 ret_from_fork+0x1f/0x30

  -> #1 (btrfs-fs-00){++++}-{3:3}:
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 down_read_nested+0x45/0x220
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
	 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
	 __btrfs_update_delayed_inode+0x93/0x2c0 [btrfs]
	 __btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs]
	 __btrfs_run_delayed_items+0x8e/0x140 [btrfs]
	 btrfs_commit_transaction+0x367/0xbc0 [btrfs]
	 btrfs_mksubvol+0x2db/0x470 [btrfs]
	 btrfs_mksnapshot+0x7b/0xb0 [btrfs]
	 __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
	 btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
	 btrfs_ioctl+0xd0b/0x2690 [btrfs]
	 __x64_sys_ioctl+0x6f/0xa0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
	 check_prev_add+0x91/0xc60
	 validate_chain+0xa6e/0x2a20
	 __lock_acquire+0x582/0xac0
	 lock_acquire+0xca/0x430
	 __mutex_lock+0xa0/0xaf0
	 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
	 btrfs_evict_inode+0x3cc/0x560 [btrfs]
	 evict+0xd6/0x1c0
	 dispose_list+0x48/0x70
	 prune_icache_sb+0x54/0x80
	 super_cache_scan+0x121/0x1a0
	 do_shrink_slab+0x16d/0x3b0
	 shrink_slab+0xb1/0x2e0
	 shrink_node+0x230/0x6a0
	 balance_pgdat+0x325/0x750
	 kswapd+0x206/0x4d0
	 kthread+0x137/0x150
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> kernfs_mutex --> fs_reclaim

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(kernfs_mutex);
				 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/76:
   #0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
   #2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50

  stack backtrace:
  CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  Call Trace:
   dump_stack+0x77/0x97
   check_noncircular+0xff/0x110
   ? save_trace+0x50/0x470
   check_prev_add+0x91/0xc60
   validate_chain+0xa6e/0x2a20
   ? save_trace+0x50/0x470
   __lock_acquire+0x582/0xac0
   lock_acquire+0xca/0x430
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   __mutex_lock+0xa0/0xaf0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   ? __lock_acquire+0x582/0xac0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   ? btrfs_evict_inode+0x30b/0x560 [btrfs]
   ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
   btrfs_evict_inode+0x3cc/0x560 [btrfs]
   evict+0xd6/0x1c0
   dispose_list+0x48/0x70
   prune_icache_sb+0x54/0x80
   super_cache_scan+0x121/0x1a0
   do_shrink_slab+0x16d/0x3b0
   shrink_slab+0xb1/0x2e0
   shrink_node+0x230/0x6a0
   balance_pgdat+0x325/0x750
   kswapd+0x206/0x4d0
   ? finish_wait+0x90/0x90
   ? balance_pgdat+0x750/0x750
   kthread+0x137/0x150
   ? kthread_mod_delayed_work+0xc0/0xc0
   ret_from_fork+0x1f/0x30

This happens because we are still holding the path open when we start
adding the sysfs files for the block groups, which creates a dependency
on fs_reclaim via the tree lock.  Fix this by dropping the path before
we start doing anything with sysfs.

Reported-by: David Sterba <dsterba@suse.com>
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
jlelli pushed a commit that referenced this pull request Jan 11, 2021
Very sporadically I had test case btrfs/069 from fstests hanging (for
years, it is not a recent regression), with the following traces in
dmesg/syslog:

  [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started
  [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
  [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished
  [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds.
  [162513.514318]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.514747] task:btrfs-transacti state:D stack:    0 pid:1356167 ppid:     2 flags:0x00004000
  [162513.514751] Call Trace:
  [162513.514761]  __schedule+0x5ce/0xd00
  [162513.514765]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.514771]  schedule+0x46/0xf0
  [162513.514844]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.514850]  ? finish_wait+0x90/0x90
  [162513.514864]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.514879]  transaction_kthread+0xa4/0x170 [btrfs]
  [162513.514891]  ? btrfs_cleanup_transaction+0x660/0x660 [btrfs]
  [162513.514894]  kthread+0x153/0x170
  [162513.514897]  ? kthread_stop+0x2c0/0x2c0
  [162513.514902]  ret_from_fork+0x22/0x30
  [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds.
  [162513.515192]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.515680] task:fsstress        state:D stack:    0 pid:1356184 ppid:1356177 flags:0x00004000
  [162513.515682] Call Trace:
  [162513.515688]  __schedule+0x5ce/0xd00
  [162513.515691]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.515697]  schedule+0x46/0xf0
  [162513.515712]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.515716]  ? finish_wait+0x90/0x90
  [162513.515729]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.515743]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
  [162513.515753]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
  [162513.515758]  ? __ia32_sys_fdatasync+0x20/0x20
  [162513.515761]  iterate_supers+0x87/0xf0
  [162513.515765]  ksys_sync+0x60/0xb0
  [162513.515768]  __do_sys_sync+0xa/0x10
  [162513.515771]  do_syscall_64+0x33/0x80
  [162513.515774]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.515781] RIP: 0033:0x7f5238f50bd7
  [162513.515782] Code: Bad RIP value.
  [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
  [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
  [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a
  [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0
  [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a
  [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
  [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds.
  [162513.516064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.516617] task:fsstress        state:D stack:    0 pid:1356185 ppid:1356177 flags:0x00000000
  [162513.516620] Call Trace:
  [162513.516625]  __schedule+0x5ce/0xd00
  [162513.516628]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.516634]  schedule+0x46/0xf0
  [162513.516647]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.516650]  ? finish_wait+0x90/0x90
  [162513.516662]  start_transaction+0x4d7/0x5f0 [btrfs]
  [162513.516679]  btrfs_setxattr_trans+0x3c/0x100 [btrfs]
  [162513.516686]  __vfs_setxattr+0x66/0x80
  [162513.516691]  __vfs_setxattr_noperm+0x70/0x200
  [162513.516697]  vfs_setxattr+0x6b/0x120
  [162513.516703]  setxattr+0x125/0x240
  [162513.516709]  ? lock_acquire+0xb1/0x480
  [162513.516712]  ? mnt_want_write+0x20/0x50
  [162513.516721]  ? rcu_read_lock_any_held+0x8e/0xb0
  [162513.516723]  ? preempt_count_add+0x49/0xa0
  [162513.516725]  ? __sb_start_write+0x19b/0x290
  [162513.516727]  ? preempt_count_add+0x49/0xa0
  [162513.516732]  path_setxattr+0xba/0xd0
  [162513.516739]  __x64_sys_setxattr+0x27/0x30
  [162513.516741]  do_syscall_64+0x33/0x80
  [162513.516743]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.516745] RIP: 0033:0x7f5238f56d5a
  [162513.516746] Code: Bad RIP value.
  [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc
  [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a
  [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470
  [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700
  [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004
  [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0
  [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds.
  [162513.517064]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.517763] task:fsstress        state:D stack:    0 pid:1356196 ppid:1356177 flags:0x00004000
  [162513.517780] Call Trace:
  [162513.517786]  __schedule+0x5ce/0xd00
  [162513.517789]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.517796]  schedule+0x46/0xf0
  [162513.517810]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.517814]  ? finish_wait+0x90/0x90
  [162513.517829]  start_transaction+0x37c/0x5f0 [btrfs]
  [162513.517845]  btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs]
  [162513.517857]  btrfs_sync_fs+0x61/0x1c0 [btrfs]
  [162513.517862]  ? __ia32_sys_fdatasync+0x20/0x20
  [162513.517865]  iterate_supers+0x87/0xf0
  [162513.517869]  ksys_sync+0x60/0xb0
  [162513.517872]  __do_sys_sync+0xa/0x10
  [162513.517875]  do_syscall_64+0x33/0x80
  [162513.517878]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.517881] RIP: 0033:0x7f5238f50bd7
  [162513.517883] Code: Bad RIP value.
  [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2
  [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7
  [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053
  [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0
  [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053
  [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340
  [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds.
  [162513.518298]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.519157] task:fsstress        state:D stack:    0 pid:1356197 ppid:1356177 flags:0x00000000
  [162513.519160] Call Trace:
  [162513.519165]  __schedule+0x5ce/0xd00
  [162513.519168]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.519174]  schedule+0x46/0xf0
  [162513.519190]  wait_current_trans+0xde/0x140 [btrfs]
  [162513.519193]  ? finish_wait+0x90/0x90
  [162513.519206]  start_transaction+0x4d7/0x5f0 [btrfs]
  [162513.519222]  btrfs_create+0x57/0x200 [btrfs]
  [162513.519230]  lookup_open+0x522/0x650
  [162513.519246]  path_openat+0x2b8/0xa50
  [162513.519270]  do_filp_open+0x91/0x100
  [162513.519275]  ? find_held_lock+0x32/0x90
  [162513.519280]  ? lock_acquired+0x33b/0x470
  [162513.519285]  ? do_raw_spin_unlock+0x4b/0xc0
  [162513.519287]  ? _raw_spin_unlock+0x29/0x40
  [162513.519295]  do_sys_openat2+0x20d/0x2d0
  [162513.519300]  do_sys_open+0x44/0x80
  [162513.519304]  do_syscall_64+0x33/0x80
  [162513.519307]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.519309] RIP: 0033:0x7f5238f4a903
  [162513.519310] Code: Bad RIP value.
  [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
  [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903
  [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470
  [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002
  [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013
  [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620
  [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds.
  [162513.519727]       Not tainted 5.9.0-rc6-btrfs-next-69 #1
  [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [162513.520508] task:btrfs           state:D stack:    0 pid:1356211 ppid:1356178 flags:0x00004002
  [162513.520511] Call Trace:
  [162513.520516]  __schedule+0x5ce/0xd00
  [162513.520519]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
  [162513.520525]  schedule+0x46/0xf0
  [162513.520544]  btrfs_scrub_pause+0x11f/0x180 [btrfs]
  [162513.520548]  ? finish_wait+0x90/0x90
  [162513.520562]  btrfs_commit_transaction+0x45a/0xc30 [btrfs]
  [162513.520574]  ? start_transaction+0xe0/0x5f0 [btrfs]
  [162513.520596]  btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs]
  [162513.520619]  btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs]
  [162513.520639]  btrfs_ioctl+0x2a25/0x36f0 [btrfs]
  [162513.520643]  ? do_sigaction+0xf3/0x240
  [162513.520645]  ? find_held_lock+0x32/0x90
  [162513.520648]  ? do_sigaction+0xf3/0x240
  [162513.520651]  ? lock_acquired+0x33b/0x470
  [162513.520655]  ? _raw_spin_unlock_irq+0x24/0x50
  [162513.520657]  ? lockdep_hardirqs_on+0x7d/0x100
  [162513.520660]  ? _raw_spin_unlock_irq+0x35/0x50
  [162513.520662]  ? do_sigaction+0xf3/0x240
  [162513.520671]  ? __x64_sys_ioctl+0x83/0xb0
  [162513.520672]  __x64_sys_ioctl+0x83/0xb0
  [162513.520677]  do_syscall_64+0x33/0x80
  [162513.520679]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [162513.520681] RIP: 0033:0x7fc3cd307d87
  [162513.520682] Code: Bad RIP value.
  [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
  [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87
  [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003
  [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003
  [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001
  [162513.520703]
		  Showing all locks held in the system:
  [162513.520712] 1 lock held by khungtaskd/54:
  [162513.520713]  #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197
  [162513.520728] 1 lock held by in:imklog/596:
  [162513.520729]  #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60
  [162513.520782] 1 lock held by btrfs-transacti/1356167:
  [162513.520784]  #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs]
  [162513.520798] 1 lock held by btrfs/1356190:
  [162513.520800]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60
  [162513.520805] 1 lock held by fsstress/1356184:
  [162513.520806]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
  [162513.520811] 3 locks held by fsstress/1356185:
  [162513.520812]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
  [162513.520815]  #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120
  [162513.520820]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
  [162513.520833] 1 lock held by fsstress/1356196:
  [162513.520834]  #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0
  [162513.520838] 3 locks held by fsstress/1356197:
  [162513.520839]  #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50
  [162513.520843]  #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50
  [162513.520846]  #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]
  [162513.520858] 2 locks held by btrfs/1356211:
  [162513.520859]  #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs]
  [162513.520877]  #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs]

This was weird because the stack traces show that a transaction commit,
triggered by a device replace operation, is blocking trying to pause any
running scrubs but there are no stack traces of blocked tasks doing a
scrub.

After poking around with drgn, I noticed there was a scrub task that was
constantly running and blocking for shorts periods of time:

  >>> t = find_task(prog, 1356190)
  >>> prog.stack_trace(t)
  #0  __schedule+0x5ce/0xcfc
  #1  schedule+0x46/0xe4
  #2  schedule_timeout+0x1df/0x475
  #3  btrfs_reada_wait+0xda/0x132
  #4  scrub_stripe+0x2a8/0x112f
  #5  scrub_chunk+0xcd/0x134
  #6  scrub_enumerate_chunks+0x29e/0x5ee
  #7  btrfs_scrub_dev+0x2d5/0x91b
  #8  btrfs_ioctl+0x7f5/0x36e7
  #9  __x64_sys_ioctl+0x83/0xb0
  torvalds#10 do_syscall_64+0x33/0x77
  torvalds#11 entry_SYSCALL_64+0x7c/0x156

Which corresponds to:

int btrfs_reada_wait(void *handle)
{
    struct reada_control *rc = handle;
    struct btrfs_fs_info *fs_info = rc->fs_info;

    while (atomic_read(&rc->elems)) {
        if (!atomic_read(&fs_info->reada_works_cnt))
            reada_start_machine(fs_info);
        wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0,
                          (HZ + 9) / 10);
    }
(...)

So the counter "rc->elems" was set to 1 and never decreased to 0, causing
the scrub task to loop forever in that function. Then I used the following
script for drgn to check the readahead requests:

  $ cat dump_reada.py
  import sys
  import drgn
  from drgn import NULL, Object, cast, container_of, execscript, \
      reinterpret, sizeof
  from drgn.helpers.linux import *

  mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

  mnt = None
  for mnt in for_each_mount(prog, dst = mnt_path):
      pass

  if mnt is None:
      sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
      sys.exit(1)

  fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

  def dump_re(re):
      nzones = re.nzones.value_()
      print(f're at {hex(re.value_())}')
      print(f'\t logical {re.logical.value_()}')
      print(f'\t refcnt {re.refcnt.value_()}')
      print(f'\t nzones {nzones}')
      for i in range(nzones):
          dev = re.zones[i].device
          name = dev.name.str.string_()
          print(f'\t\t dev id {dev.devid.value_()} name {name}')
      print()

  for _, e in radix_tree_for_each(fs_info.reada_tree):
      re = cast('struct reada_extent *', e)
      dump_re(re)

  $ drgn dump_reada.py
  re at 0xffff8f3da9d25ad8
          logical 38928384
          refcnt 1
          nzones 1
                 dev id 0 name b'/dev/sdd'
  $

So there was one readahead extent with a single zone corresponding to the
source device of that last device replace operation logged in dmesg/syslog.
Also the ID of that zone's device was 0 which is a special value set in
the source device of a device replace operation when the operation finishes
(constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()),
confirming again that device /dev/sdd was the source of a device replace
operation.

Normally there should be as many zones in the readahead extent as there are
devices, and I wasn't expecting the extent to be in a block group with a
'single' profile, so I went and confirmed with the following drgn script
that there weren't any single profile block groups:

  $ cat dump_block_groups.py
  import sys
  import drgn
  from drgn import NULL, Object, cast, container_of, execscript, \
      reinterpret, sizeof
  from drgn.helpers.linux import *

  mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1"

  mnt = None
  for mnt in for_each_mount(prog, dst = mnt_path):
      pass

  if mnt is None:
      sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
      sys.exit(1)

  fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info)

  BTRFS_BLOCK_GROUP_DATA = (1 << 0)
  BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1)
  BTRFS_BLOCK_GROUP_METADATA = (1 << 2)
  BTRFS_BLOCK_GROUP_RAID0 = (1 << 3)
  BTRFS_BLOCK_GROUP_RAID1 = (1 << 4)
  BTRFS_BLOCK_GROUP_DUP = (1 << 5)
  BTRFS_BLOCK_GROUP_RAID10 = (1 << 6)
  BTRFS_BLOCK_GROUP_RAID5 = (1 << 7)
  BTRFS_BLOCK_GROUP_RAID6 = (1 << 8)
  BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9)
  BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10)

  def bg_flags_string(bg):
      flags = bg.flags.value_()
      ret = ''
      if flags & BTRFS_BLOCK_GROUP_DATA:
          ret = 'data'
      if flags & BTRFS_BLOCK_GROUP_METADATA:
          if len(ret) > 0:
              ret += '|'
          ret += 'meta'
      if flags & BTRFS_BLOCK_GROUP_SYSTEM:
          if len(ret) > 0:
              ret += '|'
          ret += 'system'
      if flags & BTRFS_BLOCK_GROUP_RAID0:
          ret += ' raid0'
      elif flags & BTRFS_BLOCK_GROUP_RAID1:
          ret += ' raid1'
      elif flags & BTRFS_BLOCK_GROUP_DUP:
          ret += ' dup'
      elif flags & BTRFS_BLOCK_GROUP_RAID10:
          ret += ' raid10'
      elif flags & BTRFS_BLOCK_GROUP_RAID5:
          ret += ' raid5'
      elif flags & BTRFS_BLOCK_GROUP_RAID6:
          ret += ' raid6'
      elif flags & BTRFS_BLOCK_GROUP_RAID1C3:
          ret += ' raid1c3'
      elif flags & BTRFS_BLOCK_GROUP_RAID1C4:
          ret += ' raid1c4'
      else:
          ret += ' single'

      return ret

  def dump_bg(bg):
      print()
      print(f'block group at {hex(bg.value_())}')
      print(f'\t start {bg.start.value_()} length {bg.length.value_()}')
      print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}')

  bg_root = fs_info.block_group_cache_tree.address_of_()
  for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'):
      dump_bg(bg)

  $ drgn dump_block_groups.py

  block group at 0xffff8f3d673b0400
         start 22020096 length 16777216
         flags 258 - system raid6

  block group at 0xffff8f3d53ddb400
         start 38797312 length 536870912
         flags 260 - meta raid6

  block group at 0xffff8f3d5f4d9c00
         start 575668224 length 2147483648
         flags 257 - data raid6

  block group at 0xffff8f3d08189000
         start 2723151872 length 67108864
         flags 258 - system raid6

  block group at 0xffff8f3db70ff000
         start 2790260736 length 1073741824
         flags 260 - meta raid6

  block group at 0xffff8f3d5f4dd800
         start 3864002560 length 67108864
         flags 258 - system raid6

  block group at 0xffff8f3d67037000
         start 3931111424 length 2147483648
         flags 257 - data raid6
  $

So there were only 2 reasons left for having a readahead extent with a
single zone: reada_find_zone(), called when creating a readahead extent,
returned NULL either because we failed to find the corresponding block
group or because a memory allocation failed. With some additional and
custom tracing I figured out that on every further ocurrence of the
problem the block group had just been deleted when we were looping to
create the zones for the readahead extent (at reada_find_extent()), so we
ended up with only one zone in the readahead extent, corresponding to a
device that ends up getting replaced.

So after figuring that out it became obvious why the hang happens:

1) Task A starts a scrub on any device of the filesystem, except for
   device /dev/sdd;

2) Task B starts a device replace with /dev/sdd as the source device;

3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently
   starting to scrub a stripe from block group X. This call to
   btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add()
   calls reada_add_block(), it passes the logical address of the extent
   tree's root node as its 'logical' argument - a value of 38928384;

4) Task A then enters reada_find_extent(), called from reada_add_block().
   It finds there isn't any existing readahead extent for the logical
   address 38928384, so it proceeds to the path of creating a new one.

   It calls btrfs_map_block() to find out which stripes exist for the block
   group X. On the first iteration of the for loop that iterates over the
   stripes, it finds the stripe for device /dev/sdd, so it creates one
   zone for that device and adds it to the readahead extent. Before getting
   into the second iteration of the loop, the cleanup kthread deletes block
   group X because it was empty. So in the iterations for the remaining
   stripes it does not add more zones to the readahead extent, because the
   calls to reada_find_zone() returned NULL because they couldn't find
   block group X anymore.

   As a result the new readahead extent has a single zone, corresponding to
   the device /dev/sdd;

4) Before task A returns to btrfs_reada_add() and queues the readahead job
   for the readahead work queue, task B finishes the device replace and at
   btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new
   device /dev/sdg;

5) Task A returns to reada_add_block(), which increments the counter
   "->elems" of the reada_control structure allocated at btrfs_reada_add().

   Then it returns back to btrfs_reada_add() and calls
   reada_start_machine(). This queues a job in the readahead work queue to
   run the function reada_start_machine_worker(), which calls
   __reada_start_machine().

   At __reada_start_machine() we take the device list mutex and for each
   device found in the current device list, we call
   reada_start_machine_dev() to start the readahead work. However at this
   point the device /dev/sdd was already freed and is not in the device
   list anymore.

   This means the corresponding readahead for the extent at 38928384 is
   never started, and therefore the "->elems" counter of the reada_control
   structure allocated at btrfs_reada_add() never goes down to 0, causing
   the call to btrfs_reada_wait(), done by the scrub task, to wait forever.

Note that the readahead request can be made either after the device replace
started or before it started, however in pratice it is very unlikely that a
device replace is able to start after a readahead request is made and is
able to complete before the readahead request completes - maybe only on a
very small and nearly empty filesystem.

This hang however is not the only problem we can have with readahead and
device removals. When the readahead extent has other zones other than the
one corresponding to the device that is being removed (either by a device
replace or a device remove operation), we risk having a use-after-free on
the device when dropping the last reference of the readahead extent.

For example if we create a readahead extent with two zones, one for the
device /dev/sdd and one for the device /dev/sde:

1) Before the readahead worker starts, the device /dev/sdd is removed,
   and the corresponding btrfs_device structure is freed. However the
   readahead extent still has the zone pointing to the device structure;

2) When the readahead worker starts, it only finds device /dev/sde in the
   current device list of the filesystem;

3) It starts the readahead work, at reada_start_machine_dev(), using the
   device /dev/sde;

4) Then when it finishes reading the extent from device /dev/sde, it calls
   __readahead_hook() which ends up dropping the last reference on the
   readahead extent through the last call to reada_extent_put();

5) At reada_extent_put() it iterates over each zone of the readahead extent
   and attempts to delete an element from the device's 'reada_extents'
   radix tree, resulting in a use-after-free, as the device pointer of the
   zone for /dev/sdd is now stale. We can also access the device after
   dropping the last reference of a zone, through reada_zone_release(),
   also called by reada_extent_put().

And a device remove suffers the same problem, however since it shrinks the
device size down to zero before removing the device, it is very unlikely to
still have readahead requests not completed by the time we free the device,
the only possibility is if the device has a very little space allocated.

While the hang problem is exclusive to scrub, since it is currently the
only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free
problem affects any path that triggers readhead, which includes
btree_readahead_hook() and __readahead_hook() (a readahead worker can
trigger readahed for the children of a node) for example - any path that
ends up calling reada_add_block() can trigger the use-after-free after a
device is removed.

So fix this by waiting for any readahead requests for a device to complete
before removing a device, ensuring that while waiting for existing ones no
new ones can be made.

This problem has been around for a very long time - the readahead code was
added in 2011, device remove exists since 2008 and device replace was
introduced in 2013, hard to pick a specific commit for a git Fixes tag.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
jlelli pushed a commit that referenced this pull request Jan 11, 2021
When enabling qgroups we walk the tree_root and then add a qgroup item
for every root that we have.  This creates a lock dependency on the
tree_root and qgroup_root, which results in the following lockdep splat
(with tree locks using rwsem), eg. in tests btrfs/017 or btrfs/022:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0-default+ #1299 Not tainted
  ------------------------------------------------------
  btrfs/24552 is trying to acquire lock:
  ffff9142dfc5f630 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

  but task is already holding lock:
  ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (btrfs-root-00){++++}-{3:3}:
	 __lock_acquire+0x3fb/0x730
	 lock_acquire.part.0+0x6a/0x130
	 down_read_nested+0x46/0x130
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
	 btrfs_search_slot+0xc3/0x9f0 [btrfs]
	 btrfs_insert_item+0x6e/0x140 [btrfs]
	 btrfs_create_tree+0x1cb/0x240 [btrfs]
	 btrfs_quota_enable+0xcd/0x790 [btrfs]
	 btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
	 __x64_sys_ioctl+0x83/0xa0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (btrfs-quota-00){++++}-{3:3}:
	 check_prev_add+0x91/0xc30
	 validate_chain+0x491/0x750
	 __lock_acquire+0x3fb/0x730
	 lock_acquire.part.0+0x6a/0x130
	 down_read_nested+0x46/0x130
	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
	 btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
	 btrfs_search_slot+0xc3/0x9f0 [btrfs]
	 btrfs_insert_empty_items+0x58/0xa0 [btrfs]
	 add_qgroup_item.part.0+0x72/0x210 [btrfs]
	 btrfs_quota_enable+0x3bb/0x790 [btrfs]
	 btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
	 __x64_sys_ioctl+0x83/0xa0
	 do_syscall_64+0x2d/0x70
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(btrfs-root-00);
				 lock(btrfs-quota-00);
				 lock(btrfs-root-00);
    lock(btrfs-quota-00);

   *** DEADLOCK ***

  5 locks held by btrfs/24552:
   #0: ffff9142df431478 (sb_writers#10){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0xa0
   #1: ffff9142f9b10cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl_quota_ctl+0x7b/0xe0 [btrfs]
   #2: ffff9142f9b11a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0x790 [btrfs]
   #3: ffff9142df431698 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x406/0x510 [btrfs]
   #4: ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]

  stack backtrace:
  CPU: 1 PID: 24552 Comm: btrfs Not tainted 5.9.0-default+ #1299
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  Call Trace:
   dump_stack+0x77/0x97
   check_noncircular+0xf3/0x110
   check_prev_add+0x91/0xc30
   validate_chain+0x491/0x750
   __lock_acquire+0x3fb/0x730
   lock_acquire.part.0+0x6a/0x130
   ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
   ? lock_acquire+0xc4/0x140
   ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
   down_read_nested+0x46/0x130
   ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
   __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
   ? btrfs_root_node+0xd9/0x200 [btrfs]
   __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
   btrfs_search_slot_get_root+0x11d/0x290 [btrfs]
   btrfs_search_slot+0xc3/0x9f0 [btrfs]
   btrfs_insert_empty_items+0x58/0xa0 [btrfs]
   add_qgroup_item.part.0+0x72/0x210 [btrfs]
   btrfs_quota_enable+0x3bb/0x790 [btrfs]
   btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs]
   __x64_sys_ioctl+0x83/0xa0
   do_syscall_64+0x2d/0x70
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by dropping the path whenever we find a root item, add the
qgroup item, and then re-lookup the root item we found and continue
processing roots.

Reported-by: David Sterba <dsterba@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
jlelli pushed a commit that referenced this pull request Jan 11, 2021
I got the following lockdep splat with tree locks converted to rwsem
patches on btrfs/104:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.9.0+ torvalds#102 Not tainted
  ------------------------------------------------------
  btrfs-cleaner/903 is trying to acquire lock:
  ffff8e7fab6ffe30 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170

  but task is already holding lock:
  ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (&fs_info->commit_root_sem){++++}-{3:3}:
	 down_read+0x40/0x130
	 caching_thread+0x53/0x5a0
	 btrfs_work_helper+0xfa/0x520
	 process_one_work+0x238/0x540
	 worker_thread+0x55/0x3c0
	 kthread+0x13a/0x150
	 ret_from_fork+0x1f/0x30

  -> #2 (&caching_ctl->mutex){+.+.}-{3:3}:
	 __mutex_lock+0x7e/0x7b0
	 btrfs_cache_block_group+0x1e0/0x510
	 find_free_extent+0xb6e/0x12f0
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb1/0x330
	 alloc_tree_block_no_bg_flush+0x4f/0x60
	 __btrfs_cow_block+0x11d/0x580
	 btrfs_cow_block+0x10c/0x220
	 commit_cowonly_roots+0x47/0x2e0
	 btrfs_commit_transaction+0x595/0xbd0
	 sync_filesystem+0x74/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x36/0xa0
	 cleanup_mnt+0x12d/0x190
	 task_work_run+0x5c/0xa0
	 exit_to_user_mode_prepare+0x1df/0x200
	 syscall_exit_to_user_mode+0x54/0x280
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&space_info->groups_sem){++++}-{3:3}:
	 down_read+0x40/0x130
	 find_free_extent+0x2ed/0x12f0
	 btrfs_reserve_extent+0xb3/0x1b0
	 btrfs_alloc_tree_block+0xb1/0x330
	 alloc_tree_block_no_bg_flush+0x4f/0x60
	 __btrfs_cow_block+0x11d/0x580
	 btrfs_cow_block+0x10c/0x220
	 commit_cowonly_roots+0x47/0x2e0
	 btrfs_commit_transaction+0x595/0xbd0
	 sync_filesystem+0x74/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20
	 deactivate_locked_super+0x36/0xa0
	 cleanup_mnt+0x12d/0x190
	 task_work_run+0x5c/0xa0
	 exit_to_user_mode_prepare+0x1df/0x200
	 syscall_exit_to_user_mode+0x54/0x280
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (btrfs-root-00){++++}-{3:3}:
	 __lock_acquire+0x1167/0x2150
	 lock_acquire+0xb9/0x3d0
	 down_read_nested+0x43/0x130
	 __btrfs_tree_read_lock+0x32/0x170
	 __btrfs_read_lock_root_node+0x3a/0x50
	 btrfs_search_slot+0x614/0x9d0
	 btrfs_find_root+0x35/0x1b0
	 btrfs_read_tree_root+0x61/0x120
	 btrfs_get_root_ref+0x14b/0x600
	 find_parent_nodes+0x3e6/0x1b30
	 btrfs_find_all_roots_safe+0xb4/0x130
	 btrfs_find_all_roots+0x60/0x80
	 btrfs_qgroup_trace_extent_post+0x27/0x40
	 btrfs_add_delayed_data_ref+0x3fd/0x460
	 btrfs_free_extent+0x42/0x100
	 __btrfs_mod_ref+0x1d7/0x2f0
	 walk_up_proc+0x11c/0x400
	 walk_up_tree+0xf0/0x180
	 btrfs_drop_snapshot+0x1c7/0x780
	 btrfs_clean_one_deleted_snapshot+0xfb/0x110
	 cleaner_kthread+0xd4/0x140
	 kthread+0x13a/0x150
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    btrfs-root-00 --> &caching_ctl->mutex --> &fs_info->commit_root_sem

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&fs_info->commit_root_sem);
				 lock(&caching_ctl->mutex);
				 lock(&fs_info->commit_root_sem);
    lock(btrfs-root-00);

   *** DEADLOCK ***

  3 locks held by btrfs-cleaner/903:
   #0: ffff8e7fab628838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
   #1: ffff8e7faadac640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
   #2: ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80

  stack backtrace:
  CPU: 0 PID: 903 Comm: btrfs-cleaner Not tainted 5.9.0+ torvalds#102
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
  Call Trace:
   dump_stack+0x8b/0xb0
   check_noncircular+0xcf/0xf0
   __lock_acquire+0x1167/0x2150
   ? __bfs+0x42/0x210
   lock_acquire+0xb9/0x3d0
   ? __btrfs_tree_read_lock+0x32/0x170
   down_read_nested+0x43/0x130
   ? __btrfs_tree_read_lock+0x32/0x170
   __btrfs_tree_read_lock+0x32/0x170
   __btrfs_read_lock_root_node+0x3a/0x50
   btrfs_search_slot+0x614/0x9d0
   ? find_held_lock+0x2b/0x80
   btrfs_find_root+0x35/0x1b0
   ? do_raw_spin_unlock+0x4b/0xa0
   btrfs_read_tree_root+0x61/0x120
   btrfs_get_root_ref+0x14b/0x600
   find_parent_nodes+0x3e6/0x1b30
   btrfs_find_all_roots_safe+0xb4/0x130
   btrfs_find_all_roots+0x60/0x80
   btrfs_qgroup_trace_extent_post+0x27/0x40
   btrfs_add_delayed_data_ref+0x3fd/0x460
   btrfs_free_extent+0x42/0x100
   __btrfs_mod_ref+0x1d7/0x2f0
   walk_up_proc+0x11c/0x400
   walk_up_tree+0xf0/0x180
   btrfs_drop_snapshot+0x1c7/0x780
   ? btrfs_clean_one_deleted_snapshot+0x73/0x110
   btrfs_clean_one_deleted_snapshot+0xfb/0x110
   cleaner_kthread+0xd4/0x140
   ? btrfs_alloc_root+0x50/0x50
   kthread+0x13a/0x150
   ? kthread_create_worker_on_cpu+0x40/0x40
   ret_from_fork+0x1f/0x30
  BTRFS info (device sdb): disk space caching is enabled
  BTRFS info (device sdb): has skinny extents

This happens because qgroups does a backref lookup when we create a
delayed ref.  From here it may have to look up a root from an indirect
ref, which does a normal lookup on the tree_root, which takes the read
lock on the tree_root nodes.

To fix this we need to add a variant for looking up roots that searches
the commit root of the tree_root.  Then when we do the backref search
using the commit root we are sure to not take any locks on the tree_root
nodes.  This gets rid of the lockdep splat when running btrfs/104.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
jlelli pushed a commit that referenced this pull request Jan 11, 2021
Ido Schimmel says:

====================
mlxsw: Various fixes

This patch set contains various fixes for mlxsw.

Patch #1 ensures that only link modes that are supported by both the
device and the driver are advertised. When a link mode that is not
supported by the driver is negotiated by the device, it will be
presented as an unknown speed by ethtool, causing the bond driver to
wrongly assume that the link is down.

Patch #2 fixes a trivial memory leak upon module removal.

Patch #3 fixes a use-after-free that syzkaller was able to trigger once
on a slow emulator after a few months of fuzzing.
====================

Link: https://lore.kernel.org/r/20201024133733.2107509-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
jlelli pushed a commit that referenced this pull request Jan 11, 2021
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for v5.10, take #2

- Fix compilation error when PMD and PUD are folded
- Fix regresssion of the RAZ behaviour of ID_AA64ZFR0_EL1
jlelli pushed a commit that referenced this pull request Jan 11, 2021
While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs
with 11TB of ram, I hit the following panic:

    BUG: Kernel NULL pointer dereference on read at 0x00000007
    Faulting instruction address: 0xc000000000456048
    Oops: Kernel access of bad area, sig: 11 [#2]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS= 2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp
    CPU: 160 PID: 1 Comm: systemd Tainted: G      D           5.9.0 #1
    NIP:  c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350
    REGS: c00006028d1b77a0 TRAP: 0300   Tainted: G      D            (5.9.0)
    MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004228  XER: 00000000
    CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0
    GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000
    GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320
    GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000
    GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a
    GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1
    GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8
    GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000
    GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001
    NIP [c000000000456048] __kmalloc_node+0x108/0x790
    LR [c000000000455fd4] __kmalloc_node+0x94/0x790
    Call Trace:
      kvmalloc_node+0x58/0x110
      mem_cgroup_css_online+0x10c/0x270
      online_css+0x48/0xd0
      cgroup_apply_control_enable+0x2c4/0x470
      cgroup_mkdir+0x408/0x5f0
      kernfs_iop_mkdir+0x90/0x100
      vfs_mkdir+0x138/0x250
      do_mkdirat+0x154/0x1c0
      system_call_exception+0xf8/0x200
      system_call_common+0xf0/0x27c
    Instruction dump:
    e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a
    2fbc0000 419e0018 41920230 e9270010 <89290007> 7f994800 419e0220 7ee6bb78

This pointing to the following code:

    mm/slub.c:2851
            if (unlikely(!object || !node_match(page, node))) {
    c000000000456038:       00 00 bc 2f     cmpdi   cr7,r28,0
    c00000000045603c:       18 00 9e 41     beq     cr7,c000000000456054 <__kmalloc_node+0x114>
    node_match():
    mm/slub.c:2491
            if (node != NUMA_NO_NODE && page_to_nid(page) != node)
    c000000000456040:       30 02 92 41     beq     cr4,c000000000456270 <__kmalloc_node+0x330>
    page_to_nid():
    include/linux/mm.h:1294
    c000000000456044:       10 00 27 e9     ld      r9,16(r7)
    c000000000456048:       07 00 29 89     lbz     r9,7(r9)	<<<< r9 = NULL
    node_match():
    mm/slub.c:2491
    c00000000045604c:       00 48 99 7f     cmpw    cr7,r25,r9
    c000000000456050:       20 02 9e 41     beq     cr7,c000000000456270 <__kmalloc_node+0x330>

The panic occurred in slab_alloc_node() when checking for the page's node:

	object = c->freelist;
	page = c->page;
	if (unlikely(!object || !node_match(page, node))) {
		object = __slab_alloc(s, gfpflags, node, addr, c);
		stat(s, ALLOC_SLOWPATH);

The issue is that object is not NULL while page is NULL which is odd but
may happen if the cache flush happened after loading object but before
loading page.  Thus checking for the page pointer is required too.

The cache flush is done through an inter processor interrupt when a
piece of memory is off-lined.  That interrupt is triggered when a memory
hot-unplug operation is initiated and offline_pages() is calling the
slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback()
which is calling flush_cpu_slab().  If that interrupt is caught between
the reading of c->freelist and the reading of c->page, this could lead
to such a situation.  That situation is expected and the later call to
this_cpu_cmpxchg_double() will detect the change to c->freelist and redo
the whole operation.

In commit 6159d0f ("mm/slub.c: page is always non-NULL in
node_match()") check on the page pointer has been removed assuming that
page is always valid when it is called.  It happens that this is not
true in that particular case, so check for page before calling
node_match() here.

Fixes: 6159d0f ("mm/slub.c: page is always non-NULL in node_match()")
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20201027190406.33283-1-ldufour@linux.ibm.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jlelli pushed a commit that referenced this pull request Jan 11, 2021
This fix is for a failure that occurred in the DWARF unwind perf test.

Stack unwinders may probe memory when looking for frames.

Memory sanitizer will poison and track uninitialized memory on the
stack, and on the heap if the value is copied to the heap.

This can lead to false memory sanitizer failures for the use of an
uninitialized value.

Avoid this problem by removing the poison on the copied stack.

The full msan failure with track origins looks like:

==2168==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x559ceb10755b in handle_cfi elfutils/libdwfl/frame_unwind.c:648:8
    #1 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4
    #2 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7
    #3 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10
    #4 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17
    #5 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17
    #6 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14
    #7 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10
    #8 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8
    #9 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8
    torvalds#10 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
    torvalds#11 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
    torvalds#12 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
    torvalds#13 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
    torvalds#14 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
    torvalds#15 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
    torvalds#16 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
    torvalds#17 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
    torvalds#18 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
    torvalds#19 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
    torvalds#20 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
    torvalds#21 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
    torvalds#22 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
    torvalds#23 0x559cea95fbce in main tools/perf/perf.c:539:3

  Uninitialized value was stored to memory at
    #0 0x559ceb106acf in __libdwfl_frame_reg_set elfutils/libdwfl/frame_unwind.c:77:22
    #1 0x559ceb106acf in handle_cfi elfutils/libdwfl/frame_unwind.c:627:13
    #2 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4
    #3 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7
    #4 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10
    #5 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17
    #6 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17
    #7 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14
    #8 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10
    #9 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8
    torvalds#10 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8
    torvalds#11 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
    torvalds#12 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
    torvalds#13 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
    torvalds#14 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
    torvalds#15 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
    torvalds#16 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
    torvalds#17 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
    torvalds#18 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
    torvalds#19 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
    torvalds#20 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
    torvalds#21 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
    torvalds#22 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
    torvalds#23 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
    torvalds#24 0x559cea95fbce in main tools/perf/perf.c:539:3

  Uninitialized value was stored to memory at
    #0 0x559ceb106a54 in handle_cfi elfutils/libdwfl/frame_unwind.c:613:9
    #1 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4
    #2 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7
    #3 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10
    #4 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17
    #5 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17
    #6 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14
    #7 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10
    #8 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8
    #9 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8
    torvalds#10 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
    torvalds#11 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
    torvalds#12 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
    torvalds#13 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
    torvalds#14 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
    torvalds#15 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
    torvalds#16 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
    torvalds#17 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
    torvalds#18 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
    torvalds#19 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
    torvalds#20 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
    torvalds#21 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
    torvalds#22 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
    torvalds#23 0x559cea95fbce in main tools/perf/perf.c:539:3

  Uninitialized value was stored to memory at
    #0 0x559ceaff8800 in memory_read tools/perf/util/unwind-libdw.c:156:10
    #1 0x559ceb10f053 in expr_eval elfutils/libdwfl/frame_unwind.c:501:13
    #2 0x559ceb1060cc in handle_cfi elfutils/libdwfl/frame_unwind.c:603:18
    #3 0x559ceb105448 in __libdwfl_frame_unwind elfutils/libdwfl/frame_unwind.c:741:4
    #4 0x559ceb0ece90 in dwfl_thread_getframes elfutils/libdwfl/dwfl_frame.c:435:7
    #5 0x559ceb0ec6b7 in get_one_thread_frames_cb elfutils/libdwfl/dwfl_frame.c:379:10
    #6 0x559ceb0ec6b7 in get_one_thread_cb elfutils/libdwfl/dwfl_frame.c:308:17
    #7 0x559ceb0ec6b7 in dwfl_getthreads elfutils/libdwfl/dwfl_frame.c:283:17
    #8 0x559ceb0ec6b7 in getthread elfutils/libdwfl/dwfl_frame.c:354:14
    #9 0x559ceb0ec6b7 in dwfl_getthread_frames elfutils/libdwfl/dwfl_frame.c:388:10
    torvalds#10 0x559ceaff6ae6 in unwind__get_entries tools/perf/util/unwind-libdw.c:236:8
    torvalds#11 0x559ceabc9dbc in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:111:8
    torvalds#12 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
    torvalds#13 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
    torvalds#14 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
    torvalds#15 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
    torvalds#16 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
    torvalds#17 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
    torvalds#18 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
    torvalds#19 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
    torvalds#20 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
    torvalds#21 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
    torvalds#22 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
    torvalds#23 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
    torvalds#24 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
    torvalds#25 0x559cea95fbce in main tools/perf/perf.c:539:3

  Uninitialized value was stored to memory at
    #0 0x559cea9027d9 in __msan_memcpy llvm/llvm-project/compiler-rt/lib/msan/msan_interceptors.cpp:1558:3
    #1 0x559cea9d2185 in sample_ustack tools/perf/arch/x86/tests/dwarf-unwind.c:41:2
    #2 0x559cea9d202c in test__arch_unwind_sample tools/perf/arch/x86/tests/dwarf-unwind.c:72:9
    #3 0x559ceabc9cbd in test_dwarf_unwind__thread tools/perf/tests/dwarf-unwind.c:106:6
    #4 0x559ceabca5cf in test_dwarf_unwind__compare tools/perf/tests/dwarf-unwind.c:138:26
    #5 0x7f812a6865b0 in bsearch (libc.so.6+0x4e5b0)
    #6 0x559ceabca871 in test_dwarf_unwind__krava_3 tools/perf/tests/dwarf-unwind.c:162:2
    #7 0x559ceabca926 in test_dwarf_unwind__krava_2 tools/perf/tests/dwarf-unwind.c:169:9
    #8 0x559ceabca946 in test_dwarf_unwind__krava_1 tools/perf/tests/dwarf-unwind.c:174:9
    #9 0x559ceabcae12 in test__dwarf_unwind tools/perf/tests/dwarf-unwind.c:211:8
    torvalds#10 0x559ceabbc4ab in run_test tools/perf/tests/builtin-test.c:418:9
    torvalds#11 0x559ceabbc4ab in test_and_print tools/perf/tests/builtin-test.c:448:9
    torvalds#12 0x559ceabbac70 in __cmd_test tools/perf/tests/builtin-test.c:669:4
    torvalds#13 0x559ceabbac70 in cmd_test tools/perf/tests/builtin-test.c:815:9
    torvalds#14 0x559cea960e30 in run_builtin tools/perf/perf.c:313:11
    torvalds#15 0x559cea95fbce in handle_internal_command tools/perf/perf.c:365:8
    torvalds#16 0x559cea95fbce in run_argv tools/perf/perf.c:409:2
    torvalds#17 0x559cea95fbce in main tools/perf/perf.c:539:3

  Uninitialized value was created by an allocation of 'bf' in the stack frame of function 'perf_event__synthesize_mmap_events'
    #0 0x559ceafc5f60 in perf_event__synthesize_mmap_events tools/perf/util/synthetic-events.c:445

SUMMARY: MemorySanitizer: use-of-uninitialized-value elfutils/libdwfl/frame_unwind.c:648:8 in handle_cfi
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: clang-built-linux@googlegroups.com
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sandeep Dasgupta <sdasgup@google.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lore.kernel.org/lkml/20201113182053.754625-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
jlelli pushed a commit that referenced this pull request Jan 11, 2021
Actually, burst size is equal to '1 << desc->rqcfg.brst_size'.
we should use burst size, not desc->rqcfg.brst_size.

dma memcpy performance on Rockchip RV1126
@ 1512MHz A7, 1056MHz LPDDR3, 200MHz DMA:

dmatest:

/# echo dma0chan0 > /sys/module/dmatest/parameters/channel
/# echo 4194304 > /sys/module/dmatest/parameters/test_buf_size
/# echo 8 > /sys/module/dmatest/parameters/iterations
/# echo y > /sys/module/dmatest/parameters/norandom
/# echo y > /sys/module/dmatest/parameters/verbose
/# echo 1 > /sys/module/dmatest/parameters/run

dmatest: dma0chan0-copy0: result #1: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #2: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #3: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #4: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #5: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #6: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #7: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000
dmatest: dma0chan0-copy0: result #8: 'test passed' with src_off=0x0 dst_off=0x0 len=0x400000

Before:

  dmatest: dma0chan0-copy0: summary 8 tests, 0 failures 48 iops 200338 KB/s (0)

After this patch:

  dmatest: dma0chan0-copy0: summary 8 tests, 0 failures 179 iops 734873 KB/s (0)

After this patch and increase dma clk to 400MHz:

  dmatest: dma0chan0-copy0: summary 8 tests, 0 failures 259 iops 1062929 KB/s (0)

Signed-off-by: Sugar Zhang <sugar.zhang@rock-chips.com>
Link: https://lore.kernel.org/r/1605326106-55681-1-git-send-email-sugar.zhang@rock-chips.com
Signed-off-by: Vinod Koul <vkoul@kernel.org>
jlelli pushed a commit that referenced this pull request Jan 11, 2021
Ido Schimmel says:

====================
mlxsw: Couple of fixes

Patch #1 fixes firmware flashing when CONFIG_MLXSW_CORE=y and
CONFIG_MLXFW=m.

Patch #2 prevents EMAD transactions from needlessly failing when the
system is under heavy load by using exponential backoff.

Please consider patch #2 for stable.
====================

Link: https://lore.kernel.org/r/20201117173352.288491-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
jlelli pushed a commit that referenced this pull request Aug 2, 2021
…asid()

While digesting the XSAVE-related horrors which got introduced with
the supervisor/user split, the recent addition of ENQCMD-related
functionality got on the radar and turned out to be similarly broken.

update_pasid(), which is only required when X86_FEATURE_ENQCMD is
available, is invoked from two places:

 1) From switch_to() for the incoming task

 2) Via a SMP function call from the IOMMU/SMV code

#1 is half-ways correct as it hacks around the brokenness of get_xsave_addr()
   by enforcing the state to be 'present', but all the conditionals in that
   code are completely pointless for that.

   Also the invocation is just useless overhead because at that point
   it's guaranteed that TIF_NEED_FPU_LOAD is set on the incoming task
   and all of this can be handled at return to user space.

#2 is broken beyond repair. The comment in the code claims that it is safe
   to invoke this in an IPI, but that's just wishful thinking.

   FPU state of a running task is protected by fregs_lock() which is
   nothing else than a local_bh_disable(). As BH-disabled regions run
   usually with interrupts enabled the IPI can hit a code section which
   modifies FPU state and there is absolutely no guarantee that any of the
   assumptions which are made for the IPI case is true.

   Also the IPI is sent to all CPUs in mm_cpumask(mm), but the IPI is
   invoked with a NULL pointer argument, so it can hit a completely
   unrelated task and unconditionally force an update for nothing.
   Worse, it can hit a kernel thread which operates on a user space
   address space and set a random PASID for it.

The offending commit does not cleanly revert, but it's sufficient to
force disable X86_FEATURE_ENQCMD and to remove the broken update_pasid()
code to make this dysfunctional all over the place. Anything more
complex would require more surgery and none of the related functions
outside of the x86 core code are blatantly wrong, so removing those
would be overkill.

As nothing enables the PASID bit in the IA32_XSS MSR yet, which is
required to make this actually work, this cannot result in a regression
except for related out of tree train-wrecks, but they are broken already
today.

Fixes: 20f0afd ("x86/mmu: Allocate/free a PASID")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/87mtsd6gr9.ffs@nanos.tec.linutronix.de
jlelli pushed a commit that referenced this pull request Aug 2, 2021
ASan reported a memory leak caused by info_linear not being deallocated.

The info_linear was allocated during in perf_event__synthesize_one_bpf_prog().

This patch adds the corresponding free() when bpf_prog_info_node
is freed in perf_env__purge_bpf().

  $ sudo ./perf record -- sleep 5
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.025 MB perf.data (8 samples) ]

  =================================================================
  ==297735==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 7688 byte(s) in 19 object(s) allocated from:
      #0 0x4f420f in malloc (/home/user/linux/tools/perf/perf+0x4f420f)
      #1 0xc06a74 in bpf_program__get_prog_info_linear /home/user/linux/tools/lib/bpf/libbpf.c:11113:16
      #2 0xb426fe in perf_event__synthesize_one_bpf_prog /home/user/linux/tools/perf/util/bpf-event.c:191:16
      #3 0xb42008 in perf_event__synthesize_bpf_events /home/user/linux/tools/perf/util/bpf-event.c:410:9
      #4 0x594596 in record__synthesize /home/user/linux/tools/perf/builtin-record.c:1490:8
      #5 0x58c9ac in __cmd_record /home/user/linux/tools/perf/builtin-record.c:1798:8
      #6 0x58990b in cmd_record /home/user/linux/tools/perf/builtin-record.c:2901:8
      #7 0x7b2a20 in run_builtin /home/user/linux/tools/perf/perf.c:313:11
      #8 0x7b12ff in handle_internal_command /home/user/linux/tools/perf/perf.c:365:8
      #9 0x7b2583 in run_argv /home/user/linux/tools/perf/perf.c:409:2
      torvalds#10 0x7b0d79 in main /home/user/linux/tools/perf/perf.c:539:3
      torvalds#11 0x7fa357ef6b74 in __libc_start_main /usr/src/debug/glibc-2.33-8.fc34.x86_64/csu/../csu/libc-start.c:332:16

Signed-off-by: Riccardo Mancini <rickyman7@gmail.com>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Link: http://lore.kernel.org/lkml/20210602224024.300485-1-rickyman7@gmail.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
jlelli pushed a commit that referenced this pull request Aug 9, 2021
Add the following Telit FD980 composition 0x1056:

Cfg #1: mass storage
Cfg #2: rndis, tty, adb, tty, tty, tty, tty

Signed-off-by: Daniele Palmas <dnlplm@gmail.com>
Link: https://lore.kernel.org/r/20210803194711.3036-1-dnlplm@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Johan Hovold <johan@kernel.org>
jlelli pushed a commit that referenced this pull request Nov 23, 2021
Often some test cases like btrfs/161 trigger lockdep splats that complain
about possible unsafe lock scenario due to the fact that during mount,
when reading the chunk tree we end up calling blkdev_get_by_path() while
holding a read lock on a leaf of the chunk tree. That produces a lockdep
splat like the following:

[ 3653.683975] ======================================================
[ 3653.685148] WARNING: possible circular locking dependency detected
[ 3653.686301] 5.15.0-rc7-btrfs-next-103 #1 Not tainted
[ 3653.687239] ------------------------------------------------------
[ 3653.688400] mount/447465 is trying to acquire lock:
[ 3653.689320] ffff8c6b0c76e528 (&disk->open_mutex){+.+.}-{3:3}, at: blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.691054]
               but task is already holding lock:
[ 3653.692155] ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 3653.693978]
               which lock already depends on the new lock.

[ 3653.695510]
               the existing dependency chain (in reverse order) is:
[ 3653.696915]
               -> #3 (btrfs-chunk-00){++++}-{3:3}:
[ 3653.698053]        down_read_nested+0x4b/0x140
[ 3653.698893]        __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 3653.699988]        btrfs_read_lock_root_node+0x31/0x40 [btrfs]
[ 3653.701205]        btrfs_search_slot+0x537/0xc00 [btrfs]
[ 3653.702234]        btrfs_insert_empty_items+0x32/0x70 [btrfs]
[ 3653.703332]        btrfs_init_new_device+0x563/0x15b0 [btrfs]
[ 3653.704439]        btrfs_ioctl+0x2110/0x3530 [btrfs]
[ 3653.705405]        __x64_sys_ioctl+0x83/0xb0
[ 3653.706215]        do_syscall_64+0x3b/0xc0
[ 3653.706990]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.708040]
               -> #2 (sb_internal#2){.+.+}-{0:0}:
[ 3653.708994]        lock_release+0x13d/0x4a0
[ 3653.709533]        up_write+0x18/0x160
[ 3653.710017]        btrfs_sync_file+0x3f3/0x5b0 [btrfs]
[ 3653.710699]        __loop_update_dio+0xbd/0x170 [loop]
[ 3653.711360]        lo_ioctl+0x3b1/0x8a0 [loop]
[ 3653.711929]        block_ioctl+0x48/0x50
[ 3653.712442]        __x64_sys_ioctl+0x83/0xb0
[ 3653.712991]        do_syscall_64+0x3b/0xc0
[ 3653.713519]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.714233]
               -> #1 (&lo->lo_mutex){+.+.}-{3:3}:
[ 3653.715026]        __mutex_lock+0x92/0x900
[ 3653.715648]        lo_open+0x28/0x60 [loop]
[ 3653.716275]        blkdev_get_whole+0x28/0x90
[ 3653.716867]        blkdev_get_by_dev.part.0+0x142/0x320
[ 3653.717537]        blkdev_open+0x5e/0xa0
[ 3653.718043]        do_dentry_open+0x163/0x390
[ 3653.718604]        path_openat+0x3f0/0xa80
[ 3653.719128]        do_filp_open+0xa9/0x150
[ 3653.719652]        do_sys_openat2+0x97/0x160
[ 3653.720197]        __x64_sys_openat+0x54/0x90
[ 3653.720766]        do_syscall_64+0x3b/0xc0
[ 3653.721285]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.721986]
               -> #0 (&disk->open_mutex){+.+.}-{3:3}:
[ 3653.722775]        __lock_acquire+0x130e/0x2210
[ 3653.723348]        lock_acquire+0xd7/0x310
[ 3653.723867]        __mutex_lock+0x92/0x900
[ 3653.724394]        blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.725041]        blkdev_get_by_path+0xb8/0xd0
[ 3653.725614]        btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
[ 3653.726332]        open_fs_devices+0xd7/0x2c0 [btrfs]
[ 3653.726999]        btrfs_read_chunk_tree+0x3ad/0x870 [btrfs]
[ 3653.727739]        open_ctree+0xb8e/0x17bf [btrfs]
[ 3653.728384]        btrfs_mount_root.cold+0x12/0xde [btrfs]
[ 3653.729130]        legacy_get_tree+0x30/0x50
[ 3653.729676]        vfs_get_tree+0x28/0xc0
[ 3653.730192]        vfs_kern_mount.part.0+0x71/0xb0
[ 3653.730800]        btrfs_mount+0x11d/0x3a0 [btrfs]
[ 3653.731427]        legacy_get_tree+0x30/0x50
[ 3653.731970]        vfs_get_tree+0x28/0xc0
[ 3653.732486]        path_mount+0x2d4/0xbe0
[ 3653.732997]        __x64_sys_mount+0x103/0x140
[ 3653.733560]        do_syscall_64+0x3b/0xc0
[ 3653.734080]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.734782]
               other info that might help us debug this:

[ 3653.735784] Chain exists of:
                 &disk->open_mutex --> sb_internal#2 --> btrfs-chunk-00

[ 3653.737123]  Possible unsafe locking scenario:

[ 3653.737865]        CPU0                    CPU1
[ 3653.738435]        ----                    ----
[ 3653.739007]   lock(btrfs-chunk-00);
[ 3653.739449]                                lock(sb_internal#2);
[ 3653.740193]                                lock(btrfs-chunk-00);
[ 3653.740955]   lock(&disk->open_mutex);
[ 3653.741431]
                *** DEADLOCK ***

[ 3653.742176] 3 locks held by mount/447465:
[ 3653.742739]  #0: ffff8c6acf85c0e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xd5/0x3b0
[ 3653.744114]  #1: ffffffffc0b28f70 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x59/0x870 [btrfs]
[ 3653.745563]  #2: ffff8c6b0a9f39e0 (btrfs-chunk-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 3653.747066]
               stack backtrace:
[ 3653.747723] CPU: 4 PID: 447465 Comm: mount Not tainted 5.15.0-rc7-btrfs-next-103 #1
[ 3653.748873] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 3653.750592] Call Trace:
[ 3653.750967]  dump_stack_lvl+0x57/0x72
[ 3653.751526]  check_noncircular+0xf3/0x110
[ 3653.752136]  ? stack_trace_save+0x4b/0x70
[ 3653.752748]  __lock_acquire+0x130e/0x2210
[ 3653.753356]  lock_acquire+0xd7/0x310
[ 3653.753898]  ? blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.754596]  ? lock_is_held_type+0xe8/0x140
[ 3653.755125]  ? blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.755729]  ? blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.756338]  __mutex_lock+0x92/0x900
[ 3653.756794]  ? blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.757400]  ? do_raw_spin_unlock+0x4b/0xa0
[ 3653.757930]  ? _raw_spin_unlock+0x29/0x40
[ 3653.758437]  ? bd_prepare_to_claim+0x129/0x150
[ 3653.758999]  ? trace_module_get+0x2b/0xd0
[ 3653.759508]  ? try_module_get.part.0+0x50/0x80
[ 3653.760072]  blkdev_get_by_dev.part.0+0xe7/0x320
[ 3653.760661]  ? devcgroup_check_permission+0xc1/0x1f0
[ 3653.761288]  blkdev_get_by_path+0xb8/0xd0
[ 3653.761797]  btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
[ 3653.762454]  open_fs_devices+0xd7/0x2c0 [btrfs]
[ 3653.763055]  ? clone_fs_devices+0x8f/0x170 [btrfs]
[ 3653.763689]  btrfs_read_chunk_tree+0x3ad/0x870 [btrfs]
[ 3653.764370]  ? kvm_sched_clock_read+0x14/0x40
[ 3653.764922]  open_ctree+0xb8e/0x17bf [btrfs]
[ 3653.765493]  ? super_setup_bdi_name+0x79/0xd0
[ 3653.766043]  btrfs_mount_root.cold+0x12/0xde [btrfs]
[ 3653.766780]  ? rcu_read_lock_sched_held+0x3f/0x80
[ 3653.767488]  ? kfree+0x1f2/0x3c0
[ 3653.767979]  legacy_get_tree+0x30/0x50
[ 3653.768548]  vfs_get_tree+0x28/0xc0
[ 3653.769076]  vfs_kern_mount.part.0+0x71/0xb0
[ 3653.769718]  btrfs_mount+0x11d/0x3a0 [btrfs]
[ 3653.770381]  ? rcu_read_lock_sched_held+0x3f/0x80
[ 3653.771086]  ? kfree+0x1f2/0x3c0
[ 3653.771574]  legacy_get_tree+0x30/0x50
[ 3653.772136]  vfs_get_tree+0x28/0xc0
[ 3653.772673]  path_mount+0x2d4/0xbe0
[ 3653.773201]  __x64_sys_mount+0x103/0x140
[ 3653.773793]  do_syscall_64+0x3b/0xc0
[ 3653.774333]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3653.775094] RIP: 0033:0x7f648bc45aaa

This happens because through btrfs_read_chunk_tree(), which is called only
during mount, ends up acquiring the mutex open_mutex of a block device
while holding a read lock on a leaf of the chunk tree while other paths
need to acquire other locks before locking extent buffers of the chunk
tree.

Since at mount time when we call btrfs_read_chunk_tree() we know that
we don't have other tasks running in parallel and modifying the chunk
tree, we can simply skip locking of chunk tree extent buffers. So do
that and move the assertion that checks the fs is not yet mounted to the
top block of btrfs_read_chunk_tree(), with a comment before doing it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
jlelli pushed a commit that referenced this pull request May 8, 2023
In thread__comm_len(),strlen() is called outside of the
thread->comm_lock critical section,which may cause a UAF
problems if comm__free() is called by the process_thread
concurrently.

backtrace of the core file is as follows:

    (gdb) bt
    #0  __strlen_evex () at ../sysdeps/x86_64/multiarch/strlen-evex.S:77
    #1  0x000055ad15d31de5 in thread__comm_len (thread=0x7f627d20e300) at util/thread.c:320
    #2  0x000055ad15d4fade in hists__calc_col_len (h=0x7f627d295940, hists=0x55ad1772bfe0)
        at util/hist.c:103
    #3  hists__calc_col_len (hists=0x55ad1772bfe0, h=0x7f627d295940) at util/hist.c:79
    #4  0x000055ad15d52c8c in output_resort (hists=hists@entry=0x55ad1772bfe0, prog=0x0,
        use_callchain=false, cb=cb@entry=0x0, cb_arg=0x0) at util/hist.c:1926
    #5  0x000055ad15d530a4 in evsel__output_resort_cb (evsel=evsel@entry=0x55ad1772bde0,
        prog=prog@entry=0x0, cb=cb@entry=0x0, cb_arg=cb_arg@entry=0x0) at util/hist.c:1945
    #6  0x000055ad15d53110 in evsel__output_resort (evsel=evsel@entry=0x55ad1772bde0,
        prog=prog@entry=0x0) at util/hist.c:1950
    #7  0x000055ad15c6ae9a in perf_top__resort_hists (t=t@entry=0x7ffcd9cbf4f0) at builtin-top.c:311
    #8  0x000055ad15c6cc6d in perf_top__print_sym_table (top=0x7ffcd9cbf4f0) at builtin-top.c:346
    #9  display_thread (arg=0x7ffcd9cbf4f0) at builtin-top.c:700
    torvalds#10 0x00007f6282fab4fa in start_thread (arg=<optimized out>) at pthread_create.c:443
    torvalds#11 0x00007f628302e200 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

The reason is that strlen() get a pointer to a memory that has been freed.

The string pointer is stored in the structure comm_str, which corresponds
to a rb_tree node,when the node is erased, the memory of the string is also freed.

In thread__comm_len(),it gets the pointer within the thread->comm_lock critical section,
but passed to strlen() outside of the thread->comm_lock critical section, and the perf
process_thread may called comm__free() concurrently, cause this segfault problem.

The process is as follows:

display_thread                                  process_thread
--------------                                  --------------

thread__comm_len
  -> thread__comm_str
       # held the comm read lock
    -> __thread__comm_str(thread)
       # release the comm read lock
                                                thread__delete
                                                     # held the comm write lock
                                                  -> comm__free
                                                    -> comm_str__put(comm->comm_str)
                                                      -> zfree(&cs->str)
                                                     # release the comm write lock
      # The memory of the string pointed
        to by comm has been free.
    -> thread->comm_len = strlen(comm);

This patch expand the critical section range of thread->comm_lock in thread__comm_len(),
to make strlen() called safe.

Signed-off-by: Wenyu Liu <liuwenyu7@huawei.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Feilong Lin <linfeilong@huawei.com>
Cc: Hewenliang <hewenliang4@huawei.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Yunfeng Ye <yeyunfeng@huawei.com>
Link: https://lore.kernel.org/r/322bfb49-840b-f3b6-9ef1-f9ec3435b07e@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
jlelli pushed a commit that referenced this pull request May 8, 2023
I got a report of a msan failure like below:

  $ sudo perf lock con -ab -- sleep 1
  ...
  ==224416==WARNING: MemorySanitizer: use-of-uninitialized-value
      #0 0x5651160d6c96 in lock_contention_read  util/bpf_lock_contention.c:290:8
      #1 0x565115f90870 in __cmd_contention  builtin-lock.c:1919:3
      #2 0x565115f90870 in cmd_lock  builtin-lock.c:2385:8
      #3 0x565115f03a83 in run_builtin  perf.c:330:11
      #4 0x565115f03756 in handle_internal_command  perf.c:384:8
      #5 0x565115f02d53 in run_argv  perf.c:428:2
      #6 0x565115f02d53 in main  perf.c:562:3
      #7 0x7f43553bc632 in __libc_start_main
      #8 0x565115e865a9 in _start

It was because the 'key' variable is not initialized.  Actually it'd be set
by bpf_map_get_next_key() but msan didn't seem to understand it.  Let's make
msan happy by initializing the variable.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230324001922.937634-1-namhyung@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
jlelli pushed a commit that referenced this pull request May 8, 2023
Seen in "perf stat --bpf-counters --for-each-cgroup test" running in a
container:

  libbpf: Failed to bump RLIMIT_MEMLOCK (err = -1), you might need to do it explicitly!
  libbpf: Error in bpf_object__probe_loading():Operation not permitted(1). Couldn't load trivial BPF program. Make sure your kernel supports BPF (CONFIG_BPF_SYSCALL=y) and/or that RLIMIT_MEMLOCK is set to big enough value.
  libbpf: failed to load object 'bperf_cgroup_bpf'
  libbpf: failed to load BPF skeleton 'bperf_cgroup_bpf': -1
  Failed to load cgroup skeleton

    #0 0x55f28a650981 in list_empty tools/include/linux/list.h:189
    #1 0x55f28a6593b4 in evsel__exit util/evsel.c:1518
    #2 0x55f28a6596af in evsel__delete util/evsel.c:1544
    #3 0x55f28a89d166 in bperf_cgrp__destroy util/bpf_counter_cgroup.c:283
    #4 0x55f28a899e9a in bpf_counter__destroy util/bpf_counter.c:816
    #5 0x55f28a659455 in evsel__exit util/evsel.c:1520
    #6 0x55f28a6596af in evsel__delete util/evsel.c:1544
    #7 0x55f28a640d4d in evlist__purge util/evlist.c:148
    #8 0x55f28a640ea6 in evlist__delete util/evlist.c:169
    #9 0x55f28a4efbf2 in cmd_stat tools/perf/builtin-stat.c:2598
    torvalds#10 0x55f28a6050c2 in run_builtin tools/perf/perf.c:330
    torvalds#11 0x55f28a605633 in handle_internal_command tools/perf/perf.c:384
    torvalds#12 0x55f28a6059fb in run_argv tools/perf/perf.c:428
    torvalds#13 0x55f28a6061d3 in main tools/perf/perf.c:562

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Florian Fischer <florian.fischer@muhq.space>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230410205659.3131608-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
jlelli pushed a commit that referenced this pull request May 8, 2023
…us union field

If bperf (perf tools that use BPF skels) sets evsel->leader_skel or
evsel->follower_skel then it appears that evsel->bpf_skel is set and can
trigger the following use-after-free:

==13575==ERROR: AddressSanitizer: heap-use-after-free on address 0x60c000014080 at pc 0x55684b939880 bp 0x7ffdfcf30d70 sp 0x7ffdfcf30d68
READ of size 8 at 0x60c000014080 thread T0
     #0 0x55684b93987f in sample_filter_bpf__destroy tools/perf/bpf_skel/sample_filter.skel.h:44:11
     #1 0x55684b93987f in perf_bpf_filter__destroy tools/perf/util/bpf-filter.c:155:2
     #2 0x55684b98f71e in evsel__exit tools/perf/util/evsel.c:1521:2
     #3 0x55684b98a352 in evsel__delete tools/perf/util/evsel.c:1547:2
     #4 0x55684b981918 in evlist__purge tools/perf/util/evlist.c:148:3
     #5 0x55684b981918 in evlist__delete tools/perf/util/evlist.c:169:2
     #6 0x55684b887d60 in cmd_stat tools/perf/builtin-stat.c:2598:2
..
0x60c000014080 is located 0 bytes inside of 128-byte region [0x60c000014080,0x60c000014100)
freed by thread T0 here:
     #0 0x55684b780e86 in free compiler-rt/lib/asan/asan_malloc_linux.cpp:52:3
     #1 0x55684b9462da in bperf_cgroup_bpf__destroy tools/perf/bpf_skel/bperf_cgroup.skel.h:61:2
     #2 0x55684b9462da in bperf_cgrp__destroy tools/perf/util/bpf_counter_cgroup.c:282:2
     #3 0x55684b944c75 in bpf_counter__destroy tools/perf/util/bpf_counter.c:819:2
     #4 0x55684b98f716 in evsel__exit tools/perf/util/evsel.c:1520:2
     #5 0x55684b98a352 in evsel__delete tools/perf/util/evsel.c:1547:2
     #6 0x55684b981918 in evlist__purge tools/perf/util/evlist.c:148:3
     #7 0x55684b981918 in evlist__delete tools/perf/util/evlist.c:169:2
     #8 0x55684b887d60 in cmd_stat tools/perf/builtin-stat.c:2598:2
...
previously allocated by thread T0 here:
     #0 0x55684b781338 in calloc compiler-rt/lib/asan/asan_malloc_linux.cpp:77:3
     #1 0x55684b944e25 in bperf_cgroup_bpf__open_opts tools/perf/bpf_skel/bperf_cgroup.skel.h:73:35
     #2 0x55684b944e25 in bperf_cgroup_bpf__open tools/perf/bpf_skel/bperf_cgroup.skel.h:97:9
     #3 0x55684b944e25 in bperf_load_program tools/perf/util/bpf_counter_cgroup.c:55:9
     #4 0x55684b944e25 in bperf_cgrp__load tools/perf/util/bpf_counter_cgroup.c:178:23
     #5 0x55684b889289 in __run_perf_stat tools/perf/builtin-stat.c:713:7
     #6 0x55684b889289 in run_perf_stat tools/perf/builtin-stat.c:949:8
     #7 0x55684b888029 in cmd_stat tools/perf/builtin-stat.c:2537:12

Resolve by clearing 'evsel->bpf_skel' as part of bpf_counter__destroy().

Suggested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bpf@vger.kernel.org
Link: http://lore.kernel.org/lkml/20230411051718.267228-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
jlelli pushed a commit that referenced this pull request May 8, 2023
gpi_ch_init() doesn't lock the ctrl_lock mutex, so there is no need to
unlock it too. Instead the mutex is handled by the function
gpi_alloc_chan_resources(), which properly locks and unlocks the mutex.

=====================================
WARNING: bad unlock balance detected!
6.3.0-rc5-00253-g99792582ded1-dirty torvalds#15 Not tainted
-------------------------------------
kworker/u16:0/9 is trying to release lock (&gpii->ctrl_lock) at:
[<ffffb99d04e1284c>] gpi_alloc_chan_resources+0x108/0x5bc
but there are no more locks to release!

other info that might help us debug this:
6 locks held by kworker/u16:0/9:
 #0: ffff575740010938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x220/0x594
 #1: ffff80000809bdd0 (deferred_probe_work){+.+.}-{0:0}, at: process_one_work+0x220/0x594
 #2: ffff575740f2a0f8 (&dev->mutex){....}-{3:3}, at: __device_attach+0x38/0x188
 #3: ffff57574b5570f8 (&dev->mutex){....}-{3:3}, at: __device_attach+0x38/0x188
 #4: ffffb99d06a2f180 (of_dma_lock){+.+.}-{3:3}, at: of_dma_request_slave_channel+0x138/0x280
 #5: ffffb99d06a2ee20 (dma_list_mutex){+.+.}-{3:3}, at: dma_get_slave_channel+0x28/0x10c

stack backtrace:
CPU: 7 PID: 9 Comm: kworker/u16:0 Not tainted 6.3.0-rc5-00253-g99792582ded1-dirty torvalds#15
Hardware name: Google Pixel 3 (DT)
Workqueue: events_unbound deferred_probe_work_func
Call trace:
 dump_backtrace+0xa0/0xfc
 show_stack+0x18/0x24
 dump_stack_lvl+0x60/0xac
 dump_stack+0x18/0x24
 print_unlock_imbalance_bug+0x130/0x148
 lock_release+0x270/0x300
 __mutex_unlock_slowpath+0x48/0x2cc
 mutex_unlock+0x20/0x2c
 gpi_alloc_chan_resources+0x108/0x5bc
 dma_chan_get+0x84/0x188
 dma_get_slave_channel+0x5c/0x10c
 gpi_of_dma_xlate+0x110/0x1a0
 of_dma_request_slave_channel+0x174/0x280
 dma_request_chan+0x3c/0x2d4
 geni_i2c_probe+0x544/0x63c
 platform_probe+0x68/0xc4
 really_probe+0x148/0x2ac
 __driver_probe_device+0x78/0xe0
 driver_probe_device+0x3c/0x160
 __device_attach_driver+0xb8/0x138
 bus_for_each_drv+0x84/0xe0
 __device_attach+0x9c/0x188
 device_initial_probe+0x14/0x20
 bus_probe_device+0xac/0xb0
 device_add+0x60c/0x7d8
 of_device_add+0x44/0x60
 of_platform_device_create_pdata+0x90/0x124
 of_platform_bus_create+0x15c/0x3c8
 of_platform_populate+0x58/0xf8
 devm_of_platform_populate+0x58/0xbc
 geni_se_probe+0xf0/0x164
 platform_probe+0x68/0xc4
 really_probe+0x148/0x2ac
 __driver_probe_device+0x78/0xe0
 driver_probe_device+0x3c/0x160
 __device_attach_driver+0xb8/0x138
 bus_for_each_drv+0x84/0xe0
 __device_attach+0x9c/0x188
 device_initial_probe+0x14/0x20
 bus_probe_device+0xac/0xb0
 deferred_probe_work_func+0x8c/0xc8
 process_one_work+0x2bc/0x594
 worker_thread+0x228/0x438
 kthread+0x108/0x10c
 ret_from_fork+0x10/0x20

Fixes: 5d0c353 ("dmaengine: qcom: Add GPI dma driver")
Signed-off-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Link: https://lore.kernel.org/r/20230409233355.453741-1-dmitry.baryshkov@linaro.org
Signed-off-by: Vinod Koul <vkoul@kernel.org>
jlelli pushed a commit that referenced this pull request May 8, 2023
Hayes Wang says:

====================
r8152: fix 2.5G devices

v3:
For patch #2, modify the comment.

v2:
For patch #1, Remove inline for fc_pause_on_auto() and fc_pause_off_auto(),
and update the commit message.

For patch #2, define the magic value for OCP register 0xa424.

v1:
These patches are used to fix some issues of RTL8156.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
jlelli pushed a commit that referenced this pull request May 8, 2023
Sai Krishna says:

====================
octeontx2: Miscellaneous fixes

This patchset includes following fixes.

Patch #1 Fix for the race condition while updating APR table

Patch #2 Fix end bit position in NPC scan config

Patch #3 Fix depth of CAM, MEM table entries

Patch #4 Fix in increase the size of DMAC filter flows

Patch #5 Fix driver crash resulting from invalid interface type
information retrieved from firmware

Patch #6 Fix incorrect mask used while installing filters involving
fragmented packets

Patch #7 Fixes for NPC field hash extract w.r.t IPV6 hash reduction,
         IPV6 filed hash configuration.

Patch #8 Fix for NPC hardware parser configuration destination
         address hash, IPV6 endianness issues.

Patch #9 Fix for skipping mbox initialization for PFs disabled by firmware.

Patch torvalds#10 Fix disabling packet I/O in case of mailbox timeout.

Patch torvalds#11 Fix detaching LF resources in case of VF probe fail.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants