Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZVOL ops hanging #6888

Closed
sempervictus opened this issue Nov 20, 2017 · 20 comments
Closed

ZVOL ops hanging #6888

sempervictus opened this issue Nov 20, 2017 · 20 comments
Milestone

Comments

@sempervictus
Copy link
Contributor

sempervictus commented Nov 20, 2017

System information

Type Version/Name
Distribution Name Arch
Distribution Version N/A
Linux Kernel 4.9.62
Architecture x86-64
ZFS Version master + all patches and changes
SPL Version UNKNOWN

Describe the problem you're observing

While running a dd if=ssd of=raidz2/zvol bs=64M i'm getting consistent crashes and hangs which look like:

[ 1107.239763] INFO: task zvol:514 blocked for more than 120 seconds.
[ 1107.261785]       Tainted: P           OE   4.9.62 #1
[ 1107.274064] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1107.299938] zvol            D    0   514      2 0x00000000
[ 1107.299947]  ffffffff810a88e0 0000000000000000 ffff8817ec930000 ffff880ff76e0000
[ 1107.299956]  ffff880fffc96ac0 ffffc9000d73bae8 ffffffff81806d4a ffffffff813d2e03
[ 1107.299963]  ffff880fffc96ac0 ffffc9000d73bb10 ffff880ff76e0000 ffff881ff75f7a40
[ 1107.299975] Call Trace:
[ 1107.299990]  [<ffffffff810a88e0>] ? switched_to_idle+0x20/0x20
[ 1107.299999]  [<ffffffff81806d4a>] ? __schedule+0x24a/0x6e0
[ 1107.300004]  [<ffffffff813d2e03>] ? __list_add+0x33/0x60
[ 1107.300010]  [<ffffffff81807224>] schedule+0x44/0x90
[ 1107.300024]  [<ffffffffa01f8fd4>] cv_wait_common+0x144/0x160 [spl]
[ 1107.300030]  [<ffffffff810bc910>] ? prepare_to_wait_event+0x110/0x110
[ 1107.300043]  [<ffffffffa01f900f>] __cv_wait+0x1f/0x30 [spl]
[ 1107.300126]  [<ffffffffa07b0533>] txg_wait_open+0xb3/0x100 [zfs]
[ 1107.300185]  [<ffffffffa075b640>] dmu_tx_wait+0x380/0x390 [zfs]
[ 1107.300244]  [<ffffffffa075b833>] dmu_tx_assign+0x93/0x4b0 [zfs]
[ 1107.300294]  [<ffffffffa0819fdd>] zvol_write+0x14d/0x560 [zfs]
[ 1107.300303]  [<ffffffffa01f125c>] ? spl_kmem_free+0x2c/0x40 [spl]
[ 1107.300312]  [<ffffffffa01f4ca1>] taskq_thread+0x2b1/0x4d0 [spl]
[ 1107.300324]  [<ffffffff810a01e0>] ? wake_up_q+0x90/0x90
[ 1107.300333]  [<ffffffffa01f49f0>] ? task_done+0xa0/0xa0 [spl]
[ 1107.300340]  [<ffffffff81095e4d>] kthread+0xfd/0x120
[ 1107.300346]  [<ffffffff81095d50>] ? kthread_parkme+0x40/0x40
[ 1107.300352]  [<ffffffff8180c667>] ret_from_fork+0x37/0x50

Preventing the operation from completing and requiring a full system reboot. I pulled all patches and changes, just built from master, and still seeing the issue.
This is a very common stack trace (or similar to many others i've seen over the years), and i'd love to never see it again... :)

@sempervictus
Copy link
Contributor Author

With #6294, this is still happening, albeit with a new stack trace:

[Mon Nov 20 22:04:02 2017] INFO: task txg_sync:8485 blocked for more than 120 seconds.
[Mon Nov 20 22:04:02 2017]       Tainted: P           OE   4.9.62 #1
[Mon Nov 20 22:04:02 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 20 22:04:02 2017] txg_sync        D    0  8485      2 0x00000080
[Mon Nov 20 22:04:02 2017]  ffffffff810a88e0 0000000000000000 ffff880ff9308000 ffff881ff1e4c200
[Mon Nov 20 22:04:02 2017]  ffff8817ffc56ac0 ffffc9004601ba98 ffffffff81806d4a ffffffffa081616b
[Mon Nov 20 22:04:02 2017]  ffff8817ffc56ac0 ffff8817e9916708 ffff881ff1e4c200 ffff8817ffc56ac0
[Mon Nov 20 22:04:02 2017] Call Trace:
[Mon Nov 20 22:04:02 2017]  [<ffffffff810a88e0>] ? switched_to_idle+0x20/0x20
[Mon Nov 20 22:04:02 2017]  [<ffffffff81806d4a>] ? __schedule+0x24a/0x6e0
[Mon Nov 20 22:04:02 2017]  [<ffffffffa081616b>] ? zil_sync+0x2eb/0x480 [zfs]
[Mon Nov 20 22:04:02 2017]  [<ffffffff81807224>] schedule+0x44/0x90
[Mon Nov 20 22:04:02 2017]  [<ffffffff8180aa98>] schedule_timeout+0x268/0x480
[Mon Nov 20 22:04:02 2017]  [<ffffffffa0819394>] ? zio_issue_async+0x14/0x30 [zfs]
[Mon Nov 20 22:04:02 2017]  [<ffffffffa081eb83>] ? zio_nowait+0xb3/0x160 [zfs]
[Mon Nov 20 22:04:02 2017]  [<ffffffff810ecdc1>] ? ktime_get+0x71/0x150
[Mon Nov 20 22:04:02 2017]  [<ffffffff81806a84>] io_schedule_timeout+0xb4/0x130
[Mon Nov 20 22:04:02 2017]  [<ffffffffa024b091>] cv_wait_common+0xd1/0x170 [spl]
[Mon Nov 20 22:04:02 2017]  [<ffffffff810bc910>] ? prepare_to_wait_event+0x110/0x110
[Mon Nov 20 22:04:02 2017]  [<ffffffffa024b1b2>] __cv_wait_io+0x22/0x30 [spl]
[Mon Nov 20 22:04:02 2017]  [<ffffffffa081e448>] zio_wait+0x108/0x1d0 [zfs]
[Mon Nov 20 22:04:02 2017]  [<ffffffffa0790e2d>] dsl_pool_sync+0xbd/0x410 [zfs]
[Mon Nov 20 22:04:02 2017]  [<ffffffffa07b15fb>] spa_sync+0x43b/0xda0 [zfs]
[Mon Nov 20 22:04:02 2017]  [<ffffffff810bc95e>] ? autoremove_wake_function+0x4e/0x70
[Mon Nov 20 22:04:02 2017]  [<ffffffff810bc155>] ? __wake_up+0x55/0x70
[Mon Nov 20 22:04:02 2017]  [<ffffffffa07c50d3>] txg_sync_thread+0x2e3/0x4b0 [zfs]
[Mon Nov 20 22:04:02 2017]  [<ffffffffa07c4df0>] ? txg_delay+0x190/0x190 [zfs]
[Mon Nov 20 22:04:02 2017]  [<ffffffffa0245ae0>] ? __thread_exit+0x20/0x20 [spl]
[Mon Nov 20 22:04:02 2017]  [<ffffffffa0245b5a>] thread_generic_wrapper+0x7a/0x90 [spl]
[Mon Nov 20 22:04:02 2017]  [<ffffffff81095e4d>] kthread+0xfd/0x120
[Mon Nov 20 22:04:02 2017]  [<ffffffff81095d50>] ? kthread_parkme+0x40/0x40
[Mon Nov 20 22:04:02 2017]  [<ffffffff8180c667>] ret_from_fork+0x37/0x50

I'm pulling SPL dynamic threads (setting to 0) and deploying another from-master build. Annoying. Storage virtualization with block devices is such a useful function, and its been a rolling nightmare on ZoL since 0.6.4.
I will note though that under this crash condition, the IOs continue. Just wondering if the ones in the hung tasks ever completed. Sort of a useful thing to know when DDing a 1T disk containing IDS data. With oflag=direct it doesnt crash, but crawls at 10MB/s, which isnt viable either.

@sempervictus
Copy link
Contributor Author

Adding insult to injury, ZFS appears able to perform this same operation to a file in the ZPL. ZVOLs are definitely broken if the ZPL, a more complex construct, can execute this operation with attributes and structures associated which dont exist in the bitwise blockdev world.

Ping @behlendorf: are there any plans to fix the zvol nightmare? its been in varying states of "broken->super-broken" for well over a year. IIRC @ryao was working on this for an employer, but i've not seen hide nor hair of him since that discussion, making me think that code wont be implemented or published if it is. Maybe it makes more sense to write a Linux-specific implementation given the threading and block access differences from the ground up, instead of the 'adapt Illumos code much as wel can" approach on this. We used to get >1GB/s on iSCSI before 0.6.4.8 or thereabouts (the zvol rewrite removing half of the blockdev pipe), now we cant even use a zvol locally as a dd target...

@sempervictus
Copy link
Contributor Author

Searching for "zvol" in the issue tracker produces pages of results - this is definitely a major problem area for some time. Just a few recent examples of seemingly the same issue (blocking ops for over 2m) - #6890, #6736, #6330, #6265, and pages more going back to 2013. I figure the LLNL use case is for Lustre, which accesses the DMU directly, so there's no direct priority for this issue. However, any company providing services on ZoL is not likely going to want to publish the code which gives them their ops differentiator for the service they provide (which is the sort of thing i believe ryao was working on when last we spoke). As a result, we might be a bit stuck unless someone like Datto comes around who seem more than willing to Open Source their internal magic for interop and sanity in the ecosystem, or one of us finds the time to learn the Linux block IO subsystem well enough to implement these virtual block devices in an optimal/safe way for this OS.

@ryao
Copy link
Contributor

ryao commented Nov 21, 2017

@sempervictus I am still around, although very backlogged. I have not hit this particular issue at work. Is it possible to get stack traces from all threads when this happens?

@h1z1
Copy link

h1z1 commented Nov 21, 2017

Interesting, sure it's not hardware related? I've hit this but only after a disk failed and after tweaking NCQ. Increasing the queue would wake it up. Assumed it was a bug in the kernel somewhere.

@sempervictus
Copy link
Contributor Author

sempervictus commented Nov 25, 2017

@ryao: so this is still happening using your 0.7.3 tree:

[Sat Nov 25 04:00:44 2017] INFO: task z_zvol:18882 blocked for more than 600 seconds.
[Sat Nov 25 04:00:44 2017]       Tainted: P           OE   4.9.64 #1
[Sat Nov 25 04:00:44 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sat Nov 25 04:00:44 2017] z_zvol          D    0 18882      2 0x00000080
[Sat Nov 25 04:00:44 2017]  ffffffff810a8890 0000000000000000 ffff8807e162a400 ffff8807eae44d00
[Sat Nov 25 04:00:44 2017]  ffff88201fd16ac0 ffffc90037c5bc48 ffffffff81806c3a 0000000000000282
[Sat Nov 25 04:00:44 2017]  ffff88201fd16ac0 0000000000000000 ffff8807eae44d00 000000000000006b
[Sat Nov 25 04:00:44 2017] Call Trace:
[Sat Nov 25 04:00:44 2017]  [<ffffffff810a8890>] ? switched_to_idle+0x20/0x20
[Sat Nov 25 04:00:44 2017]  [<ffffffff81806c3a>] ? __schedule+0x24a/0x6e0
[Sat Nov 25 04:00:44 2017]  [<ffffffff81807114>] schedule+0x44/0x90
[Sat Nov 25 04:00:44 2017]  [<ffffffffa0253edd>] taskq_wait_outstanding+0x8d/0xd0 [spl]
[Sat Nov 25 04:00:44 2017]  [<ffffffff810bc8c0>] ? prepare_to_wait_event+0x110/0x110
[Sat Nov 25 04:00:44 2017]  [<ffffffffa0e8bcb1>] zvol_task_cb+0x201/0x5d0 [zfs]
[Sat Nov 25 04:00:44 2017]  [<ffffffff810a8890>] ? switched_to_idle+0x20/0x20
[Sat Nov 25 04:00:44 2017]  [<ffffffffa0254ca1>] taskq_thread+0x2b1/0x4d0 [spl]
[Sat Nov 25 04:00:44 2017]  [<ffffffff810a0190>] ? wake_up_q+0x90/0x90
[Sat Nov 25 04:00:44 2017]  [<ffffffffa02549f0>] ? task_done+0xa0/0xa0 [spl]
[Sat Nov 25 04:00:44 2017]  [<ffffffff81095dfd>] kthread+0xfd/0x120
[Sat Nov 25 04:00:44 2017]  [<ffffffff81095d00>] ? kthread_parkme+0x40/0x40
[Sat Nov 25 04:00:44 2017]  [<ffffffff8180c567>] ret_from_fork+0x37/0x50

^^ pages upon pages of that when sending from one pool to another via zfs send -Rvec and zfs recv -vFu. I see this all over the place when sending anything with zvols. After this ZFS starts "hiccuping" and transfer rates drop from 200MB/s to 50-100/MB every few seconds with nothing in between.
How would i pull stack traces from all threads? If it requires reading /dev/kmem, these are all grsec 4.9 systems (though i've seen very similar at places which run ubuntu vanilla). Please pardon my ignorance, i should know this by now.
Cancelling the send results in some FUBAR state where the pool seems to want to sync TXGs forever maxing out the underlying disks. This continues after a pool export and import, with the import taking ~10m. There's a fair hint of Murphy in the air...
The destination pool was created using your branch, completely clean, taking the send after create, export, import cycle. Hell, even if i can buffer consumer IO via LIO/SCST loopback somehow, the breakage from send/recv is somewhat untenable.
I am seeing that two of the underlying disks are lagging considerably - iostat and zpool iostat both show write latencies significantly above their peers (raidz2, so i can swap the pair if need be). Wondering if maybe there's a bad HBA cable or something as i've replaced all the members of the pool. Smartctl tests pass... Also, wouldn't that be the DMU telling me to screw off, not the ZVOLs logically atop it?
Thanks as always for any help, hoping your return to daylight heralds an actual solution to this ungodly mess.
EDIT: oh, also i created the pool with volmode=none, but it seems the recv inherits state from the sender, so that's ignored. i could try a system-wide option, but the OS is on ZFS, and that gets complicated if i have to set it every time wen send/recv.
EDIT2: setting the volmode parameter systemwide has no effect, recv of zvols causes this still (destroyed and rebuilt pool, on separate SATA links).

@loli10K
Copy link
Contributor

loli10K commented Nov 27, 2017

EDIT2: setting the volmode parameter systemwide has no effect, recv of zvols causes this still (destroyed and rebuilt pool, on separate SATA links).

@sempervictus by "setting the volmode parameter systemwide has no effect" did you mean "it has no effect in this context" (fixing this particular issue) or "the volmode parameter doesn't work as it is supposed to"?

Assuming the "system-wide" option is the zvol_volmode module parameter, if you are receiving volmode from the sender, that setting is not going to work unless volmode=default (which is the only case where the zvol_volmode is used, IIRC). You should be able to work around this receiving with -o volmode=none.

If you found a bug using the volmode property we should probably open a separate issue.

@behlendorf
Copy link
Contributor

behlendorf commented Nov 27, 2017

@sempervictus could you post the contents of /proc/spl/kstat/zfs/dmu_tx, /proc/spl/kstat/zfs/<pool>/dmu_tx_assign, and /proc/spl/taskq from a system which has hit this problem. That may help provide some insight in to the issue. Are you able to reproduce this consistently only by running dd if=ssd of=raidz2/zvol bs=64M?

@behlendorf
Copy link
Contributor

@sempervictus the stack from #6888 (comment) shows that for some reason there are IOs which aren't getting completed. When this happens to IOs issued by the txg_sync thread things will stall exactly as you've described. In most case this is due to a problem with flaky hardware, the drivers, or the scsi layer. What I think we should do is add some additional statistics and logging to make it easier to determine exactly why the IO isn't completing.

@dweeezil
Copy link
Contributor

dweeezil commented Nov 27, 2017

I've been loosely following this, and other related threads, and I suppose it might be time to make mention of the little patch in dweeezil/zfs@b8f3110fb. That patch along with a much shorter setting for zfs_deadman_synctime_ms was a stopgap solution I developed to deal with "lost" completions. The salient point being is that if a completion event is never delivered, the whole ZFS pipeline will eventually grind to a halt. My suggestion to anyone trying out this patch is to keep a close eye on zpool events looking for pairs of "delay" events.

@sempervictus
Copy link
Contributor Author

@behlendorf: the faulty hardware thing is a bit tough to swallow, same ops work just fine on file targets. There's something in the zvol IO pipeline causing this.
@dweeezil: adding to test stack, thank you.

@sempervictus
Copy link
Contributor Author

@dweeezil: no dice, built from master with that commit (had a minor merge fix since the line above the delta changed since you made it), and got:

[Sat Dec 16 22:32:52 2017] INFO: task txg_sync:4191 blocked for more than 120 seconds.
[Sat Dec 16 22:32:52 2017]       Tainted: P           OE   4.9.68 #1
[Sat Dec 16 22:32:52 2017] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sat Dec 16 22:32:52 2017] txg_sync        D    0  4191      2 0x00080080
[Sat Dec 16 22:32:52 2017]  ffffffff810a8b00 0000000000000000 ffff880c11a2d000 ffff880bfe468000
[Sat Dec 16 22:32:52 2017]  ffff880617d96b00 ffffc9002da77a98 ffffffff81834dcd ffff880bfd16b808
[Sat Dec 16 22:32:52 2017]  ffff880617d96b00 ffff88055b00f0d8 ffff880bfe468000 ffff880617d96b00
[Sat Dec 16 22:32:52 2017] Call Trace:
[Sat Dec 16 22:32:52 2017]  [<ffffffff810a8b00>] ? switched_to_idle+0x20/0x20
[Sat Dec 16 22:32:52 2017]  [<ffffffff81834dcd>] ? __schedule+0x24d/0x6e0
[Sat Dec 16 22:32:52 2017]  [<ffffffff818352a4>] schedule+0x44/0x90
[Sat Dec 16 22:32:52 2017]  [<ffffffff81838b18>] schedule_timeout+0x268/0x480
[Sat Dec 16 22:32:52 2017]  [<ffffffffa06d9ce4>] ? zio_issue_async+0x14/0x30 [zfs]
[Sat Dec 16 22:32:52 2017]  [<ffffffffa06df5b3>] ? zio_nowait+0xb3/0x160 [zfs]
[Sat Dec 16 22:32:52 2017]  [<ffffffff810ee041>] ? ktime_get+0x71/0x150
[Sat Dec 16 22:32:52 2017]  [<ffffffff81834b04>] io_schedule_timeout+0xb4/0x130
[Sat Dec 16 22:32:52 2017]  [<ffffffffa02de091>] cv_wait_common+0xd1/0x170 [spl]
[Sat Dec 16 22:32:52 2017]  [<ffffffff810bcba0>] ? prepare_to_wait_event+0x110/0x110
[Sat Dec 16 22:32:52 2017]  [<ffffffffa02de1b2>] __cv_wait_io+0x22/0x30 [spl]
[Sat Dec 16 22:32:52 2017]  [<ffffffffa06dee78>] zio_wait+0x108/0x1d0 [zfs]
[Sat Dec 16 22:32:52 2017]  [<ffffffffa064d55d>] dsl_pool_sync+0xbd/0x410 [zfs]
[Sat Dec 16 22:32:52 2017]  [<ffffffffa067105b>] spa_sync+0x43b/0xda0 [zfs]
[Sat Dec 16 22:32:52 2017]  [<ffffffff810bcbee>] ? autoremove_wake_function+0x4e/0x70
[Sat Dec 16 22:32:52 2017]  [<ffffffff81837364>] ? __mutex_unlock_slowpath+0x74/0xa0
[Sat Dec 16 22:32:52 2017]  [<ffffffffa0684b83>] txg_sync_thread+0x2e3/0x4b0 [zfs]
[Sat Dec 16 22:32:52 2017]  [<ffffffffa06848a0>] ? txg_delay+0x190/0x190 [zfs]
[Sat Dec 16 22:32:52 2017]  [<ffffffffa02d8ae0>] ? __thread_exit+0x20/0x20 [spl]
[Sat Dec 16 22:32:52 2017]  [<ffffffffa02d8b5a>] thread_generic_wrapper+0x7a/0x90 [spl]
[Sat Dec 16 22:32:52 2017]  [<ffffffff81095e0d>] kthread+0xfd/0x120
[Sat Dec 16 22:32:52 2017]  [<ffffffff81095d10>] ? kthread_parkme+0x40/0x40
[Sat Dec 16 22:32:52 2017]  [<ffffffff8183a6e7>] ret_from_fork+0x37/0x50
[Sat Dec 16 22:34:55 2017] INFO: task txg_sync:4191 blocked for more than 120 seconds.

while doing a dd to a zvol on a raidz2 with a slog ssd mirror.
Didn't register any events in zpool events -f

@sempervictus
Copy link
Contributor Author

sempervictus commented Dec 17, 2017 via email

@behlendorf behlendorf added this to the 0.8.0 milestone Dec 18, 2017
@behlendorf
Copy link
Contributor

the faulty hardware thing is a bit tough to swallow

Yeah, I don't think I buy it either. Let me see about putting together a patch to properly extend the deadman logic so we can get some proper debugging regarding the hung I/Os.

@shodanshok
Copy link
Contributor

@sempervictus does the problem persist if using direct_io on the target zvol? For example, executing dd if=ssd of=raidz2/zvol bs=64M oflag=direct

@sempervictus
Copy link
Contributor Author

sempervictus commented Jan 2, 2018 via email

@behlendorf
Copy link
Contributor

Is this still an issue?

@sempervictus
Copy link
Contributor Author

Haven't seen it in about 6 months, but we greatly reduced snapshot frequency across the board and moved to 4.14 as our base.

@behlendorf
Copy link
Contributor

Alright, well then let's close it out for now. If we're able to reproduce it we can always open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@behlendorf @sempervictus @ryao @dweeezil @loli10K @shodanshok @h1z1 and others