Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs dataset becomes stuck #8321

Closed
Machske opened this issue Jan 21, 2019 · 9 comments
Closed

zfs dataset becomes stuck #8321

Machske opened this issue Jan 21, 2019 · 9 comments
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@Machske
Copy link

Machske commented Jan 21, 2019

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 16.04
Linux Kernel 4.15.0-43-generic
Architecture x86_64
ZFS Version 0.7.12-0york3~16.04
SPL Version 0.7.12-0york3~16.04
Cpu Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz
Memory 128GB
Swap 12GB available - 0k used - not on zfs

Describe the problem you're observing

This server has 303 datasets in 1 zpool

# zpool status -v
  pool: zpool1
 state: ONLINE
  scan: scrub repaired 0B in 1h28m with 0 errors on Sun Jan 13 01:52:17 2019
config:

        NAME           STATE     READ WRITE CKSUM
        zpool1         ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            nvme0n1p4  ONLINE       0     0     0
            nvme1n1p4  ONLINE       0     0     0

One specific dataset became "stuck". The command "ls" hangs on that specific dataset, though other datasets appear to function correctly.

This issue appeared to have started after running a npm process (nodejs) which writes huge number of files.

Describe how to reproduce the problem

There is no specific way that we can do to reproduce. Running the same command after a reboot runs succesfully on the same dataset.

Include any warning/errors/backtraces from the system logs

Jan 21 13:12:42 server kernel: [1478288.311651]       Tainted: P           OE    4.15.0-43-generic #46~16.04.1-Ubuntu
Jan 21 13:12:42 server kernel: [1478288.311710] txg_quiesce     D    0  1224      2 0x80000000
Jan 21 13:12:42 server kernel: [1478288.311718]  __schedule+0x3d6/0x8b0
Jan 21 13:12:42 server kernel: [1478288.311727]  cv_wait_common+0x101/0x140 [spl]
Jan 21 13:12:42 server kernel: [1478288.311734]  __cv_wait+0x15/0x20 [spl]
Jan 21 13:12:42 server kernel: [1478288.311798]  ? txg_delay+0x170/0x170 [zfs]
Jan 21 13:12:42 server kernel: [1478288.311805]  kthread+0x105/0x140
Jan 21 13:12:42 server kernel: [1478288.311810]  ? kthread_destroy_worker+0x50/0x50
Jan 21 13:12:42 server kernel: [1478288.312428] INFO: task node:5142 blocked for more than 120 seconds.
Jan 21 13:12:42 server kernel: [1478288.312481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 21 13:12:42 server kernel: [1478288.312511] Call Trace:
Jan 21 13:12:42 server kernel: [1478288.312515]  schedule+0x36/0x80
Jan 21 13:12:42 server kernel: [1478288.312519]  call_rwsem_down_read_failed+0x18/0x30
Jan 21 13:12:42 server kernel: [1478288.312522]  down_read+0x20/0x40
Jan 21 13:12:42 server kernel: [1478288.312526]  ? lookup_fast+0x1ab/0x300
Jan 21 13:12:42 server kernel: [1478288.312529]  ? path_init+0x1d9/0x310
Jan 21 13:12:42 server kernel: [1478288.312532]  filename_lookup+0xb8/0x1a0
Jan 21 13:12:42 server kernel: [1478288.312539]  ? strncpy_from_user+0x4d/0x170
Jan 21 13:12:42 server kernel: [1478288.312542]  ? user_path_at_empty+0x36/0x40
Jan 21 13:12:42 server kernel: [1478288.312546]  SYSC_newlstat+0x3d/0x70
Jan 21 13:12:42 server kernel: [1478288.312551]  ? syscall_trace_enter+0xca/0x2f0
Jan 21 13:12:42 server kernel: [1478288.312554]  do_syscall_64+0x73/0x130
Jan 21 13:12:42 server kernel: [1478288.312557] RIP: 0033:0x7fcdde6e70e5
Jan 21 13:12:42 server kernel: [1478288.312560] RAX: ffffffffffffffda RBX: 00007fcdceffd630 RCX: 00007fcdde6e70e5
Jan 21 13:12:42 server kernel: [1478288.312561] RBP: 00007fcdceffceb0 R08: 0000000000000000 R09: 0000000000004070
Jan 21 13:12:42 server kernel: [1478288.312563] R13: 0000000000000000 R14: 00007fcdceffd9c0 R15: 00000000040cbe80
Jan 21 13:12:42 server kernel: [1478288.312588]       Tainted: P           OE    4.15.0-43-generic #46~16.04.1-Ubuntu
Jan 21 13:12:42 server kernel: [1478288.312643] node            D    0  5143   1611 0x00000120
Jan 21 13:12:42 server kernel: [1478288.312646]  __schedule+0x3d6/0x8b0
Jan 21 13:12:42 server kernel: [1478288.312649]  schedule+0x36/0x80
Jan 21 13:12:42 server kernel: [1478288.312653]  call_rwsem_down_read_failed+0x18/0x30
Jan 21 13:12:42 server kernel: [1478288.312656]  down_read+0x20/0x40
Jan 21 13:12:42 server kernel: [1478288.312658]  ? lookup_fast+0x1ab/0x300
Jan 21 13:12:42 server kernel: [1478288.312661]  ? path_init+0x1d9/0x310
Jan 21 13:12:42 server kernel: [1478288.312664]  filename_lookup+0xb8/0x1a0
Jan 21 13:12:42 server kernel: [1478288.312668]  ? strncpy_from_user+0x4d/0x170
Jan 21 13:12:42 server kernel: [1478288.312671]  ? user_path_at_empty+0x36/0x40
Jan 21 13:12:42 server kernel: [1478288.312676]  ? dput+0x67/0x1f0
Jan 21 13:12:42 server kernel: [1478288.312679]  ? __secure_computing+0x3b/0xf0
Jan 21 13:12:42 server kernel: [1478288.312683]  SyS_newlstat+0xe/0x10
Jan 21 13:12:42 server kernel: [1478288.312686]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Jan 21 13:12:42 server kernel: [1478288.312687] RSP: 002b:00007fcdce7fbd88 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Jan 21 13:12:42 server kernel: [1478288.312689] RDX: 00007fcdce7fbdf0 RSI: 00007fcdce7fbdf0 RDI: 00000000049783b0
Jan 21 13:12:42 server kernel: [1478288.312691] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007
Jan 21 13:12:42 server kernel: [1478288.312695] INFO: task node:5125 blocked for more than 120 seconds.
Jan 21 13:12:42 server kernel: [1478288.312747] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 21 13:12:42 server kernel: [1478288.312777] Call Trace:
Jan 21 13:12:42 server kernel: [1478288.312780]  schedule+0x36/0x80
Jan 21 13:12:42 server kernel: [1478288.312784]  call_rwsem_down_read_failed+0x18/0x30
Jan 21 13:12:42 server kernel: [1478288.312787]  down_read+0x20/0x40
Jan 21 13:12:42 server kernel: [1478288.312789]  ? lookup_fast+0x1ab/0x300
Jan 21 13:12:42 server kernel: [1478288.312792]  ? path_init+0x1d9/0x310
Jan 21 13:12:42 server kernel: [1478288.312796]  ? ___slab_alloc+0x341/0x4e0
Jan 21 13:12:42 server kernel: [1478288.312799]  filename_lookup+0xb8/0x1a0
Jan 21 13:12:42 server kernel: [1478288.312803]  ? strncpy_from_user+0x4d/0x170
Jan 21 13:12:42 server kernel: [1478288.312806]  ? user_path_at_empty+0x36/0x40
Jan 21 13:12:42 server kernel: [1478288.312809]  SYSC_newlstat+0x3d/0x70
Jan 21 13:12:42 server kernel: [1478288.312812]  ? syscall_trace_enter+0xca/0x2f0
Jan 21 13:12:42 server kernel: [1478288.312815]  do_syscall_64+0x73/0x130
Jan 21 13:12:42 server kernel: [1478288.312818] RIP: 0033:0x7f8b1c6bc0e5
Jan 21 13:12:42 server kernel: [1478288.312820] RAX: ffffffffffffffda RBX: 00007f8b1a5cc630 RCX: 00007f8b1c6bc0e5
Jan 21 13:12:42 server kernel: [1478288.312821] RBP: 00007f8b1a5cbeb0 R08: 0000000000000000 R09: 000000000000326d
Jan 21 13:12:42 server kernel: [1478288.312823] R13: 0000000000000000 R14: 00007f8b1a5cc9c0 R15: 0000000002f66ef0
Jan 21 13:12:42 server kernel: [1478288.312848]       Tainted: P           OE    4.15.0-43-generic #46~16.04.1-Ubuntu
Jan 21 13:12:42 server kernel: [1478288.312904] node            D    0  5128   1611 0x00000120
Jan 21 13:12:42 server kernel: [1478288.312908]  __schedule+0x3d6/0x8b0
Jan 21 13:12:42 server kernel: [1478288.312911]  rwsem_down_read_failed+0x10a/0x170
Jan 21 13:12:42 server kernel: [1478288.312914]  ? call_rwsem_down_read_failed+0x18/0x30
Jan 21 13:12:42 server kernel: [1478288.312917]  lookup_slow+0x60/0x170
Jan 21 13:12:42 server kernel: [1478288.312920]  walk_component+0x1c5/0x360
Jan 21 13:12:42 server kernel: [1478288.312922]  path_lookupat+0x76/0x230
Jan 21 13:12:42 server kernel: [1478288.312926]  ? __check_object_size+0xfc/0x1a0
Jan 21 13:12:42 server kernel: [1478288.312929]  user_path_at_empty+0x36/0x40
Jan 21 13:12:42 server kernel: [1478288.312932]  vfs_statx+0x77/0xe0
Jan 21 13:12:42 server kernel: [1478288.312935]  ? __secure_computing+0x3b/0xf0
Jan 21 13:12:42 server kernel: [1478288.312939]  SyS_newlstat+0xe/0x10
Jan 21 13:12:42 server kernel: [1478288.312942]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Jan 21 13:12:42 server kernel: [1478288.312943] RSP: 002b:00007f8b195c9d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Jan 21 13:12:42 server kernel: [1478288.312945] RDX: 00007f8b195c9df0 RSI: 00007f8b195c9df0 RDI: 0000000002efcda0
Jan 21 13:12:42 server kernel: [1478288.312946] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007
Jan 21 13:12:42 server kernel: [1478288.312950] INFO: task node:5148 blocked for more than 120 seconds.
Jan 21 13:12:42 server kernel: [1478288.312973]       Tainted: P           OE    4.15.0-43-generic #46~16.04.1-Ubuntu
Jan 21 13:12:42 server kernel: [1478288.313000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 21 13:12:42 server kernel: [1478288.313027] node            D    0  5148   1611 0x00000120
Jan 21 13:12:42 server kernel: [1478288.313029] Call Trace:
Jan 21 13:12:42 server kernel: [1478288.313031]  __schedule+0x3d6/0x8b0
Jan 21 13:12:42 server kernel: [1478288.313033]  schedule+0x36/0x80
Jan 21 13:12:42 server kernel: [1478288.313034]  rwsem_down_read_failed+0x10a/0x170
Jan 21 13:12:42 server kernel: [1478288.313036]  call_rwsem_down_read_failed+0x18/0x30
Jan 21 13:12:42 server kernel: [1478288.313038]  ? call_rwsem_down_read_failed+0x18/0x30
Jan 21 13:12:42 server kernel: [1478288.313039]  down_read+0x20/0x40
Jan 21 13:12:42 server kernel: [1478288.313040]  lookup_slow+0x60/0x170
Jan 21 13:12:42 server kernel: [1478288.313043]  walk_component+0x1c5/0x360
Jan 21 13:12:42 server kernel: [1478288.313046]  path_lookupat+0x76/0x230
Jan 21 13:12:42 server kernel: [1478288.313049]  ? __check_object_size+0xfc/0x1a0
Jan 21 13:12:42 server kernel: [1478288.313053]  user_path_at_empty+0x36/0x40
Jan 21 13:12:42 server kernel: [1478288.313056]  vfs_statx+0x77/0xe0
Jan 21 13:12:42 server kernel: [1478288.313059]  ? __secure_computing+0x3b/0xf0
Jan 21 13:12:42 server kernel: [1478288.313062]  SyS_newlstat+0xe/0x10
Jan 21 13:12:42 server kernel: [1478288.313065]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Jan 21 13:12:42 server kernel: [1478288.313066] RSP: 002b:00007f0581ba4d88 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Jan 21 13:12:42 server kernel: [1478288.313068] RDX: 00007f0581ba4df0 RSI: 00007f0581ba4df0 RDI: 00000000041af5d0
Jan 21 13:12:42 server kernel: [1478288.313070] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007
Jan 21 13:12:42 server kernel: [1478288.313073] INFO: task node:5153 blocked for more than 120 seconds.
Jan 21 13:12:42 server kernel: [1478288.313122] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 21 13:12:42 server kernel: [1478288.313152] Call Trace:
Jan 21 13:12:42 server kernel: [1478288.313155]  schedule+0x36/0x80
Jan 21 13:12:42 server kernel: [1478288.313159]  call_rwsem_down_read_failed+0x18/0x30
Jan 21 13:12:42 server kernel: [1478288.313161]  down_read+0x20/0x40
Jan 21 13:12:42 server kernel: [1478288.313164]  ? lookup_fast+0x1ab/0x300
Jan 21 13:12:42 server kernel: [1478288.313167]  ? path_init+0x1d9/0x310
Jan 21 13:12:42 server kernel: [1478288.313170]  filename_lookup+0xb8/0x1a0
Jan 21 13:12:42 server kernel: [1478288.313174]  ? strncpy_from_user+0x4d/0x170
Jan 21 13:12:42 server kernel: [1478288.313176]  ? user_path_at_empty+0x36/0x40
Jan 21 13:12:42 server kernel: [1478288.313180]  SYSC_newlstat+0x3d/0x70
Jan 21 13:12:42 server kernel: [1478288.313182]  ? syscall_trace_enter+0xca/0x2f0
Jan 21 13:12:42 server kernel: [1478288.313185]  do_syscall_64+0x73/0x130
Jan 21 13:12:42 server kernel: [1478288.313188] RIP: 0033:0x7f05922e50e5
Jan 21 13:12:42 server kernel: [1478288.313190] RAX: ffffffffffffffda RBX: 00007f05837fe630 RCX: 00007f05922e50e5
Jan 21 13:12:42 server kernel: [1478288.313191] RBP: 00007f05837fdeb0 R08: 0000000000000000 R09: 00000000000036cb
Jan 21 13:12:42 server kernel: [1478288.313193] R13: 0000000000000000 R14: 00007f05837fe9c0 R15: 000000000361fe20
Jan 21 13:12:42 server kernel: [1478288.313218]       Tainted: P           OE    4.15.0-43-generic #46~16.04.1-Ubuntu
Jan 21 13:12:42 server kernel: [1478288.313273] node            D    0  5154   1611 0x00000120
Jan 21 13:12:42 server kernel: [1478288.313276]  __schedule+0x3d6/0x8b0
Jan 21 13:12:42 server kernel: [1478288.313278]  ? mutex_lock+0x12/0x40
Jan 21 13:12:42 server kernel: [1478288.313281]  rwsem_down_read_failed+0x10a/0x170
Jan 21 13:12:42 server kernel: [1478288.313285]  ? call_rwsem_down_read_failed+0x18/0x30
Jan 21 13:12:42 server kernel: [1478288.313288]  lookup_slow+0x60/0x170
Jan 21 13:12:42 server kernel: [1478288.313290]  walk_component+0x1c5/0x360
Jan 21 13:12:42 server kernel: [1478288.313293]  path_lookupat+0x76/0x230
Jan 21 13:12:42 server kernel: [1478288.313297]  ? __check_object_size+0xfc/0x1a0
Jan 21 13:12:42 server kernel: [1478288.313300]  user_path_at_empty+0x36/0x40
Jan 21 13:12:42 server kernel: [1478288.313303]  vfs_statx+0x77/0xe0
Jan 21 13:12:42 server kernel: [1478288.313306]  ? __secure_computing+0x3b/0xf0
Jan 21 13:12:42 server kernel: [1478288.313309]  SyS_newlstat+0xe/0x10
Jan 21 13:12:42 server kernel: [1478288.313312]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Jan 21 13:12:42 server kernel: [1478288.313313] RSP: 002b:00007f0582ffcd88 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Jan 21 13:12:42 server kernel: [1478288.313315] RDX: 00007f0582ffcdf0 RSI: 00007f0582ffcdf0 RDI: 00000000036bcbc0
Jan 21 13:12:42 server kernel: [1478288.313317] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007
Jan 21 13:12:42 server kernel: [1478288.313413]  rwsem_down_read_failed+0x10a/0x170
Jan 21 13:12:42 server kernel: [1478288.313416]  call_rwsem_down_read_failed+0x18/0x30
Jan 21 13:12:42 server kernel: [1478288.313417]  ? call_rwsem_down_read_failed+0x18/0x30
Jan 21 13:12:42 server kernel: [1478288.313418]  down_read+0x20/0x40
Jan 21 13:12:42 server kernel: [1478288.313420]  lookup_slow+0x60/0x170
Jan 21 13:12:42 server kernel: [1478288.313421]  ? lookup_fast+0x1ab/0x300
Jan 21 13:12:42 server kernel: [1478288.313423]  walk_component+0x1c5/0x360
Jan 21 13:12:42 server kernel: [1478288.313424]  ? path_init+0x1d9/0x310
Jan 21 13:12:42 server kernel: [1478288.313425]  path_lookupat+0x76/0x230
Jan 21 13:12:42 server kernel: [1478288.313427]  filename_lookup+0xb8/0x1a0
Jan 21 13:12:42 server kernel: [1478288.313429]  ? __check_object_size+0xfc/0x1a0
Jan 21 13:12:42 server kernel: [1478288.313431]  ? strncpy_from_user+0x4d/0x170
Jan 21 13:12:42 server kernel: [1478288.313432]  user_path_at_empty+0x36/0x40
Jan 21 13:12:42 server kernel: [1478288.313434]  ? user_path_at_empty+0x36/0x40
Jan 21 13:12:42 server kernel: [1478288.313435]  vfs_statx+0x77/0xe0
Jan 21 13:12:42 server kernel: [1478288.313437]  SYSC_newlstat+0x3d/0x70
Jan 21 13:12:42 server kernel: [1478288.313438]  ? __secure_computing+0x3b/0xf0
Jan 21 13:12:42 server kernel: [1478288.313440]  ? syscall_trace_enter+0xca/0x2f0
Jan 21 13:12:42 server kernel: [1478288.313441]  SyS_newlstat+0xe/0x10
Jan 21 13:12:42 server kernel: [1478288.313443]  do_syscall_64+0x73/0x130
Jan 21 13:12:42 server kernel: [1478288.313445]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
Jan 21 13:12:42 server kernel: [1478288.313446] RIP: 0033:0x7fdb23fdf0e5
Jan 21 13:12:42 server kernel: [1478288.313446] RSP: 002b:00007fdb21eeed88 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Jan 21 13:12:42 server kernel: [1478288.313447] RAX: ffffffffffffffda RBX: 00007fdb21eef630 RCX: 00007fdb23fdf0e5
Jan 21 13:12:42 server kernel: [1478288.313448] RDX: 00007fdb21eeedf0 RSI: 00007fdb21eeedf0 RDI: 0000000003d07320
Jan 21 13:12:42 server kernel: [1478288.313449] RBP: 00007fdb21eeeeb0 R08: 0000000000000000 R09: 000000000000323c
Jan 21 13:12:42 server kernel: [1478288.313450] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000007
Jan 21 13:12:42 server kernel: [1478288.313450] R13: 0000000000000000 R14: 00007fdb21eef9c0 R15: 0000000003cc0b80
Jan 21 13:12:42 server kernel: [1478288.313452] INFO: task node:5180 blocked for more than 120 seconds.
Jan 21 13:12:42 server kernel: [1478288.313476]       Tainted: P           OE    4.15.0-43-generic #46~16.04.1-Ubuntu
Jan 21 13:12:42 server kernel: [1478288.313503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 21 13:12:42 server kernel: [1478288.313531] node            D    0  5180   1611 0x00000120
Jan 21 13:12:42 server kernel: [1478288.313532] Call Trace:
Jan 21 13:12:42 server kernel: [1478288.313534]  __schedule+0x3d6/0x8b0
Jan 21 13:12:42 server kernel: [1478288.313535]  schedule+0x36/0x80
Jan 21 13:12:42 server kernel: [1478288.313536]  rwsem_down_read_failed+0x10a/0x170
Jan 21 13:12:42 server kernel: [1478288.313538]  call_rwsem_down_read_failed+0x18/0x30
Jan 21 13:12:42 server kernel: [1478288.313540]  ? call_rwsem_down_read_failed+0x18/0x30
Jan 21 13:12:42 server kernel: [1478288.313541]  down_read+0x20/0x40
Jan 21 13:12:42 server kernel: [1478288.313543]  lookup_slow+0x60/0x170
Jan 21 13:12:42 server kernel: [1478288.313544]  ? lookup_fast+0x1ab/0x300
Jan 21 13:12:42 server kernel: [1478288.313545]  walk_component+0x1c5/0x360
Jan 21 13:12:42 server kernel: [1478288.313546]  ? path_init+0x1d9/0x310
Jan 21 13:12:42 server kernel: [1478288.313548]  path_lookupat+0x76/0x230
Jan 21 13:12:42 server kernel: [1478288.313549]  filename_lookup+0xb8/0x1a0
Jan 21 13:12:42 server kernel: [1478288.313551]  ? __check_object_size+0xfc/0x1a0
Jan 21 13:12:42 server kernel: [1478288.313553]  ? strncpy_from_user+0x4d/0x170
Jan 21 13:12:42 server kernel: [1478288.313555]  user_path_at_empty+0x36/0x40
Jan 21 13:12:42 server kernel: [1478288.313556]  ? user_path_at_empty+0x36/0x40
Jan 21 13:12:42 server kernel: [1478288.313558]  vfs_statx+0x77/0xe0
Jan 21 13:12:42 server kernel: [1478288.313559]  SYSC_newlstat+0x3d/0x70
Jan 21 13:12:42 server kernel: [1478288.313561]  ? __secure_computing+0x3b/0xf0
Jan 21 13:12:42 server kernel: [1478288.313562]  ? syscall_trace_enter+0xca/0x2f0
Jan 21 13:12:42 server kernel: [1478288.313564]  SyS_newlstat+0xe/0x10
Jan 21 13:12:42 server kernel: [1478288.313565]  do_syscall_64+0x73/0x130
Jan 21 13:12:42 server kernel: [1478288.313568] RIP: 0033:0x7fdb23fdf0e5
Jan 21 13:12:42 server kernel: [1478288.313570] RAX: ffffffffffffffda RBX: 00007fdb216ee630 RCX: 00007fdb23fdf0e5
Jan 21 13:12:42 server kernel: [1478288.313571] RBP: 00007fdb216edeb0 R08: 0000000000000000 R09: 0000000000003249
Jan 21 13:12:42 server kernel: [1478288.313572] R13: 0000000000000000 R14: 00007fdb216ee9c0 R15: 000000000415d1d0
@justenwalker
Copy link

justenwalker commented Mar 26, 2019

I think I'm getting the same issue; We see it sporadically happen when running Spark inside Docker containers. We don't have a way to reproduce it intentionally either; it just seems to happen, and then writes to the zpool stop or come to a trickle.

Update 1

For some added information. It seems like it might have a correlation with Docker + ZFS. (I'm running Docker CE 18.09.2). I was running ioping -W on the disk to see what the read/write latency was and it was hanging (Ctrl+C/Ctrl+Z did nothing). I SSH'd onto the machine from a different window and ran a docker kill on the only running container and the hung processes started working. iopings after the docker container was killed resumed.

Here' is a zpool history of the operations it performed

2019-03-26.20:19:09 zfs clone -p -o mountpoint=legacy datadrive/docker/fac1563632f26f74bc2d12a67ffba679565037b04b1b99a48b6b035d55124cbd@573872026 datadrive/docker/25a7aeaae73e2d888cace5823473f1a07f68004445a1c914d9f2aec59236ef92-init
2019-03-26.20:19:10 zfs destroy -d datadrive/docker/fac1563632f26f74bc2d12a67ffba679565037b04b1b99a48b6b035d55124cbd@573872026
2019-03-26.20:19:10 zfs snapshot datadrive/docker/25a7aeaae73e2d888cace5823473f1a07f68004445a1c914d9f2aec59236ef92-init@254035669
2019-03-26.20:19:11 zfs clone -p -o mountpoint=legacy datadrive/docker/25a7aeaae73e2d888cace5823473f1a07f68004445a1c914d9f2aec59236ef92-init@254035669 datadrive/docker/25a7aeaae73e2d888cace5823473f1a07f68004445a1c914d9f2aec59236ef92
2019-03-26.20:19:11 zfs destroy -d datadrive/docker/25a7aeaae73e2d888cace5823473f1a07f68004445a1c914d9f2aec59236ef92-init@254035669
2019-03-26.20:47:59 zfs destroy -r datadrive/docker/25a7aeaae73e2d888cace5823473f1a07f68004445a1c914d9f2aec59236ef92
2019-03-26.20:48:04 zfs destroy -r datadrive/docker/25a7aeaae73e2d888cace5823473f1a07f68004445a1c914d9f2aec59236ef92-init
2019-03-26.20:48:17 zfs snapshot datadrive/docker/fac1563632f26f74bc2d12a67ffba679565037b04b1b99a48b6b035d55124cbd@254127386
2019-03-26.20:48:18 zfs clone -p -o mountpoint=legacy datadrive/docker/fac1563632f26f74bc2d12a67ffba679565037b04b1b99a48b6b035d55124cbd@254127386 datadrive/docker/383a0bfe7cd5e3a15a1d760bb001fb83a486ba218faee59beacbbbd596166cba-init
2019-03-26.20:48:18 zfs destroy -d datadrive/docker/fac1563632f26f74bc2d12a67ffba679565037b04b1b99a48b6b035d55124cbd@254127386
2019-03-26.20:48:19 zfs snapshot datadrive/docker/383a0bfe7cd5e3a15a1d760bb001fb83a486ba218faee59beacbbbd596166cba-init@68293172
2019-03-26.20:48:19 zfs clone -p -o mountpoint=legacy datadrive/docker/383a0bfe7cd5e3a15a1d760bb001fb83a486ba218faee59beacbbbd596166cba-init@68293172 datadrive/docker/383a0bfe7cd5e3a15a1d760bb001fb83a486ba218faee59beacbbbd596166cba
2019-03-26.20:48:22 zfs destroy -d datadrive/docker/383a0bfe7cd5e3a15a1d760bb001fb83a486ba218faee59beacbbbd596166cba-init@68293172
2019-03-26.20:48:40 zfs destroy -r datadrive/docker/383a0bfe7cd5e3a15a1d760bb001fb83a486ba218faee59beacbbbd596166cba
2019-03-26.20:48:45 zfs destroy -r datadrive/docker/383a0bfe7cd5e3a15a1d760bb001fb83a486ba218faee59beacbbbd596166cba-init
2019-03-26.20:51:41 zfs destroy -r datadrive/docker/fac1563632f26f74bc2d12a67ffba679565037b04b1b99a48b6b035d55124cbd
2019-03-26.20:51:41 zfs destroy -r datadrive/docker/b67fa0121422cf5c8562f589c48ec9677fbdfaf7e67f2d73604a45fc0d366e7b
2019-03-26.20:51:42 zfs destroy -r datadrive/docker/297c07b5f3cfd39638e3a921757793bb5723fce9234e23dbfbb196f53e3cc9c2
2019-03-26.20:51:47 zfs destroy -r datadrive/docker/91f78d3bfdc9a966729428a83b4b52491001cf4c000324e724533405179b3791

@Machske Are you running nodejs inside of a docker container?

Logs and other Info

Type Version/Name
Distribution Name Ubuntu
Distribution Version 16.04.5 LTS
Linux Kernel 4.15.0-1037-azure
Architecture x86_64
ZFS Version 0.7.5-1ubuntu16.4
SPL Version 0.7.5-1ubuntu1
Cpu Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz (16 cores)
Memory 64GB
Swap 0k
admin@server:~$ sudo zpool status -v                                                                                                                           [1/332]
  pool: datadrive
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        datadrive   ONLINE       0     0     0
          sdy       ONLINE       0     0     0
          sdf       ONLINE       0     0     0
          sdab      ONLINE       0     0     0
          sdo       ONLINE       0     0     0
          sdw       ONLINE       0     0     0
          sdd       ONLINE       0     0     0
          sdm       ONLINE       0     0     0
          sdu       ONLINE       0     0     0
          sdk       ONLINE       0     0     0
          sds       ONLINE       0     0     0
          sdag      ONLINE       0     0     0
          sdi       ONLINE       0     0     0
          sdq       ONLINE       0     0     0
          sdae      ONLINE       0     0     0
          sdz       ONLINE       0     0     0
          sdg       ONLINE       0     0     0
          sdac      ONLINE       0     0     0
          sdx       ONLINE       0     0     0
          sde       ONLINE       0     0     0
          sdaa      ONLINE       0     0     0
          sdn       ONLINE       0     0     0
          sdv       ONLINE       0     0     0
          sdc       ONLINE       0     0     0
          sdl       ONLINE       0     0     0
          sdt       ONLINE       0     0     0
          sdah      ONLINE       0     0     0
          sdj       ONLINE       0     0     0
          sdr       ONLINE       0     0     0
          sdaf      ONLINE       0     0     0
          sdh       ONLINE       0     0     0
          sdp       ONLINE       0     0     0
          sdad      ONLINE       0     0     0
        cache
          sdb       ONLINE       0     0     0

errors: No known data errors
Mar 26 16:35:32 server kernel: [492512.220074] INFO: task txg_quiesce:4971 blocked for more than 120 seconds.
Mar 26 16:35:32 server kernel: [492512.227926]       Tainted: P           O     4.15.0-1037-azure #39~16.04.1-Ubuntu
Mar 26 16:35:32 server kernel: [492512.236109] ""echo 0 > /proc/sys/kernel/hung_task_timeout_secs"" disables this message.
Mar 26 16:35:32 server kernel: [492512.243629] txg_quiesce     D    0  4971      2 0x80000000
Mar 26 16:35:32 server kernel: [492512.243632] Call Trace:
Mar 26 16:35:32 server kernel: [492512.243641]  __schedule+0x3d6/0x8b0
Mar 26 16:35:32 server kernel: [492512.243643]  schedule+0x36/0x80
Mar 26 16:35:32 server kernel: [492512.243653]  cv_wait_common+0x101/0x140 [spl]
Mar 26 16:35:32 server kernel: [492512.243660]  ? wait_woken+0x80/0x80
Mar 26 16:35:32 server kernel: [492512.243669]  __cv_wait+0x15/0x20 [spl]
Mar 26 16:35:32 server kernel: [492512.243721]  txg_quiesce_thread+0x2dd/0x3f0 [zfs]
Mar 26 16:35:32 server kernel: [492512.243765]  ? txg_delay+0x170/0x170 [zfs]
Mar 26 16:35:32 server kernel: [492512.243773]  thread_generic_wrapper+0x74/0x90 [spl]
Mar 26 16:35:32 server kernel: [492512.243778]  kthread+0x105/0x140
Mar 26 16:35:32 server kernel: [492512.243783]  ? __thread_exit+0x20/0x20 [spl]
Mar 26 16:35:32 server kernel: [492512.243786]  ? kthread_destroy_worker+0x50/0x50
Mar 26 16:35:32 server kernel: [492512.243788]  ? kthread_destroy_worker+0x50/0x50
Mar 26 16:35:32 server kernel: [492512.243792]  ret_from_fork+0x35/0x40
Mar 26 16:35:32 server kernel: [492512.244178] INFO: task java:22483 blocked for more than 120 seconds.
Mar 26 16:35:32 server kernel: [492512.251063]       Tainted: P           O     4.15.0-1037-azure #39~16.04.1-Ubuntu
Mar 26 16:35:32 server kernel: [492512.262305] ""echo 0 > /proc/sys/kernel/hung_task_timeout_secs"" disables this message.
Mar 26 16:35:32 server kernel: [492512.268551] java            D    0 22483  22333 0x00000100
Mar 26 16:35:32 server kernel: [492512.268554] Call Trace:
Mar 26 16:35:32 server kernel: [492512.268558]  __schedule+0x3d6/0x8b0
Mar 26 16:35:32 server kernel: [492512.268562]  ? page_add_file_rmap+0x133/0x170
Mar 26 16:35:32 server kernel: [492512.268564]  schedule+0x36/0x80
Mar 26 16:35:32 server kernel: [492512.268566]  rwsem_down_read_failed+0x10a/0x170
Mar 26 16:35:32 server kernel: [492512.268570]  ? __handle_mm_fault+0xd5c/0x11e0
Mar 26 16:35:32 server kernel: [492512.268591]  call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.268593]  ? call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.268595]  down_read+0x20/0x40
Mar 26 16:35:32 server kernel: [492512.268622]  __do_page_fault+0x465/0x500
Mar 26 16:35:32 server kernel: [492512.268629]  ? SyS_futex+0x7f/0x180
Mar 26 16:35:32 server kernel: [492512.268634]  do_page_fault+0x2e/0xf0
Mar 26 16:35:32 server kernel: [492512.268639]  ? page_fault+0x2f/0x50
Mar 26 16:35:32 server kernel: [492512.268643]  page_fault+0x45/0x50
Mar 26 16:35:32 server kernel: [492512.268649] RIP: 0033:0x7fd0623456b0
Mar 26 16:35:32 server kernel: [492512.268651] RSP: 002b:00007fd033efbbe0 EFLAGS: 00010202
Mar 26 16:35:32 server kernel: [492512.268655] RAX: 0000000013eaf3ac RBX: 0000000000000001 RCX: 00007fd062b144c4
Mar 26 16:35:32 server kernel: [492512.268657] RDX: 0000000000000085 RSI: 0000000000000085 RDI: 00007fd04d56cc4e
Mar 26 16:35:32 server kernel: [492512.268659] RBP: 00007fd033efbc00 R08: 0000000000000000 R09: 0000000000000080
Mar 26 16:35:32 server kernel: [492512.268662] R10: 00007fd04d56cc20 R11: 00007fcfcc510434 R12: 00007fd033efbcd0
Mar 26 16:35:32 server kernel: [492512.268663] R13: 00007fd033efbd40 R14: 00007fd033efd7b8 R15: 00007fd033efbd98
Mar 26 16:35:32 server kernel: [492512.268666] INFO: task java:22486 blocked for more than 120 seconds.
Mar 26 16:35:32 server kernel: [492512.273841]       Tainted: P           O     4.15.0-1037-azure #39~16.04.1-Ubuntu
Mar 26 16:35:32 server kernel: [492512.279427] ""echo 0 > /proc/sys/kernel/hung_task_timeout_secs"" disables this message.
Mar 26 16:35:32 server kernel: [492512.285524] java            D    0 22486  22333 0x00000100
Mar 26 16:35:32 server kernel: [492512.285526] Call Trace:
Mar 26 16:35:32 server kernel: [492512.285529]  __schedule+0x3d6/0x8b0
Mar 26 16:35:32 server kernel: [492512.285531]  ? page_add_file_rmap+0x133/0x170
Mar 26 16:35:32 server kernel: [492512.285532]  schedule+0x36/0x80
Mar 26 16:35:32 server kernel: [492512.285534]  rwsem_down_read_failed+0x10a/0x170
Mar 26 16:35:32 server kernel: [492512.285536]  ? __handle_mm_fault+0xd5c/0x11e0
Mar 26 16:35:32 server kernel: [492512.285538]  call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.285540]  ? call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.285541]  down_read+0x20/0x40
Mar 26 16:35:32 server kernel: [492512.285543]  __do_page_fault+0x465/0x500
Mar 26 16:35:32 server kernel: [492512.285545]  do_page_fault+0x2e/0xf0
Mar 26 16:35:32 server kernel: [492512.285547]  ? page_fault+0x2f/0x50
Mar 26 16:35:32 server kernel: [492512.285549]  page_fault+0x45/0x50
Mar 26 16:35:32 server kernel: [492512.285550] RIP: 0033:0x7fd062a9a293
Mar 26 16:35:32 server kernel: [492512.285551] RSP: 002b:00007fd033bf8210 EFLAGS: 00010297
Mar 26 16:35:32 server kernel: [492512.285552] RAX: 00007fcfe001c0a0 RBX: 00007fcf3000e450 RCX: 00007fcfcc461f18
Mar 26 16:35:32 server kernel: [492512.285553] RDX: 00007fd062b69b58 RSI: 0000000000000012 RDI: 00007fcfe001c0a0
Mar 26 16:35:32 server kernel: [492512.285554] RBP: 00007fd033bf8220 R08: 00007fcf3000e710 R09: 00007fcfc8385638
Mar 26 16:35:32 server kernel: [492512.285555] R10: 00007fd033bfa0d8 R11: 00007fd033bfa0d8 R12: 00007fcf3000ea78
Mar 26 16:35:32 server kernel: [492512.285555] R13: 00007fd033bf8510 R14: 00007fcfcc461f18 R15: 00007fcfe001c0a0
Mar 26 16:35:32 server kernel: [492512.285557] INFO: task java:22488 blocked for more than 120 seconds.
Mar 26 16:35:32 server kernel: [492512.290424]       Tainted: P           O     4.15.0-1037-azure #39~16.04.1-Ubuntu
Mar 26 16:35:32 server kernel: [492512.296544] ""echo 0 > /proc/sys/kernel/hung_task_timeout_secs"" disables this message.
Mar 26 16:35:32 server kernel: [492512.302508] java            D    0 22488  22333 0x00000100
Mar 26 16:35:32 server kernel: [492512.302511] Call Trace:
Mar 26 16:35:32 server kernel: [492512.302514]  __schedule+0x3d6/0x8b0
Mar 26 16:35:32 server kernel: [492512.302536]  ? page_add_file_rmap+0x133/0x170
Mar 26 16:35:32 server kernel: [492512.302540]  schedule+0x36/0x80
Mar 26 16:35:32 server kernel: [492512.302544]  rwsem_down_read_failed+0x10a/0x170
Mar 26 16:35:32 server kernel: [492512.302546]  ? __handle_mm_fault+0xd5c/0x11e0
Mar 26 16:35:32 server kernel: [492512.302549]  call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.302551]  ? call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.302553]  down_read+0x20/0x40
Mar 26 16:35:32 server kernel: [492512.302556]  __do_page_fault+0x465/0x500
Mar 26 16:35:32 server kernel: [492512.302558]  ? SyS_futex+0x7f/0x180
Mar 26 16:35:32 server kernel: [492512.302583]  do_page_fault+0x2e/0xf0
Mar 26 16:35:32 server kernel: [492512.302608]  ? page_fault+0x2f/0x50
Mar 26 16:35:32 server kernel: [492512.302611]  page_fault+0x45/0x50
Mar 26 16:35:32 server kernel: [492512.302614] RIP: 0033:0x7fd062482ffd
Mar 26 16:35:32 server kernel: [492512.302615] RSP: 002b:00007fd0339f7600 EFLAGS: 00010202
Mar 26 16:35:32 server kernel: [492512.302619] RAX: 0000000000000002 RBX: 00007fd05c29f800 RCX: 00000000000000b7
Mar 26 16:35:32 server kernel: [492512.302621] RDX: 00007fcfc001c741 RSI: 0000000000000001 RDI: 0000000000000010
Mar 26 16:35:32 server kernel: [492512.302623] RBP: 00007fd0339f7690 R08: 0000000000000000 R09: 00007fd05c012ba0
Mar 26 16:35:32 server kernel: [492512.302625] R10: 0000000000000000 R11: 00007fcfc001c780 R12: 00007fd0339f7760
Mar 26 16:35:32 server kernel: [492512.302626] R13: 00007fd062f775c8 R14: 00000000000000b7 R15: 00007fd04b3660b0
Mar 26 16:35:32 server kernel: [492512.302631] INFO: task java:22490 blocked for more than 120 seconds.
Mar 26 16:35:32 server kernel: [492512.307529]       Tainted: P           O     4.15.0-1037-azure #39~16.04.1-Ubuntu
Mar 26 16:35:32 server kernel: [492512.312858] ""echo 0 > /proc/sys/kernel/hung_task_timeout_secs"" disables this message.
Mar 26 16:35:32 server kernel: [492512.318418] java            D    0 22490  22333 0x00000100
Mar 26 16:35:32 server kernel: [492512.318421] Call Trace:
Mar 26 16:35:32 server kernel: [492512.318424]  __schedule+0x3d6/0x8b0
Mar 26 16:35:32 server kernel: [492512.318426]  ? page_add_file_rmap+0x133/0x170
Mar 26 16:35:32 server kernel: [492512.318428]  schedule+0x36/0x80
Mar 26 16:35:32 server kernel: [492512.318430]  rwsem_down_read_failed+0x10a/0x170
Mar 26 16:35:32 server kernel: [492512.318432]  ? __handle_mm_fault+0xd5c/0x11e0
Mar 26 16:35:32 server kernel: [492512.318435]  call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.318436]  ? call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.318438]  down_read+0x20/0x40
Mar 26 16:35:32 server kernel: [492512.318440]  __do_page_fault+0x465/0x500
Mar 26 16:35:32 server kernel: [492512.318442]  ? SyS_futex+0x7f/0x180
Mar 26 16:35:32 server kernel: [492512.318444]  do_page_fault+0x2e/0xf0
Mar 26 16:35:32 server kernel: [492512.318447]  ? page_fault+0x2f/0x50
Mar 26 16:35:32 server kernel: [492512.318449]  page_fault+0x45/0x50
Mar 26 16:35:32 server kernel: [492512.318450] RIP: 0033:0x7fd0627f8360
Mar 26 16:35:32 server kernel: [492512.318451] RSP: 002b:00007fd0337f5c78 EFLAGS: 00010202
Mar 26 16:35:32 server kernel: [492512.318453] RAX: 0000000000000006 RBX: 00007fd0337f5d30 RCX: 00007fd05c2a4000
Mar 26 16:35:32 server kernel: [492512.318454] RDX: 00007fd0337f5cb0 RSI: 0000000100002548 RDI: 00007fd0337f5d50
Mar 26 16:35:32 server kernel: [492512.318455] RBP: 00007fd0337f5d00 R08: 00007fd05c2a4000 R09: 00007fd062b6cbd3
Mar 26 16:35:32 server kernel: [492512.318456] R10: 0000000100002548 R11: 0000000000000098 R12: 00007fd0337f5ca0
Mar 26 16:35:32 server kernel: [492512.318457] R13: 00007fd0337f5cb0 R14: 00007fd05c2a4000 R15: 00000001007e2ce0
Mar 26 16:35:32 server kernel: [492512.318460] INFO: task java:22493 blocked for more than 120 seconds.
Mar 26 16:35:32 server kernel: [492512.323228]       Tainted: P           O     4.15.0-1037-azure #39~16.04.1-Ubuntu
Mar 26 16:35:32 server kernel: [492512.328731] ""echo 0 > /proc/sys/kernel/hung_task_timeout_secs"" disables this message.
Mar 26 16:35:32 server kernel: [492512.334430] java            D    0 22493  22333 0x00000100
Mar 26 16:35:32 server kernel: [492512.334432] Call Trace:
Mar 26 16:35:32 server kernel: [492512.334435]  __schedule+0x3d6/0x8b0
Mar 26 16:35:32 server kernel: [492512.334437]  schedule+0x36/0x80
Mar 26 16:35:32 server kernel: [492512.334438]  rwsem_down_read_failed+0x10a/0x170
Mar 26 16:35:32 server kernel: [492512.334440]  call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.334442]  ? call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.334444]  down_read+0x20/0x40
Mar 26 16:35:32 server kernel: [492512.334446]  __do_page_fault+0x465/0x500
Mar 26 16:35:32 server kernel: [492512.334447]  ? SyS_futex+0x7f/0x180
Mar 26 16:35:32 server kernel: [492512.334449]  do_page_fault+0x2e/0xf0
Mar 26 16:35:32 server kernel: [492512.334451]  ? page_fault+0x2f/0x50
Mar 26 16:35:32 server kernel: [492512.334453]  page_fault+0x45/0x50
Mar 26 16:35:32 server kernel: [492512.334454] RIP: 0033:0x7fd062fc9709
Mar 26 16:35:32 server kernel: [492512.334454] RSP: 002b:00007fd0334f3c90 EFLAGS: 00010202
Mar 26 16:35:32 server kernel: [492512.334456] RAX: ffffffffffffff92 RBX: 0000000000078364 RCX: 00007fd062fc9709
Mar 26 16:35:32 server kernel: [492512.334457] RDX: 0000000000000001 RSI: 0000000000000089 RDI: 00007fd05c2ae054
Mar 26 16:35:32 server kernel: [492512.334457] RBP: 00007fd0334f3d50 R08: 00007fd05c2ae028 R09: 00000000ffffffff
Mar 26 16:35:32 server kernel: [492512.334458] R10: 00007fd0334f3cf0 R11: 0000000000000202 R12: 0000000000000001
Mar 26 16:35:32 server kernel: [492512.334459] R13: 00007fd0334f3cf0 R14: ffffffffffffff92 R15: 00007fd0334f3c00
Mar 26 16:35:32 server kernel: [492512.334461] INFO: task java:22523 blocked for more than 120 seconds.
Mar 26 16:35:32 server kernel: [492512.339352]       Tainted: P           O     4.15.0-1037-azure #39~16.04.1-Ubuntu
Mar 26 16:35:32 server kernel: [492512.344978] ""echo 0 > /proc/sys/kernel/hung_task_timeout_secs"" disables this message.
Mar 26 16:35:32 server kernel: [492512.351830] java            D    0 22523  22333 0x00000100
Mar 26 16:35:32 server kernel: [492512.351833] Call Trace:
Mar 26 16:35:32 server kernel: [492512.351837]  __schedule+0x3d6/0x8b0
Mar 26 16:35:32 server kernel: [492512.351839]  schedule+0x36/0x80
Mar 26 16:35:32 server kernel: [492512.351841]  rwsem_down_read_failed+0x10a/0x170
Mar 26 16:35:32 server kernel: [492512.351844]  call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.351846]  ? call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.351850]  down_read+0x20/0x40
Mar 26 16:35:32 server kernel: [492512.351858]  __do_page_fault+0x465/0x500
Mar 26 16:35:32 server kernel: [492512.351862]  ? SyS_futex+0x7f/0x180
Mar 26 16:35:32 server kernel: [492512.351864]  do_page_fault+0x2e/0xf0
Mar 26 16:35:32 server kernel: [492512.351867]  ? page_fault+0x2f/0x50
Mar 26 16:35:32 server kernel: [492512.351869]  page_fault+0x45/0x50
Mar 26 16:35:32 server kernel: [492512.351871] RIP: 0033:0x7fd062fc9709
Mar 26 16:35:32 server kernel: [492512.351872] RSP: 002b:00007fd0301d46d0 EFLAGS: 00010202
Mar 26 16:35:32 server kernel: [492512.351890] RAX: ffffffffffffff92 RBX: 0000000000078365 RCX: 00007fd062fc9709
Mar 26 16:35:32 server kernel: [492512.351897] RDX: 0000000000000001 RSI: 0000000000000089 RDI: 00007fcf54013a54
Mar 26 16:35:32 server kernel: [492512.351925] RBP: 00007fd0301d4790 R08: 00007fcf54013a28 R09: 00000000ffffffff
Mar 26 16:35:32 server kernel: [492512.351926] R10: 00007fd0301d4730 R11: 0000000000000202 R12: 0000000000000001
Mar 26 16:35:32 server kernel: [492512.351928] R13: 00007fd0301d4730 R14: ffffffffffffff92 R15: 00007fd0301d4700
Mar 26 16:35:32 server kernel: [492512.351936] INFO: task java:22550 blocked for more than 120 seconds.
Mar 26 16:35:32 server kernel: [492512.367072]       Tainted: P           O     4.15.0-1037-azure #39~16.04.1-Ubuntu
Mar 26 16:35:32 server kernel: [492512.375253] ""echo 0 > /proc/sys/kernel/hung_task_timeout_secs"" disables this message.
Mar 26 16:35:32 server kernel: [492512.383629] java            D    0 22550  22333 0x00000100
Mar 26 16:35:32 server kernel: [492512.383632] Call Trace:
Mar 26 16:35:32 server kernel: [492512.383637]  __schedule+0x3d6/0x8b0
Mar 26 16:35:32 server kernel: [492512.383640]  schedule+0x36/0x80
Mar 26 16:35:32 server kernel: [492512.383643]  io_schedule+0x16/0x40
Mar 26 16:35:32 server kernel: [492512.383646]  wait_on_page_bit+0xf9/0x140
Mar 26 16:35:32 server kernel: [492512.383650]  ? page_cache_tree_insert+0xe0/0xe0
Mar 26 16:35:32 server kernel: [492512.383654]  shrink_page_list+0x736/0xbe0
Mar 26 16:35:32 server kernel: [492512.383658]  shrink_inactive_list+0x2af/0x5f0
Mar 26 16:35:32 server kernel: [492512.383663]  shrink_node_memcg+0x367/0x7d0
Mar 26 16:35:32 server kernel: [492512.383671]  shrink_node+0xe1/0x310
Mar 26 16:35:32 server kernel: [492512.383673]  ? shrink_node+0xe1/0x310
Mar 26 16:35:32 server kernel: [492512.383676]  do_try_to_free_pages+0xee/0x360
Mar 26 16:35:32 server kernel: [492512.383678]  try_to_free_mem_cgroup_pages+0x10a/0x200
Mar 26 16:35:32 server kernel: [492512.383684]  try_charge+0x151/0x6d0
Mar 26 16:35:32 server kernel: [492512.383687]  ? _cond_resched+0x1a/0x50
Mar 26 16:35:32 server kernel: [492512.383691]  mem_cgroup_try_charge+0x8b/0x1b0
Mar 26 16:35:32 server kernel: [492512.383693]  __add_to_page_cache_locked+0x67/0x240
Mar 26 16:35:32 server kernel: [492512.383696]  add_to_page_cache_lru+0x4e/0xe0
Mar 26 16:35:32 server kernel: [492512.383703]  filemap_fault+0x35c/0x750
Mar 26 16:35:32 server kernel: [492512.383708]  ? filemap_map_pages+0x297/0x3f0
Mar 26 16:35:32 server kernel: [492512.383712]  __do_fault+0x24/0xe9
Mar 26 16:35:32 server kernel: [492512.383715]  __handle_mm_fault+0xd3c/0x11e0
Mar 26 16:35:32 server kernel: [492512.383722]  handle_mm_fault+0xcc/0x1c0
Mar 26 16:35:32 server kernel: [492512.383726]  __do_page_fault+0x260/0x500
Mar 26 16:35:32 server kernel: [492512.383730]  do_page_fault+0x2e/0xf0
Mar 26 16:35:32 server kernel: [492512.383734]  ? page_fault+0x2f/0x50
Mar 26 16:35:32 server kernel: [492512.383737]  page_fault+0x45/0x50
Mar 26 16:35:32 server kernel: [492512.383740] RIP: 0033:0x7fd0626d50a0
Mar 26 16:35:32 server kernel: [492512.383742] RSP: 002b:00007fd0288f5df8 EFLAGS: 00010203
Mar 26 16:35:32 server kernel: [492512.383746] RAX: 00000000000d0029 RBX: 00007fcf40001000 RCX: 0000000000000000
Mar 26 16:35:32 server kernel: [492512.383748] RDX: 0000000000000001 RSI: 00000000810cec28 RDI: 00007fd0288f6b30
Mar 26 16:35:32 server kernel: [492512.383750] RBP: 00007fd0288f5e40 R08: 0000000000000000 R09: 0000000000000000
Mar 26 16:35:32 server kernel: [492512.383752] R10: 00007fd05d172c10 R11: 00007fd062a58acb R12: 00007fcf38000a30
Mar 26 16:35:32 server kernel: [492512.383753] R13: 00007fcf38000a30 R14: 00007fcf400018f8 R15: 0000000000000000
Mar 26 16:35:32 server kernel: [492512.383757] INFO: task java:22553 blocked for more than 120 seconds.
Mar 26 16:35:32 server kernel: [492512.390706]       Tainted: P           O     4.15.0-1037-azure #39~16.04.1-Ubuntu
Mar 26 16:35:32 server kernel: [492512.402105] ""echo 0 > /proc/sys/kernel/hung_task_timeout_secs"" disables this message.
Mar 26 16:35:32 server kernel: [492512.407922] java            D    0 22553  22333 0x00000100
Mar 26 16:35:32 server kernel: [492512.407924] Call Trace:
Mar 26 16:35:32 server kernel: [492512.407928]  __schedule+0x3d6/0x8b0
Mar 26 16:35:32 server kernel: [492512.407931]  ? page_add_file_rmap+0x133/0x170
Mar 26 16:35:32 server kernel: [492512.407933]  schedule+0x36/0x80
Mar 26 16:35:32 server kernel: [492512.407935]  rwsem_down_read_failed+0x10a/0x170
Mar 26 16:35:32 server kernel: [492512.407937]  ? __handle_mm_fault+0xd5c/0x11e0
Mar 26 16:35:32 server kernel: [492512.407941]  call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.407943]  ? call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.407944]  down_read+0x20/0x40
Mar 26 16:35:32 server kernel: [492512.407947]  __do_page_fault+0x465/0x500
Mar 26 16:35:32 server kernel: [492512.407949]  do_page_fault+0x2e/0xf0
Mar 26 16:35:32 server kernel: [492512.407951]  ? page_fault+0x2f/0x50
Mar 26 16:35:32 server kernel: [492512.407953]  page_fault+0x45/0x50
Mar 26 16:35:32 server kernel: [492512.407954] RIP: 0033:0x7fd0624916f0
Mar 26 16:35:32 server kernel: [492512.407955] RSP: 002b:00007fd0282f1eb8 EFLAGS: 00010202
Mar 26 16:35:32 server kernel: [492512.407956] RAX: 0000000000000000 RBX: 00007fd0282f2510 RCX: 0000000000000011
Mar 26 16:35:32 server kernel: [492512.407957] RDX: 00007fcf88165000 RSI: 0000000000000011 RDI: 00007fd0282f2330
Mar 26 16:35:32 server kernel: [492512.407958] RBP: 00007fd0282f1f40 R08: 0000000000000000 R09: 00007fcf88165000
Mar 26 16:35:32 server kernel: [492512.407959] R10: 00007fce88437274 R11: 0000000000000002 R12: 00007fd0282f2330
Mar 26 16:35:32 server kernel: [492512.407960] R13: 0000000000000002 R14: 00007fce88437274 R15: 00007fcf88165000
Mar 26 16:35:32 server kernel: [492512.407965] INFO: task java:22555 blocked for more than 120 seconds.
Mar 26 16:35:32 server kernel: [492512.412849]       Tainted: P           O     4.15.0-1037-azure #39~16.04.1-Ubuntu
Mar 26 16:35:32 server kernel: [492512.419329] ""echo 0 > /proc/sys/kernel/hung_task_timeout_secs"" disables this message.
Mar 26 16:35:32 server kernel: [492512.425743] java            D    0 22555  22333 0x00000100
Mar 26 16:35:32 server kernel: [492512.425745] Call Trace:
Mar 26 16:35:32 server kernel: [492512.425748]  __schedule+0x3d6/0x8b0
Mar 26 16:35:32 server kernel: [492512.425750]  schedule+0x36/0x80
Mar 26 16:35:32 server kernel: [492512.425752]  rwsem_down_read_failed+0x10a/0x170
Mar 26 16:35:32 server kernel: [492512.425754]  call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.425756]  ? call_rwsem_down_read_failed+0x18/0x30
Mar 26 16:35:32 server kernel: [492512.425758]  down_read+0x20/0x40
Mar 26 16:35:32 server kernel: [492512.425760]  __do_page_fault+0x465/0x500
Mar 26 16:35:32 server kernel: [492512.425762]  do_page_fault+0x2e/0xf0
Mar 26 16:35:32 server kernel: [492512.425764]  ? page_fault+0x2f/0x50
Mar 26 16:35:32 server kernel: [492512.425766]  page_fault+0x45/0x50
Mar 26 16:35:32 server kernel: [492512.425767] RIP: 0033:0x7fd0306e5a02
Mar 26 16:35:32 server kernel: [492512.425770] RSP: 002b:00007fd021ffc5b0 EFLAGS: 00010217
Mar 26 16:35:32 server kernel: [492512.425773] RAX: 0000000000000000 RBX: 00000000000003e8 RCX: 00007fd0636fea13
Mar 26 16:35:32 server kernel: [492512.425774] RDX: 0000000000000000 RSI: 00007fcf24029000 RDI: 0000000000000000
Mar 26 16:35:32 server kernel: [492512.425775] RBP: 00007fd021ffc620 R08: 0000000000000002 R09: 00007fd021ffc5d0
Mar 26 16:35:32 server kernel: [492512.425775] R10: 00000000000003e8 R11: 0000000000000000 R12: 00007fcf24029000
Mar 26 16:35:32 server kernel: [492512.425778] R13: 0000000000002000 R14: 0000000000000133 R15: 00007fcf24042000

@behlendorf behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 26, 2019
@behlendorf
Copy link
Contributor

Thanks for commenting, from the posted stacks it looks like there may be some issue reclaiming memory. But there's unfortunately not enough information to really understand why that's the case. If you do encounter this again, it would be great if you could dump all of the blocked kernel stack traces for analysis. You can do this by running sysrq-w and grabbing the console log.

https://en.wikipedia.org/wiki/Magic_SysRq_key

@justenwalker
Copy link

justenwalker commented Mar 27, 2019

Thanks for the quick response @behlendorf. I've attached the stack traces here:

@Machske
Copy link
Author

Machske commented Mar 27, 2019

@justenwalker In regards to your question about nodejs in a container. Yes we are running nodejs procs in 1 or more containers.

@behlendorf
Copy link
Contributor

@justenwalker thanks. It looks to me like you're hitting issue #7939 which is resulting in the deadlock. This was fixed in in the master branch with 779a6c0 and that fix was backported for the 0.7.13 release. I'd suggest updating to 0.7.13 release if you can.

@justenwalker
Copy link

justenwalker commented Apr 3, 2019

Thanks @behlendorf - We use Ubuntu 16.04, so it looks like this package will not be available through official channels. Not sure we'd move to unofficial channels in order to get upgrade to 0.7.13 (as this isn't even in the latest ubuntu release)

Is there any way that we can work-around this bug beyond updating to the 0.7.13?

@dweeezil
Copy link
Contributor

dweeezil commented Apr 3, 2019

@justenwalker I'd like to point out both for your own information and for others that you can install the Ubuntu "LTS Enablement Stack" on a 16.04 system which mainly just installs a new kernel (on servers) and will give you the same version of ZoL that's used on 18.04. It's generally the very latest 0.7.X.

@justenwalker
Copy link

justenwalker commented Apr 3, 2019

Thanks @dweeezil

We're on 4.15.0-1037-azure kernel; which appears to be a kernel build having some Azure-specific features. I'm not sure what the effect would be of switching to the hwe variant; Will have to test this out. Gut feeling says there might be other unintended side-effects.

Update: tried installing the hwe kernel for 16.04. While it didn't seem to break anything, it also didn't install any newer version of ZFS.

@stale
Copy link

stale bot commented Aug 24, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020
@stale stale bot closed this as completed Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

5 participants
@behlendorf @justenwalker @dweeezil @Machske and others