Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel BUG while copying large amount of files #4608

Closed
baughj opened this issue May 7, 2016 · 8 comments
Closed

Kernel BUG while copying large amount of files #4608

baughj opened this issue May 7, 2016 · 8 comments

Comments

@baughj
Copy link

baughj commented May 7, 2016

I am running Ubuntu 14.04 with ZFS 0.6.5.6 (0.6.5.6-1~trusty from the PPA), on 3.13 (3.13.0-85). The machine has 16GB of RAM, the pool consists of 8 6TB drives in 4 mirrors, along with an SSD mirror for ZIL and two SSDs for cache.

About halfway into a 4TB copy to this new pool, I hit the following bug:

May  7 00:44:47 alexandria kernel: [19412.636117] BUG: unable to handle kernel paging request at 00000000aaca2410
May  7 00:44:47 alexandria kernel: [19412.636566] IP: [<ffffffffa01b952c>] buf_hash_insert+0xbc/0x190 [zfs]
May  7 00:44:47 alexandria kernel: [19412.637044] PGD 0 
May  7 00:44:47 alexandria kernel: [19412.637503] Oops: 0000 [#1] SMP 
May  7 00:44:47 alexandria kernel: [19412.637936] Modules linked in: snd_hda_codec_realtek snd_hda_codec_hdmi eeepc_wmi asus_wmi ppdev sparse_keymap x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd serio_raw dm_multipath scsi_dh lpc_ich snd_hda_intel snd_hda_codec i915 snd_hwdep drm_kms_helper parport_pc mac_hid snd_pcm wmi snd_page_alloc shpchp video mei_me mei lp drm parport i2c_algo_bit snd_timer snd soundcore zfs(POX) zunicode(POX) zcommon(POX) znvpair(POX) spl(OX) zavl(POX) hid_generic usbhid hid raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 psmouse raid0 r8169 ahci mii libahci multipath linear
May  7 00:44:47 alexandria kernel: [19412.639617] CPU: 2 PID: 3636 Comm: z_wr_int_6 Tainted: P           OX 3.13.0-85-generic #129-Ubuntu
May  7 00:44:47 alexandria kernel: [19412.640186] Hardware name: ASUS All Series/H87M-PLUS, BIOS 0306 04/07/2013
May  7 00:44:47 alexandria kernel: [19412.640760] task: ffff8803f34dc800 ti: ffff8803f3484000 task.ti: ffff8803f3484000
May  7 00:44:47 alexandria kernel: [19412.641343] RIP: 0010:[<ffffffffa01b952c>]  [<ffffffffa01b952c>] buf_hash_insert+0xbc/0x190 [zfs]
May  7 00:44:47 alexandria kernel: [19412.641951] RSP: 0018:ffff8803f3485c20  EFLAGS: 00010202
May  7 00:44:47 alexandria kernel: [19412.642555] RAX: 00000000aaca2410 RBX: ffff88000f242a40 RCX: 0000000100000100
May  7 00:44:47 alexandria kernel: [19412.643174] RDX: 0000000000000002 RSI: ffff8802301a5000 RDI: ffffc900074af1d8
May  7 00:44:47 alexandria kernel: [19412.643776] RBP: ffff8803f3485c38 R08: ffff880404e8d800 R09: 0000000000000001
May  7 00:44:47 alexandria kernel: [19412.644380] R10: ffffea00019e6800 R11: ffffffffa010c75a R12: 0000000000017a3b
May  7 00:44:47 alexandria kernel: [19412.644996] R13: 0000000000068ed0 R14: ffff880158cc54e0 R15: 0000000000000000
May  7 00:44:47 alexandria kernel: [19412.645618] FS:  0000000000000000(0000) GS:ffff88041fb00000(0000) knlGS:0000000000000000
May  7 00:44:47 alexandria kernel: [19412.646232] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May  7 00:44:47 alexandria kernel: [19412.646849] CR2: 00000000aaca2410 CR3: 0000000001c0e000 CR4: 00000000001407e0
May  7 00:44:47 alexandria kernel: [19412.647468] Stack:
May  7 00:44:47 alexandria kernel: [19412.648114]  ffff880005559a40 ffff88000f242a40 ffff880172172240 ffff8803f3485c78
May  7 00:44:47 alexandria kernel: [19412.648764]  ffffffffa01bcb11 ffffffffa0316050 ffff880005559a40 0000000000000000
May  7 00:44:47 alexandria kernel: [19412.649449]  0000000000000000 0000000000200000 0000000000000000 ffff8803f3485cf0
May  7 00:44:47 alexandria kernel: [19412.650132] Call Trace:
May  7 00:44:47 alexandria kernel: [19412.650857]  [<ffffffffa01bcb11>] arc_write_done+0x91/0x3e0 [zfs]
May  7 00:44:47 alexandria kernel: [19412.651562]  [<ffffffffa026ea69>] zio_done+0x2c9/0xe70 [zfs]
May  7 00:44:47 alexandria kernel: [19412.652271]  [<ffffffff811a5d6d>] ? kfree+0x11d/0x160
May  7 00:44:47 alexandria kernel: [19412.653025]  [<ffffffffa010c75a>] ? spl_kmem_free+0x2a/0x40 [spl]
May  7 00:44:47 alexandria kernel: [19412.653787]  [<ffffffffa022df20>] ? vdev_mirror_map_free+0x20/0x30 [zfs]
May  7 00:44:47 alexandria kernel: [19412.654522]  [<ffffffffa026f03a>] zio_done+0x89a/0xe70 [zfs]
May  7 00:44:47 alexandria kernel: [19412.655321]  [<ffffffff811a5d6d>] ? kfree+0x11d/0x160
May  7 00:44:47 alexandria kernel: [19412.656045]  [<ffffffffa026f03a>] zio_done+0x89a/0xe70 [zfs]
May  7 00:44:47 alexandria kernel: [19412.656763]  [<ffffffffa026a978>] zio_execute+0xc8/0x180 [zfs]
May  7 00:44:47 alexandria kernel: [19412.657496]  [<ffffffffa010fe8d>] taskq_thread+0x20d/0x430 [spl]
May  7 00:44:47 alexandria kernel: [19412.658213]  [<ffffffff8109d570>] ? wake_up_state+0x20/0x20
May  7 00:44:47 alexandria kernel: [19412.658951]  [<ffffffffa010fc80>] ? taskq_cancel_id+0x120/0x120 [spl]
May  7 00:44:47 alexandria kernel: [19412.659672]  [<ffffffff8108deb2>] kthread+0xd2/0xf0
May  7 00:44:47 alexandria kernel: [19412.660456]  [<ffffffff8108dde0>] ? kthread_create_on_node+0x1c0/0x1c0
May  7 00:44:47 alexandria kernel: [19412.661215]  [<ffffffff8173c2e8>] ret_from_fork+0x58/0x90
May  7 00:44:47 alexandria kernel: [19412.662017]  [<ffffffff8108dde0>] ? kthread_create_on_node+0x1c0/0x1c0
May  7 00:44:47 alexandria kernel: [19412.662760] Code: 05 82 3c 0f 00 4a 8d 3c e0 48 8b 37 48 85 f6 0f 84 d2 00 00 00 48 8b 0b 48 89 f0 31 d2 eb 0c 48 8b 40 20 83 c2 01 48 85 c0 74 34 <48> 39 08 75 ef 4c 8b 43 08 4c 39 40 08 75 e5 4c 8b 4b 10 4c 39 
May  7 00:44:47 alexandria kernel: [19412.663606] RIP  [<ffffffffa01b952c>] buf_hash_insert+0xbc/0x190 [zfs]
May  7 00:44:47 alexandria kernel: [19412.664450]  RSP <ffff8803f3485c20>
May  7 00:44:47 alexandria kernel: [19412.665241] CR2: 00000000aaca2410
May  7 00:44:47 alexandria kernel: [19412.764046] ---[ end trace f7ff5ed6da7298b7 ]---

After this bug, I had to reboot to use the filesystem.

Please let me know if there is any additional information I can provide.

@behlendorf
Copy link
Contributor

@baughj thanks for filing this. Was this one a one time event or is it reproducible?

@baughj
Copy link
Author

baughj commented May 9, 2016

It is not reproducible so far. I am having some other hardware issues that I suspect are controller-related which have prevented me from really trying again.

@dweeezil
Copy link
Contributor

dweeezil commented May 9, 2016

I took a look at this after it was posted and it seemed that about the only way for the indexing into buf_hash_table.ht_table[] to fail would be if the value of buf_hash_table.ht_mask to be corrupted in-memory. It's one of those interesting values which is set at initialization and not ever changed and another example of the many places we could add a VERIFY to harden the system. However, access to it is in such a hot path that it would be a shame to add any more instructions than necessary.

@baughj Does your system have ECC memory?

@tuxoko
Copy link
Contributor

tuxoko commented May 9, 2016

@dweeezil
The fail is not indexing in to the ht_table. The fail is when walking hash list on to invalid address.

@dweeezil
Copy link
Contributor

dweeezil commented May 9, 2016

When I first looked at the failing offset, it appeared to be in the index to ht_table[](based on running gdb on a module I had laying around). It makes a lot more sense that the fail is while traversing one of the lists.

@baughj
Copy link
Author

baughj commented May 11, 2016

Not sure if it makes sense to keep this ticket open, as I have basically replaced most of the machine at this point, and I am no longer receiving the error on the same dataset. Sorry for the noise.

@behlendorf
Copy link
Contributor

Closing, can no longer be reproduced.

@flokli
Copy link

flokli commented Jan 24, 2022

I ran into the same today, on a box with ECC enabled:

opened #13005

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants