Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Panic, CentOS7 not syncing: bad overwrite z_wr_int_7 #4610

Closed
azuretek opened this issue May 7, 2016 · 8 comments
Closed

Kernel Panic, CentOS7 not syncing: bad overwrite z_wr_int_7 #4610

azuretek opened this issue May 7, 2016 · 8 comments
Labels
Status: Inactive Not being actively updated Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@azuretek
Copy link

azuretek commented May 7, 2016

My system keeps kernel panicking while I transfer new data to the zpool. I'm not sure where to go from here, I'm looking at the crash vmcore, but I'm not sure what I can do to fix it. My google skills are lacking on this one.

My system has 32GB of non-ecc RAM (Tested with memtest86+ and found no issues), i5 4670K and my motherboard is an asrock z97 extreme 4. 12 4TB disks are connected to a IBM M1015 and a RES2SV240 sas expander.

I've saved the crash files, does anyone have any idea what is causing this problem? Is there any more information I can provide to help troubleshoot this?

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-327.13.1.el7.x86_64/vmlinux
    DUMPFILE: ./vmcore  [PARTIAL DUMP]
        CPUS: 4
        DATE: Sat May  7 02:51:44 2016
      UPTIME: 00:41:09
LOAD AVERAGE: 1.30, 1.73, 1.55
       TASKS: 659
    NODENAME: azureserve1.azuredomain.local
     RELEASE: 3.10.0-327.13.1.el7.x86_64
     VERSION: #1 SMP Thu Mar 31 16:04:38 UTC 2016
     MACHINE: x86_64  (3399 Mhz)
      MEMORY: 31.7 GB
       PANIC: "Kernel panic - not syncing: bad overwrite, hdr=ffff88060d720520 exists=ffff880679d37858"
         PID: 1722
     COMMAND: "z_wr_int_7"
        TASK: ffff8807e4d85080  [THREAD_INFO: ffff8807e4da8000]
         CPU: 2
       STATE: TASK_RUNNING (PANIC)
@behlendorf
Copy link
Contributor

@azuretek if you could post the backtraces from the crash dump we may be able to identify the issue.

@azuretek
Copy link
Author

azuretek commented May 10, 2016

Here it is, I'm not sure if anything else is necessary

crash> bt
PID: 1722   TASK: ffff8807e4d85080  CPU: 2   COMMAND: "z_wr_int_7"
 #0 [ffff8807e4dab9a0] machine_kexec at ffffffff81051beb
 #1 [ffff8807e4daba00] crash_kexec at ffffffff810f2662
 #2 [ffff8807e4dabad0] panic at ffffffff8162ef9e
 #3 [ffff8807e4dabb50] arc_write_done at ffffffffa04cad54 [zfs]
 #4 [ffff8807e4dabb98] zio_done at ffffffffa0580990 [zfs]
 #5 [ffff8807e4dabc10] zio_done at ffffffffa0581692 [zfs]
 #6 [ffff8807e4dabc28] zio_done at ffffffffa058115c [zfs]
 #7 [ffff8807e4dabca0] zio_done at ffffffffa0581692 [zfs]
 #8 [ffff8807e4dabcb8] zio_done at ffffffffa058115c [zfs]
 #9 [ffff8807e4dabd30] zio_done at ffffffffa0581692 [zfs]
#10 [ffff8807e4dabd48] zio_done at ffffffffa058115c [zfs]
#11 [ffff8807e4dabdc0] zio_done at ffffffffa0581692 [zfs]
#12 [ffff8807e4dabdd8] zio_execute at ffffffffa057c8c8 [zfs]
#13 [ffff8807e4dabe20] taskq_thread at ffffffffa04496de [spl]
#14 [ffff8807e4dabec8] kthread at ffffffff810a5aef
#15 [ffff8807e4dabf50] ret_from_fork at ffffffff81645e18

@azuretek
Copy link
Author

I've rebuilt my system, used different kernels and distros but see no difference. I think it's likely to be a hardware issue but I've already replaced the RAM and run memtest86+. I've move hard drives between bays. The only thing I haven't done is replace the mobo, CPU and controller/expander. Does anyone know how I can narrow down where the problem might be?

@dweeezil
Copy link
Contributor

I find the similarity between this stack and the one in #4608 to be rather interesting.

@azuretek
Copy link
Author

Seems like it could be the same problem, or at least a similar problem. Wish there was more of a resolution or at least some way to narrow down what the problem is. Right now I've decided to RMA some of the hardware to rule it out.

@azuretek
Copy link
Author

Replaced the RAID controller, still no change.

@azuretek
Copy link
Author

The specific panic error keeps changing, always related to ZFS though.

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-327.13.1.el7.x86_64/vmlinux
    DUMPFILE: vmcore  [PARTIAL DUMP]
        CPUS: 4
        DATE: Wed May 18 16:50:03 2016
      UPTIME: 01:25:40
LOAD AVERAGE: 1.36, 1.71, 1.70
       TASKS: 565
    NODENAME: azureserve1.azuredomain.local
     RELEASE: 3.10.0-327.13.1.el7.x86_64
     VERSION: #1 SMP Thu Mar 31 16:04:38 UTC 2016
     MACHINE: x86_64  (3399 Mhz)
      MEMORY: 31.7 GB
       PANIC: "general protection fault: 0000 [#1] SMP "
         PID: 1797
     COMMAND: "z_fr_iss_6"
        TASK: ffff8807e5fca280  [THREAD_INFO: ffff8807e4010000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

@dweeezil
Copy link
Contributor

@azuretek What's the stack trace for that one (the crash in z_fr_iss_6) look like? Is it also zio_done->arc_write_done?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

4 participants
@behlendorf @dweeezil @azuretek and others