-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Importing corrupted pool causes PANIC: zfs: adding existent segment to range tree #13483
Comments
I'd probably try using the tunable |
Thanks, I tried it with the system from the hrmpf rescue CD, but unfortunately got the same error.
|
I'm having this issue as well on Arch Linux with 5.18.9 kernel with zfs built from git @ 74230a5. I was able to mount the disk as read-write with:
Scrubbing now. |
Just to keep this thread fresh: I have encountered a problem generally matching this description on FreeBSD 12.3-RELEASE-p5. The issue appears to have begun with a kernel panic which occurred in the middle of my standard everyday work on the system (it serves as an NFS host for my workstation's home directory). That occurred at approximately 2022-08-02 14:30, but it took me a while to figure out what was going on and this is the first in a series of kernel panics documented by my system and related to the issue:
Subsequently, any attempt to import the zpool would result in the same sort of panic described in this issue:
This is from the next attempt to import; I'm providing it since the stacktrace path seems perhaps sufficiently different to assist in identifying the cause:
After the above attempts, I determined that I was able to successfully import the zpool in readonly mode. I sequestered away the important changes I had made to the data since the latest backup operation the previous night and I exported the zpool. I then attempted to import using a previous TXG as the original author of this issue mentioned above, but while that did cause the import operation to attempt for about 8 hours (rather than immediately panic), it ultimately resulted in the same kernel panic. Sadly, I don't have any stack trace output for that one, just this:
I then attempted upgrading from FreeBSD 12.2-RELEASE to 13.1-RELEASE, but sadly, that did not resolve the issue. Finally, I used the following tunables to successfully import my zpool in read/write mode:
Perhaps interestingly: without the zil.replay_disable tunable being set (but the other three tunables being set as shown), the import operation continues to fail with a panic featuring the same "adding existent segment to range tree" issue. If the zil.replay_disable tunable is set, I am able to import the pool although I receive the following warnings:
Once I got my pool imported in read/write mode, I executed a full scrub which found zero data errors and repaired 0B. I thought this might put me in the clear, but while the I would appreciate any advice anyone might have on this front. I have ordered a third external backup drive to which I am going to make a third replica of my pool before I try something as painful and potentially messy as a full restoration from backup. Oh, incidentally, I did also observe that I could cause ZFS to throw those warnings if I listed snapshots for a particular dataset within my pool (the one in which I was active at the time of the initial kernel panic). I thought I was clever and tried to remove just the latest snapshot for that dataset, but it did not resolve the problem. |
It seems that my current situation is: I have some incorrect extent allocations in my metaslabs. Given the successful scrub, it appears that, aside from the problematic (duplicate?) entries in the metaslabs, the non-problematic entries in the metaslabs correctly correspond to the data on the disk. This may be instructive:
It's unclear to me at the moment why the zdb command issues warnings to the effect that it is "removing nonexistent segment"s from the range tree, whereas the import declares seemingly the opposite problem, but I am sure that makes perfect sense when one better understands ZFS internals. The offset and size values described here certainly seem to match up nicely with those reported as problems during the import operation, so I'm pretty sure I'm on the right path. I have been searching for any utilities that I might invoke to correct metaslab entries, but have come up empty-handed thus far. |
It looks like these are principally double-free issues from zdb output:
Looks like txg 21134971 is the culprit. |
As an update: After reading about metaslabs, spacemaps, and their maintenance, I came to believe that the following things are true:
Lo and behold, that appears to have occurred. At the end of the day, I executed the zdb command to view free range allocations and found that all of the duplicated segments had been subsumed in consolidated descriptions. Expecting this to mean my problem had disappeared (and kinda confirming that by the fact that the zdb command threw no errors when executed), I exported the pool, restarted my machine, and imported it without any tunable adjustments (as described previously above). And it imported perfectly. A subsequent scrub shows zero bytes repaired and zero data errors. Hopefully this helps someone else if he or she should encounter the same issue. Please, by all means, someone correct me if I have erred in my assessment, here. |
I am having this exact same issue on TrueNAS Scale. I am able to import the pool using Here is what dmesg shows:
|
same issue as @power-max on TrueNAS Scale. No idea how to get the system up an running again now since it craps out on startup :( |
@da-anda did you tried disconnecting drives before boot? After boot set variables (vfs.zfs.recover=1 You may export pool before connecting drives and then import or go for import while it says pool is offline |
@CaCTuCaTu4ECKuu thanks, I have recovered it in the meantime by booting into a live distro (Ubuntu IIRC) and fixed the pool there. And disconnecting affected drives and then boot did not work since it was the boot pool (TrueNAS is also using ZFS for the boot drive for some reason) |
When you mention, I should probaby make config backup and schedule it somehow for boot-pool, didn't even thought that same error could happen to boot pool, it would be easier and faster to reinstall I guess |
same problem on Ubuntu 22 |
And how that scrubbing will help you? Once you reboot, that panic comes back again. |
You have to leave it running for several days with that options enabled. Then export, disable parameters and restart, if system started thet kinda fine, otherwise repeat and leave it be longer. |
I am having a similar problem,
|
I have this same problem now, and none of the tunables are working for me. |
So I had to commit an extra step. One of my encrypted datasets got badly corrupted from a send/receive. Since I already suspected the dataset to be badly damaged, I tried to export and re-import the pool with the tunables. I got my pool imported successfully without the keys, nuked the suspicious dataset, and unlocked the remaining ones. I'm back online with my pool. Now I just need to clean it up. Am I right to assume that running |
zdb is readonly, it will never fix anything, and if it does, it's a bug. |
I also had this error on booting. I followed @cyberpower678 steps and was able to import the pool. In my case I had attempted to delete a dataset that had an (unlocked) encrypted child dataset via the TrueNAS webgui, this process ran for over 8hours without completing and at that point I rebooted the system. |
Ok, thanks. Is there something I can run that actually does do some basic scrub like repairing? |
|
Just wanted to also say that I have been experiencing this with my backup zfs server that is the target of many zfs send recv and I believe different datasets have caused different corruption and I would get this panic. I had tried importing from a previous uber block but no bueno and I have had to nuke the system twice now and restarting the 100TiB transfer over WAN |
Have you tried importing while setting the 4 tunables mentioned further up? If you have an encrypted dataset, have you tried importing without unlocking the datasets while the tunables are set? |
Hi @da-anda! I am having the exact same issue that you were having in Truenas Scale with the boot-pool. Could you please guide me on how did you fix it from Live CD? Just |
sorry, don't recall which commands I used. Probably simply export and import followed by a scrub. |
I can add to this bug. I was running TrueNAS Scale Cobia 23.10.0.1 Hardware is i5-12400, 64gb ram PC-3200 No ECC, on an ASRock Z690 Pro RS, using firmware from after june of this year (don't remember the exact number but it was recent) and updated again to 17.03 and still had the issue. Hard drives are 3x 4TB Ironwolf Pros in Z1 config, lz4 compression, no encryption, configured to act as a windows file server. A ZIL/SLOG was also set up using a single 32gb optane m2 nvme drive. Everything was set to sync writes. I do not know what lead to this issue, but it was absolutely tied to a zvol hosting a windows VM disk. When I updated from truenas 22.13.3 to 23.10.0.1, I had upgraded the pool feature flags that the zvol was part of (I put it under a "VM_disks" dataset). A few days later, these issues started occuring, and it was very random, and left NO loggable trace. A scrub failed with multiple checksum errors (but it didn't seem tied to any specific files that I could find), but after I cleared the warning and reran the scrub, it came up with no issues. Today, while I was in the VM and updating it, it would crash. I then ran filesystem checking tools within the VM, and it would crash. I then changed the compression from lz4 to no compression as a test on the zvol. I then tried an in place reinstallation of the windows VM, and halfway through, it again crashed... and then the host system was stuck in a boot loop of kernel panics displaying issues like this: The only way I could escape this issue was to pull the drives and mount them as read only so I could start copying data off. I am in the process of blowing away and rebuilding the vdev and are going to put the VM zvols on a different vdev. |
@Anticept I am not sure your TrueNAS crash issue is related to the zvol or ZFS. I am experiencing the very same constant crashes with a Win10 VM running on TrueNAS 23.10. while I had no issues at all on 22.x, and in my case the VM has a NVME passed through, so is not running off of any zvol. In case you would like to add additional info to my bug report over at ix-systems, here is the link https://ixsystems.atlassian.net/browse/NAS-124949 . Since I am not running ECC memory (like you), they basically refused to look into the issue. But if more people experience sudden crashes with VMs on 23.10., maybe they will investigate. |
I am not running ECC memory either. But yes, after upgrading to cobia and feature flags, crashes started. |
I seem to observe that this tends to happens when upgrading ZFS to a newer versions, importing datasets created with a significantly older version of ZFS. Something doesn't seem to migrate properly, but doesn't initially break the current version of ZFS until ZFS gets a few more updates. Have any of you tried booting with my solution in mind? |
if you are referring to my recent crashes after updating to TrueNAS 23.10 along with the new ZFS feature flags, then no, my pools are not encrypted. The only "special" thing I have enabled is zstd compression (no dedupe, etc). I could boot up the system with both of my pools disconnected though and test the VM then, if that is what you meant. edit: but if the crashes would be related to ZFS, they should also happen if no VM is running, but that is not the case for me. No VM and no apps, the system seems to run just fine. But ZFS access is ofc also less if no VM/app is running |
no, my VM has a NVME passthrough as boot drive and does not depend on any dataset, zvol or anything ZFS related. I don't even know why it crashes as there is nothing in the logs (I at least couldn't spot anything). But just to rule out a ZFS issue, I will boot with all drives/pools detached and see what happens |
For what it's worth, I had this same problem on my striped nvme rpool which began when I was stressing the system while trying out a new 14TB usb zpool for backups. I worked my way through it thanks mainly to @bitwise0perator. When I finally found this thread I was able to boot a proxmox install and Rebooted that vm and I was stoked that it was away. Now I had to cleanup the rest of the permanent error. There were some other adventures I had from mistakes but I won't get into those.
I hope this helps someone. |
FWIW, my issue persists, even without imported zpools. So I don't think my current issue is related to zfs, unless the boot-pool would be corrupted, but it's a clean install of TrueNAS 23.10 on a brand new SSD. I try downgrading TrueNAS, but that likely won't work due to the applied zfs feature flags (albeit those features are not in use atm) |
So you’ve disconnected all of your disks that have your data pools leavih only the disk(s) with the boot-pool visible to the system. Sounds like your boot-pool may be effed. Try reinstalling TrueNAS from scratch and reload your configuration from your config backup. I’d say only your boot-pool maybe the issue, so you might just be able to reconnect your disks and let the configuration restore process deal with the rest. |
today I tryed move our gitlab instance to zfs filesystem and I get the sam error:
Ubuntu 22.04 on aws ec2 c5.2xlarge
zpool and dataset info:
|
@Anticept With your system having non-ECC RAM, check it is not a memory corruptions, run good long memory test before one scrambled your pool beyond recover. It already looks like you have space map corruption, and your only hope is likely to import the pool read-only and evacuate the data while you can. |
@amotin It's not a memory issue either. Already been down that road. PS: I also use an optane 32gb nvme as a SLOG device in case that's relevant. Had no issues for over a year, then as soon as I grabbed truenas cobia, everything went haywire. Lost a VM image (inconsequential, it just had a jump server and that's it), got stuck in a boot loop, etc. Ran tests on the hardware, grabbed manufacturer tools to examine the disks, etc. I mounted it read only and exported the data. It was really strange because after upgrading to cobia, I ran tests and it all seemed stable at first. Before moving back to bluefin, I created a fresh cobia install, wiped the array and made it fresh, and copied everything back to it, and it continued to remain stable even though i was pounding it with the file copy operations to fill the disks back up as fast as it would all go. But once the next day came around and we started having 10+ people hitting it through the SMB service, that's when it got real unstable, crashing as often as every half hour. I said I'm done with this, went back to bluefin 2.4.2. It's been 100% issue free. Got another system with ECC ram coming, we're going to go to two fileservers, one as a standby, as we can't afford to have it going offline like that. |
After update to 2.2.1 I tried with zfs_dmu_offset_next_sync=0
scrub doest't help |
System information
Tried on another system
Describe the problem you're observing
I have a (no redundancy) pool which have been corrupted by a HW failure. Trying to import it causes a PANIC and the zpool import process hangs in "D uninterruptible sleep (usually IO)" state:
Maybe issue #13445 is related, there is a similar backtrace there.
The pool can be imported with
zpool import -o readonly=true -f rpool
or withzpool import -f -T 2676127 rpool
.Describe how to reproduce the problem
I have a dump / image of the pool in the corrupted state on which I can repeatedly reproduce this with both system / ZFS version.
I can (and willing to) try out possible solutions, too. (The original pool have been recovered with the
-T txg
method.)Unfortunately I can not share the whole image as it contains personal information, some short hexdump may be possible.
Include any warning/errors/backtraces from the system logs
The results have been reproduced running on qemu/kvm (version 5.2.0) using the image as a virtual disks, the original hypervisor was an ESXi.
The original system (Debian, zfs-2.0.3-9)
The hrmpf rescue BootCD (zfs-2.1.2-1)
Thank you!
The text was updated successfully, but these errors were encountered: