vdev_config_sync can't guarantee the state on disk is still transactionally consistent #4162

liaoyuxiangqin · 2016-01-04T16:42:12Z

when sync out the even labels (L0, L2) , if the system dies in the middle of this process,
all of the even labels that made it to disk will be newer than any uberblocks and odd labels(L1, L3).

Then, power on the system and import pool. Zfs find pools through search given disk directories use zpool_read_label, but this function will return an nvlist describing the configuration by read the label information for first time success, so that those have been updated even labels(L0, L2) may be selected and add to the list of known devices through add_config.

And then get_configs will picking the best config for each toplevel vdev rely on the config with the latest transaction group, so that those config read from even labels(L0, L2) may be selected because even labels have been sync out with latest txg. Then we assemble the toplevel vdevs into a full config for the pool and load it, but the full config newer than any uberblocks, so the spa_load_impl may be validate fail due to the full config and best uberblock described content inconsistent.

For example: if pool occur hot spare replace and sync thread need to sync the uberblock and changes to the vdev configuration, when sync out the even labels (L0, L2) , if the system dies in the middle of this process, all of the vdev tree construction in even labels that made it to disk will be newer than any uberblocks and odd labels(L1, L3). Because of the config read from even labels with latest txg, so that we will choose those config for each toplevel vdev and assemble into full config to load spa. However, the spa_load_impl return fail result from rvd->vdev_guid_sum != ub->ub_guid_sum.

The root cause is we simply return an nvlist through read the label information for first time success,
rather than select best label which consistent with uberblock from 4 labels.

richardelling · 2016-01-04T17:43:06Z

Are you suggesting that latest uberblock is to be considered invalid? If so, you're guaranteeing silent data loss. Current behaviour fails the spa load and requires manual intervention to accept the data loss.

A better approach is to fix the code that updates the spare activation.

liaoyuxiangqin · 2016-01-05T02:10:31Z

Dear Richard,
I think that when import pool in the scenario above described, we may picking the nvlist config from even labels (L0, L2) , which newer than any uberblocks, such as config txg greater than latest uberblock txg, or config vdev DTL object can't find in latest uberblock , so that spa_load_impl load vdev DTL fail and set vdev cant open, at last root vdev state valid fail and spa load failure.

richardelling · 2016-01-05T21:31:26Z

Have you been able to demonstrate this occuring on hardware that supports the cache flush command?

liaoyuxiangqin · 2016-01-06T07:24:23Z

I have not do such test, but i think the problem still occuring whether hardware that supports the cache flush command.
I think the problem is due to vdev_config_sync use two-stage sync the uberblock and changes to the vdev configuration, and then when importing pools we simply get an nvlist config through read the label information for first time success, rather than select best label which consistent with uberblock from 4 labels.

richardelling · 2016-01-06T23:02:04Z

The two-stage sync ensures that at least one label is written and synced to persistent media. The second stage is not expected to succeed if the first failed. Since the second stage only writes uberblocks, the on-disk data should be correct: including updated MOS and at least one current label. Do you have evidence to the contrary?

behlendorf · 2016-01-08T18:59:37Z

@liaoyuxiangqin the case you describe is absolutely something which can happen. However, there's already code to handle it in the kernel when importing the pool. See vdev_uberblock_load()->vdev_label_read_config().

        /*
         * It's possible that the best uberblock was discovered on a label
         * that has a configuration which was written in a future txg.
         * Search all labels on this vdev to find the configuration that
         * matches the txg for our uberblock.
         */
        if (cb.ubl_vd != NULL)
                *config = vdev_label_read_config(cb.ubl_vd, ub->ub_txg);

The root cause is we simply return an nvlist through read the label information for first time success,
rather than select best label which consistent with uberblock from 4 labels.

Exactly right. It looks like the user space zpool_read_label() function needs to be more discriminating. It shouldn't return labels which were written after the latest uberblock. That appears to be possible for both Ilumos and Linux even though the algorithms differ slightly.

Illumos behaves as you describe and returns the just first intact label it discovers without regard for txg. For Linux zpool_read_label() was updated slightly in commit 7d90f56. It still returns the first label found but in addition it counts the number of intact labels on the device. That information is passed along to add_config() so it can prefer more intact devices.

Both implementations seem have the same flaw. @liaoyuxiangqin could you propose a patch for review and testing?

richardelling · 2016-01-09T23:27:46Z

Upon further review, I concur that the logic in zpool_read_label() should do as vdev_label_read_config() with comparison to the latest or target uberblock txg.

Arguably this fix should also apply to illumos, however the use of zpool_read_label() in illumos is different than ZoL, and thus illumos seems to be not effected in the same way.

A clever approach might be to reverse the search order: begin at label 3.

Real-world testing will be somewhat tricky because the events that lead to this condition are uncommon: requiring a failure in the midst of a set of small writes and cache flushes. However, it is easy to contrive a blown label test case.

ilovezfs · 2016-01-10T03:17:44Z

Arguably this fix should also apply to illumos, however the use of zpool_read_label() in illumos is different than ZoL, and thus illumos seems to be not affected in the same way.

Or maybe ZoL should put it back the way it was?

liaoyuxiangqin · 2016-01-10T06:32:46Z

@richardelling think you for you response, i agree with your consideration of zpool_read_label() should do as vdev_label_read_config() with comparison to the latest or target uberblock txg.

Arguably this fix should also apply to illumos, however the use of zpool_read_label() in illumos is
different than ZoL, and thus illumos seems to be not affected in the same way.

I think both illumos and zfsonlinux implementations should need to optimize, the approach might be to reverse the search order: begin at label 3, which worth i to refer, and i need to consider more case of two-stage sync labels failure.

liaoyuxiangqin · 2016-01-10T07:27:27Z

@behlendorf think you for you response, i will preliminary consider to optimize zpool_read_label() function by traversal 4 labels and find best uberblock in this device and then picking the best config for each toplevel vdev rely on the config with the latest transaction group and pool best uberblock txg regard, but i need to consider more case of two-stage sync labels failure and self-test.

At last, i will propose code for review and testing.

behlendorf · 2016-01-10T19:53:48Z

Arguably this fix should also apply to illumos, however the use of zpool_read_label() in illumos is different than ZoL, and thus illumos seems to be not affected in the same way.

How are things so different in ZoL? We don't return a different label, just a little more information along with that label. The only major difference here is we depend more heavily on the label scanning behaviour by default and less on a cache file. I don't see any reason why Illumos wouldn't see the same issues if there were more real world systems stressing these call paths in interesting ways.

richardelling · 2016-01-10T22:51:47Z

I think you'll find in practice that large illumos implementations always use the cachefile. Consider that illumos can have 20+ partitions/slices per drive and when there are more than a few hundred drives, this is way too much work. So once the system architecture is set to use cachefiles, there is no going back.

behlendorf · 2016-01-22T19:41:21Z

That certainly makes sense for illumos but under Linux it's somewhat less useful. During boot all the block devices are already probed in parallel and brought online asynchronously. As part of this process udev automatically probes and identifies any filesystems and constructs a convenient cache. So by the time we're at the point where we could import the pool we already know where everything is. What we really need to do under Linux is just integrate more tightly with libblkid.

That said, having the a cache file around as a recovery mechanism is awesome!

liaoyuxiangqin · 2019-02-18T01:56:24Z

@kpande Sorry, I was only submitted the issue and did not mentioned PR for it, thanks.

stale · 2020-08-24T23:53:42Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

behlendorf mentioned this issue Jan 10, 2016

Add labelfix (zpool label recovery tool) to official tools. #4187

Open

stale bot added the Status: Stale No recent activity for issue label Aug 24, 2020

stale bot closed this as completed Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vdev_config_sync can't guarantee the state on disk is still transactionally consistent #4162

vdev_config_sync can't guarantee the state on disk is still transactionally consistent #4162

liaoyuxiangqin commented Jan 4, 2016

richardelling commented Jan 4, 2016

liaoyuxiangqin commented Jan 5, 2016

richardelling commented Jan 5, 2016

liaoyuxiangqin commented Jan 6, 2016

richardelling commented Jan 6, 2016

behlendorf commented Jan 8, 2016

richardelling commented Jan 9, 2016

ilovezfs commented Jan 10, 2016

liaoyuxiangqin commented Jan 10, 2016

liaoyuxiangqin commented Jan 10, 2016

behlendorf commented Jan 10, 2016

richardelling commented Jan 10, 2016

behlendorf commented Jan 22, 2016

liaoyuxiangqin commented Feb 18, 2019

stale bot commented Aug 24, 2020

vdev_config_sync can't guarantee the state on disk is still transactionally consistent #4162

vdev_config_sync can't guarantee the state on disk is still transactionally consistent #4162

Comments

liaoyuxiangqin commented Jan 4, 2016

richardelling commented Jan 4, 2016

liaoyuxiangqin commented Jan 5, 2016

richardelling commented Jan 5, 2016

liaoyuxiangqin commented Jan 6, 2016

richardelling commented Jan 6, 2016

behlendorf commented Jan 8, 2016

richardelling commented Jan 9, 2016

ilovezfs commented Jan 10, 2016

liaoyuxiangqin commented Jan 10, 2016

liaoyuxiangqin commented Jan 10, 2016

behlendorf commented Jan 10, 2016

richardelling commented Jan 10, 2016

behlendorf commented Jan 22, 2016

liaoyuxiangqin commented Feb 18, 2019

stale bot commented Aug 24, 2020