Implement sequential (two-phase) resilvering #3625

Deewiant · 2015-07-24T11:56:59Z

https://blogs.oracle.com/roch/entry/sequential_resilvering describes a two-phase resilvering process which avoids random I/O, potentially dramatically speeding up resilvering especially on HDDs.

As far as I know ZoL doesn't do anything like this, so I created this issue to keep track of the situation.

behlendorf · 2015-07-24T18:52:43Z

@Deewiant thanks for filing this. Yes, this is something we've talked about implementing for some time. I think it would be great to implement when someone has the time.

adilger · 2015-12-03T00:35:21Z

Resilvering will also benefit greatly from the Metadata allocation class of issue #3779 to separate the metadata onto a separate SSD device.

Another option that was discussed in the past for mirror devices (not sure if this was ever implemented) is to do a full linear "dd" style copy of the working device to the replacement, and then fall back to a scrub to verify the data was written correctly. That gets the data redundancy back quickly, using nice large streaming IO requests to the disks, and then the scrub can be done at a lower priority. There still exists the possibility that the source device has latent sector errors, so the failing drive shouldn't be taken offline until after the scrub, so that it could be used to read any blocks with bad checksums, if possible.

jumbi77 · 2016-02-26T21:23:57Z

I guess another approach to speed up resilvering is the parity declustered RAIDz/mirror #3497 ?!

Another idea which was not mentioned yet is the "RAID-Z/mirror hybrid allocator" from Oracle. Not sure if this accelerate also resilvering, but it is a nice performance boost in generell i guess. As far as i understand, the metadata are then mirrored within the raidz. Is it planned to implement this or is that obsolete because of #3779 ?

thegreatgazoo · 2016-03-02T05:10:36Z

@jumbi77 Parity declustered RAID is a new type of VDEV, called dRAID, which will offer scalable rebuild performance. It will not affect how RAIDz resilver works.

jumbi77 · 2016-05-15T19:03:45Z

Is there anybody working on that feature or have plans to implement it in the future? Just curious.

nwf · 2016-07-06T13:54:25Z

I'd like to suggest that there be a RAM-only queue mode for sequential resilvering, along the lines of rsync's asynchronous recursor. Rather than traverse all the metadata blocks at once and sort all the data blocks, which is likely to be an enormous collection which must itself be serialized to disk, it might be nice for the system to use a standard in-RAM producer/consumer queue with the producer (metadata recursor) stalling if the queue fills. The queue would of course be sorted by address on device (with multiple VDEVs intermixed) so that it acted as an enormous elevator queue. While no longer strictly sequential -- the recursor would find blocks out of order and have a limited ability to sort while blocks were in queue -- the collection of data block pointers no longer needs to be persisted to disk and there should be plenty of opportunities for streaming reads.

I suppose the other downside to such a thing is that it seems difficult to persist enough data to allow scrubs to pick up where they left off across exports or reboots, but I am not convinced that that is all that useful?

ironMann · 2016-07-06T14:14:04Z

I've just started looking into this. Initially I had the same idea with RAM only solution (and it might be my first prototype), but I don't think it will be enough for larger pools. As design document in #1277 suggests there are few benefits with persisting the resilver log. I'm thinking about a solution in line with async_destroy, but I still have a lot to learn about zfs internals.

If somebody would like to provide mentorship for this project, feel free to contact me.

nwf · 2016-07-06T23:19:52Z

I think a strictly-read-only scrub might be worthwhile, too. Maybe make persistence optional (treat it as a queue without bound so that it never blocks)?

Alternatively, doing multiple metadata scans and selecting the next consecutive chunk of DVAs, again without persistence, might be simpler. In this design, one would walk the metadata in full and collect the lowest e.g. 16M data DVAs (in sorted order, so that it can be traversed with big streaming reads). By remembering what the 16Mth DVA was, the next walk of the metadata could collect the next 16M DVAs. This is an easily resumable (just remember which bin of DVAs was being scrubbed) and bounded-memory algorithm that should be easy to implement. (Credit, I think, is due to HAMMER2.)

thewacokid · 2016-07-18T23:07:21Z

Perhaps this is not the correct way to push this, but SMR drives would absolutely love even a slightly sequential workload for resilvers. Current code degrades to 1-5 IOPs with SMR drives over time, which makes rebuilds take an eternity, especially with the queue depth stuck at 1 on the drive being replaced (is that a bug, or expected? I haven't had time to dig into the source).

Just to clarify, this is the SMR drive being resilvered to after a few hours (filtering out idle drives):
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 1.00 0.00 0.00 99.00

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sdgt 0.00 0.00 0.00 6.00 0.00 568.00 189.33 1.00 165.67 166.67 100.00

ironMann · 2016-07-19T06:36:15Z

@thewacokid try boosting scrub io with parameters as suggested in #4825 (comment)
I've started a discussion on openzfs-developer mailing list about this feature, and it seems there's already work started on this feature.

thewacokid · 2016-07-19T14:23:27Z

@ironMann Those parameters help massively with normal drives, however, SMR drives eventually (within an hour or so) degrade to a handful of IOPs as they shuffle data out to the shingles. Perhaps higher queue depths would help, or async rebuild IO, or something easier than a full sequential resilver patch? I'm unsure why there's only ever one pending IO to the target drive.

scineram · 2016-08-24T13:24:27Z

@thewacokid There will be a talk next month to watch out for on this issue. Probably the work @ironMann mentioned.
http://open-zfs.org/wiki/Scrub/Resilver_Performance

mailinglists35 · 2016-09-27T09:21:21Z

from the openzfs conference recording I understand Nexenta may be able to do it. can't wait to see this in ZoL!
http://livestream.com/accounts/15501788/events/6340478/videos/137014181
scroll to minute 15

mailinglists35 · 2017-03-08T21:58:34Z

what is the status of this? the link I've pasted above is no longer working.

mailinglists35 · 2017-03-08T22:15:38Z

hm, PR #5153 mentions a new PR, #5841 that intends to solve #3497 which appears to do a faster resilvering.

nwf · 2017-03-09T02:44:48Z

@mailinglists35: the dRAID stuff is different, though it happens to have similar effects.

@skiselkov has written all the code to do this; it's in review at skiselkov/illumos-gate#2 and https://github.com/skiselkov/illumos-gate/commits/better_resilver_illumos. I have ported the code over to ZoL and been testing it with delightful success (it was very straightforward, doubtless in part because ZoL strives to minimize divergence against upstream). I should assume a pull request to ZoL is forthcoming once the review is done and code gets put back to Illumos.

ETA: @skiselkov's implementation is purely in-RAM and achieves persistence by periodically draining the reorder buffer, thereby bringing the metadata recursor's state and the set of blocks actually scrubbed into sync. This is a really neat design and keeps the on-disk persistence structure fully backwards compatible with the existing records. He deserves immense praise for the work. :)

skiselkov · 2017-03-09T02:54:32Z

@nwf Just an FYI, the resilver work isn't quite complete yet. I have a number of changes queued up that build in some more suggestions from Matt Ahrens from the design/early review phase. Notably a lot of the range_tree code is gonna change, as well as some of the vdev queue taskq handling. Nothing too dramatic, I just don't want you to put in a lot of work on porting and to have it then blown out by changing it a lot.

nwf · 2017-03-09T03:16:49Z

@skiselkov: No worries! I'm happy to follow along and start over if needed. :)

thegreatgazoo · 2017-03-09T17:16:59Z

Just to clarify:

Resilver, and any optimization of it, works with any type of vdev, including the new dRAID vdev
Rebuild, a completely new mechanism added by dRAID, works only with dRAID and mirror.

mailinglists35 · 2017-03-09T19:30:32Z

thank you all!
@behlendorf could you add a milestone for this?

jumbi77 · 2017-06-26T16:55:50Z

Referencing to #6256

interduo · 2017-11-17T12:44:08Z

Thanks for this. This was a big problem for me in one location.

Will this come to 0.7.4 release ?

behlendorf · 2017-11-17T17:45:07Z

Your welcome, this feature will be part of 0.8.

Currently, scrubs and resilvers can take an extremely long time to complete. This is largely due to the fact that zfs scans process pools in logical order, as determined by each block's bookmark. This makes sense from a simplicity perspective, but blocks in zfs are often scattered randomly across disks, particularly due to zfs's copy-on-write mechanisms. This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance. This patch also updates and cleans up some of the scan code which has not been updated in several years. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com> Authored-by: Alek Pinchuk <apinchuk@datto.com> Authored-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes openzfs#3625 Closes openzfs#6256

behlendorf added the Type: Feature Feature request or new feature label Jul 24, 2015

ironMann mentioned this issue Jul 1, 2016

[RFC][WIP] Vectorized RAIDZ generate and reconstruct methods [abd] #4439

Closed

thegreatgazoo mentioned this issue Sep 22, 2016

Sequential SPA-level rebuild for mirror and dRAID #5153

Closed

behlendorf added this to the 0.8.0 milestone Mar 9, 2017

behlendorf closed this as completed in d4a72f2 Nov 16, 2017

joshenders mentioned this issue Nov 4, 2023

Sequential replacement for raidz #15494

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement sequential (two-phase) resilvering #3625

Implement sequential (two-phase) resilvering #3625

Deewiant commented Jul 24, 2015

behlendorf commented Jul 24, 2015

adilger commented Dec 3, 2015

jumbi77 commented Feb 26, 2016

thegreatgazoo commented Mar 2, 2016

jumbi77 commented May 15, 2016

nwf commented Jul 6, 2016

ironMann commented Jul 6, 2016

nwf commented Jul 6, 2016

thewacokid commented Jul 18, 2016 •

edited

Loading

ironMann commented Jul 19, 2016

thewacokid commented Jul 19, 2016 •

edited

Loading

scineram commented Aug 24, 2016

mailinglists35 commented Sep 27, 2016 •

edited

Loading

mailinglists35 commented Mar 8, 2017

mailinglists35 commented Mar 8, 2017

nwf commented Mar 9, 2017 •

edited

Loading

skiselkov commented Mar 9, 2017

nwf commented Mar 9, 2017

thegreatgazoo commented Mar 9, 2017

mailinglists35 commented Mar 9, 2017

jumbi77 commented Jun 26, 2017

interduo commented Nov 17, 2017 •

edited

Loading

behlendorf commented Nov 17, 2017

Implement sequential (two-phase) resilvering #3625

Implement sequential (two-phase) resilvering #3625

Comments

Deewiant commented Jul 24, 2015

behlendorf commented Jul 24, 2015

adilger commented Dec 3, 2015

jumbi77 commented Feb 26, 2016

thegreatgazoo commented Mar 2, 2016

jumbi77 commented May 15, 2016

nwf commented Jul 6, 2016

ironMann commented Jul 6, 2016

nwf commented Jul 6, 2016

thewacokid commented Jul 18, 2016 • edited Loading

ironMann commented Jul 19, 2016

thewacokid commented Jul 19, 2016 • edited Loading

scineram commented Aug 24, 2016

mailinglists35 commented Sep 27, 2016 • edited Loading

mailinglists35 commented Mar 8, 2017

mailinglists35 commented Mar 8, 2017

nwf commented Mar 9, 2017 • edited Loading

skiselkov commented Mar 9, 2017

nwf commented Mar 9, 2017

thegreatgazoo commented Mar 9, 2017

mailinglists35 commented Mar 9, 2017

jumbi77 commented Jun 26, 2017

interduo commented Nov 17, 2017 • edited Loading

behlendorf commented Nov 17, 2017

thewacokid commented Jul 18, 2016 •

edited

Loading

thewacokid commented Jul 19, 2016 •

edited

Loading

mailinglists35 commented Sep 27, 2016 •

edited

Loading

nwf commented Mar 9, 2017 •

edited

Loading

interduo commented Nov 17, 2017 •

edited

Loading