Sequential SPA-level rebuild for mirror and dRAID #5153

thegreatgazoo · 2016-09-22T22:24:40Z

This patch implements SPA-level rebuild for mirror and the upcoming dRAID (#3497) vdev's. It works by scanning the metaslabs and rebuilding each used segment of DVA. The redundancy layout (i.e. where the copies, data/parity blocks are) must be determined by the DVA only. That's why it does not work with raidz. Since the scan happens at SPA/metaslab level, it doesn't traverse the block pointer tree. The downside is that block checksum cannot be verified during the scan. The upside is that the IO is perfectly sequential and IO size is often larger than block size. Note that this is different from #3625, which is still resilver.

The following tests compared rebuild and resilver on the same hardware populated with exactly the same data:

# modprobe zfs zfs_vdev_scrub_min_active=8 zfs_vdev_scrub_max_active=10 zfs_vdev_async_write_min_active=10
# zpool create -f tank mirror sdc sde
# for i in `seq 1 20`; do cat ubuntu-14.04.4-desktop-amd64.iso > /tank/$i.iso; done
# tar xf linux-4.7.3.tar.xz -C /tank/
# tar xf linux-4.5.tar.xz -C /tank/
# df -h /tank/
Filesystem      Size  Used Avail Use% Mounted on
tank            1.8T   22G  1.8T   2% /tank
# zpool attach -f tank sdc sdg
# zpool status
  pool: tank
 state: ONLINE
  scan: rebuilt 21.5G in 0h1m46s with 0 errors on Thu Sep 22 19:57:46 2016
config: 

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdg     ONLINE       0     0     0

errors: No known data errors
# zpool export tank
# zpool create -f tank mirror sdc sde
# for i in `seq 1 20`; do cat ubuntu-14.04.4-desktop-amd64.iso > /tank/$i.iso; done
# tar xf linux-4.7.3.tar.xz -C /tank/
# tar xf linux-4.5.tar.xz -C /tank/
# df -h /tank/
Filesystem      Size  Used Avail Use% Mounted on
tank            1.8T   22G  1.8T   2% /tank
# echo 0 > /sys/module/zfs/parameters/vdev_draid_max_rebuild # Disable Rebuild
# zpool attach -f tank sdc sdg
# zpool status
  pool: tank
 state: ONLINE
  scan: resilvered 21.5G in 0h3m0s with 0 errors on Thu Sep 22 20:05:18 2016
config: 

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdg     ONLINE       0     0     0

errors: No known data errors

The rebuild took 106 seconds while the resilver took 180 seconds:

scan: rebuilt 21.5G in 0h1m46s with 0 errors on Thu Sep 22 19:57:46 2016
scan: resilvered 21.5G in 0h3m0s with 0 errors on Thu Sep 22 20:05:18 2016

The resilver took about 70% more time to complete. Note that this is a very ideal workload for resilver: new and nearly empty pool, mostly large files. But rebuild is sequential by design and is not that allergic to aging, fragmentation, or file sizes. So this test actually favors resilver.

The following two pictures shows graphics of IO data during the rebuild and the resilver:

In each picture, there are 4 graphs:

X axis shows wall clock time. Each graph shows 120 seconds of IO data
The 1st graph shows read bw in MB/s for the individual drives
The 2nd graph shows write bw in MB/s for the individual drives
The 3rd and 4th graphs show average read/write request sizes in KB, respectively.

The most interesting part is the 3rd and 4th graphs showing average IO request sizes. For rebuild, the r/w request sizes averaged at about 1M, but for resilver the sizes were at around 128K, which is the recordsize. This is because rebuild scans metaslabs, and fixes used segments each of which may contain many blocks.

Comments and suggestions are very welcome. The code at this point is rough and hacky. It may panic and/or destroy your data. That said, I was able to test rebuilds successfully, and scrub found no errors.

How to use:

Apply this patch.
Then all attach and replace operations of a mirror vdev will use the sequential rebuild mechanism. To use resilver instead: echo 0 > /sys/module/zfs/parameters/vdev_draid_max_rebuild

Signed-off-by: Isaac Huang <he.huang@intel.com>

thegreatgazoo · 2016-09-23T02:51:22Z

Graphs of IO for replacing a failed drive in a 3-way mirror, results are nearly identical to attaching a new drive to a 2-way mirror:

BTW there was some cstyle issue in the patch, some lines too long etc. I'll fix them later.

rlaager · 2016-09-27T18:01:24Z

cmd/zpool/zpool_main.c

 			    "%llu errors on %s");
+		} else if (ps->pss_func == POOL_SCAN_REBUILD) {
+			fmt = "rebuilt %s in %lluh%um%us with "


This needs to use gettext().

rlaager · 2016-09-27T18:01:31Z

cmd/zpool/zpool_main.c

@@ -5473,6 +5478,9 @@ print_scan_status(pool_scan_stat_t *ps)
 		} else if (ps->pss_func == POOL_SCAN_RESILVER) {
 			(void) printf(gettext("resilver canceled on %s"),
 			    ctime(&end));
+		} else if (ps->pss_func == POOL_SCAN_REBUILD) {
+			(void) printf("rebuild canceled on %s",


This needs to use gettext().

rlaager · 2016-09-27T18:01:35Z

cmd/zpool/zpool_main.c

@@ -5488,6 +5496,9 @@ print_scan_status(pool_scan_stat_t *ps)
 	} else if (ps->pss_func == POOL_SCAN_RESILVER) {
 		(void) printf(gettext("resilver in progress since %s"),
 		    ctime(&start));
+	} else if (ps->pss_func == POOL_SCAN_REBUILD) {
+		(void) printf("rebuild in progress since %s",


This needs to use gettext().

rlaager · 2016-09-27T18:01:47Z

cmd/zpool/zpool_main.c

@@ -5526,6 +5537,9 @@ print_scan_status(pool_scan_stat_t *ps)
 	} else if (ps->pss_func == POOL_SCAN_SCRUB) {
 		(void) printf(gettext("\t%s repaired, %.2f%% done\n"),
 		    processed_buf, 100 * fraction_done);
+	} else if (ps->pss_func == POOL_SCAN_REBUILD) {
+		(void) printf("    %s rebuilt, %.2f%% done\n",


This needs to use gettext(). Also, the whitespace at the front should be "\t".

rlaager · 2016-09-27T18:03:06Z

module/zfs/dsl_scan.c

@@ -338,7 +339,8 @@ dsl_scan_done(dsl_scan_t *scn, boolean_t complete, dmu_tx_t *tx)
 		spa_history_log_internal(spa, "scan done", tx,
 		    "errors=%llu", spa_get_errlog_size(spa));

-	if (DSL_SCAN_IS_SCRUB_RESILVER(scn)) {
+	if (DSL_SCAN_IS_SCRUB_RESILVER(scn) ||
+	    scn->scn_phys.scn_func == POOL_SCAN_REBUILD) {


This should probably be a DSL_SCAN_IS_REBUILD define for consistency.

rlaager · 2016-09-27T18:05:07Z

module/zfs/dsl_scan.c

@@ -1550,6 +1588,18 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
 	if (!scn->scn_async_stalled && !dsl_scan_active(scn))
 		return;

+	if (scn->scn_phys.scn_func == POOL_SCAN_REBUILD) {
+		if (scn->scn_visited_this_txg == 19890604) {


Can you explain this magic number?

It's just a historical date that happened to pop up. Anything else would work, like the usual deadbeef babyface stuff.

rlaager · 2016-09-27T18:05:47Z

module/zfs/vdev.c

@@ -3053,7 +3053,7 @@ vdev_stat_update(zio_t *zio, uint64_t psize)

 	if (type == ZIO_TYPE_WRITE && txg != 0 &&
 	    (!(flags & ZIO_FLAG_IO_REPAIR) ||
-	    (flags & ZIO_FLAG_SCAN_THREAD) ||
+	    ((flags & ZIO_FLAG_SCAN_THREAD) && !spa->spa_dsl_pool->dp_scan->scn_is_sequential) ||


You might want to wrap this line.

rlaager · 2016-09-27T18:05:57Z

module/zfs/spa.c

+
+	DVA_SET_VDEV(&dva[0], vd->vdev_id);
+	DVA_SET_OFFSET(&dva[0], offset);
+	DVA_SET_GANG(&dva[0], !!0);


Why is this !!0 and not just 0?

Yes should be just 0. The !!0 was there because the code was originally DVA_SET_GANG(&dva[0], !!(flags & ZDB_FLAG_GBH));

rlaager · 2016-09-27T18:06:54Z

module/zfs/spa.c

@@ -4550,6 +4761,8 @@ spa_vdev_attach(spa_t *spa, uint64_t guid, nvlist_t *nvroot, int replacing)
 		return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));

 	pvd = oldvd->vdev_parent;
+	if (pvd->vdev_ops == &vdev_mirror_ops && vdev_draid_max_rebuild > 0)


Since rebuild is not specific to draid, it should not have draid in the name for this parameter.

rlaager · 2016-09-27T18:07:00Z

module/zfs/spa.c

@@ -6911,4 +7136,7 @@ module_param(zio_taskq_batch_pct, uint, 0444);
 MODULE_PARM_DESC(zio_taskq_batch_pct,
 	"Percentage of CPUs to run an IO worker thread");

+module_param(vdev_draid_max_rebuild, int, 0644);


Since rebuild is not specific to draid, it should not have draid in the name for this parameter.

thegreatgazoo · 2016-12-01T01:28:03Z

Thanks @rlaager I've updated the patch from your feedback. I'm not going to update this PR because I'll push the dRAID code soon, which includes the mirror rebuild support.

ahrens · 2017-01-18T00:27:04Z

One terminology note:

It works by scanning the metaslabs and rebuilding each used segment of DVA. The redundancy layout (i.e. where the copies, data/parity blocks are) must be determined by the DVA only.

Instead of "DVA", I think you mean "segment of contiguously-allocated space", or something to that effect. If we actually had all the DVA's (blk_dva[]), we would know the data layout on RAID-Z as well.

ahrens · 2017-01-18T00:33:15Z

Something seems to be wrong with how github is rendering the diffs (most of the changes seem to be missing). I think it could be worthwhile to get this integrated before DRAID, because it's useful without DRAID (on mirrors), and I imagine it's considerably simpler than DRAID so easier to review/test.

What about doing a second pass to verify the checksums of the copied data (by reading all the block pointers), after the fast pass rebuild? Will that always happen, never happen, or be user-selectable?

mailinglists35 · 2017-03-08T22:28:50Z

@ahrens What about doing a second pass to verify the checksums of the copied data (by reading all the block pointers), after the fast pass rebuild? Will that always happen, never happen, or be user-selectable?

As an enduser I would prefer to have this feature turned off by default, to be able to enable it on the fly without having to export pool or unload kernel modules, and after syncing the DVAs to be able to optionally run the new sequential scrub like oracle describes (that should be default) when blocks are traversed in disk order

behlendorf · 2018-02-07T00:49:56Z

I'm not going to update this PR because I'll push the dRAID code soon, which includes the mirror rebuild support.

Closing PR. The feedback above was folded in to the dRAID patch #7078.

Sequential SPA-level rebuild for mirror and dRAID

2729fa6

Signed-off-by: Isaac Huang <he.huang@intel.com>

rlaager suggested changes Sep 27, 2016

View reviewed changes

behlendorf added the Status: Work in Progress Not yet ready for general review label Sep 30, 2016

behlendorf added the OpenZFS Review label Jan 11, 2017

mailinglists35 mentioned this pull request Mar 8, 2017

Implement sequential (two-phase) resilvering #3625

Closed

behlendorf changed the title ~~[RFC] [WIP] Sequential SPA-level rebuild for mirror and dRAID~~ Sequential SPA-level rebuild for mirror and dRAID May 10, 2017

behlendorf added the Type: Feature Feature request or new feature label May 10, 2017

behlendorf closed this Feb 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequential SPA-level rebuild for mirror and dRAID #5153

Sequential SPA-level rebuild for mirror and dRAID #5153

thegreatgazoo commented Sep 22, 2016 •

edited

Loading

thegreatgazoo commented Sep 23, 2016

rlaager Sep 27, 2016

rlaager Sep 27, 2016

rlaager Sep 27, 2016

rlaager Sep 27, 2016

rlaager Sep 27, 2016

rlaager Sep 27, 2016

thegreatgazoo Dec 1, 2016

rlaager Sep 27, 2016

rlaager Sep 27, 2016

thegreatgazoo Dec 1, 2016

rlaager Sep 27, 2016

rlaager Sep 27, 2016

thegreatgazoo commented Dec 1, 2016

ahrens commented Jan 18, 2017

ahrens commented Jan 18, 2017

mailinglists35 commented Mar 8, 2017 •

edited

Loading

behlendorf commented Feb 7, 2018

Sequential SPA-level rebuild for mirror and dRAID #5153

Sequential SPA-level rebuild for mirror and dRAID #5153

Conversation

thegreatgazoo commented Sep 22, 2016 • edited Loading

thegreatgazoo commented Sep 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thegreatgazoo commented Dec 1, 2016

ahrens commented Jan 18, 2017

ahrens commented Jan 18, 2017

mailinglists35 commented Mar 8, 2017 • edited Loading

behlendorf commented Feb 7, 2018

thegreatgazoo commented Sep 22, 2016 •

edited

Loading

mailinglists35 commented Mar 8, 2017 •

edited

Loading