Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequential SPA-level rebuild for mirror and dRAID #5153

Closed

Conversation

thegreatgazoo
Copy link

@thegreatgazoo thegreatgazoo commented Sep 22, 2016

This patch implements SPA-level rebuild for mirror and the upcoming dRAID (#3497) vdev's. It works by scanning the metaslabs and rebuilding each used segment of DVA. The redundancy layout (i.e. where the copies, data/parity blocks are) must be determined by the DVA only. That's why it does not work with raidz. Since the scan happens at SPA/metaslab level, it doesn't traverse the block pointer tree. The downside is that block checksum cannot be verified during the scan. The upside is that the IO is perfectly sequential and IO size is often larger than block size. Note that this is different from #3625, which is still resilver.

The following tests compared rebuild and resilver on the same hardware populated with exactly the same data:

# modprobe zfs zfs_vdev_scrub_min_active=8 zfs_vdev_scrub_max_active=10 zfs_vdev_async_write_min_active=10
# zpool create -f tank mirror sdc sde
# for i in `seq 1 20`; do cat ubuntu-14.04.4-desktop-amd64.iso > /tank/$i.iso; done
# tar xf linux-4.7.3.tar.xz -C /tank/
# tar xf linux-4.5.tar.xz -C /tank/
# df -h /tank/
Filesystem      Size  Used Avail Use% Mounted on
tank            1.8T   22G  1.8T   2% /tank
# zpool attach -f tank sdc sdg
# zpool status
  pool: tank
 state: ONLINE
  scan: rebuilt 21.5G in 0h1m46s with 0 errors on Thu Sep 22 19:57:46 2016
config: 

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdg     ONLINE       0     0     0

errors: No known data errors
# zpool export tank
# zpool create -f tank mirror sdc sde
# for i in `seq 1 20`; do cat ubuntu-14.04.4-desktop-amd64.iso > /tank/$i.iso; done
# tar xf linux-4.7.3.tar.xz -C /tank/
# tar xf linux-4.5.tar.xz -C /tank/
# df -h /tank/
Filesystem      Size  Used Avail Use% Mounted on
tank            1.8T   22G  1.8T   2% /tank
# echo 0 > /sys/module/zfs/parameters/vdev_draid_max_rebuild # Disable Rebuild
# zpool attach -f tank sdc sdg
# zpool status
  pool: tank
 state: ONLINE
  scan: resilvered 21.5G in 0h3m0s with 0 errors on Thu Sep 22 20:05:18 2016
config: 

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdg     ONLINE       0     0     0

errors: No known data errors

The rebuild took 106 seconds while the resilver took 180 seconds:

  • scan: rebuilt 21.5G in 0h1m46s with 0 errors on Thu Sep 22 19:57:46 2016
  • scan: resilvered 21.5G in 0h3m0s with 0 errors on Thu Sep 22 20:05:18 2016

The resilver took about 70% more time to complete. Note that this is a very ideal workload for resilver: new and nearly empty pool, mostly large files. But rebuild is sequential by design and is not that allergic to aging, fragmentation, or file sizes. So this test actually favors resilver.

The following two pictures shows graphics of IO data during the rebuild and the resilver:
rebuild
resilver
In each picture, there are 4 graphs:

  • X axis shows wall clock time. Each graph shows 120 seconds of IO data
  • The 1st graph shows read bw in MB/s for the individual drives
  • The 2nd graph shows write bw in MB/s for the individual drives
  • The 3rd and 4th graphs show average read/write request sizes in KB, respectively.

The most interesting part is the 3rd and 4th graphs showing average IO request sizes. For rebuild, the r/w request sizes averaged at about 1M, but for resilver the sizes were at around 128K, which is the recordsize. This is because rebuild scans metaslabs, and fixes used segments each of which may contain many blocks.

Comments and suggestions are very welcome. The code at this point is rough and hacky. It may panic and/or destroy your data. That said, I was able to test rebuilds successfully, and scrub found no errors.

How to use:

  1. Apply this patch.
  2. Then all attach and replace operations of a mirror vdev will use the sequential rebuild mechanism. To use resilver instead: echo 0 > /sys/module/zfs/parameters/vdev_draid_max_rebuild

Signed-off-by: Isaac Huang <he.huang@intel.com>
@thegreatgazoo
Copy link
Author

Graphs of IO for replacing a failed drive in a 3-way mirror, results are nearly identical to attaching a new drive to a 2-way mirror:
replace

BTW there was some cstyle issue in the patch, some lines too long etc. I'll fix them later.

"%llu errors on %s");
} else if (ps->pss_func == POOL_SCAN_REBUILD) {
fmt = "rebuilt %s in %lluh%um%us with "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to use gettext().

@@ -5473,6 +5478,9 @@ print_scan_status(pool_scan_stat_t *ps)
} else if (ps->pss_func == POOL_SCAN_RESILVER) {
(void) printf(gettext("resilver canceled on %s"),
ctime(&end));
} else if (ps->pss_func == POOL_SCAN_REBUILD) {
(void) printf("rebuild canceled on %s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to use gettext().

@@ -5488,6 +5496,9 @@ print_scan_status(pool_scan_stat_t *ps)
} else if (ps->pss_func == POOL_SCAN_RESILVER) {
(void) printf(gettext("resilver in progress since %s"),
ctime(&start));
} else if (ps->pss_func == POOL_SCAN_REBUILD) {
(void) printf("rebuild in progress since %s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to use gettext().

@@ -5526,6 +5537,9 @@ print_scan_status(pool_scan_stat_t *ps)
} else if (ps->pss_func == POOL_SCAN_SCRUB) {
(void) printf(gettext("\t%s repaired, %.2f%% done\n"),
processed_buf, 100 * fraction_done);
} else if (ps->pss_func == POOL_SCAN_REBUILD) {
(void) printf(" %s rebuilt, %.2f%% done\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to use gettext(). Also, the whitespace at the front should be "\t".

@@ -338,7 +339,8 @@ dsl_scan_done(dsl_scan_t *scn, boolean_t complete, dmu_tx_t *tx)
spa_history_log_internal(spa, "scan done", tx,
"errors=%llu", spa_get_errlog_size(spa));

if (DSL_SCAN_IS_SCRUB_RESILVER(scn)) {
if (DSL_SCAN_IS_SCRUB_RESILVER(scn) ||
scn->scn_phys.scn_func == POOL_SCAN_REBUILD) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a DSL_SCAN_IS_REBUILD define for consistency.

@@ -1550,6 +1588,18 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
if (!scn->scn_async_stalled && !dsl_scan_active(scn))
return;

if (scn->scn_phys.scn_func == POOL_SCAN_REBUILD) {
if (scn->scn_visited_this_txg == 19890604) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this magic number?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a historical date that happened to pop up. Anything else would work, like the usual deadbeef babyface stuff.

@@ -3053,7 +3053,7 @@ vdev_stat_update(zio_t *zio, uint64_t psize)

if (type == ZIO_TYPE_WRITE && txg != 0 &&
(!(flags & ZIO_FLAG_IO_REPAIR) ||
(flags & ZIO_FLAG_SCAN_THREAD) ||
((flags & ZIO_FLAG_SCAN_THREAD) && !spa->spa_dsl_pool->dp_scan->scn_is_sequential) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to wrap this line.


DVA_SET_VDEV(&dva[0], vd->vdev_id);
DVA_SET_OFFSET(&dva[0], offset);
DVA_SET_GANG(&dva[0], !!0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this !!0 and not just 0?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes should be just 0. The !!0 was there because the code was originally DVA_SET_GANG(&dva[0], !!(flags & ZDB_FLAG_GBH));

@@ -4550,6 +4761,8 @@ spa_vdev_attach(spa_t *spa, uint64_t guid, nvlist_t *nvroot, int replacing)
return (spa_vdev_exit(spa, NULL, txg, ENOTSUP));

pvd = oldvd->vdev_parent;
if (pvd->vdev_ops == &vdev_mirror_ops && vdev_draid_max_rebuild > 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since rebuild is not specific to draid, it should not have draid in the name for this parameter.

@@ -6911,4 +7136,7 @@ module_param(zio_taskq_batch_pct, uint, 0444);
MODULE_PARM_DESC(zio_taskq_batch_pct,
"Percentage of CPUs to run an IO worker thread");

module_param(vdev_draid_max_rebuild, int, 0644);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since rebuild is not specific to draid, it should not have draid in the name for this parameter.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Sep 30, 2016
@thegreatgazoo
Copy link
Author

Thanks @rlaager I've updated the patch from your feedback. I'm not going to update this PR because I'll push the dRAID code soon, which includes the mirror rebuild support.

@ahrens
Copy link
Member

ahrens commented Jan 18, 2017

One terminology note:

It works by scanning the metaslabs and rebuilding each used segment of DVA. The redundancy layout (i.e. where the copies, data/parity blocks are) must be determined by the DVA only.

Instead of "DVA", I think you mean "segment of contiguously-allocated space", or something to that effect. If we actually had all the DVA's (blk_dva[]), we would know the data layout on RAID-Z as well.

@ahrens
Copy link
Member

ahrens commented Jan 18, 2017

Something seems to be wrong with how github is rendering the diffs (most of the changes seem to be missing). I think it could be worthwhile to get this integrated before DRAID, because it's useful without DRAID (on mirrors), and I imagine it's considerably simpler than DRAID so easier to review/test.

What about doing a second pass to verify the checksums of the copied data (by reading all the block pointers), after the fast pass rebuild? Will that always happen, never happen, or be user-selectable?

@mailinglists35
Copy link

mailinglists35 commented Mar 8, 2017

@ahrens What about doing a second pass to verify the checksums of the copied data (by reading all the block pointers), after the fast pass rebuild? Will that always happen, never happen, or be user-selectable?

As an enduser I would prefer to have this feature turned off by default, to be able to enable it on the fly without having to export pool or unload kernel modules, and after syncing the DVAs to be able to optionally run the new sequential scrub like oracle describes (that should be default) when blocks are traversed in disk order

@behlendorf behlendorf changed the title [RFC] [WIP] Sequential SPA-level rebuild for mirror and dRAID Sequential SPA-level rebuild for mirror and dRAID May 10, 2017
@behlendorf behlendorf added the Type: Feature Feature request or new feature label May 10, 2017
@behlendorf
Copy link
Contributor

I'm not going to update this PR because I'll push the dRAID code soon, which includes the mirror rebuild support.

Closing PR. The feedback above was folded in to the dRAID patch #7078.

@behlendorf behlendorf closed this Feb 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Work in Progress Not yet ready for general review Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants