-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sequential SPA-level rebuild for mirror and dRAID #5153
Conversation
Signed-off-by: Isaac Huang <he.huang@intel.com>
"%llu errors on %s"); | ||
} else if (ps->pss_func == POOL_SCAN_REBUILD) { | ||
fmt = "rebuilt %s in %lluh%um%us with " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to use gettext().
@@ -5473,6 +5478,9 @@ print_scan_status(pool_scan_stat_t *ps) | |||
} else if (ps->pss_func == POOL_SCAN_RESILVER) { | |||
(void) printf(gettext("resilver canceled on %s"), | |||
ctime(&end)); | |||
} else if (ps->pss_func == POOL_SCAN_REBUILD) { | |||
(void) printf("rebuild canceled on %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to use gettext().
@@ -5488,6 +5496,9 @@ print_scan_status(pool_scan_stat_t *ps) | |||
} else if (ps->pss_func == POOL_SCAN_RESILVER) { | |||
(void) printf(gettext("resilver in progress since %s"), | |||
ctime(&start)); | |||
} else if (ps->pss_func == POOL_SCAN_REBUILD) { | |||
(void) printf("rebuild in progress since %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to use gettext().
@@ -5526,6 +5537,9 @@ print_scan_status(pool_scan_stat_t *ps) | |||
} else if (ps->pss_func == POOL_SCAN_SCRUB) { | |||
(void) printf(gettext("\t%s repaired, %.2f%% done\n"), | |||
processed_buf, 100 * fraction_done); | |||
} else if (ps->pss_func == POOL_SCAN_REBUILD) { | |||
(void) printf(" %s rebuilt, %.2f%% done\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to use gettext(). Also, the whitespace at the front should be "\t".
@@ -338,7 +339,8 @@ dsl_scan_done(dsl_scan_t *scn, boolean_t complete, dmu_tx_t *tx) | |||
spa_history_log_internal(spa, "scan done", tx, | |||
"errors=%llu", spa_get_errlog_size(spa)); | |||
|
|||
if (DSL_SCAN_IS_SCRUB_RESILVER(scn)) { | |||
if (DSL_SCAN_IS_SCRUB_RESILVER(scn) || | |||
scn->scn_phys.scn_func == POOL_SCAN_REBUILD) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a DSL_SCAN_IS_REBUILD define for consistency.
@@ -1550,6 +1588,18 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx) | |||
if (!scn->scn_async_stalled && !dsl_scan_active(scn)) | |||
return; | |||
|
|||
if (scn->scn_phys.scn_func == POOL_SCAN_REBUILD) { | |||
if (scn->scn_visited_this_txg == 19890604) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a historical date that happened to pop up. Anything else would work, like the usual deadbeef babyface stuff.
@@ -3053,7 +3053,7 @@ vdev_stat_update(zio_t *zio, uint64_t psize) | |||
|
|||
if (type == ZIO_TYPE_WRITE && txg != 0 && | |||
(!(flags & ZIO_FLAG_IO_REPAIR) || | |||
(flags & ZIO_FLAG_SCAN_THREAD) || | |||
((flags & ZIO_FLAG_SCAN_THREAD) && !spa->spa_dsl_pool->dp_scan->scn_is_sequential) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to wrap this line.
|
||
DVA_SET_VDEV(&dva[0], vd->vdev_id); | ||
DVA_SET_OFFSET(&dva[0], offset); | ||
DVA_SET_GANG(&dva[0], !!0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this !!0 and not just 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes should be just 0. The !!0 was there because the code was originally DVA_SET_GANG(&dva[0], !!(flags & ZDB_FLAG_GBH));
@@ -4550,6 +4761,8 @@ spa_vdev_attach(spa_t *spa, uint64_t guid, nvlist_t *nvroot, int replacing) | |||
return (spa_vdev_exit(spa, NULL, txg, ENOTSUP)); | |||
|
|||
pvd = oldvd->vdev_parent; | |||
if (pvd->vdev_ops == &vdev_mirror_ops && vdev_draid_max_rebuild > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since rebuild is not specific to draid, it should not have draid in the name for this parameter.
@@ -6911,4 +7136,7 @@ module_param(zio_taskq_batch_pct, uint, 0444); | |||
MODULE_PARM_DESC(zio_taskq_batch_pct, | |||
"Percentage of CPUs to run an IO worker thread"); | |||
|
|||
module_param(vdev_draid_max_rebuild, int, 0644); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since rebuild is not specific to draid, it should not have draid in the name for this parameter.
Thanks @rlaager I've updated the patch from your feedback. I'm not going to update this PR because I'll push the dRAID code soon, which includes the mirror rebuild support. |
One terminology note:
Instead of "DVA", I think you mean "segment of contiguously-allocated space", or something to that effect. If we actually had all the DVA's ( |
Something seems to be wrong with how github is rendering the diffs (most of the changes seem to be missing). I think it could be worthwhile to get this integrated before DRAID, because it's useful without DRAID (on mirrors), and I imagine it's considerably simpler than DRAID so easier to review/test. What about doing a second pass to verify the checksums of the copied data (by reading all the block pointers), after the fast pass rebuild? Will that always happen, never happen, or be user-selectable? |
As an enduser I would prefer to have this feature turned off by default, to be able to enable it on the fly without having to export pool or unload kernel modules, and after syncing the DVAs to be able to optionally run the new sequential scrub like oracle describes (that should be default) when blocks are traversed in disk order |
Closing PR. The feedback above was folded in to the dRAID patch #7078. |
This patch implements SPA-level rebuild for mirror and the upcoming dRAID (#3497) vdev's. It works by scanning the metaslabs and rebuilding each used segment of DVA. The redundancy layout (i.e. where the copies, data/parity blocks are) must be determined by the DVA only. That's why it does not work with raidz. Since the scan happens at SPA/metaslab level, it doesn't traverse the block pointer tree. The downside is that block checksum cannot be verified during the scan. The upside is that the IO is perfectly sequential and IO size is often larger than block size. Note that this is different from #3625, which is still resilver.
The following tests compared rebuild and resilver on the same hardware populated with exactly the same data:
The rebuild took 106 seconds while the resilver took 180 seconds:
The resilver took about 70% more time to complete. Note that this is a very ideal workload for resilver: new and nearly empty pool, mostly large files. But rebuild is sequential by design and is not that allergic to aging, fragmentation, or file sizes. So this test actually favors resilver.
The following two pictures shows graphics of IO data during the rebuild and the resilver:
In each picture, there are 4 graphs:
The most interesting part is the 3rd and 4th graphs showing average IO request sizes. For rebuild, the r/w request sizes averaged at about 1M, but for resilver the sizes were at around 128K, which is the recordsize. This is because rebuild scans metaslabs, and fixes used segments each of which may contain many blocks.
Comments and suggestions are very welcome. The code at this point is rough and hacky. It may panic and/or destroy your data. That said, I was able to test rebuilds successfully, and scrub found no errors.
How to use:
echo 0 > /sys/module/zfs/parameters/vdev_draid_max_rebuild