Sequential Scrubs and Resilvers #6256

tcaputi · 2017-06-21T19:13:45Z

Motivation and Context

Currently, scrubs and resilvers can take an extremely long time to complete. This is largely due to the fact that zfs scans process pools in logical order, as determined by each block's bookmark. This makes sense from a simplicity perspective, but blocks in zfs are often scattered randomly across disks, particularly due to zfs's copy-on-write mechanisms.

At the 2016 ZFS developer summit @skiselkov presented some experimental work he had done to greatly improve performance by breaking scrubs into 2 parts: metadata scanning and block scrubbing. The metadata scanning would sort blocks into sequential chunks which could be issued to disk much more efficiently by the scrubbing code. This patch is mostly his (WIP) work, ported to Linux and with a few small improvements.

This PR is currently being made so that we can use buildbot to detect bugs.

Description

This patch essentially consists of 4 pieces (at the moment).

@skiselkov 's sequential scrubbing algorithm as described here and here
improvements to the scrubbing algorithm so that its effects on txg_sync time are more consistent.
performance improvements to the existing prefetching algorithm
some code reorganization (since much of the existing scrub code is very old at this point)

How Has This Been Tested?

Initial performance tests show scrubs taking 5-6x less time with this patch on a 47TB pool consisting of data mimicking our production environment. On zfs version 0.6.8, the scrub took a little over 125 hours while it only took 22.5 with this patch. @skiselkov 's presentation at the developer summit cited a 16x performance improvement for a worst case scenario.

As this patch solidifies we will add more formalized automated tests.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

Special thanks to @alek-p and @skiselkov for doing most of the work.

nwf · 2017-06-21T20:21:20Z

I had done this port and was holding off on making a PR from it because @skiselkov indicated that work was still wanting to make sweeping changes to the rangetree structure. If that's no longer true, then it's probably time to get this merged. :)

In my testing (which is what prompted #6209), I found it much better to raise scn_ddt_class_max to DDT_CLASS_UNIQUE and to disable prefetching entirely. You'll note that ddt_class_contains, in module/zfs/ddt.c, is particularly simple when we're scanning the entire DDT, avoiding any need to pull DDT ZAPs into the ARC.

nwf · 2017-06-21T20:23:16Z

module/zfs/dsl_scan.c

+/*
+ * Given a set of I/O parameters as discovered by the metadata traversal
+ * process, attempts to place the I/O into the sorted queues (if allowed),
+ * or immediately executes the I/O. The dummy flag can be set to


Set to... what?

Sorry. Rebase bug...

/* * Given a set of I/O parameters as discovered by the metadata traversal * process, attempts to place the I/O into the sorted queues (if allowed), * or immediately executes the I/O. The dummy flag can be set to indicate * this IO has already been done and a placeholder should be used instead. */

nwf · 2017-06-21T20:26:26Z

module/zfs/dsl_scan.c

@@ -1104,7 +1212,7 @@ dsl_scan_zil_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
 	SET_BOOKMARK(&zb, zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET],
 	    ZB_ZIL_OBJECT, ZB_ZIL_LEVEL, bp->blk_cksum.zc_word[ZIL_ZC_SEQ]);

-	VERIFY(0 == scan_funcs[scn->scn_phys.scn_func](dp, bp, &zb));
+	scan_funcs[scn->scn_phys.scn_func](dp, bp, &zb, B_FALSE);


Why not VERIFY?

there is currently only one scan_func_t: dsl_scan_scrub_cb() which has always (as far as I can tell) returned zero. The only info I could find about its purpose is a comment indicating that returning zero means /* do not relocate this block */ which doesn't seem to make sense anymore. Many of the callers didn't check the return code anyway, and the only ones that did simply did a VERIFY(0 ==...) as you see here.

As part of the cleanup I mentioned, I changed the function to return void.

Hah. "did not relocate this block" sounds like some of bprewrite slipping in. OK.

tonyhutter · 2017-06-23T00:58:51Z

Just did an initial test on our 80-drive HDD JBOD. The test pool was setup as a 8x 10 disk raidz2:

	NAME        STATE     READ WRITE CKSUM
	mypool      ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    U0      ONLINE       0     0     0
	    U1      ONLINE       0     0     0
	    U14     ONLINE       0     0     0
	    U15     ONLINE       0     0     0
	    U28     ONLINE       0     0     0
	    U29     ONLINE       0     0     0
	    U42     ONLINE       0     0     0
	    U43     ONLINE       0     0     0
	    U56     ONLINE       0     0     0
	    U70     ONLINE       0     0     0
	  raidz2-1  ONLINE       0     0     0
...
	  raidz2-7  ONLINE       0     0     0
...

$ zpool list
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
mypool   580T   439G   580T         -     0%     0%  1.00x  ONLINE  -

The dataset was various copies of /usr, /boot, /bin, /opt, and /lib.

Before

  pool: mypool
 state: ONLINE
  scan: scrub in progress since Thu Jun 22 17:17:26 2017
	437G scanned out of 439G at 559M/s, 0h0m to go
	0B repaired, 99.56% done
config:

...

  scan: scrub repaired 0B in 0h13m with 0 errors on Thu Jun 22 17:30:49 2017

After

  pool: mypool
 state: ONLINE
  scan: scrub in progress since Thu Jun 22 17:33:55 2017
    439G scanned at 1.60G/s, 438G verified out of 439G at 1.60G/s, 99.85% done
    0B repaired, 0h0m to go
config:

...

  scan: scrub repaired 0B in 0h4m with 0 errors on Thu Jun 22 17:38:30 2017

So looks like a 3.25x speed-up. Very nice!

tcaputi · 2017-06-23T01:44:33Z

@kpande I'll fix the wording soon. Its trying to say too many things using the old text ATM. Right now I just wanted to see how happy buildbot is with it

angstymeat · 2017-06-27T06:59:57Z

Well, I can say that it took the scrub time on one of our systems down from 20 hours 25 minutes to 8 hours 41 minutes.

Two other systems had their scrub times remain roughly the same, but they hadn't been scanned in a while and their space usage had significantly increased since the last scrub, so the rates at which they were scrubbed increased.

I know this doesn't have anything to do with this patch, but it would be nice if the amount of data that was scrubbed was reported when running zpool status, in addition to just how much was repaired.

tcaputi · 2017-06-27T14:39:39Z

@angstymeat
Can I ask what kind of data you have in your pool? (number of snapshots / datasets, size of files, recordsize, etc.). I'm kind of surprised that you and @tonyhutter aren't seeing more dramatic speed increases.

I know this doesn't have anything to do with this patch, but it would be nice if the amount of data that was scrubbed was reported when running zpool status, in addition to just how much was repaired.

I'm not 100% sure what you mean by this. During the scrub, zpool status does report how much data has currently been scanned and verified. Afterwards, the amount scanned should just be the amount of data in your pool.

angstymeat · 2017-06-27T20:22:32Z

I thought that was a pretty good speed increase.

We handle real-time seismic data from around the world. This particular system receives data from a number of reference stations and devices that are being tested. We can get thousands of files a day between 56k and 8m from each station, and we have data going back 10 years.

We don't need speed for the most part, and the average daily data rate isn't high, but we need to hold onto the data for a long time (currently around 10 years), and the data has to be kept safe. The system is built more for redundancy and safety than it is for speed including multiple backups to on-site and off-site each night.

The system itself is running under VMWare vSphere 5.0, and its vm has 6 CPUs with 24GB of RAM.

The pool is set up as follows, with ashift=12:

  pool: storage
 state: ONLINE
  scan: scrub repaired 0B in 8h41m with 0 errors on Tue Jun 27 00:22:11 2017
config:
	NAME                                        STATE     READ WRITE CKSUM
	storage                                     ONLINE       0     0     0
	  mirror-0                                  ONLINE       0     0     0
	    scsi-3600144f0e632cf000000519eb372002b  ONLINE       0     0     0
	    scsi-3600144f0967d8e00000051a5380b0007  ONLINE       0     0     0
	  mirror-1                                  ONLINE       0     0     0
	    scsi-3600144f0e632cf000000519eb3fc002c  ONLINE       0     0     0
	    scsi-3600144f0967d8e00000051a538100008  ONLINE       0     0     0
	  mirror-2                                  ONLINE       0     0     0
	    scsi-3600144f0e632cf000000519eb400002d  ONLINE       0     0     0
	    scsi-3600144f0967d8e00000051a538140009  ONLINE       0     0     0

errors: No known data errors

and

NAME      SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
storage  5.44T  4.43T  1.01T         -    36%    81%  1.00x  ONLINE  -

The drives are some older 2TB Seagate enterprise drives running off a Sun X4500 that is configured with OpenIndiana's comstar software to provide individual disks remotely as block storage over a 4gb fibre channel connection. The connection itself is aggregated over 4 links across two cards for increased performance and redundancy.

The vm is mounting the disks as raw devices (no vmware filesystem is involved). We run another 50 virtual machines across 2 servers, each accessing disks from 1 of 3 sets of JBODs. Some are 4Gbps and others are 8Gbps.

It's not blazing fast, but they're not terribly slow either. Tests a few years ago showed that I could sustain 6Gb/s read speeds off a ZFS pool if I set it up for it.

The disks are under a constant write load. Nothing too high, maybe a few megabytes a second, but it is also used to examine the data files which can cause a lot of them to be loaded at once, so it can have large bursts of data reads.

I'm using LZ4 which is give us an overall 1.39 compression ratio.

There are 4 file systems:

NAME              USED  AVAIL  REFER  MOUNTPOINT
storage          4.43T   863G   160K  none
storage/asl      8.96G   863G  8.96G  /export/data/asl
storage/data     6.88G   863G  6.88G  /export/data/data
storage/newpier  19.9M   863G  19.9M  /export/home/newpier
storage/rt       4.41T   863G  4.29T  /export/home/rt

I'm using zfs-auto-snapshot to make snapshots. I keep 36 hourly snapshots, 14 daily, 10 weekly, and 6 monthlies. There's currently 264 snapshots.

We have a couple of systems set up this way. This one is used for reference and testing. The others collect more data from more stations and perform processing before they are sent off to be archived.

This system can be taken down for updates, testing, or maintenance more easily than the others.

As for this:

I know this doesn't have anything to do with this patch, but it would be nice if the amount of data that was scrubbed was reported when running zpool status, in addition to just how much was repaired

It will show you how much is being scrubbed while it is running:

  scan: scrub in progress since Mon Jun 26 15:41:09 2017
    606G scanned at 1.43G/s, 2.57M verified out of 4.42T at 6.18K/s, 0.00% done
    0B repaired, (scan is slow, no estimated time)

but once it is completed you only see how much was repaired.

  scan: scrub repaired 0B in 8h41m with 0 errors on Tue Jun 27 00:22:11 2017

It has nothing to do with your patch. I should probably just open a feature request someday.

tcaputi · 2017-06-28T05:30:33Z

I just pushed a commit for buildbot to test. I would not try this commit out, since it is very WIP. I will clen everything up and rebase once it looks like everything is working.

nwf · 2017-06-28T05:38:03Z

module/zfs/dsl_scan.c

@@ -3449,8 +3446,7 @@ scan_io_queue_vacate_cb(range_tree_t *rt, void *arg)
 {
 	dsl_scan_io_queue_t *queue = arg;
 	void *cookie = NULL;
-	while (avl_destroy_nodes(&queue->q_exts_by_size, &cookie) != NULL)
-		;
+	while (avl_destroy_nodes(&queue->q_exts_by_size, &cookie) != NULL);


Oof. Is that really the preferred style?

its my preferred style, but im not looking style right now :). I will rebase everything and clean it up once I'm sure the new range tree is't causing any more problems.

angstymeat · 2017-06-29T22:26:57Z

Just to make it clear before I get into the error I just found: I'm using the version of this patch from 3 days ago (before e121c36, the one before you posted that you were making some updates to it and not to use that version).

Zpool scrub is reporting errors in files when running with this patch. I've tried it twice, now, and on one of our machines I get errors on both drives in the same mirror. I don't get these errors and the scrub completes without error when I use the current git master.

  pool: kea
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: scrub in progress since Thu Jun 29 15:13:18 2017
    618G scanned at 182M/s, 527G verified out of 3.91T at 155M/s, 13.18% done
    0B repaired, 6h22m to go
config:

	NAME                                        STATE     READ WRITE CKSUM
	kea                                         DEGRADED     0     0   276
	  mirror-0                                  ONLINE       0     0     4
	    scsi-3600144f0967d8e00000051b65e610001  ONLINE       0     0     4
	    scsi-3600144f0e632cf00000051b65fae0001  ONLINE       0     0     4
	  mirror-1                                  DEGRADED     0     0   270
	    scsi-3600144f0967d8e00000051b65e730002  DEGRADED     0     0   270  too many errors
	    scsi-3600144f0e632cf00000051b65fb30002  DEGRADED     0     0   270  too many errors
	  mirror-2                                  ONLINE       0     0     0
	    scsi-3600144f0967d8e00000051b65e780003  ONLINE       0     0     0
	    scsi-3600144f0e632cf0000005445398e0003  ONLINE       0     0     0
	  mirror-3                                  ONLINE       0     0     2
	    scsi-3600144f0967d8e00000051b65e7c0004  ONLINE       0     0     2
	    scsi-3600144f0e632cf00000051b65fb80004  ONLINE       0     0     2
	logs
	  mirror-4                                  ONLINE       0     0     0
	    scsi-3600c0ff000130d30d6552c5301000000  ONLINE       0     0     0
	    scsi-3600c0ff000130e6fb15e2c5301000000  ONLINE       0     0     0

tcaputi · 2017-06-29T23:02:31Z

@angstymeat Thanks for bringing this up. The bug I am trying to nail down currently may be contributing to this (although it sounds unlikely from your description). Can you give me any more info about this? What kind of workload is running during the scrub? What kind of data is in the pool? If you could find a consistent way to reproduce it that would be wonderful (although that might be asking for a bit much).

angstymeat · 2017-06-29T23:22:47Z

It's the same kind of data that's in the system I was previously testing on. Seismic waveform data, filesystem using lz4 compression. The data doesn't really compress much, but a lot of the output we generate from it does compress.

It's under a constant write load, but not much. iostat is showing around 450k/s being written.

I'll say that this pool has a bit of a history; it was one of the first I used ZFS on back under Solaris 10 when it first came out, and was migrated to ZoL once we dropped Sun hardware. It's been running under ZoL since the first month it was able to mount file systems, and has been updated as ZoL has updated.

The pool has been recreated several times, but the data was transferred via zfs send/receive since the Solaris days.

One thing is that this pool has always been slow. Even now it takes it 31+ hours to scrub, even though it is running the same kinds of disks from the same storage servers the previous system is using. The configuration is the same, other than it uses 5 sets of mirrors instead of 4, and the scrub time is almost 50% longer.

It s happening every time I scan with patch 6256, and it is the same set of files each time. I haven't let it scrub through to completion, though. The errors go away when I scrub it with the current master.

I'm not sure what else I can provide. I can give you some zdb output, if you can think of anything that would be helpful. I'm currently running master, but I can switch back to the patched version if you need me to.

I've been testing this on 5 systems (2 personal), and this is the only one I've seen a scrub issue on.

ahrens · 2017-11-08T20:42:16Z

include/sys/range_tree.h

 	range_tree_ops_t *rt_ops;
+
+	/* rt_avl_compare should only be set it rt_arg is an AVL tree */


ahrens · 2017-11-08T20:44:51Z

man/man5/zfs-module-parameters.5

+.ad
+.RS 12n
+Minimum time "long-lived" blocks are locked in the ARC, specified in jiffies.
+A value of 0 will default to 6 seconds.


can we specify this in milliseconds instead, please?

Brian and I talked about this. I think we decided it would be best to make a pass through the code and cleanup all tunables that use jiffies in a later PR. This one simply works the same as zfs_arc_min_prefetch_lifespan for the moment.

@tcaputi you've already fixed this in the code, but the zfs-module-parameters.5 wasn't updated. There are two tunings arc_min_prefetch_ms which was previous called zfs_arc_min_prefetch_lifespan, and arc_min_prescient_prefetch_ms which is what long life was renamed.

ahrens · 2017-11-08T20:53:26Z

man/man5/zfs-module-parameters.5

+.RS 12n
+Strategy used to determine how data verification should be done during scrubs.
+If set to 1, I/Os will always be issued sequentially. This may help performance
+if the pool's data is very disperate. If set to 2, I/Os will be always be issued


It would be nice if we never needed to change the default behavior. But assuming that we need to document this, can I suggest an alternate wording:

Determines the order that data will be verified while scrubbing or resilvering. If set to 1, data will be verified as sequentially as possible, given the amount of memory reserved for scrubbing (see zfs_scan_mem_lim_fact). This may improve scrub performance if the pool's data is very fragmented. If set to 2, the largest mostly-contiguous chunk of found data will be verified first. By deferring scrubbing of small segments, we may later find adjacent data to coalesce and increase the segment size.

I will take your wording for now. I found this tunable very useful for playing with performance on different pools, so I'd feel bad removing it.

behlendorf · 2017-11-09T23:15:56Z

Testing results from a real world worst case pool containing ~97G of small files.

TLDR, performance can be improved 3x by changing the default values for zfs_scan_top_maxinflight to 1024 and zfs_vdev_aggregation_limit to 1048576. Without tuning this PR degrades performance compared to master for my test pool.

The test pool is a copy of a 3 month old filesystem containing a few million small files. The workload for those 3 months involved frequently creating and unlinking small files resulting in a fragmented (37%) but relatively empty (6% full) pool.

$ zpool list -v
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
jet1      1.45T  97.4G  1.36T     13.1T    37%     6%  1.00x  ONLINE  -
  mirror   744G  48.7G   695G     6.55T    37%     6%
    L1      -      -      -         -      -      -
    L2      -      -      -         -      -      -
  mirror   744G  48.7G   695G     6.55T    38%     6%
    L3      -      -      -         -      -      -
    L4      -      -      -         -      -      -

The test case was to measure the efficiency of resilvering both mirrors and verify that sequential resilver does behave as designed. This included measuring the total resilver time and checking for the expected IO pattern.

$ zpool status -v
...
	NAME             STATE     READ WRITE CKSUM
	jet1             DEGRADED     0     0     0
	  mirror-0       DEGRADED     0     0     0
	    replacing-0  DEGRADED     0     0     0
	      L1        FAULTED      0     0     0  external device fault
	      L5        ONLINE       0     0     0  (resilvering)
	    L2          ONLINE       0     0     0
	  mirror-1       DEGRADED     0     0     0
	    replacing-0  DEGRADED     0     0     0
	      L3        FAULTED      0     0     0  external device fault
	      L6        ONLINE       0     0     0  (resilvering)
	    L4

Surprisingly, when comparing this PR, 5d6e310, with master resilver time increased by 14% as shown in the first two columns below.. Not good, so what's going on.

Profiling with zpool iostat -r to get a request size histogram showed this was due to zfs_scan_top_maxinflight being set to 32 by default. The IO aggregation code depends on there being pending zios in the queue which can be aggregated. By trickling them out slowly we're effectively defeating the aggregation logic which results in the resilver issuing lots of small, but sequential, IOs. Increasing zfs_scan_top_maxinflight to 256 resolves the aggregation issue, as you can see fewer, larger IOs are now being issued. This change won't result in the devices being overwhelmed as cautioned above in the comment above zfs_scan_top_maxinflight since zfs_vdev_scrub_max_active further limits the number of inflight IOs.

Additional performance gains are possible by increasing zfs_vdev_aggregation_limit from 128K to 1M. Since we know scrub IOs are going to be sequential and thus easily aggregated we should allow them to be merged in to larger IOs. As shown in the stacked bar graph below by increasing both the zfs_vdev_aggregation_limit and zfs_scan_top_maxinflight we can rebuild 60% of this fragmented pool using only 128K or larger disk IOs. Compared with <1% with the default tunings, and ~40% with zfs_scan_top_maxinflight=256.

According to my testing the sweet spot is to update the default values for these two tunings as follows.

zfs_scan_top_maxinflight=1024
zfs_vdev_aggregation_limit=1048576

The above observations also suggest a potential future optimization. Since we know scrub IOs will be largely sequential it would be reasonable to submit them from the start as delegated IOs. This would remove the need for the low level aggregation logic to detect that they're sequential and perform the aggregation. This might further improve performance but obviously we'd want detailed performance data to confirm it actually helps.

tcaputi · 2017-11-10T08:08:41Z

@behlendorf Haha, it seems that we've made scans so fast with your tunables that we've broken 3 of the tests (zpool_scrub_002_pos, zpool_reopen_003_pos, zpool_reopen_004_pos) that expect to be able to run a command or 2 before a scrub or resilver completes. I'll try increasing the amount of data in the pool to see if that helps.

angstymeat · 2017-11-10T08:39:34Z

Just to throw in some more performance numbers...

My home system has a 3-disk raidz1 array that is 3TB in size.

It started out at over 11 hours to scrub before this #6256.

With the patch applied back in August, it took around 4 hours to scrub with roughly 1.1TB of free space left.

After the latest patch I applied yesterday, it took 5 hours & 2 minutes, with 478GB free (it took longer, but it was scrubbing more data that in August).

After I adjusted the tunables above, my scrub time on the same amount of data dropped to 3.5 hours.

behlendorf · 2017-11-10T19:05:39Z

man/man5/zfs-module-parameters.5

@@ -1780,18 +1798,81 @@ Default value: \fB75\fR.
 .sp
 .ne 2
 .na
-\fBzfs_top_maxinflight\fR (int)
+\fBzfs_scan_top_maxinflight\fR (int)


Changed to zfs_scan_top_bytes, needs to be updated. Testing shows that we should default to 4M.

behlendorf · 2017-11-10T21:04:10Z

man/man5/zfs-module-parameters.5

 .RE

 .sp
 .ne 2
 .na
-\fBzfs_scan_idle\fR (int)
+\fBzfs_scan_mem_lim_fact\fR (int)


Can you also make sure all the zfs_scan_* entries are listed together in alphabetical order.

behlendorf · 2017-11-10T21:07:53Z

man/man5/zfs-module-parameters.5

+To preserve progress across reboots the sequential scan algorithm periodically
+needs to stop metadata scanning and issue all the verifications I/Os to disk.
+The frequency of this flushing is determined by the
+\fBfBzfs_scan_checkpoint_intval\fR tunable.


Extra fB here needs to be removed.

behlendorf · 2017-11-10T21:08:31Z

man/man5/zfs-module-parameters.5

-a non-scrub or non-resilver I/O operation has occurred within the past
-\fBzfs_scan_idle\fR ticks.
+To preserve progress across reboots the sequential scan algorithm periodically
+needs to stop metadata scanning and issue all the verifications I/Os to disk.


s/verifications/verification

Currently, scrubs and resilvers can take an extremely long time to complete. This is largely due to the fact that zfs scans process pools in logical order, as determined by each block's bookmark. This makes sense from a simplicity perspective, but blocks in zfs are often scattered randomly across disks, particularly due to zfs's copy-on-write mechanisms. This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance. This patch also updates and cleans up some of the scan code which has not been updated in several years. Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com> Authored-by: Alek Pinchuk <apinchuk@datto.com> Authored-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>

pruiz · 2018-01-04T18:46:38Z

@behlendorf is this scheduled for 0.8?

mailinglists35 · 2018-01-04T21:56:05Z

@angstymeat when you say 'tunables above', do you refer to what Brian said earlier?

"Testing results from a real world worst case pool containing ~97G of small files. TLDR, performance can be improved 3x by changing the default values for zfs_scan_top_maxinflight to 1024 and zfs_vdev_aggregation_limit to 1048576."

behlendorf · 2018-01-09T00:32:01Z

@pruiz yes.

Currently, scrubs and resilvers can take an extremely long time to complete. This is largely due to the fact that zfs scans process pools in logical order, as determined by each block's bookmark. This makes sense from a simplicity perspective, but blocks in zfs are often scattered randomly across disks, particularly due to zfs's copy-on-write mechanisms. This patch improves performance by splitting scrubs and resilvers into a metadata scanning phase and an IO issuing phase. The metadata scan reads through the structure of the pool and gathers an in-memory queue of I/Os, sorted by size and offset on disk. The issuing phase will then issue the scrub I/Os as sequentially as possible, greatly improving performance. This patch also updates and cleans up some of the scan code which has not been updated in several years. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com> Authored-by: Alek Pinchuk <apinchuk@datto.com> Authored-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes openzfs#3625 Closes openzfs#6256

tcaputi force-pushed the better_scrub branch from dca9200 to 1ab3784 Compare June 21, 2017 19:37

nwf reviewed Jun 21, 2017

View reviewed changes

tcaputi force-pushed the better_scrub branch 3 times, most recently from 6f49f97 to 59a74df Compare June 22, 2017 21:07

jumbi77 mentioned this pull request Jun 26, 2017

Implement sequential (two-phase) resilvering #3625

Closed

ahrens self-requested a review June 26, 2017 18:24

nwf reviewed Jun 28, 2017

View reviewed changes

tcaputi force-pushed the better_scrub branch 10 times, most recently from 0789f28 to 248573a Compare June 29, 2017 21:14

tcaputi force-pushed the better_scrub branch from 248573a to 9ff831b Compare June 29, 2017 23:02

ahrens reviewed Nov 9, 2017

View reviewed changes

tcaputi force-pushed the better_scrub branch 2 times, most recently from aab6610 to fe52369 Compare November 10, 2017 04:26

behlendorf approved these changes Nov 10, 2017

View reviewed changes

tcaputi force-pushed the better_scrub branch 13 times, most recently from d735014 to 756f774 Compare November 14, 2017 17:16

tcaputi force-pushed the better_scrub branch from 756f774 to 1c6275b Compare November 15, 2017 19:12

behlendorf merged commit d4a72f2 into openzfs:master Nov 16, 2017

GregorKopka mentioned this pull request Aug 5, 2019

RFC: generic device removal #9129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequential Scrubs and Resilvers #6256

Sequential Scrubs and Resilvers #6256

tcaputi commented Jun 21, 2017 •

edited

Loading

nwf commented Jun 21, 2017 •

edited

Loading

nwf Jun 21, 2017

tcaputi Jun 21, 2017

nwf Jun 21, 2017

tcaputi Jun 21, 2017

nwf Jun 21, 2017

tonyhutter commented Jun 23, 2017

tcaputi commented Jun 23, 2017

angstymeat commented Jun 27, 2017

tcaputi commented Jun 27, 2017

angstymeat commented Jun 27, 2017

tcaputi commented Jun 28, 2017 •

edited

Loading

nwf Jun 28, 2017

tcaputi Jun 28, 2017

angstymeat commented Jun 29, 2017

tcaputi commented Jun 29, 2017

angstymeat commented Jun 29, 2017

ahrens Nov 8, 2017

ahrens Nov 8, 2017

tcaputi Nov 9, 2017

behlendorf Nov 9, 2017 •

edited

Loading

ahrens Nov 8, 2017

tcaputi Nov 9, 2017

behlendorf commented Nov 9, 2017 •

edited

Loading

tcaputi commented Nov 10, 2017

angstymeat commented Nov 10, 2017 •

edited

Loading

behlendorf Nov 10, 2017

behlendorf Nov 10, 2017

behlendorf Nov 10, 2017

behlendorf Nov 10, 2017

pruiz commented Jan 4, 2018

mailinglists35 commented Jan 4, 2018

behlendorf commented Jan 9, 2018

		range_tree_ops_t *rt_ops;

		/* rt_avl_compare should only be set it rt_arg is an AVL tree */

Sequential Scrubs and Resilvers #6256

Sequential Scrubs and Resilvers #6256

Conversation

tcaputi commented Jun 21, 2017 • edited Loading

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

nwf commented Jun 21, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tonyhutter commented Jun 23, 2017

tcaputi commented Jun 23, 2017

angstymeat commented Jun 27, 2017

tcaputi commented Jun 27, 2017

angstymeat commented Jun 27, 2017

tcaputi commented Jun 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angstymeat commented Jun 29, 2017

tcaputi commented Jun 29, 2017

angstymeat commented Jun 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf Nov 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behlendorf commented Nov 9, 2017 • edited Loading

tcaputi commented Nov 10, 2017

angstymeat commented Nov 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pruiz commented Jan 4, 2018

mailinglists35 commented Jan 4, 2018

behlendorf commented Jan 9, 2018

tcaputi commented Jun 21, 2017 •

edited

Loading

nwf commented Jun 21, 2017 •

edited

Loading

tcaputi commented Jun 28, 2017 •

edited

Loading

behlendorf Nov 9, 2017 •

edited

Loading

behlendorf commented Nov 9, 2017 •

edited

Loading

angstymeat commented Nov 10, 2017 •

edited

Loading