Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.6.5-1 fsync performance regression #3780

Closed
hhhappe opened this issue Sep 15, 2015 · 30 comments
Closed

0.6.5-1 fsync performance regression #3780

hhhappe opened this issue Sep 15, 2015 · 30 comments
Labels
Type: Performance Performance improvement or performance problem
Milestone

Comments

@hhhappe
Copy link

hhhappe commented Sep 15, 2015

After upgrading to 0.6.5-1 fsync's are really slow. Ex.:

dd if=/dev/zero of=/zfs/t0 bs=1M count=128 conv=fsync
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 11.0608 s, 12.1 MB/s

With 0.6.4-2 this was ~800MB/s. This is an 18 disk raidz2.

I've tried to create smaller raidz2, raidz, mirror and no redundancy. Only single disk and mirrored filesystems are giving more than 100MB/s.

Config:

CentOS 6.7
kernel: 2.6.32-573.3.1.el6.x86_64
1x E5-2640v3
128GB memory

@Bronek
Copy link

Bronek commented Sep 15, 2015

Which kernel version?

@hhhappe
Copy link
Author

hhhappe commented Sep 15, 2015

See last part of message.

@ryao
Copy link
Contributor

ryao commented Sep 15, 2015

This is likely a regression caused by b39c22b. It improved latencies, but that might have been at the risk of reduced throughput. We probably should revisit this.

That said, I am surprised to hear that fsync was so fast with 0.6.4 in this setup. It sounds like you are using 18 mechanical disks in a single raidz2 vdev. Unless sync=disabled is set, I would definitely not expect to see 800MB/sec from that dd command. It is probably higher if we consider how things work internally, but in either case, that level of performance should only be possible with a SLOG device or flash storage because 800MB/sec of 1MB synchronous writes implies 800 IOPS. If you have a single 18-disk raidz2 vdev with mechanical disks, you should not be able to get more than 200 IOPS even with the best mechanical disks. 800 IOPS far exceeds the limit of what the disks typically are able to do, although it maybe is not impossible if we are laying things out in sequence on the disks.

What kinds of disks are these and what kind of controller(s) are in use? Also, were these tests done with anything other than compression=none and sync=standard in either case?

@ryao
Copy link
Contributor

ryao commented Sep 15, 2015

Just as a point of reference for others, here are figures from my workstation, which has a mirrored pair of Intel 730 SSDs, uses Linux 4.1 and everything is on ZFS with compression=lz4, dedup=on and sufficient memory that the entire DDT should be in RAM:

desktop ~ # dd if=/dev/zero of=t0 bs=1M count=128 conv=fsync
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 0.0401037 s, 3.3 GB/s
desktop ~ # rm t0 
desktop ~ # dd if=/dev/urandom of=t0 bs=1M count=128 conv=fsync
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 8.82549 s, 15.2 MB/s

There is definitely room for improvement (15.2MB/sec is too slow), but the compression's zero detection and deduplication meant that no data blocks actually had to be written and things became incredibly fast (i.e. 3300 IOPS). Either of compression or deduplication would be sufficient to do this.

@ryao
Copy link
Contributor

ryao commented Sep 15, 2015

I just did a flame graph analysis of the urandom case. It is CPU bound by the random number generator. If I adjust for that by copying the random output to a tmpfs, writes become bandwidth limited and I see realistic throughput numbers from dd:

desktop ~ # mv t0 /tmp/
desktop ~ # dd if=/tmp/t0 of=t0 bs=1M count=128 conv=fsync
128+0 records in
128+0 records out
134217728 bytes (134 MB) copied, 0.510636 s, 263 MB/s

@hhhappe
Copy link
Author

hhhappe commented Sep 15, 2015

@ryao: You just had me question my understanding of conv=fsync, so i did a quick strace. It only does one fsync as I expected.

This is plain LSI SAS hardware (Dell MD3060e and 9207-e8) and Seagate ST6000NM0034 disks.

To get to 800MB/s I need to adjust zfs_vdev_sync_write_max_active up. Without fsync these 18 disks can at times reach 3GB/s.

@ryao
Copy link
Contributor

ryao commented Sep 16, 2015

@hhhappe I had not realized that conv=fsync only did one fsync() call. Thanks for the correction.

To what value are you setting zfs_vdev_sync_write_max_active?

@hhhappe
Copy link
Author

hhhappe commented Sep 16, 2015

Hmm, the zfs_vdev_sync_write_max_active increase helped for 0.6.3, but 0.6.4.2 performs without. It helped for files writes ending with fsync. I need to revisit that.

The 0.6.5 tests I did with default parameters.

@behlendorf
Copy link
Contributor

@hhhappe did the strace show all the time being spent in the call to fsync?

@hhhappe
Copy link
Author

hhhappe commented Sep 17, 2015

The strace spends almost all of the time in fsync. One 128x1M run gave ~200us per write and 12s in fsync.

@alexanderhaensch
Copy link

I can confirm this on my system too:

with lz4 compression

with fsync

134217728 bytes (134 MB) copied, 0.175775 s, 764 MB/s

without fsync

134217728 bytes (134 MB) copied, 0.072671 s, 1.8 GB/s

without compression

with fsync

134217728 bytes (134 MB) copied, 4.12551 s, 32.5 MB/s

without fsync

134217728 bytes (134 MB) copied, 0.0699792 s, 1.9 GB/s

With a larger payload and without compression:

with fsync

10485760000 bytes (10 GB) copied, 27.7684 s, 378 MB/s

without fsync

10485760000 bytes (10 GB) copied, 7.2256 s, 1.5 GB/s

The large file is faster than the small one!

@behlendorf behlendorf added this to the 0.7.0 milestone Sep 22, 2015
@behlendorf behlendorf added Type: Performance Performance improvement or performance problem Bug - Point Release labels Sep 22, 2015
@behlendorf
Copy link
Contributor

@hhhappe unfortunately I haven't been able to reproduce this problem. fsync(2) performance appears unchanged in my testing between 0.6.4.2 and 0.6.5.1 when using a single disk pool under Centos 6.7. It's certainly possible this problem only manifests itself for larger configurations be I haven't had a chance to test those yet. Where is the effect more pronounced in your testing, large raidz2 configurations?

@alexanderhaensch do you have any data comparing performance with 0.6.4.2 vs 0.6.5.1. We'd expect the fsync tests to be slower than the non-fsync ones, the question is how do the test results compare across releases.

@chjohnst
Copy link

I also ran tests to see if there is a diff, 0.6.5 actually performed
slightly faster (could be noise). Are you certain this isn't a hardware
related issue?
On Sep 22, 2015 1:25 PM, "Brian Behlendorf" notifications@github.com
wrote:

@hhhappe https://github.com/hhhappe unfortunately I haven't been able
to reproduce this problem. fsync(2) performance appears unchanged in my
testing between 0.6.4.2 and 0.6.5.1 when using a single disk pool under
Centos 6.7. It's certainly possible this problem only manifests itself for
larger configurations be I haven't had a chance to test those yet. Where is
the effect more pronounced in your testing, large raidz2 configurations?

@alexanderhaensch https://github.com/alexanderhaensch do you have any
data comparing performance with 0.6.4.2 vs 0.6.5.1. We'd expect the fsync
tests to be slower than the non-fsync ones, the question is how do the test
results compare across releases.


Reply to this email directly or view it on GitHub
#3780 (comment).

@alexanderhaensch
Copy link

I ran some tests on different systems.
I found one system out of 5 that seems not affected, but all other do show the problem.
The Hardware is very different between the systems, so i do not see a red line here.

I will only show the conv=fsync results on uncompressed filesytems because they are the most significant.

First system LSI SAS2308
two disk, no redundancy
was:
uname -a
Linux iris 3.13.0-52-generic #86-Ubuntu SMP
ZFS: Loaded module v0.6.4.2-1trusty, ZFS pool version 5000, ZFS filesystem version 5
191 MB/s
then:
ZFS: Loaded module v0.6.5.1-1
trusty, ZFS pool version 5000, ZFS filesystem version 5
37,7 MB/s
-> thats a difference.

AMD Phenom X2 with onboard storage controller AMD SB8x0
raidz1 3 Disks

ZFS: Loaded module v0.6.4.2-r0-gentoo, ZFS pool version 5000, ZFS filesystem version 5
76,8 MB/s

ZFS: Loaded module v0.6.5-r1-gentoo, ZFS pool version 5000, ZFS filesystem version 5
70,5 MB/s
-> no differene here.

Then i have a HP Proliant Microserver.
Thats also an AMD system with a very similar Controller and Disks of the same type as in the system described before. It has 4 Disks and a smaller CPU.
Unfortunatly i have only the result for 0.6.5.1 because the system has autoupdates and i do not know how to downgrade zfs on ubuntu.
The transfer value is 4,2 MB/s!
-> Thats too low i guess.

@hhhappe
Copy link
Author

hhhappe commented Sep 23, 2015

Here are some 0.6.5.1 numbers for striped and raidz2 for the regular disk counts (x: ). I noticed a difference in the first run right after create, so I included these as the first results for each disk count. For the very common 10 and 18 disk raidz2 the first run is faster, but not fast enough.

I will go and downgrade now and get the number for 0.6.4.2.

(stripe)

1: 134217728 bytes (134 MB) copied, 10.1796 s, 13.2 MB/s
1: 134217728 bytes (134 MB) copied, 0.786758 s, 171 MB/s

2: 134217728 bytes (134 MB) copied, 5.54385 s, 24.2 MB/s
2: 134217728 bytes (134 MB) copied, 0.999162 s, 134 MB/s

4: 134217728 bytes (134 MB) copied, 1.8845 s, 71.2 MB/s
4: 134217728 bytes (134 MB) copied, 1.54637 s, 86.8 MB/s

8: 134217728 bytes (134 MB) copied, 1.26512 s, 106 MB/s
8: 134217728 bytes (134 MB) copied, 1.28457 s, 104 MB/s

16: 134217728 bytes (134 MB) copied, 1.24824 s, 108 MB/s
16: 134217728 bytes (134 MB) copied, 0.901596 s, 149 MB/s

raidz2

3: 134217728 bytes (134 MB) copied, 5.89388 s, 22.8 MB/s
3: 134217728 bytes (134 MB) copied, 6.27775 s, 21.4 MB/s

4: 134217728 bytes (134 MB) copied, 5.22175 s, 25.7 MB/s
4: 134217728 bytes (134 MB) copied, 4.76767 s, 28.2 MB/s

6: 134217728 bytes (134 MB) copied, 4.02013 s, 33.4 MB/s
6: 134217728 bytes (134 MB) copied, 3.85458 s, 34.8 MB/s

10: 134217728 bytes (134 MB) copied, 2.59336 s, 51.8 MB/s
10: 134217728 bytes (134 MB) copied, 7.77536 s, 17.3 MB/s

18: 134217728 bytes (134 MB) copied, 2.89049 s, 46.4 MB/s
18: 134217728 bytes (134 MB) copied, 10.7906 s, 12.4 MB/s

@behlendorf
Copy link
Contributor

@alexanderhaensch the most likely cause of the performance regression is commit b39c22b. Would it be possible for you to revert this change from 0.6.5.1 and see if performance improves? Also are you using an external SSD based log device?

Here's what I think is happening. Commit b39c22b was designed to ensure that synchronous I/Os, like those written to the ZIL, get immediately sent to disk. They're not allowed to queued in memory which would increase their latency. This is great if you assume your log is on an SSD with low latency. However, if your log is stored on the primary pool this is going to be terrible. In that case you want to aggregate as much as you can since the write is going to be expensive! We're going to need to distinguish these cases.

@alexanderhaensch
Copy link

I have a system without ZIL where i can test this. My patchset on 0.6.5.1 increases.. :)
The system with ZIL is on 0.6.4.2 because it needs to be productive.

@alexanderhaensch
Copy link

In my case b39c22b is not the culprit. The system without b39c22b performs exactly the same.

@behlendorf
Copy link
Contributor

@alexanderhaensch in that case we may need to git bisect this to determine where the regression was introduced.

@alexanderhaensch
Copy link

I have to take my statement back. The system is i tested is slow on 0.6.4.2 and on 0.6.5.1..

@hhhappe
Copy link
Author

hhhappe commented Sep 24, 2015

Joggling to many things at the moment I don't have a complete set of results for the 0.6.4.2. Also, I noticed that the disk's write cache was not on. Apparently, that makes a huge difference on this test system (single MD3060 enclosure). I need to dig deeper here.

We have another system with 3x MD3060e enclosures connected to two servers. This is close to production mode, so I will not mess with pool setup (18 disk riadz2), but I will upgrade one server and see what happens.

Anyway, here are the 0.6.4.2 results for 18 disk raidz2 with 128MB and 1GB file size (disk write cache off).

Single MD3060e (same as before):

134217728 bytes (134 MB) copied, 1.23128 s, 109 MB/s
1073741824 bytes (1.1 GB) copied, 8.57962 s, 125 MB/s

So 10 times better than 0.6.5.1 and increasing file size does not help much.

3x MD3060e:

134217728 bytes (134 MB) copied, 0.227799 s, 589 MB/s
1073741824 bytes (1.1 GB) copied, 1.22117 s, 879 MB/s

A lot better.

@behlendorf
Copy link
Contributor

@hhhappe thanks for getting this data. Have you done the experiment of reverting b39c22b? Did it help?

@behlendorf behlendorf added this to the 0.6.5.3 milestone Sep 24, 2015
@behlendorf behlendorf removed this from the 0.6.5.2 milestone Sep 24, 2015
behlendorf added a commit to behlendorf/zfs that referenced this issue Sep 24, 2015
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in.  This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.

This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system.  Additionally, this could negatively
impact performance and may explain the performance regressions
reported in both openzfs#3829 and openzfs#3780.

This patch resolves the issue by making the blocking behavior
dependant on a 'wait' flag being passed rather than overloading
the passed bio flags.

Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#3780
Issue openzfs#3829
Issue openzfs#3652
@hhhappe
Copy link
Author

hhhappe commented Sep 25, 2015

I haven't tried reverting b39c22b yet. Just tried 0.6.5.1 on the 3xMD3060e setup and it is even worse for the raidz2 18x disk pool:

134217728 bytes (134 MB) copied, 26.0719 s, 5.1 MB/s
1073741824 bytes (1.1 GB) copied, 194.238 s, 5.5 MB/s

@hhhappe
Copy link
Author

hhhappe commented Sep 25, 2015

Okay reverting on the single MD3060e system did the trick. In fact, it is a lot faster than 0.6.4.2:

134217728 bytes (134 MB) copied, 0.205837 s, 652 MB/s
1073741824 bytes (1.1 GB) copied, 0.757666 s, 1.4 GB/s

Now, would that be a fix?

@alexanderhaensch
Copy link

@hhhappe is your latest report on the same configuration as two days ago? Is sync=disabled set? Back then you reported between 12.4 MB/s and 46.4 MB/s..

@alexanderhaensch
Copy link

@behlendorf Success! your commit 393ee23 solves the problem for me!
134217728 bytes (134 MB) copied, 0.533725 s, 251 MB/s
Thats now on average 10times faster than on 0.6.5.1 and 0.6.4.2. I am not sure whats wrong with 0.6.4.2.
Unfortunatly i can only confirm this on gentoo, as i am still unable to patch on ubuntu.

@hhhappe
Copy link
Author

hhhappe commented Sep 25, 2015

@alexanderhaensch sync were the default.

@behlendorf
Copy link
Contributor

@hhhappe @alexanderhaensch great news. If reverting b39c22b fixed the issue then the patch in #3833 will address it as well. It'll be in the next point release which shouldn't tag to long to finalize.

@behlendorf behlendorf modified the milestones: 0.6.5.3, 0.6.5.2 Sep 25, 2015
behlendorf added a commit that referenced this issue Sep 25, 2015
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in.  This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.

This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system.  Additionally, this could negatively
impact performance and explains the performance regressions reported
in both #3829 and #3780.

This patch resolves the issue by making the blocking behavior
dependent on a 'wait' flag being passed rather than overloading
the passed bio flags.

Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3652
Issue #3780
Issue #3785
Issue #3817
Issue #3821
Issue #3829
Issue #3832
Issue #3870
@behlendorf
Copy link
Contributor

Resolved by 5592404 which will be cherry-picked in to 0.6.5.2 release.

@kernelOfTruth
Copy link
Contributor

@behlendorf I've been wondering in the last weeks why updating the database of the recoll full desktop text search (http://www.lesbonscomptes.com/recoll/) was taking literally hours and giving me a hard time to complete during "closing" operation (fsync, merging, sorting, etc.) - thought it was "normal" to have this significantly worse performance compared to btrfs or ext4,

now it completes within a few minutes - it appears it was a significant regression

Thanks for the fix 👍

behlendorf added a commit that referenced this issue Sep 30, 2015
Commit b39c22b set the READ_SYNC and WRITE_SYNC flags for a bio
based on the ZIO_PRIORITY_* flag passed in.  This had the unnoticed
side-effect of making the vdev_disk_io_start() synchronous for
certain I/Os.

This in turn resulted in vdev_disk_io_start() being able to
re-dispatch zio's which would result in a RCU stalls when a disk
was removed from the system.  Additionally, this could negatively
impact performance and explains the performance regressions reported
in both #3829 and #3780.

This patch resolves the issue by making the blocking behavior
dependent on a 'wait' flag being passed rather than overloading
the passed bio flags.

Finally, the WRITE_SYNC and READ_SYNC behavior is restricted to
non-rotational devices where there is no benefit to queuing to
aggregate the I/O.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3652
Issue #3780
Issue #3785
Issue #3817
Issue #3821
Issue #3829
Issue #3832
Issue #3870
MorpheusTeam pushed a commit to Xyratex/lustre-stable that referenced this issue Oct 17, 2015
ZFS/SPL 0.6.5.2

Bug Fixes
* Init script fixes openzfs/zfs#3816
* Fix uioskip crash when skip to end openzfs/zfs#3806
  openzfs/zfs#3850
* Userspace can trigger an assertion openzfs/zfs#3792
* Fix quota userused underflow bug openzfs/zfs#3789
* Fix performance regression from unwanted synchronous I/O
  openzfs/zfs#3780
* Fix deadlock during ARC reclaim openzfs/zfs#3808
  openzfs/zfs#3834
* Fix deadlock with zfs receive and clamscan openzfs/zfs#3719
* Allow NFS activity to defer snapshot unmounts openzfs/zfs#3794
* Linux 4.3 compatibility openzfs/zfs#3799
* Zed reload fixes openzfs/zfs#3773
* Fix PAX Patch/Grsec SLAB_USERCOPY panic openzfs/zfs#3796
* Always remove during dkms uninstall/update openzfs/spl#476

ZFS/SPL 0.6.5.1

Bug Fixes

* Fix zvol corruption with TRIM/discard openzfs/zfs#3798
* Fix NULL as mount(2) syscall data parameter openzfs/zfs#3804
* Fix xattr=sa dataset property not honored openzfs/zfs#3787

ZFS/SPL 0.6.5

Supported Kernels

* Compatible with 2.6.32 - 4.2 Linux kernels.

New Functionality

* Support for temporary mount options.
* Support for accessing the .zfs/snapshot over NFS.
* Support for estimating send stream size when source is a bookmark.
* Administrative commands are allowed to use reserved space improving
  robustness.
* New notify ZEDLETs support email and pushbullet notifications.
* New keyword 'slot' for vdev_id.conf to control what is use for the
  slot number.
* New zpool export -a option unmounts and exports all imported pools.
* New zpool iostat -y omits the first report with statistics since
  boot.
* New zdb can now open the root dataset.
* New zdb can print the numbers of ganged blocks.
* New zdb -ddddd can print details of block pointer objects.
* New zdb -b performance improved.
* New zstreamdump -d prints contents of blocks.

New Feature Flags

* large_blocks - This feature allows the record size on a dataset to
be set larger than 128KB. We currently support block sizes from 512
bytes to 16MB. The benefits of larger blocks, and thus larger IO, need
to be weighed against the cost of COWing a giant block to modify one
byte. Additionally, very large blocks can have an impact on I/O
latency, and also potentially on the memory allocator. Therefore, we
do not allow the record size to be set larger than zfs_max_recordsize
(default 1MB). Larger blocks can be created by changing this tuning,
pools with larger blocks can always be imported and used, regardless
of this setting.

* filesystem_limits - This feature enables filesystem and snapshot
limits. These limits can be used to control how many filesystems
and/or snapshots can be created at the point in the tree on which the
limits are set.

*Performance*

* Improved zvol performance on all kernels (>50% higher throughput,
  >20% lower latency)
* Improved zil performance on Linux 2.6.39 and earlier kernels (10x
  lower latency)
* Improved allocation behavior on mostly full SSD/file pools (5% to
  10% improvement on 90% full pools)
* Improved performance when removing large files.
* Caching improvements (ARC):
** Better cached read performance due to reduced lock contention.
** Smarter heuristics for managing the total size of the cache and the
   distribution of data/metadata.
** Faster release of cached buffers due to unexpected memory pressure.

*Changes in Behavior*

* Default reserved space was increased from 1.6% to 3.3% of total pool
capacity. This default percentage can be controlled through the new
spa_slop_shift module option, setting it to 6 will restore the
previous percentage.

* Loading of the ZFS module stack is now handled by systemd or the
sysv init scripts. Invoking the zfs/zpool commands will not cause the
modules to be automatically loaded. The previous behavior can be
restored by setting the ZFS_MODULE_LOADING=yes environment variable
but this functionality will be removed in a future release.

* Unified SYSV and Gentoo OpenRC initialization scripts. The previous
functionality has been split in to zfs-import, zfs-mount, zfs-share,
and zfs-zed scripts. This allows for independent control of the
services and is consistent with the unit files provided for a systemd
based system. Complete details of the functionality provided by the
updated scripts can be found here.

* Task queues are now dynamic and worker threads will be created and
destroyed as needed. This allows the system to automatically tune
itself to ensure the optimal number of threads are used for the active
workload which can result in a performance improvement.

* Task queue thread priorities were correctly aligned with the default
Linux file system thread priorities. This allows ZFS to compete fairly
with other active Linux file systems when the system is under heavy
load.

* When compression=on the default compression algorithm will be lz4 as
long as the feature is enabled. Otherwise the default remains lzjb.
Similarly lz4 is now the preferred method for compressing meta data
when available.

* The use of mkdir/rmdir/mv in the .zfs/snapshot directory has been
disabled by default both locally and via NFS clients. The
zfs_admin_snapshot module option can be used to re-enable this
functionality.

* LBA weighting is automatically disabled on files and SSDs ensuring
the entire device is used fairly.
* iostat accounting on zvols running on kernels older than Linux 3.19
is no longer supported.

* The known issues preventing swap on zvols for Linux 3.9 and newer
kernels have been resolved. However, deadlocks are still possible for
older kernels.

Module Options

* Changed zfs_arc_c_min default from 4M to 32M to accommodate large
  blocks.
* Added metaslab_aliquot to control how many bytes are written to a
  top-level vdev before moving on to the next one. Increasing this may
  be helpful when using blocks larger than 1M.
* Added spa_slop_shift, see 'reserved space' comment in the 'changes
  to behavior' section.
* Added zfs_admin_snapshot, enable/disable the use of mkdir/rmdir/mv
  in .zfs/snapshot directory.
* Added zfs_arc_lotsfree_percent, throttle I/O when free system
  memory drops below this percentage.
* Added zfs_arc_num_sublists_per_state, used to allow more
  fine-grained locking.
* Added zfs_arc_p_min_shift, used to set a floor on arc_p.
* Added zfs_arc_sys_free, the target number of bytes the ARC should
  leave as free.
* Added zfs_dbgmsg_enable, used to enable the 'dbgmsg' kstat.
* Added zfs_dbgmsg_maxsize, sets the maximum size of the dbgmsg
  buffer.
* Added zfs_max_recordsize, used to control the maximum allowed
  record size.
* Added zfs_arc_meta_strategy, used to select the preferred ARC
  reclaim strategy.
* Removed metaslab_min_alloc_size, it was unused internally due to
  prior changes.
* Removed zfs_arc_memory_throttle_disable, replaced by
  zfs_arc_lotsfree_percent.
* Removed zvol_threads, zvols no longer require a dedicated task
  queue.
* See zfs-module-parameters(5) for complete details on available
  module options.

Bug Fixes

* Improved documentation with many updates, corrections, and
  additions.
* Improved sysv, systemd, initramfs, and dracut support.
* Improved block pointer validation before issuing IO.
* Improved scrub pause heuristics.
* Improved test coverage.
* Improved heuristics for automatic repair when zfs_recover=1 module
  option is set.
* Improved debugging infrastructure via 'dbgmsg' kstat.
* Improved zpool import performance.
* Fixed deadlocks in direct memory reclaim.
* Fixed deadlock on db_mtx and dn_holds.
* Fixed deadlock in dmu_objset_find_dp().
* Fixed deadlock during zfs rollback.
* Fixed kernel panic due to tsd_exit() in ZFS_EXIT.
* Fixed kernel panic when adding a duplicate dbuf to dn_dbufs.
* Fixed kernel panic due to security / ACL creation failure.
* Fixed kernel panic on unmount due to iput taskq.
* Fixed panic due to corrupt nvlist when running utilities.
* Fixed panic on unmount due to not waiting for all znodes to be
  released.
* Fixed panic with zfs clone from different source and target pools.
* Fixed NULL pointer dereference in dsl_prop_get_ds().
* Fixed NULL pointer dereference in dsl_prop_notify_all_cb().
* Fixed NULL pointer dereference in zfsdev_getminor().
* Fixed I/Os are now aggregated across ZIO priority classes.
* Fixed .zfs/snapshot auto-mounting for all supported kernels.
* Fixed 3-digit octal escapes by changing to 4-digit which
  disambiguate the output.
* Fixed hard lockup due to infinite loop in zfs_zget().
* Fixed misreported 'alloc' value for cache devices.
* Fixed spurious hung task watchdog stack traces.
* Fixed direct memory reclaim deadlocks.
* Fixed module loading in zfs import systemd service.
* Fixed intermittent libzfs_init() failure to open /dev/zfs.
* Fixed hot-disk sparing for disk vdevs
* Fixed system spinning during ARC reclaim.
* Fixed formatting errors in {{zfs(8)}}
* Fixed zio pipeline stall by having callers invoke next stage.
* Fixed assertion failed in zrl_tryenter().
* Fixed memory leak in make_root_vdev().
* Fixed memory leak in zpool_in_use().
* Fixed memory leak in libzfs when doing rollback.
* Fixed hold leak in dmu_recv_end_check().
* Fixed refcount leak in bpobj_iterate_impl().
* Fixed misuse of input argument in traverse_visitbp().
* Fixed missing missing mutex_destroy() calls.
* Fixed integer overflows in dmu_read/dmu_write.
* Fixed verify() failure in zio_done().
* Fixed zio_checksum_error() to only include info for ECKSUM errors.
* Fixed -ESTALE to force lookup on missing NFS file handles.
* Fixed spurious failures from dsl_dataset_hold_obj().
* Fixed zfs compressratio when using with 4k sector size.
* Fixed spurious watchdog warnings in prefetch thread.
* Fixed unfair disk space allocation when vdevs are of unequal size.
* Fixed ashift accounting error writing to cache devices.
* Fixed zdb -d has false positive warning when
  feature@large_blocks=disabled.
* Fixed zdb -h | -i seg fault.
* Fixed force-received full stream into a dataset if it has a
  snapshot.
* Fixed snapshot error handling.
* Fixed 'hangs' while deleting large files.
* Fixed lock contention (rrw_exit) while running a read only load.
* Fixed error message when creating a pool to include all problematic
  devices.
* Fixed Xen virtual block device detection, partitions are now
  created.
* Fixed missing E2BIG error handling in zfs_setprop_error().
* Fixed zpool import assertion in libzfs_import.c.
* Fixed zfs send -nv output to stderr.
* Fixed idle pool potentially running itself out of space.
* Fixed narrow race which allowed read(2) to access beyond fstat(2)'s
  reported end-of-file.
* Fixed support for VPATH builds.
* Fixed double counting of HDR_L2ONLY_SIZE in ARC.
* Fixed 'BUG: Bad page state' warning from kernel due to writeback
  flag.
* Fixed arc_available_memory() to check freemem.
* Fixed arc_memory_throttle() to check pageout.
* Fixed'zpool create warning when using zvols in debug builds.
* Fixed loop devices layered on ZFS with 4.1 kernels.
* Fixed zvol contribution to kernel entropy pool.
* Fixed handling of compression flags in arc header.
* Substantial changes to realign code base with illumos.
* Many additional bug fixes.

Signed-off-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Change-Id: I87c012aec9ec581b10a417d699dafc7d415abf63
Reviewed-on: http://review.whamcloud.com/16399
Tested-by: Jenkins
Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Tested-by: Maloo <hpdd-maloo@intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

7 participants