Improve dirty page writeout performance #2250

ryao · 2014-04-09T20:35:10Z

This refactors ->writepages and ->writepage to improve performance.

->writepages has been refactored to call zfs_range_lock() once per call.
Each zfs_range_lock() does its own memory allocation and various tree
operations, so making it occur only once reduces the computational
complexity of writepages, by at least a factor of log(N) where N is the
number of pages. This avoids significant virtual memory contention that
is prevalent in Linux's vmalloc(). In addition, we now pass WB_SYNC_NONE
to write_cache_pages. This moves responsibility for ensuring synchronous
semantics entirely to zil_commit(), where it should have been in the
first place. It also eliminates an opportunity for concurrent threads to
block each other on writeback.

->writepage has been refactored to permit the changes to ->writepages.
Situations where it is told to write out an invalidated page is now
properly handled by ignoring them. This avoids a theoretical NULL
pointer dereference.

These changes substantially improve 32-thread 4K randomwrite IO on a RAM
disk in filebench's randomwrite benchmark. This has been tested with the
following commands on a boot that passed ramdisk_size=8388604 to the
kernel commandline:

modprobe loop
losetup loop0 /dev/ram0
zpool create -o ashift=12 -O recordsize=4K test /dev/loop0
zpool export test
losetup -D
zpool import -d /dev test

filebench << END
load randomwrite
set $dir=/test
set $nthreads=32
set $iosize=4096
run 60
END

zpool destroy test

Previous tests on Gentoo Linux using a kernel compiled from the CentOS
2.6.32-431.11.2.el6.x86_64 kernel with the default .config showed single
threaded performance at roughly 10k to 11k IOPS with 32-thread
performance at 5k-6k in this benchmark. Tests with this patch show a
performance improvement to 15k to 17k in the 32-thread case. These tests
were conducted on a single processor Xeon E5-2620 with 64GB of RAM.

Signed-off-by: Richard Yao ryao@gentoo.org

sempervictus · 2014-04-28T18:13:12Z

Been using this in conjunction with #2129 and here's how they stack up against current HEAD:
3.13.6 with BFQ IO scheduling:

,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1600 MBs |    8.6 s | 185.861 MB/s | 253.6 %  | 1717.8 % |
| Random Write   31 MBs |    1.3 s |  23.149 MB/s | 123.9 %  | 116.9 % |
| Read         1600 MBs |    0.5 s | 3283.142 MB/s | 5017.3 %  | 975.9 % |
| Random Read    31 MBs |    0.0 s | 3111.310 MB/s | 5256.9 %  | 716.8 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.044 ms |        2.884 ms |  0.00000 |   0.00000 |
| Random Write |        0.732 ms |      631.622 ms |  0.00000 |   0.00000 |
| Read         |        0.005 ms |        0.098 ms |  0.00000 |   0.00000 |
| Random Read  |        0.005 ms |        0.090 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.031 ms |      631.622 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

With both patches:

Tiotest results for 8 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1600 MBs |    5.1 s | 316.313 MB/s | 267.5 %  | 2903.1 % |
| Random Write   31 MBs |    0.2 s | 195.877 MB/s | 315.9 %  | 1767.6 % |
| Read         1600 MBs |    0.6 s | 2890.857 MB/s | 5369.4 %  | 845.0 % |
| Random Read    31 MBs |    0.0 s | 3161.036 MB/s | 5098.1 %  | 1122.8 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.044 ms |        6.067 ms |  0.00000 |   0.00000 |
| Random Write |        0.044 ms |        0.134 ms |  0.00000 |   0.00000 |
| Read         |        0.006 ms |        0.399 ms |  0.00000 |   0.00000 |
| Random Read  |        0.006 ms |        0.050 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.025 ms |        6.067 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

Benchmarks are pretty pointless in real world terms though.
The system is a xeon v2 2680 with 32G and an evo 840 1T SSD.
The ZVOL is ashift=12 on a LUKS/dm-crypt container using aes-xts-plain64, and the system has been getting thrashed for a week or so running root ZFS this way, migrating hundreds of gigs between datasets testing options (the PPA available version makes the system unusable by the way, HEAD is functional but laggy, these patches make it almost like using EXT4).

One thing to note, and i've no idea where the blame lies here, is that i did experience a kernel panic with these patches applied. Ended up hurting the ext2 /boot partition. I've since added openzfs/spl#328 to the mix and so far no issues in 24 hours.

Just completed a pretty robust kernel build of 3.14.2-pf, AUFS (and a bunch of other patches plus just about the whole device tree) in 10m flat... will move to that for testing now that i'm building dkms debs from the github sources instead of per-kernel modules.

sempervictus · 2014-04-28T21:24:34Z

3.14-pf is a no go - immediate crashes on head of SPL and patched version, kind of difficult to export the stack traces since they happen during boot, or hard lock the host almost immediately on login to lightdm (its a laptop to boot).

Additionally either this patch or #2129, or both, are causing kernel panics still, though not as quickly as the 3.14 bit. No stack traces to show for it so far.
Are there any known problems with running UKSM and ARC/ZFS?

ryao · 2014-04-29T12:28:07Z

@sempervictus This pull request contains two patches. One is the performance patch and another is a fix for an ancient bug that predated OpenSolaris. I wrote the latter after it was found that the former had a tendency to trigger an assertion in AVL trees. Are you using both patches?

Also, some early versions of the latter patch were buggy and consequently, different versions have been pushed. The buggy versions had been pushed for Brian's regression test systems because they were the only ones able to reproduce problems with the former patch.

Additionally, the patches in #2129 are causing issues in regression tests. I do not advise using them in production.

As for UKSM, you are the first that I know to do anything with it and ZFS, so you are in uncharted territory with it.

sempervictus · 2014-04-29T17:42:25Z

Thanks for the clarification.
I've was using 2129 with the latest sha256 patch, but will pull it out of the patchset for now.

ryao · 2014-04-29T17:50:10Z

Let me know if it works without 2129.

On Apr 29, 2014, at 1:42 PM, RageLtMan notifications@github.com wrote:

Thanks for the clarification.
I've was using 2129 with the latest sha256 patch, but will pull it out of the patchset for now.

—
Reply to this email directly or view it on GitHub.

kernelOfTruth · 2014-04-29T18:19:59Z

@sempervictus please also post some test results if time allows

I'm also using the sha256 checksums and use ZFS heavily on a desktop and am interested in the latency results by this (and/or the slab patchset, which I'm also using) alone

thanks !

sempervictus · 2014-05-02T18:45:35Z

4 Days up, no kernel messages, no crashes, very usable.
I've been abusing it a bit while getting the system ironed out on a LUKS 1T SSD and have been sending a gz9+dedup ZFS to an encrypted 1T USB3 since 0520EST this morning (dedup is ridiculously slow on volumes over 250G when limiting arc to 10G).
I wont get any accurate benchmarks until i have a full zvol backup on the new disk though.

One thing to note is that i had to kill all of my auto-snapshots and create a new recursive snapshot from the pool on down. When i tried sending the mess to auto-snaps i kept getting errors:

receiving incremental stream of rpool/root@zfs_local into zbac00/rpool/root@zfs_local
cannot receive incremental stream: invalid backup stream
warning: cannot send 'rpool/root@zfs-auto-snap': Broken pipe

behlendorf · 2014-05-02T18:56:40Z

@sempervictus thanks for the feedback, we're getting closer to being able to merge this and this are looking good. However a few remaining issues have been discovered. For example, I'm able to consistently cause the following crash with CentOS6.5 (sorry about the abbreviated stack).

Kernel panic - not syncing: avl_find() succeeded inside 
avl_add()
avl_update
zfs_range_lock
zfs_read
zpl_read_common
zpl_read

sempervictus · 2014-05-02T22:00:42Z

what precedes this behavior and has it been observed on 6.4?
If its the only thing holding you up, we have some decent devops capabilities to spin up test environments and iron this out for you, but need proper instrumentation (tools/instructions) and workflow direction for running your debug jobs.

ryao · 2014-05-02T22:04:01Z

I think I learned what I need to know to fix it yesterday. Due to my schedule, I will not be able to try it for another few hours.

On May 2, 2014, at 6:00 PM, RageLtMan notifications@github.com wrote:

what precedes this behavior and has it been observed on 6.4?
If its the only thing holding you up, we have some decent devops capabilities to spin up test environments and iron this out for you, but need proper instrumentation (tools/instructions) and workflow direction for running your debug jobs.

—
Reply to this email directly or view it on GitHub.

behlendorf · 2014-05-05T22:24:54Z

@ryao Here's a full backtrace and some additional debugging from gdb. This was from xfstests 074, time permitting I'll dig in to this too.

centos-6.5-x86_64-try-buildslave login:
Kernel panic - not syncing: avl_find() succeeded inside avl_add()
Pid: 31538, comm: fstest Tainted: P           ---------------    2.6.32-431.11.2.el6.x86_64 #1
Call Trace:
 [<ffffffff81527823>] ? panic+0xa7/0x16f
 [<ffffffffa0349451>] ? kmem_alloc_debug+0x251/0x520 [spl]
 [<ffffffffa007c870>] ? avl_update+0x0/0x80 [zavl]
 [<ffffffffa0949a4d>] ? zfs_range_lock+0x45d/0x760 [zfs]
 [<ffffffffa0957820>] ? zfs_read+0x110/0x510 [zfs]
 [<ffffffff8119bc0a>] ? do_filp_open+0x6ea/0xd20
 [<ffffffffa08e1ad6>] ? refcount_remove+0x16/0x20 [zfs]
 [<ffffffffa0979b12>] ? zpl_read_common+0x52/0x80 [zfs]
 [<ffffffffa0979ba8>] ? zpl_read+0x68/0xa0 [zfs]
 [<ffffffff81226496>] ? security_file_permission+0x16/0x20
 [<ffffffff81189775>] ? vfs_read+0xb5/0x1a0
 [<ffffffff81189aa2>] ? sys_pread64+0x82/0xa0
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b

kernelOfTruth · 2014-05-06T22:42:05Z

@behlendorf does this only occur on older kernels or also on newer ones ?

any risk of data loss ?

if not and only a potential hardlock is in store - I'd be willing to give it a try with 3.14.3

@ryao any news on the fix or did life call back ? =)

behlendorf · 2014-05-07T01:01:22Z

I've only been able to reproduce it with 2.6.32 kernels under a fairly abusive test case. There shouldn't be any risk of data loss here, just potentially a lockup. Any additional testing you can offer would be welcome! I'll try and look at this too time permitting.

ryao · 2014-05-07T02:26:35Z

any news on the fix or did life call back ? =)

I am in the process of constructing a private lab to enable me to better reproduce these kinds of issues. It is taking more time to setup than I thought, but I am making a push to bring it online as part of my work on this. I hope to have physical hardware dedicated to reproducing such issues online tomorrow.

kernelOfTruth · 2014-05-09T20:48:19Z

maybe it's too early but I seem to observe following errors:

status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P

after using the patched ZFS & ZFS-KMOD parts with this patchset

there are in total 2 drives (mirror mode)

the master drive /home and the mirror, the checksum errors always occur on the mirror (2-5 checksum errors, so the number is not fixed)

RAM is fine, harddrive should be fine, too (no smartctl, or badblocks errors, no strange noises)

so far I've pinpointed it down to following:

when shutting down system via shutdown -h now, the errors occur during import

when logging out of X, and "zpool export" the pool/drives before shutting down or rebooting the system,

during the next import the errors don't occur

following extra settings, besides copies=. , are used

zfs set atime=off
zfs set xattr=off

zfs set aclinherit=discard

zfs set compression=lz4
zfs set checksum=sha256

zfs set sync=standard

zfs set logbias=latency

kernelOfTruth · 2014-05-14T10:10:25Z

okay, so it's not related to this patchset but the fact that the mirrored pool which is my /home "partition" doesn't seem to get exported, unmounted anymore during shutdown - weird :/

kernelOfTruth · 2014-06-01T11:56:03Z

FYI:

still using this patch(set) and no related issues so far

thanks =)

behlendorf · 2014-06-01T17:41:58Z

Thanks for the update.

mattaw · 2014-06-12T03:00:45Z

Linux MythMaster 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Just tried running with 2250, 2129 and spl_kmem_slab at 16384.

Got a fairly unresponsive system after a page fail with light load in about 1 hour. ZFS has one iSCSI ZFS block device and a standard ZFS filesystem served by SAMBA. All have lz4 compression on. Was able to shutdown safely. log message listed below along with system configuration and hardware.

Jun 10 23:59:24 mythmaster kernel: [89387.840120] BUG: unable to handle kernel paging request at ffffffff81203e40
Jun 10 23:59:24 mythmaster kernel: [89387.840135] IP: [<ffffffff813669d7>] rb_insert_color+0x87/0x170
Jun 10 23:59:24 mythmaster kernel: [89387.840150] PGD 1c11067 PUD 1c12063 PMD 12001e1
Jun 10 23:59:24 mythmaster kernel: [89387.840160] Oops: 0003 [#1] SMP
Jun 10 23:59:24 mythmaster kernel: [89387.840168] Modules linked in: ib_srpt tcm_qla2xxx tcm_loop tcm_fc iscsi_target_mod target_core_pscsi target_core_file target_core_iblock target_co
re_mod zfs(POF) zunicode(POF) zavl(POF) zcommon(POF) znvpair(POF) spl(OF) ib_cm ib_sa ib_mad ib_core ib_addr qla2xxx libfc scsi_transport_fc scsi_tgt configfs rfcomm bnep bluetooth snd_
hda_codec_hdmi ir_lirc_codec lirc_dev ir_jvc_decoder ir_sanyo_decoder ir_mce_kbd_decoder ir_sony_decoder ir_rc6_decoder ir_rc5_decoder ir_nec_decoder rc_rc6_mce joydev mceusb rc_core nv
_tco snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device kvm_amd nvidia(POF) snd_timer kvm snd serio_raw amd6
4_edac_mod drm edac_core edac_mce_amd k10temp soundcore shpchp i2c_nforce2 mac_hid parport_pc ppdev 8021q garp stp mrp llc w83793 w83627hf hwmon_vid lp parport hid_generic usbhid hid us
b_storage raid10 raid456 as
ync_raid6_recov async_memcpy async_pq async_xor async_tx xor psmouse raid6_pq s2io raid1 pata_acpi sata_mv raid0 multipath forcedeth sata_nv linear [last unloaded: spl]
Jun 10 23:59:24 mythmaster kernel: [89387.840346] CPU: 2 PID: 2570 Comm: named Tainted: PF          O 3.13.0-29-generic #53-Ubuntu
Jun 10 23:59:24 mythmaster kernel: [89387.840353] Hardware name: Supermicro H8DM8-2/H8DM8-2, BIOS 080014  10/22/2009
Jun 10 23:59:24 mythmaster kernel: [89387.840359] task: ffff8806685f5fc0 ti: ffff880668d56000 task.ti: ffff880668d56000
Jun 10 23:59:24 mythmaster kernel: [89387.840364] RIP: 0010:[<ffffffff813669d7>]  [<ffffffff813669d7>] rb_insert_color+0x87/0x170
Jun 10 23:59:24 mythmaster kernel: [89387.840376] RSP: 0018:ffff880668d57ec8  EFLAGS: 00010286
Jun 10 23:59:24 mythmaster kernel: [89387.840380] RAX: ffff880359d8a980 RBX: ffff880036ad7d00 RCX: ffffffff81203e40
Jun 10 23:59:24 mythmaster kernel: [89387.840385] RDX: ffff88030ea58780 RSI: ffff88066bfc9970 RDI: ffff88030ea58781
Jun 10 23:59:24 mythmaster kernel: [89387.840390] RBP: ffff880668d57ec8 R08: ffff88030ea58780 R09: ffffffff812041dd
Jun 10 23:59:24 mythmaster kernel: [89387.840395] R10: 00007f792d7cbe00 R11: 0000000000000246 R12: ffff88066bccfa00
Jun 10 23:59:24 mythmaster kernel: [89387.840400] R13: ffff8803580e8780 R14: ffff880036ad7d30 R15: ffff88066bfc9970
Jun 10 23:59:24 mythmaster kernel: [89387.840406] FS:  00007f792d7cc700(0000) GS:ffff880373c80000(0000) knlGS:0000000000000000
Jun 10 23:59:24 mythmaster kernel: [89387.840411] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 10 23:59:24 mythmaster kernel: [89387.840416] CR2: ffffffff81203e40 CR3: 0000000359e3a000 CR4: 00000000000007e0
Jun 10 23:59:24 mythmaster kernel: [89387.840420] Stack:
Jun 10 23:59:24 mythmaster kernel: [89387.840424]  ffff880668d57f78 ffffffff81205a58 ffff880668d57ef8 000000198109dd94
Jun 10 23:59:24 mythmaster kernel: [89387.840436]  ffff8803580e87d8 ffff8803580e8798 0000000000000000 ffff88066bfc9908
Jun 10 23:59:24 mythmaster kernel: [89387.840446]  0000000000000304 ffff88066bfc9900 0000000100000001 0000001981020d25
Jun 10 23:59:24 mythmaster kernel: [89387.840455] Call Trace:
Jun 10 23:59:24 mythmaster kernel: [89387.840468]  [<ffffffff81205a58>] SyS_epoll_ctl+0x6e8/0x9e0
Jun 10 23:59:24 mythmaster kernel: [89387.840478]  [<ffffffff812041b0>] ? ep_poll_wakeup_proc+0x30/0x30
Jun 10 23:59:24 mythmaster kernel: [89387.840488]  [<ffffffff8172adff>] tracesys+0xe1/0xe6
Jun 10 23:59:24 mythmaster kernel: [89387.840492] Code: 48 89 08 48 83 e1 fc 48 89 3a 0f 84 84 00 00 00 48 3b 51 10 0f 84 b2 00 00 00 48 89 41 08 5d c3 0f 1f 40 00 48 89 d7 48 83 cf 01 <48> 89 39 48 89 38 48 8b 02 48 83 e0 fc 48 85 c0 48 89 02 0f 84
Jun 10 23:59:24 mythmaster kernel: [89387.840578] RIP  [<ffffffff813669d7>] rb_insert_color+0x87/0x170
Jun 10 23:59:24 mythmaster kernel: [89387.840587]  RSP <ffff880668d57ec8>
Jun 10 23:59:24 mythmaster kernel: [89387.840591] CR2: ffffffff81203e40
Jun 10 23:59:24 mythmaster kernel: [89387.840597] ---[ end trace 4f950476f5436d37 ]---

ZPool configuration:

  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 2h10m with 0 errors on Mon Jun  9 11:43:47 2014
config:

        NAME                                          STATE     READ WRITE CKSUM
        tank                                          ONLINE       0     0     0
          mirror-0                                    ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA2585130  ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA2900089  ONLINE       0     0     0
          mirror-1                                    ONLINE       0     0     0
            ata-WDC_WD20EFRX-68EUZN0_WD-WMC4M1549840  ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA2915755  ONLINE       0     0     0
          mirror-2                                    ONLINE       0     0     0
            ata-ST2000DM001-1CH164_Z340EKTA           ONLINE       0     0     0
            ata-ST2000DM001-1CH164_Z340EL4X           ONLINE       0     0     0
        logs
          ata-GIGABYTE_i-RAM_CC9C135B2613DDB4D072     ONLINE       0     0     0

errors: No known data errors

Hardware:
Two quad Opteron 2389.
24G of ECC RAM.
H8DME-2 Motherboard

This refactors ->writepages and ->writepage to improve performance. ->writepages has been refactored to call zfs_range_lock() once per call. Each zfs_range_lock() does its own memory allocation and various tree operations, so making it occur only once reduces the computational complexity of writepages, by at least a factor of log(N) where N is the number of pages. This avoids significant virtual memory contention that is prevalent in Linux's vmalloc(). In addition, we now pass WB_SYNC_NONE to write_cache_pages. This moves responsibility for ensuring synchronous semantics entirely to zil_commit(), where it should have been in the first place. It also eliminates an opportunity for concurrent threads to block each other on writeback. ->writepage has been refactored to permit the changes to ->writepages. Situations where it is told to write out an invalidated page is now properly handled by ignoring them. This avoids a theoretical NULL pointer dereference. These changes substantially improve 32-thread 4K randomwrite IO on a RAM disk in filebench's randomwrite benchmark. This has been tested with the following commands on a boot that passed ramdisk_size=8388604 to the kernel commandline: modprobe loop losetup loop0 /dev/ram0 zpool create -o ashift=12 -O recordsize=4K test /dev/loop0 zpool export test losetup -D zpool import -d /dev test filebench << END load randomwrite set \$dir=/test set \$nthreads=32 set \$iosize=4096 run 60 END zpool destroy test Previous tests on Gentoo Linux using a kernel compiled from the CentOS 2.6.32-431.11.2.el6.x86_64 kernel with the default .config showed single threaded performance at roughly 10k to 11k IOPS with 32-thread performance at 5k-6k in this benchmark. Tests with this patch show a performance improvement to 15k to 17k in the 32-thread case. These tests were conducted on a single processor Xeon E5-2620 with 64GB of RAM. Signed-off-by: Richard Yao <ryao@gentoo.org>

behlendorf · 2014-06-21T00:03:07Z

@ryao I've spent all day looking at this trying to find a clean way to refactor the code along the lines of what your suggesting in this patch. Ideally we would be able to take the writer range lock over the call to write_cache_pages() but in the WB_SYNC_ALL case that may deadlock. To avoid the deadlock we must call write_cache_pages() with WB_SYNC_NONE but that means we must implement our own data-integrity logic. That may sound simple enough on the surface but getting it right isn't as simple as calling zil_commit(). That won't handle the case where a page in writeback was re-dirtied and that new page needs to be written. Calling zil_commit() will only ensure that the previous version which was in writeback was written. This issue likely won't cause any test failures but it's just functionally wrong.

I took a crack at implementing this data-integrity logic in #2413 by tagging all the dirty pages for WB_SYNC_ALL callers. However, the logic is a little overly broad and it could result in is writing out dirty pages multiple times. That said, I believe it's functionality correct. I've also rolled some other minor cleanup in to the #2413 patch.

Next I ran some benchmarks using the stock 0.6.3 code, your patched version, and my patched version. I used fio to generate a workload of smalll 4k IOs from multiple threads and ran it against a ramdisk. To make thing more realistic I also mixed in calls to msync() every 32 writes. Here's the test script.

[global]
bs=4k
ioengine=mmap
iodepth=1
size=1g
direct=0
runtime=60
directory=/tank/fio
filename=mmap.test.file
numjobs=32
sync_file_range=write:32

[seq-read]
rw=read
stonewall

[rand-read]
rw=randread
stonewall

[seq-write]
rw=write
stonewall

[rand-write]
rw=randwrite
stonewall

My results don't show any significant performance win for the new code. For any version of this patch to get merged we're going to need to show the following.

It's functionally correct.
We have a benchmark which shows it improves performance for a specific workload.
It doesn't negatively impact the performance of other workloads.

ryao · 2014-06-21T20:29:11Z

@behlendorf If I understand correctly, you are concerned about the following scenario:

Page A in File X is dirtied.
->writepages() is called on a range in file X that includes page A.
Page A is redirtied before zil_commit() in 2 finishes.
->writepages() is called on a range in file X that includes page A and is called before zil_commit() in 2 finishes, but after 3.
Is the redirtied page on stable storage?

If we drop the rangelock before zil_commit(), then writepages can skip over the redirtied page by running before the callback from the DMU. If we drop the rangelock after zil_commit(), then writepages will not skip over the redirtied page. We can avoid the problem of having redirted pages to write-out when in writeback by holding the range lock in ->writepage{,s}() during the critical section in which pages are in writeback. The SUSv2 only requires us to ensure that everything at the time of the syscall entry is written out, which holding the rangelock until zil_commit() returns will do:

http://pubs.opengroup.org/onlinepubs/7908799/xsh/fsync.html

I will try to get some numbers for our revised patches later this weekend (where mine will be changed to drop the rangelock after zil_commit()). Doing precisely 1 rangelock per ->writepages will make our dirty page writeback more like that of Illumos, which does only 1 rangelock per VOP_PUTPAGE call. I suspect that making us consistent with that will yield an improvement solely from the reduction in the asymtotic complexity of ->writepages() from O(N*log(N)) to O(N), where taking a rangelock is O(log(N)).

That being said, I suspect that our testing methodologies are different. I used filebench to test this on a ramdisk (using the Linux brd driver). How are you testing this?

ryao · 2014-06-21T22:11:14Z

@behlendorf I probably should elaborate that the key idea in this patch is to change locking such that each ->writepages() call does precisely 1 rangelock. My main mistake here was in locking too little by dropping the lock before zil_commit(). That is wrong and it is fortunate that you caught it.

A nice thing about doing precisely 1 rangelock per ->writepage{s,} call (provided that it is dropped after zil_commit() returns) is that it allows us to eliminate a possible pingpong effect between a thread doing zil_commit() and a thread blocked in WB_SYNC_ALL. In that situation, thread B wakes thread A upon each clear of a writeback bit, thread A takes the CPU and then blocks until B wakes it up at the next page.

behlendorf · 2014-06-23T18:47:36Z

@ryao Yes, that's exactly the situation I'm worried about.

We can avoid the problem of having redirted pages to write-out when in writeback by holding the range lock in ->writepage{,s}() during the critical section in which pages are in writeback.

Unfortunately, we can't. Extending the range lock as you suggest will introduce a deadlock. The zill_commit() function calls the zfs_get_data() function we may need to acquire the range lock. If it happens to need to take a read lock on the range we're holding a write lock for the system will deadlock.

We could perhaps extend the range lock interfaces to handle this case by indicating we already have the lock. But that's going to take us farther away from the upstream code so I'd like to avoid that if possible.

The cleanest solution I think is to preserve what was done in the previous code. We call write_cache_pages() a second time with the correct sync mode. I'm going to try this our today and refresh the patch.

I probably should elaborate that the key idea in this patch is to change locking such that each ->writepages() call does precisely 1 rangelock.

I agree with this entirely. Pulling the range lock up a level so there's just one for the entire call writepages() call should be a good thing. For certain workloads I expect it will significantly decrease the amount of contention.

The thing is before we merge a change like this we need to quantify how much it helps. So what I'd like to see is at least one specific test case where it's shown to improve performance, or decrease cpu utilization, or improve some other metric.

I've been using a mmap() based fio workload described in the commit message against a ramdisk based zpool in a VM. So far my testing was inconclusive so I'll need to refine my process a bit. I'm all for testing other workloads, I noticed you used the filebench randwrite test for your testing. It looked at the briefly but it wasn't clear to me that it used mmap for it's IO so I opted for something else.

ryao · 2014-06-24T20:01:57Z

I am withdrawing this pull request because benchmark results that I posted in #2413 show that 0.6.3 has the performance characteristics that my past benchmarks had attributed to this patch and a change such as this is ineffective.

lukemarsden · 2014-06-30T20:56:19Z

0.6.3 has the performance characteristics that my past benchmarks had attributed to this patch and a change such as this is ineffective.

Just to clarify, this means that 0.6.3 performs as well as the previous code did with this patch?

ryao · 2014-07-01T00:09:40Z

0.6.3 performs as well as the previous code did with whatever happened to be in my local tree. Some patches that I wrote alongside this to fix bugs were merged, which might explain the discrepancy.

dweeezil mentioned this pull request Apr 29, 2014

Use Linux slab allocator for small allocations openzfs/spl#328

Closed

behlendorf added this to the 0.6.3 milestone May 2, 2014

behlendorf added Feature and removed Feature labels May 2, 2014

sempervictus mentioned this pull request May 2, 2014

Move ARC data buffers out of vmalloc #2129

Closed

behlendorf modified the milestones: 0.6.4, 0.6.3 May 29, 2014

ryao closed this Jun 24, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve dirty page writeout performance #2250

Improve dirty page writeout performance #2250

ryao commented Apr 9, 2014

sempervictus commented Apr 28, 2014

sempervictus commented Apr 28, 2014

ryao commented Apr 29, 2014

sempervictus commented Apr 29, 2014

ryao commented Apr 29, 2014

kernelOfTruth commented Apr 29, 2014

sempervictus commented May 2, 2014

behlendorf commented May 2, 2014

sempervictus commented May 2, 2014

ryao commented May 2, 2014

behlendorf commented May 5, 2014

kernelOfTruth commented May 6, 2014

behlendorf commented May 7, 2014

ryao commented May 7, 2014

kernelOfTruth commented May 9, 2014

kernelOfTruth commented May 14, 2014

kernelOfTruth commented Jun 1, 2014

behlendorf commented Jun 1, 2014

mattaw commented Jun 12, 2014

behlendorf commented Jun 21, 2014

ryao commented Jun 21, 2014

ryao commented Jun 21, 2014

behlendorf commented Jun 23, 2014

ryao commented Jun 24, 2014

lukemarsden commented Jun 30, 2014

ryao commented Jul 1, 2014

Improve dirty page writeout performance #2250

Improve dirty page writeout performance #2250

Conversation

ryao commented Apr 9, 2014

sempervictus commented Apr 28, 2014

sempervictus commented Apr 28, 2014

ryao commented Apr 29, 2014

sempervictus commented Apr 29, 2014

ryao commented Apr 29, 2014

kernelOfTruth commented Apr 29, 2014

sempervictus commented May 2, 2014

behlendorf commented May 2, 2014

sempervictus commented May 2, 2014

ryao commented May 2, 2014

behlendorf commented May 5, 2014

kernelOfTruth commented May 6, 2014

behlendorf commented May 7, 2014

ryao commented May 7, 2014

kernelOfTruth commented May 9, 2014

kernelOfTruth commented May 14, 2014

kernelOfTruth commented Jun 1, 2014

behlendorf commented Jun 1, 2014

mattaw commented Jun 12, 2014

behlendorf commented Jun 21, 2014

ryao commented Jun 21, 2014

ryao commented Jun 21, 2014

behlendorf commented Jun 23, 2014

ryao commented Jun 24, 2014

lukemarsden commented Jun 30, 2014

ryao commented Jul 1, 2014