-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZVOL write IO merging not sufficient #8472
Comments
ZVOL currently does not even support noop |
The default value of nomerges is 2, I will try to set it 0, re-test the cases, and report back soon. Today I can confirm that setting nomerges to 0 has no actual effect |
Can somebody (who are familar with ZFS DMU code) investigate the IO merging logic inside DMU a bit, perhaps one can find a better solution there? Just wonder why the IO merging at DMU is not working in this simple (single thread of 4KB consecutive IO writes) case. |
@samuelxhu @kpande From how I understand it, the problem is reproducible even without zvol: if you overwrite a large-recordsize (ie: 128k) file with 4k writes, you will encounter heavy read/modify/write. The problem does not seem related to the aggregator not doing its work; rather, it depends on the fact that on partial-recordsize write, the entire recordsize must be copied in memory. For example:
So, the r/m/w behavior really seems intrinsically tied to the ARC/checksumming, rather than depending on aggregator not doing its work. However, in older ZFS versions (<= 0.6.4), zvols where somewhat immune from this problem. This stems from the fact that, unless doing direct I/O, zvols do not bypass the standard linux pagecache. In the example above, running On ZFS >= 0.6.5, the zvol code was changed to skip some of the previous linux "canned" block layer code, simplyfing the I/O stack and bypassing the I/O scheduler entirely (side note: in recent linux kernel, For what it is worth, I feel the current behavior is the right one: in my opinion, zvols should not behave too much differently from datasets. That said, this preclude a possible optimization (ie: using the pagecache as a sort of "first stage" buffer where merging can be done before sending anything to ZFS). |
Sorry, I disagree that the current behavior of ZVOL is the right one. There
are many use cases for zvol to behave like a normal block device, e.g. as
backend storage for FC and iSCSI, hosting VMs etc. In those use-cases, a
scheduler such as deadline/noop can merge smaller requests into bigger
ones, thereby reducing the likelihood of RMWs.
AND using a scheduler to merge requests does not impose a big burden on memory useage!
|
@kpande Just to confirm that setting /sys/devices/virtual/block/zdXXX/queue/nomerges to 0 does not cause contiguous IO requests to merge. It seems all kinds of IO merging are, unfortunately, disabled by the current implementation. Ryao's original good will is to avoid double merging and let DMU do IO merging. It is mysterious that DMU does not do the correct merging either. |
@samuelxhu I think the rationale for current behavior is that you should avoid double caching by using direct I/O to the zvols; in this case, the additional merging done by the pagacache is skipped anyway, so it is better to also skip any additional processing done by the I/O scheduler. Anyway, @ryao can surely answer you in more detailed/correct form. The key point is that is not the DMU not merging request. Actually, it is doing I/O merging. You are asking for an additional buffer to "pre-merge" multiple write requests before passing them to the "real" ZFS code in order to avoid read amplification. While understanding your request, I think this currently is out of scope, and quite different from how ZFS is expected to work. |
@shodandok ZVOL has been widely used as block devices since its beginning,
such as backends for FC, iSCSI, hosting VM and even stacking with md, drdb,
flashcace, drdb, rdb block devices. Therefore it is extremely important to
keep ZVOL like a "normal" block device, supporting scheduler such
noop/deadline to merging incoming IO reqeusts. By the way, having standard
scheduler behavior has nothing to do with double caching.
@kpande Using smaller volblocksize, such as volblocksize=4k, may help
reduce RMWs without the need of IO request merging, however, this is far
from the ideal case: with 4KB disks, this effectively prevents ZFS
compression and the usage of ZFS RAIDZs. Furthermore, using extremely small
volblocksize has a negative impact on throughput performance. It is widely
reported that, for hosting VM, volblocksize of 32KB is a better choice in
practice.
…On Mon, Mar 4, 2019 at 3:54 AM kpande ***@***.***> wrote:
just use a smaller volblocksize and be aware of raidz overhead
considerations if you are not using mirrors. using native 512b storage
(some NVMe, some datacentre HDD up to 4TB) and ashift=9 will allow
compression to work with volblocksize=4k.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8472 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALDBAY29hniBkSaaXYAIjTGnnp9zTznGks5vTCh5gaJpZM4baO63>
.
|
@samuelxhu but they are normal block devices; only the scheduler code was bypassed to improve performance in the common case. I have no problem understanding what you say and why, but please be aware you are describing a pretty narrow use case/optimization: contiguous, non-direct 4k writes to a zvols, is the only case where pagacache merging will be useful. If random I/O are issued, merging is not useful. If direct I/O Is used, merging is again not useful. So, while I am not against the change you suggest, please be aware of its narrow scope in real world workloads. |
@kpande I have over 20 ZFS storage box served as FC/iSCSI backend, which use 32KB volblocksize. We run different workloads on them, and found that 32KB volblocksize strikes the best balance between IOPS and throughput. I had severals friends runing ZVOL for VMware, who recommends 32KB as well. |
Let me describe another ZVOL use case which requires the normal block device behavior with a valid scheduler: one or multiple application servers use an FC or iSCSI LUN backed by ZVOL; the servers use a server-side SSD cache, such as Flashcache or bcache to reduce latency and to accelerate application IO. Either flashcache or bcache will issue small but contiguous 4KB IO requests to the backend, anticipating the backend block device will sort and merge those contiguous IO requests. In the above case, any other block devices include HDD, SSD, RAID, or virutal block device will have no performance issues. BUT with zvol with its current implementation, one will see siginificant performance degradation due to excessive and unneccesary RMWs. |
In general, it is unlikely that merging will benefit overall performance. However, concurrency is important and has changed during the 0.7 evolution. Unfortunately, AFIAK, there is no comprehensive study on how to tune the concurrency. See https://github.com/zfsonlinux/zfs/wiki/ZFS-on-Linux-Module-Parameters#zvol_threads Also, there are discussions in #7834 regarding the performance changes over time, especially with the introduction of the write and DVA throttles. If you have data to add, please add it there. |
Why using ZVOLs as backend block device for iSCSI/FC LUN is not a common use case? Don't be narror-minded, it is very common. This is the typical use case that ZVOL should have its own scheduler, at least for two purposes: 1) to keep compatible with linux block device model (extremely important for block device stacking), as the applications anticipate the backend storage ZVOL to do IO merging and sorting ; 2) to reduce the chance of notorious RMWs in parituclar for non-4KB ZVOL volblocksizes I do not really understand, why ZVOL should be different from a normal block device? For those who use ZVOL with 4KB volblocksize only, set the scheduler to noop/deadline does only cost few CPU cycles, but IO merging has the big potential to reduce the chance of RMWs for non-4KB ZVOL volblocksizes. Pity on me, i run more than hundreds FC/iSCSI ZFS ZVOL storage box with volblocksize of 32KB or even bigger for sensible reasons, missing a valid scheduler in 0.7.X causes us pains on excessive RMWs and thus performance degradation, preventing us from upgrading (from 0.6.4.2) to any later versions. We would like to sponsor a fund to support somebody who can make a patch restoring the scheduler feature for ZVOL in 0.7.X. Anyone who are interested pls contact me at samuel.xhu@gmail.com. The patch may or may not be accepted by ZFS authority, but we would like to pay the work. |
@kpande Thanks a lot for pointing out the related previous commits, and i will have a careful look at it and try to find a temporary remedy for excessive RMWs. I notice that previous zvol performance testing focuses primarily on 4KB or 8KB ZVOL, perhaps that is the reason rendering RMW issues less visible and thus RMWs are ignored by many eyes. Let me explain a bit why a larger blocksize ZVOL still makes sense and should not be ignored: 1) to enable the use of LZ4 compression together with RAIDZ(1/2/3) to gain storage space efficiency; 2) to strike a balance between IOPS and throughput, and 32KB seems to be good for VM workloads since it is not-so-big and not-so-small either; 3) We have server-side flash cache (flashcache, bcache, enhanceIO, etc) implemented on all application servers, which absorbs random 4KB writes and then issues contiguous IO requests (semi-sequential )of 4KB or other small sizes, anticipating the backend block devices (iSCSI/FC ZVOLs) to do IO merging/sorting. In my humble opinion, elliminating the scheduler code from ZVOL really causes the RMWs pain for non-4KB ZVOL, perhaps not for everyone, but at least for some of ZFS fans. |
@kpande It is interesting to notice that some people are complaining about performance degradation due to commit 37f9dac as well in #4512 Maybe it is just a coincidence, maybe not. The commit 37f9dac may perform well for zvols with direct I/O, but there are many other use cases which are suffering from performance degradation due to the missing scheduler (merging and sorting IO requests) behavior. |
It seems #361 basically cover the problem explained here. Rather than using the pagecache (with its double-caching and increased memory pressure on ARC), I would suggest creating a small (~1M), front "write buffer" to coalesce writes before sending them to ARC. @behlendorf @ryao any chances to implement something similar? |
@shodanshok good finding! Indeed #361 deals essentially with the same RMW issue here. It came out in 2011, at which time ZFS practitioner can at least use deadline/noop scheduler (before 0.6.5.X) to allievate the chance of RMWs. In #4512, a few ZFS users complained about significant writes amplification just after removing the scheduler, but for unknown reasons RMWs were not paid attention to. Given so much evidence, it seems to be the right time to take serious efforts to solve this RMW issue for ZVOL. We volunteer to take the responsibility of testing, and if needed, funding sponsorship up to 5K USD (from Horeb Data AG, Switerland) is possible for the code developer (If multiple developers involved, behlendorf pls divide) |
@kpande Only for database workloads we have aligned IO for ZVOLs, and unfortunately I do not observe significant performance improvement after 0.6.5.x. The reason might be that I universally have ZFS box with high-end CPU and plenty of DRAM (256GB or above), thus saving a few CPU cycles does not have material impact on IO performance. (The bottleneck is definitely on HDDs, not on CPU cycles or memory bandwidth) Most of our workloads are not un-aligned IOs, such as hosting VMs, FC/iSCSI backed by ZVOLs, where the frontend applications generate mixed workloads of all kinds. Our engineer team currently focuses on fighting with RMWs, and I think either #361 or #4512 should already show sufficent evidence of the issue. Before ZVOLs has an effective IO merging facility, we plan to write a shim layer block device sitting in front of ZFS to enable IO request sorting and merging to reduce the occurence of RMWs. |
@samuelxhu one thing I'd suggest trying first is to increase the dbuf cache size. This small cache sits in front of the compressed ARC and contains an LRU of the most recently used uncompressed buffers. By increasing its size you may be able to mitigate some of the RMW penalty you're seeing. You'll need to increase the
You might find you can use one of Linux's many existing dm devices for this layer. Improving the performance of volumes across a wide variety of workloads is something we're interested in, but haven't had the time to work on. If you're interested, rather than implementing your own shim layer I'd be happy to discuss a design for doing the merging in the zvol implementation. As mentioned above, the current code depends on the DMU do the heavy lifting regarding merging. However, for volumes there's nothing preventing us from doing our own merging. Even just front/back merging or being aware of the volumes internal alignment might yield significant gains. |
In order to merge you need two queues: active and waiting. With the request-based scheme there is one queue with depth=zvol_threads. In other words, we'd have to pause I/Os before they become active. This is another reason why I believe merging is not the solution to the observed problem. |
@richardelling From my test, it seems that DMU merging at writeout time is working properly. Hence my idea of a "front-buffer" which accepts small writes as they are (irrespective of the underlying recordsize) that, after having accumulated/merged some data (say, 1 MB) writes them via the normal ARC buffering/flushing scheme. This would emulate what pagecache is doing for regular block devices, without the added memory pressure of a real pagecache (which can not be limited in any way, if I remember correctly). I have no idea if this can be implemented without lowering ZFS excellent resilience or how difficult would be doing it, of course. |
@behlendorf thanks a lot for suggestions. Looks like that front merging can be easily turned on by reverting commit 5731140, but extensive IO sorting/merging inside ZVOL/DMU may take more efforts, and I may not be capable of coding much myself, but would like to contribute in testing or other ways as much as possible |
Just to chime in - we use ZFS heavily with VM workloads and there is a huge tradeoff between using a 128KiB volblocksize or smaller. Higher volblocksizes actually perform much better up to a point when throughput is saturated, while smaller volblocksizes almost always perform worse, but don't cause throughput problems. And I found it quite difficult to actually predict/benchmark this behaviour because it works very differently on new unfragmented pools, new ZVOLs (no overwrites), different layers of caching (I am absolutely certain that linux pagecache still does something with ZFS as I'm seeing misses that never hit the drives) and various caching problems (ZFS doesn't seem to cache everything it should or could in ARC). This all makes it very hard to compare performance of ZFS/ZVOLs to any other block device, it makes it hard to tune and it makes it extremely hard to compete with "dumb" solutions like mdraid when performance is all over the place. If there is any possibility to improve merging to avoid throughput saturation then please investigate it, the other solution (to the problems I am seeing in my environment) is to fix performance issues with smaller volblocksizes, but I guess that will be much more difficult and I have seen it already discussed elsewhere multiple times (like ZFS not being able to use vdev queues efficiently when those vdevs are fast, like NVMe where I have rarely seen a queue size >1). |
We did a lot of experimentation with ZVOLs here and I'd like to offer a few suggestions.
With zvols, always always always blktrace them as you're setting up to see what is going on. We found that some filesystem options (large XFS allocsize=) could provoke RMW from the pager when things were being flushed out. If you blktrace and see reads for a block coming in before the writes do, you are in this situation.
Here's an example: zfs create -V 1g -o volblocksize=128k tank/xfs mkfs.xfs -s size=4096 -d sw=1,su=131072 -m crc=0 -l logdev=/dev/zvol/tank/xfsjournal /dev/zvol/tank/xfs largeio + large stripe unit + separate XFS journal has been the winning combination for us. Hope this helps. |
Very good points. Thanks a lot
Samuel
…On Sat, Apr 6, 2019 at 5:08 AM janetcampbell ***@***.***> wrote:
We did a lot of experimentation with ZVOLs here and I'd like to offer a
few suggestions.
1. RMW can come from above you as well as from within ZFS. Depending
on what parameters you're using on your filesystem and what you set for
your block device, you can end up with either the VM subsystem or user land
thinking that you have a large minimum IO size, and they will try to pull
in data from you before they write out.
With zvols, always always always blktrace them as you're setting up to see
what is going on. We found that some filesystem options (large XFS
allocsize=) could provoke RMW from the pager when things were being flushed
out. If you blktrace and see reads for a block coming in before the writes
do, you are in this situation.
1.
Proper setup is essential and "proper" is a matter of perspective.
Usually it's best to configure a volume as though it was on a RAID stripe
either the size of the volblocksize, or half that size. The reason you
might choose a smaller size is if you are on a pool with no SLOG and you
want all writes to the zvol to go to ZIL blocks instead of indirect sync,
as large block zvols do with full-block writes. Or, you may want to
refactor your data into larger chunks for efficiency or synchronization
purposes.
2.
Poor inbound IO merge. It's best to configure a filesystem on a zvol
to expose a large preferred IO size to applications, allowing FIO to come
through in big chunks.
3.
Always use primarycache=all.
4.
If you use XFS on zvols, use a separate 4K volblocksize ZVOL for XFS
filesystem journaling. This can be small, 100MB is more than enough. This
keeps the constant flushing that XFS does out of your primary ZVOL, and
allows things to aggregate much more effectively.
Here's an example:
zfs create -V 1g -o volblocksize=128k tank/xfs
zfs create -V 100m -o volblocksize=4k tank/xfsjournal
mkfs.xfs -s size=4096 -d sw=1,su=131072 -m crc=0 -l
logdev=/dev/zvol/tank/xfsjournal /dev/zvol/tank/xfs
mount -o largeio,discard,noatime,logbsize=256K,logbufs=8
/dev/zvol/tank/xfs /somewhere
largeio + large stripe unit have been the winning combination for us.
Hope this helps.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8472 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALDBAf-NmeyIqfvQFI1w277HQiTFgaaxks5veA-UgaJpZM4baO63>
.
|
A little gem I came up with that I haven't seen elsewhere... Large zvols cause more TxG commit activity. The big danger from this is RMW reads, which can stomp on other IO that's going around. Measure TxG commit speed. Open the ZIO throttle. Then, set zfs_sync_taskq_batch_pct=1 and do a TxG commit. Raise it slowly until TxG commit speed is a little slower than it was before the test. This will rate limit the TxG commit and the RMW reads that come off of it, and also can help I/O aggregation. I came up with this approach when I developed a remote backup system that went to block devices on the far side of a WAN. With this you can run long intervals between commits and carry plenty of dirty data, which helps reduce RMW. Once you set the sync taskq, turn the ZIO throttle on and adjust it to just before where it starts to have an effect. This will match these two parameters to the natural flow of the system. At this point you can usually turn aggregation way up and drop the number of async writers some. Oh, and make sure your dirty data write throttle is calibrated correctly and has enough room to work. ndirty should stabilize in the middle of its range during high throughput workloads. We mostly use 128K-256K zvols. They work very well and beat out ZPL mounts for MongoDB performance for us. Performance is more consistent than ZPL mounts provided you're good to them (don't do indirect sync writes with a small to moderate block size zvol unless you don't care about read performance). |
I realized there are a lot of comments here that are coming from the wrong place on RMW reads and how ZFS handles data going into the DMU and such. Unless in the midst of a TxG commit, ZFS will not issue RMW reads for partial blocksize writes unless they are indirect sync writes, and you can't get a partial block indirect sync write on a ZVOL due to how zvol_immediate_write_size is handled. Normally the txg commit handles all RMW reads when necessary at the start of the commit, and none happen between commits. The RMW reads people are bothered by are actually coming from the Linux kernel, in fs/buffer.c. Here's a long winded explanation of why and how to fix it (easy with ZVOLs): With a 4k superblock inode size you can run a ZVOL with a huge volblocksize, txg commit once a minute, and handle tiny writes without problem. Zero RMW if all the pieces of the block show up before TxG commit. Hope this helps. |
@janetcampbell while I agree that a reasonable sized recordsize is key to extract good read performance, especially from rotating media, I think you are missing the fact that RMW can and will happen very early in the write process, as early as accepting the write buffer into the DMU. Let me do a practical example:
Please note how, on the first 4k write test, rmw (with synchronous reads) happens as soon as the write buffers are accepted in the DMU (this is reflected by the very low The second identical write test, which is done without dropping the cache, avoids the rmw part (especially its synchronous read part) and shows much higher write performance. Again, merging at write time is working correctly. This is, in my opinion, the key reason why peoples tell ZFS needs tons of memory to have good performance: being so penalizing, reducing the R part of rmw using very large ARC can be extremely important. It should be noted that L2ARC works very well in this scenario, and it is the main reason why I often use cache device even on workloads with low L2ARC hit rate. |
@DemiMarie based on my tests, no: the overlaying device mapper will not expose any IO scheduler, negating early IO merging. That said, the real performance killer is the sync read IOs needed for partial record update. To somewhat mitigate that, you can use ZVOLs avoiding O_DIRECT file IO (ie: using the Linux pagecache as an upper, coelescing buffer); however, this means double-caching and possibly some bad (performance-wise) interaction between the pagecache and the ARC. |
The use-case I am interested in is using ZFS in QubesOS, which means that the zvols are being exposed over the Xen PV disk protocol. Not sure what the best answer is there. Is @sempervictus’s patch a solution? |
@DemiMarie I might be entirely wrong about this, but in Your usecase:
|
Is this done? Where is the PR? |
I would encourage all zvol users to test drive my block multi-queue PR here: #12664 . You could see pretty big performance improvements with it, depending on your workload. |
lazy person asking: how far in time is that PR from being merged into main? |
@mailinglists35 - hard to tell; even otherwise complete PRs sometimes hang out in the queue for a while as other things are implemented in master. Its in the testing phase though, so closer to it than otherwise :). |
Would it be possible to use the kernel’s write IO merging layer? |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
Still an issue |
Long story short:
or
See my "zvol performance" slides: |
@tonyhutter - thanks for the slide deck. How do those graphs look after a ZVOL has taken enough writes to have allocated all blocks at least once before freeing? The performance degradation over use is a very unpleasant effect in real world scenarios which benchmarks don't capture. |
Will #13148 be merged into 2.2?
|
@sempervictus unfortunately I didn't test that. All tests were on relatively new zvols. @MichaelHierweck that's the plan, but I'm still undecided on if blk-mq should be enabled by default or not. It's not a clear performance winner in all use cases. Currently blk-mq is not enabled by default in the master branch. |
@tonyhutter thanks for the slide. However, I am not sure using O_DIRECT would be enough to avoid any performance degradation, as the one describe by @sempervictus and detailed here: #8472 (comment) (where I am not using O_DIRECT but, writing to a zfs plain file, I still avoid the linux pagecache). In these cases the real performance killer is the r/m/w commanded by reading the to-be-overwritten block. I think the solution is the (non-merged) async DMU patch, or am I missing something? |
The async-DMU patch isn't just unmerged, its rather unfinished. Getting that across the finish line likely requires a fair chunk of dedicated time from one of the senior folks familiar with a wide swath of the code or a lot of time from a skilled C developer who would need to figure out all of the structures, functions, and locking semantics that effort impacts. |
How much time do you think it would take @sempervictus? |
@DemiMarie - a professional C/C++ developer colleague took a cursory pass, and since he's not of ZFS-land, the complexity of handling async dispatch in C was compounded by the complexity of ZFS itself which anyone from outside the core team would need to learn along the way. He found a few bugs in the async pieces just on that first pass - he's quite good 😄, but that likely indicates that we're not at first down with 20y to go... We figured at least several weeks of professional effort (not cheap), but had no way to get an upper bound for that (consideration at the time was to hire him to do this, but folks like that aren't exactly dripping w/ spare time). So i can't answer that question to anyone's satisfaction for budgeting/hiring/etc. |
I think that the best chance we have at assay on the level of effort for that async DMU piece is to ask one of the heavy-hitting OpenZFS members to slot a review aimed at ascertaining requirements into their workflow over the next X months. I've fallen off somewhat in my awareness of the goings-on around here in the last year or so (not that i don't love y'all, just that infosec is somewhat of an all consuming function these days); but off the top of my head, figure that @ryao, @behlendorf, or @tonyhutter have the breadth of code-awareness required to able to partition & delegate efforts for such a review. Personally, wouldn't be surprised if (given appropriate effort) this becomes the new "ABD patch set" during R&D 😄 |
@tuxoko or @dweeezil or @pcd1193182 might be able to take a look? (sorry for volunteering you :) ) |
I wonder if stackless coroutines could help. Those can be implemented in C with some macro abuse. If this were part of the Linux kernel I would suggest using Rust, which supports async programming natively. |
I mostly live in Rust these days, but until the GCC front-end is done, i'm not a big fan of mixing ABI like that. |
@DemiMarie the main point of the async DMU patch should be to avoid the synchronous read of the to-be-overwritten records, rather than issuing an async callback. In other workds, I understand it as a mean to deferring some writes to avoid useless reads. Or am I wrong? |
I honestly am not sure what the async DMU patch actually does, but async C code is a known pain-point in general. |
System information
Describe the problem you're observing
Before 0.6.5.X, e.g. 0.6.3-1.3 or 0.6.4.2, ZoL had the standard linux block device layer for ZVOL, thus one can use scheduler, deadline or others, to merge incoming IO requests. Even with the simplest noop scheduler, contiguous IO requests could still merge if they are sequential.
Things changed from 0.6.5.X on, Rao re-wrote the block layer of ZVOL, and disabled request merging at ZVOL layer, claiming that DMU does IO merging. However it seems that DMU IO merging either not work properly, or DMU IO merging is not sufficient from the performance point of view.
The problem is as follows. ZVOL has a volblocksize setting, and in many cases, e.g. for hosting VM, it is set to 32KB or so. When IO requests has a request size less than the volblocksize, read-modify-writes (RMW) will occur, leading to performance degradation. A scheduler, such as deadline, is capable of sorting and merging IO request, thus reducing the chance of RMW.
Describe how to reproduce the problem
Create a not-so-big ZVOL with volblocksize of 32KB, use FIO to issue a single sequential write IO workload of size 4KB, after a while (after the ZVOL filled with some data), either using "iostat -mx 1 10 " or "zpool iostat 1 10", one can see there are a lot of read-modify-writes. Note that at the beginning of writes, there will be less or no RMW because ZVOL is almost empty and ZFS can intelligently skip reading zeros.
In contrast, use FIO to issue sequential write IO workload of size 32KB, 64KB, or even larger, no matter how long you run the workload, there is no RMW.
Apparently IO merging logic at ZVOL is not working properly. Either we re-enable block device scheduler choice of deadline or noop, or fix the broken IO merging logic in DMU, should fix this performance issue.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: