Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce write throttle smoothing #12868

Closed
wants to merge 1 commit into from

Conversation

allanjude
Copy link
Contributor

@allanjude allanjude commented Dec 16, 2021

To avoid stalls and uneven performance during heavy write workloads,
continue to apply the write throttling even when the amount of dirty data
has temporarily dipped below zfs_delay_min_dirty_percent for up to
zfs_write_smoothing seconds.

Sponsored by: Zededa Inc.

Motivation and Context

Consider a case where new data is written to the ZFS dirty buffers at the rate that the backing storage cannot keep up with. ZFS allows writing data at full speed until the amount of dirty data in memory reach the threshold (zfs_delay_min_dirty_percent, default is 60%). When this point is reached, ZFS starts to apply artificial delays. The applied delays scales based on the amount of dirty data and is described by the formula:

delay_ns = zfs_delay_scale * (amount_of_dirty_data - delay_min_bytes) / (zfs_dirty_data_max - amount_of_dirty_data)

While data is added at a somewhat consistent rate, data is drained from the dirty buffers in large swaths as each transaction group is flushed to disk. This can result in the amount of dirty data dropping below the 60% threshold again, resulting in no delay being applied, and ZFS accepts data at full speed.

This causes applications to experience significant changes in write throughput depending on the amount of dirty data, especially on very heavy write workloads. The behavior is noticeable because the delay is applied only when the threshold is exceeded.

Description

The idea is to continue to apply the write throttle (adding delay to every write) for a short time even after the amount of dirty data has fallen below the threshold. The smoothing mechanism continues to be applied for zfs_write_smoothing seconds after the last time when the amount of dirty data was above the threshold. Thanks to this, applications will not be able to burst writes again if the backing storage is still possibly unable to keep up with the level of incoming writes.
The additional delay will be applied only when the backing storage has been overloaded within the last few seconds.

The original delaying algorithm works as before.

The formula used for the smoothing algorithm is as follow:

delay_ns = zfs_delay_scale * (amount_of_dirty_data) / ((zfs_dirty_data_max - amount_of_dirty_data));

The characteristic of this formula is shown in the Figure below in yellow color. It’s more aggressive than the previous one because when the dirty data exceeds the threshold, we apply the previous formula.

smoothed_write_throttle

How Has This Been Tested?

We performed several tests using the fio utility. We observed the amount of dirty data during these tests and applied delays. Both those observation was done using eBPF.
Example of eBPF to observe the change in the amount of data buffers:

bpftrace -e ‘kprobe:dsl_pool_dirty_delta {@val[tid] = (int8 *)arg0 + 0x248;} kretprobe:dsl_pool_dirty_delta {printf(“%lld %lld\n”, nsecs, *(uint64 *)@val[tid]); delete(@val[tid]); }’

To observe the applied delay:

bpftrace -e ‘kprobe:dmu_tx_assign { @time[tid] = 0 } kprobe:dmu_tx_wait { @tmp[tid] = nsecs } kretprobe:dmu_tx_wait /@tmp[tid]/ { @time[tid] += nsecs - @tmp
[tid] } kretprobe:dmu_tx_assign { @ns = hist(@time[tid]); delete(@time[tid]) } END { clear(@time); clear(@tmp); }’

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Dec 17, 2021
@allanjude allanjude force-pushed the smoothing branch 2 times, most recently from 1384257 to bb2e82b Compare December 24, 2021 15:21
man/man4/zfs.4 Outdated Show resolved Hide resolved
}
if (last_smooth > now) {
smoothing_time = zfs_smoothing_scale *
(dirty >> 1) / ((delay_min_bytes - dirty) << 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the magic with the shifts? Wouldn't decreasing zfs_smoothing_scale 4 times do the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dirty == delay_min_bytes here should be division by zero. If it goes higher, then smoothing time will be zero due to negative overflow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the magic with the number is over complicated.
I will simplify that, thank you!
Dividing by zero is a nice catch - It showed us that we made a mistake when putting this commit together. We have used different formulas in the descriptiona and final commit.


if (dirty <= delay_min_bytes)
now = gethrtime();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional gethrtime() per I/O may be not huge, but not free also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one actually isn't per IO - this one is in one that we "almost know" that we have to delay.
The one per IO would be in dsl_pool_need_dirty_delay.
But in general I agree that this isn't free.

@amotin
Copy link
Member

amotin commented Dec 27, 2021

The problem of throughput fluctuation with txg commits is known for a long time. But IIRC Matt Ahrens already tried to fix it by decrementing ammount of dirty data on write ZIOs completion. If you still observe significant correlation between the dirty data fluctuation and TXG commits, then I guess it may be some issue with that write completion accounting (there are no leaks by design, since on each TXG commit all that left is cleared any way), or on opposite side there could be periods of time at the end of TXG commit where little data written, but much time spent, including on cache syncs. Or there may be some other sources of fluctuation, such as dedup tables, which update may take significant time at each TXG commit.

man/man4/zfs.4 Outdated Show resolved Hide resolved
Copy link
Contributor

@oshogbo oshogbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review!

man/man4/zfs.4 Outdated Show resolved Hide resolved
man/man4/zfs.4 Outdated Show resolved Hide resolved

if (dirty <= delay_min_bytes)
now = gethrtime();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one actually isn't per IO - this one is in one that we "almost know" that we have to delay.
The one per IO would be in dsl_pool_need_dirty_delay.
But in general I agree that this isn't free.

}
if (last_smooth > now) {
smoothing_time = zfs_smoothing_scale *
(dirty >> 1) / ((delay_min_bytes - dirty) << 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, the magic with the number is over complicated.
I will simplify that, thank you!
Dividing by zero is a nice catch - It showed us that we made a mistake when putting this commit together. We have used different formulas in the descriptiona and final commit.

@@ -103,6 +103,7 @@ unsigned long zfs_dirty_data_max = 0;
unsigned long zfs_dirty_data_max_max = 0;
int zfs_dirty_data_max_percent = 10;
int zfs_dirty_data_max_max_percent = 25;
unsigned long zfs_smoothing_write = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like when users are required to tune things for getting good results. If this patch is so good, why is it disabled by default?

And as I've written in my comment, it would be good to understand the actual cause of fluctuation. Why added write accounting did not make it smooth enough? Unavoidable fluctuations due to metadata updates and cache flushes during commit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alexander, thank you for raising your concerns.

In general, the existing mechanism should be enough in most cases.
We can tune it to slow down writes from some threshold, and the delay will increase over time.
Our test in our configuration means applying the delay very early and with more significant delays, which causes a drop in overlay performance.
But this ofc would make the writes more stable.

As you mention, the fluctuation with txg commits has been known for a long time, and this is a small
an implementation which in some cases may help users.
The issue is that the current mechanism is not adaptive. You apply delays without looking
at the current overload. If there is a big chunk of data and then nothing, do we
want to use the delay? And on the other hand, if we just flushed a big piece of data
and there is another chunk of data and another, should we apply the delays?
The current algorithm looks only at one variable, the current usage of the dirty buffers.

This patch is non-ideal - we agree. But we try to address this issue somehow and give at least some mechanism to test.
We can also look at this as the historical data to determine if we should or should not apply pressure.

Why is this disabled by default? We successfully tested it on our installation, but we are unsure how all configurations will react. We prefer to start small and see how the broader community will respond to this approach. We are more than happy to enable it by default at some point. On the other hand, if there will be a better approach or it seems unpracticed on most installations, we are more than happy to remove it. Fortunately, the mechanism isn't very intrusive. It is not an on-disk format change, so if we decide to adapt it, or replace it with another mechanism in the future, we won't have any problems doing that.

If you have another idea of dealing with fluctuation, we are more than happy to try it.

To avoid stalls and uneven performance during heavy write workloads,
continue to apply write throttling even when the amount of dirty data
has temporarily dipped below zfs_delay_min_dirty_percent for up to
zfs_write_smoothing seconds.

Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com>
@@ -15,7 +15,7 @@
.\" own identifying information:
.\" Portions Copyright [yyyy] [name of copyright owner]
.\"
.Dd June 1, 2021
.Dd January 8, 2021
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like unnecessary change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hym, but when we change the man page shouldn't we bump the .Dd?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, should be more specific - January 8, 2021 is older than June 1, 2021. Or maybe it's a typo and it should be 2022?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep it should be 2022. Thanks!

@@ -1089,10 +1103,11 @@ dmu_tx_wait(dmu_tx_t *tx)
DMU_TX_STAT_BUMP(dmu_tx_dirty_over_max);
while (dp->dp_dirty_total >= zfs_dirty_data_max)
cv_wait(&dp->dp_spaceavail_cv, &dp->dp_lock);
last_smooth = dp->dp_last_smooth;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the read of dp_last_smooth not being done in dmu_tx_delay, like the update?

Is it because we need to hold the mutex, and don't want to pay the cost of re-acquiring in dmu_tx_delay just to read the value?
Does it actually make a perf difference in practice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its because we would need to take again a mutex.
This method is equivalent to dp_dirty_total which also requires mutex.

@@ -115,6 +117,7 @@ typedef struct dsl_pool {
kcondvar_t dp_spaceavail_cv;
uint64_t dp_dirty_pertxg[TXG_SIZE];
uint64_t dp_dirty_total;
hrtime_t dp_last_smooth;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc comment suggestion, if I understand the code correctly:

The point in time time after which the write smoothing mechanism has no effect.

If that interpretation of the variable's role is correct, I'd prefer a different name.
For example, dp_stop_write_smooth_after.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a variable described in doc.
This is variable to store a time of last smoothing.

if (dirty <= delay_min_bytes)
now = gethrtime();

if (dirty <= delay_min_bytes && last_smooth <= now)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid this early exit if we can take the "performance hit" of the two zero initializations in line 799 and 800, and the MIN(MAX(...)) code in line 811.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it make code more readable?


if (dirty > delay_min_bytes) {
delay_time = zfs_delay_scale *
(dirty - delay_min_bytes) / (zfs_dirty_data_max - dirty);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably use the occasion to READ_ONCE(zfs_dirty_data_max) at the top of the function, into a const value.
Then make all use sites of that variable in this function use the const variable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can try that.

Copy link
Contributor

@oshogbo oshogbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address comments from @problame

@@ -115,6 +117,7 @@ typedef struct dsl_pool {
kcondvar_t dp_spaceavail_cv;
uint64_t dp_dirty_pertxg[TXG_SIZE];
uint64_t dp_dirty_total;
hrtime_t dp_last_smooth;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a variable described in doc.
This is variable to store a time of last smoothing.

if (dirty <= delay_min_bytes)
now = gethrtime();

if (dirty <= delay_min_bytes && last_smooth <= now)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it make code more readable?


if (dirty > delay_min_bytes) {
delay_time = zfs_delay_scale *
(dirty - delay_min_bytes) / (zfs_dirty_data_max - dirty);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can try that.

@@ -1089,10 +1103,11 @@ dmu_tx_wait(dmu_tx_t *tx)
DMU_TX_STAT_BUMP(dmu_tx_dirty_over_max);
while (dp->dp_dirty_total >= zfs_dirty_data_max)
cv_wait(&dp->dp_spaceavail_cv, &dp->dp_lock);
last_smooth = dp->dp_last_smooth;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its because we would need to take again a mutex.
This method is equivalent to dp_dirty_total which also requires mutex.

@amotin
Copy link
Member

amotin commented May 17, 2022

I suspect that problem "fixed" in this PR was introduced by #12284 . If workload repeatedly overwrites the same data over and over again, log size grows faster than amount of dirty data in ARC, and instead of smooth write throttling in dmu_tx_wait() the mentioned PR triggers txg_wait_synced(dp, spa_last_synced_txg(spa) + 1), that blocks any further user writes until all dirty data is written and TXG synced, exactly as we could see on the graph provided during the previous monthly call. I am now working on amending the mentioned PR to make it also throttle writes smoothly.

@amotin amotin mentioned this pull request May 18, 2022
13 tasks
@amotin
Copy link
Member

amotin commented May 18, 2022

I think #13476 should solve the problem better and this PR should be closed, unless you know some other pathological scenario still not covered.

@allanjude
Copy link
Contributor Author

This pull request has been overcome in favour of:
#13838
and
#13839

@allanjude allanjude closed this Sep 4, 2022
snajpa added a commit to vpsfreecz/zfs that referenced this pull request Sep 21, 2022
not our work, just trying it out

openzfs#12868

Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants