-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory reclaim and cache interaction #6035
Conversation
@dbavatar, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @grwilson and @prakashsurya to be potential reviewers. |
Are you running 0.6.5.y and tried this on top or are you running 0.7.0-rc3 and tried this on top? There have been substantial changes to memory behavior for 0.7.0 so before evaluating these changes, I would like to know what branch was used for testing them. Edit: I am pleasantly surprised by how small these changes are. It will take some time to study each change, but there is nothing here that looks unreasonable at a glance. :) |
@ryao 6.5.x, but all the changes still apply, the things it's fixing are still quite broken in 7.x. I've made the changes quite small so it should be readable. I can do additional testing in the lab running on 7.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, I need some time to review these changes, but there is nothing unreasonable that I see at a glance. There are some minor style issues that need to be fixed though.
module/zfs/arc.c
Outdated
return (SHRINK_STOP); | ||
if (mutex_tryenter(&arc_reclaim_lock) == 0) { | ||
ARCSTAT_INCR(arcstat_need_free, ptob(sc->nr_to_scan)); | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a style violation. The return value should be surrounded by parenthesis. Use ./scripts/cstyle.pl on arc.c to catch these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I saw the failures, I will fix/repush the whole branch when I get a chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Little things like this cause issues for my pull requests more often than I am willing to admit (although it is public record, so anyone can go look). I look forward to the repush. :)
@dbavatar I agree that the things you changed are common to both. So far, I am not seeing anything I dislike in these changes, although I am still reading them. I probably won't finish my own review until much later in the day. In the interim, it would be great if you could fix the style issues. |
Without making any particular comments on this patch stack yet, there's an argument to be made that many of the ZoL-specific hacks in the ARC code are ripe for review. For example, the "balanced" metadata adjustment strategy, which is frequently the cause of high CPU loading, may very well no longer be needed, or should at the very least be tuned in some way. Unfortunately, the ARC code is one place where there's a lot of divergence from the upstream code and it's had a lot of ZoL-specific changes, reversions, etc. over time. |
b0c338c
to
8132dbc
Compare
@dweeezil Not only that, but the Linux kernel memory behavior is substantially different from Illumos thanks to direct reclaim and other decisions. We cannot use the exact same code. :/ |
@ryao Indeed. This is an area where ZoL will always have plenty of divergence. |
@dweeezil Yes that is quite a heavy symbol, but I think you'll find the behavior is much improved with this branch. I believe it was overworking by at least a factor of 6-10x before. |
Ghost meta/data buffers are not actually allocated AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
a5858ac
to
bc95471
Compare
Okay, as promised, I have dug into these changes more. They look good to me. However, would you explain the rationale behind the scaling in proportion to page cache patch? Given that it is off by default, I am not sure if I understand the use case where this knob helps. Also, we need to figure out why the buildbot was unhappy before this can be merged. It is worth noting that the lack of atomics is a Linux specific issue because we altered this code to run in a direct reclaim context where on other systems, it runs on a dedicated arc_reclaim thread. I would say that is fixing a regression caused by that change if the interactions of old approach and Linux's direct reclaim were not so terrible that even these regressions were a step forward back then. |
The theoretical background to why this is necessary: the pagecache in the Linux kernel is one giant LRU page list, which for the most part consists of all "free" memory. Now imagine a workload where there are frequently used files, and also file scanning going on. Every time a frequently used page is used, it gets moved to the tail of the LRU. Since the LRU is likely very deep, the probability of reclaiming your frequently used file is low, as measured by the distance to the head of the list. Now, imagine that you split this LRU in half, fitting 2 into the same fixed memory space. That immediately means the probability of reclaiming your frequently used page is now much higher, measured as the distance between the head of the LRU and your page. So, we already start at a disadvantage at the ideal scenario where we've just split 50/50. But, this is unlikely on a real workload. What's more likely to happen is that one of those LRU's has more pressure on it that the other, and one of them becomes very small, and you keep dropping frequent pages from it leading to huge latencies. The zfs MRU/MFU is actually trying to prevent this exact thing from happening inside arc, but when you have (MRU+MFU)+LRU in the same memory space, this no longer functions properly. This is exactly what we ran into. Although the rest of this branch already helps, since it stops excessive reclaims from hitting arc, there is still the possibility of scanning going on with pagecached filesystems, the memory pressure of which will shrink arc down to arc_min. This is too small to fit our frequently used files, leading to huge latencies which were never expected by the original app author, since pagecache served effectively as an infinite cache machine for small files. It is also at the same time not acceptable to keep increasing arc_min until the problem goes away, because effectively that memory becomes unusable, we need ZFS to have an effective overhead of near 0. We want it to behave in the same way as pagecache and all other kernel reclaimable memory, that under high pressure the memory can be reclaimed, but otherwise utilizes all of "free" memory. This is exactly what zfs_arc_pc_frac enables. Until there is pressure/reclaim, everything works as normal. However under pressure, we can assure that we both allow reclaiming of all our memory and also withstand scanning pressure from pagecache. It's off by default because we can't estimate a users workload, nor if they care about balancing cache with pagecache. Even if a user has converted 100% to ZFS, you still have other uses for pagecache, like if you mount FUSE filesystems. Scanning on a FUSE filesystem could easily obliterate the arc cache. For our workloads I've set the scaling factor at 15%, but I can imagine a scenario where someone might want to set it at 500%, for example. Other possible solutions for Linux would be to remove arc cache, and use pagecache to cache everything. This is obviously very invasive change, but at least we're be using the same LRU as everything else.
Seems like false positives, how can we kick off another run without updating the commits?
Not quite sure what you're getting at here. Although, I don't think the arc_reclaim thread is necessary at all in Linux, they don't even export all the feedback mechanisms needed to get it behave properly. I have it effectively disabled (arc_sys_free lower than kernel min watermark). |
git commit --amend the latest commit to change the ID, and force-push the branch - that should most likely be enough |
@dbavatar Removing ARC would be a major performance regression for workloads that benefit from ARC. It is rather unfortunate that ARC doesn't play nicely with the page cache of other filesystems. Having heard that, I definitely see the use case for this tunable now. You might want to clarify that the tunable exists to make ZFS play more nicely with LRU in the documentation. As for my arc_reclaim remarks, we had gutted it and left the remaining functions in a thread that was called arc_adapt. At some point that I had not realized when I wrote that comment, we changed the name back to arc_reclaim. See 302f753 for the original name change. Anyway, my point was that on other systems, reclamation from ARC is done entirely in arc_reclaim. There is no direct reclaim and therefore no need for atomics because only 1 thread will ever be doing reclaim. The need for atomics is one that the project manufactured when it ported ZFS to Linux and adapted it to direct reclaim. |
bc95471
to
f897ff4
Compare
All build checks are now passing. I also did some smoketesting with 7.x, no issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the commit comment, s/arc_needs_free/arc_need_free/
Ensures proper accounting of bytes we requested to free AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
f897ff4
to
319e357
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to suggest some working for the commit log of the form:
In the arc shrinker, only increment arcstat_need_free (arc_need_free) when reclaim is in progress.
And then maybe a bit of explanation about the "doubling".
This is actually not what is happening. That mutex isn't held during reclaim, it's only protecting some variables, which is why I suggested that perhaps this should be a spinlock instead, but would require some more work. The problem being solved here is that if the mutex is held, we cannot sleep here, so we accumulate what we should have reclaimed this round, instead of just ignoring it. The doubling is from both doing the reclaim and also scheduling the same amount of data for background reclaim. I can add more to the commit if that's still not clear. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested wording for the commit log comment:
Lock contention, by itself, shouldn't indicate a stop condition to the kernel's slab shrinker. Doing so can cause stalls when the kernel is trying to free large parts of the cache such as is done by drop_caches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some commit log comment suggestions:
s/evictabe/evictable/
In the second example, how about "..., arc_kmem_reap_now still consumes excessive CPU".
319e357
to
6e01d6e
Compare
Move arcstat_need_free increment from all direct calls to when arc_reclaim_lock is busy and we exit wihout doing anything. Data will be reclaimed in reclaim thread. The previous location meant that we both reclaim the memory in this thread, and also schedule the same amount of memory for reclaim in arc_reclaim, effectively doubling the requested reclaim. AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Lock contention, by itself, shouldn't indicate a stop condition to the kernel's slab shrinker. Doing so can cause stalls when the kernel is trying to free large parts of the cache such as is done by drop_caches Also, perhaps arc_reclaim_lock should be a spinlock, and this code eliminated. AKAMAI: zfs: CR 3593801 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Calling it when nothing is evictable will cause extra kswapd cpu. Also if we didn't shrink it's unlikely to have memory to reap because we likely just called it microseconds ago. The exception is if we are in direct reclaim. You can see how hard this is being hit in kswapd with a light test workload: 34.95% [zfs] [k] arc_kmem_reap_now 5.40% [spl] [k] spl_kmem_cache_reap_now 3.79% [kernel] [k] _raw_spin_lock 2.86% [spl] [k] __spl_kmem_cache_generic_shrinker.isra.7 2.70% [kernel] [k] shrink_slab.part.37 1.93% [kernel] [k] isolate_lru_pages.isra.43 1.55% [kernel] [k] __wake_up_bit 1.20% [kernel] [k] super_cache_count 1.20% [kernel] [k] __radix_tree_lookup With ZFS just mounted but only ext4/pagecache memory pressure arc_kmem_reap_now still consumes excessive CPU: 12.69% [kernel] [k] isolate_lru_pages.isra.43 10.76% [kernel] [k] free_pcppages_bulk 7.98% [kernel] [k] drop_buffers 7.31% [kernel] [k] shrink_page_list 6.44% [zfs] [k] arc_kmem_reap_now 4.19% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.95% [kernel] [k] __isolate_lru_page 3.09% [kernel] [k] __radix_tree_lookup Same pagecache only workload as above with this patch series: 11.58% [kernel] [k] isolate_lru_pages.isra.43 11.20% [kernel] [k] drop_buffers 9.67% [kernel] [k] free_pcppages_bulk 8.44% [kernel] [k] shrink_page_list 4.86% [kernel] [k] __isolate_lru_page 4.43% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.44% [kernel] [k] __radix_tree_lookup (arc_kmem_reap_now has 0 samples in perf) AKAMAI: zfs: CR 3695042 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Could return the wrong pages value AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
6e01d6e
to
f05bc2b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch refers to the pagecache, generically, however, it's only paying attention to that which is the NR_FILE_PAGES
category.
Although I certainly understand what this is trying to accomplish w.r.t. interaction with pagecache-using filesystems (ext4), what about regular user programs which would have their memory accounted for in anonymous pages? Would it make sense to have a separate tunable to set ratios against other classes? To the sums of some classes? Maybe a tunable to act as a mask and select the types of pagecache against which the ratio would be enforced?
Even with this patch, it seems a memory-hogging userland program could cause the same problem this one is trying to avoid.
If this patch stay as-is, I think at least the documentation should mention that it's somewhat specific to memory allocated by non-ZFS filesystems (and the handful of other cases which allocate NR_FILE_PAGES
).
This whole stack looks OK except for some documentation and comment clarifications. Also, the f05bc2b commit seems to be specific to a very particular, albeit very common and likely very important for many users, case. I still think there's a case to be made to increasing |
@dbavatar Thanks for the clarification about the doubling, that wasn't clear to me and was why I was asking for an expanded comment. |
Regular user allocated (anonymous) memory does not matter, it is not reclaimable memory. It can be swappable, but the objective is to use free ram, not to make user programs swap out to get even more ram. However, the point is taken as to NR_FILE_PAGES is not necessarily the only measure you might want to scale against. One could argue that it should be all of reclaimable memory, but I think that's not quite right because we aren't competing with them for file-based resources, i.e. the split-LRU problem described above. They behave the same way whether pagecache is competing against them, or arc cache is, so I think it's best not to scale against them. I can think of corner cases to this like what about NR_FILE_MAPPED pages, but this becomes problematic trying to separate mapped pages that are represented in NR_FILE_PAGES from those that are not. Some things we might want to measure against aren't even exported from Linux, if available at all, like swapcache. However, since zfs_arc_pc_frac is a empirically derived tunable meant to support a given workload, you can turn the knob either way to compensate, up to a point.
If user allocates lots of memory, we must assume it is because it must do so to complete it's task, and we should give up our reclaimable memory as requested by the kernel. It is quite likely for this scenario to coexist with the one that is usually balancing caches, which is also why we don't want to keep increasing zfs_arc_min. If you increase zfs_arc_min, users will likely complain that ZFS is a memory hog compared to every other in-tree fs.
Pushing update now. |
f05bc2b
to
3caa00c
Compare
.sp | ||
Default value: \fB0\fR (disabled). | ||
.RE | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbavatar Wouldn't it be more intuitive to express zfs_arc_pc_frac
as a percentage of the size of the pagecache, and then maybe call it zfs_arc_pc_pct
? It would be more congruent with other (percentage-based) tunables and could be used as:
uint64_t min = ptob(global_page_state(NR_FILE_PAGES)) * 100 / zfs_arc_pc_pct;
And documented as "percentage of the pagecache size to which the ARC may be reduced".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, I was just hoping to avoid the divide, being concerned with kswapd cpu from the beginning.
module/zfs/arc.c
Outdated
* Scale reported evictable memory in proportion to page cache, cap | ||
* at specified min/max. | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need a blank line here.
e9d4f18
to
aabd0f0
Compare
When multiple filesystems are in use, memory pressure causes arc_cache to collapse to a minimum. Allow arc_cache to maintain proportional size even when hit rates are disproportionate. We do this only via evictable size from the kernel shrinker, thus it's only in effect under memory pressure. AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
aabd0f0
to
efd766c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbavatar thanks for tackling this issue and breaking it in to small individual commits to make it easier to review. The latest version of the proposed patch stack LGTM.
I agree we should merge this with zfs_arc_pc_percent=0
initially to be conservative. But at some point we'll want to enable this functionality by default. Do you have a good sense from your testing what a reasonable default value would be for most systems?
@behlendorf This is a difficult question. We could try 100%, i.e. under pressure we would do a 50/50 split. In the desktop case that would prevent scanning on other filesystems from obliterating arc. But is that what we are aiming for, a desktop ZFS user? Anyone using ZFS for mission critical stuff is going to want to change this. We could as well introduce a static number, i.e. limit pagecache to 1GB. Tangentially we could potentially make this work better if we could export more variables and/or hooks from the upstream kernel, but these would basically be only to support ZFS, does anyone know if they've been receptive to changes like that? |
Ideally we want ZFS to work well for both desktop and data center workloads. If we do need to prefer one configuration over the other we'll want to opt for large scale, mission critical stuff which is ZFS's core strength. This is ready to go so I'll get it merged today. That'll make it easier to run some benchmarks so we can settle of a reasonable default.
It's been my experience that the upstream kernel community hasn't been willing to expose existing interfaces for external consumers, let alone add new ones. As a general rule they understandably want at least one in-tree consumer. So I suspect they wouldn't particularly receptive this to approach. I don't think that means we're out of luck at all. We can make use of the existing filesystem / mm interfaces and it wouldn't surprise me if the kernel already provides everything we need. Integrating the ARC more closely with the page cache has always been something we wanted to investigate. Between the callbacks provided by the |
Ghost meta/data buffers are not actually allocated AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035
Ensures proper accounting of bytes we requested to free AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035
Move arcstat_need_free increment from all direct calls to when arc_reclaim_lock is busy and we exit wihout doing anything. Data will be reclaimed in reclaim thread. The previous location meant that we both reclaim the memory in this thread, and also schedule the same amount of memory for reclaim in arc_reclaim, effectively doubling the requested reclaim. AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035
Lock contention, by itself, shouldn't indicate a stop condition to the kernel's slab shrinker. Doing so can cause stalls when the kernel is trying to free large parts of the cache such as is done by drop_caches Also, perhaps arc_reclaim_lock should be a spinlock, and this code eliminated. AKAMAI: zfs: CR 3593801 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035
AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035
Calling it when nothing is evictable will cause extra kswapd cpu. Also if we didn't shrink it's unlikely to have memory to reap because we likely just called it microseconds ago. The exception is if we are in direct reclaim. You can see how hard this is being hit in kswapd with a light test workload: 34.95% [zfs] [k] arc_kmem_reap_now 5.40% [spl] [k] spl_kmem_cache_reap_now 3.79% [kernel] [k] _raw_spin_lock 2.86% [spl] [k] __spl_kmem_cache_generic_shrinker.isra.7 2.70% [kernel] [k] shrink_slab.part.37 1.93% [kernel] [k] isolate_lru_pages.isra.43 1.55% [kernel] [k] __wake_up_bit 1.20% [kernel] [k] super_cache_count 1.20% [kernel] [k] __radix_tree_lookup With ZFS just mounted but only ext4/pagecache memory pressure arc_kmem_reap_now still consumes excessive CPU: 12.69% [kernel] [k] isolate_lru_pages.isra.43 10.76% [kernel] [k] free_pcppages_bulk 7.98% [kernel] [k] drop_buffers 7.31% [kernel] [k] shrink_page_list 6.44% [zfs] [k] arc_kmem_reap_now 4.19% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.95% [kernel] [k] __isolate_lru_page 3.09% [kernel] [k] __radix_tree_lookup Same pagecache only workload as above with this patch series: 11.58% [kernel] [k] isolate_lru_pages.isra.43 11.20% [kernel] [k] drop_buffers 9.67% [kernel] [k] free_pcppages_bulk 8.44% [kernel] [k] shrink_page_list 4.86% [kernel] [k] __isolate_lru_page 4.43% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.44% [kernel] [k] __radix_tree_lookup (arc_kmem_reap_now has 0 samples in perf) AKAMAI: zfs: CR 3695042 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035
Could return the wrong pages value AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035
Description
Fix memory reclaim behavior and add scaling factor to balance multiple caches (default disabled)
Motivation and Context
We deployed ZFS to several thousand machines and immediately started noticing some bad performance characteristics. Our machines still have traditional filesystems utilizing pagecache which compete with zfs. I identified one of the main problems was the arc cache would collapse to minimum, regardless of several GB of memory being free/used by pagecache and other reclaimable objects. This set of patches corrects the reclaim behavior, and also adds an option that allows balancing of multiple caches via scaling factor, which only activates on reclaim. This means under memory pressure a user-defined scaling factor will balance pagecache and arc cache, and when not under pressure will work normally. This is disabled by default. In addition this patchset means the zfs_arc_sys_free algorithm is no longer necessary. Although some user may want to set this for other reasons but the zfs module does neither get the appropriate watermarks from the kernel, nor does it need to as the feedback mechanism is the reclaim callback, which is based on the true watermarks. Performance should be improved by disabling this by default, I can add a patch later for this.
Other fixes:
How Has This Been Tested?
Test with production traffic on 100 machines. Verified arc cache does not collapse to minimum size, and we do not incur performance hit from heavy MFU cache misses. Expanding to 100k+ machines over several weeks.
Types of changes
Checklist:
Signed-off-by
.