Memory reclaim and cache interaction #6035

dbavatar · 2017-04-17T19:42:03Z

Description

Fix memory reclaim behavior and add scaling factor to balance multiple caches (default disabled)

Motivation and Context

We deployed ZFS to several thousand machines and immediately started noticing some bad performance characteristics. Our machines still have traditional filesystems utilizing pagecache which compete with zfs. I identified one of the main problems was the arc cache would collapse to minimum, regardless of several GB of memory being free/used by pagecache and other reclaimable objects. This set of patches corrects the reclaim behavior, and also adds an option that allows balancing of multiple caches via scaling factor, which only activates on reclaim. This means under memory pressure a user-defined scaling factor will balance pagecache and arc cache, and when not under pressure will work normally. This is disabled by default. In addition this patchset means the zfs_arc_sys_free algorithm is no longer necessary. Although some user may want to set this for other reasons but the zfs module does neither get the appropriate watermarks from the kernel, nor does it need to as the feedback mechanism is the reclaim callback, which is based on the true watermarks. Performance should be improved by disabling this by default, I can add a patch later for this.

Other fixes:

Reclaim behavior was fixed in regards to drop_caches, previously with a significant cache size it would take many calls of drop_caches to actually reclaim all memory, only a single run should do it now. This was problematic for several reasons, for example trying to free memory before huge page allocation.
zfs routines now take significantly less CPU on the kswapd stack for multiple reasons.

How Has This Been Tested?

Test with production traffic on 100 machines. Verified arc cache does not collapse to minimum size, and we do not incur performance hit from heavy MFU cache misses. Expanding to 100k+ machines over several weeks.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

mention-bot · 2017-04-17T19:42:06Z

@dbavatar, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @grwilson and @prakashsurya to be potential reviewers.

ryao · 2017-04-17T20:36:28Z

Are you running 0.6.5.y and tried this on top or are you running 0.7.0-rc3 and tried this on top?

There have been substantial changes to memory behavior for 0.7.0 so before evaluating these changes, I would like to know what branch was used for testing them.

Edit: I am pleasantly surprised by how small these changes are. It will take some time to study each change, but there is nothing here that looks unreasonable at a glance. :)

dbavatar · 2017-04-17T20:42:35Z

@ryao 6.5.x, but all the changes still apply, the things it's fixing are still quite broken in 7.x. I've made the changes quite small so it should be readable. I can do additional testing in the lab running on 7.x

ryao

As I said, I need some time to review these changes, but there is nothing unreasonable that I see at a glance. There are some minor style issues that need to be fixed though.

ryao · 2017-04-17T20:43:02Z

module/zfs/arc.c

-		return (SHRINK_STOP);
+	if (mutex_tryenter(&arc_reclaim_lock) == 0) {
+		ARCSTAT_INCR(arcstat_need_free, ptob(sc->nr_to_scan));
+		return 0;


This is a style violation. The return value should be surrounded by parenthesis. Use ./scripts/cstyle.pl on arc.c to catch these.

Yes I saw the failures, I will fix/repush the whole branch when I get a chance.

Little things like this cause issues for my pull requests more often than I am willing to admit (although it is public record, so anyone can go look). I look forward to the repush. :)

ryao · 2017-04-17T20:47:06Z

@dbavatar I agree that the things you changed are common to both. So far, I am not seeing anything I dislike in these changes, although I am still reading them. I probably won't finish my own review until much later in the day. In the interim, it would be great if you could fix the style issues.

dweeezil · 2017-04-17T21:02:35Z

Without making any particular comments on this patch stack yet, there's an argument to be made that many of the ZoL-specific hacks in the ARC code are ripe for review. For example, the "balanced" metadata adjustment strategy, which is frequently the cause of high CPU loading, may very well no longer be needed, or should at the very least be tuned in some way. Unfortunately, the ARC code is one place where there's a lot of divergence from the upstream code and it's had a lot of ZoL-specific changes, reversions, etc. over time.

ryao · 2017-04-17T21:45:18Z

@dweeezil Not only that, but the Linux kernel memory behavior is substantially different from Illumos thanks to direct reclaim and other decisions. We cannot use the exact same code. :/

dweeezil · 2017-04-17T21:48:06Z

@ryao Indeed. This is an area where ZoL will always have plenty of divergence.

dbavatar · 2017-04-17T21:53:12Z

@dweeezil Yes that is quite a heavy symbol, but I think you'll find the behavior is much improved with this branch. I believe it was overworking by at least a factor of 6-10x before.

Ghost meta/data buffers are not actually allocated AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

ryao · 2017-04-18T03:47:15Z

Okay, as promised, I have dug into these changes more. They look good to me. However, would you explain the rationale behind the scaling in proportion to page cache patch? Given that it is off by default, I am not sure if I understand the use case where this knob helps.

Also, we need to figure out why the buildbot was unhappy before this can be merged.

It is worth noting that the lack of atomics is a Linux specific issue because we altered this code to run in a direct reclaim context where on other systems, it runs on a dedicated arc_reclaim thread. I would say that is fixing a regression caused by that change if the interactions of old approach and Linux's direct reclaim were not so terrible that even these regressions were a step forward back then.

dbavatar · 2017-04-18T16:56:12Z

They look good to me. However, would you explain the rationale behind the scaling in proportion to page cache patch? Given that it is off by default, I am not sure if I understand the use case where this knob helps.

The theoretical background to why this is necessary: the pagecache in the Linux kernel is one giant LRU page list, which for the most part consists of all "free" memory. Now imagine a workload where there are frequently used files, and also file scanning going on. Every time a frequently used page is used, it gets moved to the tail of the LRU. Since the LRU is likely very deep, the probability of reclaiming your frequently used file is low, as measured by the distance to the head of the list. Now, imagine that you split this LRU in half, fitting 2 into the same fixed memory space. That immediately means the probability of reclaiming your frequently used page is now much higher, measured as the distance between the head of the LRU and your page. So, we already start at a disadvantage at the ideal scenario where we've just split 50/50. But, this is unlikely on a real workload. What's more likely to happen is that one of those LRU's has more pressure on it that the other, and one of them becomes very small, and you keep dropping frequent pages from it leading to huge latencies. The zfs MRU/MFU is actually trying to prevent this exact thing from happening inside arc, but when you have (MRU+MFU)+LRU in the same memory space, this no longer functions properly.

This is exactly what we ran into. Although the rest of this branch already helps, since it stops excessive reclaims from hitting arc, there is still the possibility of scanning going on with pagecached filesystems, the memory pressure of which will shrink arc down to arc_min. This is too small to fit our frequently used files, leading to huge latencies which were never expected by the original app author, since pagecache served effectively as an infinite cache machine for small files. It is also at the same time not acceptable to keep increasing arc_min until the problem goes away, because effectively that memory becomes unusable, we need ZFS to have an effective overhead of near 0. We want it to behave in the same way as pagecache and all other kernel reclaimable memory, that under high pressure the memory can be reclaimed, but otherwise utilizes all of "free" memory. This is exactly what zfs_arc_pc_frac enables. Until there is pressure/reclaim, everything works as normal. However under pressure, we can assure that we both allow reclaiming of all our memory and also withstand scanning pressure from pagecache.

It's off by default because we can't estimate a users workload, nor if they care about balancing cache with pagecache. Even if a user has converted 100% to ZFS, you still have other uses for pagecache, like if you mount FUSE filesystems. Scanning on a FUSE filesystem could easily obliterate the arc cache. For our workloads I've set the scaling factor at 15%, but I can imagine a scenario where someone might want to set it at 500%, for example.

Other possible solutions for Linux would be to remove arc cache, and use pagecache to cache everything. This is obviously very invasive change, but at least we're be using the same LRU as everything else.

Also, we need to figure out why the buildbot was unhappy before this can be merged.

Seems like false positives, how can we kick off another run without updating the commits?

It is worth noting that the lack of atomics is a Linux specific issue because we altered this code to run in a direct reclaim context where on other systems, it runs on a dedicated arc_reclaim thread. I would say that is fixing a regression caused by that change if the interactions of old approach and Linux's direct reclaim were not so terrible that even these regressions were a step forward back then.

Not quite sure what you're getting at here. Although, I don't think the arc_reclaim thread is necessary at all in Linux, they don't even export all the feedback mechanisms needed to get it behave properly. I have it effectively disabled (arc_sys_free lower than kernel min watermark).

kernelOfTruth · 2017-04-18T17:07:44Z

Seems like false positives, how can we kick off another run without updating the commits?

git commit --amend the latest commit to change the ID,

and force-push the branch - that should most likely be enough

ryao · 2017-04-18T17:11:35Z

@dbavatar Removing ARC would be a major performance regression for workloads that benefit from ARC. It is rather unfortunate that ARC doesn't play nicely with the page cache of other filesystems. Having heard that, I definitely see the use case for this tunable now. You might want to clarify that the tunable exists to make ZFS play more nicely with LRU in the documentation.

As for my arc_reclaim remarks, we had gutted it and left the remaining functions in a thread that was called arc_adapt. At some point that I had not realized when I wrote that comment, we changed the name back to arc_reclaim. See 302f753 for the original name change. Anyway, my point was that on other systems, reclamation from ARC is done entirely in arc_reclaim. There is no direct reclaim and therefore no need for atomics because only 1 thread will ever be doing reclaim. The need for atomics is one that the project manufactured when it ported ZFS to Linux and adapted it to direct reclaim.

dbavatar · 2017-04-19T17:19:35Z

All build checks are now passing. I also did some smoketesting with 7.x, no issues.

ryao

This looks good to me.

dweeezil

In the commit comment, s/arc_needs_free/arc_need_free/

Ensures proper accounting of bytes we requested to free AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

dweeezil

I'd like to suggest some working for the commit log of the form:

In the arc shrinker, only increment arcstat_need_free (arc_need_free) when reclaim is in progress.

And then maybe a bit of explanation about the "doubling".

dbavatar · 2017-04-24T20:36:14Z

In the arc shrinker, only increment arcstat_need_free (arc_need_free) when reclaim is in progress.

This is actually not what is happening. That mutex isn't held during reclaim, it's only protecting some variables, which is why I suggested that perhaps this should be a spinlock instead, but would require some more work. The problem being solved here is that if the mutex is held, we cannot sleep here, so we accumulate what we should have reclaimed this round, instead of just ignoring it.

The doubling is from both doing the reclaim and also scheduling the same amount of data for background reclaim. I can add more to the commit if that's still not clear.

dweeezil

Suggested wording for the commit log comment:

Lock contention, by itself, shouldn't indicate a stop condition to the kernel's slab shrinker. Doing so can cause stalls when the kernel is trying to free large parts of the cache such as is done by drop_caches.

dweeezil

Some commit log comment suggestions:

s/evictabe/evictable/

In the second example, how about "..., arc_kmem_reap_now still consumes excessive CPU".

Move arcstat_need_free increment from all direct calls to when arc_reclaim_lock is busy and we exit wihout doing anything. Data will be reclaimed in reclaim thread. The previous location meant that we both reclaim the memory in this thread, and also schedule the same amount of memory for reclaim in arc_reclaim, effectively doubling the requested reclaim. AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

Lock contention, by itself, shouldn't indicate a stop condition to the kernel's slab shrinker. Doing so can cause stalls when the kernel is trying to free large parts of the cache such as is done by drop_caches Also, perhaps arc_reclaim_lock should be a spinlock, and this code eliminated. AKAMAI: zfs: CR 3593801 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

Calling it when nothing is evictable will cause extra kswapd cpu. Also if we didn't shrink it's unlikely to have memory to reap because we likely just called it microseconds ago. The exception is if we are in direct reclaim. You can see how hard this is being hit in kswapd with a light test workload: 34.95% [zfs] [k] arc_kmem_reap_now 5.40% [spl] [k] spl_kmem_cache_reap_now 3.79% [kernel] [k] _raw_spin_lock 2.86% [spl] [k] __spl_kmem_cache_generic_shrinker.isra.7 2.70% [kernel] [k] shrink_slab.part.37 1.93% [kernel] [k] isolate_lru_pages.isra.43 1.55% [kernel] [k] __wake_up_bit 1.20% [kernel] [k] super_cache_count 1.20% [kernel] [k] __radix_tree_lookup With ZFS just mounted but only ext4/pagecache memory pressure arc_kmem_reap_now still consumes excessive CPU: 12.69% [kernel] [k] isolate_lru_pages.isra.43 10.76% [kernel] [k] free_pcppages_bulk 7.98% [kernel] [k] drop_buffers 7.31% [kernel] [k] shrink_page_list 6.44% [zfs] [k] arc_kmem_reap_now 4.19% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.95% [kernel] [k] __isolate_lru_page 3.09% [kernel] [k] __radix_tree_lookup Same pagecache only workload as above with this patch series: 11.58% [kernel] [k] isolate_lru_pages.isra.43 11.20% [kernel] [k] drop_buffers 9.67% [kernel] [k] free_pcppages_bulk 8.44% [kernel] [k] shrink_page_list 4.86% [kernel] [k] __isolate_lru_page 4.43% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.44% [kernel] [k] __radix_tree_lookup (arc_kmem_reap_now has 0 samples in perf) AKAMAI: zfs: CR 3695042 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

Could return the wrong pages value AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

dweeezil

This patch refers to the pagecache, generically, however, it's only paying attention to that which is the NR_FILE_PAGES category.

Although I certainly understand what this is trying to accomplish w.r.t. interaction with pagecache-using filesystems (ext4), what about regular user programs which would have their memory accounted for in anonymous pages? Would it make sense to have a separate tunable to set ratios against other classes? To the sums of some classes? Maybe a tunable to act as a mask and select the types of pagecache against which the ratio would be enforced?

Even with this patch, it seems a memory-hogging userland program could cause the same problem this one is trying to avoid.

If this patch stay as-is, I think at least the documentation should mention that it's somewhat specific to memory allocated by non-ZFS filesystems (and the handful of other cases which allocate NR_FILE_PAGES).

dweeezil · 2017-04-24T21:26:27Z

This whole stack looks OK except for some documentation and comment clarifications. Also, the f05bc2b commit seems to be specific to a very particular, albeit very common and likely very important for many users, case. I still think there's a case to be made to increasing zfs_arc_min (or, at least, a floor on metadata in the ARC) on any system with a decent amount of RAM on which large pools are being used and there are other large non-zfs consumers of the pagecache.

dweeezil · 2017-04-24T21:32:16Z

@dbavatar Thanks for the clarification about the doubling, that wasn't clear to me and was why I was asking for an expanded comment.

dbavatar · 2017-04-24T22:22:26Z

This patch refers to the pagecache, generically, however, it's only paying attention to that which is the NR_FILE_PAGES category.

Although I certainly understand what this is trying to accomplish w.r.t. interaction with pagecache-using filesystems (ext4), what about regular user programs which would have their memory accounted for in anonymous pages? Would it make sense to have a separate tunable to set ratios against other classes? To the sums of some classes? Maybe a tunable to act as a mask and select the types of pagecache against which the ratio would be enforced?

Regular user allocated (anonymous) memory does not matter, it is not reclaimable memory. It can be swappable, but the objective is to use free ram, not to make user programs swap out to get even more ram. However, the point is taken as to NR_FILE_PAGES is not necessarily the only measure you might want to scale against. One could argue that it should be all of reclaimable memory, but I think that's not quite right because we aren't competing with them for file-based resources, i.e. the split-LRU problem described above. They behave the same way whether pagecache is competing against them, or arc cache is, so I think it's best not to scale against them.

I can think of corner cases to this like what about NR_FILE_MAPPED pages, but this becomes problematic trying to separate mapped pages that are represented in NR_FILE_PAGES from those that are not. Some things we might want to measure against aren't even exported from Linux, if available at all, like swapcache. However, since zfs_arc_pc_frac is a empirically derived tunable meant to support a given workload, you can turn the knob either way to compensate, up to a point.

Even with this patch, it seems a memory-hogging userland program could cause the same problem this one is trying to avoid.

I still think there's a case to be made to increasing zfs_arc_min (or, at least, a floor on metadata in the ARC) on any system with a decent amount of RAM on which large pools are being used and there are other large non-zfs consumers of the pagecache.

If user allocates lots of memory, we must assume it is because it must do so to complete it's task, and we should give up our reclaimable memory as requested by the kernel. It is quite likely for this scenario to coexist with the one that is usually balancing caches, which is also why we don't want to keep increasing zfs_arc_min. If you increase zfs_arc_min, users will likely complain that ZFS is a memory hog compared to every other in-tree fs.

If this patch stay as-is, I think at least the documentation should mention that it's somewhat specific to memory allocated by non-ZFS filesystems (and the handful of other cases which allocate NR_FILE_PAGES).

Pushing update now.

dweeezil · 2017-04-25T03:04:35Z

man/man5/zfs-module-parameters.5

+.sp
+Default value: \fB0\fR (disabled).
+.RE
+


@dbavatar Wouldn't it be more intuitive to express zfs_arc_pc_frac as a percentage of the size of the pagecache, and then maybe call it zfs_arc_pc_pct? It would be more congruent with other (percentage-based) tunables and could be used as:

uint64_t min = ptob(global_page_state(NR_FILE_PAGES)) * 100 / zfs_arc_pc_pct;

And documented as "percentage of the pagecache size to which the ARC may be reduced".

We could do that, I was just hoping to avoid the divide, being concerned with kswapd cpu from the beginning.

dweeezil · 2017-04-25T03:04:40Z

module/zfs/arc.c

+	 * Scale reported evictable memory in proportion to page cache, cap
+	 * at specified min/max.
+	 */
+


Don't need a blank line here.

dweeezil · 2017-04-25T03:29:35Z

Tangentially related to this issue: it would be neat to de-couple zfs_arc_meta_min from zfs_arc_min, and to allow either to be lower than 32MiB to help with really small systems (see 9141582 and 121b3ca). Large blocks have complicated this issue.

When multiple filesystems are in use, memory pressure causes arc_cache to collapse to a minimum. Allow arc_cache to maintain proportional size even when hit rates are disproportionate. We do this only via evictable size from the kernel shrinker, thus it's only in effect under memory pressure. AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

behlendorf

@dbavatar thanks for tackling this issue and breaking it in to small individual commits to make it easier to review. The latest version of the proposed patch stack LGTM.

I agree we should merge this with zfs_arc_pc_percent=0 initially to be conservative. But at some point we'll want to enable this functionality by default. Do you have a good sense from your testing what a reasonable default value would be for most systems?

dbavatar · 2017-05-02T14:49:34Z

@behlendorf This is a difficult question. We could try 100%, i.e. under pressure we would do a 50/50 split. In the desktop case that would prevent scanning on other filesystems from obliterating arc. But is that what we are aiming for, a desktop ZFS user? Anyone using ZFS for mission critical stuff is going to want to change this. We could as well introduce a static number, i.e. limit pagecache to 1GB.

Tangentially we could potentially make this work better if we could export more variables and/or hooks from the upstream kernel, but these would basically be only to support ZFS, does anyone know if they've been receptive to changes like that?

behlendorf · 2017-05-02T19:44:07Z

Ideally we want ZFS to work well for both desktop and data center workloads. If we do need to prefer one configuration over the other we'll want to opt for large scale, mission critical stuff which is ZFS's core strength. This is ready to go so I'll get it merged today. That'll make it easier to run some benchmarks so we can settle of a reasonable default.

Tangentially we could potentially make this work better if we could export more variables and/or hooks from the upstream kernel, but these would basically be only to support ZFS, does anyone know if they've been receptive to changes like that?

It's been my experience that the upstream kernel community hasn't been willing to expose existing interfaces for external consumers, let alone add new ones. As a general rule they understandably want at least one in-tree consumer. So I suspect they wouldn't particularly receptive this to approach.

I don't think that means we're out of luck at all. We can make use of the existing filesystem / mm interfaces and it wouldn't surprise me if the kernel already provides everything we need. Integrating the ARC more closely with the page cache has always been something we wanted to investigate. Between the callbacks provided by the address_space_operations and vm_operations_struct we have considerable flexibility. Additionally, the kernel provides per-superblock shrinkers which we could make better use of and a handful of other useful interfaces. There are quite a few existing avenues we can pursue in order to further improve things.

Ghost meta/data buffers are not actually allocated AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035

Ensures proper accounting of bytes we requested to free AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035

Move arcstat_need_free increment from all direct calls to when arc_reclaim_lock is busy and we exit wihout doing anything. Data will be reclaimed in reclaim thread. The previous location meant that we both reclaim the memory in this thread, and also schedule the same amount of memory for reclaim in arc_reclaim, effectively doubling the requested reclaim. AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035

Lock contention, by itself, shouldn't indicate a stop condition to the kernel's slab shrinker. Doing so can cause stalls when the kernel is trying to free large parts of the cache such as is done by drop_caches Also, perhaps arc_reclaim_lock should be a spinlock, and this code eliminated. AKAMAI: zfs: CR 3593801 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035

AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035

Calling it when nothing is evictable will cause extra kswapd cpu. Also if we didn't shrink it's unlikely to have memory to reap because we likely just called it microseconds ago. The exception is if we are in direct reclaim. You can see how hard this is being hit in kswapd with a light test workload: 34.95% [zfs] [k] arc_kmem_reap_now 5.40% [spl] [k] spl_kmem_cache_reap_now 3.79% [kernel] [k] _raw_spin_lock 2.86% [spl] [k] __spl_kmem_cache_generic_shrinker.isra.7 2.70% [kernel] [k] shrink_slab.part.37 1.93% [kernel] [k] isolate_lru_pages.isra.43 1.55% [kernel] [k] __wake_up_bit 1.20% [kernel] [k] super_cache_count 1.20% [kernel] [k] __radix_tree_lookup With ZFS just mounted but only ext4/pagecache memory pressure arc_kmem_reap_now still consumes excessive CPU: 12.69% [kernel] [k] isolate_lru_pages.isra.43 10.76% [kernel] [k] free_pcppages_bulk 7.98% [kernel] [k] drop_buffers 7.31% [kernel] [k] shrink_page_list 6.44% [zfs] [k] arc_kmem_reap_now 4.19% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.95% [kernel] [k] __isolate_lru_page 3.09% [kernel] [k] __radix_tree_lookup Same pagecache only workload as above with this patch series: 11.58% [kernel] [k] isolate_lru_pages.isra.43 11.20% [kernel] [k] drop_buffers 9.67% [kernel] [k] free_pcppages_bulk 8.44% [kernel] [k] shrink_page_list 4.86% [kernel] [k] __isolate_lru_page 4.43% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.44% [kernel] [k] __radix_tree_lookup (arc_kmem_reap_now has 0 samples in perf) AKAMAI: zfs: CR 3695042 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035

Could return the wrong pages value AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035

ryao suggested changes Apr 17, 2017

View reviewed changes

dbavatar force-pushed the dbavatar/cache_reclaim branch from b0c338c to 8132dbc Compare April 17, 2017 21:44

Don't report ghost buffers as evictable mem

a7c054f

Ghost meta/data buffers are not actually allocated AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

dbavatar force-pushed the dbavatar/cache_reclaim branch 3 times, most recently from a5858ac to bc95471 Compare April 17, 2017 22:29

dbavatar force-pushed the dbavatar/cache_reclaim branch from bc95471 to f897ff4 Compare April 18, 2017 19:31

ryao approved these changes Apr 24, 2017

View reviewed changes

dweeezil suggested changes Apr 24, 2017

View reviewed changes

Make arc_need_free updates atomic

a50931e

Ensures proper accounting of bytes we requested to free AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

dbavatar force-pushed the dbavatar/cache_reclaim branch from f897ff4 to 319e357 Compare April 24, 2017 19:26

dweeezil approved these changes Apr 24, 2017

View reviewed changes

dbavatar force-pushed the dbavatar/cache_reclaim branch from 319e357 to 6e01d6e Compare April 24, 2017 21:05

dbavatar added 5 commits April 24, 2017 17:06

Only wakeup waiters if we've actually done work

a159130

AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

Correct signed operation

ad9d172

Could return the wrong pages value AKAMAI: zfs: CR 3695072 Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>

dbavatar force-pushed the dbavatar/cache_reclaim branch from 6e01d6e to f05bc2b Compare April 24, 2017 21:06

dweeezil reviewed Apr 24, 2017

View reviewed changes

dbavatar force-pushed the dbavatar/cache_reclaim branch from f05bc2b to 3caa00c Compare April 24, 2017 22:22

dweeezil approved these changes Apr 25, 2017

View reviewed changes

ryao added Component: Memory Management kernel memory management Type: Performance Performance improvement or performance problem labels Apr 26, 2017

dbavatar force-pushed the dbavatar/cache_reclaim branch 2 times, most recently from e9d4f18 to aabd0f0 Compare April 28, 2017 13:47

dbavatar force-pushed the dbavatar/cache_reclaim branch from aabd0f0 to efd766c Compare April 28, 2017 17:40

behlendorf approved these changes May 1, 2017

View reviewed changes

behlendorf closed this in 03b60ee May 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory reclaim and cache interaction #6035

Memory reclaim and cache interaction #6035

dbavatar commented Apr 17, 2017

mention-bot commented Apr 17, 2017

ryao commented Apr 17, 2017 •

edited

Loading

dbavatar commented Apr 17, 2017

ryao left a comment

ryao Apr 17, 2017

dbavatar Apr 17, 2017

ryao Apr 17, 2017

ryao commented Apr 17, 2017

dweeezil commented Apr 17, 2017

ryao commented Apr 17, 2017

dweeezil commented Apr 17, 2017

dbavatar commented Apr 17, 2017

ryao commented Apr 18, 2017 •

edited

Loading

dbavatar commented Apr 18, 2017

kernelOfTruth commented Apr 18, 2017

ryao commented Apr 18, 2017 •

edited

Loading

dbavatar commented Apr 19, 2017

ryao left a comment

dweeezil left a comment

dweeezil left a comment

dbavatar commented Apr 24, 2017

dweeezil left a comment

dweeezil left a comment

dweeezil left a comment

dweeezil commented Apr 24, 2017

dweeezil commented Apr 24, 2017

dbavatar commented Apr 24, 2017 •

edited

Loading

dweeezil Apr 25, 2017

dbavatar Apr 25, 2017

dweeezil Apr 25, 2017

dweeezil commented Apr 25, 2017

behlendorf left a comment

dbavatar commented May 2, 2017

behlendorf commented May 2, 2017

Memory reclaim and cache interaction #6035

Memory reclaim and cache interaction #6035

Conversation

dbavatar commented Apr 17, 2017

Description

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

mention-bot commented Apr 17, 2017

ryao commented Apr 17, 2017 • edited Loading

dbavatar commented Apr 17, 2017

ryao left a comment

Choose a reason for hiding this comment

ryao Apr 17, 2017

Choose a reason for hiding this comment

dbavatar Apr 17, 2017

Choose a reason for hiding this comment

ryao Apr 17, 2017

Choose a reason for hiding this comment

ryao commented Apr 17, 2017

dweeezil commented Apr 17, 2017

ryao commented Apr 17, 2017

dweeezil commented Apr 17, 2017

dbavatar commented Apr 17, 2017

ryao commented Apr 18, 2017 • edited Loading

dbavatar commented Apr 18, 2017

kernelOfTruth commented Apr 18, 2017

ryao commented Apr 18, 2017 • edited Loading

dbavatar commented Apr 19, 2017

ryao left a comment

Choose a reason for hiding this comment

dweeezil left a comment

Choose a reason for hiding this comment

dweeezil left a comment

Choose a reason for hiding this comment

dbavatar commented Apr 24, 2017

dweeezil left a comment

Choose a reason for hiding this comment

dweeezil left a comment

Choose a reason for hiding this comment

dweeezil left a comment

Choose a reason for hiding this comment

dweeezil commented Apr 24, 2017

dweeezil commented Apr 24, 2017

dbavatar commented Apr 24, 2017 • edited Loading

dweeezil Apr 25, 2017

Choose a reason for hiding this comment

dbavatar Apr 25, 2017

Choose a reason for hiding this comment

dweeezil Apr 25, 2017

Choose a reason for hiding this comment

dweeezil commented Apr 25, 2017

behlendorf left a comment

Choose a reason for hiding this comment

dbavatar commented May 2, 2017

behlendorf commented May 2, 2017

ryao commented Apr 17, 2017 •

edited

Loading

ryao commented Apr 18, 2017 •

edited

Loading

ryao commented Apr 18, 2017 •

edited

Loading

dbavatar commented Apr 24, 2017 •

edited

Loading