Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory reclaim and cache interaction #6035

Closed
wants to merge 8 commits into from

Conversation

dbavatar
Copy link
Contributor

Description

Fix memory reclaim behavior and add scaling factor to balance multiple caches (default disabled)

Motivation and Context

We deployed ZFS to several thousand machines and immediately started noticing some bad performance characteristics. Our machines still have traditional filesystems utilizing pagecache which compete with zfs. I identified one of the main problems was the arc cache would collapse to minimum, regardless of several GB of memory being free/used by pagecache and other reclaimable objects. This set of patches corrects the reclaim behavior, and also adds an option that allows balancing of multiple caches via scaling factor, which only activates on reclaim. This means under memory pressure a user-defined scaling factor will balance pagecache and arc cache, and when not under pressure will work normally. This is disabled by default. In addition this patchset means the zfs_arc_sys_free algorithm is no longer necessary. Although some user may want to set this for other reasons but the zfs module does neither get the appropriate watermarks from the kernel, nor does it need to as the feedback mechanism is the reclaim callback, which is based on the true watermarks. Performance should be improved by disabling this by default, I can add a patch later for this.

Other fixes:

  1. Reclaim behavior was fixed in regards to drop_caches, previously with a significant cache size it would take many calls of drop_caches to actually reclaim all memory, only a single run should do it now. This was problematic for several reasons, for example trying to free memory before huge page allocation.
  2. zfs routines now take significantly less CPU on the kswapd stack for multiple reasons.

How Has This Been Tested?

Test with production traffic on 100 machines. Verified arc cache does not collapse to minimum size, and we do not incur performance hit from heavy MFU cache misses. Expanding to 100k+ machines over several weeks.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@mention-bot
Copy link

@dbavatar, thanks for your PR! By analyzing the history of the files in this pull request, we identified @behlendorf, @grwilson and @prakashsurya to be potential reviewers.

@ryao
Copy link
Contributor

ryao commented Apr 17, 2017

Are you running 0.6.5.y and tried this on top or are you running 0.7.0-rc3 and tried this on top?

There have been substantial changes to memory behavior for 0.7.0 so before evaluating these changes, I would like to know what branch was used for testing them.

Edit: I am pleasantly surprised by how small these changes are. It will take some time to study each change, but there is nothing here that looks unreasonable at a glance. :)

@dbavatar
Copy link
Contributor Author

@ryao 6.5.x, but all the changes still apply, the things it's fixing are still quite broken in 7.x. I've made the changes quite small so it should be readable. I can do additional testing in the lab running on 7.x

Copy link
Contributor

@ryao ryao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, I need some time to review these changes, but there is nothing unreasonable that I see at a glance. There are some minor style issues that need to be fixed though.

module/zfs/arc.c Outdated
return (SHRINK_STOP);
if (mutex_tryenter(&arc_reclaim_lock) == 0) {
ARCSTAT_INCR(arcstat_need_free, ptob(sc->nr_to_scan));
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a style violation. The return value should be surrounded by parenthesis. Use ./scripts/cstyle.pl on arc.c to catch these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I saw the failures, I will fix/repush the whole branch when I get a chance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little things like this cause issues for my pull requests more often than I am willing to admit (although it is public record, so anyone can go look). I look forward to the repush. :)

@ryao
Copy link
Contributor

ryao commented Apr 17, 2017

@dbavatar I agree that the things you changed are common to both. So far, I am not seeing anything I dislike in these changes, although I am still reading them. I probably won't finish my own review until much later in the day. In the interim, it would be great if you could fix the style issues.

@dweeezil
Copy link
Contributor

Without making any particular comments on this patch stack yet, there's an argument to be made that many of the ZoL-specific hacks in the ARC code are ripe for review. For example, the "balanced" metadata adjustment strategy, which is frequently the cause of high CPU loading, may very well no longer be needed, or should at the very least be tuned in some way. Unfortunately, the ARC code is one place where there's a lot of divergence from the upstream code and it's had a lot of ZoL-specific changes, reversions, etc. over time.

@ryao
Copy link
Contributor

ryao commented Apr 17, 2017

@dweeezil Not only that, but the Linux kernel memory behavior is substantially different from Illumos thanks to direct reclaim and other decisions. We cannot use the exact same code. :/

@dweeezil
Copy link
Contributor

@ryao Indeed. This is an area where ZoL will always have plenty of divergence.

@dbavatar
Copy link
Contributor Author

@dweeezil Yes that is quite a heavy symbol, but I think you'll find the behavior is much improved with this branch. I believe it was overworking by at least a factor of 6-10x before.

Ghost meta/data buffers are not actually allocated

AKAMAI: zfs: CR 3695072

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
@dbavatar dbavatar force-pushed the dbavatar/cache_reclaim branch 3 times, most recently from a5858ac to bc95471 Compare April 17, 2017 22:29
@ryao
Copy link
Contributor

ryao commented Apr 18, 2017

Okay, as promised, I have dug into these changes more. They look good to me. However, would you explain the rationale behind the scaling in proportion to page cache patch? Given that it is off by default, I am not sure if I understand the use case where this knob helps.

Also, we need to figure out why the buildbot was unhappy before this can be merged.

It is worth noting that the lack of atomics is a Linux specific issue because we altered this code to run in a direct reclaim context where on other systems, it runs on a dedicated arc_reclaim thread. I would say that is fixing a regression caused by that change if the interactions of old approach and Linux's direct reclaim were not so terrible that even these regressions were a step forward back then.

@dbavatar
Copy link
Contributor Author

They look good to me. However, would you explain the rationale behind the scaling in proportion to page cache patch? Given that it is off by default, I am not sure if I understand the use case where this knob helps.

The theoretical background to why this is necessary: the pagecache in the Linux kernel is one giant LRU page list, which for the most part consists of all "free" memory. Now imagine a workload where there are frequently used files, and also file scanning going on. Every time a frequently used page is used, it gets moved to the tail of the LRU. Since the LRU is likely very deep, the probability of reclaiming your frequently used file is low, as measured by the distance to the head of the list. Now, imagine that you split this LRU in half, fitting 2 into the same fixed memory space. That immediately means the probability of reclaiming your frequently used page is now much higher, measured as the distance between the head of the LRU and your page. So, we already start at a disadvantage at the ideal scenario where we've just split 50/50. But, this is unlikely on a real workload. What's more likely to happen is that one of those LRU's has more pressure on it that the other, and one of them becomes very small, and you keep dropping frequent pages from it leading to huge latencies. The zfs MRU/MFU is actually trying to prevent this exact thing from happening inside arc, but when you have (MRU+MFU)+LRU in the same memory space, this no longer functions properly.

This is exactly what we ran into. Although the rest of this branch already helps, since it stops excessive reclaims from hitting arc, there is still the possibility of scanning going on with pagecached filesystems, the memory pressure of which will shrink arc down to arc_min. This is too small to fit our frequently used files, leading to huge latencies which were never expected by the original app author, since pagecache served effectively as an infinite cache machine for small files. It is also at the same time not acceptable to keep increasing arc_min until the problem goes away, because effectively that memory becomes unusable, we need ZFS to have an effective overhead of near 0. We want it to behave in the same way as pagecache and all other kernel reclaimable memory, that under high pressure the memory can be reclaimed, but otherwise utilizes all of "free" memory. This is exactly what zfs_arc_pc_frac enables. Until there is pressure/reclaim, everything works as normal. However under pressure, we can assure that we both allow reclaiming of all our memory and also withstand scanning pressure from pagecache.

It's off by default because we can't estimate a users workload, nor if they care about balancing cache with pagecache. Even if a user has converted 100% to ZFS, you still have other uses for pagecache, like if you mount FUSE filesystems. Scanning on a FUSE filesystem could easily obliterate the arc cache. For our workloads I've set the scaling factor at 15%, but I can imagine a scenario where someone might want to set it at 500%, for example.

Other possible solutions for Linux would be to remove arc cache, and use pagecache to cache everything. This is obviously very invasive change, but at least we're be using the same LRU as everything else.

Also, we need to figure out why the buildbot was unhappy before this can be merged.

Seems like false positives, how can we kick off another run without updating the commits?

It is worth noting that the lack of atomics is a Linux specific issue because we altered this code to run in a direct reclaim context where on other systems, it runs on a dedicated arc_reclaim thread. I would say that is fixing a regression caused by that change if the interactions of old approach and Linux's direct reclaim were not so terrible that even these regressions were a step forward back then.

Not quite sure what you're getting at here. Although, I don't think the arc_reclaim thread is necessary at all in Linux, they don't even export all the feedback mechanisms needed to get it behave properly. I have it effectively disabled (arc_sys_free lower than kernel min watermark).

@kernelOfTruth
Copy link
Contributor

Seems like false positives, how can we kick off another run without updating the commits?

git commit --amend the latest commit to change the ID,

and force-push the branch - that should most likely be enough

@ryao
Copy link
Contributor

ryao commented Apr 18, 2017

@dbavatar Removing ARC would be a major performance regression for workloads that benefit from ARC. It is rather unfortunate that ARC doesn't play nicely with the page cache of other filesystems. Having heard that, I definitely see the use case for this tunable now. You might want to clarify that the tunable exists to make ZFS play more nicely with LRU in the documentation.

As for my arc_reclaim remarks, we had gutted it and left the remaining functions in a thread that was called arc_adapt. At some point that I had not realized when I wrote that comment, we changed the name back to arc_reclaim. See 302f753 for the original name change. Anyway, my point was that on other systems, reclamation from ARC is done entirely in arc_reclaim. There is no direct reclaim and therefore no need for atomics because only 1 thread will ever be doing reclaim. The need for atomics is one that the project manufactured when it ported ZFS to Linux and adapted it to direct reclaim.

@dbavatar
Copy link
Contributor Author

All build checks are now passing. I also did some smoketesting with 7.x, no issues.

Copy link
Contributor

@ryao ryao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Copy link
Contributor

@dweeezil dweeezil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit comment, s/arc_needs_free/arc_need_free/

Ensures proper accounting of bytes we requested to free

AKAMAI: zfs: CR 3695072

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Copy link
Contributor

@dweeezil dweeezil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to suggest some working for the commit log of the form:

In the arc shrinker, only increment arcstat_need_free (arc_need_free) when reclaim is in progress.

And then maybe a bit of explanation about the "doubling".

@dbavatar
Copy link
Contributor Author

In the arc shrinker, only increment arcstat_need_free (arc_need_free) when reclaim is in progress.

This is actually not what is happening. That mutex isn't held during reclaim, it's only protecting some variables, which is why I suggested that perhaps this should be a spinlock instead, but would require some more work. The problem being solved here is that if the mutex is held, we cannot sleep here, so we accumulate what we should have reclaimed this round, instead of just ignoring it.

The doubling is from both doing the reclaim and also scheduling the same amount of data for background reclaim. I can add more to the commit if that's still not clear.

Copy link
Contributor

@dweeezil dweeezil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested wording for the commit log comment:

Lock contention, by itself, shouldn't indicate a stop condition to the kernel's slab shrinker. Doing so can cause stalls when the kernel is trying to free large parts of the cache such as is done by drop_caches.

Copy link
Contributor

@dweeezil dweeezil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some commit log comment suggestions:

s/evictabe/evictable/

In the second example, how about "..., arc_kmem_reap_now still consumes excessive CPU".

Move arcstat_need_free increment from all direct calls to when
arc_reclaim_lock is busy and we exit wihout doing anything. Data will
be reclaimed in reclaim thread. The previous location meant that we
both reclaim the memory in this thread, and also schedule the same
amount of memory for reclaim in arc_reclaim, effectively doubling the
requested reclaim.

AKAMAI: zfs: CR 3695072

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Lock contention, by itself, shouldn't indicate a stop condition to the
kernel's slab shrinker. Doing so can cause stalls when the kernel is
trying to free large parts of the cache such as is done by drop_caches

Also, perhaps arc_reclaim_lock should be a spinlock, and this code
eliminated.

AKAMAI: zfs: CR 3593801

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
AKAMAI: zfs: CR 3695072

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Calling it when nothing is evictable will cause extra kswapd cpu. Also
if we didn't shrink it's unlikely to have memory to reap because we
likely just called it microseconds ago. The exception is if we are in
direct reclaim.

You can see how hard this is being hit in kswapd with a light test
workload:

  34.95%  [zfs]             [k] arc_kmem_reap_now
   5.40%  [spl]             [k] spl_kmem_cache_reap_now
   3.79%  [kernel]          [k] _raw_spin_lock
   2.86%  [spl]             [k] __spl_kmem_cache_generic_shrinker.isra.7
   2.70%  [kernel]          [k] shrink_slab.part.37
   1.93%  [kernel]          [k] isolate_lru_pages.isra.43
   1.55%  [kernel]          [k] __wake_up_bit
   1.20%  [kernel]          [k] super_cache_count
   1.20%  [kernel]          [k] __radix_tree_lookup

With ZFS just mounted but only ext4/pagecache memory pressure
arc_kmem_reap_now still consumes excessive CPU:

  12.69%  [kernel]  [k] isolate_lru_pages.isra.43
  10.76%  [kernel]  [k] free_pcppages_bulk
   7.98%  [kernel]  [k] drop_buffers
   7.31%  [kernel]  [k] shrink_page_list
   6.44%  [zfs]     [k] arc_kmem_reap_now
   4.19%  [kernel]  [k] free_hot_cold_page
   4.00%  [kernel]  [k] __slab_free
   3.95%  [kernel]  [k] __isolate_lru_page
   3.09%  [kernel]  [k] __radix_tree_lookup

Same pagecache only workload as above with this patch series:

  11.58%  [kernel]  [k] isolate_lru_pages.isra.43
  11.20%  [kernel]  [k] drop_buffers
   9.67%  [kernel]  [k] free_pcppages_bulk
   8.44%  [kernel]  [k] shrink_page_list
   4.86%  [kernel]  [k] __isolate_lru_page
   4.43%  [kernel]  [k] free_hot_cold_page
   4.00%  [kernel]  [k] __slab_free
   3.44%  [kernel]  [k] __radix_tree_lookup

   (arc_kmem_reap_now has 0 samples in perf)

AKAMAI: zfs: CR 3695042

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Could return the wrong pages value

AKAMAI: zfs: CR 3695072

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Copy link
Contributor

@dweeezil dweeezil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch refers to the pagecache, generically, however, it's only paying attention to that which is the NR_FILE_PAGES category.

Although I certainly understand what this is trying to accomplish w.r.t. interaction with pagecache-using filesystems (ext4), what about regular user programs which would have their memory accounted for in anonymous pages? Would it make sense to have a separate tunable to set ratios against other classes? To the sums of some classes? Maybe a tunable to act as a mask and select the types of pagecache against which the ratio would be enforced?

Even with this patch, it seems a memory-hogging userland program could cause the same problem this one is trying to avoid.

If this patch stay as-is, I think at least the documentation should mention that it's somewhat specific to memory allocated by non-ZFS filesystems (and the handful of other cases which allocate NR_FILE_PAGES).

@dweeezil
Copy link
Contributor

This whole stack looks OK except for some documentation and comment clarifications. Also, the f05bc2b commit seems to be specific to a very particular, albeit very common and likely very important for many users, case. I still think there's a case to be made to increasing zfs_arc_min (or, at least, a floor on metadata in the ARC) on any system with a decent amount of RAM on which large pools are being used and there are other large non-zfs consumers of the pagecache.

@dweeezil
Copy link
Contributor

@dbavatar Thanks for the clarification about the doubling, that wasn't clear to me and was why I was asking for an expanded comment.

@dbavatar
Copy link
Contributor Author

dbavatar commented Apr 24, 2017

This patch refers to the pagecache, generically, however, it's only paying attention to that which is the NR_FILE_PAGES category.

Although I certainly understand what this is trying to accomplish w.r.t. interaction with pagecache-using filesystems (ext4), what about regular user programs which would have their memory accounted for in anonymous pages? Would it make sense to have a separate tunable to set ratios against other classes? To the sums of some classes? Maybe a tunable to act as a mask and select the types of pagecache against which the ratio would be enforced?

Regular user allocated (anonymous) memory does not matter, it is not reclaimable memory. It can be swappable, but the objective is to use free ram, not to make user programs swap out to get even more ram. However, the point is taken as to NR_FILE_PAGES is not necessarily the only measure you might want to scale against. One could argue that it should be all of reclaimable memory, but I think that's not quite right because we aren't competing with them for file-based resources, i.e. the split-LRU problem described above. They behave the same way whether pagecache is competing against them, or arc cache is, so I think it's best not to scale against them.

I can think of corner cases to this like what about NR_FILE_MAPPED pages, but this becomes problematic trying to separate mapped pages that are represented in NR_FILE_PAGES from those that are not. Some things we might want to measure against aren't even exported from Linux, if available at all, like swapcache. However, since zfs_arc_pc_frac is a empirically derived tunable meant to support a given workload, you can turn the knob either way to compensate, up to a point.

Even with this patch, it seems a memory-hogging userland program could cause the same problem this one is trying to avoid.

I still think there's a case to be made to increasing zfs_arc_min (or, at least, a floor on metadata in the ARC) on any system with a decent amount of RAM on which large pools are being used and there are other large non-zfs consumers of the pagecache.

If user allocates lots of memory, we must assume it is because it must do so to complete it's task, and we should give up our reclaimable memory as requested by the kernel. It is quite likely for this scenario to coexist with the one that is usually balancing caches, which is also why we don't want to keep increasing zfs_arc_min. If you increase zfs_arc_min, users will likely complain that ZFS is a memory hog compared to every other in-tree fs.

If this patch stay as-is, I think at least the documentation should mention that it's somewhat specific to memory allocated by non-ZFS filesystems (and the handful of other cases which allocate NR_FILE_PAGES).

Pushing update now.

.sp
Default value: \fB0\fR (disabled).
.RE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbavatar Wouldn't it be more intuitive to express zfs_arc_pc_frac as a percentage of the size of the pagecache, and then maybe call it zfs_arc_pc_pct? It would be more congruent with other (percentage-based) tunables and could be used as:

uint64_t min = ptob(global_page_state(NR_FILE_PAGES)) * 100 / zfs_arc_pc_pct;

And documented as "percentage of the pagecache size to which the ARC may be reduced".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that, I was just hoping to avoid the divide, being concerned with kswapd cpu from the beginning.

module/zfs/arc.c Outdated
* Scale reported evictable memory in proportion to page cache, cap
* at specified min/max.
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need a blank line here.

@dweeezil
Copy link
Contributor

Tangentially related to this issue: it would be neat to de-couple zfs_arc_meta_min from zfs_arc_min, and to allow either to be lower than 32MiB to help with really small systems (see 9141582 and 121b3ca). Large blocks have complicated this issue.

@ryao ryao added Component: Memory Management kernel memory management Type: Performance Performance improvement or performance problem labels Apr 26, 2017
@dbavatar dbavatar force-pushed the dbavatar/cache_reclaim branch 2 times, most recently from e9d4f18 to aabd0f0 Compare April 28, 2017 13:47
When multiple filesystems are in use, memory pressure causes arc_cache
to collapse to a minimum. Allow arc_cache to maintain proportional size
even when hit rates are disproportionate. We do this only via evictable
size from the kernel shrinker, thus it's only in effect under memory
pressure.

AKAMAI: zfs: CR 3695072

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbavatar thanks for tackling this issue and breaking it in to small individual commits to make it easier to review. The latest version of the proposed patch stack LGTM.

I agree we should merge this with zfs_arc_pc_percent=0 initially to be conservative. But at some point we'll want to enable this functionality by default. Do you have a good sense from your testing what a reasonable default value would be for most systems?

@dbavatar
Copy link
Contributor Author

dbavatar commented May 2, 2017

@behlendorf This is a difficult question. We could try 100%, i.e. under pressure we would do a 50/50 split. In the desktop case that would prevent scanning on other filesystems from obliterating arc. But is that what we are aiming for, a desktop ZFS user? Anyone using ZFS for mission critical stuff is going to want to change this. We could as well introduce a static number, i.e. limit pagecache to 1GB.

Tangentially we could potentially make this work better if we could export more variables and/or hooks from the upstream kernel, but these would basically be only to support ZFS, does anyone know if they've been receptive to changes like that?

@behlendorf
Copy link
Contributor

Ideally we want ZFS to work well for both desktop and data center workloads. If we do need to prefer one configuration over the other we'll want to opt for large scale, mission critical stuff which is ZFS's core strength. This is ready to go so I'll get it merged today. That'll make it easier to run some benchmarks so we can settle of a reasonable default.

Tangentially we could potentially make this work better if we could export more variables and/or hooks from the upstream kernel, but these would basically be only to support ZFS, does anyone know if they've been receptive to changes like that?

It's been my experience that the upstream kernel community hasn't been willing to expose existing interfaces for external consumers, let alone add new ones. As a general rule they understandably want at least one in-tree consumer. So I suspect they wouldn't particularly receptive this to approach.

I don't think that means we're out of luck at all. We can make use of the existing filesystem / mm interfaces and it wouldn't surprise me if the kernel already provides everything we need. Integrating the ARC more closely with the page cache has always been something we wanted to investigate. Between the callbacks provided by the address_space_operations and vm_operations_struct we have considerable flexibility. Additionally, the kernel provides per-superblock shrinkers which we could make better use of and a handful of other useful interfaces. There are quite a few existing avenues we can pursue in order to further improve things.

behlendorf pushed a commit that referenced this pull request May 2, 2017
Ghost meta/data buffers are not actually allocated

AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
behlendorf pushed a commit that referenced this pull request May 2, 2017
Ensures proper accounting of bytes we requested to free

AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
behlendorf pushed a commit that referenced this pull request May 2, 2017
Move arcstat_need_free increment from all direct calls to when
arc_reclaim_lock is busy and we exit wihout doing anything. Data will
be reclaimed in reclaim thread. The previous location meant that we
both reclaim the memory in this thread, and also schedule the same
amount of memory for reclaim in arc_reclaim, effectively doubling the
requested reclaim.

AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
behlendorf pushed a commit that referenced this pull request May 2, 2017
Lock contention, by itself, shouldn't indicate a stop condition to the
kernel's slab shrinker. Doing so can cause stalls when the kernel is
trying to free large parts of the cache such as is done by drop_caches

Also, perhaps arc_reclaim_lock should be a spinlock, and this code
eliminated.

AKAMAI: zfs: CR 3593801
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
behlendorf pushed a commit that referenced this pull request May 2, 2017
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
behlendorf pushed a commit that referenced this pull request May 2, 2017
Calling it when nothing is evictable will cause extra kswapd cpu. Also
if we didn't shrink it's unlikely to have memory to reap because we
likely just called it microseconds ago. The exception is if we are in
direct reclaim.

You can see how hard this is being hit in kswapd with a light test
workload:

  34.95%  [zfs]             [k] arc_kmem_reap_now
   5.40%  [spl]             [k] spl_kmem_cache_reap_now
   3.79%  [kernel]          [k] _raw_spin_lock
   2.86%  [spl]             [k] __spl_kmem_cache_generic_shrinker.isra.7
   2.70%  [kernel]          [k] shrink_slab.part.37
   1.93%  [kernel]          [k] isolate_lru_pages.isra.43
   1.55%  [kernel]          [k] __wake_up_bit
   1.20%  [kernel]          [k] super_cache_count
   1.20%  [kernel]          [k] __radix_tree_lookup

With ZFS just mounted but only ext4/pagecache memory pressure
arc_kmem_reap_now still consumes excessive CPU:

  12.69%  [kernel]  [k] isolate_lru_pages.isra.43
  10.76%  [kernel]  [k] free_pcppages_bulk
   7.98%  [kernel]  [k] drop_buffers
   7.31%  [kernel]  [k] shrink_page_list
   6.44%  [zfs]     [k] arc_kmem_reap_now
   4.19%  [kernel]  [k] free_hot_cold_page
   4.00%  [kernel]  [k] __slab_free
   3.95%  [kernel]  [k] __isolate_lru_page
   3.09%  [kernel]  [k] __radix_tree_lookup

Same pagecache only workload as above with this patch series:

  11.58%  [kernel]  [k] isolate_lru_pages.isra.43
  11.20%  [kernel]  [k] drop_buffers
   9.67%  [kernel]  [k] free_pcppages_bulk
   8.44%  [kernel]  [k] shrink_page_list
   4.86%  [kernel]  [k] __isolate_lru_page
   4.43%  [kernel]  [k] free_hot_cold_page
   4.00%  [kernel]  [k] __slab_free
   3.44%  [kernel]  [k] __radix_tree_lookup

   (arc_kmem_reap_now has 0 samples in perf)

AKAMAI: zfs: CR 3695042
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
behlendorf pushed a commit that referenced this pull request May 2, 2017
Could return the wrong pages value

AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
@behlendorf behlendorf closed this in 03b60ee May 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants