Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARC causing slow system performance on any version higher than v2.1.0 (v2.1.6+) #127

Open
Haravikk opened this issue Mar 24, 2024 · 5 comments

Comments

@Haravikk
Copy link

Haravikk commented Mar 24, 2024

System information

Type Version/Name
Distribution Name macOS
Distribution Version 10.15.7
Linux Kernel Darwin Kernel Version 19.6.0
Architecture x86_64
ZFS Version zfs-macOS-2.2.3-rc4

Describe the problem you're observing

Updating to any macOS ZFS version above v2.1.0 results in extremely poor system performance while datasets are mounted and in use.

The issue appears to be with ARC, as setting primarycache=none and secondarycache=none significantly improves system performance/responsiveness, minus the cost of loading more data from disk.

Describe how to reproduce the problem

  1. Install v2.1.0 on a known affected system (2018 Mac Mini, 2008 Mac Pro, 2010 Mac Pro running Catalina are confirmed)
  2. Create/mount datasets and verify performance is as expected. Encryption, compression etc. should all perform within reasonable margins.
  3. Upgrade to v2.1.6 or later (including v2.2.3rc4).
  4. Observe performance is significantly worse.
  5. Run the commands zfs set primarycache=none <pool> and zfs set secondarycache=none <zpool> (may need to set for additional datasets if values are not inherited).
  6. After a short delay, performance should significantly improve (a restart may be required).
  7. Revert to v2.1.0 and set primarycache and secondarycache back to previous values (default is =all).
  8. Performance should be much better with ARC functioning as normal.

Attachments

The following spindumps were all generated under v2.2.3rc4 with ARC configured as normal (in use), causing many programs to run extremely slowly as they spend large amounts of time waiting for data.

spindumps-v2.2.3rc4.zip

Unfortunately due to the slow system responsiveness it was difficult to generate spindumps at the moments of worst performance, though I tried. Of note, spindump.6.txt was taken while attempting to decrypt a dataset into a new (unencrypted) dataset for testing, so may give useful stack traces.

Additional Notes

I'm not aware of any specific changes to ARC that are likely to have caused this drastic change in performance, but the fact that setting primarycache=none and secondarycache=none results in such an improvement in system responsiveness makes it clear that the issue is most likely either related to the ARC, or to something it relies upon.

I would assume that if this issue also affected Linux there would have been a lot more issues reported about it, so either Linux is unaffected, or macOS is affected differently (more severely), resulting in a much more noticeable drop in performance.

Many, many thanks to armdn for discovering the workaround for this issue on the forum topic originally created for it. You can view the topic here for many more spindumps and sysctl output.

As pointed out by cgiard, since the issue has occurred since at least the v2.1.6 macOS release, this makes the persistent L2ARC fixes a possible area to look at, though removing L2ARC does not appear to make a difference.

Another thread by ranvel, which you can see here proposes that the issue is write operations causing user space to freeze. My own experience hasn't occurred on write-intensive systems though, so I think the interaction may be more complex.

@armdn
Copy link

armdn commented Mar 26, 2024

The problem reproducable on wide range of systems. I tested it on 10.13, 10.14, 10.15... And even CLEAN installation with newest pool and settings are affected. The tests was provided on two different computers.

@Haravikk
Copy link
Author

Haravikk commented Mar 26, 2024

Thanks so much for all the work you've put into tracking this down; I've also linked in the other thread you've found which does sound like the same issue.

Ranvel suggests in that thread it might be writes causing userspace to lock up, which is as good a theory as any, though my own case isn't especially write intensive. Updating ZFS often triggers mds/Spotlight for me, but while it involves a lot of reads the write activity isn't what I would call especially high, so I'm not so sure it's the writing that's the problem. But if it's the write interaction with ARC rather than the read interaction then that might help narrow down the problem?

In addition to my affected main machine (2018 Mac Mini), I have two older systems, a mid 2010 Mac Mini and a late 2009 iMac (so very similar CPUs) both running High Sierra (10.13), and they each have a much more basic ZFS setup – a single compressed and encrypted zvol for hosting Time Machine backups. While ZFS performance for them isn't amazing (they have very little hardware acceleration support) neither seems to be experiencing the same issue with system slow downs. They're both operating with much smaller ARC sizes (around 256mb) as neither has a lot of RAM, and ARC isn't as important for what they're doing.

But it makes me wonder if the issue could be related to ARC size, or perhaps it gets worse the larger it gets and these older systems just aren't at a point where it becomes noticeable?

@Haravikk
Copy link
Author

Haravikk commented Mar 27, 2024

As per Arne's post to the forum topic, another possible workaround is to disable compressed ARC like so (set in /etc/zfs/zsysctl.conf to persist):

sysctl -w kstat.zfs.darwin.tunable.zfs.compressed_arc_enabled=0

This may allow primarycache and secondarycache to remain enabled without the same performance degradation, except for less efficient ARC memory utilisation due to uncompressed records taking up more space, but that should still be a big improvement over running without any caching.

I'll try to verify this on my system(s) over the weekend.

Update: Unfortunately this didn't "fix" the problem, only make it less severe, see below.

@Haravikk
Copy link
Author

Haravikk commented Mar 31, 2024

Unfortunately while disabling compressed ARC did improve performance overall, it didn't solve this problem – while the system was much more responsive with less entries in ARC and relatively low write activity, as soon as I started writing large quantities of data and ARC reached around ~4gb, performance took a nose dive as normal.

For anyone else looking to test with compressed ARC disabled, make sure to disable it before importing your pool(s), otherwise you need to export then re-import them after; in my case the difference was only noticeable after primary ARC was emptied.

I also tested removing my L2ARC device from my main working pool, but this made no discernible difference. The only thing that seems to help is bypassing primary ARC using primarycache=none on all active datasets, but this is of course not acceptable for a working pool as the lack of caching also hurts performance.

The issue is definitely proportional to the amount of primary ARC being utilised; when primary ARC is under 1gb my system was still generally responsive, but as this amount climbed to 4gb and beyond it became more and more unusable, and at around the 10gb mark windowserver locked up and presumably crashed, causing all of my user accounts to be logged out. I'm not convinced that limiting the ARC maximum is a viable option; as even when my ARC used was around 1gb I was seeing some misbehaving processes (opendirectoryd was the main one this time, despite having no active directory users).

Before windowserver crashed I was able to capture another spindump while copying (via rsync) between two datasets; I thought this might be useful to compare a spindump with ARC compression disabled. You can jump to the receiving side process by searching for "rsync [9056]".
spindump.v2.2.3rc4.compresse_arc_disabled.zip

My conclusions from all of this are:

  • The problem is at its most severe with larger amounts of write activity passing through primary ARC.
  • While it's possible read activity is also affected, the issue is far less severe.
  • Disabling compressed ARC improved performance but didn't fix the problem. My theory is that decompressing compressed ARC entries somehow has a multiplicative effect on the performance problem (slow operations became even slower).
  • L2ARC appears to have no particular effect – removing L2ARC device(s) made no noticeable difference.
  • The problem gets worse the larger the primary ARC gets; it's hard to tell if this is because of increased cache hits, or the issue is related to ARC structures that have grown larger. By the time I get to this point it becomes very difficult to run useful tests without causing windowserver to crash.

@Haravikk
Copy link
Author

After upgrading to macOS Sonoma, I've been able to confirm that this extreme performance degradation does not occur under Sonoma, I have been able to upgrade to 2.2.3rc4 without any issues, and even once my ARC has filled no performance loss is observed, with no need to turn off ARC or any features.

Whatever the problem is it appears to be specific to Catalina, and possibly all pre-ARM/M1 versions of macOS, though I can't confirm as I skipped Big Sur, Monterey and Ventura to upgrade directly to Sonoma instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants