feat: Make cluster cache target the watch cache instead of etcd #617

tosi3k · 2024-07-25T09:54:05Z

Problem statement in argoproj/argo-cd#18838.

This PR addresses the problem partially - starting with delegating the K8s List calls to the watch cache, offloading the etcd.

tosi3k · 2024-07-25T11:16:26Z

CC @crenshaw-dev

pkg/cache/cluster.go

Signed-off-by: Antoni Zawodny <zawodny@google.com>

sonarcloud · 2024-07-25T16:04:53Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

crenshaw-dev · 2024-07-25T16:13:16Z

@tosi3k just reading around about the watch cache, it looks like maybe the watch cache doesn't support pagination? kubernetes/kubernetes#102672

If that's still the case, are we going to end up falling back to etcd anyway since we set the page size? And do we run the risk of causing massing kube-apiserver memory load by using the watch cache instead of etcd?

tosi3k · 2024-07-26T20:22:16Z

Watch cache not having support for pagination (i.e. disregarding the limit from the List API call's URI query string) is a known issue - the issue you linked provides a nice discussion about the tradeoffs we face here.

If that's still the case, are we going to end up falling back to etcd anyway since we set the page size?

Not really - limit parameter is what is disregarded when set in conjunction with resourceVersion=0 in a List API call, not the resourceVersion param.

And do we run the risk of causing massing kube-apiserver memory load by using the watch cache instead of etcd?

I'd rather call it trading off some amount of memory for large amount of CPU with the latter being a scarcer resource at scale of large Kubernetes clusters in general :). In conjunction with #616, that would be a short memory usage spike in kube-apiserver just on cache initialization rather than every 10 minutes.

In general, this will be eventually addressed by kubernetes/enhancements#3142 - this enhancement's enablement was unfortunately reverted last-minute in K8s 1.31 but will be available in 1.32.

crenshaw-dev · 2024-07-26T20:41:34Z

Not really - limit parameter is what is disregarded when set in conjunction with resourceVersion=0 in a List API call, not the resourceVersion param.

Should we explicitly remove the limit parameter to avoid confusion, or does it end up being used in the case of a fallback to etcd?

In general, this will be eventually addressed by kubernetes/enhancements#3142 - this enhancement's enablement was unfortunately reverted last-minute in K8s 1.31 but will be available in 1.32.

Makes sense, I'll just keep an eye on CPU use while testing, and we'll want to pay attention during Argo CD RCs for any complaints.

tosi3k · 2024-07-26T21:06:58Z

Should we explicitly remove the limit parameter to avoid confusion, or does it end up being used in the case of a fallback to etcd?

Good question - we don't want to drop it since there's a possibility someone disabled the watch cache explicitly (I think there are some K8s distros that do so) by setting the apiserver's --watch-cache flag to false: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/.

crenshaw-dev · 2024-07-26T21:08:44Z

Cool! Okay next step is for me to do some testing and gather data. I can't make any promises on how soon that'll be, but I want to get this change into 2.13.

wojtek-t · 2024-07-31T08:03:15Z

pkg/cache/cluster.go

+		// We set the resource version to 0 below to proactively prevent the
+		// list API call from reaching etcd and make the server fetch the data
+		// from the watch cache instead.
+		opts.ResourceVersion = "0"


This is unclear to me.
When RV=0 is set, we're kube-apiserver is ignoring the Limit option now, so the whole pagination won't be effective anyway.

When this function is called? Is it only on initialization or it's also used later?

With #616 submitted together with this PR, this would be called only on initialization, "resource version too old" or pathological scenarios like failing to extract RV from the watch event which would require relisting anyway.

Without #616, the list would be always executed every ~10 min by default (same as now).

IMHO it would make more sense to have both changes in a single PR but I was requested to split them to facilitate some testing that @crenshaw-dev wants to do internally.

I'm assuming #616 will merge. With that assumption, I'm not even convinced we want this one to happen.

I'm not even convinced we want this one to happen.

Why not? Doesn't using the watch cache have its own benefits independent of eliminating the 10-minute re-list operations?

It has, but it comes with the cost of reads being potentially stale. I don't know ArgoCD well enough to tell what consequences it may have.
At the same time, new features are landing in k8s 1.31 and more will land in 1.32 and further where it will no longer matter (ConsistentListFromCache, WatchList/streaming list).

So let's start with the other PR for now and revisit this one in case of complaints (we know that the other PR on its own would effectively address issues that we observed).

reads being potentially stale

What's the actual negative impact of this? If after doing a watch cache list we immediately handle all events that have happened since the returned resourceVersion, what risk have we introduced that didn't already exist?

let's start with the other PR for now

The other PR reduces ongoing etcd load and the likelihood of hitting etcd compaction errors during long list operations, but it doesn't eliminate the initial load or the possibility of compaction errors.

The other PR also eliminates the 10min re-list completely. My understanding is that the re-list help ensure data integrity (for example, when events get missed) and is a pattern used in the Kubernetes informer code (so a general best practice). I'm up for an experiment, but I'm a little wary of completely eliminating the re-list or deviating too far from a pattern used by the k8s code we're trying to emulate.

What's the actual negative impact of this? If after doing a watch cache list we immediately handle all events that have happened since the returned resourceVersion, what risk have we introduced that didn't already exist?

In majority of cases, not big risk, but in case cache is visibly lagging (we've had an outage when we observed lagging by ~1h or so), then it might be bigger problem..

The other PR also eliminates the 10min re-list completely. My understanding is that the re-list help ensure data integrity (for example, when events get missed) and is a pattern used in the Kubernetes informer code (so a general best practice). I'm up for an experiment, but I'm a little wary of completely eliminating the re-list or deviating too far from a pattern used by the k8s code we're trying to emulate.

Nope - core k8s controllers generally don't relist at all.
You might be mixing that with the "resync" concept in the informer framework. Resync is not a relist - resync calls a handler for all objects, but it's iterating over them from in-memory cache.
Historically (5y+ ago) we indeed were periodically relisting due to bugs that were still happening, but our machinery was stabilized since then and we're not doing that anymore for core components.

we observed lagging by ~1h or so

Damn okay, that makes sense. I assume we'd expect to have to process a ton of events in that case.

But #616 wouldn't really solve that, it would just make it less likely by reducing the number of list operations, if I understand correctly.

core k8s controllers generally don't relist at all. You might be mixing that with the "resync" concept in the informer framework

I absolutely am confusing those two things, good catch.

In that case I'm much less worried about #616, and I do see why you might find it preferable to just do #616 and skip the watch cache change until ~1.32. I would like to test the changes independently and see where we land.

I'm not sure when I'll have time to test, and I'm very uncomfortable merging any performance changes without testing. So if anyone else has time / a good test environment, I'd appreciate the help.

But #616 wouldn't really solve that, it would just make it less likely by reducing the number of list operations, if I understand correctly.

616 indeed isn't a fix for that - what it is is that it prevents us going back in time across component restarts.
Lagging is one thing, going back in time is more serious thing to face.
[Effectively an instance of https://github.com/kubernetes/kubernetes/issues/59848 ]

From scalability POV you will certainly see improvement with this change - but I prefer to stay on the safe side first, and potentially consider that later only if really needed.

jessesuen · 2024-08-16T18:50:21Z

pkg/cache/cluster.go

+		// We set the resource version to 0 below to proactively prevent the
+		// list API call from reaching etcd and make the server fetch the data
+		// from the watch cache instead.
+		opts.ResourceVersion = "0"


I don't think it's ever a good idea to list with resourceVersion=0. Even though this allows the API server to use it's object cache, it causes a worse problem. It does not allow limit/pagination/chunking to be used at all and so all objects of that group/kind will be returned in the response, resulting in OOM conditions of the client dealing with the huge response.

We've been wrestling with this problem in workflows over the years, and the solution we landed on is to never allow resourceVersion=0 requests, even if it means hitting etcd (resourceVersion="").

argoproj/argo-workflows#13466 (review)

And do we run the risk of causing massing kube-apiserver memory load by using the watch cache instead of etcd?

I'd rather call it trading off some amount of memory for large amount of CPU with the latter being a scarcer resource at scale of large Kubernetes clusters in general :).

The tradeoff is not really about increase of kube-apiserver CPU/memory. This will cause the application-controller to have to load all objects of every group/kind in memory when dealing with the non-paginated responses.

The tradeoff is not really about increase of kube-apiserver CPU/memory. This will cause the application-controller to have to load all objects of every group/kind in memory when dealing with the non-paginated responses.

This tradeoff I mentioned was from the server's POV, not client's but I agree there's a risk of client OOMing if the amount of resources it tracks is large and the client doesn't get enough memory assigned. Transforming informers could be a possible alleviation of the problem you describe as they provide possibility to reduce the memory footprint due to the objects stored in the client's cache but it would require considerably more work to validate them and migrate to them here if feasible.

tosi3k changed the title ~~feat: make cluster cache target the watch cache instead of etcd~~ feat: Make cluster cache target the watch cache instead of etcd Jul 25, 2024

tosi3k mentioned this pull request Jul 25, 2024

feat: Drop unnecessary listing for the sake of watch reinitialization #616

Open

crenshaw-dev reviewed Jul 25, 2024

View reviewed changes

pkg/cache/cluster.go Show resolved Hide resolved

feat: make cluster cache target the watch cache instead of etcd

375f743

Signed-off-by: Antoni Zawodny <zawodny@google.com>

tosi3k force-pushed the list-from-watch-cache branch from ea97e08 to 375f743 Compare July 25, 2024 16:04

wojtek-t reviewed Jul 31, 2024

View reviewed changes

jessesuen requested changes Aug 16, 2024

View reviewed changes

agilgur5 mentioned this pull request Aug 17, 2024

fix(controller): set WatchListPageSize when WatchError argoproj/argo-workflows#13466

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Make cluster cache target the watch cache instead of etcd #617

feat: Make cluster cache target the watch cache instead of etcd #617

tosi3k commented Jul 25, 2024

tosi3k commented Jul 25, 2024

sonarcloud bot commented Jul 25, 2024

crenshaw-dev commented Jul 25, 2024

tosi3k commented Jul 26, 2024

crenshaw-dev commented Jul 26, 2024

tosi3k commented Jul 26, 2024 •

edited

Loading

crenshaw-dev commented Jul 26, 2024

wojtek-t Jul 31, 2024

tosi3k Jul 31, 2024 •

edited

Loading

wojtek-t Jul 31, 2024

crenshaw-dev Jul 31, 2024

wojtek-t Jul 31, 2024

crenshaw-dev Jul 31, 2024

wojtek-t Jul 31, 2024

crenshaw-dev Jul 31, 2024 •

edited

Loading

wojtek-t Jul 31, 2024

jessesuen Aug 16, 2024 •

edited

Loading

jessesuen Aug 16, 2024

tosi3k Aug 20, 2024 •

edited

Loading

feat: Make cluster cache target the watch cache instead of etcd #617

Are you sure you want to change the base?

feat: Make cluster cache target the watch cache instead of etcd #617

Conversation

tosi3k commented Jul 25, 2024

tosi3k commented Jul 25, 2024

sonarcloud bot commented Jul 25, 2024

Quality Gate passed

crenshaw-dev commented Jul 25, 2024

tosi3k commented Jul 26, 2024

crenshaw-dev commented Jul 26, 2024

tosi3k commented Jul 26, 2024 • edited Loading

crenshaw-dev commented Jul 26, 2024

Choose a reason for hiding this comment

tosi3k Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crenshaw-dev Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jessesuen Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tosi3k Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

tosi3k commented Jul 26, 2024 •

edited

Loading

tosi3k Jul 31, 2024 •

edited

Loading

crenshaw-dev Jul 31, 2024 •

edited

Loading

jessesuen Aug 16, 2024 •

edited

Loading

tosi3k Aug 20, 2024 •

edited

Loading