Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Make cluster cache target the watch cache instead of etcd #617

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tosi3k
Copy link

@tosi3k tosi3k commented Jul 25, 2024

Problem statement in argoproj/argo-cd#18838.

This PR addresses the problem partially - starting with delegating the K8s List calls to the watch cache, offloading the etcd.

@tosi3k tosi3k changed the title feat: make cluster cache target the watch cache instead of etcd feat: Make cluster cache target the watch cache instead of etcd Jul 25, 2024
@tosi3k
Copy link
Author

tosi3k commented Jul 25, 2024

CC @crenshaw-dev

Signed-off-by: Antoni Zawodny <zawodny@google.com>
Copy link

sonarcloud bot commented Jul 25, 2024

@crenshaw-dev
Copy link
Member

@tosi3k just reading around about the watch cache, it looks like maybe the watch cache doesn't support pagination? kubernetes/kubernetes#102672

If that's still the case, are we going to end up falling back to etcd anyway since we set the page size? And do we run the risk of causing massing kube-apiserver memory load by using the watch cache instead of etcd?

@tosi3k
Copy link
Author

tosi3k commented Jul 26, 2024

Watch cache not having support for pagination (i.e. disregarding the limit from the List API call's URI query string) is a known issue - the issue you linked provides a nice discussion about the tradeoffs we face here.

If that's still the case, are we going to end up falling back to etcd anyway since we set the page size?

Not really - limit parameter is what is disregarded when set in conjunction with resourceVersion=0 in a List API call, not the resourceVersion param.

And do we run the risk of causing massing kube-apiserver memory load by using the watch cache instead of etcd?

I'd rather call it trading off some amount of memory for large amount of CPU with the latter being a scarcer resource at scale of large Kubernetes clusters in general :). In conjunction with #616, that would be a short memory usage spike in kube-apiserver just on cache initialization rather than every 10 minutes.

In general, this will be eventually addressed by kubernetes/enhancements#3142 - this enhancement's enablement was unfortunately reverted last-minute in K8s 1.31 but will be available in 1.32.

@crenshaw-dev
Copy link
Member

Not really - limit parameter is what is disregarded when set in conjunction with resourceVersion=0 in a List API call, not the resourceVersion param.

Should we explicitly remove the limit parameter to avoid confusion, or does it end up being used in the case of a fallback to etcd?

In general, this will be eventually addressed by kubernetes/enhancements#3142 - this enhancement's enablement was unfortunately reverted last-minute in K8s 1.31 but will be available in 1.32.

Makes sense, I'll just keep an eye on CPU use while testing, and we'll want to pay attention during Argo CD RCs for any complaints.

@tosi3k
Copy link
Author

tosi3k commented Jul 26, 2024

Should we explicitly remove the limit parameter to avoid confusion, or does it end up being used in the case of a fallback to etcd?

Good question - we don't want to drop it since there's a possibility someone disabled the watch cache explicitly (I think there are some K8s distros that do so) by setting the apiserver's --watch-cache flag to false: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/.

@crenshaw-dev
Copy link
Member

Cool! Okay next step is for me to do some testing and gather data. I can't make any promises on how soon that'll be, but I want to get this change into 2.13.

// We set the resource version to 0 below to proactively prevent the
// list API call from reaching etcd and make the server fetch the data
// from the watch cache instead.
opts.ResourceVersion = "0"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unclear to me.
When RV=0 is set, we're kube-apiserver is ignoring the Limit option now, so the whole pagination won't be effective anyway.

When this function is called? Is it only on initialization or it's also used later?

Copy link
Author

@tosi3k tosi3k Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With #616 submitted together with this PR, this would be called only on initialization, "resource version too old" or pathological scenarios like failing to extract RV from the watch event which would require relisting anyway.

Without #616, the list would be always executed every ~10 min by default (same as now).

IMHO it would make more sense to have both changes in a single PR but I was requested to split them to facilitate some testing that @crenshaw-dev wants to do internally.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming #616 will merge. With that assumption, I'm not even convinced we want this one to happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even convinced we want this one to happen.

Why not? Doesn't using the watch cache have its own benefits independent of eliminating the 10-minute re-list operations?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has, but it comes with the cost of reads being potentially stale. I don't know ArgoCD well enough to tell what consequences it may have.
At the same time, new features are landing in k8s 1.31 and more will land in 1.32 and further where it will no longer matter (ConsistentListFromCache, WatchList/streaming list).

So let's start with the other PR for now and revisit this one in case of complaints (we know that the other PR on its own would effectively address issues that we observed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reads being potentially stale

What's the actual negative impact of this? If after doing a watch cache list we immediately handle all events that have happened since the returned resourceVersion, what risk have we introduced that didn't already exist?

let's start with the other PR for now

The other PR reduces ongoing etcd load and the likelihood of hitting etcd compaction errors during long list operations, but it doesn't eliminate the initial load or the possibility of compaction errors.

The other PR also eliminates the 10min re-list completely. My understanding is that the re-list help ensure data integrity (for example, when events get missed) and is a pattern used in the Kubernetes informer code (so a general best practice). I'm up for an experiment, but I'm a little wary of completely eliminating the re-list or deviating too far from a pattern used by the k8s code we're trying to emulate.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the actual negative impact of this? If after doing a watch cache list we immediately handle all events that have happened since the returned resourceVersion, what risk have we introduced that didn't already exist?

In majority of cases, not big risk, but in case cache is visibly lagging (we've had an outage when we observed lagging by ~1h or so), then it might be bigger problem..

The other PR also eliminates the 10min re-list completely. My understanding is that the re-list help ensure data integrity (for example, when events get missed) and is a pattern used in the Kubernetes informer code (so a general best practice). I'm up for an experiment, but I'm a little wary of completely eliminating the re-list or deviating too far from a pattern used by the k8s code we're trying to emulate.

Nope - core k8s controllers generally don't relist at all.
You might be mixing that with the "resync" concept in the informer framework. Resync is not a relist - resync calls a handler for all objects, but it's iterating over them from in-memory cache.
Historically (5y+ ago) we indeed were periodically relisting due to bugs that were still happening, but our machinery was stabilized since then and we're not doing that anymore for core components.

Copy link
Member

@crenshaw-dev crenshaw-dev Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we observed lagging by ~1h or so

Damn okay, that makes sense. I assume we'd expect to have to process a ton of events in that case.

But #616 wouldn't really solve that, it would just make it less likely by reducing the number of list operations, if I understand correctly.

core k8s controllers generally don't relist at all. You might be mixing that with the "resync" concept in the informer framework

I absolutely am confusing those two things, good catch.

In that case I'm much less worried about #616, and I do see why you might find it preferable to just do #616 and skip the watch cache change until ~1.32. I would like to test the changes independently and see where we land.

I'm not sure when I'll have time to test, and I'm very uncomfortable merging any performance changes without testing. So if anyone else has time / a good test environment, I'd appreciate the help.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But #616 wouldn't really solve that, it would just make it less likely by reducing the number of list operations, if I understand correctly.

616 indeed isn't a fix for that - what it is is that it prevents us going back in time across component restarts.
Lagging is one thing, going back in time is more serious thing to face.
[Effectively an instance of https://github.com/kubernetes/kubernetes/issues/59848 ]

From scalability POV you will certainly see improvement with this change - but I prefer to stay on the safe side first, and potentially consider that later only if really needed.

Comment on lines +556 to +559
// We set the resource version to 0 below to proactively prevent the
// list API call from reaching etcd and make the server fetch the data
// from the watch cache instead.
opts.ResourceVersion = "0"
Copy link
Member

@jessesuen jessesuen Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's ever a good idea to list with resourceVersion=0. Even though this allows the API server to use it's object cache, it causes a worse problem. It does not allow limit/pagination/chunking to be used at all and so all objects of that group/kind will be returned in the response, resulting in OOM conditions of the client dealing with the huge response.

We've been wrestling with this problem in workflows over the years, and the solution we landed on is to never allow resourceVersion=0 requests, even if it means hitting etcd (resourceVersion="").

argoproj/argo-workflows#13466 (review)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And do we run the risk of causing massing kube-apiserver memory load by using the watch cache instead of etcd?

I'd rather call it trading off some amount of memory for large amount of CPU with the latter being a scarcer resource at scale of large Kubernetes clusters in general :).

The tradeoff is not really about increase of kube-apiserver CPU/memory. This will cause the application-controller to have to load all objects of every group/kind in memory when dealing with the non-paginated responses.

Copy link
Author

@tosi3k tosi3k Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tradeoff is not really about increase of kube-apiserver CPU/memory. This will cause the application-controller to have to load all objects of every group/kind in memory when dealing with the non-paginated responses.

This tradeoff I mentioned was from the server's POV, not client's but I agree there's a risk of client OOMing if the amount of resources it tracks is large and the client doesn't get enough memory assigned. Transforming informers could be a possible alleviation of the problem you describe as they provide possibility to reduce the memory footprint due to the objects stored in the client's cache but it would require considerably more work to validate them and migrate to them here if feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants