Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a number of Nexus RPWs operate on instances in a predictable order #6565

Open
hawkw opened this issue Sep 12, 2024 · 5 comments
Open

a number of Nexus RPWs operate on instances in a predictable order #6565

hawkw opened this issue Sep 12, 2024 · 5 comments
Assignees
Labels
nexus Related to nexus

Comments

@hawkw
Copy link
Member

hawkw commented Sep 12, 2024

Something that occurred to me whilst working on #6503 is that a number of Nexus' RPWs that operate on instances (instance_watcher, instance_updater, and the upcoming instance_resurrection task added in #6503) currently iterate over those instances with the same ordering for all Nexus replicas.

This is due to the use of pagination in the database queries for instances matching certain states which feed those tasks' operations, which requires that the query be ordered by the column to paginate on. Many of these background tasks currently perform work on instances/VMMs in the order that they are returned from the database, because they are written in a way that tries to avoid holding all of the instances/VMMs that match their queries in memory at the same time. Instead, they read paginated data pages from the database, operate on the returned entries, and then move on to the next page. This means that when these tasks activate on multiple Nexuses, they will perform operations like instance health checking in the same order for every Nexus.

If these tasks operate on instances/VMMs in the same order on every Nexus replica, and the Nexuses activate their background tasks at around the same time,1 the likelihood of contention between the sagas (instance-update, in the case of the instance_watcher and instance_updater RPWs, and instance-start, in instance_resurrection) is higher. This results in duplicate work, like launching sagas that will then quickly fail because another saga has already performed the same operations. In addition, if a particular instance or VMM is in some way problematic, or operations on it take a really long time, it's possible for it to gum up every Nexus' RPW at the same time.

I'm not sure if any of this is a particularly urgent issue, but I wonder if it's worth trying to change these RPWs so that each Nexus replica is operating on instances/VMMs in a different order. Maybe there's some way for each Nexus' database queries to have a random start position in the table, but I'm not sure how that would interact with the existing pagination code, or even if there's a way to do that with CRDB at all. Another potential solution could be to change these tasks to instead read all matching instances/VMMs into a HashMap with a hasher using a random seed unique to that Nexus, and then operate on them in the order of iterating over that hashmap. This would nicely distribute work so that it's much less likely for two Nexuses to be operating in the same place, but it would also mean having to store all matching records in memory, rather than just the ones currently being operated on.2

Footnotes

  1. Fortunately, it's pretty unlikely that they will be perfectly in sync, as the Nexuses probably won't all start at precisely the same time, and non-deterministic scheduling on the part of both Tokio and the OS scheduler adds additional noise.

  2. In some cases, that might be fine --- the instance_watcher task, for instance, already constructs a hashmap of all VMMs it health checked in order to record their metrics, so it already has O(instances) memory use.

@hawkw hawkw added the nexus Related to nexus label Sep 12, 2024
@hawkw hawkw self-assigned this Sep 12, 2024
@hawkw
Copy link
Member Author

hawkw commented Sep 12, 2024

I'd love to hear what @gjcolombo, @davepacheco, and maybe @smklein think about this, when y'all have a moment --- is this worth trying to change?

@hawkw
Copy link
Member Author

hawkw commented Sep 12, 2024

It's possible that there are other Nexus RPWs that operate on other entities in a predictable order as well --- I'm personally only familiar with the instance-management subsystem, but I wonder if we're also operating on other things (i.e. disks) in the same order on every Nexus?

@davepacheco
Copy link
Collaborator

I agree this could be a problem, but I would probably wait to see it be a problem before doing anything about it. The impact seems likely to be small, and in deployed systems it seems unlikely to me that the Nexus instances' start times will be that aligned.

If this were a problem, I could see starting pagination at a random point and wrapping around. That should be fine as far as CockroachDB is concerned. It might require type-specific code in Paginator but it seems doable to me.

I could also see staggering task activation times after startup. I think we probably want to activate everything once at startup, but for the next activation we could trigger it at a more random point in the window.

@gjcolombo
Copy link
Contributor

I could very well be missing something, but I wonder if tasks like this need to paginate at all. At least for the instance tasks it seems like the idea is to (1) find some set of instances in an aberrant state (e.g. not heartbeated in a while, needs an update, is in the Failed state) and (2) do a thing to fix them. Once you've successfully done the thing, the instance isn't in the bad state anymore, so it will no longer appear in queries that list the instances in bad states. If all that is so, could the tasks just do a SELECT FROM instance WHERE <bad state> ORDER BY random() LIMIT n to get a set of instances to act on (and then rerun that query after doing their work)? That would make separate tasks much more likely to select distinct instances, but you'd still be safe from having a task work on the same instance repeatedly, since it would be removed from the query pool after the first time it was selected.

@hawkw
Copy link
Member Author

hawkw commented Sep 12, 2024

@gjcolombo that's a good point, at least for the instance_updater, instance_reincarnation, and abandoned_vmm_reaper, we don't actually need to do pagination at all, because successfully performing the operation on those instances makes them no longer match the query. That should be a pretty straightforward change.

That doesn't apply for the instance_watcher task though, since it always attempts to health check all instances that Nexus knows about. I think we do still need pagination there.

hawkw added a commit that referenced this issue Sep 13, 2024
This commit reworks the instance-reincarnation task substantially to use
a randomized query order rather than pagination, as @gjcolombo described
in [this comment][1]. In addition, we now enforce a real limit on the
number of concurrently-running instance start sagas, which is presently
16, a number I picked completely arbitrarily.

This branch now depends on PR #6574, which --- it turns out --- is
actually a requirement for any kind of saga concurrency limiting in
background tasks that start a bunch of sagas. Whoopsie.

[1]: #6565 (comment)
hawkw added a commit that referenced this issue Sep 16, 2024
This commit reworks the instance-reincarnation task substantially to use
a randomized query order rather than pagination, as @gjcolombo described
in [this comment][1]. In addition, we now enforce a real limit on the
number of concurrently-running instance start sagas, which is presently
16, a number I picked completely arbitrarily.

This branch now depends on PR #6574, which --- it turns out --- is
actually a requirement for any kind of saga concurrency limiting in
background tasks that start a bunch of sagas. Whoopsie.

[1]: #6565 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nexus Related to nexus
Projects
None yet
Development

No branches or pull requests

3 participants