a number of Nexus RPWs operate on instances in a predictable order #6565

hawkw · 2024-09-12T18:53:49Z

Something that occurred to me whilst working on #6503 is that a number of Nexus' RPWs that operate on instances (instance_watcher, instance_updater, and the upcoming instance_resurrection task added in #6503) currently iterate over those instances with the same ordering for all Nexus replicas.

This is due to the use of pagination in the database queries for instances matching certain states which feed those tasks' operations, which requires that the query be ordered by the column to paginate on. Many of these background tasks currently perform work on instances/VMMs in the order that they are returned from the database, because they are written in a way that tries to avoid holding all of the instances/VMMs that match their queries in memory at the same time. Instead, they read paginated data pages from the database, operate on the returned entries, and then move on to the next page. This means that when these tasks activate on multiple Nexuses, they will perform operations like instance health checking in the same order for every Nexus.

If these tasks operate on instances/VMMs in the same order on every Nexus replica, and the Nexuses activate their background tasks at around the same time,¹ the likelihood of contention between the sagas (instance-update, in the case of the instance_watcher and instance_updater RPWs, and instance-start, in instance_resurrection) is higher. This results in duplicate work, like launching sagas that will then quickly fail because another saga has already performed the same operations. In addition, if a particular instance or VMM is in some way problematic, or operations on it take a really long time, it's possible for it to gum up every Nexus' RPW at the same time.

I'm not sure if any of this is a particularly urgent issue, but I wonder if it's worth trying to change these RPWs so that each Nexus replica is operating on instances/VMMs in a different order. Maybe there's some way for each Nexus' database queries to have a random start position in the table, but I'm not sure how that would interact with the existing pagination code, or even if there's a way to do that with CRDB at all. Another potential solution could be to change these tasks to instead read all matching instances/VMMs into a HashMap with a hasher using a random seed unique to that Nexus, and then operate on them in the order of iterating over that hashmap. This would nicely distribute work so that it's much less likely for two Nexuses to be operating in the same place, but it would also mean having to store all matching records in memory, rather than just the ones currently being operated on.²

Fortunately, it's pretty unlikely that they will be perfectly in sync, as the Nexuses probably won't all start at precisely the same time, and non-deterministic scheduling on the part of both Tokio and the OS scheduler adds additional noise. ↩
In some cases, that might be fine --- the instance_watcher task, for instance, already constructs a hashmap of all VMMs it health checked in order to record their metrics, so it already has O(instances) memory use. ↩

The text was updated successfully, but these errors were encountered:

hawkw · 2024-09-12T18:54:39Z

I'd love to hear what @gjcolombo, @davepacheco, and maybe @smklein think about this, when y'all have a moment --- is this worth trying to change?

hawkw · 2024-09-12T18:55:41Z

It's possible that there are other Nexus RPWs that operate on other entities in a predictable order as well --- I'm personally only familiar with the instance-management subsystem, but I wonder if we're also operating on other things (i.e. disks) in the same order on every Nexus?

davepacheco · 2024-09-12T20:32:36Z

I agree this could be a problem, but I would probably wait to see it be a problem before doing anything about it. The impact seems likely to be small, and in deployed systems it seems unlikely to me that the Nexus instances' start times will be that aligned.

If this were a problem, I could see starting pagination at a random point and wrapping around. That should be fine as far as CockroachDB is concerned. It might require type-specific code in Paginator but it seems doable to me.

I could also see staggering task activation times after startup. I think we probably want to activate everything once at startup, but for the next activation we could trigger it at a more random point in the window.

gjcolombo · 2024-09-12T21:05:17Z

I could very well be missing something, but I wonder if tasks like this need to paginate at all. At least for the instance tasks it seems like the idea is to (1) find some set of instances in an aberrant state (e.g. not heartbeated in a while, needs an update, is in the Failed state) and (2) do a thing to fix them. Once you've successfully done the thing, the instance isn't in the bad state anymore, so it will no longer appear in queries that list the instances in bad states. If all that is so, could the tasks just do a SELECT FROM instance WHERE <bad state> ORDER BY random() LIMIT n to get a set of instances to act on (and then rerun that query after doing their work)? That would make separate tasks much more likely to select distinct instances, but you'd still be safe from having a task work on the same instance repeatedly, since it would be removed from the query pool after the first time it was selected.

hawkw · 2024-09-12T21:07:30Z

@gjcolombo that's a good point, at least for the instance_updater, instance_reincarnation, and abandoned_vmm_reaper, we don't actually need to do pagination at all, because successfully performing the operation on those instances makes them no longer match the query. That should be a pretty straightforward change.

That doesn't apply for the instance_watcher task though, since it always attempts to health check all instances that Nexus knows about. I think we do still need pagination there.

@gjcolombo

This commit reworks the instance-reincarnation task substantially to use a randomized query order rather than pagination, as @gjcolombo described in [this comment][1]. In addition, we now enforce a real limit on the number of concurrently-running instance start sagas, which is presently 16, a number I picked completely arbitrarily. This branch now depends on PR #6574, which --- it turns out --- is actually a requirement for any kind of saga concurrency limiting in background tasks that start a bunch of sagas. Whoopsie. [1]: #6565 (comment)

@gjcolombo

This commit reworks the instance-reincarnation task substantially to use a randomized query order rather than pagination, as @gjcolombo described in [this comment][1]. In addition, we now enforce a real limit on the number of concurrently-running instance start sagas, which is presently 16, a number I picked completely arbitrarily. This branch now depends on PR #6574, which --- it turns out --- is actually a requirement for any kind of saga concurrency limiting in background tasks that start a bunch of sagas. Whoopsie. [1]: #6565 (comment)

hawkw added the nexus Related to nexus label Sep 12, 2024

hawkw self-assigned this Sep 12, 2024

hawkw mentioned this issue Sep 13, 2024

want a way for background tasks to await saga completion #6569

Closed

hawkw mentioned this issue Sep 13, 2024

[nexus] Reincarnate Failed instances #6503

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a number of Nexus RPWs operate on instances in a predictable order #6565

a number of Nexus RPWs operate on instances in a predictable order #6565

hawkw commented Sep 12, 2024

hawkw commented Sep 12, 2024

hawkw commented Sep 12, 2024

davepacheco commented Sep 12, 2024

gjcolombo commented Sep 12, 2024

hawkw commented Sep 12, 2024

a number of Nexus RPWs operate on instances in a predictable order #6565

a number of Nexus RPWs operate on instances in a predictable order #6565

Comments

hawkw commented Sep 12, 2024

Footnotes

hawkw commented Sep 12, 2024

hawkw commented Sep 12, 2024

davepacheco commented Sep 12, 2024

gjcolombo commented Sep 12, 2024

hawkw commented Sep 12, 2024