-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
a number of Nexus RPWs operate on instances in a predictable order #6565
Comments
I'd love to hear what @gjcolombo, @davepacheco, and maybe @smklein think about this, when y'all have a moment --- is this worth trying to change? |
It's possible that there are other Nexus RPWs that operate on other entities in a predictable order as well --- I'm personally only familiar with the instance-management subsystem, but I wonder if we're also operating on other things (i.e. disks) in the same order on every Nexus? |
I agree this could be a problem, but I would probably wait to see it be a problem before doing anything about it. The impact seems likely to be small, and in deployed systems it seems unlikely to me that the Nexus instances' start times will be that aligned. If this were a problem, I could see starting pagination at a random point and wrapping around. That should be fine as far as CockroachDB is concerned. It might require type-specific code in I could also see staggering task activation times after startup. I think we probably want to activate everything once at startup, but for the next activation we could trigger it at a more random point in the window. |
I could very well be missing something, but I wonder if tasks like this need to paginate at all. At least for the instance tasks it seems like the idea is to (1) find some set of instances in an aberrant state (e.g. not heartbeated in a while, needs an update, is in the Failed state) and (2) do a thing to fix them. Once you've successfully done the thing, the instance isn't in the bad state anymore, so it will no longer appear in queries that list the instances in bad states. If all that is so, could the tasks just do a |
@gjcolombo that's a good point, at least for the That doesn't apply for the |
This commit reworks the instance-reincarnation task substantially to use a randomized query order rather than pagination, as @gjcolombo described in [this comment][1]. In addition, we now enforce a real limit on the number of concurrently-running instance start sagas, which is presently 16, a number I picked completely arbitrarily. This branch now depends on PR #6574, which --- it turns out --- is actually a requirement for any kind of saga concurrency limiting in background tasks that start a bunch of sagas. Whoopsie. [1]: #6565 (comment)
This commit reworks the instance-reincarnation task substantially to use a randomized query order rather than pagination, as @gjcolombo described in [this comment][1]. In addition, we now enforce a real limit on the number of concurrently-running instance start sagas, which is presently 16, a number I picked completely arbitrarily. This branch now depends on PR #6574, which --- it turns out --- is actually a requirement for any kind of saga concurrency limiting in background tasks that start a bunch of sagas. Whoopsie. [1]: #6565 (comment)
Something that occurred to me whilst working on #6503 is that a number of Nexus' RPWs that operate on instances (
instance_watcher
,instance_updater
, and the upcominginstance_resurrection
task added in #6503) currently iterate over those instances with the same ordering for all Nexus replicas.This is due to the use of pagination in the database queries for instances matching certain states which feed those tasks' operations, which requires that the query be ordered by the column to paginate on. Many of these background tasks currently perform work on instances/VMMs in the order that they are returned from the database, because they are written in a way that tries to avoid holding all of the instances/VMMs that match their queries in memory at the same time. Instead, they read paginated data pages from the database, operate on the returned entries, and then move on to the next page. This means that when these tasks activate on multiple Nexuses, they will perform operations like instance health checking in the same order for every Nexus.
If these tasks operate on instances/VMMs in the same order on every Nexus replica, and the Nexuses activate their background tasks at around the same time,1 the likelihood of contention between the sagas (
instance-update
, in the case of theinstance_watcher
andinstance_updater
RPWs, andinstance-start
, ininstance_resurrection
) is higher. This results in duplicate work, like launching sagas that will then quickly fail because another saga has already performed the same operations. In addition, if a particular instance or VMM is in some way problematic, or operations on it take a really long time, it's possible for it to gum up every Nexus' RPW at the same time.I'm not sure if any of this is a particularly urgent issue, but I wonder if it's worth trying to change these RPWs so that each Nexus replica is operating on instances/VMMs in a different order. Maybe there's some way for each Nexus' database queries to have a random start position in the table, but I'm not sure how that would interact with the existing pagination code, or even if there's a way to do that with CRDB at all. Another potential solution could be to change these tasks to instead read all matching instances/VMMs into a
HashMap
with a hasher using a random seed unique to that Nexus, and then operate on them in the order of iterating over that hashmap. This would nicely distribute work so that it's much less likely for two Nexuses to be operating in the same place, but it would also mean having to store all matching records in memory, rather than just the ones currently being operated on.2Footnotes
Fortunately, it's pretty unlikely that they will be perfectly in sync, as the Nexuses probably won't all start at precisely the same time, and non-deterministic scheduling on the part of both Tokio and the OS scheduler adds additional noise. ↩
In some cases, that might be fine --- the
instance_watcher
task, for instance, already constructs a hashmap of all VMMs it health checked in order to record their metrics, so it already has O(instances) memory use. ↩The text was updated successfully, but these errors were encountered: