-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crdb memory budget exceeded error when creating a large number of instances concurrently #5904
Comments
The omicron commit deployed in the environment I used (rack2) was The last time I ran the same terraform test was on commit |
I suspect this subquery is the (or part of the) problem: (
SELECT
$40 + "shift" AS "mac"
FROM
(
SELECT
generate_series(0, $41) AS "index",
generate_series(0, $42) AS "shift"
UNION ALL
SELECT
generate_series($43, $44) AS "index",
generate_series($45, -1) AS "shift"
)
LEFT OUTER JOIN "network_interface" ON (
"vpc_id", "mac", "time_deleted" IS NULL
) = ($46, $47 + "shift", TRUE)
WHERE
"mac" IS NULL
ORDER BY
"index"
LIMIT
1
) AS "mac", After substituting bind parameters, this becomes: SELECT
184993468409456 + "shift" AS "mac"
FROM
(
SELECT
generate_series(0, 432527) AS "index",
generate_series(0, 432527) AS "shift"
UNION ALL
SELECT
generate_series(432528, 983039) AS "index",
generate_series(-550512, -1) AS "shift"
)
LEFT OUTER JOIN "network_interface" ON (
"vpc_id", "mac", "time_deleted" IS NULL
) = ('91a91cab-2fbd-4c1e-a91f-4f2bcae5705b', 184993468409456 + "shift", TRUE)
WHERE
"mac" IS NULL
ORDER BY
"index"
LIMIT
1 which has some pretty large
Note the actual memory usage was 25 MiB and the estimated max memory was 50 MiB. Our total query memory budget is only ~128 MiB IIRC, so a few of these running concurrently would max us out. The |
@jgallagher @bnaecker - Thank you for digging into this. As the logic for locating available mac address hasn't changed lately (or for a long time), it is possible that because we have a lot of more background processes running and makeing database queries, the expensive subquery has become more prone to the memory budget limit. I wonder if we need to re-evaluate the memory budget allocated besides tuning this particular subquery. cc @smklein |
I've been mulling on this for a bit. The next-item style queries all work the same way: join the existing entries with all possible entries, and take the first row where the existing entry is NULL. That is, select the first possible entry that's not already in the table. This is fine, but obviously expensive because we form the full set of all possible values. For things like MACs this expensive, but for IPv6 addresses it will be literally impossible. In theory it should be possible to do this with a self-join instead, between the current set of values and the "next one", e.g., |
I am trying to do another round of concurrent VM provisioning testing on the colo with R10, but I am hitting this issue much more frequently. I'm not sure if that is related to the release itself or if it is due to have a lot more records with In previous R9 testing I was able to reach about 250 VM creations (with a concurrency of 25), but currently I am not able to get past 75 outside a single run where I reduced concurrency to 10. Though that may have just gotten lucky? Further attempts have failed. Is there anything we want to test around the size of the |
Anecdotal follow up, but I am starting to see this when provisioning a single VM from the web console now as well. I'm not sure if it is exactly related to that subquery though.
|
Ok, looking in to this more I think it is a different query as this looks related to the saga executor state:
It is still an exhaustion during instance create though. |
Mentioning this because I misread that output at first: what you're seeing there are SQL INSERTs that are storing errors that happened during some previous query's execution. The errors didn't (necessarily) come from that INSERT. The errors would have come from some query that was executed during the execution of that saga node (e.g., saga id e6b2c0c3-e4ca-4623-b112-1ab6af432e03 node 5 -- it looks like node 5 for all of them so I'm guessing it's the same kind of saga). It looks like they're trying to allocate ~4 MiB but there's already ~130 MiB allocated out of a pool about that size. If I recall correctly, sometimes it's not the query that produces the error that's actually the problem. 4 MiB is larger than I'd expect but it's also not that much memory. I expect the Cockroach docs have more information about tracking this down but I haven't had a chance to look. |
Thanks! that certainly helps me start sorting out some confusion I had around this. Let me cast a wider net on the queries I'm looking at. Is there any reason we could think of there being an affinity between memory exhaustion and specific silos? I have a test silo that I am operating in, and I have no conclusive evidence, but once I seem to trigger the exhaustion that silo seems to continually trigger it for minutes at a time. Meanwhile the |
There's an overview of managing this problem in the CockroachDB docs: There's also a bunch more information here: I imagine we'll want to characterize how the memory is being used when this happens to decide if there are problematic queries or if we just need to give CockroachDB more memory. |
The only thing that immediately comes to mind is if the Silo has different IP pools or other resources whose queries are more expensive for some reason? It's conceivable. |
This could be #5651? |
It is likely related I think. I was testing against two different IP pools, and the results were similar. One pool was a /20 and the other a /22. Going back to the affinity thing. I need to do more testing, but I suspect it may be project/VPC/subnet related. I was not able to drill down past that. I'll add instructions to run, but https://github.com/oxidecomputer/up-down is what I was using for testing. |
The VPC subnet IP range also matters. I run into the CRDB memory budget error consistently, approx. 5 out of 120 instances provisioned with Terraform on default To be clear, this can't be a workaround for customers. I'm just identifying the "offenders". |
I’m pretty sure the queries allocating the next available IP in the VPC Subnet are to blame. Those don’t have a hard size limit other than the size of the subnet itself. I wrote a bit about this earlier in the thread, and can take it on in earnest next week. It seems like this is high priority, and blocking further work.
|
I'm starting to look at this in earnest, and there seems to be two approaches we could take to reducing the memory consumption here:
Note that the self-join thing would have to be done for both the MAC and IP addresses, since both use the Self-joinThe query for a self-join would look like this:
That takes the next MAC ( In this particular case, the query consumes far less memory than the one using
So that's 50 KiB, vs around 25MiB for the one using This does change the next-item query a lot, and so is probably a good bit of work. I'm also less sure of the memory consumption profile. Limit and retryIn this approach, we'd add some mechanism for creating an iterator over chunks of the search space, and then loop the query until we either (1) find an entry or (2) exhaust the search space. This is how we do the existing VNI allocation, using the omicron/nexus/db-queries/src/db/queries/vpc.rs Lines 324 to 340 in ba49693
We've successfully used this approach already, which is kind of nice. But it does involve more machinery, since we need to add an iterator for both the MAC and IP addresses, and then figure out which one to advance based on how the query fails. It also requires picking an arbitrary size for these ranges, which isn't needed for the self-join approach. |
Is it easier to write this in sql then modifying the next-item query? If so, maybe use the escape hatch. |
The next-item query does already dip into raw SQL, though using Diesel's |
- Adds the `NextItemSelfJoined` query, which implements the next-item query as a self-join. The current implementation joins the target table with the series of all possible values for the target item. Because CRDB eagerly evaluates subqueries, this entire list must be buffered in memory. The self-join version instead joins the existing table with the literal next item (item + 1), and finds the lowest one that's not allocated yet. Some initial testing suggests this can consume much less memory and run much faster than the previous query implementations. As a tradeoff, this new query cannot be used to randomly select _any_ item in a range -- only the next, lowest value can be selected. Note that this changes the way we allocate MAC addresses, which were previously random. - Fixes #5904
- Adds the `NextItemSelfJoined` query, which implements the next-item query as a self-join. The current implementation joins the target table with the series of all possible values for the target item. Because CRDB eagerly evaluates subqueries, this entire list must be buffered in memory. The self-join version instead joins the existing table with the literal next item (item + 1), and finds the lowest one that's not allocated yet. Some initial testing suggests this can consume much less memory and run much faster than the previous query implementations. As a tradeoff, this new query cannot be used to randomly select _any_ item in a range -- only the next, lowest value can be selected. Note that this changes the way we allocate MAC addresses, which were previously random. - Fixes #5904
While creating 70 instances concurrently using the same Terraform plan I used many times before without any problem, I'm now hitting a number of 500 errors in every run consistently. The errors in nexus log all look like the one below:
Using this D script against the nexus process running in sled 8, I was able to capture the database error in question:
It is however an insert statement and didn't appear to be the culprit. I had a look at 3 different occurrences of the 500 errors and tried to pinpoint the more expensive queries running around the same time. Here is one of those (latency = 346542 us):
(a grep of this query pattern shows the latency was as high as 1523657 us during the window when I was running
terraform apply
and tracing the database queries).The complete nexus log and db query tracing output are located in catacomb:/staff/dogfood/omicron-5904.
The text was updated successfully, but these errors were encountered: