-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory budget exceeded error when attempting to create a new vpc #4283
Comments
This is a CockroachDB error, indicating that it's at risk of crashing due to OOM. |
I haven't seen this before but we haven't done much tuning of CockroachDB and I don't remember this coming up in the stress testing we did back in 2020. (But those were relatively simple queries. It sounds like memory usage here depends on the complexity of the queries.) I found some information here:
@askfongjojo I assume you're referring to this quote from that second link:
I think that's a Linux-ism and does not apply to us. (Helios avoids overcommitting memory rather than terminates processes when memory gets tight.) It's quite possible that their default value for --max-sql-memory of "25%" [of what? what CockroachDB thinks the machine has?] is too low for our deployment. In calculating a default, they may be assuming lots of overhead for other parts of the system. Linux's behavior makes it particularly risky for them not to do this since getting OOM-killed has a major impact. But if CockroachDB in our system is looking at the memory available to the zone, and that's only the slice of the system we've carved off for CockroachDB, then we should probably tune this way up. The big question is really: are we spending too much memory or have we just tuned it too tightly? To answer these I'm wondering:
(not asking anybody to dig into these -- though that would be helpful! -- just writing these down) |
Thanks dap for taking a look. I've been trying to look for any useful log message on the CRDB side but haven't figured out the exact query triggering the error. This will help answer the question of whether it's reasonable for it to require so much memory. Meanwhile, I got only some heap usage report from the database health logs to get a sense of the consumption. I'll keep digging the logs for a bit more. Sled 9
Sled 10
Sled 14
Sled 16
sled 17
|
If you are able to generate an API request which triggers the error, then it should be possible to use DTrace to extract the SQL queries which result from that. The There will be many SQL queries from that one API call, but the last one will probably be the relevant one, though I'm not positive about that. I can also help narrowing down the queries further if that's useful. |
I have not traced the running system, but based on the error message, it seems we were in |
The An alternative would be to restructure the query, although it doesn't look like the underlying CRDB behavior of storing an entire subquery in memory has been improved. Another alternative is to implement the "freelist" style query that @davepacheco and I have talked about for a long time. It's not clear how that would work exactly, but the gist is: store free ranges in some table; select the next item from one of those ranges, reducing its size by 1; use the now-allocated value in the desired insertion query. |
I'm tracing the connection probes we have inside Nexus, and trying to create a project. I get what looks like the same 500 error described here, and the probes show this:
So that does appear that the |
Running the
That takes many seconds to complete, and then spews to the console. What we're really doing here is generating a sequence of all possible VNIs and then doing a
So the first VNI we chose (randomly by the application) is actually available. I think we should probably implement the quick fix of trying only a small range, with a few application-level retries. |
For some more context, here are the VNIs we've currently allocated. There ain't much there, so the query that just picked the random VNI directly would have succeeded fine:
VNIs are 24-bit numbers, so we've allocated basically none of the available range. For completeness though, let's consider retrying at the application level, using a |
- Fixes #4283. - Adds a relatively small limit to the `NextItem` query used for finding a free VNI during VPC creation. This limits the memory consumption to something very reasonable, but is big enough that we should be extremely unlikely to find _no_ available VNIs in the range. - Add an application-level retry loop when inserting _customer_ VPCs, which catches the unlikely event that there really are no VNIs available, and retries a few times. - Adds tests for the computation of the limited search range. - Adds tests for the actual exhaustion-detection and retry behavior.
- Fixes #4283. - Adds a relatively small limit to the `NextItem` query used for finding a free VNI during VPC creation. This limits the memory consumption to something very reasonable, but is big enough that we should be extremely unlikely to find _no_ available VNIs in the range. - Add an application-level retry loop when inserting _customer_ VPCs, which catches the unlikely event that there really are no VNIs available, and retries a few times. - Adds tests for the computation of the limited search range. - Adds tests for the actual exhaustion-detection and retry behavior.
- Fixes #4283. - Adds a relatively small limit to the `NextItem` query used for finding a free VNI during VPC creation. This limits the memory consumption to something very reasonable, but is big enough that we should be extremely unlikely to find _no_ available VNIs in the range. - Add an application-level retry loop when inserting _customer_ VPCs, which catches the unlikely event that there really are no VNIs available, and retries a few times. - Adds tests for the computation of the limited search range. - Adds tests for the actual exhaustion-detection and retry behavior.
- Fixes #4283. - Adds a relatively small limit to the `NextItem` query used for finding a free VNI during VPC creation. This limits the memory consumption to something very reasonable, but is big enough that we should be extremely unlikely to find _no_ available VNIs in the range. - Add an application-level retry loop when inserting _customer_ VPCs, which catches the unlikely event that there really are no VNIs available, and retries a few times. - Adds tests for the computation of the limited search range. - Adds tests for the actual exhaustion-detection and retry behavior. Review feedback - Add const-assert that VNI search range is valid - Rename `Vpc::with_random_vni()` Search entire VNI range in chunks - Remove `IncompleteVpc::with_random_vni()` - Add iterator to search ranges of VNIs sequentially, yielding all range starts in the valid guest VNI range from a random starting VNI. - Instead of a limited retry loop when creating VPCs, search until the iterator is exhausted and we've searched the whole range. Throw in some DTrace probes fmt
- Fixes #4283. - Adds a relatively small limit to the `NextItem` query used for finding a free VNI during VPC creation. This limits the memory consumption to something very reasonable, but is big enough that we should be extremely unlikely to find _no_ available VNIs in the range. - Add an application-level retry loop when inserting _customer_ VPCs, which catches the unlikely event that there really are no VNIs available, and retries a few times. - Adds tests for the computation of the limited search range. - Adds tests for the actual exhaustion-detection and retry behavior.
The error was encountered on rack2 when I tried to create a new project:
Given the error was related to vpc, I then tried to create a new vpc in an existing project and hit the same error.
The text was updated successfully, but these errors were encountered: