feat(205/go-app): use automaxprocs to autoscale to CPU limit of container #278
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I have detected an issue when the GOMAXPROCS value doesn't match the number of available CPU cores.
GOMAXPROCS is the number of threads allowed to run Go code at the same time, so by default the Go runtime sets it to the available number of CPU cores.
If GOMAXPROCS is less than the number of CPU cores, you lose parallelism, as Go will not use the available cores, but setting it too high is also an issue, since the Go runtime can swap goroutines much faster than the OS can swap threads, including across cores.
I suspect this could happen during tests in the cloud because we're setting a CPU limit of 2 cores, but machines like m6a.2xlarge have 8 vCPU, and I'm not sure if Kubernetes limit affects the number of cores the Go runtime sees when it starts up. Running with GODEBUG=gctrace=1 set will print out GC stats, the final field being the current number of PROCs.
Solving it is simple, Uber made a library that looks at the container's CPU limit and set's GOMAXPROCS accordingly. If you manually set the environment variable the library will still respect it, leaving the option to manually change it if that is desired, or if it can't figure it out at runtime.
Notably, it seems like having GOMAXPROCS be 2 when there are only 2 CPU causes a bigger performance win than letting the GC use 3x as much memory in my heavy-load wrk tests.
Using wrk -d30s -t12 -c200 and limiting the server to 2 CPUs with
docker run --cpus 2
:The better scheduling / contention from matching the core count cuts the latency by 66% compared to improving GC performance by giving it more RAM. Improving GOGC still helps a noticable amount.