Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(205/go-app): use automaxprocs to autoscale to CPU limit of container #278

Merged
merged 1 commit into from
Sep 18, 2024

Conversation

cookieo9
Copy link
Contributor

I have detected an issue when the GOMAXPROCS value doesn't match the number of available CPU cores.

GOMAXPROCS is the number of threads allowed to run Go code at the same time, so by default the Go runtime sets it to the available number of CPU cores.

If GOMAXPROCS is less than the number of CPU cores, you lose parallelism, as Go will not use the available cores, but setting it too high is also an issue, since the Go runtime can swap goroutines much faster than the OS can swap threads, including across cores.

I suspect this could happen during tests in the cloud because we're setting a CPU limit of 2 cores, but machines like m6a.2xlarge have 8 vCPU, and I'm not sure if Kubernetes limit affects the number of cores the Go runtime sees when it starts up. Running with GODEBUG=gctrace=1 set will print out GC stats, the final field being the current number of PROCs.

Solving it is simple, Uber made a library that looks at the container's CPU limit and set's GOMAXPROCS accordingly. If you manually set the environment variable the library will still respect it, leaving the option to manually change it if that is desired, or if it can't figure it out at runtime.

Notably, it seems like having GOMAXPROCS be 2 when there are only 2 CPU causes a bigger performance win than letting the GC use 3x as much memory in my heavy-load wrk tests.

Using wrk -d30s -t12 -c200 and limiting the server to 2 CPUs with docker run --cpus 2:

Thread Stats   Avg      Stdev     Max   +/- Stdev

  CORES=2 GOMAXPROCS=8 GOGC=100
    Latency    20.72ms   25.56ms 105.36ms   79.27%
    Req/Sec     4.13k   609.55    19.03k    82.90%

  CORES=2 GOMAXPROCS=8 GOGC=300
    Latency    19.58ms   24.58ms  97.19ms   79.73%
    Req/Sec     4.51k   617.96    17.13k    79.74%

  CORES=2 GOMAXPROCS=2 GOGC=100
    Latency     6.53ms    2.37ms  24.92ms   70.40%
    Req/Sec     5.08k   690.78    43.98k    99.53%

  CORES=2 GOMAXPROCS=2 GOGC=300
    Latency     6.14ms    2.05ms  19.86ms   69.17%
    Req/Sec     5.40k   817.94    50.31k    99.89%

The better scheduling / contention from matching the core count cuts the latency by 66% compared to improving GC performance by giving it more RAM. Improving GOGC still helps a noticable amount.

@cookieo9
Copy link
Contributor Author

Just to be clear, I'm worried that on a VM with 8 vCPU, Go is using a default GOMAXPROCS value of 8 (running 8 threads), even when Kubernetes is limiting the container to 2 Cores. Running with GODEBUG=gctrace=1 set will confirm this (as it did on my local docker tests).

Using automaxprocs is easy, and should fix this if it's happening as there definitely seems to be a win for running with an appropriate number of OS threads.

Also of note is the other languages / frameworks that use green/user threads may also need such tuning if they think they have 8 cores to play with, as they should be told to not use that many threads otherwise they may be constantly shuffling work between cores at the OS level.

I have detected an issue when the GOMAXPROCS value doesn't match the number of available CPU cores.

GOMAXPROCS is the number of threads allowed to run Go code at the same time, so by default the Go runtime sets it to the available number of CPU cores.

If GOMAXPROCS is less than the number of CPU cores, you lose parallelism, as Go will not use the available cores, but setting it too high is also an issue, since the Go runtime can swap goroutines much faster than the OS can swap threads, including across cores.

I suspect this could happen during tests in the cloud because we're setting a CPU limit of 2 cores, but machines like m6a.2xlarge have 8 vCPU, and I'm not sure if Kubernetes limit affects the number of cores the Go runtime sees when it starts up. Running with GODEBUG=gctrace=1 set will print out GC stats, the final field being the current number of PROCs.

Solving it is simple, Uber made a library that looks at the container's CPU limit and set's GOMAXPROCS accordingly. If you manually set the environment variable the library will still respect it, leaving the option to manually change it if that is desired.

Notably, it seems like having GOMAXPROCS be 2 when there are only 2 CPU causes a bigger performance win than letting the GC use 3x as much memory in my heavy-load wrk tests.

Using wrk -d30s -t12 -c200 and limiting the server to 2 CPUs with
`docker run --cpus 2`:

```
Thread Stats   Avg      Stdev     Max   +/- Stdev

  CORES=2 GOMAXPROCS=8 GOGC=100
    Latency    20.72ms   25.56ms 105.36ms   79.27%
    Req/Sec     4.13k   609.55    19.03k    82.90%

  CORES=2 GOMAXPROCS=8 GOGC=300
    Latency    19.58ms   24.58ms  97.19ms   79.73%
    Req/Sec     4.51k   617.96    17.13k    79.74%

  CORES=2 GOMAXPROCS=2 GOGC=100
    Latency     6.53ms    2.37ms  24.92ms   70.40%
    Req/Sec     5.08k   690.78    43.98k    99.53%

  CORES=2 GOMAXPROCS=2 GOGC=300
    Latency     6.14ms    2.05ms  19.86ms   69.17%
    Req/Sec     5.40k   817.94    50.31k    99.89%

```

The better scheduling / contention from matching the core count cuts the
latency by 66% compared to improving GC performance by giving it more
RAM.
@antonputra
Copy link
Owner

@cookieo9 Thanks. When I was trying to run Actix in Rust on an 'm7a.large' instance, I discovered that instead of setting the number of threads to match the number of cores, it actually matches the physical number of processors. In the case of m7a.large, 2 vCPUs correspond to 1 physical processor (1 processor, 2 cores) - https://browser.geekbench.com/v5/cpu/22475411.

@antonputra antonputra merged commit 0841aa2 into antonputra:main Sep 18, 2024
@cookieo9 cookieo9 deleted the go-automaxprocs branch September 19, 2024 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants