I have detected an issue when the GOMAXPROCS value doesn't match the number of available CPU cores.
GOMAXPROCS is the number of threads allowed to run Go code at the same time, so by default the Go runtime sets it to the available number of CPU cores.
If GOMAXPROCS is less than the number of CPU cores, you lose parallelism, as Go will not use the available cores, but setting it too high is also an issue, since the Go runtime can swap goroutines much faster than the OS can swap threads, including across cores.
I suspect this could happen during tests in the cloud because we're setting a CPU limit of 2 cores, but machines like m6a.2xlarge have 8 vCPU, and I'm not sure if Kubernetes limit affects the number of cores the Go runtime sees when it starts up. Running with GODEBUG=gctrace=1 set will print out GC stats, the final field being the current number of PROCs.
Solving it is simple, Uber made a library that looks at the container's CPU limit and set's GOMAXPROCS accordingly. If you manually set the environment variable the library will still respect it, leaving the option to manually change it if that is desired.
Notably, it seems like having GOMAXPROCS be 2 when there are only 2 CPU causes a bigger performance win than letting the GC use 3x as much memory in my heavy-load wrk tests.
Using wrk -d30s -t12 -c200 and limiting the server to 2 CPUs with
`docker run --cpus 2`:
```
Thread Stats Avg Stdev Max +/- Stdev
CORES=2 GOMAXPROCS=8 GOGC=100
Latency 20.72ms 25.56ms 105.36ms 79.27%
Req/Sec 4.13k 609.55 19.03k 82.90%
CORES=2 GOMAXPROCS=8 GOGC=300
Latency 19.58ms 24.58ms 97.19ms 79.73%
Req/Sec 4.51k 617.96 17.13k 79.74%
CORES=2 GOMAXPROCS=2 GOGC=100
Latency 6.53ms 2.37ms 24.92ms 70.40%
Req/Sec 5.08k 690.78 43.98k 99.53%
CORES=2 GOMAXPROCS=2 GOGC=300
Latency 6.14ms 2.05ms 19.86ms 69.17%
Req/Sec 5.40k 817.94 50.31k 99.89%
```
The better scheduling / contention from matching the core count cuts the
latency by 66% compared to improving GC performance by giving it more
RAM.