memory leak: looks like KeyValue Unmarshal or txReaderBuffer cause #12256

yangxuanjia · 2020-08-25T09:49:58Z

k8s range all data in lots of concurrent connections, memory raise rapidly and never decline, may be memory leak.
we see the memory raise to 110G or 240G on physical machine, and never decline.
we stop all k8s server and restart the k8s, but do nothing. the high memory always keep in etcd.

Centos8.1
ETCD master

tangcong · 2020-08-25T10:08:50Z

can you provide more info，such as etcd version，using v2 or v3，etcd log，etcd metrics?

yangxuanjia · 2020-08-25T10:12:32Z

pprof.zip

yangxuanjia · 2020-08-25T10:15:07Z

hi tangcong,

etcd master branch, newer than 3.4.10.
using v3.
no etcd log.
I can support pprof heap metrics.

tangcong · 2020-08-25T11:37:55Z

it seems caused by concurrently readtx buffer and larger query response. can you provide db_open_read_transactions metric value?

yangxuanjia · 2020-08-26T01:56:00Z

db_open_read_transactions looks not too much.
But looks have association with memory raise.

tangcong · 2020-08-26T02:08:05Z

How many keys does your cluster have? why there is no etcd log? slow query will print warning message. When did you restart the apiserver, process_open_fds did not drop significantly?

yangxuanjia · 2020-08-26T02:33:15Z

this is a test etcd cluster for 10k k8s nodes, not online, so we not logger the etcd log.
We benchmark 1 billion proposals, I have not exact key cout.
We restart the apiserver after the memory raise, I want know whether memory will decline if connections interrupt,
because I guess txn readbuffer copy will release when connection interrupt, but it looks not.

yangxuanjia · 2020-08-26T08:12:12Z

maybe I catch the bug, it's really the txn read buffer bug cause the memory leak,
I will try to resolve it and then pull a PR for this bug.
PS: if I can resolve it. ( >_<!!!)

yangxuanjia · 2020-08-27T06:39:32Z

I make test, watch the memory use, I find etcd occupy the big memory by txn read buffer,
and when txn read buffer run over, memory will release to [HeapReleased], then back to [HeapIdle].
golang doc say [HeapIdle] mean memory will go back to OS, but I only see it back to [HeapSys],
not back to real OS.

if there have other read OP, it will allocate memory from [HeapSys] to [HeapInuse]/[HeapAlloc],
OS top see memory will not raise again, it looks like etcd will hold the allocated memory as big memory cache.

I have 2 question,

how can I release etcd [HeapSys] memory to real OS?
concurrent txn read will make OOM when range all keys(300k keys and 100 concurrents, 100G-200G easy to occupied),
memory will raise rapidly.

runtime.MemStats.HeapInuse, value, 1.68GB
runtime.MemStats.HeapAlloc, value, 1.68GB
runtime.MemStats.HeapReleased, value, 57.24GB
runtime.MemStats.HeapIdle, value, 57.94GB
runtime.MemStats.HeapSys, value, 59.62GB
runtime.MemStats.TotalAlloc, value, 393.16GB

runtime.MemStats.HeapObjects, value, 17.17MB
debug.GCStats.NumGC, value, 333.00B
runtime.MemStats.StackInuse, value, 2.47MB
runtime.MemStats.MSpanInuse, value, 5.81MB
runtime.MemStats.HeapReleased, value, 57.24GB
runtime.MemStats.GCCPUFraction, value, 0.00B
runtime.NumGoroutine, value, 6.00B
debug.GCStats.LastGC, value, 1.39EB
runtime.MemStats.HeapAlloc, value, 1.43GB
runtime.MemStats.DebugGC, value, 0.00B
runtime.MemStats.MSpanSys, value, 49.14MB
runtime.MemStats.MCacheSys, value, 128.00KB
runtime.MemStats.StackSys, value, 2.47MB
runtime.MemStats.HeapIdle, value, 58.19GB
runtime.MemStats.Lookups, value, 0.00B
debug.ReadGCStats, mean.rate, 0.00B
debug.ReadGCStats, count, 94.00B
debug.ReadGCStats, stddev, 4.03MB
debug.ReadGCStats, 95%, 551.29KB
debug.ReadGCStats, 5m.rate, 0.00B
debug.ReadGCStats, 1m.rate, 0.00B
debug.ReadGCStats, median, 24.76KB
debug.ReadGCStats, 75%, 40.43KB
debug.ReadGCStats, 99%, 39.27MB
debug.ReadGCStats, 99.9%, 39.27MB
debug.ReadGCStats, 15m.rate, 0.00B
debug.ReadGCStats, min, 3.61KB
debug.ReadGCStats, max, 39.27MB
debug.ReadGCStats, mean, 528.17KB
runtime.MemStats.PauseTotalNs, value, 213.50MB
runtime.MemStats.NextGC, value, 2.25GB
runtime.MemStats.Alloc, value, 1.43GB
runtime.MemStats.NumGC, value, 4.00B
debug.GCStats.Pause, min, 109.23KB
debug.GCStats.Pause, max, 13.42MB
debug.GCStats.Pause, mean, 965.72KB
debug.GCStats.Pause, stddev, 2.09MB
debug.GCStats.Pause, median, 569.84KB
debug.GCStats.Pause, 99%, 13.42MB
debug.GCStats.Pause, 99.9%, 13.42MB
debug.GCStats.Pause, count, 87.00B
debug.GCStats.Pause, 75%, 710.93KB
debug.GCStats.Pause, 95%, 1.07MB
runtime.MemStats.BuckHashSys, value, 1.38MB
runtime.MemStats.PauseNs, max, 13.42MB
runtime.MemStats.PauseNs, mean, 656.54KB
runtime.MemStats.PauseNs, stddev, 1.10MB
runtime.MemStats.PauseNs, 75%, 699.76KB
runtime.MemStats.PauseNs, 99%, 7.60MB
runtime.MemStats.PauseNs, count, 333.00B
runtime.MemStats.PauseNs, min, 43.23KB
runtime.MemStats.PauseNs, 99.9%, 13.42MB
runtime.MemStats.PauseNs, median, 553.80KB
runtime.MemStats.PauseNs, 95%, 992.23KB
runtime.MemStats.MCacheInuse, value, 108.50KB
runtime.ReadMemStats, min, 78.55KB
runtime.ReadMemStats, 99%, 5.95GB
runtime.ReadMemStats, 1m.rate, 0.00B
runtime.ReadMemStats, mean.rate, 0.00B
runtime.ReadMemStats, count, 94.00B
runtime.ReadMemStats, 75%, 622.42KB
runtime.ReadMemStats, 99.9%, 5.95GB
runtime.ReadMemStats, 15m.rate, 0.00B
runtime.ReadMemStats, max, 5.95GB
runtime.ReadMemStats, mean, 144.82MB
runtime.ReadMemStats, stddev, 871.26MB
runtime.ReadMemStats, median, 320.45KB
runtime.ReadMemStats, 95%, 347.13MB
runtime.ReadMemStats, 5m.rate, 0.00B
runtime.MemStats.Mallocs, value, 42.91KB
runtime.MemStats.HeapInuse, value, 1.43GB
runtime.MemStats.LastGC, value, 1.39EB
runtime.NumThread, value, 70.00B
runtime.MemStats.HeapSys, value, 59.62GB
runtime.NumCgoCall, value, 0.00B
runtime.MemStats.EnableGC, value, 1.00B
runtime.MemStats.Frees, value, 41.86KB
runtime.MemStats.Sys, value, 61.64GB
runtime.MemStats.TotalAlloc, value, 393.16GB
debug.GCStats.PauseTotal, value, 213.50MB

yangxuanjia · 2020-08-27T06:51:19Z

golang/go#14521

yangxuanjia · 2020-08-27T07:55:47Z

maybe the resolve method is let k8s to modify.
k8s can't range all pods concurrently, k8s can range pods under their namespace to avoid txn reader buffer allocate big memory concurrently.
Or serializable range all pods api, or make cache on k8s api when use etcd 3.4.10+.

yangxuanjia · 2020-08-28T02:06:03Z

Hi tangcong,

I have some question, hope you can disabuse me.
If etcd have 10k nodes and 300k pods data on disk.

is backend txn read buffer has all data?
I see the backend txn read buffer only write point when writeback,
and the buffer can reset data when defrag and batchTxBuffered commit.
so how much data on txn read buffer, all data or only have recently write data?
If backend txn read buffer have lots of data.
Is any concurrent txn read need copy the backend txn read buffer?
2.1 if one txn read just get one key(short read), need I copy all txn read buffer?
2.2 if one txn read just range 100 keys(short read), need I copy all txn read buffer?
2.3 if one txn read just range large keys (long read), need I copy all txn read buffer?
I think the goal of concurrent read is resolve long read and write lock conflict, but short read and write conflict is very little.
so I have a new idea, split short read and long read, only long read copy backend txn read buffer,
short read also add RLock to backend txn read buffer, maybe it can reduce the memory raise rapidly and get a not bad performance. (But if more than 100+ long range request concurrent come, etcd also will consume large memory, so it is not a beautiful resolution)

tangcong · 2020-08-28T02:58:08Z

txn read buffer only has recently write data,backend will commit data per 100ms(default) and reset read buffer,so in the case of normal commit data and db_open_read_transactions value is small , the buffer does not contain too much data.

The key issue is that you have to avoid ranging all data and it costs too much memory. please follow some best practices and see k8s sla.
https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md

yangxuanjia · 2020-08-28T03:19:31Z

Hi tangcong,

If read buffer have little cache data, why pprof show the most memory allocate through func (bb *bucketBuffer) Copy()???
Even if I have 100+ connections range all data concurrently, the most memory I think will cost on Add(), but not Copy().
How about your thinking?

tangcong · 2020-08-28T03:25:10Z

can you reproduce it stably? we can add some debug logs to see if commit is blocked and print buffer size.

	if t.backend.readTx.tx != nil {
		// wait all store read transactions using the current boltdb tx to finish,
		// then close the boltdb tx
		go func(tx *bolt.Tx, wg *sync.WaitGroup) {
			wg.Wait()
			if err := tx.Rollback(); err != nil {
				t.backend.lg.Fatal("failed to rollback tx", zap.Error(err))
			}
		}(t.backend.readTx.tx, t.backend.readTx.txWg)
		t.backend.readTx.reset()
	}

	t.batchTx.commit(stop)

yangxuanjia · 2020-08-28T06:04:52Z

Hi tangcong,

My thinking is wrong, the most memory used is truely storeTxnRead.rangeKeys()->KeyValue.Unmarshal().
it can prove by heap_inuse_space.svg in pprof.zip. So 100+ concurrent range 300k keys request, will cost most of memory.

txReadBuffer.unsaftCopy()->bucketBuffer.Copy(), this we only can see in heap_alloc_space, it is an accumulated memory cost.

So you are right by "The key issue is that you have to avoid ranging all data and it costs too much memory."

The key point is not txn ReadBuffer Copy(), but is range all data cause storeTxnRead.rangeKeys() costs too much memory.

tangcong · 2020-08-28T06:54:43Z

good. got it. rate-limiting is necessary for expensive read and it can avoid costing too much memory.
I will add rate-limiting for expensive reads in future work. thanks.

chinazj · 2020-09-09T10:05:26Z

hi yangxuanjia

I encountered a difficulty, I cannot use -tls_ca=ca.crt -tls_cert=server.crt -tls_key=server.key in pprof

go tool pprof -tls_ca=ca.crt -tls_cert=server.crt -tls_key=server.key  https://10.55.58.31:2380/debug/pprof/profile
Fetching profile over HTTP from https://10.55.58.31:2380/debug/pprof/profile
https://10.55.58.31:2380/debug/pprof/profile: Get "https://10.55.58.31:2380/debug/pprof/profile": remote error: tls: bad certificate
failed to fetch any source profiles

please help me.

thanks

yangxuanjia · 2020-09-09T10:09:18Z

curl -k --cert client.pem --key client-key.pem 'https://172.28.200.99:2379/debug/pprof/heap' -o ppppp1.dump
go tool pprof --svg -inuse_space -output profile001_inuse_space.svg ppppp1.dump

yangxuanjia changed the title ~~memory leak: looks like KeyValue Unmarshal cause~~ memory leak: looks like KeyValue Unmarshal or txReaderBuffer cause Aug 25, 2020

yangxuanjia mentioned this issue Aug 27, 2020

I'm not sure we can make progress on this without an example that we can run ourselves. golang/go#41065

Closed

yangxuanjia closed this as completed Aug 28, 2020

tangcong mentioned this issue Sep 12, 2020

*: Implement etcd QoS POC #12290

Closed

yangxuanjia mentioned this issue Sep 28, 2020

k8s range lots of pods(about 300k pods) in 100 threads, cause etcd OOM #12342

Closed

zyfo2 mentioned this issue Nov 4, 2021

Expensive range request memory usage optimization to reduce OOM #13458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory leak: looks like KeyValue Unmarshal or txReaderBuffer cause #12256

memory leak: looks like KeyValue Unmarshal or txReaderBuffer cause #12256

yangxuanjia commented Aug 25, 2020 •

edited

Loading

tangcong commented Aug 25, 2020

yangxuanjia commented Aug 25, 2020

yangxuanjia commented Aug 25, 2020 •

edited

Loading

tangcong commented Aug 25, 2020 •

edited

Loading

yangxuanjia commented Aug 26, 2020 •

edited

Loading

tangcong commented Aug 26, 2020

yangxuanjia commented Aug 26, 2020 •

edited

Loading

yangxuanjia commented Aug 26, 2020 •

edited

Loading

yangxuanjia commented Aug 27, 2020 •

edited

Loading

yangxuanjia commented Aug 27, 2020

yangxuanjia commented Aug 27, 2020 •

edited

Loading

yangxuanjia commented Aug 28, 2020 •

edited

Loading

tangcong commented Aug 28, 2020

yangxuanjia commented Aug 28, 2020

tangcong commented Aug 28, 2020 •

edited

Loading

yangxuanjia commented Aug 28, 2020 •

edited

Loading

tangcong commented Aug 28, 2020

chinazj commented Sep 9, 2020

yangxuanjia commented Sep 9, 2020

memory leak: looks like KeyValue Unmarshal or txReaderBuffer cause #12256

memory leak: looks like KeyValue Unmarshal or txReaderBuffer cause #12256

Comments

yangxuanjia commented Aug 25, 2020 • edited Loading

tangcong commented Aug 25, 2020

yangxuanjia commented Aug 25, 2020

yangxuanjia commented Aug 25, 2020 • edited Loading

tangcong commented Aug 25, 2020 • edited Loading

yangxuanjia commented Aug 26, 2020 • edited Loading

tangcong commented Aug 26, 2020

yangxuanjia commented Aug 26, 2020 • edited Loading

yangxuanjia commented Aug 26, 2020 • edited Loading

yangxuanjia commented Aug 27, 2020 • edited Loading

runtime.MemStats.HeapInuse, value, 1.68GB runtime.MemStats.HeapAlloc, value, 1.68GB runtime.MemStats.HeapReleased, value, 57.24GB runtime.MemStats.HeapIdle, value, 57.94GB runtime.MemStats.HeapSys, value, 59.62GB runtime.MemStats.TotalAlloc, value, 393.16GB

yangxuanjia commented Aug 27, 2020

yangxuanjia commented Aug 27, 2020 • edited Loading

yangxuanjia commented Aug 28, 2020 • edited Loading

tangcong commented Aug 28, 2020

yangxuanjia commented Aug 28, 2020

tangcong commented Aug 28, 2020 • edited Loading

yangxuanjia commented Aug 28, 2020 • edited Loading

tangcong commented Aug 28, 2020

chinazj commented Sep 9, 2020

yangxuanjia commented Sep 9, 2020

yangxuanjia commented Aug 25, 2020 •

edited

Loading

yangxuanjia commented Aug 25, 2020 •

edited

Loading

tangcong commented Aug 25, 2020 •

edited

Loading

yangxuanjia commented Aug 26, 2020 •

edited

Loading

yangxuanjia commented Aug 26, 2020 •

edited

Loading

yangxuanjia commented Aug 26, 2020 •

edited

Loading

yangxuanjia commented Aug 27, 2020 •

edited

Loading

runtime.MemStats.HeapInuse, value, 1.68GB
runtime.MemStats.HeapAlloc, value, 1.68GB
runtime.MemStats.HeapReleased, value, 57.24GB
runtime.MemStats.HeapIdle, value, 57.94GB
runtime.MemStats.HeapSys, value, 59.62GB
runtime.MemStats.TotalAlloc, value, 393.16GB

yangxuanjia commented Aug 27, 2020 •

edited

Loading

yangxuanjia commented Aug 28, 2020 •

edited

Loading

tangcong commented Aug 28, 2020 •

edited

Loading

yangxuanjia commented Aug 28, 2020 •

edited

Loading