Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak: looks like KeyValue Unmarshal or txReaderBuffer cause #12256

Closed
yangxuanjia opened this issue Aug 25, 2020 · 19 comments
Closed

memory leak: looks like KeyValue Unmarshal or txReaderBuffer cause #12256

yangxuanjia opened this issue Aug 25, 2020 · 19 comments

Comments

@yangxuanjia
Copy link
Contributor

yangxuanjia commented Aug 25, 2020

k8s range all data in lots of concurrent connections, memory raise rapidly and never decline, may be memory leak.
we see the memory raise to 110G or 240G on physical machine, and never decline.
we stop all k8s server and restart the k8s, but do nothing. the high memory always keep in etcd.

Centos8.1
ETCD master

Screenshot from 2020-08-25 15-07-10
Screenshot from 2020-08-25 15-44-50

@tangcong
Copy link
Contributor

can you provide more info,such as etcd version,using v2 or v3,etcd log,etcd metrics?

@yangxuanjia yangxuanjia changed the title memory leak: looks like KeyValue Unmarshal cause memory leak: looks like KeyValue Unmarshal or txReaderBuffer cause Aug 25, 2020
@yangxuanjia
Copy link
Contributor Author

pprof.zip

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Aug 25, 2020

hi tangcong,

etcd master branch, newer than 3.4.10.
using v3.
no etcd log.
I can support pprof heap metrics.

@tangcong
Copy link
Contributor

tangcong commented Aug 25, 2020

it seems caused by concurrently readtx buffer and larger query response. can you provide db_open_read_transactions metric value?

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Aug 26, 2020

db_open_read_transactions looks not too much.
But looks have association with memory raise.

Screenshot from 2020-08-26 09-55-35

@tangcong
Copy link
Contributor

How many keys does your cluster have? why there is no etcd log? slow query will print warning message. When did you restart the apiserver, process_open_fds did not drop significantly?

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Aug 26, 2020

this is a test etcd cluster for 10k k8s nodes, not online, so we not logger the etcd log.
We benchmark 1 billion proposals, I have not exact key cout.
We restart the apiserver after the memory raise, I want know whether memory will decline if connections interrupt,
because I guess txn readbuffer copy will release when connection interrupt, but it looks not.

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Aug 26, 2020

maybe I catch the bug, it's really the txn read buffer bug cause the memory leak,
I will try to resolve it and then pull a PR for this bug.
PS: if I can resolve it. ( >_<!!!)

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Aug 27, 2020

I make test, watch the memory use, I find etcd occupy the big memory by txn read buffer,
and when txn read buffer run over, memory will release to [HeapReleased], then back to [HeapIdle].
golang doc say [HeapIdle] mean memory will go back to OS, but I only see it back to [HeapSys],
not back to real OS.

if there have other read OP, it will allocate memory from [HeapSys] to [HeapInuse]/[HeapAlloc],
OS top see memory will not raise again, it looks like etcd will hold the allocated memory as big memory cache.

I have 2 question,

  1. how can I release etcd [HeapSys] memory to real OS?
  2. concurrent txn read will make OOM when range all keys(300k keys and 100 concurrents, 100G-200G easy to occupied),
    memory will raise rapidly.

runtime.MemStats.HeapInuse, value, 1.68GB
runtime.MemStats.HeapAlloc, value, 1.68GB
runtime.MemStats.HeapReleased, value, 57.24GB
runtime.MemStats.HeapIdle, value, 57.94GB
runtime.MemStats.HeapSys, value, 59.62GB
runtime.MemStats.TotalAlloc, value, 393.16GB

runtime.MemStats.HeapObjects, value, 17.17MB
debug.GCStats.NumGC, value, 333.00B
runtime.MemStats.StackInuse, value, 2.47MB
runtime.MemStats.MSpanInuse, value, 5.81MB
runtime.MemStats.HeapReleased, value, 57.24GB
runtime.MemStats.GCCPUFraction, value, 0.00B
runtime.NumGoroutine, value, 6.00B
debug.GCStats.LastGC, value, 1.39EB
runtime.MemStats.HeapAlloc, value, 1.43GB
runtime.MemStats.DebugGC, value, 0.00B
runtime.MemStats.MSpanSys, value, 49.14MB
runtime.MemStats.MCacheSys, value, 128.00KB
runtime.MemStats.StackSys, value, 2.47MB
runtime.MemStats.HeapIdle, value, 58.19GB
runtime.MemStats.Lookups, value, 0.00B
debug.ReadGCStats, mean.rate, 0.00B
debug.ReadGCStats, count, 94.00B
debug.ReadGCStats, stddev, 4.03MB
debug.ReadGCStats, 95%, 551.29KB
debug.ReadGCStats, 5m.rate, 0.00B
debug.ReadGCStats, 1m.rate, 0.00B
debug.ReadGCStats, median, 24.76KB
debug.ReadGCStats, 75%, 40.43KB
debug.ReadGCStats, 99%, 39.27MB
debug.ReadGCStats, 99.9%, 39.27MB
debug.ReadGCStats, 15m.rate, 0.00B
debug.ReadGCStats, min, 3.61KB
debug.ReadGCStats, max, 39.27MB
debug.ReadGCStats, mean, 528.17KB
runtime.MemStats.PauseTotalNs, value, 213.50MB
runtime.MemStats.NextGC, value, 2.25GB
runtime.MemStats.Alloc, value, 1.43GB
runtime.MemStats.NumGC, value, 4.00B
debug.GCStats.Pause, min, 109.23KB
debug.GCStats.Pause, max, 13.42MB
debug.GCStats.Pause, mean, 965.72KB
debug.GCStats.Pause, stddev, 2.09MB
debug.GCStats.Pause, median, 569.84KB
debug.GCStats.Pause, 99%, 13.42MB
debug.GCStats.Pause, 99.9%, 13.42MB
debug.GCStats.Pause, count, 87.00B
debug.GCStats.Pause, 75%, 710.93KB
debug.GCStats.Pause, 95%, 1.07MB
runtime.MemStats.BuckHashSys, value, 1.38MB
runtime.MemStats.PauseNs, max, 13.42MB
runtime.MemStats.PauseNs, mean, 656.54KB
runtime.MemStats.PauseNs, stddev, 1.10MB
runtime.MemStats.PauseNs, 75%, 699.76KB
runtime.MemStats.PauseNs, 99%, 7.60MB
runtime.MemStats.PauseNs, count, 333.00B
runtime.MemStats.PauseNs, min, 43.23KB
runtime.MemStats.PauseNs, 99.9%, 13.42MB
runtime.MemStats.PauseNs, median, 553.80KB
runtime.MemStats.PauseNs, 95%, 992.23KB
runtime.MemStats.MCacheInuse, value, 108.50KB
runtime.ReadMemStats, min, 78.55KB
runtime.ReadMemStats, 99%, 5.95GB
runtime.ReadMemStats, 1m.rate, 0.00B
runtime.ReadMemStats, mean.rate, 0.00B
runtime.ReadMemStats, count, 94.00B
runtime.ReadMemStats, 75%, 622.42KB
runtime.ReadMemStats, 99.9%, 5.95GB
runtime.ReadMemStats, 15m.rate, 0.00B
runtime.ReadMemStats, max, 5.95GB
runtime.ReadMemStats, mean, 144.82MB
runtime.ReadMemStats, stddev, 871.26MB
runtime.ReadMemStats, median, 320.45KB
runtime.ReadMemStats, 95%, 347.13MB
runtime.ReadMemStats, 5m.rate, 0.00B
runtime.MemStats.Mallocs, value, 42.91KB
runtime.MemStats.HeapInuse, value, 1.43GB
runtime.MemStats.LastGC, value, 1.39EB
runtime.NumThread, value, 70.00B
runtime.MemStats.HeapSys, value, 59.62GB
runtime.NumCgoCall, value, 0.00B
runtime.MemStats.EnableGC, value, 1.00B
runtime.MemStats.Frees, value, 41.86KB
runtime.MemStats.Sys, value, 61.64GB
runtime.MemStats.TotalAlloc, value, 393.16GB
debug.GCStats.PauseTotal, value, 213.50MB

@yangxuanjia
Copy link
Contributor Author

golang/go#14521

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Aug 27, 2020

maybe the resolve method is let k8s to modify.
k8s can't range all pods concurrently, k8s can range pods under their namespace to avoid txn reader buffer allocate big memory concurrently.
Or serializable range all pods api, or make cache on k8s api when use etcd 3.4.10+.

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Aug 28, 2020

Hi tangcong,

I have some question, hope you can disabuse me.
If etcd have 10k nodes and 300k pods data on disk.

  1. is backend txn read buffer has all data?
    I see the backend txn read buffer only write point when writeback,
    and the buffer can reset data when defrag and batchTxBuffered commit.
    so how much data on txn read buffer, all data or only have recently write data?
  2. If backend txn read buffer have lots of data.
    Is any concurrent txn read need copy the backend txn read buffer?
    2.1 if one txn read just get one key(short read), need I copy all txn read buffer?
    2.2 if one txn read just range 100 keys(short read), need I copy all txn read buffer?
    2.3 if one txn read just range large keys (long read), need I copy all txn read buffer?
    I think the goal of concurrent read is resolve long read and write lock conflict, but short read and write conflict is very little.
    so I have a new idea, split short read and long read, only long read copy backend txn read buffer,
    short read also add RLock to backend txn read buffer, maybe it can reduce the memory raise rapidly and get a not bad performance. (But if more than 100+ long range request concurrent come, etcd also will consume large memory, so it is not a beautiful resolution)

@tangcong
Copy link
Contributor

txn read buffer only has recently write data,backend will commit data per 100ms(default) and reset read buffer,so in the case of normal commit data and db_open_read_transactions value is small , the buffer does not contain too much data.

The key issue is that you have to avoid ranging all data and it costs too much memory. please follow some best practices and see k8s sla.
https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md

@yangxuanjia
Copy link
Contributor Author

Hi tangcong,

If read buffer have little cache data, why pprof show the most memory allocate through func (bb *bucketBuffer) Copy()???
Even if I have 100+ connections range all data concurrently, the most memory I think will cost on Add(), but not Copy().
How about your thinking?

@tangcong
Copy link
Contributor

tangcong commented Aug 28, 2020

can you reproduce it stably? we can add some debug logs to see if commit is blocked and print buffer size.

	if t.backend.readTx.tx != nil {
		// wait all store read transactions using the current boltdb tx to finish,
		// then close the boltdb tx
		go func(tx *bolt.Tx, wg *sync.WaitGroup) {
			wg.Wait()
			if err := tx.Rollback(); err != nil {
				t.backend.lg.Fatal("failed to rollback tx", zap.Error(err))
			}
		}(t.backend.readTx.tx, t.backend.readTx.txWg)
		t.backend.readTx.reset()
	}

	t.batchTx.commit(stop)

@yangxuanjia
Copy link
Contributor Author

yangxuanjia commented Aug 28, 2020

Hi tangcong,

My thinking is wrong, the most memory used is truely storeTxnRead.rangeKeys()->KeyValue.Unmarshal().
it can prove by heap_inuse_space.svg in pprof.zip. So 100+ concurrent range 300k keys request, will cost most of memory.

txReadBuffer.unsaftCopy()->bucketBuffer.Copy(), this we only can see in heap_alloc_space, it is an accumulated memory cost.

So you are right by "The key issue is that you have to avoid ranging all data and it costs too much memory."

The key point is not txn ReadBuffer Copy(), but is range all data cause storeTxnRead.rangeKeys() costs too much memory.

@tangcong
Copy link
Contributor

good. got it. rate-limiting is necessary for expensive read and it can avoid costing too much memory.
I will add rate-limiting for expensive reads in future work. thanks.

@chinazj
Copy link

chinazj commented Sep 9, 2020

hi yangxuanjia

I encountered a difficulty, I cannot use -tls_ca=ca.crt -tls_cert=server.crt -tls_key=server.key in pprof

go tool pprof -tls_ca=ca.crt -tls_cert=server.crt -tls_key=server.key  https://10.55.58.31:2380/debug/pprof/profile
Fetching profile over HTTP from https://10.55.58.31:2380/debug/pprof/profile
https://10.55.58.31:2380/debug/pprof/profile: Get "https://10.55.58.31:2380/debug/pprof/profile": remote error: tls: bad certificate
failed to fetch any source profiles

please help me.

thanks

@yangxuanjia
Copy link
Contributor Author

curl -k --cert client.pem --key client-key.pem 'https://172.28.200.99:2379/debug/pprof/heap' -o ppppp1.dump
go tool pprof --svg -inuse_space -output profile001_inuse_space.svg ppppp1.dump

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants