Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent revision and data occurs #13766

Closed
liuycsd opened this issue Mar 8, 2022 · 45 comments · Fixed by #13854 or #13878
Closed

Inconsistent revision and data occurs #13766

liuycsd opened this issue Mar 8, 2022 · 45 comments · Fixed by #13854 or #13878
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/bug

Comments

@liuycsd
Copy link

liuycsd commented Mar 8, 2022

I'm running a 3-member etcd cluster in a testing environment,
and a k8s cluster with each kube-apiserver connecting to one etcd server via localhost.

etcd cluster: created and running with v3.5.1
k8s cluster: v1.22

I found a data inconsistent several days ago:
Some keys exists on node x and y but not on node z, and some on z but not on the other two, e.g. different pods list would be returned from different servers.
Some keys have different values between z and the others, and can be updated to another different value via the corresponding etcd endpoint, e.g. the kube-system/kube-controller-manager lease points to different pods on different servers and both pods can successfully update the lease via their corresponding local api-server and etcd.
While other keys, including some new created ones, are consistent.

Checking with etcdctl endpoint status, raft_term, leader, raftIndex and raftAppliedIndex are same, but revision and dbSize does not: revision on node z is smaller than the others (~700000 diff) and both revisions are increasing.

Checking with etcd-dump-logs, it seems all the 3 nodes are receiving the same raft logs.

I'm not sure how and when this happens.
But the nodes might be under loads and may run out of memory or disk spaces sometimes.
Checking and comparing db keys with bbolt keys SNAPSHOT key, and seaching node operating system logs for the revisions near where the difference starts, I found some slow read logs mentions those revisions and dozon of leader elections and switches during those several hours. Besides, the disk space (including the etcd data dir) of node z may be full and the operating system logs was cleared so I don't know what happeds exactly and I'm not sure whether this relates to the issue or not.

@serathius serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 8, 2022
@serathius
Copy link
Member

cc @ptabor

@ptabor
Copy link
Contributor

ptabor commented Mar 8, 2022

Was this cluster originally created as 3.5.1 or it was subject to migration from 3.4.x. or 3.3.x ?

@serathius
Copy link
Member

Hey @liuycsd, can you provide etcd configuration you are running. Specifically it would be useful to know if you use --experimental-corrupt-check-time as would allow early detection.

@liuycsd
Copy link
Author

liuycsd commented Mar 8, 2022

Was this cluster originally created as 3.5.1 or it was subject to migration from 3.4.x. or 3.3.x ?

IIRC, it was originally created as 3.5.1.

@liuycsd
Copy link
Author

liuycsd commented Mar 8, 2022

Hey @liuycsd, can you provide etcd configuration you are running. Specifically it would be useful to know if you use --experimental-corrupt-check-time as would allow early detection.

I didn't set --experimental-corrupt-check-time.

@liuycsd
Copy link
Author

liuycsd commented Mar 9, 2022

Hey @liuycsd, can you provide etcd configuration you are running. Specifically it would be useful to know if you use --experimental-corrupt-check-time as would allow early detection.

Checking the running server, current config is:
https serving and client and peer cert auth enabled
quota-backend-bytes
heartbeat-interval
election-timeout
auto-compaction-retention
All in environment variables and no command line arguments.

@chaochn47
Copy link
Member

chaochn47 commented Mar 11, 2022

Hi @liuycsd, thanks for the reporting this issue.

But the nodes might be under loads and may run out of memory or disk spaces sometimes.

I found some slow read logs mentions those revisions and dozon of leader elections and switches during those several hours. Besides, the disk space (including the etcd data dir) of node z may be full and the operating system logs was cleared so I don't know what happeds exactly and I'm not sure whether this relates to the issue or not.

Looks like the data inconsistency triggers are

  • OOM
  • disk space full
  • lot of leader elections and switches during those several hours.

Can you help us better understand your cluster is set up?

  • I am wondering if you have a monitoring system in place for this etcd cluster so you concluded there were likely out of memory or full disk space issues? If so, can you provide those information?
  • Do you have periodic backup https://etcd.io/docs/v3.5/op-guide/recovery/#snapshotting-the-keyspace? It can have two benefits.
    • Compare two db files before and after the incident across the etcd cluster to further troubleshoot
    • Recover to a point-in-time good state where data was consistent

I found some slow read logs mentions those revisions and dozon of leader elections and switches during those several hours.

  • Any logs will be useful for us to help troubleshoot

Kindly suggest:

  • Please use --experimental-corrupt-check-time as would allow early detection as @ptabor and @serathius said.
  • Use latest etcd released version v3.5.2 as it included one important data inconsistency bug fix 55c16df by comparing v3.5.2 and v3.5.1 v3.5.1...v3.5.2

@wilsonwang371
Copy link
Contributor

I suggest you try 3.5.2 and see if this problem can still happen. Thre was a data corruption issue fixed recently.

@liuycsd
Copy link
Author

liuycsd commented Mar 12, 2022

Thanks.

That's just a simple environment for testing, with limited resources. No monitoring or backup setup.
Just because someone deployed a heavy system using that k8s cluster and found it quite slow, after logging into those nodes, found it was out of resources and made some manually cleanup and kubectl get pods showed that it seems to works.
But when I'm using that cluster several days later I found the inconsistent and take a snapshot AFTER that. No previous backup found.

I try to reproduce it in another 3.5.1 cluster, but it didn't occur. I setup a low cpu and disk quota to simulate a overloaded system so some leader elections were triggered randomly, at the same time, keep running several etcdctl compare/put txn, but the revisions inconsistent didn't occur.

I may try v3.5.2 and setup some basic monitor and backup later to see whether or not it occurs again.

Please use --experimental-corrupt-check-time as would allow early detection as @ptabor and @serathius said.

Are there any suggestions about what duration should I use for the corrupt-check-time and how can I know when it detects any corruption? I didn't find the detailed document about this.

@liuycsd liuycsd closed this as completed Mar 12, 2022
@liuycsd
Copy link
Author

liuycsd commented Mar 12, 2022

Sorry for the accident close

@liuycsd liuycsd reopened this Mar 12, 2022
@liuycsd
Copy link
Author

liuycsd commented Mar 14, 2022

Thanks @chaochn47 @wilsonwang371 and everyone else.

After checking the log again, I found a panic did occur on node y, and another panic occurred on node x two minutes later. From the restart log after panic, the revision from loaded db file was near (~10000 later than) where the revision difference starts by comparing the db with bbolt keys. I don't know whether or not it panicked on node z because the logs were cleared.
It just panicked at kvsToEvents in mvcc.syncWatchers.

@serathius
Copy link
Member

Can you share the stacktraces?

@liuycsd
Copy link
Author

liuycsd commented Mar 14, 2022

Can you share the stacktraces?

I don't have access to the cluster right now. I may paste it when it's available later.
IIRC, the stacktraces are like #12845.

@wilsonwang371
Copy link
Contributor

can you check if #13505 is the root cause?

@liuycsd
Copy link
Author

liuycsd commented Mar 16, 2022

can you check if #13505 is the root cause?

How could I verify that?

The stacktraces are similar.
On node y:

{"level":"panic","ts":"...","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 7)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue
goroutine 173 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00f971800, 0xc01238bb00, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000614870, 0x12268a2, 0x23, 0xc01238bb00, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc000614870, 0xc0000fc720, 0xc00eb85000, 0x13c, 0x200, 0xc00eb82000, 0x13c, 0x200, 0x12, 0x7fffffffffffffff, ...)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc0000fc6e0, 0x0)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc0000fc6e0)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:95 +0x3a7
etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

On node x:

{"level":"panic","ts":"...","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 0)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue
goroutine 209 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00015d5c0, 0xc00841ffc0, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0005e2d20, 0x12268a2, 0x23, 0xc00841ffc0, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc0005e2d20, 0xc00042c360, 0xc0070cf000, 0x14c, 0x200, 0xc0070cc000, 0x14c, 0x200, 0x12, 0x7fffffffffffffff, ...)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc00042c320, 0x0)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc00042c320)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:95 +0x3a7
etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

@wilsonwang371
Copy link
Contributor

wilsonwang371 commented Mar 16, 2022

You can use 3.5.2 and see if you can reproduce this issue. From the stack trace, it seems like #13505 fixed this issue and it is available in 3.5.2. @liuycsd

@wilsonwang371
Copy link
Contributor

can you check if #13505 is the root cause?

How could I verify that?

The stacktraces are similar.
On node y:

{"level":"panic","ts":"...","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 7)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue
goroutine 173 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00f971800, 0xc01238bb00, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000614870, 0x12268a2, 0x23, 0xc01238bb00, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc000614870, 0xc0000fc720, 0xc00eb85000, 0x13c, 0x200, 0xc00eb82000, 0x13c, 0x200, 0x12, 0x7fffffffffffffff, ...)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc0000fc6e0, 0x0)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc0000fc6e0)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:95 +0x3a7
etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

On node x:

{"level":"panic","ts":"...","caller":"mvcc/watchable_store.go:414","msg":"failed to unmarshal mvccpb.KeyValue","error":"proto: KeyValue: illegal tag 0 (wire type 0)","stacktrace":"go.etcd.io/etcd/server/v3/mvcc.kvsToEvents\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359\ngo.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop\n\t/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222"}
panic: failed to unmarshal mvccpb.KeyValue
goroutine 209 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00015d5c0, 0xc00841ffc0, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc0005e2d20, 0x12268a2, 0x23, 0xc00841ffc0, 0x1, 0x1)
	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/mvcc.kvsToEvents(0xc0005e2d20, 0xc00042c360, 0xc0070cf000, 0x14c, 0x200, 0xc0070cc000, 0x14c, 0x200, 0x12, 0x7fffffffffffffff, ...)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:414 +0x296
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchers(0xc00042c320, 0x0)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:359 +0x3e8
go.etcd.io/etcd/server/v3/mvcc.(*watchableStore).syncWatchersLoop(0xc00042c320)
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:222 +0x1e8
created by go.etcd.io/etcd/server/v3/mvcc.newWatchableStore
	/tmp/etcd-release-3.5.1/etcd/release/etcd/server/mvcc/watchable_store.go:95 +0x3a7
etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

do you have any update on this ?

@serathius
Copy link
Member

Please use --experimental-corrupt-check-time as would allow early detection as @ptabor and @serathius said.

Are there any suggestions about what duration should I use for the corrupt-check-time and how can I know when it detects any corruption? I didn't find the detailed document about this.

There is some performance overhead expected fro running it, I would recommend it to run every couple of hours.

@serathius
Copy link
Member

I'm still investigating the data corruption issues that were reported for v3.5 and were not reproduced. Based on this issue I have been looking if #13505 could be the case of them. I would like to confirm that we can reproduce data corruption on v3.5.0 by triggering case from #13505 to then verify that v3.5.2 has fixed the data corruption.

However based on my discussion with @ptabor, we don't see any how mis-referenced bbolt page could trigger could cause the corruption. It could still be caused by generic segfault, however this makes it even harder to reproduce.

As discussed on last community meeting this problem was found by Bytedance, @wilsonwang371 could you shed some light on how this issue was detected and if it was accompanied by data corruption?

@wilsonwang371
Copy link
Contributor

I'm still investigating the data corruption issues that were reported for v3.5 and were not reproduced. Based on this issue I have been looking if #13505 could be the case of them. I would like to confirm that we can reproduce data corruption on v3.5.0 by triggering case from #13505 to then verify that v3.5.2 has fixed the data corruption.

However based on my discussion with @ptabor, we don't see any how mis-referenced bbolt page could trigger could cause the corruption. It could still be caused by generic segfault, however this makes it even harder to reproduce.

As discussed on last community meeting this problem was found by Bytedance, @wilsonwang371 could you shed some light on how this issue was detected and if it was accompanied by data corruption?

We found data inconsistency in our internal etcd cluster. This was found in 3.4 but this is happening more often after we start using 3.5 which has a better performance.

When this happens, a solution would be take down the error node for some time and bring it back again later. This will force the data resync inside the cluster and the problem will be gone.

However, when this happens, no alert or special message was given. It gives the SRE team a hard time to detect this. Luckily, they were able to dump the log and I found there is a panic within the time range that we found issue. Other than that, there is no other error messages discovered. The content of the panic is very similar to the one described in this thread. Basically, somewhere in kvsToEvents(), a panic was triggered.

With some investigation, we believed the panic should be an error either from reading boltdb or from reading memory. From my point of view, it should be a data corruption inside boltdb.

Later in #13505, we found that the boltdb was accessed without a lock protection. Without a proper lock held, we can corrupt boltdb data and this can definitely trigger a data corruption in etcd. That's why I believe this patch should be able to fix this data corruption issue.

@PaulFurtado
Copy link

We encountered this corruption in one kube etcd cluster running etcd 3.5.1 yesterday morning and then upgraded all of our clusters to 3.5.2 today because we were going to be doing a rolling-restart of the kube api servers, which causes dangerously high memory spikes on etcd.

We faced etcd OOMs in two of our clusters when performing this rolling update. Both of those etcd clusters were running 3.5.2 and both experienced this corruption. The kube symptom of this is that querying various objects returned different results, probably depending on which etcd server was hit. This resulted in errors when listing objects like:

Error from server: etcdserver: mvcc: required revision is a future revision

and cases where querying the same object multiple times in a row altered between a recent state and one from 10+ minutes before.

On the etcd side, the most obvious symptom was the wildly different database sizes between members.

  • On the first cluster we had faced this issue on yesterday when still running 3.5.1, we attempted a defrag on the single member that had a different DB size and the defrag repeated failed so we shut the member down and the kube cluster immediately returned to a happy state.
  • On the two clusters we faced this issue on today that were running 3.5.2, most of the members had different db sizes. Ex: one showing 1.7GB, two showing 1.6GB, two showing 1.8GB. defrags succeeded no problem but did not solve anything. We attempted to use the new --experimental-corrupt-check-time flag on all of them and restart, but it didn't find any corruption, and further defrags continued to succeed but not solve the issue. Only way to solve was to walk through each member and shut them down log enough that they did a full sync on startup.

So, I can confirm that corruption problems definitely still exist in 3.5.2, and that the experimental corruption check flag may need some more work to detect it.

The servers in the first two cases were i3en.2xlarge ec2 instances (8 CPU, 64GiB memory, high iops nvme SSDs). The servers in the final case were i3en.6xlarge (24CPU, 192GiB memory). For an idea of load, every member in that i3en.6xlarge cluster pretty consistently does 8-12gbps of etcd traffic, and the 2xlarge clusters around 3-5gbps. It seems like our kube workload reproduces it after just a couple of etcd OOMs, so a reproducer could possibly just be issuing kill -9 to the etcd members repeatedly while performing a stress test until the db sizes vary widely, although the system behavior under memory pressure could also be relevant (heavily thrashing io causing the server to come to a crawl and leader elections occurring just prior to OOM).

Please let me know if any additional information would be valuable for you here.

@serathius
Copy link
Member

serathius commented Mar 24, 2022

Thanks @PaulFurtado sharing, this is hugely informative. There are 3 important pieces of information:

  • Corruption issues are still present in v3.5.2
  • Corruption check cannot detect them
  • We know under what circumstances issue can be reproduced.

I will look into reproducing this myself, couple of questions about corruption check. It is possible that corruption check run, but don't report an issue due to high WAL latency of followers. To get more details about why it's not working can you:

  • Provide some of the corruption check logs. They can be recognized by their source file location be corrupt.go.
  • Can you try enabling -experimental-initial-corrupt-check. It enables corruption check when member joins existing cluster, this should allow to always check hash difference between members.

Would be also good to know if you have seen same problem with v3.4.X. If you have a environment for reproducing the issue it would be good to confirm that problem only affects v3.5. It would be also useful to eliminate that the problem is with any optional v3.5 feature, can you confirm provide etcd configuration you use?

@PaulFurtado
Copy link

@serathius sorry, I had mentioned the --experimental-corrupt-check-time flag, but I actually meant the --experimental-initial-corrupt-check flag. When we believed it to be corrupt, we enabled the flag and this is what was logged upon restart:

{"level":"info","ts":"2022-03-23T21:49:02.865Z","caller":"etcdserver/corrupt.go:46","msg":"starting initial corruption check","local-member-id":"d9ccd33e39ffddb","timeout":"7s"}
{"level":"info","ts":"2022-03-23T21:49:06.242Z","caller":"etcdserver/corrupt.go:116","msg":"initial corruption checking passed; no corruption","local-member-id":"d9ccd33e39ffddb"}

We have never experienced this issue with 3.4.x (or if we did, it must have been extremely subtle and never caused us pain). Under 3.4.x we experienced OOM kills constantly, and had faced so many outages due to OOMs so I would expect that we definitely would have noticed it on 3.4.x. We have been running on 3.5.x for a bit over a month now and this is the first time we hit an OOM condition and we immediately hit this corruption, so it definitely seems like a 3.5 issue.

These are the exact flags from one of the i3en.6xlarge servers:

--logger zap
--log-outputs stderr
--log-level info
--listen-peer-urls https://0.0.0.0:2380
--listen-client-urls https://0.0.0.0:2379
--advertise-client-urls https://172.16.60.166:2379
--data-dir /usr/share/hubspot/etcd/data
--peer-trusted-ca-file /etc/pki/tls/certs/etcd-peer-ca.crt
--peer-cert-file /etc/pki/tls/certs/etcd-peer.crt
--peer-key-file /etc/pki/tls/private/etcd-peer.key
--peer-client-cert-auth
--trusted-ca-file /etc/pki/tls/certs/etcd-client-ca.crt
--client-cert-auth
--cert-file /etc/pki/tls/certs/etcd-client.crt
--key-file /etc/pki/tls/private/etcd-client.key
--auto-compaction-retention 1h
--heartbeat-interval 100
--election-timeout 15000
--quota-backend-bytes 17179869184
--snapshot-count 100000
--max-request-bytes 15728640
--enable-v2
--experimental-peer-skip-client-san-verification

I also doubt that it's relevant, but we run with these env vars as well:

ETCD_NAME=na1-brave-street
ETCD_INITIAL_CLUSTER_STATE=existing
ETCD_INITIAL_CLUSTER=default=http://localhost:2380
ETCD_INITIAL_CLUSTER_TOKEN=kube-iad02
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://172.16.60.166:2380
GODEBUG=x509ignoreCN=0

I will also mention that we are running kubernetes 1.21 and we are aware that it is intended to talk to an etcd 3.4.x cluster. However, we upgraded to etcd 3.5 because it is impossible to run etcd with stable latency and memory usage on 3.4.x with kube clusters of our size. Is it possible that something about the 3.4.x client could be corrupting the 3.5.x server?

@chaochn47
Copy link
Member

chaochn47 commented Mar 25, 2022

I am not able to reproduce it with a 3 distributed etcd node cluster backing a kubernetes 1.22 cluster.

The reproduction set up is generating load from a standby node using benchmark put, range and txn-mixed. The 3 node keeps oom_reaped, recovered from the local disk and the revision is monotonically increasing across the cluster. The revision divergence is minimal and slow nodes will catch up eventually.

@PaulFurtado

  • Would you mind check one more time if creating 1.21 cluster with fresh v3.5.2 etcd cluster can reproduce the issue?
  • Does my reproduction mentioned above miss anything from your observation?

Thanks!

On my end, I will try to rerun the reproduction process with etcd cluster upgrading/downgrading between 3.5 and 3.4.

@serathius
Copy link
Member

serathius commented Mar 25, 2022

I think reproduced the issue, it was surprisingly easy when I knew where to look. Based on @PaulFurtado comment I looked into simulating highly stressed etcd and sending SIGKILL signal to members one by one. When looking into our functional tests I found it very strange that we don't already have such test. We have tests with SIGTERM_* and SIGQUIT_AND_REMOVE_FOLLOWER, however we don't just test if database is correctly restored after unrecoverable error.

I have added a new tests (SIGKILL_FOLLOWER, SIGKILL_LEADER) and increased the stress-qps. This was enough to cause data inconsistency. As functional tests run with --experimental-initial-corrupt-check, killed member fails to join with message checkInitialHashKV failed. This doesn't answer question how in @PaulFurtado case this check was not triggered, but should be enough to show there is a problem.

To make results repeatable I have modified functional tests to inject the failure repeatably for some time. I managed to get 100% chance of reproduction for both test scenarios with 8000 qps within 1 minute of running. Issue seems to be only happen in higher qps scenarios, with lower chance of reproduction with 4000 qps and no reproductions with 2000 qps.

With that I looked into using same method to test v3.4.18 release. I didn't managed to get any corruptions even when running for 10 minutes with 8000 qps. I didn't test with higher qps as this is limit of my workstation, however this should be enough to confirm that issue is only on v3.5.X.

I'm sharing the code I used for reproduction here #13838. I will be looking to root causing the data inconsistency first, and later redesigning our functional tests as they seem to be not fulfilling their function.

TeddyAndrieux added a commit to scality/metalk8s that referenced this issue Jul 27, 2022
NOTE: This new version contains the fix
for etcd-io/etcd#13766 so we no longer need
the `--experimental-initial-corrupt-check=true` flag
TeddyAndrieux added a commit to scality/metalk8s that referenced this issue Jul 27, 2022
NOTE: This new version contains the fix
for etcd-io/etcd#13766 so we no longer need
the `--experimental-initial-corrupt-check=true` flag
serathius added a commit to serathius/etcd that referenced this issue Jan 13, 2023
serathius added a commit to serathius/etcd that referenced this issue Jan 13, 2023
serathius added a commit to serathius/etcd that referenced this issue Jan 13, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Jan 16, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Jan 16, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Jan 16, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Jan 17, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Jan 17, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Jan 17, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
serathius added a commit to serathius/etcd that referenced this issue Jan 17, 2023
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/bug
9 participants