Removed etcd member failed to stop on stuck disk #14338

chaochn47 · 2022-08-12T00:59:25Z

What happened?

etcd failed to stop and stuck in stopping state after it was removed from membership. It went unresponsive for any requests sent to it.

What did you expect to happen?

I expect etcd can graceful terminate itself.

How can we reproduce it (as minimally and precisely as possible)?

It was observed in an availability zone outage. The reproduce can be like the following

provision a 3 member cluster.
inject delay in disk fsync operation (in our setup, its disk operations went through a special network link).
remove this member right after.

Here is a similar reproduce #13527 but does not have member removal fault injection.

Anything else we need to know?

{
    "level": "info",
    "ts": "2022-07-10T17:33:16.789Z",
    "caller": "raft/raft.go:859",
    "msg": "f858bb9b873f65a1 [term: 4] received a MsgApp message with higher term from fb4a15d865627b24 [term: 5]"
}

{
    "level": "info",
    "ts": "2022-07-10T17:33:16.789Z",
    "caller": "raft/raft.go:700",
    "msg": "f858bb9b873f65a1 became follower at term 5"
}

{
    "level": "info",
    "ts": "2022-07-10T17:33:16.789Z",
    "caller": "raft/node.go:327",
    "msg": "raft.node: f858bb9b873f65a1 changed leader from f858bb9b873f65a1 to fb4a15d865627b24 at term 5"
}

{
    "level": "warn",
    "ts": "2022-07-10T17:34:08.643Z",
    "caller": "etcdserver/server.go:1101",
    "msg": "server error",
    "error": "the member has been permanently removed from the cluster"
}

{
    "level": "warn",
    "ts": "2022-07-10T17:34:08.643Z",
    "caller": "etcdserver/server.go:1102",
    "msg": "data-dir used by this member must be removed"
}

etcd: /health error; QGET failed etcdserver: server stopped (status code 503)

{
    "level": "info",
    "ts": "2022-07-10T17:34:16.007Z",
    "caller": "traceutil/trace.go:145",
    "msg": "trace[1220767039] range",
    "detail": "{range_begin:/registry/health; range_end:; }",
    "duration": "2.000149186s",
    "start": "2022-07-10T17:34:14.006Z",
    "end": "2022-07-10T17:34:16.007Z",
    "steps": [
        "trace[1220767039] 'agreement among raft nodes before linearized reading'  (duration: 2.000062825s)"
    ]
}

many more apply request took too long because "error":"context canceled" and continued for almost 2 hours.

{
    "level": "warn",
    "ts": "2022-07-10T19:22:09.843Z",
    "caller": "wal/wal.go:808",
    "msg": "slow fdatasync",
    "took": "1h48m54.55408532s",
    "expected-duration": "1s"
}

rafthttp pipelines termination

{
    "level": "info",
    "ts": "2022-07-10T19:22:09.843Z",
    "caller": "rafthttp/pipeline.go:86",
    "msg": "stopped HTTP pipelining with remote peer",
    "local-member-id": "f858bb9b873f65a1",
    "remote-peer-id": "772211084901eb47"
}

{
    "level": "info",
    "ts": "2022-07-10T19:22:09.843Z",
    "caller": "rafthttp/pipeline.go:86",
    "msg": "stopped HTTP pipelining with remote peer",
    "local-member-id": "f858bb9b873f65a1",
    "remote-peer-id": "b9f9f861d890475a"
}

...

etcd server stopped with exit code 0

Etcd version (please run commands below)

$ etcd --version
v3.4.18

$ etcdctl version
v3.4.18

Etcd configuration (command line flags or environment variables)

{
    "level": "info",
    "ts": "2022-06-17T10:05:05.264Z",
    "caller": "embed/etcd.go:303",
    "msg": "starting an etcd server",
    "etcd-version": "3.4.18",
    "git-sha": "c1c4e2d91",
    "go-version": "go1.15.15",
    "go-os": "linux",
    "go-arch": "amd64",
    "max-cpu-set": 2,
    "max-cpu-available": 2,
    "member-initialized": false,
    "name": "zombie-node-name",
    "data-dir": "/mnt/xvdb/datadir",
    "wal-dir": "/mnt/xvdb/datadir/member/wal",
    "wal-dir-dedicated": "/mnt/xvdb/datadir/member/wal",
    "member-dir": "/mnt/xvdb/datadir/member",
    "force-new-cluster": false,
    "heartbeat-interval": "100ms",
    "election-timeout": "1s",
    "initial-election-tick-advance": true,
    "snapshot-count": 10000,
    "snapshot-catchup-entries": 5000,
    "initial-advertise-peer-urls": [
        "http://zombie-node-ip:2380"
    ],
    "listen-peer-urls": [
        "http://0.0.0.0:2380"
    ],
    "advertise-client-urls": [
        "http://zombie-node-ip:2379"
    ],
    "listen-client-urls": [
        "http://0.0.0.0:2379"
    ],
    "listen-metrics-urls": [],
    "cors": [
        "*"
    ],
    "host-whitelist": [
        "*"
    ],
    "initial-cluster": "node-a=http://node-a-ip:2380,node-b=http://node-b-ip:2380,zombie-node-name=http://zombie-node-ip:2380",
    "initial-cluster-state": "existing",
    "initial-cluster-token": "cd5b97cb-7577-4b42-8804-673b57d20f93",
    "quota-size-bytes": 8388608000,
    "pre-vote": true,
    "initial-corrupt-check": false,
    "corrupt-check-time-interval": "0s",
    "auto-compaction-mode": "periodic",
    "auto-compaction-retention": "0s",
    "auto-compaction-interval": "0s",
    "discovery-url": "",
    "discovery-proxy": ""
}

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

chaochn47 · 2022-08-12T01:05:38Z

from reading the code, it seems etcd server was stuck in

etcd/etcdserver/server.go

Lines 1020 to 1025 in 72d3e38

    
           sched.Stop() 
        
           // wait for gouroutines before closing raft so wal stays open 
        
           s.wg.Wait() 
        
           s.SyncTicker.Stop()

It was stuck there because v2 health check handler returned etcdserver: server stopped here and it was the reason of close(s.stopping)

It looks like fsync was stuck in the middle. Looked into 2022-07-10T19:22:09.843Z - 1h48m54.55408532s is around 2022-07-10T17:33:15.289Z

I expect despite of disk failures/latency, close(s.done) will be called and process to be exited.

Much appreciate of any insights, thanks!!

chaochn47 · 2022-08-15T06:57:23Z

Okay. I get a repro working by injecting sleeping at raftAfterSave failpoint followed by member removal.

The fifo scheduler stop is the culprit and getting stuck

etcd/pkg/schedule/schedule.go

Lines 119 to 125 in 72d3e38

    
           func (f *fifo) Stop() { 
        
           	f.mu.Lock() 
        
           	f.cancel() 
        
           	f.cancel = nil 
        
           	f.mu.Unlock() 
        
           	<-f.donec 
        
           }

Will dig in a little deeper to understand why..

chaochn47 · 2022-08-15T21:08:26Z

The raft loop was stuck in the middle and apply routine is waiting for it.

etcd/etcdserver/server.go

Lines 1119 to 1122 in 72d3e38

    
           // wait for the raft routine to finish the disk writes before triggering a 
        
           // snapshot. or applied index might be greater than the last index in raft 
        
           // storage, since the raft routine might be slower than apply routine. 
        
           <-apply.notifyc

I think the proper fix should be cancelling this apply successfully in the shutdown scenario even if disk write was stuck. Right now, the context is ignored.

etcd/etcdserver/server.go

Lines 1059 to 1060 in 72d3e38

    
           f := func(context.Context) { s.applyAll(&ep, &ap) } 
        
           sched.Schedule(f)

Any thoughts? @ahrtr @serathius @spzala

ahrtr · 2022-08-15T22:17:23Z

Just a quick question before I have a deep dive, can you reproduce this on release-3.5 or main?

chaochn47 · 2022-08-15T22:25:15Z

I think I can but I haven't yet tried this. Will report in a few mins on release-3.5 or main with the reproduce script shared.

Here is one for v3.4.18 reproduce.txt

chaochn47 · 2022-08-16T00:16:28Z

Here is one for v3.5.4 reproduce-3.5.txt

Some side effect I observed.

2 leaders at a time but actually only one leader!

$ etcdctl endpoint status --cluster -w table
+------------------------+------------------+---------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT        |        ID        | VERSION | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------+------------------+---------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
|  http://127.0.0.1:2379 | 8211f1d0f64f3269 |  3.5.4 |                 |   20 kB |          16 kB |     false |      false |         2 |          8 |                  8 |        |
| http://127.0.0.1:22379 | 91bc3c398fb3c146 |  3.5.4 |                 |   20 kB |          16 kB |      true |      false |         2 |          8 |                  8 |        |
| http://127.0.0.1:32379 | fd422379fda50e48 |  3.5.4 |                 |   20 kB |          16 kB |     false |      false |         2 |          8 |                  8 |        |
+------------------------+------------------+---------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
 
$ curl http://127.0.0.1:1234/etcdserver/raftAfterSave -XPUT -d'sleep(600000)'
 
$ etcdctl --endpoints http://127.0.0.1:22379,http://127.0.0.1:32379 member remove 8211f1d0f64f3269
Member 8211f1d0f64f3269 removed from cluster ef37ad9dc622a7c4

$ etcdctl endpoint status -w table --endpoints http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379
+------------------------+------------------+---------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT        |        ID        | VERSION | STORAGE VERSION | DB SIZE | DB SIZE IN USE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------+------------------+---------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+
|  http://127.0.0.1:2379 | 8211f1d0f64f3269 |   3.5.4 |                 |   20 kB |          16 kB |      true |      false |         2 |          8 |                  8 |        |
| http://127.0.0.1:22379 | 91bc3c398fb3c146 |   3.5.4 |                 |   20 kB |          16 kB |     false |      false |         3 |         10 |                 10 |        |
| http://127.0.0.1:32379 | fd422379fda50e48 |   3.5.4 |                 |   20 kB |          16 kB |      true |      false |         3 |         10 |                 10 |        |
+------------------------+------------------+---------+-----------------+---------+----------------+-----------+------------+-----------+------------+--------------------+--------+

To elaborate more, the symptom we got is a stale watch connection was not cleaned up (it is supposed to be) with member removal. So client cache was always outdated...

FAILPOINTS=1 ./build

needs to be updated to 

FAILPOINTS=1 ./build.sh  in release 3.5

ahrtr · 2022-08-18T05:44:39Z

It might not be safe to forcibly terminate the applying workflow.

The most feasible solution for now is to print log repeatedly at server.go#L930 and server.go#L978 so as to provide more visibility on the issue.

ahrtr · 2022-08-19T22:20:18Z

Please note that the etcdctl endpoint status isn't an atomic operation, and it gets all members status one by one. If there is a lead changing in-between, then you may get two leaders in the end. So it could be expected behavior. But you should can only see this with very low possibility, and it should be a temporary "issue".

chaochn47 · 2022-08-22T19:14:53Z

But you should can only see this with very low possibility, and it should be a temporary "issue"

Yeah, it only happens when disk IO is stuck in the middle. Usually it is caused by a data center outage.

FYI, We are deploying a fix to the local monitoring agent to forcibly stop the server given it's already removed from the membership.

However, it could have been done in etcd IMHO.

serathius · 2022-11-15T14:31:52Z

Looks more like a feature request to force etcd to shutdown after being removed from cluster. My first thought is that this should be part of admin operation to kill etcd on disk failure.

HuangQAQ · 2024-03-27T09:13:57Z

@chaochn47 Has this issue been resolved? I also encountered the problem of abnormal fluctuations in etcd, which is similar to your situation.

chaochn47 added the type/bug label Aug 12, 2022

chaochn47 changed the title ~~etcd failed to stop on partial network outage~~ Removed etcd member failed to stop on partial network outage Aug 15, 2022

chaochn47 changed the title ~~Removed etcd member failed to stop on partial network outage~~ Removed etcd member failed to stop on stuck disk Aug 15, 2022

chaochn47 added a commit to chaochn47/etcd that referenced this issue Aug 15, 2022

add debug logging on 3.4.18 for understanding etcd-io#14338

477bdf0

ahrtr added the stage/tracked label Oct 20, 2022

serathius added release/v3.5 type/feature and removed release/v3.5 type/bug labels Nov 15, 2022

ahrtr mentioned this issue Feb 22, 2023

All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247

Closed

chaochn47 mentioned this issue Jul 20, 2023

Introduce grpc health check in etcd client #16276

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed etcd member failed to stop on stuck disk #14338

Removed etcd member failed to stop on stuck disk #14338

chaochn47 commented Aug 12, 2022 •

edited

Loading

chaochn47 commented Aug 12, 2022 •

edited

Loading

chaochn47 commented Aug 15, 2022 •

edited

Loading

chaochn47 commented Aug 15, 2022

ahrtr commented Aug 15, 2022

chaochn47 commented Aug 15, 2022 •

edited

Loading

chaochn47 commented Aug 16, 2022 •

edited

Loading

ahrtr commented Aug 18, 2022

ahrtr commented Aug 19, 2022

chaochn47 commented Aug 22, 2022 •

edited

Loading

serathius commented Nov 15, 2022

HuangQAQ commented Mar 27, 2024

Removed etcd member failed to stop on stuck disk #14338

Removed etcd member failed to stop on stuck disk #14338

Comments

chaochn47 commented Aug 12, 2022 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

chaochn47 commented Aug 12, 2022 • edited Loading

chaochn47 commented Aug 15, 2022 • edited Loading

chaochn47 commented Aug 15, 2022

ahrtr commented Aug 15, 2022

chaochn47 commented Aug 15, 2022 • edited Loading

chaochn47 commented Aug 16, 2022 • edited Loading

ahrtr commented Aug 18, 2022

ahrtr commented Aug 19, 2022

chaochn47 commented Aug 22, 2022 • edited Loading

serathius commented Nov 15, 2022

HuangQAQ commented Mar 27, 2024

chaochn47 commented Aug 12, 2022 •

edited

Loading

chaochn47 commented Aug 12, 2022 •

edited

Loading

chaochn47 commented Aug 15, 2022 •

edited

Loading

chaochn47 commented Aug 15, 2022 •

edited

Loading

chaochn47 commented Aug 16, 2022 •

edited

Loading

chaochn47 commented Aug 22, 2022 •

edited

Loading