disk write failed and network partitioned leader was not able to step down to follower #13527

chaochn47 · 2021-12-08T23:40:50Z

Hi etcd community:

One of our production 3-member etcd clusters had 2 leader at a time. During this event, the droplet (the Physical Host on which the EC2 instances run) behind the old leader (A) was unavailable and the associated EBS volume failed to serve WAL fsync.

Ideally, the checkquorum raft message should be raised after the election timeout breaches the limit and RecentActive should be false for each peers if no recent hearbeat response or MsgAppResp reached leader.

etcd/raft/raft.go

Lines 1000 to 1020 in 161bf7e

    
           case pb.MsgCheckQuorum: 
        
           	// The leader should always see itself as active. As a precaution, handle 
        
           	// the case in which the leader isn't in the configuration any more (for 
        
           	// example if it just removed itself). 
        
           	// 
        
           	// TODO(tbg): I added a TODO in removeNode, it doesn't seem that the 
        
           	// leader steps down when removing itself. I might be missing something. 
        
           	if pr := r.prs.Progress[r.id]; pr != nil { 
        
           		pr.RecentActive = true 
        
           	} 
        
           	if !r.prs.QuorumActive() { 
        
           		r.logger.Warningf("%x stepped down to follower since quorum is not active", r.id) 
        
           		r.becomeFollower(r.Term, None) 
        
           	} 
        
           	// Mark everyone (but ourselves) as inactive in preparation for the next 
        
           	// CheckQuorum. 
        
           	r.prs.Visit(func(id uint64, pr *tracker.Progress) { 
        
           		if id != r.id { 
        
           			pr.RecentActive = false 
        
           		} 
        
           	})

However, the ticker is not triggered due to the disk write stalls in the raft output ready handling logic.

etcd/etcdserver/raft.go

Lines 170 to 173 in 161bf7e

    
           select { 
        
           case <-r.ticker.C: 
        
           	r.tick() 
        
           case rd := <-r.Ready():

This can be easily reproduced like the following

Isolated old Leader A is still the leader from its point of view

# run A with the failpoint enabled
[root@ip-10-0-61-148 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | a320313566418492 |  3.4.18 |   20 kB |      true |      false |        35 |    1741005 |            1741005 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@ip-10-0-61-148 bin]# curl http://127.0.0.1:1234/go.etcd.io/etcd/etcdserver/raftAfterSave -XPUT -d'sleep(600000)'
[root@ip-10-0-61-148 bin]# iptables -A INPUT -s 10.0.123.82 -j DROP && iptables -A INPUT -s 10.0.171.218 -j DROP
[root@ip-10-0-61-148 bin]# curl -sL http://localhost:2379/metrics | grep "is_leader"
# HELP etcd_server_is_leader Whether or not this member is a leader. 1 if is, 0 otherwise.
# TYPE etcd_server_is_leader gauge
etcd_server_is_leader 1
[root@ip-10-0-61-148 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | a320313566418492 |  3.4.18 |   20 kB |      true |      false |        35 |    1741050 |            1741050 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Follower B (after the network partition injected, it becomes the leader)

[root@ip-10-0-171-218 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 6e05b88806758f58 |  3.4.17 |   20 kB |     false |      false |        35 |    1741005 |            1741005 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@ip-10-0-171-218 bin]# iptables -A INPUT -s 10.0.61.148 -j DROP
[root@ip-10-0-171-218 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 6e05b88806758f58 |  3.4.17 |   20 kB |      true |      false |        36 |    1741062 |            1741062 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Follower C

etcdctl -w table endpoint status
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |       ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 70ddedef0fd6218 |  3.4.17 |   20 kB |     false |      false |        35 |    1741005 |            1741005 |        |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@ip-10-0-123-82 bin]# iptables -A INPUT -s 10.0.61.148 -j DROP
[root@ip-10-0-123-82 bin]# etcdctl -w table endpoint status
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |       ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 70ddedef0fd6218 |  3.4.17 |   20 kB |     false |      false |        36 |    1741058 |            1741058 |        |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Questions:

Is it required that the r.tick() and rd := <-r.Ready() should be executed mutually exclusively? Otherwise, we could separate the rd := <- r.Ready() in another indefinite loop as a background routine.
Is the above behavior expected from a raft design perspective?

@gyuho @ptabor @serathius @hexfusion @wilsonwang371 PTAL, thx!

The text was updated successfully, but these errors were encountered:

stale · 2022-03-12T23:15:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

stale · 2022-06-12T23:11:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

chaochn47 · 2023-07-28T23:14:01Z

xref.

chaochn47 changed the title ~~network partitioned leader was not able to step down to follower~~ disk write failed and network partitioned leader was not able to step down to follower Dec 8, 2021

stale bot added the stale label Mar 12, 2022

serathius removed the stale label Mar 14, 2022

stale bot added the stale label Jun 12, 2022

serathius added type/bug and removed stale labels Jun 13, 2022

chaochn47 mentioned this issue Aug 12, 2022

Removed etcd member failed to stop on stuck disk #14338

Open

ahrtr added the stage/tracked label Aug 15, 2022

serathius added the release/v3.4 label Nov 24, 2022

chaochn47 mentioned this issue Feb 13, 2023

All leases are revoked when the etcd leader is stuck in handling raft Ready due to slow fdatasync or high CPU. #15247

Closed

This was referenced Aug 2, 2023

Livez/Readyz #16007

Open

Introduce grpc health check in etcd client #16276

Open

chaochn47 mentioned this issue Oct 29, 2023

etcdserver/raft.go: separate raft tick and ready #16847

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disk write failed and network partitioned leader was not able to step down to follower #13527

disk write failed and network partitioned leader was not able to step down to follower #13527

chaochn47 commented Dec 8, 2021 •

edited

Loading

stale bot commented Mar 12, 2022

stale bot commented Jun 12, 2022

chaochn47 commented Jul 28, 2023

disk write failed and network partitioned leader was not able to step down to follower #13527

disk write failed and network partitioned leader was not able to step down to follower #13527

Comments

chaochn47 commented Dec 8, 2021 • edited Loading

This can be easily reproduced like the following

stale bot commented Mar 12, 2022

stale bot commented Jun 12, 2022

chaochn47 commented Jul 28, 2023

chaochn47 commented Dec 8, 2021 •

edited

Loading