Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk write failed and network partitioned leader was not able to step down to follower #13527

Open
chaochn47 opened this issue Dec 8, 2021 · 3 comments

Comments

@chaochn47
Copy link
Member

chaochn47 commented Dec 8, 2021

Hi etcd community:

One of our production 3-member etcd clusters had 2 leader at a time. During this event, the droplet (the Physical Host on which the EC2 instances run) behind the old leader (A) was unavailable and the associated EBS volume failed to serve WAL fsync.

Ideally, the checkquorum raft message should be raised after the election timeout breaches the limit and RecentActive should be false for each peers if no recent hearbeat response or MsgAppResp reached leader.

etcd/raft/raft.go

Lines 1000 to 1020 in 161bf7e

case pb.MsgCheckQuorum:
// The leader should always see itself as active. As a precaution, handle
// the case in which the leader isn't in the configuration any more (for
// example if it just removed itself).
//
// TODO(tbg): I added a TODO in removeNode, it doesn't seem that the
// leader steps down when removing itself. I might be missing something.
if pr := r.prs.Progress[r.id]; pr != nil {
pr.RecentActive = true
}
if !r.prs.QuorumActive() {
r.logger.Warningf("%x stepped down to follower since quorum is not active", r.id)
r.becomeFollower(r.Term, None)
}
// Mark everyone (but ourselves) as inactive in preparation for the next
// CheckQuorum.
r.prs.Visit(func(id uint64, pr *tracker.Progress) {
if id != r.id {
pr.RecentActive = false
}
})

However, the ticker is not triggered due to the disk write stalls in the raft output ready handling logic.

etcd/etcdserver/raft.go

Lines 170 to 173 in 161bf7e

select {
case <-r.ticker.C:
r.tick()
case rd := <-r.Ready():

This can be easily reproduced like the following

Isolated old Leader A is still the leader from its point of view

# run A with the failpoint enabled
[root@ip-10-0-61-148 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | a320313566418492 |  3.4.18 |   20 kB |      true |      false |        35 |    1741005 |            1741005 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@ip-10-0-61-148 bin]# curl http://127.0.0.1:1234/go.etcd.io/etcd/etcdserver/raftAfterSave -XPUT -d'sleep(600000)'
[root@ip-10-0-61-148 bin]# iptables -A INPUT -s 10.0.123.82 -j DROP && iptables -A INPUT -s 10.0.171.218 -j DROP
[root@ip-10-0-61-148 bin]# curl -sL http://localhost:2379/metrics | grep "is_leader"
# HELP etcd_server_is_leader Whether or not this member is a leader. 1 if is, 0 otherwise.
# TYPE etcd_server_is_leader gauge
etcd_server_is_leader 1
[root@ip-10-0-61-148 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | a320313566418492 |  3.4.18 |   20 kB |      true |      false |        35 |    1741050 |            1741050 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Follower B (after the network partition injected, it becomes the leader)

[root@ip-10-0-171-218 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 6e05b88806758f58 |  3.4.17 |   20 kB |     false |      false |        35 |    1741005 |            1741005 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@ip-10-0-171-218 bin]# iptables -A INPUT -s 10.0.61.148 -j DROP
[root@ip-10-0-171-218 bin]# etcdctl -w table endpoint status
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 6e05b88806758f58 |  3.4.17 |   20 kB |      true |      false |        36 |    1741062 |            1741062 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Follower C

etcdctl -w table endpoint status
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |       ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 70ddedef0fd6218 |  3.4.17 |   20 kB |     false |      false |        35 |    1741005 |            1741005 |        |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@ip-10-0-123-82 bin]# iptables -A INPUT -s 10.0.61.148 -j DROP
[root@ip-10-0-123-82 bin]# etcdctl -w table endpoint status
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |       ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 70ddedef0fd6218 |  3.4.17 |   20 kB |     false |      false |        36 |    1741058 |            1741058 |        |
+----------------+-----------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Questions:

  • Is it required that the r.tick() and rd := <-r.Ready() should be executed mutually exclusively? Otherwise, we could separate the rd := <- r.Ready() in another indefinite loop as a background routine.
  • Is the above behavior expected from a raft design perspective?

@gyuho @ptabor @serathius @hexfusion @wilsonwang371 PTAL, thx!

@chaochn47 chaochn47 changed the title network partitioned leader was not able to step down to follower disk write failed and network partitioned leader was not able to step down to follower Dec 8, 2021
@stale
Copy link

stale bot commented Mar 12, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 12, 2022
@serathius serathius removed the stale label Mar 14, 2022
@stale
Copy link

stale bot commented Jun 12, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@chaochn47
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants