Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: making check quorum behaves more like a leader lease #5451

Closed
wants to merge 2 commits into from
Closed

raft: making check quorum behaves more like a leader lease #5451

wants to merge 2 commits into from

Conversation

swingbach
Copy link
Contributor

According to the thesis 6.4, there's a way to serve read-only query efficiently without going through raft logs. but it still needs to communicate with the quorum which is not efficient enough. introducing leader lease mechanism are gonna make it more efficient.

But leader lease breaks leaderTransfer case, I assume the one who send the leaderTransfer message are supposed to know exactly what he's doing. so leader lease shouldn't break leaderTransfer. I'm gonna fix it on the next pr.

@@ -565,6 +565,12 @@ func (r *raft) Step(m pb.Message) error {
case m.Term > r.Term:
lead := m.From
if m.Type == pb.MsgVote {
if r.state == StateFollower && r.checkQuorum && r.electionElapsed < r.electionTimeout {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to correctly implement this, we also need pre-vote support.

Or the peer with a higher term will keep on sending vote request. Leader cannot make that peer stable since that peer will always reject the append requests from leader since it has higher term.

Pre-vote can help with this issue since a peer will not increase its term unless it knows there is no leader in the recent election timeout.

Another issue would be peer starting. If a peer get restarted, it will reset the election timeout and cannot vote for the a few seconds or so. If the entire cluster is restarted or is bootstrapped for the first time, there will be a few seconds unavailability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this limited to followers? As long as checkQuorum is enabled, any node in StateLeader is guaranteed to have been in contact with a majority of nodes within an election timeout, so it shouldn't increase its term either.

If the entire cluster is restarted or is bootstrapped for the first time, there will be a few seconds unavailability.

This is already true: when the entire cluster is restarted, the first election cannot happen until an election timeout (plus randomization) has passed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already true:

Right... Ignore my last point.

@xiang90
Copy link
Contributor

xiang90 commented May 25, 2016

According to the thesis 6.4, there's a way to serve read-only query efficiently without going through raft logs. but it still needs to communicate with the quorum which is not efficient enough.

The leader lease approach @swingbach mentioned is actually explained in the 6.4.1 section. As the thesis correctly states, it might suffer from clock drift.

@ongardie Actually there is one thing that I am not clear about: in the thesis, it says

Once the leader’s heartbeats were acknowledged by a majority of
the cluster, the leader would assume that no other server will become leader for about an election
timeout, 

Does this assume that pre-vote is implemented, thus the follower would not increase its term when it timeouts and would reject vote request if it recently receives from the leader ?

It seems like without pre-vote or other similar mechanism, we cannot assume there is no server will become leader for about an election timeout.

/cc @bdarnell

@bdarnell
Copy link
Contributor

You don't need the full pre-vote RPC, but you do need this tweak from the end of section 4.2.3, which is basically what this commit does:

We modify the RequestVote RPC to achieve this: if a server receives a RequestVote request within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote.

This is also discussed in https://groups.google.com/d/msg/raft-dev/5Lxxr2UwzQc/UCKzsx0qL1sJ

@xiang90
Copy link
Contributor

xiang90 commented May 25, 2016

@bdarnell I understand that part. However my concern was that

The peer with a higher term will keep on sending vote request. 
Leader cannot make that peer stable since that peer will always 
reject the append requests from leader since it has higher term.

@bdarnell
Copy link
Contributor

Ah, OK. I was responding to your last comment, "It seems like without pre-vote or other similar mechanism, we cannot assume there is no server will become leader for about an election timeout." I think we can make that assumption, but if we make this change without pre-vote then a node may get stuck with a term number that is too high and be unable to proceed.

@xiang90
Copy link
Contributor

xiang90 commented May 25, 2016

a node may get stuck with a term number that is too high and be unable to proceed.

Yea... So I think we need to implement pre-vote to make all stuff work correctly? A stuck node seems to be unacceptable.

@xiang90
Copy link
Contributor

xiang90 commented May 25, 2016

@bdarnell Seem like here is the answer from deigo

The leader (A or B) will send a heartbeat to C (the one with higher term), C will reply to that heartbeat with its newer term, and the leader will step down and adopt that newer term.

Now we ignores message with lower term. We need to tweak that to make C be able to reply a response with higher term to disrupt the leader and free itself.

@xiang90
Copy link
Contributor

xiang90 commented May 27, 2016

closing this in favor of #5468.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants