Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rare leadership transfer failures when writes happen during transfer #581

Merged
merged 7 commits into from
Dec 4, 2023

Conversation

ncabatoff
Copy link
Contributor

@ncabatoff ncabatoff commented Nov 23, 2023

The problem I'm trying to fix: after we send the TimeoutNow during a leader transfer, we remain the leader for a little while. During that time we allow writes, which can result in the upcoming election being lost by our chosen target, if it doesn't have the highest index at the time when it's asking for votes.

The fix: wait for up to ElectionTimeout after the TimeoutNow before we allow writes to proceed.

… happening during the transfer. Fix the problem partially: it can still occur if we dispatchLogs after the transfer message has been sent to the target, before it wins the election.
… changes: we weren't properly respecting stopCh.
…tionTimeout before resuming applies. Fix is in scare quotes because I'm not all sure this is acceptable.
…fail before responding to the transfer request.
…, which is racy in that in between calling pollState and setting up event monitoring, the cluster could've elected a leader. When that happens, Leader() errors returning 0 leaders.
@ncabatoff ncabatoff changed the title Reproduce leadership transfer failures when writes happen during transfer Fix rare leadership transfer failures when writes happen during transfer Nov 24, 2023
@ncabatoff ncabatoff marked this pull request as ready for review November 24, 2023 20:18
@ncabatoff ncabatoff requested a review from a team as a code owner November 24, 2023 20:18
@ncabatoff ncabatoff requested review from DanStough and removed request for a team November 24, 2023 20:18
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great and don't see any issues.

I did have one question in line about whether or not an alternative was easier to read/reason about but I'm not really convinced either way to be honest and I think both are effectively evuivalent so not blocking!

Comment on lines +697 to +706
// Wait for up to ElectionTimeout before flagging the
// leadership transfer as done and unblocking applies in
// the leaderLoop.
select {
case <-time.After(r.config().ElectionTimeout):
err := fmt.Errorf("leadership transfer timeout")
r.logger.Debug(err.Error())
future.respond(err)
case <-leftLeaderLoop:
r.logger.Debug("lost leadership during transfer (expected)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about just staying in the loop here instead of starting a new ElectionTimeout ticker and having to duplicate the leftLeaderLoop code path too?

I was originally expecting that we'd not wait for another ElectionTimeout but rather keep waiting for up to the original one we started for the transfer to complete. I don't think it's terrible to do it this way but it strikes me that it's less code and maybe simpler to reason able if the behaviour is just: block until leadership transfer works or one election timeout from handling the request...

The rationale for using ElectionTimeout here presumably was that, LeadershipTransfer is only an optimization to avoid just ungracefully stopping and having to wait for an election timeout... so letting leader transfer take longer than a whole election timeout is a little bit defeating the point and we should probably return fast and let the old leader just shut down if it wants to.

I think that rationale gets a bit more subtle in cases like Autopilot upgrade where we rely on Leadership Transfer working before we move on to the next phase rather than just shutting down anyway.

tl;dr, I don't think the behaviour here is bad or wrong especially and it beats the bug without this wait. If it turns out easier to follow to duplicate the code and make the behavior more explicit like you have here I'm OK with it. Just curious if you tried simply replacing this else branch with continue which I think is also a correct fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried that. It got messy due to the other cases wanting to read from doneCh. In the end I decided this was clearer.

@@ -2337,17 +2337,71 @@ func TestRaft_LeadershipTransferWithOneNode(t *testing.T) {
}
}

func TestRaft_LeadershipTransferWithWrites(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have anecdata about roughly how frequently this reproduced a bug before your fix locally? I assume it's still non-deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't give you numbers, but it didn't take many tries to provoke a failure. It can still fail if I up the concurrency enough, e.g. if I run 16 parallel -race instances. But I suspect the same is true for many other tests, as the failures are things like timeouts due to heartbeats not being processed quickly enough.

@ncabatoff ncabatoff merged commit 62eaa1c into main Dec 4, 2023
6 checks passed
@ncabatoff ncabatoff deleted the repro-leader-transfer-failure branch December 4, 2023 14:38
ncabatoff added a commit to hashicorp/vault that referenced this pull request Feb 13, 2024
ncabatoff added a commit to hashicorp/vault that referenced this pull request Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants