Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: postpone MsgReadIndex until first commit in the term #12762

Merged
merged 1 commit into from
Mar 23, 2021

Conversation

wpedrak
Copy link
Contributor

@wpedrak wpedrak commented Mar 11, 2021

This PR tries to address #12680. Instead of dropping requests while Leader didn't committed first log in the term, it records them and process afer commit (1).

#12680 could be also fixed by another approach: retry mechanism in the following code (2)

for !timeout && !done {
select {
case rs = <-s.r.readStateC:
done = bytes.Equal(rs.RequestCtx, ctxToSend)
if !done {
// a previous request might time out. now we should ignore the response of it and
// continue waiting for the response of the current requests.
id2 := uint64(0)
if len(rs.RequestCtx) == 8 {
id2 = binary.BigEndian.Uint64(rs.RequestCtx)
}
lg.Warn(
"ignored out-of-date read index response; local node read indexes queueing up and waiting to be in sync with leader",
zap.Uint64("sent-request-id", id1),
zap.Uint64("received-request-id", id2),
)
slowReadIndex.Inc()
}
case <-leaderChangedNotifier:
timeout = true
readIndexFailed.Inc()
// return a retryable error.
nr.notify(ErrLeaderChanged)
case <-time.After(s.Cfg.ReqTimeout()):
lg.Warn("timed out waiting for read index response (local node might have slow network)", zap.Duration("timeout", s.Cfg.ReqTimeout()))
nr.notify(ErrTimeout)
timeout = true
slowReadIndex.Inc()
case <-s.stopping:
return
}
}

While this PR is proposal for (1), I'd like to introduce PR for (2) as well. In my opinion implementing (1) alone, (2) alone and both seems reasonable and I'd love to hear what you think.

Edit:
Implementation of (2) can be found in #12780

@hexfusion
Copy link
Contributor

/cc @xiang90

Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. That looks good to me.

Could you please update as well comment to:

etcd/raft/node.go

Lines 182 to 186 in b9226d0

// ReadIndex request a read state. The read state will be set in the ready.
// Read state has a read index. Once the application advances further than the read
// index, any linearizable read requests issued before the read request can be
// processed safely. The read state will have the same rctx attached.
ReadIndex(ctx context.Context, rctx []byte) error

It should state:

Note that request can be lost without notice, therefore it is user's job to ensure read index retries.

I think we should eventually audit all etcd calls of 'Propose' and 'ReadIndex' for retries, but
still retries would add unnecessary latency. I think we should merge this PR as well.

@wpedrak
Copy link
Contributor Author

wpedrak commented Mar 12, 2021

One thing that that might be worth considering is that in current setup there is no limit to pendingReadIndexMessages size, thus in situation when we are not able to commit after election and we are spammed with ReadIndex requests we would potentially run out of memory.

wpedrak added a commit to wpedrak/etcd that referenced this pull request Mar 16, 2021
@wpedrak wpedrak mentioned this pull request Mar 16, 2021
wpedrak added a commit to wpedrak/etcd that referenced this pull request Mar 16, 2021
wpedrak added a commit to wpedrak/etcd that referenced this pull request Mar 16, 2021
@ptabor
Copy link
Contributor

ptabor commented Mar 17, 2021

there is no limit to pendingReadIndexMessages size, thus in situation when we are not able to commit after election and we are spammed with ReadIndex requests we would potentially run out of memory.

That would mean we have a leader that:

  • is not committing anything, so is blocking all writes
  • is not propagating ReadIndex, so is blocking all reads
    I think OOM is better for such a leader than staying in such state for a long time if it ever happens.

@ptabor
Copy link
Contributor

ptabor commented Mar 17, 2021

@tangcong @tbg @bdarnell - can you, please take a look from RAFT perspective ?

@ptabor ptabor requested review from bdarnell and tbg March 17, 2021 07:58
@bdarnell
Copy link
Contributor

I've never used ReadIndex so I'm not an expert here, but allowing a leader to OOM seems like a pretty bad outcome (consider a process that is running multiple instances of raft, or maintaining a lot of state in addition to raft). I'd prefer to avoid this amount of buffering in the raft object.

We have some similar use cases in CockroachDB where we just retry if the leader isn't ready. In order to avoid waiting for the full timeout (500ms or whatever), we retry immediately when we see the leader commit the first empty entry of its term.

@gyuho
Copy link
Contributor

gyuho commented Mar 17, 2021

Is this something we can prevent at etcd server layer, and just expose the progress from raft? Just like we do for "error too many requests"? And let etcd handle the retry? Agree with @bdarnell and we should never OOM etcd server.

@ptabor
Copy link
Contributor

ptabor commented Mar 17, 2021

I was not precise. I don't think that the change in reality is growing the probability of OOM.
In particular RAFT state is already 'queuing' the readOnlyIndex requests in its internal structures:

readIndexQueue []string

ro.readIndexQueue = append(ro.readIndexQueue, s)

and the size of the queue is unbound.

As long as the leader is operating and being able to breadcast messages, the queue keeps being truncated.
The first commit by a new leader should not be different.

BTW: Interesting idea with resending all ReadIndex on leadership change as well (although adding more 'internal dependencies' on etcd side).

@wpedrak
Copy link
Contributor Author

wpedrak commented Mar 22, 2021

I don't think that the change in reality is growing the probability of OOM.

Indeed. If we trace execution of ReadIndex we would find out that all messages that are appended to pendingReadIndexMessages (slice introduced in this PR) will eventually be appended to readIndexQueue mentioned above. In case leader already committed entry in its turn, all those messages would go to readIndexQueue directly.

BTW: Interesting idea with resending all ReadIndex on leadership change as well (although adding more 'internal dependencies' on etcd side).

@ptabor I believe that you meant retry on "first commit in current turn", not "leadership change" as we currently retry on leadership change in

case <-leaderChangedNotifier:
continue

and

case <-leaderChangedNotifier:
timeout = true
readIndexFailed.Inc()
// return a retryable error.
nr.notify(ErrLeaderChanged)

@ptabor
Copy link
Contributor

ptabor commented Mar 22, 2021

BTW: Interesting idea with resending all ReadIndex on leadership change as well (although adding more 'internal dependencies' on etcd side).

@ptabor I believe that you meant retry on "first commit in current turn", not "leadership change" as we currently retry on leadership change in

Correct. So potentially the trigger could go from:

if len(e.Data) == 0 {
select {
case s.forceVersionC <- struct{}{}:
default:
}
// promote lessor when the local member is leader and finished

@wpedrak
Copy link
Contributor Author

wpedrak commented Mar 22, 2021

Makes sense. I definitely like the idea of being proactive here, however I see this as separate PR as it wouldn't share any code with proposition here.

@bdarnell @gyuho Could you elaborate on your concerns after @ptabor and @wpedrak comments?

@bdarnell
Copy link
Contributor

If it's already going into a queue I'm not as worried. A second queue feels inelegant but I'm not sure how to restructure it so LGTM.

Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Reject read only request when this leader has not committed any log entry at its term

I assume this time window in practice is pretty small. So, LGTM. /cc @xiang90

Requesting some clarification and docs. Otherwise LGTM.

raft/raft.go Outdated Show resolved Hide resolved
raft/raft.go Outdated Show resolved Hide resolved
raft/raft.go Show resolved Hide resolved
@codecov-io
Copy link

Codecov Report

Merging #12762 (758ff01) into master (456e129) will decrease coverage by 14.52%.
The diff coverage is 0.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #12762       +/-   ##
===========================================
- Coverage   67.68%   53.15%   -14.53%     
===========================================
  Files         421      389       -32     
  Lines       32973    31379     -1594     
===========================================
- Hits        22318    16681     -5637     
- Misses       8602    12965     +4363     
+ Partials     2053     1733      -320     
Flag Coverage Δ
all 53.15% <0.00%> (-14.53%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
raft/node.go 50.75% <ø> (-36.69%) ⬇️
raft/raft.go 25.70% <0.00%> (-64.68%) ⬇️
client/v3/utils.go 0.00% <0.00%> (-100.00%) ⬇️
raft/quorum/quorum.go 0.00% <0.00%> (-100.00%) ⬇️
raft/tracker/state.go 0.00% <0.00%> (-100.00%) ⬇️
client/v3/compact_op.go 0.00% <0.00%> (-100.00%) ⬇️
client/v3/ordering/util.go 0.00% <0.00%> (-100.00%) ⬇️
pkg/tlsutil/cipher_suites.go 0.00% <0.00%> (-100.00%) ⬇️
pkg/transport/sockopt_unix.go 0.00% <0.00%> (-100.00%) ⬇️
client/v3/naming/endpoints/endpoints.go 0.00% <0.00%> (-100.00%) ⬇️
... and 275 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 456e129...758ff01. Read the comment docs.

@ptabor
Copy link
Contributor

ptabor commented Mar 23, 2021

Thank you for the change. It's a big step towards seamless leadership change.

@serathius
Copy link
Member

@ahrtr ahrtr mentioned this pull request Jul 19, 2022
25 tasks
ahrtr added a commit to ahrtr/etcd that referenced this pull request Jul 22, 2022
Backport etcd-io#12762 to 3.4

Signed-off-by: Benjamin Wang <wachao@vmware.com>
ahrtr added a commit to ahrtr/etcd that referenced this pull request Jul 22, 2022
Backport etcd-io#12762 to 3.4

Signed-off-by: Benjamin Wang <wachao@vmware.com>
tjungblu pushed a commit to tjungblu/etcd that referenced this pull request Sep 8, 2022
Backport etcd-io#12762 to 3.4

Signed-off-by: Benjamin Wang <wachao@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants