Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED] Added jitter in the reconnect logic #564

Merged
merged 2 commits into from
May 13, 2020
Merged

[FIXED] Added jitter in the reconnect logic #564

merged 2 commits into from
May 13, 2020

Conversation

kozlovic
Copy link
Member

@kozlovic kozlovic commented May 1, 2020

When library would wait for the reconnect wait interval, it will
now wait for a random time bounded by this value.
This prevents thundering herd reconnect issue, especially with
TLS.

Resolves #563

Signed-off-by: Ivan Kozlovic ivan@synadia.com

@kozlovic
Copy link
Member Author

kozlovic commented May 1, 2020

I added a random delay capped to the reconnect wait. Also made sure that we kick out the doReconnect go routine (was previously possibly waiting for reconnectWait even when connection was closed.

Should we have a minimum wait? For instance use random but for [reconnectWait/2:reconnectWait]?

@kozlovic
Copy link
Member Author

kozlovic commented May 1, 2020

Some tests will now sometimes fail because they expect the connection to be reconnecting based on the ReconnectWait duration. Waiting for decision on wait time before fixing those (for instance: TestReconnectAllowedFlags
TestReconnectAllowedFlags: reconnect_test.go:104: Expected to be in a reconnecting state)

go.mod Outdated
github.com/nats-io/jwt v0.3.2
github.com/nats-io/nats-server/v2 v2.1.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that go mod tooling added this, but should we continue to remove it to avoid depending on server?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arg.. it's my local update of staticcheck that did that likely and I forgot to remove. Will update.

go.mod Outdated
)

go 1.13
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not completely sure of the impact of adding this one yet too...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will reset the go.mod/sum changes, but will be easier for me to do a git push force, so be aware.

@coveralls
Copy link

coveralls commented May 1, 2020

Coverage Status

Coverage increased (+0.06%) to 91.671% when pulling b2be4bb on fix_563 into c48e770 on master.

nats.go Outdated
@@ -1844,7 +1848,11 @@ func (nc *Conn) doReconnect(err error) {
if sleepTime <= 0 {
runtime.Gosched()
} else {
time.Sleep(time.Duration(sleepTime))
rt.Reset(time.Duration(rand.Int63n(sleepTime)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconnect Wait time should still hold same semantics. Do not attempt reconnect for some period of time. What I think we need to add is a Jitter option which has a sane default, I might say 2 secs or so that is added onto reconnectWait.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So random between [0,2sec] added to the base, ok got it. Will update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argg.. quite a bit of tests seem to expect no more than reconnectWait, so we may get some flappers pop up, will have to fix those along the way... unless you think that added jitter should be an option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be an option, but should be low as default. For tests we can set to 0 which should do old behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect that some of those tests are in /test package so I cannot play that trick without exposing an option or function to set that value to zero..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jitter was always meant to be a public option, just wanted a sane default. For the case we were dealing with today they would probably set it to 5-10 secs.. I think that is alot for default cases where you have live backup servers and want quick reconnect. Maybe 1 second is plenty in that case, so [0,1] is default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so we do make it an Option in that it is settable. Still, it means that all tests in /test that relied on no more than reconnectWait will have to be altered, but that's ok. Just that I may not find them all at once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they will need to set Jitter to 0.. ;)

@ripienaar
Copy link
Contributor

In my ideal world I want this to be an exponential back off. Or at least the ability to pass in a delay function to achieve that.

Ie. Jitter by default but make it possible to do expo back offs via user supplied sleeper. Pass the sleeper a reconnect counter.

At large site I had nats using all cpu for several minutes on restarts due to combos of slow TLS handshakes, timeouts and synced reconnects.

@kozlovic kozlovic force-pushed the fix_563 branch 2 times, most recently from eaa6848 to b0b0bd3 Compare May 1, 2020 22:42
@derekcollison
Copy link
Member

@ripienaar Yes we were discussing about the serves and clients interacting. This is mostly to avoid thundering herd but if org controls both ends they can configure there clients to flatten the curve of reconnect rates, especially for high-demanding TLS handshakes, like 4K RSA keys ;)

@ripienaar
Copy link
Contributor

Sure, I am saying I want more control than this pr provides.

This high cpu and literal outage for minutes was even with the fastest ciphers.

@derekcollison
Copy link
Member

ok gotcha..

@kozlovic
Copy link
Member Author

kozlovic commented May 4, 2020

@ripienaar @derekcollison I can certainly add a CustomReconnectJitter() option. Here is what I propose, let me know if it matches your thoughts:

  • An CustomReconnectJitter option that is given the URL the library is going to try to connect to along with the number of reconnect attempts for this URL. The user will return the additional jitter.
  • If such option is specified, the default jitter is ignored

Question: Should the value returned by the invocation of the user's custom reconnect jitter callback be used as-is (in addition to the reconnect wait), or should the library still use a random value between 0 and that value?

@derekcollison
Copy link
Member

I think jitter should be handled as we discussed. What sounds like something RI might need is a reconnect failed handler such that he can possibly change reconnect time.

What I think he wants is a simple option (or default behavior) that just does exponential backoff.

@ripienaar is this more for the client app's benefit or the servers?

@ripienaar
Copy link
Contributor

ripienaar commented May 4, 2020

Yeah no URL - this is for me to use in a go client. If I know I am working on a system with >50k clients, I will do the extra work to use write a custom jitter.

Using the stateless backoff here https://blog.gopheracademy.com/advent-2014/backoff/

CustomReconnectDelay(func(n int) time.Duration {
  return backoff.Default.Duration(n)
})

And that's it, if I specify this you ignore the default you are proposing and people with complex needs can implement their crazy however they like.

When you're doing TLS at scale with NATS you have to be VERY careful - that's fine it's normal its not a NATS problem - but if I am at scale this is a key knob needed.

@derekcollison
Copy link
Member

Yes definitely needed for large scale TLS (not even large TBH, user we are working with was dealing with 1200).

Do we have a callback that exists now that says I tried everyone and no one picked up?

Feels like this callback would be where we reset the reconnect time (Jitter should probably not be tweaked here IMO).

The on connect again we can reset it back to normal, etc.

@ripienaar
Copy link
Contributor

Yeah, this is why my example takes a n input - it's the amount of reconnects in this cycle. So you dont need to remember anything other than reconnect count. The Jitter provider gives you the right sleep based on reconnect count. Very little on the client library side to track

@ColinSullivan1
Copy link
Member

IMO, this can be more general than jitter, maybe label it ReconnectAttemptFailedCB. Could be very useful in debugging. Couldn't attempts be tracked via closure? Along with custom, we could provide a few predefined callbacks that users could choose - jitter only, backoff w/ jitter, etc...

nats.go Outdated
// It can return a jitter value based on those metrics.
// The library will take a random value between 0 and the returned value
// and add that to the ReconnectWait duration.
// Note that any jitter is capped with ReconnectJitterMax.
Copy link
Contributor

@ripienaar ripienaar May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so with this design library users still cant use their own backoff logic, we can only influence the jitter which is not what I was suggesting earlier. It's not nothing and will help, just pointing out that's not what I was suggesting at all.

The desired effect of a backoff is that the minimal reconnect interval increases over time + jitter. This implimentation means still that minimal reconnect time is ~2seconds which is not good

Here's 30 reconnect attempts by grpc, note its capped at 2 minutes and then jitters around that, but the minimum reconnect time approaches max over time not min+jitter:

0: 1s
1: 1.599317579s
2: 2.56393758s
3: 3.990555513s
4: 5.962072093s
5: 10.763383515s
6: 18.872196205s
7: 29.559588104s
8: 37.147186587s
9: 1m10.383560759s
10: 1m37.91169291s
11: 2m18.715950662s
12: 1m45.303404133s
13: 1m59.446929074s
14: 1m51.271171221s
15: 2m13.320486513s
16: 1m53.109727095s
17: 2m22.762489936s
18: 1m54.080148652s
19: 2m1.71590643s
20: 2m1.300598788s
21: 1m49.431812662s
22: 2m4.949590579s
23: 1m46.34926616s
24: 1m42.787212095s
25: 2m6.799644605s
26: 1m48.488322855s
27: 2m13.048933797s
28: 2m14.636264763s
29: 1m48.410590953s
30: 1m43.727174768s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want to control both then yes? The most important is to spread out jitter. What the min wait is stops the attempts from happening quickly but changing that and not jitter (unless you already have randomness in how folks got disconnected) will not help smooth reconnect. So for the grpc example above what are the achieving besides the client just not trying as often?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a true backoff:

  • not as often
  • over time less often
  • capped at a max duration
  • with jitter always

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point I was trying to make was that jitter is more important then the delay. I understand we could have both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In example you provided above can you influence jitter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fair, use some library for wait and jitter combined. And just make sane defaults for most scenarios.

In your case is your algorithm TLS aware?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - but your defaults could be of course and would be fine for most.

choria cant be run without tls unless you go and recompile it at which point i dont care :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for that simple function. If set, supersede any ReconnectWait option. So we keep the current jitter options (nonTLS and TLS) for cases where the custom option is not set. Doc for CustomReconnectDelay() will clearly state that the library will sleep for given delay and it is recommended that the function returns a value that include some jitter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ready to submit updates. Do you guys prefer that I squash first or want to see just the difference (and I will only squash prior to merge)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Squash for me.

nats.go Outdated
// Note that any jitter is capped with ReconnectJitterMax.
ReconnectJitterTLS time.Duration

// ReconnectJitterMax caps the maximum jitter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth pointing out here that setting to 0 will disable the capping

// user the current number of attempts. This function returns the
// amount of time the library will sleep before attempting to reconnect
// again. It is strongly recommended that this value contains some
// jitter to prevent all connections to attempt reconnecting at the same time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet 👍

@derekcollison derekcollison self-requested a review May 5, 2020 16:51
go.mod Outdated
@@ -1,5 +1,7 @@
module github.com/nats-io/nats.go

go 1.14
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want this here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No... but I updated to go 1.14 and that touched this file.. you don't know how mad this thing makes me.
I just discovered that VSCode is now updating go.mod/sum anytime I save because they use go list, etc.. that does change those files. Will revert.

nats.go Outdated
RequestChanLen = 8
DefaultDrainTimeout = 30 * time.Second
LangString = "go"
Version = "1.9.2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably bump to 1.9.3

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

nats.go Outdated
DefaultPort = 4222
DefaultMaxReconnect = 60
DefaultReconnectWait = 2 * time.Second
DefaultReconnectJitterNonTLS = 100 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just DefaultReconnectJitter imo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. If no TLS specified, it means non-TLS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just going to say that this is uber complex - 2 jitter values, and a function that tells them that it failed when the close should do that. My question is if that is the case, why don't we just have one jitter that is a random value added to reconnect time wait, and also an exponential backoff which starts at reconnect time wait + jitter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also with the advent of host:port knowing the type of server is more difficult - and in most cases the client doesn't care, as the server will be whatever they have deployed and likely will only be either tls or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callback is for users that want to implement their own backoff, in which case the lib will use whatever value is returned. For the default jitters, it is true that TLS connections have usually more issues reconnecting due to the cost of tls handshake, so this is why there are 2 different values. Note that in both cases (plain and TLS) the lib still takes a random value between 0 and that value that is added to the reconnect wait.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more - I think the correct thing would be to have a 'maxReconnectWait' that is analogous to reconnectWait, but that is random between 0-maxReconnectWait. That would give a bigger jitter window without the complexity of the multipliers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callback is for users that want to implement their own backoff, in which case the lib will use whatever value is returned. For the default jitters, it is true that TLS connections have usually more issues reconnecting due to the cost of tls handshake, so this is why there are 2 different values. Note that in both cases (plain and TLS) the lib still takes a random value between 0 and that value that is added to the reconnect wait.

So why not just say have jitter as a boolean and that randomizes reconnectTimeWait. The jitter callback can be a dynamic thing - in case of node the property can be a boolean or a function, and it's done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had long discussions on all of these, base connect and jitter are separate things in almost all other systems. We tried to simplify and had quite a few back and forths.

nats.go Outdated Show resolved Hide resolved
-Add ReconnectJitter option setter allowing user to set jitter
 for non and TLS connections (possibly different values)
-Default for non-TLS will be 100ms and TLS 1sec
-Add a CustomReconnectDelay handler that is passed the number
 of full url lists attempts. The user callback will then return
 a duration that is going to be used for the wait. It is the
 responsibility of the user to send a delay with some jitter
 if desired.

Resolves #563

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@kozlovic
Copy link
Member Author

@derekcollison @ripienaar Please consider the changes we have done here. Since we now sleep only at the end of the list, it changes the fact that we use to check the lastAttempt from each server we are trying to reconnect to. We have changed that (and I have removed the lastAttempt variable that was no longer used in b2be4bb). Let's make sure this is correct before others port to respective client libraries.

Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

i = 0
var st time.Duration
if crd != nil {
wlf++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we count wlf regardless of crd? Noop now but we may use it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will move it out of the condition when we actually need it, will be easy to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add jitter during reconnect
7 participants