[FIXED] Added jitter in the reconnect logic #564

kozlovic · 2020-05-01T20:59:13Z

When library would wait for the reconnect wait interval, it will
now wait for a random time bounded by this value.
This prevents thundering herd reconnect issue, especially with
TLS.

Resolves #563

Signed-off-by: Ivan Kozlovic ivan@synadia.com

kozlovic · 2020-05-01T21:01:00Z

I added a random delay capped to the reconnect wait. Also made sure that we kick out the doReconnect go routine (was previously possibly waiting for reconnectWait even when connection was closed.

Should we have a minimum wait? For instance use random but for [reconnectWait/2:reconnectWait]?

kozlovic · 2020-05-01T21:07:29Z

Some tests will now sometimes fail because they expect the connection to be reconnecting based on the ReconnectWait duration. Waiting for decision on wait time before fixing those (for instance: TestReconnectAllowedFlags
TestReconnectAllowedFlags: reconnect_test.go:104: Expected to be in a reconnecting state)

wallyqs · 2020-05-01T21:15:16Z

go.mod

 	github.com/nats-io/jwt v0.3.2
+	github.com/nats-io/nats-server/v2 v2.1.6


I know that go mod tooling added this, but should we continue to remove it to avoid depending on server?

Arg.. it's my local update of staticcheck that did that likely and I forgot to remove. Will update.

wallyqs · 2020-05-01T21:15:53Z

go.mod

 )
+
+go 1.13


Not completely sure of the impact of adding this one yet too...

I will reset the go.mod/sum changes, but will be easier for me to do a git push force, so be aware.

coveralls · 2020-05-01T21:26:16Z

Coverage increased (+0.06%) to 91.671% when pulling b2be4bb on fix_563 into c48e770 on master.

derekcollison · 2020-05-01T21:34:59Z

nats.go

@@ -1844,7 +1848,11 @@ func (nc *Conn) doReconnect(err error) {
 		if sleepTime <= 0 {
 			runtime.Gosched()
 		} else {
-			time.Sleep(time.Duration(sleepTime))
+			rt.Reset(time.Duration(rand.Int63n(sleepTime)))


Reconnect Wait time should still hold same semantics. Do not attempt reconnect for some period of time. What I think we need to add is a Jitter option which has a sane default, I might say 2 secs or so that is added onto reconnectWait.

So random between [0,2sec] added to the base, ok got it. Will update.

Argg.. quite a bit of tests seem to expect no more than reconnectWait, so we may get some flappers pop up, will have to fix those along the way... unless you think that added jitter should be an option?

I don't think it should be an option, but should be low as default. For tests we can set to 0 which should do old behavior.

expect that some of those tests are in /test package so I cannot play that trick without exposing an option or function to set that value to zero..

Jitter was always meant to be a public option, just wanted a sane default. For the case we were dealing with today they would probably set it to 5-10 secs.. I think that is alot for default cases where you have live backup servers and want quick reconnect. Maybe 1 second is plenty in that case, so [0,1] is default.

Ok so we do make it an Option in that it is settable. Still, it means that all tests in /test that relied on no more than reconnectWait will have to be altered, but that's ok. Just that I may not find them all at once.

Yes they will need to set Jitter to 0.. ;)

ripienaar · 2020-05-01T22:23:51Z

In my ideal world I want this to be an exponential back off. Or at least the ability to pass in a delay function to achieve that.

Ie. Jitter by default but make it possible to do expo back offs via user supplied sleeper. Pass the sleeper a reconnect counter.

At large site I had nats using all cpu for several minutes on restarts due to combos of slow TLS handshakes, timeouts and synced reconnects.

derekcollison · 2020-05-02T01:13:24Z

@ripienaar Yes we were discussing about the serves and clients interacting. This is mostly to avoid thundering herd but if org controls both ends they can configure there clients to flatten the curve of reconnect rates, especially for high-demanding TLS handshakes, like 4K RSA keys ;)

ripienaar · 2020-05-02T04:51:29Z

Sure, I am saying I want more control than this pr provides.

This high cpu and literal outage for minutes was even with the fastest ciphers.

derekcollison · 2020-05-02T05:04:37Z

ok gotcha..

kozlovic · 2020-05-04T15:07:59Z

@ripienaar @derekcollison I can certainly add a CustomReconnectJitter() option. Here is what I propose, let me know if it matches your thoughts:

An CustomReconnectJitter option that is given the URL the library is going to try to connect to along with the number of reconnect attempts for this URL. The user will return the additional jitter.
If such option is specified, the default jitter is ignored

Question: Should the value returned by the invocation of the user's custom reconnect jitter callback be used as-is (in addition to the reconnect wait), or should the library still use a random value between 0 and that value?

derekcollison · 2020-05-04T15:13:29Z

I think jitter should be handled as we discussed. What sounds like something RI might need is a reconnect failed handler such that he can possibly change reconnect time.

What I think he wants is a simple option (or default behavior) that just does exponential backoff.

@ripienaar is this more for the client app's benefit or the servers?

ripienaar · 2020-05-04T15:19:11Z

Yeah no URL - this is for me to use in a go client. If I know I am working on a system with >50k clients, I will do the extra work to use write a custom jitter.

Using the stateless backoff here https://blog.gopheracademy.com/advent-2014/backoff/

CustomReconnectDelay(func(n int) time.Duration {
  return backoff.Default.Duration(n)
})

And that's it, if I specify this you ignore the default you are proposing and people with complex needs can implement their crazy however they like.

When you're doing TLS at scale with NATS you have to be VERY careful - that's fine it's normal its not a NATS problem - but if I am at scale this is a key knob needed.

derekcollison · 2020-05-04T15:32:04Z

Yes definitely needed for large scale TLS (not even large TBH, user we are working with was dealing with 1200).

Do we have a callback that exists now that says I tried everyone and no one picked up?

Feels like this callback would be where we reset the reconnect time (Jitter should probably not be tweaked here IMO).

The on connect again we can reset it back to normal, etc.

ripienaar · 2020-05-04T15:33:43Z

Yeah, this is why my example takes a n input - it's the amount of reconnects in this cycle. So you dont need to remember anything other than reconnect count. The Jitter provider gives you the right sleep based on reconnect count. Very little on the client library side to track

ColinSullivan1 · 2020-05-04T15:51:04Z

IMO, this can be more general than jitter, maybe label it ReconnectAttemptFailedCB. Could be very useful in debugging. Couldn't attempts be tracked via closure? Along with custom, we could provide a few predefined callbacks that users could choose - jitter only, backoff w/ jitter, etc...

ripienaar · 2020-05-05T08:28:21Z

nats.go

+	// It can return a jitter value based on those metrics.
+	// The library will take a random value between 0 and the returned value
+	// and add that to the ReconnectWait duration.
+	// Note that any jitter is capped with ReconnectJitterMax.


so with this design library users still cant use their own backoff logic, we can only influence the jitter which is not what I was suggesting earlier. It's not nothing and will help, just pointing out that's not what I was suggesting at all.

The desired effect of a backoff is that the minimal reconnect interval increases over time + jitter. This implimentation means still that minimal reconnect time is ~2seconds which is not good

Here's 30 reconnect attempts by grpc, note its capped at 2 minutes and then jitters around that, but the minimum reconnect time approaches max over time not min+jitter:

0: 1s 1: 1.599317579s 2: 2.56393758s 3: 3.990555513s 4: 5.962072093s 5: 10.763383515s 6: 18.872196205s 7: 29.559588104s 8: 37.147186587s 9: 1m10.383560759s 10: 1m37.91169291s 11: 2m18.715950662s 12: 1m45.303404133s 13: 1m59.446929074s 14: 1m51.271171221s 15: 2m13.320486513s 16: 1m53.109727095s 17: 2m22.762489936s 18: 1m54.080148652s 19: 2m1.71590643s 20: 2m1.300598788s 21: 1m49.431812662s 22: 2m4.949590579s 23: 1m46.34926616s 24: 1m42.787212095s 25: 2m6.799644605s 26: 1m48.488322855s 27: 2m13.048933797s 28: 2m14.636264763s 29: 1m48.410590953s 30: 1m43.727174768s

So you want to control both then yes? The most important is to spread out jitter. What the min wait is stops the attempts from happening quickly but changing that and not jitter (unless you already have randomness in how folks got disconnected) will not help smooth reconnect. So for the grpc example above what are the achieving besides the client just not trying as often?

It's a true backoff:

not as often

over time less often

capped at a max duration

with jitter always

Point I was trying to make was that jitter is more important then the delay. I understand we could have both.

In example you provided above can you influence jitter?

That is fair, use some library for wait and jitter combined. And just make sane defaults for most scenarios.

In your case is your algorithm TLS aware?

no - but your defaults could be of course and would be fine for most.

choria cant be run without tls unless you go and recompile it at which point i dont care :)

I vote for that simple function. If set, supersede any ReconnectWait option. So we keep the current jitter options (nonTLS and TLS) for cases where the custom option is not set. Doc for CustomReconnectDelay() will clearly state that the library will sleep for given delay and it is recommended that the function returns a value that include some jitter.

I am ready to submit updates. Do you guys prefer that I squash first or want to see just the difference (and I will only squash prior to merge)?

Squash for me.

ripienaar · 2020-05-05T08:31:08Z

nats.go

+	// Note that any jitter is capped with ReconnectJitterMax.
+	ReconnectJitterTLS time.Duration
+
+	// ReconnectJitterMax caps the maximum jitter.


worth pointing out here that setting to 0 will disable the capping

ripienaar · 2020-05-05T16:49:19Z

nats.go

+	// user the current number of attempts. This function returns the
+	// amount of time the library will sleep before attempting to reconnect
+	// again. It is strongly recommended that this value contains some
+	// jitter to prevent all connections to attempt reconnecting at the same time.


derekcollison · 2020-05-05T16:52:14Z

go.mod

@@ -1,5 +1,7 @@
 module github.com/nats-io/nats.go

+go 1.14


We want this here?

No... but I updated to go 1.14 and that touched this file.. you don't know how mad this thing makes me.
I just discovered that VSCode is now updating go.mod/sum anytime I save because they use go list, etc.. that does change those files. Will revert.

derekcollison · 2020-05-05T16:52:33Z

nats.go

-	RequestChanLen          = 8
-	DefaultDrainTimeout     = 30 * time.Second
-	LangString              = "go"
+	Version                      = "1.9.2"


Probably bump to 1.9.3

derekcollison · 2020-05-05T16:53:15Z

nats.go

+	DefaultPort                  = 4222
+	DefaultMaxReconnect          = 60
+	DefaultReconnectWait         = 2 * time.Second
+	DefaultReconnectJitterNonTLS = 100 * time.Millisecond


Just DefaultReconnectJitter imo

Got it. If no TLS specified, it means non-TLS.

Just going to say that this is uber complex - 2 jitter values, and a function that tells them that it failed when the close should do that. My question is if that is the case, why don't we just have one jitter that is a random value added to reconnect time wait, and also an exponential backoff which starts at reconnect time wait + jitter.

Also with the advent of host:port knowing the type of server is more difficult - and in most cases the client doesn't care, as the server will be whatever they have deployed and likely will only be either tls or not.

The callback is for users that want to implement their own backoff, in which case the lib will use whatever value is returned. For the default jitters, it is true that TLS connections have usually more issues reconnecting due to the cost of tls handshake, so this is why there are 2 different values. Note that in both cases (plain and TLS) the lib still takes a random value between 0 and that value that is added to the reconnect wait.

One more - I think the correct thing would be to have a 'maxReconnectWait' that is analogous to reconnectWait, but that is random between 0-maxReconnectWait. That would give a bigger jitter window without the complexity of the multipliers.

The callback is for users that want to implement their own backoff, in which case the lib will use whatever value is returned. For the default jitters, it is true that TLS connections have usually more issues reconnecting due to the cost of tls handshake, so this is why there are 2 different values. Note that in both cases (plain and TLS) the lib still takes a random value between 0 and that value that is added to the reconnect wait.

So why not just say have jitter as a boolean and that randomizes reconnectTimeWait. The jitter callback can be a dynamic thing - in case of node the property can be a boolean or a function, and it's done.

We had long discussions on all of these, base connect and jitter are separate things in almost all other systems. We tried to simplify and had quite a few back and forths.

nats.go

-Add ReconnectJitter option setter allowing user to set jitter for non and TLS connections (possibly different values) -Default for non-TLS will be 100ms and TLS 1sec -Add a CustomReconnectDelay handler that is passed the number of full url lists attempts. The user callback will then return a duration that is going to be used for the wait. It is the responsibility of the user to send a delay with some jitter if desired. Resolves #563 Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

kozlovic · 2020-05-12T19:55:49Z

@derekcollison @ripienaar Please consider the changes we have done here. Since we now sleep only at the end of the list, it changes the fact that we use to check the lastAttempt from each server we are trying to reconnect to. We have changed that (and I have removed the lastAttempt variable that was no longer used in b2be4bb). Let's make sure this is correct before others port to respective client libraries.

derekcollison

LGTM

derekcollison · 2020-05-12T22:16:12Z

nats.go

+			i = 0
+			var st time.Duration
+			if crd != nil {
+				wlf++


Should we count wlf regardless of crd? Noop now but we may use it.

I think we will move it out of the condition when we actually need it, will be easy to do.

kozlovic requested review from derekcollison and ColinSullivan1 May 1, 2020 20:59

wallyqs reviewed May 1, 2020

View reviewed changes

kozlovic force-pushed the fix_563 branch from 9027138 to fa41411 Compare May 1, 2020 21:23

derekcollison requested changes May 1, 2020

View reviewed changes

kozlovic force-pushed the fix_563 branch 2 times, most recently from eaa6848 to b0b0bd3 Compare May 1, 2020 22:42

ripienaar reviewed May 5, 2020

View reviewed changes

kozlovic force-pushed the fix_563 branch from e851f04 to da6924f Compare May 5, 2020 16:43

ripienaar reviewed May 5, 2020

View reviewed changes

derekcollison self-requested a review May 5, 2020 16:51

derekcollison requested changes May 5, 2020

View reviewed changes

kozlovic force-pushed the fix_563 branch from e968291 to 2431048 Compare May 6, 2020 15:10

Removed lastAttempt since it is no longer used

b2be4bb

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>

derekcollison approved these changes May 12, 2020

View reviewed changes

kozlovic merged commit 1b0a85a into master May 13, 2020

kozlovic deleted the fix_563 branch May 13, 2020 22:32

ColinSullivan1 mentioned this pull request May 20, 2020

Add reconnect jitter nats-io/nats.java#314

Closed

ColinSullivan1 mentioned this pull request May 29, 2020

Add Reconnect Jitter nats-io/nats.net.v1#385

Merged

marthaCP mentioned this pull request Oct 6, 2022

Implement Jitter and Reconnect Callback nats-io/nats.py#370

Open

		github.com/nats-io/jwt v0.3.2
		github.com/nats-io/nats-server/v2 v2.1.6

		@@ -1,5 +1,7 @@
		module github.com/nats-io/nats.go

		go 1.14

[FIXED] Added jitter in the reconnect logic #564

[FIXED] Added jitter in the reconnect logic #564

Conversation

kozlovic commented May 1, 2020

kozlovic commented May 1, 2020

kozlovic commented May 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented May 1, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ripienaar commented May 1, 2020

derekcollison commented May 2, 2020

ripienaar commented May 2, 2020

derekcollison commented May 2, 2020

kozlovic commented May 4, 2020

derekcollison commented May 4, 2020

ripienaar commented May 4, 2020 • edited Loading

derekcollison commented May 4, 2020

ripienaar commented May 4, 2020

ColinSullivan1 commented May 4, 2020

ripienaar May 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kozlovic commented May 12, 2020

derekcollison left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented May 1, 2020 •

edited

Loading

ripienaar commented May 4, 2020 •

edited

Loading

ripienaar May 5, 2020 •

edited

Loading