Hinted Failover Feature? #668

Plasma · 2017-07-25T23:18:28Z

Hi,

I wanted to see whether there was any interest in extending SE.Redis library to support what I deem a "hinted failover".

At least on Azure Redis Cache, they reboot the master/slave nodes for updates every month or two, which yields Redis socket errors during the failover - SE.Redis rightly complains that its socket has been disconnected, and any in-flight requests have now failed or were partially committed (assuming no transaction). This disrupts the app using Redis cache.

My suggestion to them was whether their platform could, say, use a PUBLISH on a pre-defined gossip channel that a failover on the node being subscribed to is about to occur, and that any supported clients (eg SE.Redis) can take advantage of this notice and drain its current connection of in-flight commands and prepare for a reconnect.

At a very basic level, perhaps something like:

// Indicate a failover is happening in 3 seconds (3000ms)
PUBLISH FailoverHints "Failover In: 3000"

SE.Redis in theory could be subscribed to this channel and when it receives this message, it realises in 3 seconds its going to need to reconnect to the socket as a failover is about to happen. In Azure Redis Cache's instance, the same DNS can be used, as its going to just point to the different node in the backend, so a reconnect is sufficient.

SE.Redis could do something like:

Stop sending more Redis commands right at this moment, but add to internal buffer
Complete/drain the existing socket of any in-flight commands, so they complete
Reconnect after the indicated "Failover hint" delay from the pubsub channel, knowing we should now be connected to the "new" server
Flush the buffered commands as we should now be on the new server

The objective is to provide a more graceful failover that is less disruptive to application code in the typical case. Instead of in-flight requests failing or there being uncertainty, the app just buffers commands for a few seconds (so app requests are delayed) but then they all get flushed without failure.

Thoughts?

The text was updated successfully, but these errors were encountered:

NickCraver · 2017-09-01T19:20:07Z

As it happens, we do this internally already, that's how we gracefully handle failovers at Stack Overflow. We'd need to coordinate on a general way to do this (hopefully across all providers) though.

@JonCole do you have thoughts on this?

JonCole · 2017-09-05T21:27:24Z

Yes, we (Azure Redis) have been considering something very similar to what is described above, although we may want to tweak the payload/syntax. We would be supportive of such a feature and would be willing to do the backend work to enable it, assuming we can come to a consensus on the implementation details.

NickCraver · 2017-09-05T22:21:51Z

@JonCole are you thinking of the same pub/sub design as Sentinel already has and clients are built for, to maximize compatibility for free if we can get it to work? Docs here on Sentinel: https://redis.io/topics/sentinel We'll be adding this to StackExchange.Redis shortly, and would much prefer that to any custom implementation for a single service provider.

JonCole · 2017-09-05T23:24:44Z

@NickCraver we are currently most interested in a pub/sub solution. Sentinel does support pub/sub notification, but from the docs it sounds like the client connects to the Sentinel endpoint for that pub/sub. Would StackExchange.Redis allow subscribing to a configured endpoint/channel (e.g. Redis itself) instead of a Sentinel endpoint?

NickCraver · 2017-09-05T23:26:14Z

@JonCole I could absolutely see that, if we get the pub/sub structure the same the endpoint is of less consequence. Our docs could just instruct a user to point at the specified Azure endpoint for the pub/sub in some way and the rest "just works" with the rest of the protocol matching, hopefully.

Plasma · 2017-09-17T08:12:26Z

Just chiming in to say thanks for considering this feature. There's been Azure Redis rolling reboots over the last few days, I have several clusters, so I get pain over a multi-day period due to abrupt socket disconnects.

@NickCraver out of interest, how are you handling this internally at SO? Does it work smooth enough for you in reality?

As a side, it seems simpler to publish direct to the pub/sub channel in the redis server being failed-over itself, rather than have a (yet another) endpoint, just pick a namespaced channel name (eg like you have for Booksleave Tie Breaker)?

Looking forward to this feature being made available.

NickCraver · 2017-09-17T12:16:05Z

@Plasma If you use the library to initiate the promotion, changes in replication, etc. it already pub/subs for all clients to re-check their configuration and handle it. We're simply using StackExchange.Redis inside Opserver when we click to do the failover. We have a dashboard that lets us control this, just some UI on top of the redis commands:

We pub/sub to both members being affected, to ensure that everyone gets it. In the master/slave chain scenario publishing to any master (or about-to-be master) will replicate the publish down their chains as well, so all clients receive it. Indeed we do not connect to another service, currently. But we're looking into Sentinel.

I'd love to merge Sentinel here but need integration tests for that PR and I haven't had the time yet. Was hoping the community would help out there but no progress so far...so has to wait for limited time unfortunately :(

NickCraver · 2018-05-27T21:56:57Z

@JonCole any changes here on your side? We've been swamped for a while, trying to catch up on things.

NickCraver · 2021-05-20T01:44:30Z

@deepakverma heads up: existing issue for the Azure maintenance publishes

NickCraver · 2022-01-08T13:50:55Z

We added this for Azure (and in a way other providers can help us add support for them as well) in #1876 - closing this out to cleanup but if you grab latest the library is now aware of incoming maintenance (and post-maintenance) happenings to reconnect ASAP.

NickCraver added ⚙️ area:connection ☁️ platform:Azure labels Sep 2, 2017

NickCraver closed this as completed Jan 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hinted Failover Feature? #668

Hinted Failover Feature? #668

Plasma commented Jul 25, 2017

NickCraver commented Sep 1, 2017

JonCole commented Sep 5, 2017

NickCraver commented Sep 5, 2017

JonCole commented Sep 5, 2017

NickCraver commented Sep 5, 2017

Plasma commented Sep 17, 2017

NickCraver commented Sep 17, 2017

NickCraver commented May 27, 2018

NickCraver commented May 20, 2021

NickCraver commented Jan 8, 2022

Hinted Failover Feature? #668

Hinted Failover Feature? #668

Comments

Plasma commented Jul 25, 2017

NickCraver commented Sep 1, 2017

JonCole commented Sep 5, 2017

NickCraver commented Sep 5, 2017

JonCole commented Sep 5, 2017

NickCraver commented Sep 5, 2017

Plasma commented Sep 17, 2017

NickCraver commented Sep 17, 2017

NickCraver commented May 27, 2018

NickCraver commented May 20, 2021

NickCraver commented Jan 8, 2022