Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hinted Failover Feature? #668

Closed
Plasma opened this issue Jul 25, 2017 · 10 comments
Closed

Hinted Failover Feature? #668

Plasma opened this issue Jul 25, 2017 · 10 comments

Comments

@Plasma
Copy link

Plasma commented Jul 25, 2017

Hi,

I wanted to see whether there was any interest in extending SE.Redis library to support what I deem a "hinted failover".

At least on Azure Redis Cache, they reboot the master/slave nodes for updates every month or two, which yields Redis socket errors during the failover - SE.Redis rightly complains that its socket has been disconnected, and any in-flight requests have now failed or were partially committed (assuming no transaction). This disrupts the app using Redis cache.

My suggestion to them was whether their platform could, say, use a PUBLISH on a pre-defined gossip channel that a failover on the node being subscribed to is about to occur, and that any supported clients (eg SE.Redis) can take advantage of this notice and drain its current connection of in-flight commands and prepare for a reconnect.

At a very basic level, perhaps something like:

// Indicate a failover is happening in 3 seconds (3000ms)
PUBLISH FailoverHints "Failover In: 3000"

SE.Redis in theory could be subscribed to this channel and when it receives this message, it realises in 3 seconds its going to need to reconnect to the socket as a failover is about to happen. In Azure Redis Cache's instance, the same DNS can be used, as its going to just point to the different node in the backend, so a reconnect is sufficient.

SE.Redis could do something like:

  1. Stop sending more Redis commands right at this moment, but add to internal buffer
  2. Complete/drain the existing socket of any in-flight commands, so they complete
  3. Reconnect after the indicated "Failover hint" delay from the pubsub channel, knowing we should now be connected to the "new" server
  4. Flush the buffered commands as we should now be on the new server

The objective is to provide a more graceful failover that is less disruptive to application code in the typical case. Instead of in-flight requests failing or there being uncertainty, the app just buffers commands for a few seconds (so app requests are delayed) but then they all get flushed without failure.

Thoughts?

@NickCraver
Copy link
Collaborator

As it happens, we do this internally already, that's how we gracefully handle failovers at Stack Overflow. We'd need to coordinate on a general way to do this (hopefully across all providers) though.

@JonCole do you have thoughts on this?

@JonCole
Copy link
Contributor

JonCole commented Sep 5, 2017

Yes, we (Azure Redis) have been considering something very similar to what is described above, although we may want to tweak the payload/syntax. We would be supportive of such a feature and would be willing to do the backend work to enable it, assuming we can come to a consensus on the implementation details.

@NickCraver
Copy link
Collaborator

@JonCole are you thinking of the same pub/sub design as Sentinel already has and clients are built for, to maximize compatibility for free if we can get it to work? Docs here on Sentinel: https://redis.io/topics/sentinel We'll be adding this to StackExchange.Redis shortly, and would much prefer that to any custom implementation for a single service provider.

@JonCole
Copy link
Contributor

JonCole commented Sep 5, 2017

@NickCraver we are currently most interested in a pub/sub solution. Sentinel does support pub/sub notification, but from the docs it sounds like the client connects to the Sentinel endpoint for that pub/sub. Would StackExchange.Redis allow subscribing to a configured endpoint/channel (e.g. Redis itself) instead of a Sentinel endpoint?

@NickCraver
Copy link
Collaborator

@JonCole I could absolutely see that, if we get the pub/sub structure the same the endpoint is of less consequence. Our docs could just instruct a user to point at the specified Azure endpoint for the pub/sub in some way and the rest "just works" with the rest of the protocol matching, hopefully.

@Plasma
Copy link
Author

Plasma commented Sep 17, 2017

Just chiming in to say thanks for considering this feature. There's been Azure Redis rolling reboots over the last few days, I have several clusters, so I get pain over a multi-day period due to abrupt socket disconnects.

@NickCraver out of interest, how are you handling this internally at SO? Does it work smooth enough for you in reality?

As a side, it seems simpler to publish direct to the pub/sub channel in the redis server being failed-over itself, rather than have a (yet another) endpoint, just pick a namespaced channel name (eg like you have for Booksleave Tie Breaker)?

Looking forward to this feature being made available.

@NickCraver
Copy link
Collaborator

@Plasma If you use the library to initiate the promotion, changes in replication, etc. it already pub/subs for all clients to re-check their configuration and handle it. We're simply using StackExchange.Redis inside Opserver when we click to do the failover. We have a dashboard that lets us control this, just some UI on top of the redis commands:
screen shot 2017-09-17 at 08 12 05

We pub/sub to both members being affected, to ensure that everyone gets it. In the master/slave chain scenario publishing to any master (or about-to-be master) will replicate the publish down their chains as well, so all clients receive it. Indeed we do not connect to another service, currently. But we're looking into Sentinel.

I'd love to merge Sentinel here but need integration tests for that PR and I haven't had the time yet. Was hoping the community would help out there but no progress so far...so has to wait for limited time unfortunately :(

@NickCraver
Copy link
Collaborator

@JonCole any changes here on your side? We've been swamped for a while, trying to catch up on things.

@NickCraver
Copy link
Collaborator

@deepakverma heads up: existing issue for the Azure maintenance publishes

@NickCraver
Copy link
Collaborator

We added this for Azure (and in a way other providers can help us add support for them as well) in #1876 - closing this out to cleanup but if you grab latest the library is now aware of incoming maintenance (and post-maintenance) happenings to reconnect ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants