Pub/Sub fixes for subscribe/re-subscribe #1947

NickCraver · 2022-01-10T14:05:00Z

We're working on pub/sub - breaking it out explicitly from #1912. This relates to several issues and in general handling resubscriptions on reconnect.

Issues: #1110, #1586, #1830 #1835

There are a few things in play we're investigating:

Subscription heartbeat not going over the subscription connection (due to PING and GetBridge)
Subscriptions not reconnecting at all (or potentially doing to and unsubscribing according to some issues)
Subscriptions always going to a single cluster node (due to default(RedisKey))

Overall this set of changes:

Completely restructures how RedisSubscriber works
- No more PendingSubscriptionState (Subscription has the needed bits to reconnect)
- Cleaner method topology (in RedisSubscriber, rather than Subscriber, RedisSubscriber, and ConnectionMultiplexer)
  - By placing these on RedisSubscriber, we can cleanly use ExecuteSync/Async bits, get proper profiling, etc.
- Proper sync/async split (rather than Wait() in sync paths)
Changes how subscriptions work
- The Subscription object is added to the ConnectionMultiplexer tracking immediately, but the command itself actually goes to the server and back (unless FireAndForget) before returning for proper ordering like other commands.
- No more Task.Run() loop - we now ensure reconnects as part of the handshake
- Subscriptions are marked as not having a server the moment a disconnect is fired
  - Question: Should we have a throttle around this for massive numbers of connections, or async it?
Changes how connecting works
- The connection completion handler will now fire when the second bridge/connection completes, this means we won't have interactive connected but subscription in an unknown state - both are connected before we fire the handler meaning the moment we come back from connect, subscriptions are in business.
Moves to a ConcurrentDictionary since we only need limited locking around this and we only have it once per multiplexer.
- TODO: This needs eyes, we could shift it - implementation changed along the way where this isn't a critical detail
Fixes the TrackSubscriptionsProcessor - this was never setting the result but didn't notice in 8 years because downstream code never cared.
- Note: each Subscription has a processor instance (with minimal state) because when the subscription command comes back then we need to decide if it successfully registered (if it didn't, we need to maintain it has no successful server)
ConnectionMultiplexer grew a DefaultSubscriber for running some commands without lots of method duplication, e.g. ensuring servers are connected.
Overrides GetHashSlot on CommandChannelBase with the new RedisChannel-based methods so that operates correctly

Not directly related changes which helped here:

Better profiler helpers for tests and profiler logging in them
Re-enables a few PubSub tests that were unreliable before...but correctly so.

TODO: I'd like to add a few more test scenarios here:

Simple Subscribe/Publish/await Until/check pattern to ensure back-to-back subscribe/publish works well
Cluster connection failure and subscriptions moving to another node

To consider:

Subscription await loop from EnsureSubscriptionsAsync and connection impact on large reconnect situations
- In a reconnect case, this is background and only the nodes affected have any latency...but still.
TODOs in code around variadic commands, e.g. re-subscribing with far fewer commands by using SUBSCRIBE <key1> <key2>...
- In cluster, we'd have to batch per slot...or just go for the first available node
- ...but if we go for the first available node, the semantics of IsConnected are slightly off in the not connected (CurrentServer is null) case, because we'd say we're connected to where it would go even though that'd be non-deterministic without hashslot batching. I think this is really minor and shouldn't affect our decision.
ConcurrentDictionary vs. returning to locks around a Dictionary
- ...but if we have to lock on firing consumption of handlers anyway, concurrency overhead is probably a wash.

We're working on pub/sub - breaking it out explicitly.

*Now* this should be stable killing and restoring both connections with proper PING routing in place.

This awaits the condition, rather than a magical delay previously.

This won't be fully accurate until #1947 fixed the PING routing, but getting a test fix into main ahead of time.

Want to yank some of this into another PR ahead of time, getting files in.

…nd cleanup In prep for changes to how we handle subscriptions internally, this does several things: - Upgrades default Redis server assumption to 3.x - Routes PING on Subscription keepalives over the subscription bridge appropriately - Fixes cluster sharding from default(RedisKey) to shared logic for RedisChannel as well (both in byte[] form) - General code cleanup in the area (getting a lot of DEBUG/VERBOSE noise into isolated files)

…/pub-sub-wip

NickCraver · 2022-01-28T01:57:14Z

@andre-ss6 Of course! I think there may be a confusing aspect of PublishAsync here though, in Redis that's how many subscribers were on that server, not on that cluster. So for example you can get a 0 back, but still get the subscription from the other node. It's going from Publisher- > Node A -> Node B -> Subscriber. This tripped me up in tests too, and is probably worth of a remark on the XML docs for this.

NickCraver · 2022-01-28T01:59:47Z

@andre-ss6 Check out the ClusterNodeSubscriptionFailover test in the changeset here to get an example of what I mean - in that we're checking that we got the message, rather than the publisher count which on a cluster turns out is...kinda meaningless :(

andre-ss6 · 2022-01-28T02:09:09Z

@NickCraver Oh I see! Thanks a lot for the clarification! I was going to actually test against the callback, instead of the return of PublishAsync, but was too lazy to do it 😅.

I will re-do the tests later then. Thanks again!

NickCraver · 2022-01-28T02:11:10Z

@andre-ss6 Thanks for the poke here, I do appreciate the eyes :) I'm changing all the <returns> on these methods now to better indicate exactly what we both hit:

        /// <returns>
        /// The number of clients that received the message *on the destination server*,
        /// note that this doesn't mean much in a cluster as clients can get the message through other nodes.
        /// </returns>

This is something that tripped me and at least 1 user up - let's save some pain.

…nge/StackExchange.Redis into craver/pub-sub-issues

This is effectively the behavior we had before (but we are ensured connected now). We just want the count here and to start them, so let's do that also pipelining the `n` commands as we did before instead of awaiting each. We also don't need the `Task` overhead. This makes moving to a variadic command set for this less urgent.

NickCraver · 2022-01-31T13:22:25Z

See 169a173 (#1947) for what I'm thinking around large numbers of subs - moving ReconfigureAsync to FireAndForget for sub re-establish.

andre-ss6 · 2022-02-03T13:32:48Z

Hey @NickCraver, thanks a lot for the help and for your time! I re-did the tests and they worked! However, not how I thought they would 😅, as they also worked on an old version of the library (1.2.6). Since it worked, however, I guess this is now a bit off-topic for the PR, so I sent you a message on gitter, if you have time to clarify some questions. But in any way, you already helped a lot. Thanks again!

mgravell · 2022-02-04T14:40:59Z

tests/RedisConfigs/Dockerfile

@@ -1,4 +1,4 @@
-FROM redis:5
+FROM redis:6.2.6


+1; this gives us a few other unrelated things we can use

mgravell

this looks great and very well considered; unable to find anything other than pettiness, so: nice job sir

NickCraver · 2022-02-04T14:43:25Z

@mgravell If there's little things to tweak as well, happy to! (same for all the PRs), think we're about in a good state so more than happy to polish anywhere

NickCraver added 13 commits January 10, 2022 08:59

WIP: Pub/Sub portion of #1912

990f5e8

We're working on pub/sub - breaking it out explicitly.

Merge remote-tracking branch 'origin/main' into craver/pub-sub-issues

38132a3

Lots of things - need to writeup in PR

d552097

Fix KeepAlive on PhysicalBridge

fac5a1b

Fix default version tests

1cb00ff

Fix up Isue922 test now that we ping the right things

7fdb45a

*Now* this should be stable killing and restoring both connections with proper PING routing in place.

Migrate PubSub tests off sync threads

85c5a4d

Fix shared connections with simulated failures (cross-test noise)

98701c9

Compensate for delay removal

377c813

This awaits the condition, rather than a magical delay previously.

Add logging to pubsub methods

d9c68e1

Add logging to PubSubGetAllCorrectOrder

3f6e030

Tidy exception messages

b63648a

Eliminate writer here

148c975

NickCraver added a commit that referenced this pull request Jan 18, 2022

Issue992: Fix test

fa9e1c3

This won't be fully accurate until #1947 fixed the PING routing, but getting a test fix into main ahead of time.

NickCraver mentioned this pull request Jan 18, 2022

Fix: Issue992 & ExceptionFactory tests #1952

Merged

NickCraver added a commit that referenced this pull request Jan 18, 2022

Fix: Issue992 & ExceptionFactory tests (#1952)

7c965cf

This won't be fully accurate until #1947 fixed the PING routing, but getting a test fix into main ahead of time.

NickCraver added 11 commits January 18, 2022 08:56

Merge remote-tracking branch 'origin/main' into craver/pub-sub-issues

78ebf5e

Merge remote-tracking branch 'origin/main' into craver/pub-sub-issues

8e168d9

Merge remote-tracking branch 'origin/main' into craver/pub-sub-issues

21d36f0

Merge remote-tracking branch 'origin/main' into craver/pub-sub-issues

144c22e

WIP: This could all be a bad idea

92cecfc

Gap commit

7d7f020

Want to yank some of this into another PR ahead of time, getting files in.

Include PING routing

6ecde2a

Revert testing change

f91e4c5

Merge remote-tracking branch 'origin/main' into craver/pub-sub-prep

bca9de0

Merge branch 'craver/pub-sub-prep' into craver/pub-sub-issues

5a6db1c

NickCraver changed the base branch from main to craver/pub-sub-prep January 20, 2022 16:54

NickCraver added 2 commits January 20, 2022 11:54

Revert that bandaid test

daa1b9c

Merge remote-tracking branch 'origin/craver/pub-sub-prep' into craver…

f36b6d9

…/pub-sub-wip

NickCraver added 3 commits January 27, 2022 21:11

Update publish docs

0465ee1

This is something that tripped me and at least 1 user up - let's save some pain.

Merge branch 'craver/pub-sub-issues' of https://github.com/StackExcha…

3d9b878

…nge/StackExchange.Redis into craver/pub-sub-issues

NickCraver requested review from philon-msft and mgravell January 31, 2022 17:17

mgravell reviewed Feb 4, 2022

View reviewed changes

mgravell approved these changes Feb 4, 2022

View reviewed changes

NickCraver merged commit d9b3c58 into main Feb 4, 2022

NickCraver deleted the craver/pub-sub-issues branch February 4, 2022 15:31

NickCraver mentioned this pull request Feb 6, 2022

Is a new connection created for each pub/sub channel subscribed? #872

Closed

NickCraver linked an issue Feb 6, 2022 that may be closed by this pull request

RedisDatabase StringSet or StringGet - No connection is active/available to service this operation #1424

Closed

This was referenced Feb 6, 2022

RedisDatabase StringSet or StringGet - No connection is active/available to service this operation #1424

Closed

Required Redis ACL for Cluster #1938

Closed

philon-msft mentioned this pull request Feb 16, 2022

Is using ForceReconnect still necessary for using Azure Redis? #1996

Closed

iteplov mentioned this pull request Sep 9, 2022

Cluster Connect works partially, with a full timeout. #2251

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pub/Sub fixes for subscribe/re-subscribe #1947

Pub/Sub fixes for subscribe/re-subscribe #1947

NickCraver commented Jan 10, 2022 •

edited

Loading

NickCraver commented Jan 28, 2022

NickCraver commented Jan 28, 2022

andre-ss6 commented Jan 28, 2022

NickCraver commented Jan 28, 2022

NickCraver commented Jan 31, 2022

andre-ss6 commented Feb 3, 2022

mgravell Feb 4, 2022

mgravell left a comment

NickCraver commented Feb 4, 2022

Pub/Sub fixes for subscribe/re-subscribe #1947

Pub/Sub fixes for subscribe/re-subscribe #1947

Conversation

NickCraver commented Jan 10, 2022 • edited Loading

NickCraver commented Jan 28, 2022

NickCraver commented Jan 28, 2022

andre-ss6 commented Jan 28, 2022

NickCraver commented Jan 28, 2022

NickCraver commented Jan 31, 2022

andre-ss6 commented Feb 3, 2022

mgravell Feb 4, 2022

Choose a reason for hiding this comment

mgravell left a comment

Choose a reason for hiding this comment

NickCraver commented Feb 4, 2022

NickCraver commented Jan 10, 2022 •

edited

Loading