Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pub/Sub fixes for subscribe/re-subscribe #1947

Merged
merged 69 commits into from
Feb 4, 2022
Merged

Conversation

NickCraver
Copy link
Collaborator

@NickCraver NickCraver commented Jan 10, 2022

We're working on pub/sub - breaking it out explicitly from #1912. This relates to several issues and in general handling resubscriptions on reconnect.

Issues: #1110, #1586, #1830 #1835

There are a few things in play we're investigating:

  • Subscription heartbeat not going over the subscription connection (due to PING and GetBridge)
  • Subscriptions not reconnecting at all (or potentially doing to and unsubscribing according to some issues)
  • Subscriptions always going to a single cluster node (due to default(RedisKey))

Overall this set of changes:

  • Completely restructures how RedisSubscriber works
    • No more PendingSubscriptionState (Subscription has the needed bits to reconnect)
    • Cleaner method topology (in RedisSubscriber, rather than Subscriber, RedisSubscriber, and ConnectionMultiplexer)
      • By placing these on RedisSubscriber, we can cleanly use ExecuteSync/Async bits, get proper profiling, etc.
    • Proper sync/async split (rather than Wait() in sync paths)
  • Changes how subscriptions work
    • The Subscription object is added to the ConnectionMultiplexer tracking immediately, but the command itself actually goes to the server and back (unless FireAndForget) before returning for proper ordering like other commands.
    • No more Task.Run() loop - we now ensure reconnects as part of the handshake
    • Subscriptions are marked as not having a server the moment a disconnect is fired
      • Question: Should we have a throttle around this for massive numbers of connections, or async it?
  • Changes how connecting works
    • The connection completion handler will now fire when the second bridge/connection completes, this means we won't have interactive connected but subscription in an unknown state - both are connected before we fire the handler meaning the moment we come back from connect, subscriptions are in business.
  • Moves to a ConcurrentDictionary since we only need limited locking around this and we only have it once per multiplexer.
    • TODO: This needs eyes, we could shift it - implementation changed along the way where this isn't a critical detail
  • Fixes the TrackSubscriptionsProcessor - this was never setting the result but didn't notice in 8 years because downstream code never cared.
    • Note: each Subscription has a processor instance (with minimal state) because when the subscription command comes back then we need to decide if it successfully registered (if it didn't, we need to maintain it has no successful server)
  • ConnectionMultiplexer grew a DefaultSubscriber for running some commands without lots of method duplication, e.g. ensuring servers are connected.
  • Overrides GetHashSlot on CommandChannelBase with the new RedisChannel-based methods so that operates correctly

Not directly related changes which helped here:

  • Better profiler helpers for tests and profiler logging in them
  • Re-enables a few PubSub tests that were unreliable before...but correctly so.

TODO: I'd like to add a few more test scenarios here:

  • Simple Subscribe/Publish/await Until/check pattern to ensure back-to-back subscribe/publish works well
  • Cluster connection failure and subscriptions moving to another node

To consider:

  • Subscription await loop from EnsureSubscriptionsAsync and connection impact on large reconnect situations
    • In a reconnect case, this is background and only the nodes affected have any latency...but still.
  • TODOs in code around variadic commands, e.g. re-subscribing with far fewer commands by using SUBSCRIBE <key1> <key2>...
    • In cluster, we'd have to batch per slot...or just go for the first available node
    • ...but if we go for the first available node, the semantics of IsConnected are slightly off in the not connected (CurrentServer is null) case, because we'd say we're connected to where it would go even though that'd be non-deterministic without hashslot batching. I think this is really minor and shouldn't affect our decision.
  • ConcurrentDictionary vs. returning to locks around a Dictionary
    • ...but if we have to lock on firing consumption of handlers anyway, concurrency overhead is probably a wash.

NickCraver added a commit that referenced this pull request Jan 18, 2022
This won't be fully accurate until #1947 fixed the PING routing, but getting a test fix into main ahead of time.
NickCraver added a commit that referenced this pull request Jan 18, 2022
This won't be fully accurate until #1947 fixed the PING routing, but getting a test fix into main ahead of time.
Want to yank some of this into another PR ahead of time, getting files in.
…nd cleanup

In prep for changes to how we handle subscriptions internally, this does several things:
- Upgrades default Redis server assumption to 3.x
- Routes PING on Subscription keepalives over the subscription bridge appropriately
- Fixes cluster sharding from default(RedisKey) to shared logic for RedisChannel as well (both in byte[] form)
- General code cleanup in the area (getting a lot of DEBUG/VERBOSE noise into isolated files)
@NickCraver NickCraver changed the base branch from main to craver/pub-sub-prep January 20, 2022 16:54
@NickCraver
Copy link
Collaborator Author

@andre-ss6 Of course! I think there may be a confusing aspect of PublishAsync here though, in Redis that's how many subscribers were on that server, not on that cluster. So for example you can get a 0 back, but still get the subscription from the other node. It's going from Publisher- > Node A -> Node B -> Subscriber. This tripped me up in tests too, and is probably worth of a remark on the XML docs for this.

@NickCraver
Copy link
Collaborator Author

@andre-ss6 Check out the ClusterNodeSubscriptionFailover test in the changeset here to get an example of what I mean - in that we're checking that we got the message, rather than the publisher count which on a cluster turns out is...kinda meaningless :(

@andre-ss6
Copy link

@NickCraver Oh I see! Thanks a lot for the clarification! I was going to actually test against the callback, instead of the return of PublishAsync, but was too lazy to do it 😅.

I will re-do the tests later then. Thanks again!

@NickCraver
Copy link
Collaborator Author

@andre-ss6 Thanks for the poke here, I do appreciate the eyes :) I'm changing all the <returns> on these methods now to better indicate exactly what we both hit:

        /// <returns>
        /// The number of clients that received the message *on the destination server*,
        /// note that this doesn't mean much in a cluster as clients can get the message through other nodes.
        /// </returns>

This is something that tripped me and at least 1 user up - let's save some pain.
This is effectively the behavior we had before (but we are ensured connected now). We just want the count here and to start them, so let's do that also pipelining the `n` commands as we did before instead of awaiting each. We also don't need the `Task` overhead. This makes moving to a variadic command set for this less urgent.
@NickCraver
Copy link
Collaborator Author

See 169a173 (#1947) for what I'm thinking around large numbers of subs - moving ReconfigureAsync to FireAndForget for sub re-establish.

@andre-ss6
Copy link

Hey @NickCraver, thanks a lot for the help and for your time! I re-did the tests and they worked! However, not how I thought they would 😅, as they also worked on an old version of the library (1.2.6). Since it worked, however, I guess this is now a bit off-topic for the PR, so I sent you a message on gitter, if you have time to clarify some questions. But in any way, you already helped a lot. Thanks again!

@@ -1,4 +1,4 @@
FROM redis:5
FROM redis:6.2.6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1; this gives us a few other unrelated things we can use

Copy link
Collaborator

@mgravell mgravell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks great and very well considered; unable to find anything other than pettiness, so: nice job sir

@NickCraver
Copy link
Collaborator Author

@mgravell If there's little things to tweak as well, happy to! (same for all the PRs), think we're about in a good state so more than happy to polish anywhere

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment