Pub sub in cluster mode, not seeing the replicas service traffic #1835

greyhoundgames · 2021-08-17T04:04:49Z

I had just figured out that my main redis cluster(AWS) was not using its replicas because I had failed to use
CommandFlags.PreferReplica during reads. After adding this, the metrics data showed the traffic split nicely between the replicas. Great!
I went to my other redis cluster I have which is a chat app and I did the same thing on it. However in this case it was a pub\sub where as the first cluster was doing string gets.

I added the PreferReplica in the subscribe. However here the traffic does not seem to have been spread out evenly. Is there a problem with replicas when using pub sub?

Code used to subscribe:
ISubscriber subscriber = RedisHelpers.ConnectionChat.GetSubscriber();
subscriber.Subscribe(user.room + ChatConstants.MESSAGES_SUB, chatCallback, CommandFlags.PreferReplica);

SDK version: Nuget 2.2.62
AWS redis Engine Version Compatibility: Version compatibility of the engine that will be run on your nodes 6.0.5
Cluster config 1 shard, 3 nodes, 2 of which are replicas.

NickCraver · 2022-01-08T15:26:59Z

Indeed there is! We found this later (recently) following up on issues. Specifically there's a bug in sharding around a default(RedisKey) path that @mgravell is looking into as part of #1912. Once a fix goes in that PR (or is broken into another), we'll update here.

We're working on pub/sub - breaking it out explicitly from #1912. This relates to several issues and in general handling resubscriptions on reconnect. Issues: #1110, #1586, #1830 #1835 There are a few things in play we're investigating: - [x] Subscription heartbeat not going over the subscription connection (due to `PING` and `GetBridge`) - [x] Subscriptions not reconnecting at all (or potentially doing to and unsubscribing according to some issues) - [x] Subscriptions always going to a single cluster node (due to `default(RedisKey)`) Overall this set of changes: - Completely restructures how RedisSubscriber works - No more `PendingSubscriptionState` (`Subscription` has the needed bits to reconnect) - Cleaner method topology (in `RedisSubscriber`, rather than `Subscriber`, `RedisSubscriber`, and `ConnectionMultiplexer`) - By placing these on `RedisSubscriber`, we can cleanly use `ExecuteSync/Async` bits, get proper profiling, etc. - Proper sync/async split (rather than `Wait()` in sync paths) - Changes how subscriptions work - The `Subscription` object is added to the `ConnectionMultiplexer` tracking immediately, but the command itself actually goes to the server and back (unless FireAndForget) before returning for proper ordering like other commands. - No more `Task.Run()` loop - we now ensure reconnects as part of the handshake - Subscriptions are marked as not having a server the moment a disconnect is fired - Question: Should we have a throttle around this for massive numbers of connections, or async it? - Changes how connecting works - The connection completion handler will now fire when the _second_ bridge/connection completes, this means we won't have `interactive` connected but `subscription` in an unknown state - both are connected before we fire the handler meaning the moment we come back from connect, subscriptions are in business. - Moves to a `ConcurrentDictionary` since we only need limited locking around this and we only have it once per multiplexer. - TODO: This needs eyes, we could shift it - implementation changed along the way where this isn't a critical detail - Fixes the `TrackSubscriptionsProcessor` - this was never setting the result but didn't notice in 8 years because downstream code never cared. - Note: each `Subscription` has a processor instance (with minimal state) because when the subscription command comes back _then_ we need to decide if it successfully registered (if it didn't, we need to maintain it has no successful server) - `ConnectionMultiplexer` grew a `DefaultSubscriber` for running some commands without lots of method duplication, e.g. ensuring servers are connected. - Overrides `GetHashSlot` on `CommandChannelBase` with the new `RedisChannel`-based methods so that operates correctly Not directly related changes which helped here: - Better profiler helpers for tests and profiler logging in them - Re-enables a few `PubSub` tests that were unreliable before...but correctly so. TODO: I'd like to add a few more test scenarios here: - [x] Simple Subscribe/Publish/await Until/check pattern to ensure back-to-back subscribe/publish works well - [x] Cluster connection failure and subscriptions moving to another node To consider: - [x] Subscription await loop from EnsureSubscriptionsAsync and connection impact on large reconnect situations - In a reconnect case, this is background and only the nodes affected have any latency...but still. - [ ] TODOs in code around variadic commands, e.g. re-subscribing with far fewer commands by using `SUBSCRIBE <key1> <key2>...` - In cluster, we'd have to batch per slot...or just go for the first available node - ...but if we go for the first available node, the semantics of `IsConnected` are slightly off in the not connected (`CurrentServer is null`) case, because we'd say we're connected to where it _would_ go even though that'd be non-deterministic without hashslot batching. I think this is really minor and shouldn't affect our decision. - [x] `ConcurrentDictionary` vs. returning to locks around a `Dictionary` - ...but if we have to lock on firing consumption of handlers anyway, concurrency overhead is probably a wash.

NickCraver · 2022-02-04T23:56:03Z

This is drastically improved in #1947 and will be in the 2.5 release - thank you for the detail! It'll still prefer the first replica but this should be an improvement over current. This preference is kept in mind across reconnects now as well.

NickCraver added the ⚙️ area:pub/sub label Jan 8, 2022

NickCraver mentioned this issue Jan 10, 2022

Pub/Sub fixes for subscribe/re-subscribe #1947

Merged

8 tasks

NickCraver added the ⚙️ area:cluster label Feb 5, 2022

NickCraver linked a pull request Feb 6, 2022 that will close this issue

Pub/Sub fixes for subscribe/re-subscribe #1947

Merged

8 tasks

NickCraver closed this as completed Feb 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pub sub in cluster mode, not seeing the replicas service traffic #1835

Pub sub in cluster mode, not seeing the replicas service traffic #1835

greyhoundgames commented Aug 17, 2021

NickCraver commented Jan 8, 2022

NickCraver commented Feb 4, 2022

Pub sub in cluster mode, not seeing the replicas service traffic #1835

Pub sub in cluster mode, not seeing the replicas service traffic #1835

Comments

greyhoundgames commented Aug 17, 2021

NickCraver commented Jan 8, 2022

NickCraver commented Feb 4, 2022