Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect HashSlot calculation for XREAD and XREADGROUP commands #2093

Merged
merged 5 commits into from
Jun 14, 2022

Conversation

nielsderdaele
Copy link
Contributor

@nielsderdaele nielsderdaele commented Apr 14, 2022

I have created a fix for the incorrect HashSlot calculation for the XREAD and XREADGROUP commands (#2086) Since these differ a lot from other messages I have created some new CommandMessage classes:

  • SingleStreamReadGroupCommandMessage
  • MultiStreamReadGroupCommandMessage
  • SingleStreamReadCommandMessage
  • MultiStreamReadCommandMessage

The single CommandMessage classes could be removed by only using the multi command messages, but then we have to create a new StreamPostion array and a new StreamPostion object. I wasn't sure which option too choose, so feel free to give input on this (GC vs duplicate code).

There is also some code duplication between the Read and ReadGroup Command messages as the ReadGroup command is very similar to the XREAD but has some additional values in the front of the command (group, consumer, noack). A base class could be created as well to reduce some of the duplication (calculating the hashslot in case of multiple streams and writing the streams part of the commands).

The method CommandAndKey is also not overriden for these Messages as I wasn't which argument of the command to include in the string.

@NickCraver
Copy link
Collaborator

@nielsderdaele I think this looks good at a glance (needs some dedicated eyes), it'd be good if we could ensure this is working in tests though. For example we can attach a profiler and see if any messages were re-sent against cluster (also proving it's broken today). I'll try and grok this and propose some tests to ensure this doesn't regress and have a pattern where we can assess this anywhere else by using that.

@nielsderdaele
Copy link
Contributor Author

@NickCraver Sounds good. I was also thinking about writing some test for the problem but didn't get to it yet. I was considering listening to the HashSlotMoved event on the ConnectionMultiplexer. This shouldn't be raised if the HashSlots are calculated correctly. In an ideal situation these tests should be implemented for each command.

For now the code has only been tested manually with a debugger and by watching statistics of redis (rejected_calls) while running some software with the modified code. We have some software components which read high volume of streams using the StreamReadGroup commands (and acknowledges the messages after processing). This is how we stumbled upon the problem as we were seeing a huge number of rejected_calls in redis statistiscs.

Note that I have only looked at the Stream commands but the same problem exists for multiple commands (eg. SortedSetCombine, SetIntersectionLength). Most of these commands are easier to fix as they have static key positions (eg. multiple keys start at position 1 -> RedisValueKeysValue(s) Commands).

@NickCraver
Copy link
Collaborator

@nielsderdaele Good idea! I'm on docs at the moment just playing (mostly away this week), but like we have expected errors in the test base class, we could set an expected number of moves to globally find and fail all these cases perhaps. If a class is intentionally getting a move, then it could declare so, SetExpectedAmbientFailureCount style.

Thoughts?

@NickCraver
Copy link
Collaborator

hmmm, I don't think the event handler is going to work because it doesn't fire in all cases. I'll get some counters in so we can observe this in tests. I need to explore this anyway because we're like to trigger a reconfigure after any wave of MOVED responses to solve the topology change case some are seeing on cluster resizing anyhow.

Copy link
Collaborator

@mgravell mgravell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me; not necessarily for this PR, but I wonder if our test code has access to the internals here sufficiently that we could new the message and validate the slot calculation, which would have shown the original problem... just a thought

@NickCraver NickCraver merged commit 8c9cb1b into StackExchange:main Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants