Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Another) performance issue after migrating from v1 to v2 #2000

Closed
andreminelli opened this issue Feb 18, 2022 · 29 comments
Closed

(Another) performance issue after migrating from v1 to v2 #2000

andreminelli opened this issue Feb 18, 2022 · 29 comments

Comments

@andreminelli
Copy link

So we were using v1.2.6 for the last 4 years at least, and we have tried (again) to update to v2.2.88 (latest at this time).

This time we have more metrics in place to share, hoping this could help us all to understand why the average "execution" time (we measure full 'modules' execution time, which call a few Redis operations among its own inner logic) is about 2 times slower with v2 compared to v1.
Obs: This measurements were obtained by updating a single 'worker' instance to V2 (in theory this was the only change - just updated the package, with no other code changes except to fix broken/new method signatures, for instance) and running it side-by-side with existing V1 'worker' instances. The volume of requests handled by this updated worker can be considered stable during this period.

Besides this 'slowness', we observed several internal performance metrics getting worse (the worker was updated on Febrary, 17th, in the morning).
Our 'reading' of those metrics on peak hours, in a nutshell, is that V2 is allocating more memory (we see much more collections on all generations), making GC to work harder (time in GC is consistently close to 15% and suspension above 8%). The thread usage is worse too, but for me is not clear if this is a symptom of GC 'hard time' or a cause...

Any thoughts?

image
image
image
image

@andre-ss6
Copy link

I think it's worth noting that this is the 4th time we try to upgrade since 2.0 has launched, and every time we had performance regressions, mainly higher latency.

@NickCraver
Copy link
Collaborator

Can y'all provide any thread dumps or memory comparisons from those peak loads? Any information on where the time's being spent would be very helpful. Some constructs are very similar and some paths are completely rewritten so knowing which features/paths you're using would be very helpful. And it may be that we can suggest alternative setups, for example tweaks to the socket manager thread pool for burst loads or generally moving off it.

@andre-ss6
Copy link

andre-ss6 commented Feb 18, 2022

Sure! Will do! We already got a process memory dump from two machines (one on v2, another on v1 of the library); We will soon also get a report on CPU usage by method. We can then send you all those and any other info that we manage to collect :)

@NickCraver
Copy link
Collaborator

Excellent - thank you! I'd like to improve perf however we can, so looking forward to any more info.

@andre-ss6
Copy link

andre-ss6 commented Feb 23, 2022

Hey @NickCraver. These are some of the data that we've collected so far:

After we upgraded to Redis from 1.2.6 to 2.2.88, we experienced a slowdown in some parts of our code that uses Redis extensively, sometimes reaching a 2x slowdown. The following graphs show the average processing time of some of our modules that use Redis, where the green line represents the instance where Redis has been upgraded (and all the others are Redis 1.2.6):

image

This is a small sample size, but it is consistent in every timespan we analyzed: the instance where Redis is upgraded is always slower by that same rough margin.

For comparison, these are the average processing time of our modules that do not use Redis, where, again, the green line represents the instance where Redis has been upgraded:

image

We can observe a difference between the instances - and the green one is even higher than a few; However, the difference here not only is way smaller, but also it doesn’t correlate with the metrics we analyzed later (and are presented in this document) when we compare other metrics with other instances (including the orange and purple ones).

We also have code in place, client-side, to measure latency from Redis requests. It is simply a decorator on StackExchange.Redis.IDatabase, and it looks like this:

image

Do note that all methods in the interface are decorated in that same way, not only the ones shown here.

The tracer referenced there is from a library from Dynatrace.

These are the results from Dynatrace:

Response Times

2.2.88 1.2.6 (Instance 1) 1.2.6 (Instance 2)
2.2.88 Avg 1.2.6 (1) Avg 1.2.6 (2) Avg
2.2.88 Med 1.2.6 (1) Med 1.2.6 (2) Med
2.2.88 99th 1.2.6 (1) 99th 1.2.6 (2) 99th

GC and Thread pool differences

We also noticed a moderate increase to GC suspension time and time spent in GC and also to both the number Thread pool active worker threads and Thread pool queue. The following graphs show these metrics for some of our instances, one running the upgraded version of the library while the rest is using 1.2.6.

2.2.88 1.2.6 (Instance 1) 1.2.6 (Instance 2) 1.2.6 (Instance 3)
2.2.88 1.2.6 (1) 1.2.6 (2) 1.2.6 (3)
2.2.88 1.2.6 (1) 1.2.6 (2) 1.2.6 (3)

@andre-ss6
Copy link

@NickCraver Also, do you know any way we could anonymize (basically remove strings from) our process memory dump in order to send it to you?

@NickCraver
Copy link
Collaborator

@andre-ss6 Thanks for these! I formatted the comment just a bit for easier comparison than scrolling. I wouldn't trust any tool to scrub data, so let me see about allocations between these 2 scenarios overall. Is this under a lot of concurrent load per instance? (I'm assuming yes)

@andre-ss6
Copy link

andre-ss6 commented Feb 23, 2022

@andre-ss6 Thanks for these! I formatted the comment just a bit for easier comparison than scrolling.

Thanks!

I wouldn't trust any tool to scrub data, so let me see about allocations between these 2 scenarios overall.

Aight. You want screenshots as well?

Is this under a lot of concurrent load per instance? (I'm assuming yes)

Yes. We do use a singleton IConnectionMultiplexer per instance though. By the way, we have also tried setting the Client Workers count to 20 (based on a comment in another issue) using a custom SocketManager in the upgraded version, however it seems to not have had any effect at all.

@andreminelli
Copy link
Author

By the way, we have also tried setting the Client Workers count to 20 (based on a comment in another issue) using a custom SocketManager in the upgraded version

For the sake of completeness, the issue mentioned by @andre-ss6 is in fact a PR: #1115

@NickCraver
Copy link
Collaborator

@andre-ss6 @andreminelli pssssst, you're going to want to peek at #2008. I'll aim to get it in with a prerelease up on NuGet ASAP - any chance you could give that a go in your scenario when available? It'll be on MyGet right after the PR's merged as well.

@andre-ss6
Copy link

@NickCraver #2009 is taking a while to be on MyGet. Does anything manual need to be done?

@andre-ss6
Copy link

@NickCraver Ah, I just noticed some tests failed :/

@NickCraver
Copy link
Collaborator

@andre-ss6 kicked that build, and making that new Envoy test (which isn't present on AppVeyor because Windows) more resilient.

@NickCraver
Copy link
Collaborator

@andre-ss6 You'll want to grab latest off MyGet for testing what 2.5.x will have. I just ran a lot of old scenarios in issues that were performance gaps vs. 1.2.6 but now look way, way better. Very curious what performance y'all see especially with such great monitoring in place!

@andre-ss6
Copy link

@andre-ss6 You'll want to grab latest off MyGet for testing what 2.5.x will have. I just ran a lot of old scenarios in issues that were performance gaps vs. 1.2.6 but now look way, way better. Very curious what performance y'all see especially with such great monitoring in place!

I'm also very curious! I'll be working on this for the next few days. Thanks a lot for your effort!

@andreminelli
Copy link
Author

@NickCraver , is there an ETA for a 2.5 release ?

@NickCraver
Copy link
Collaborator

@andreminelli There's pretty much the final up on NuGet now as https://www.nuget.org/packages/StackExchange.Redis/2.5.34-prerelease, but I'm hoping to fold in #configuration and release 2.5 tomorrow if time allows, may get to it tonight but busy day so unsure.

@andreminelli
Copy link
Author

There's pretty much the final up on NuGet now as https://www.nuget.org/packages/StackExchange.Redis/2.5.34-prerelease, but I'm hoping to fold in #configuration and release 2.5 tomorrow if time allows, may get to it tonight but busy day so unsure.

Wow, this soon? 😃
I must say that I was expecting to only have this after some weeks or so.
Thanks!

@NickCraver
Copy link
Collaborator

@andreminelli Welcome! If y'all were able to give that 2.5.34 a test I'd really appreciate it to have complete confirmation these changes in 2.5 make any impact for your scenario (we've been testing with partner teams elsewhere and see good things)

@andre-ss6
Copy link

@andreminelli There's pretty much the final up on NuGet now as https://www.nuget.org/packages/StackExchange.Redis/2.5.34-prerelease, but I'm hoping to fold in #configuration and release 2.5 tomorrow if time allows, may get to it tonight but busy day so unsure.

Does this one include the follow-up PR? I ask since the latest one on MyGet is 2.5.39.

@NickCraver
Copy link
Collaborator

2.5.39 has all of the performance changes, only the config bits are left to follow up on but those shouldn't have any impact. I may add some connection logging as well but in any case: any performance assessments you have of 2.5.39 would be super helpful!

@NickCraver
Copy link
Collaborator

@andre-ss6 @andreminelli 2.5.43 is up on NuGet now :)

@andre-ss6
Copy link

andre-ss6 commented Mar 5, 2022

@andre-ss6 @andreminelli 2.5.43 is up on NuGet now :)

Nice, thanks! We are on the process of homologation of 2.5.39. it's behaving well for now. We will try deploying to canary next Monday.

@andre-ss6
Copy link

andre-ss6 commented Mar 8, 2022

@NickCraver ... I got some good news!

The green line is still the canary instance, with 2.5.39 now!

image
image
image

We will probably deploy 2.5.43 to production next week!

@NickCraver
Copy link
Collaborator

Wooo, thanks for the update @andre-ss6!

@andre-ss6
Copy link

Thank you for you continued contribution, for your fast response and for your efforts! This was very important for us!

@andreminelli
Copy link
Author

andreminelli commented Mar 9, 2022

Thank you for you continued contribution, for your fast response and for your efforts! This was very important for us!

I couldn't agree more with @andre-ss6 .
Thank you so much, @NickCraver.

We should close this issue right after deploying with 2.5.43 release, ok?

@NickCraver
Copy link
Collaborator

I appreciate all the diagnostics and testing - help to us too <3

I'll leave this be, when y'all are deployed and good on 2.5.43 you can close at will or let me know if something unexpected happens :)

@andreminelli
Copy link
Author

We have deployed successfully using version 2.5.43, with nearly no performance difference from 1.2.6
Thanks again, @NickCraver 👏👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants