-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Another) performance issue after migrating from v1 to v2 #2000
Comments
I think it's worth noting that this is the 4th time we try to upgrade since 2.0 has launched, and every time we had performance regressions, mainly higher latency. |
Can y'all provide any thread dumps or memory comparisons from those peak loads? Any information on where the time's being spent would be very helpful. Some constructs are very similar and some paths are completely rewritten so knowing which features/paths you're using would be very helpful. And it may be that we can suggest alternative setups, for example tweaks to the socket manager thread pool for burst loads or generally moving off it. |
Sure! Will do! We already got a process memory dump from two machines (one on v2, another on v1 of the library); We will soon also get a report on CPU usage by method. We can then send you all those and any other info that we manage to collect :) |
Excellent - thank you! I'd like to improve perf however we can, so looking forward to any more info. |
Hey @NickCraver. These are some of the data that we've collected so far: After we upgraded to Redis from 1.2.6 to 2.2.88, we experienced a slowdown in some parts of our code that uses Redis extensively, sometimes reaching a 2x slowdown. The following graphs show the average processing time of some of our modules that use Redis, where the green line represents the instance where Redis has been upgraded (and all the others are Redis 1.2.6): This is a small sample size, but it is consistent in every timespan we analyzed: the instance where Redis is upgraded is always slower by that same rough margin. For comparison, these are the average processing time of our modules that do not use Redis, where, again, the green line represents the instance where Redis has been upgraded: We can observe a difference between the instances - and the green one is even higher than a few; However, the difference here not only is way smaller, but also it doesn’t correlate with the metrics we analyzed later (and are presented in this document) when we compare other metrics with other instances (including the orange and purple ones). We also have code in place, client-side, to measure latency from Redis requests. It is simply a decorator on StackExchange.Redis.IDatabase, and it looks like this: Do note that all methods in the interface are decorated in that same way, not only the ones shown here. The tracer referenced there is from a library from Dynatrace. These are the results from Dynatrace: Response Times
GC and Thread pool differencesWe also noticed a moderate increase to GC suspension time and time spent in GC and also to both the number Thread pool active worker threads and Thread pool queue. The following graphs show these metrics for some of our instances, one running the upgraded version of the library while the rest is using 1.2.6.
|
@NickCraver Also, do you know any way we could anonymize (basically remove strings from) our process memory dump in order to send it to you? |
@andre-ss6 Thanks for these! I formatted the comment just a bit for easier comparison than scrolling. I wouldn't trust any tool to scrub data, so let me see about allocations between these 2 scenarios overall. Is this under a lot of concurrent load per instance? (I'm assuming yes) |
Thanks!
Aight. You want screenshots as well?
Yes. We do use a singleton |
For the sake of completeness, the issue mentioned by @andre-ss6 is in fact a PR: #1115 |
@andre-ss6 @andreminelli pssssst, you're going to want to peek at #2008. I'll aim to get it in with a prerelease up on NuGet ASAP - any chance you could give that a go in your scenario when available? It'll be on MyGet right after the PR's merged as well. |
@NickCraver #2009 is taking a while to be on MyGet. Does anything manual need to be done? |
@NickCraver Ah, I just noticed some tests failed :/ |
@andre-ss6 kicked that build, and making that new Envoy test (which isn't present on AppVeyor because Windows) more resilient. |
@andre-ss6 You'll want to grab latest off MyGet for testing what 2.5.x will have. I just ran a lot of old scenarios in issues that were performance gaps vs. 1.2.6 but now look way, way better. Very curious what performance y'all see especially with such great monitoring in place! |
I'm also very curious! I'll be working on this for the next few days. Thanks a lot for your effort! |
@NickCraver , is there an ETA for a 2.5 release ? |
@andreminelli There's pretty much the final up on NuGet now as https://www.nuget.org/packages/StackExchange.Redis/2.5.34-prerelease, but I'm hoping to fold in #configuration and release 2.5 tomorrow if time allows, may get to it tonight but busy day so unsure. |
Wow, this soon? 😃 |
@andreminelli Welcome! If y'all were able to give that 2.5.34 a test I'd really appreciate it to have complete confirmation these changes in 2.5 make any impact for your scenario (we've been testing with partner teams elsewhere and see good things) |
Does this one include the follow-up PR? I ask since the latest one on MyGet is 2.5.39. |
2.5.39 has all of the performance changes, only the config bits are left to follow up on but those shouldn't have any impact. I may add some connection logging as well but in any case: any performance assessments you have of 2.5.39 would be super helpful! |
@andre-ss6 @andreminelli 2.5.43 is up on NuGet now :) |
Nice, thanks! We are on the process of homologation of 2.5.39. it's behaving well for now. We will try deploying to canary next Monday. |
@NickCraver ... I got some good news! The green line is still the canary instance, with 2.5.39 now! We will probably deploy 2.5.43 to production next week! |
Wooo, thanks for the update @andre-ss6! |
Thank you for you continued contribution, for your fast response and for your efforts! This was very important for us! |
I couldn't agree more with @andre-ss6 . We should close this issue right after deploying with 2.5.43 release, ok? |
I appreciate all the diagnostics and testing - help to us too <3 I'll leave this be, when y'all are deployed and good on 2.5.43 you can close at will or let me know if something unexpected happens :) |
We have deployed successfully using version 2.5.43, with nearly no performance difference from 1.2.6 |
So we were using v1.2.6 for the last 4 years at least, and we have tried (again) to update to v2.2.88 (latest at this time).
This time we have more metrics in place to share, hoping this could help us all to understand why the average "execution" time (we measure full 'modules' execution time, which call a few Redis operations among its own inner logic) is about 2 times slower with v2 compared to v1.
Obs: This measurements were obtained by updating a single 'worker' instance to V2 (in theory this was the only change - just updated the package, with no other code changes except to fix broken/new method signatures, for instance) and running it side-by-side with existing V1 'worker' instances. The volume of requests handled by this updated worker can be considered stable during this period.
Besides this 'slowness', we observed several internal performance metrics getting worse (the worker was updated on Febrary, 17th, in the morning).
Our 'reading' of those metrics on peak hours, in a nutshell, is that V2 is allocating more memory (we see much more collections on all generations), making GC to work harder (time in GC is consistently close to 15% and suspension above 8%). The thread usage is worse too, but for me is not clear if this is a symptom of GC 'hard time' or a cause...
Any thoughts?
The text was updated successfully, but these errors were encountered: