Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-8864] Dynamic allocation sampling #3395

Merged
merged 17 commits into from
Jan 31, 2024

Conversation

AlexJF
Copy link
Contributor

@AlexJF AlexJF commented Jan 23, 2024

What does this PR do?

  • Introduces a dynamic sampler for discrete events that is governed by an overhead target. The concept is similar to that of the existing dynamic sampler only that one operates under the assumption that we can sample "at will".
    • Every second, sampler settings are readjusted based on perceived volume and overhead of the recent past (via exponential moving averages).
  • Removes the allocation sample rate configuration, instead choosing to plug the aforementioned sampler into the allocation sampling logic.

Motivation:
Hardcoded/fixed sample rate configurations cannot be a good fit for all applications and their usage patterns.

Take the previous default: sample 1 allocation out of every 50:

  • On an app that does 1 allocation every second, we'd essentially sample only one allocation per profile with a negligible impact in the application.
  • On an app that does 500 allocations every second, we'd now be sampling 600 allocations per profile which looks a bit more meaningful but it's now hard to quantify/understand that impact.

By using a dynamic sampler that attempts to adhere to a specific target of CPU overhead, it becomes much easier to reason about the impact of the allocation profiler as well as have the sampling automatically adapt to usage patterns:

  • When mostly idle we can afford to sample all allocations that come.
  • On heavy usage/traffic, we'll automatically sample less and less to keep overhead under control.

Additional Notes:
The heap profiler still uses a fixed "track every x allocation sample" logic but the plan is to try and add some dynamic sampling there as well in a follow-up PR.

Nevertheless, it already has a positive effect on heap tracking overhead since part of that work currently occurs in the context of allocation sampling.

CPU usage, P50 and P95 behaviour is improved in all allocation+heap flavours. However, P100 for allocation-sampling-only becomes slightly more noisy as the dynamic nature of sampling can lead to a certain amount of oversampling around spikes (NOTE: left side is this PR, right side is current master):

image image image image

We can study extra strategies to reduce the P100 noise as follow-up:

  • Smaller adjustment windows.
  • Event-count-based readjustment - Use the lowest of Y seconds or X events since last readjustment to determine when a new readjustment is needed.
  • Replacing the exponential moving averages with PID controllers to increase sensitivity to big slope changes.

How to test the change?

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

@github-actions github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Jan 23, 2024
@AlexJF AlexJF force-pushed the alexjf/prof-8864-dynamic-allocation-sampling2 branch from 23e2850 to 5a072aa Compare January 23, 2024 18:19
@AlexJF AlexJF force-pushed the alexjf/prof-8864-dynamic-allocation-sampling2 branch from 5a072aa to 423c4ed Compare January 23, 2024 18:38
@AlexJF AlexJF marked this pull request as ready for review January 24, 2024 11:54
@AlexJF AlexJF requested review from a team as code owners January 24, 2024 11:54
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is soopah exciting!

Intuitively the approach seems to make sense. I'm curious to look at the benchmarking numbers.

I also wonder how bias compares the other options we've used for the other profilers. I seem to remember both Felix and Christophe working at some point on some simulators to compare alternatives, it may be worth seeing if we can plug this approach into the simulator and do a comparison?

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did another pass! I think this is getting in pretty good shape -- I guess we need to get together and decide the next steps towards declaring allocation (and possibly heap) as beta?

Comment on lines 217 to 224
# Question is: how long do we have to wait for probing samples? Intuitively, we need to build enough budget over
# time for us to be able to take that probing hit assuming things remain the same. Each adjustment window
# with no sampling activity earns us 0.02 seconds of budget. Therefore we need 100 of these to go by before
# we see the next probing sample.
stats = simulate_load(duration_seconds: 99, events_per_second: 4, sampling_seconds: 2)
expect(stats[:num_samples]).to eq(0)
stats = simulate_load(duration_seconds: 1, events_per_second: 4, sampling_seconds: 2)
expect(stats[:num_samples]).to eq(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems... a bit too much? E.g. even on an extremely busy application, should we aim to have at least 1 sample per second or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah I guess I went with a rather extreme situation to force the sampler's hand (sampling a single event taking 2 seconds which is already 2 adjustment windows) and that is which is leading to the extreme probing interval. I'll write a less extreme version of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest version, the test thresholds are adjusted, but I actually was thinking something a bit different.

It's more, should we tweak the sampler so that there's some minimum rate it will never fall below, at least during a 60-second profile period? E.g. it seems very extreme to decide not to have a single sample during a profiled period.

I don't think this is blocking for this PR, especially since we are planning to tweak with the thresholds for the new sampler anyway :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a kind of meta question of how hard/soft our target should be: any kind of forced minimum risks making the overhead target a mere suggestion. Bugs aside, the only way for us to arrive at a situation where we never sample in a 60-second period is if we consistently get single event sampling times that are individually above our overhead target (say the default 0.01 seconds). If this happens and you enforce a minimum of 1 sample/second, then you'd be going 60x over the defined threshold.

An alternative is to simply shutdown the allocation sampler if we detect that due to adjustments we are never reaching a minimum target (i.e. do not force a minimum but disable if a minimum cannot be met since data quality will be lousy).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking this effect can happen due to two main causes:

  1. There was a hiccup somewhere. Machine is very busy, kernel descheduled profiler while it was sampling. Machine was suspended/live moved in a VM/snapshotted. That kinda thing.

  2. We truly took a huge time to sample, and every sample we take will be as costly as this.

I think the challenge is -- how do we take 1) in stride, and still be conservative enough to do the right thing in 2) ?

Furthermore, I kinda suspect (without data) that 1 may be more common than 2, so that's why I think it's worth trying to take it into consideration, if possible.

Copy link
Contributor Author

@AlexJF AlexJF Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you raise a very good point. I have some ideas on it that I'm planning on researching separately: https://datadoghq.atlassian.net/browse/PROF-9044

@AlexJF
Copy link
Contributor Author

AlexJF commented Jan 31, 2024

image

Validated early observed behaviour remains with latest round of changes. Going to experiment with smaller adjustment windows and/or number of events triggers but those can come as new PRs to not block this one.

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This looks great to me!

I left one last reply ( #3395 (comment) ) regarding the sampler being can be a bit too strict (?) in extreme cases, but I don't think this is blocking at all and this is so much better than what we had before anyway :)

@codecov-commenter
Copy link

codecov-commenter commented Jan 31, 2024

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (8453ee4) 98.25% compared to head (c7164b8) 98.26%.
Report is 35 commits behind head on master.

Files Patch % Lines
lib/datadog/tracing/contrib/redis/patcher.rb 82.35% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3395      +/-   ##
==========================================
+ Coverage   98.25%   98.26%   +0.01%     
==========================================
  Files        1262     1266       +4     
  Lines       73606    74298     +692     
  Branches     3445     3476      +31     
==========================================
+ Hits        72320    73011     +691     
- Misses       1286     1287       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@AlexJF AlexJF merged commit ae79470 into master Jan 31, 2024
219 checks passed
@AlexJF AlexJF deleted the alexjf/prof-8864-dynamic-allocation-sampling2 branch January 31, 2024 18:58
@github-actions github-actions bot added this to the 1.20.0 milestone Jan 31, 2024
@TonyCTHsu TonyCTHsu mentioned this pull request Feb 5, 2024
ivoanjo added a commit that referenced this pull request Feb 9, 2024
**What does this PR do?**

This PR removes the unused profiler PID controller implementation.

**Motivation:**

In #3190 we imported the PID controller, and planned to use it to
control the allocation profiler. In the end, we ended up going with
a different solution for that (#3395) so let's remove the PID controller
for now.

We can always revert this commit if we need it again.

**Additional Notes:**

N/A

**How to test the change?**

Validate that CI is still green :)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants