Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow the dynamic sampling rate overhead target to be set #3310

Merged

Conversation

KJTsanaktsidis
Copy link
Contributor

@KJTsanaktsidis KJTsanaktsidis commented Dec 7, 2023

What does this PR do?
This PR adds an option dynamic_sampling_rate_overhead_target_percentage overhead_target_percentage which allows the dynamic sampling rate in the profiler to be controlled.

If this is used to reduce the sampling rate, the flushing interval is increased to compensate, so that the size of the flushed pprof files is roughly the same.

Motivation:
The profiler has a sophisticated dynamic sampling rate mechanism to tune the sample collection rate in order to target a desired level of wall-clock overhead time. Unfortunately, the default value of 2% is too much for our application; we find a noticeable and unfortunately not acceptable impact on our application's p95/p99 response times.

Additional Notes:

How to test the change?

(Edit by @ivoanjo): Setting DD_PROFILING_UPLOAD_PERIOD to > 60 will emit less profiles, and setting DD_PROFILING_OVERHEAD_TARGET_PERCENTAGE will to something different will make the profiler sample less/more than usual.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

@KJTsanaktsidis KJTsanaktsidis requested review from a team as code owners December 7, 2023 05:56
@github-actions github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Dec 7, 2023
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f3358c0) 98.23% compared to head (3a5fb7e) 98.22%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3310      +/-   ##
==========================================
- Coverage   98.23%   98.22%   -0.01%     
==========================================
  Files        1253     1253              
  Lines       72690    72711      +21     
  Branches     3415     3415              
==========================================
+ Hits        71405    71421      +16     
- Misses       1285     1290       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ivoanjo ivoanjo self-assigned this Dec 7, 2023
@ivoanjo
Copy link
Member

ivoanjo commented Dec 7, 2023

Nice! I've sent a few notes privately, but I think this is in great shape :D

KJTsanaktsidis and others added 7 commits December 11, 2023 16:23
The profiler has a sophisticated dynamic sampling rate mechanism to tune
the sample collection rate in order to target a desired level of
wall-clock overhead time. Unfortunately, the default value of 2% is too
much for our application; we find a noticeable and unfortunately not
acceptable impact on our application's p95/p99 response times.

This commit adds a configuration option
dynamic_sampling_rate_overhead_target_percentage which allows this to be
controlled. The tradeoff for reducing the overhead target is that the
profiles will have fewer samples in them and hence be less accurate.
This way, if we are targeting a lower overhead, the time between flushes
increases, so each pprof file should have ~the same number of samples as
before.
This way we can rely on a sane setting set up at initialization of the
component, rather than needing to delay init in the
CpuAndWallTimeWorker.
Shorten the user-visible setting (and env var) + remote the definitions
in ext (make the settings be the owner of the default).
This setting now always comes from `Profiling::Component`, and so I've
removed the old default to avoid confusion (since it would never be
used in practice).
@KJTsanaktsidis KJTsanaktsidis force-pushed the ktsanaktsidis/overhead_percentage_dynamic branch from f39ae30 to a17c5e9 Compare December 11, 2023 05:23
Apparently on JRuby 9.2 the same float can be represented by a different
object.
Copy link
Contributor

@AlexJF AlexJF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Was a bit worried by the

If this is used to reduce the sampling rate, the flushing interval is increased to compensate, so that the size of the flushed pprof files is roughly the same.

Part of the PR description (worth removing given it's no longer true I think!) but I see this approach was changed after (what I assume to be) the initial interactions with @ivoanjo.

It's not a big problem to change the timeouts but it makes it extremely hard at the moment to have good metric plots if you don't have a fixed interval that you can rollup over (you'd potentially get a lot of see-saw action in any graphs coming from profile metrics).

@ivoanjo
Copy link
Member

ivoanjo commented Dec 11, 2023

It's not a big problem to change the timeouts but it makes it extremely hard at the moment to have good metric plots if you don't have a fixed interval that you can rollup over (you'd potentially get a lot of see-saw action in any graphs coming from profile metrics).

Good point -- I guess this is another good reason for why this setting is not quite recommended unless you have a very specific use-case.

@ivoanjo
Copy link
Member

ivoanjo commented Dec 11, 2023

Merging away! 🎉

@ivoanjo ivoanjo merged commit 0fa4a67 into DataDog:master Dec 11, 2023
55 checks passed
@github-actions github-actions bot added this to the 1.19.0 milestone Dec 11, 2023
@ekump
Copy link
Contributor

ekump commented Jan 10, 2024

@KJTsanaktsidis This has been included in our latest release. Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants