Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9044] Handle timing hiccups #3440

Merged
merged 31 commits into from
Mar 5, 2024

Conversation

AlexJF
Copy link
Contributor

@AlexJF AlexJF commented Feb 6, 2024

What does this PR do?
Introduce a mechanism so that point-in-time hiccups in timing information cannot effectively lead to disabling of the sampler for whole minutes.

Motivation:
VM suspension, system overload and other noise may occasionally lead to timing hiccups where time where the application was suspended is misattributed to time spent sampling (e.g. VM suspension occurs in between calls to should_sample and after_sample).

This (hopefully) rare occurrence can skew our sampling statistics so much that the sampler would be effectively disabled for whole minutes. We want to limit this impact and try to have the sampler ignore sampling times that are deemed too big (our reference target is that, if forced to sample, we want to sample at least 1 event per second).

Additional Notes:
This PR tells a story:

  1. We start by adding a test describing a scenario where due to some weird circumstance (e.g. suspension), the sampling time for a single event is reported as being 1000 seconds. With existing code the test fails with:
    1) Datadog::Profiling::Collectors::DiscreteDynamicSampler a weird hiccup shouldn't be able to disable sampling for whole minutes
     Failure/Error: expect(stats[:num_samples]).to be >= 60
     
       expected: >= 60
            got:    0
     # ./spec/datadog/profiling/collectors/discrete_dynamic_sampler_spec.rb:324:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:230:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:115:in `block (2 levels) in <top (required)>'
    
  2. We then introduce a mechanism that limits the maximum amount of sampling time we're willing to consider. This maximum amount is chosen so as to represent a practical minimum sampling rate of 1 sample per adjustment window (ergo 1 sample per second). With this change the test passes:
      a weird hiccup shouldn't be able to disable sampling for whole minutes
    
     Finished in 0.08911 seconds (files took 1.03 seconds to load)
     1 example, 0 failures
    

How to test the change?

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

@github-actions github-actions bot added the profiling Involves Datadog profiling label Feb 6, 2024
@AlexJF AlexJF force-pushed the alexjf/prof-9044-handle-timing-hiccups branch from 95cdf19 to a07ac12 Compare February 7, 2024 11:16
@AlexJF AlexJF marked this pull request as ready for review February 7, 2024 11:17
@AlexJF AlexJF requested a review from a team as a code owner February 7, 2024 11:17
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

The macOS CI failures are unrelated -- #3444 .

Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

…c.rb

Co-authored-by: Ivo Anjo <ivo.anjo@datadoghq.com>
Base automatically changed from alexjf/prof-9035-handle-clock-errors-allocation-sampler to master March 5, 2024 11:31
@AlexJF AlexJF merged commit f38f080 into master Mar 5, 2024
219 checks passed
@AlexJF AlexJF deleted the alexjf/prof-9044-handle-timing-hiccups branch March 5, 2024 15:14
@github-actions github-actions bot added this to the 2.0 milestone Mar 5, 2024
@ivoanjo ivoanjo modified the milestones: 2.0, 1.21.0 Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants