Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry TraceIdRatioBased sampler requirements following OTEP 235 #4166

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jul 29, 2024

Fixes #1413.

Changes

Updates Trace SDK and TraceState handling specifications with OTEP 235 sampling thresholds. This PR depends on #4162 to introduce the concept of Trace Randomness. This PR is the second part of two, it focuses on thresholds.

  • Revise TraceIdRatioBased algorithm section. The existing TODO implies this is not a breaking change.
  • Change text about TraceIdRatioBased construction
  • Move text about TraceIdRatioBased description (leave unmodified).

The content of OTEP 235 was revised for clarity by @kalyanaj in open-telemetry/oteps#261. I've heavily copied from the final text in that still-unmerged OTEP. I introduced new content explaining how to compute thresholds from probabilities with use of variable precision, referring to the OTel Collector-Contrib pkg/sampling reference implementation. The new (Golang) demonstration code is validated here, https://go.dev/play/p/7eLM6FkuoA5.

A proof of concept for this specification along with #4162 can be found in open-telemetry/opentelemetry-go#5645.

Part of #3602.

Product of the Sampling SIG members @kentquirk @kalyanaj @oertl @PeterF778 and myself.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 30, 2024

Feedback from the OTel Spec SIG meeting discussion cc/ @jsuereth:

  • Please add a migration guide to explain how transitioning samplers will work; in particular, it's not safe to begin using non-root independent sampling until TraceIdRatioBased samplers are replaced everywhere in a trace. Until then, only safe to continue using ParentBased sampling w/ root TraceIdRatioBased decision.

Update: 68fa270

Copy link

github-actions bot commented Aug 7, 2024

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot removed the Stale label Aug 8, 2024
jmacd added a commit that referenced this pull request Aug 15, 2024
This reduces the number of lines of diff in PR 4166, which replaces the
entire `tracestate-probability-sampling.md` file with new contents.

Part of #4166.

## Changes

Move a file, place a link to it and explain that a change is in
progress.
@jmacd
Copy link
Contributor Author

jmacd commented Aug 15, 2024

@kalyanaj @PeterF778 @oertl @kentquirk Please take another look at this PR, especially the file tracestate-probability-sampling.md which now reads as a new file, not as a major rewrite. The contents are derived from open-telemetry/oteps#261.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 15, 2024

@open-telemetry/specs-trace-approvers @open-telemetry/specs-approvers @open-telemetry/technical-committee this PR has reached consensus in the Sampling SIG, we have multiple prototypes implemented, and we are looking for final approvals.

specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/tracestate-handling.md Outdated Show resolved Hide resolved
specification/trace/tracestate-handling.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Show resolved Hide resolved
specification/trace/sdk.md Show resolved Hide resolved
specification/trace/sdk.md Show resolved Hide resolved
@jpkrohling jpkrohling self-requested a review September 18, 2024 07:50
Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, will try to complete by tomorrow.

@@ -87,6 +87,7 @@ formats is required. Implementing more than one format is optional.
| [Built-in `SpanProcessor`s implement `ForceFlush` spec](specification/trace/sdk.md#forceflush-1) | | | + | | + | + | + | + | + | + | + | |
| [Attribute Limits](specification/common/README.md#attribute-limits) | X | | + | | + | + | + | + | | | | |
| Fetch InstrumentationScope from ReadableSpan | | | + | | + | | | + | | | | |
| TraceIdRatioBased implements OpenTelemetry tracestate `th` field | | | | | | | | | | | | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as the other PR: if this is required, shouldn't there be a couple of implementations lined up before the spec change is merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a shared my draft, open-telemetry/opentelemetry-go#5645, and @oertl has already merged an equivalent sampler in the Java contrib repository. (I would add that the OTel-Collector-Contrib probabilistic sampler processor acts as a near-prototype.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The connection with probabilistic sampler is detailed in #4243 and has been described as an interoperability specification.

specification/trace/sdk.md Show resolved Hide resolved
specification/trace/sdk.md Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Show resolved Hide resolved
specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved
specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved

This proposal supports two sources of randomness:

- **A custom source of randomness**: This proposal allows for a *random* (or pseudo-random) 56-bit value. We refer to this as `rv`. This can be generated and propagated through the `tracestate` header and the tracestate attribute in each span.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I commented this elsewhere, but when should I, as a user, should consider having a custom source of randomness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to be part of #4162 which focuses on randomness. It writes "To enable sampling in this and other situations where TraceIDs lack sufficient randomness,"

However, I tried to stay away from the advanced use-cases some might mention. If you have a reason to use independent trace IDs and still want them to sample consistently, this is what you'd choose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing the connection with #4162. See this commit, which hopefully answers this question (thread).

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than my previous comments, LGTM!

specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved

The original TraceIdRatioBased sampler specification gave a workaround for the underspecified behavior, that it was safe to use for root spans: "It is recommended to use this sampler algorithm only for root spans (in combination with [`ParentBased`](./sdk.md#parentbased)) because different language SDKs or even different versions of the same language SDKs may produce inconsistent results for the same input."

To avoid inconsistency during this transition, users SHOULD follow this guidance until all TraceIdRatioBased samplers used in a system have been upgraded to the modern `TraceIdRatioBased` specification based on W3C Trace Context Level 2 randomness. After all `TraceIdRatioBased` samplers have been upgraded, it is safe to use `TraceIdRatioBased` sampler without also using the `ParentBased` sampler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can users assess that they reached this? Should we keep a table, showing from which versions which SDKs support the new spec?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way they can do this is to wait for all spans to have the W3C trace Random flag set across a system. How does that sound?

specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved
specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved
jmacd and others added 3 commits September 25, 2024 15:28
Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
@jmacd jmacd requested review from a team as code owners September 25, 2024 22:29
specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved
specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved
specification/trace/tracestate-probability-sampling.md Outdated Show resolved Hide resolved
}
// Raise precision by the number of leading 0s or Fs
_, expF := math.Frexp(probability)
_, expR := math.Frexp(1 - probability)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the precision by the number of leading 0s only affects probabilities close to 1. Is the difference between a sampling probability of e.g. 0.99999 and 0.999999 relevant or could this part be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it can be removed, yes. We believe fine-resolution probabilities close to 0 are important, but not close to 1.

I put this in for symmetry--so that rounding behavior near 0 and 1 would be the same, but I can easily be convinced to remove this line. It would mean that probabilities near 1 round up to exactly 1, subject to precision.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of removing it to keep the relative loss of precision consistent across multiples of 16; it feels somewhat arbitrary to increase the precision for values above 0.9375. Increasing the precision by the number of leading 0s would change the first row to 0cccd / 2.0077358064973794E-7:

Probability Threshold w/ precision=4 abs(1 - AdjustedCount * Probability)
0.95 / 16 ** 0 0ccd 3.212386963991065E-6
0.95 / 16 ** 1 f0ccd 3.212386963991065E-6
0.95 / 16 ** 6 ffffff0ccd 3.212386963991065E-6
0.90 / 16 ** 0 199a 6.78173001933402E-6
0.90 / 16 ** 1 f199a 6.78173001933402E-6
0.90 / 16 ** 6 ffffff199a 6.78173001933402E-6

@@ -386,40 +389,41 @@ The default sampler is `ParentBased(root=AlwaysOn)`.

#### TraceIdRatioBased
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a confusion that we need to help the users with.

At first glance, TraceIdRatioBased seems to imply that the sampling is based on some ratio related to trace ids. There are quite some nuance in fact. It is only truly related to some ratio of Trace Ids in the W3C L2 trace context with the randomness flag checked. Otherwise, it is actually based on a generated randomness value.

This probably worth explicitly calling out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope 6e29b0e addresses some of the higher-level detail that is missing.

Here is a statement to add the nuance you refer to: 77b51f8 😀


Note that the "ratio-based" part of this Sampler's name implies that
it makes a probability decision directly from the TraceID, even though
it was not not originally specified in an exact way. In the present
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
it was not not originally specified in an exact way. In the present
it was not originally specified in an exact way. In the present

Note that the "ratio-based" part of this Sampler's name implies that
it makes a probability decision directly from the TraceID, even though
it was not not originally specified in an exact way. In the present
specification,the Sampler decision is more nuanced: only a portion of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
specification,the Sampler decision is more nuanced: only a portion of
specification, the Sampler decision is more nuanced: only a portion of

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Review approach & specify algorithm for TraceIdRatioBasedSampler (ProbabilitySampler)
10 participants