Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify Span ID creation with sampling. #998

Closed

Conversation

Oberon00
Copy link
Member

@Oberon00 Oberon00 commented Sep 24, 2020

Fixes #1060.

Partial fix for #864 (remainder would require an API change to not pass a trace ID to the sampler if there is none).

Changes

  • Clarifies how span IDs are generated for DROP and RECORD_ONLY.

@Oberon00 Oberon00 added area:sampling Related to trace sampling area:sdk Related to the SDK spec:trace Related to the specification/trace directory labels Sep 24, 2020
note: this must be done before calling `ShouldSample`, because it expects
a valid trace Id as input).
2. Query the `Sampler`'s [`ShouldSample`](#shouldsample) method.
3. If the decision is `DROP`, use the parent span ID or all-zero if there is none.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means: If there is no parent and the decision is drop, we will have a span with the new trace ID but an invalid span ID. The SpanContext as a whole is thus invalid. We could also specify that the generated trace ID should be discarded in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this may be worth re-considering: Generating trace+span ID for an unsampled root has implications if the span is injected. If we generate an invalid spancontext (with zero span ID), we will inject nothing as we have an invalid traceparent header as per W3C (cf. also #753). This causes downstream components to potentially create traces / run sampling logic that they would not if they got an incoming unsampled traceparent header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also specify that the generated trace ID should be discarded in that case.

I'd be fine with this.

we will inject nothing as we have an invalid traceparent header as per W3C (cf. also #753). This causes downstream components to potentially create traces

Could you elaborate?

Copy link
Member Author

@Oberon00 Oberon00 Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carlosalberto We never inject an invalid SpanContext, at least using W3C. If we would generate a new ID instead of using zero, we would inject a valid spancontext with unset sampled flag. This will likely cause the receiver to suppress the trace. On the other hand, if we use zero, the downstream component will not receive any tracestate and will likely start a new trace.

Now that I write, this, I realize that this even applies to local child spans: Normally if the root span is not sampled, local child spans would be unsampled too, but with this they would become root spans.

This is bad. I think we will have to create Span and Trace ID even for "dropped" root spans after all. It seems there is a dependency of #864 "Option to allow "default" IDs for unsampled traces" on #753 "Current "Invalid" SpanContext definition precludes TraceState-only SpanContext". We would have to have some sentinel SpanContext that is "valid, unsampled, but without any IDs". That's out of scope here of course.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 4006816. Maybe this demonstrates how easy it is to mess up here and that it is thus important to have this specified clearly? 😃

a valid trace Id as input).
2. Query the `Sampler`'s [`ShouldSample`](#shouldsample) method.
3. If the decision is `DROP`, use the parent span ID or all-zero if there is none.
Otherwise (if the decision is not `DROP`) a new span ID MUST be generated.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means: RECORD_ONLY spans get a new span ID, always. I don't know what the purpose of RECORD_ONLY is (should we remove it?), but having different spans with the same trace+span ID seems worse than breaking the trace in case any children of the RECORD_ONLY span become RECORD_AND_SAMPLE .

@andrewhsu andrewhsu added the release:after-ga Not required before GA release, and not going to work on before GA label Sep 25, 2020
@andrewhsu
Copy link
Member

from the issue triage mtg today, labelling this as after-ga since the corresponding i up.ssue is after-ga.

however, if this is just a editorial change that does not affect the trace freeze, please speak

@Oberon00
Copy link
Member Author

Oberon00 commented Sep 25, 2020

@andrewhsu As stated in the issue comment, this PR is exactly the part of the issue that is not doable after-ga.

@tigrannajaryan
Copy link
Member

@Oberon00 will you be able get the approvals today? Other options are not doing this in 1.0 at all or asking for an exception from spec freeze.

@Oberon00
Copy link
Member Author

@tigrannajaryan I won't. In that case, it seems we will have to look around after GA and document all the paths taken by any SDKs as allowed and maybe introduce some options.

@github-actions
Copy link

github-actions bot commented Oct 3, 2020

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 3, 2020
@Oberon00
Copy link
Member Author

Oberon00 commented Oct 6, 2020

I changed the corresponding issue to a new issue #1060, so we can re-evaluate what to do here for GA.

@Oberon00 Oberon00 added release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs and removed release:after-ga Not required before GA release, and not going to work on before GA labels Oct 6, 2020
@Oberon00
Copy link
Member Author

Oberon00 commented Oct 6, 2020

Relabeled to allowed-for-ga to match #1060.

bogdandrutu
bogdandrutu previously approved these changes Oct 6, 2020
@andrewhsu andrewhsu removed Stale release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs labels Oct 27, 2020
specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
Comment on lines 71 to 73
3. If the decision is `DROP` and there is a valid parent span ID, use it.
Otherwise (if the decision is not `DROP` or there was no valid parent span ID)
a new span ID MUST be generated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is tricky, but I like @Oberon00's proposal that DROP sampling decisions at the root of a trace should generate a new trace ID and new span ID without the sampled bit set. Am I understanding?

@carlosalberto
Copy link
Contributor

@open-telemetry/specs-approvers Please review this one. We bumped its related issue as "Required-for-GA" so we definitely needs reviews ;)

(note: this must be done before calling `ShouldSample`, because it expects
a valid trace ID as input).
2. Query the `Sampler`'s [`ShouldSample`](#shouldsample) method.
3. If the decision is `DROP` and there is a valid parent span ID, reuse it as the new `Span`'s span ID.
Copy link
Contributor

@jkwatson jkwatson Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean two spans with the same id? I can imagine this could become extremely confusing to someone trying to debug things. As a user, I think I'd be super confused if I saw a new Span created with the same ID as the parent, even if that span might not end up going anywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's two spans with the same ID the same way as calling extract and then inject creates a copy of the injected span (with the same ID). It is more of a hollow shell of a span than an actual span.

even if that span might not end up going anywhere

This span cannot end up going anywhere. A dropped span is always non-recording, so there is nothing that could end up anywhere beside the span+trace ID itself (not even a parent span ID). What is more, SpanProcessors are not invoked with dropped spans.

RECORD_ONLY spans are a different story, and they do get a new ID, see #998 (comment)

All that being said, I think I could agree that this may make the mental model of sampling a bit more complicated than just always creating a new ID. But I hope that the benefits, namely ability to resume sampling in a child and having the closes sampled span as parent, and a slight(?) performance improvement would be worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think I'd prefer attacking this problem via a smarter SpanContext that allowed propagation of the relevant bits, rather than generating weird internal spans for the sake of propagating the SpanContext.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, better yet, move the TraceState and TraceFlags out of the span altogether and propagate them separately in the Context, rather than hanging them off a span.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should startSpan return for a dropped span then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regeneration looks like a safer choice that will not break SDK going forward

I think changing from any of them to the other will break some uses.

and is easier to explain

Fair enough.

the idea is that instrumented library needs a reliable unique ID of the operation

If you have an ID that is not associated with any trace on the tracing backend, will that be more helpful than the "inexact" ID of the parent operation? Note, you will have a valid span ID in any case.

I would say we go GA with this, and discuss how to improve on it using things like adding a GetUniqueOperationId function to Span that returns the span ID if it was generated for that span or generates a new ID that is cached and propagated to children if it is not. I recon propagating this ID cross-process may be difficult, but the same goes for propagating the ID of the last sampled span, and it seems unlikely that you call GetUniqueOperationId on an extracted span directly (it would not be very unique, since the parent's process would have the same ID if it was sampled at least).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regeneration looks like a safer choice that will not break SDK going forward

I think changing from any of them to the other will break some uses.

and is easier to explain

Fair enough.

the idea is that instrumented library needs a reliable unique ID of the operation

If you have an ID that is not associated with any trace on the tracing backend, will that be more helpful than the "inexact" ID of the parent operation? Note, you will have a valid span ID in any case.

I would say we go GA with this, and discuss how to improve on it using things like adding a GetUniqueOperationId function to Span that returns the span ID if it was generated for that span or generates a new ID that is cached and propagated to children if it is not. I recon propagating this ID cross-process may be difficult, but the same goes for propagating the ID of the last sampled span, and it seems unlikely that you call GetUniqueOperationId on an extracted span directly (it would not be very unique, since the parent's process would have the same ID if it was sampled at least).

The assumption that there is a trace backend may not be true for everybody. And having trace backend of your caller having a record of a parent span Id is not helpful to solve your problem. So the decision we made at @open-telemetry/technical-committee is to go with span ID regeneration. This creates a nice guarantee of local uniqueness of a spanID that various telemetry signals may take a dependency on.

How strongly do you disagree?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very strongly. I do think my proposed solution is the better trade-off but I concede that always generating a new ID is also a reasonable option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Oberon00 Thanks for the answer. I think we will go with the TC decision for now then - if you have time, would you mind updating this PR to match the proposal? Else, I can help by creating a spin-off of this PR. Let me know ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to create a spin-off PR. That way we clearly document that this option was rejected.

@bogdandrutu bogdandrutu added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Nov 10, 2020
@carlosalberto
Copy link
Contributor

Ping @yurishkuro

@bogdandrutu
Copy link
Member

@SergeyKanzhelev you mentioned during the specification sig that you will document the decision that was discussed during the TC meeting.

@bogdandrutu
Copy link
Member

@Oberon00 please rebase.

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SergeyKanzhelev you mentioned during the specification sig that you will document the decision that was discussed during the TC meeting.

sorry for delay. This comment summarizes it: #998 (comment) but I was thinking to expand a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sampling Related to trace sampling area:sdk Related to the SDK release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:trace Related to the specification/trace directory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing spec for Span ID creation for dropped/record-only spans