Clarify that schema transformations SHOULD overwrite input data #3505

tigrannajaryan · 2023-05-15T17:30:38Z

Resolves: #3497

Alternates Considered

Instead of requiring overwriting the input data we could say that the transformation SHOULD be aborted if a conflict is detected or that transformation SHOULD be ignored if a conflict is detected.

The chosen approach (overwriting) seems to be best balance between maintaining reasonable outcome (keep the more important data, discarding the less important data) while also having minimal complexity of implementation (aborting and reporting errors is more complicated and more costly).

Why "SHOULD" and not "MUST"?

There are significant tradeoffs involved in what happens in the schema transform and we believe there may be valid use cases where after careful consideration a different approach may be taken (e.g. reject and surface the problem to the end user). We do not want to prohibit these other approaches. The goal of this change is to merely have a reasonable default interpretation of the schema transformation rules.

specification/schemas/file_format_v1.1.0.md

pyohannes · 2023-05-15T18:40:26Z

specification/schemas/file_format_v1.1.0.md

+Some schema changes may describe a transformation that results in the conflicts with
+the input data. In situations like this **the transformations described by Schema File
+SHOULD take precedence** over the conflicting data present in the input.


The chosen approach (overwriting) seems to be best balance between maintaining reasonable outcome (keep the more important data, discarding the less important data)

What are the reasons for classifying some data to be more important than other?

In the example given below, an instrumentation library might emit host according to semantic conventions, and a user might add a custom attribute host.name. When transforming this data to conform to schema 1.1.0 according to the conflict handling rule proposed here, the data explicitly specified by the customer in host.name will be lost. In my opinion it cannot be generally said that the one is more important than the other, but would depend on the particular use case.

Probably this is a rare edge case, however, I consider it problematic to drop data during schema conversions.

What are the reasons for classifying some data to be more important than other?

An attribute defined in official semantic conventions and subject to schema file is deemed more important than a custom attribute. Arguable, I know.

Note that custom attributes should not use Otel namespaces. So this is someone doing a bad job at following Otel recommendations, polluting data with custom attributes with bad names that are later used by Otel for a likely different purpose.

I'd say a custom attribute set by the application explicitly is much more important than any automation.

From "the least surprise" principle, it'd be confusing to override it and hard to investigate who and why overridden custom data.

It's also unclear why the instrumentation of V-1 supports attributes from version V, so this telemetry is inconsistent and broken already. Fixing it does not seem beneficial.

It seems to be best to warn (once) and not touch this data.

See my comment at #3497 for why "not touch" is hard to implement.

Resolves: open-telemetry#3497 Alternates Considered ===================== Instead of requiring overwriting the input data we could say that the transformation SHOULD be aborted if a conflict is detected or that transformation SHOULD be ignored if a conflict is detected. The chosen approach (overwriting) seems to be best balance between maintaining reasonable outcome (keep the more important data, discarding the less important data) while also having minimal complexity of implementation (aborting and reporting errors is more complicated and more costly). Why "SHOULD" and not "MUST"? ============================ There are significant tradeoffs involved in what happens in the schema transform and we believe there may be valid use cases where after careful consideration a different approach may be taken (e.g. reject and surface the problem to the end user). We do not want to prohibit these other approaches. The goal of this change is to merely have a reasonable default interpretation of the schema transformation rules.

tigrannajaryan · 2023-05-15T19:28:25Z

It is probably worth discussing the approaches in the issue #3497 (comment) before discussing this PR (which implements one of the suggested approaches).

github-actions · 2023-05-23T03:16:37Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2023-05-31T03:17:24Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

tigrannajaryan · 2023-05-31T12:51:35Z

Anyone interested in this topic see #3497 for discussion.

tigrannajaryan requested review from a team May 15, 2023 17:30

github-actions bot assigned jsuereth May 15, 2023

tigrannajaryan requested review from MovieStoreGuy and mx-psi May 15, 2023 17:31

pyohannes reviewed May 15, 2023

View reviewed changes

tigrannajaryan force-pushed the feature/tigran/clarifyschemaconficts branch from f1a1d40 to aab0407 Compare May 15, 2023 19:08

tigrannajaryan marked this pull request as draft May 15, 2023 19:28

github-actions bot added the Stale label May 23, 2023

github-actions bot closed this May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify that schema transformations SHOULD overwrite input data #3505

Clarify that schema transformations SHOULD overwrite input data #3505

tigrannajaryan commented May 15, 2023

pyohannes May 15, 2023 •

edited

Loading

tigrannajaryan May 15, 2023

tigrannajaryan May 15, 2023 •

edited

Loading

lmolkova May 15, 2023

tigrannajaryan May 15, 2023

tigrannajaryan commented May 15, 2023

github-actions bot commented May 23, 2023

github-actions bot commented May 31, 2023

tigrannajaryan commented May 31, 2023

Clarify that schema transformations SHOULD overwrite input data #3505

Clarify that schema transformations SHOULD overwrite input data #3505

Conversation

tigrannajaryan commented May 15, 2023

Alternates Considered

Why "SHOULD" and not "MUST"?

pyohannes May 15, 2023 • edited Loading

Choose a reason for hiding this comment

tigrannajaryan May 15, 2023

Choose a reason for hiding this comment

tigrannajaryan May 15, 2023 • edited Loading

Choose a reason for hiding this comment

lmolkova May 15, 2023

Choose a reason for hiding this comment

tigrannajaryan May 15, 2023

Choose a reason for hiding this comment

tigrannajaryan commented May 15, 2023

github-actions bot commented May 23, 2023

github-actions bot commented May 31, 2023

tigrannajaryan commented May 31, 2023

pyohannes May 15, 2023 •

edited

Loading

tigrannajaryan May 15, 2023 •

edited

Loading