Data Classifications for resources and attributes #187

MovieStoreGuy · 2021-11-11T05:24:38Z

The idea behind this that not all telemetry data is the same with regards to importance, how it is processed, and what is supported by the down stream vendors.

This approach would allow for efficient checking of data to then apply various configuration that is required by your organisation.

Some sections are raw and left as dot points pending discussion to help expand upon them.

anuraaga · 2021-11-11T06:38:55Z

This seems related to this spec issue open-telemetry/opentelemetry-specification#2114

At a glance, classifications like sensitive, high-cardinality do seem more useful when creating attribute views than just optional and required.

/cc @lmolkova

yurishkuro

A few high level thoughts:

the word "resource" has specific meaning in OTEL, you're overloading it by saying "resource classification". You are classifying attributes, not resources.
I think the classifications can be part of the schema files that we already support, I don't see why std attributes need to be classified in user code.
I'm fairly certain there will be big pushback on making this an SDK feature, why not do it in the collector where it can be implemented once?
before an OTEP like this is accepted there needs to be a working prototype with benchmarks

MovieStoreGuy · 2021-11-11T22:10:47Z

the word "resource" has specific meaning in OTEL, you're overloading it by saying "resource classification". You are classifying attributes, not resources.

Both are being classified here, attributes have their own classification along with their the resource sending it. The reason being is that if you can not allow resources with PD or UGC to be exported to a vendor you can drop the resource or remove the attributes that contain PD or UGC.

Classifying the resource is important because it does not require searching / validating attributes or other fields of the resource.

I think the classifications can be part of the schema files that we already support, I don't see why std attributes need to be classified in user code.

I don't see an easy pathway to success by doing this, and I maintaining it within the schema (semantic convention) alone would create a lot of friction trying to release changes in it.

I'm fairly certain there will be big pushback on making this an SDK feature, why not do it in the collector where it can be implemented once?

Some organisations can not allow for PD, UGC to leave the application so only implementing in the collector already restricts those users from adopting.

before an OTEP like this is accepted there needs to be a working prototype with benchmarks
I currently have a rather stripped down version that I had used to test my assumptions, the scenarios that I had replicated where:

Regex filter (removing attributes that matched a pattern)
Lookup based filter (have a list of known strings to exactly match)
Classification Filtering (searching for attributes that are either PD or UGC)
Resource Filtering (removing resources if they can PD or UGC)

Using Golang's benchmark, I had got the following results:

BenchmarkBaselineFilter/Collection-Size-100-12       	   40220	     28404 ns/op
BenchmarkLookupFilter/Collection-Size-100-12         	   65653	     18251 ns/op
BenchmarkRegxFilter/Collection-Size-100-12           	     835	   1408659 ns/op
BenchmarkClassificationAttributeFilter/Collection-Size-100-12      	   50774	     23447 ns/op
BenchmarkClassificationResourceFilter/Collection-Size-100-12     	1000000000	         0.0000052 ns/op

Simply performing a range over an attribute collection of size 100 was 28404 ns/op, the time being a lookup filter containing exact string to match at 18251 ns/op with classification attribute filtering taking 23447 ns/op (approx in the middle of the two benchmarks).

However, using Resource classification filtering, it can filter out resource values sub nanosecond times.

MovieStoreGuy · 2021-11-11T22:30:25Z

By the end of the day, I'll publish my mini prototype so others can play with themselves

tsloughter · 2021-11-11T23:09:29Z

I'm fairly certain there will be big pushback on making this an SDK feature, why not do it in the collector where it can be implemented once?

Are you sure? Not only does not everyone use the collector but that would still mean sending the PII data to the collector.

tigrannajaryan · 2021-11-12T00:13:04Z

I think the classifications can be part of the schema files that we already support, I don't see why std attributes need to be classified in user code.

+1 to this. I think it is worth entertaining this alternate to see what the tradeoffs are. The SDKs already contain pre-built dictionaries of standard attributes (in Go SDK it is in the semconv package). We can include the classification in these dictionaries and avoid the need to:

Call Classify() every time the attribute is added to a span/log/metrc.
Send classification information over the network.

For non-standard attributes it may still be necessary to do what this OTEP suggests. I don't know how important it is though, perhaps standard attributes cover the majority of cases where classification matters (maybe not, I may be wrong).

I don't see an easy pathway to success by doing this, and I maintaining it within the schema (semantic convention) alone would create a lot of friction trying to release changes in it.

On the other hand by NOT making it part of the semantic conventions there is a big risk of having inconsistent behaviour in different SDKs, where each instrumentation author makes their own classification decisions.

jamesmoessis · 2021-11-12T00:37:27Z

I think the classifications can be part of the schema files that we already support, I don't see why std attributes need to be classified in user code.

Maybe there could be sensible defaults which can be overridden. The semantic convention, or the instrumentation author can't know which attributes will be sensitive because it changes between applications. For example, net.peer.ip may or may not be PII depending on whether the peer is a user's device or another server. Only the application owner can know that information.

MovieStoreGuy · 2021-11-12T00:52:12Z

As I mentioned earlier, here is the prototype I had mentioned https://github.com/MovieStoreGuy/data-classifier, it is a rather stripped down only preserving attributes and resource (without any of the values or nested fields).

+1 to this. I think it is worth entertaining this alternate to see what the tradeoffs are. The SDKs already contain pre-built dictionaries of standard attributes (in Go SDK it is in the semconv package). We can include the classification in these dictionaries and avoid the need to:

Right, I think you're referring to the Classification Values here?
I am happy for them to be defined as part of the semantic convention since the values themselves should be rather static over the longevity of the project. (High Cardinality, PII, or UGC data will not stop being a problem).

To your point of:

Send classification information over the network.

In the scenario where you have mixed mode deployments of running compute nods, lambdas and users running experiments / tests locally. Not sending the classification would mean that the internal gateway / proxy collector that is forwarding the data can not filter known classifications that is not supported by the system being exported to. Assuming that application owners ignored classification filtering in their exporter, it would mean that resources or attributes that are can not be classified without impacting processing performance.

Reactive classification is extremely computational expensive with a lot of operational overhead, proactive classifications reduces the need to categorise the data in order to filter it.

Moreover, I mention within the OTEP that half the space requested (64bits) can be reserved for Vendors to enable advanced features like smaller retention windows, improve resource catalogs when filtering for telemetry (SLO resources would be more important that high cardinality resources for example). The vendor can extend their exporter to do additional actions when these classifications are set.

Classifications can be limited to the application context and no included as part of the wire datagram, however, I see a lot value including it in order to help to build central means of ensuring resource data isn't sent to the wrong place.

I don't know how important it is though, perhaps standard attributes cover the majority of cases where classification matters (maybe not, I may be wrong).

Our current adoption of Open Telemetry in Atlassian requires heavy stream process in order to filter out the standard attributes that are high cardinality, and we also wanting to extend the semantic convention to support the idea of real user monitoring (RUM) which is currently not defined in the semantic convention so being able to add classification to those attributes and resources mean we can correctly experiment with the idea without having to add additional processing requirements into a our internal telemetry streaming service.

On the other hand by NOT making it part of the semantic conventions there is a big risk of having inconsistent behaviour in different SDKs, where each instrumentation author makes their own classification decisions.

I agree, the values of the classifications has to be consistent across language and the semantic convention is the best place to define it.

MovieStoreGuy · 2021-11-12T00:56:06Z

To @jamesmoessis point, classifications are coupled to the application deployment.

Meaning that application sitting on the network edge would define net.peer.ip as PII however, an internal service behind the firewall would not consider it PII but both may consider it high cardinality.

yurishkuro · 2021-11-12T03:47:55Z

Some organisations can not allow for PD, UGC to leave the application so only implementing in the collector already restricts those users from adopting.

Are those organizations prepared to commit significant engineering capacity to developing and maintaining this non-trivial capability in every language supported by OTel? On the other hand, a deployment of the main app with collector as a sidecar communicating over local network (i.e. in practice over local unix socket) - why would that be considered "leaving the application"?

MovieStoreGuy · 2021-11-15T23:31:35Z

Hey @yurishkuro ,

Are those organizations prepared to commit significant engineering capacity to developing and maintaining this non-trivial capability in every language supported by OTel?

I am not sure your intent behind this comment and I want to understand more :)
I am fully understand that a lot of the language projects are struggling to reach feature complete and adding this would add more to those backlogs however the intent of this OTEP is to discuss its merits.

I wanted to understand what part of this OTEP is non-trivial?
As I understand the problem, adding classification support into Otel is at most adding less than 10 required methods into the SDK which their implementation is small.
Adding the classification values to the semantic convention which is mostly auto generated stubs per language.

I can see managing the classifications within the semantic convention and how the convention would adopt it to be non trivial and these can be expanded upon further here :)

Per your commitment comment, if the change is as simple as I think then it shouldn't require a significant timeline. I am happy to take on making a collector extension to support this idea as that has been my main project that I have contributed to.

On the other hand, a deployment of the main app with collector as a sidecar communicating over local network (i.e. in practice over local unix socket)

I am not certain that communicating via unix socket is supported by the collector? Let me check that :)

Why would that be considered "leaving the application"?

Once it has been sent via the application, it is no longer the application that has control of the data and that is where the concern is.

text/0187-data-classification.md

pirgeo · 2021-11-22T09:45:51Z

I like the idea. I am, however, worried a little bit about metric aggregations, since attributes are part of the metric identity. Dropping them can require re-aggregation of data (e.g. open-telemetry/opentelemetry-specification#1297). I am not sure how easy or hard that is to add to existing metrics SDKs or if that would require a lot of additional processing on the collector side (@jmacd, @reyang, @jsuereth, do you have thoughts on this?)

MovieStoreGuy · 2021-11-23T02:18:17Z

@pirgeo , I think you can get away with this without the need to re-aggregate.

You can follow similar principle of MapReduce to get around it, meaning:

Filter the data (Resource or Attributes)
Update any fields or resources (in the event of rewrite attributes or resources)
Aggregate and combine classification

In short, I think this could simply be a decorator on the exporter or considered an optional extension.

MovieStoreGuy · 2021-11-29T22:21:15Z

It looks like a simplified version of this has been started here #100 where this proposal incapsulates all data classifications.

Perhaps that we should join these efforts?

kenfinnigan · 2022-02-23T14:46:54Z

text/0187-data-classification.md

+  - this was no considered heavily since it misses USERS (developers) whom are using the SDK natively and then would need to manage their own Semantic Convention that would also need to be understood by the down stream processors and vendors
+    - Does not account for version drift with a breaking change involved
+    - Does a processor need to know all versions?
+    - Does that mean values can not be changed in the semantic convention?


Another downside of doing it in semantic conventions goes to your point above that different organizations will have potentially different classifications. Reaching a common agreement on classification for the semantic conventions would require lengthy discussion, with a good chance of many attributes not reaching a consensus decision on a classification

jpkrohling

Despite this being one more thing people will end up learning in order to do instrumentation, I like this proposal, especially as it's optional for current and future users.

text/0187-data-classification.md

Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>

tedsuo · 2023-07-24T16:49:12Z

@MovieStoreGuy we are marking OTEPs which are no longer being updated as stale. We will close this PR in one week. If you wish to resurrect this PR, that is fine.

MovieStoreGuy · 2023-07-24T20:39:13Z

No worries, I don't have much bandwidth to work on this now but I would love to pick this up at some point.

MovieStoreGuy requested a review from a team November 11, 2021 05:24

MovieStoreGuy force-pushed the msg/add-classification-otep branch from 2083903 to b3f2f54 Compare November 11, 2021 05:42

yurishkuro reviewed Nov 11, 2021

View reviewed changes

MovieStoreGuy changed the title ~~Data Classifications for resources~~ Data Classifications for resources and attributes Nov 15, 2021

arminru reviewed Nov 16, 2021

View reviewed changes

text/0187-data-classification.md Outdated Show resolved Hide resolved

Adding the idea for data classification for attributes and resources

cb34c9f

MovieStoreGuy force-pushed the msg/add-classification-otep branch from b3f2f54 to cb34c9f Compare November 18, 2021 03:31

pirgeo mentioned this pull request Nov 30, 2021

REQUEST: New membership for pirgeo open-telemetry/community#926

Closed

6 tasks

jamesmoessis mentioned this pull request Nov 30, 2021

Scope and roadmap for bringing the existing HTTP semantic conventions for tracing to an initial stable state #174

Merged

kenfinnigan reviewed Feb 23, 2022

View reviewed changes

Merge branch 'main' into msg/add-classification-otep

d8ad9c6

jpkrohling reviewed Nov 30, 2022

View reviewed changes

MovieStoreGuy and others added 5 commits December 1, 2022 10:33

Update text/0187-data-classification.md

d6f35cf

Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>

Update text/0187-data-classification.md

db12090

Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>

Update text/0187-data-classification.md

e776afa

Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>

Update text/0187-data-classification.md

ccce5b8

Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>

Update text/0187-data-classification.md

b4a91be

Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>

MovieStoreGuy and others added 2 commits December 1, 2022 10:35

Update text/0187-data-classification.md

ec89854

Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>

Removing line for SLO

c226bbf

MovieStoreGuy force-pushed the msg/add-classification-otep branch from d7862aa to c226bbf Compare December 5, 2022 02:09

tedsuo added area:sem-conv priority:p1 triaged labels Jan 30, 2023

tedsuo added the stale This issue or PR is stale and will be closed soon unless it is resurrected by the author. label Jul 24, 2023

MovieStoreGuy closed this Jul 24, 2023

pyohannes mentioned this pull request Apr 29, 2024

Specific URL query string values should be redacted open-telemetry/semantic-conventions#971

Open

lmolkova mentioned this pull request Apr 30, 2024

Sensitive Data Redaction #255

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Classifications for resources and attributes #187

Data Classifications for resources and attributes #187

MovieStoreGuy commented Nov 11, 2021

anuraaga commented Nov 11, 2021

yurishkuro left a comment

MovieStoreGuy commented Nov 11, 2021 •

edited

Loading

MovieStoreGuy commented Nov 11, 2021

tsloughter commented Nov 11, 2021

tigrannajaryan commented Nov 12, 2021

jamesmoessis commented Nov 12, 2021

MovieStoreGuy commented Nov 12, 2021

MovieStoreGuy commented Nov 12, 2021

yurishkuro commented Nov 12, 2021

MovieStoreGuy commented Nov 15, 2021

pirgeo commented Nov 22, 2021

MovieStoreGuy commented Nov 23, 2021

MovieStoreGuy commented Nov 29, 2021

kenfinnigan Feb 23, 2022

jpkrohling left a comment

tedsuo commented Jul 24, 2023

MovieStoreGuy commented Jul 24, 2023

Data Classifications for resources and attributes #187

Data Classifications for resources and attributes #187

Conversation

MovieStoreGuy commented Nov 11, 2021

anuraaga commented Nov 11, 2021

yurishkuro left a comment

Choose a reason for hiding this comment

MovieStoreGuy commented Nov 11, 2021 • edited Loading

MovieStoreGuy commented Nov 11, 2021

tsloughter commented Nov 11, 2021

tigrannajaryan commented Nov 12, 2021

jamesmoessis commented Nov 12, 2021

MovieStoreGuy commented Nov 12, 2021

MovieStoreGuy commented Nov 12, 2021

yurishkuro commented Nov 12, 2021

MovieStoreGuy commented Nov 15, 2021

pirgeo commented Nov 22, 2021

MovieStoreGuy commented Nov 23, 2021

MovieStoreGuy commented Nov 29, 2021

kenfinnigan Feb 23, 2022

Choose a reason for hiding this comment

jpkrohling left a comment

Choose a reason for hiding this comment

tedsuo commented Jul 24, 2023

MovieStoreGuy commented Jul 24, 2023

MovieStoreGuy commented Nov 11, 2021 •

edited

Loading