Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Classifications for resources and attributes #187

Conversation

MovieStoreGuy
Copy link
Contributor

The idea behind this that not all telemetry data is the same with regards to importance, how it is processed, and what is supported by the down stream vendors.

This approach would allow for efficient checking of data to then apply various configuration that is required by your organisation.

Some sections are raw and left as dot points pending discussion to help expand upon them.

@MovieStoreGuy MovieStoreGuy requested a review from a team November 11, 2021 05:24
@anuraaga
Copy link

This seems related to this spec issue open-telemetry/opentelemetry-specification#2114

At a glance, classifications like sensitive, high-cardinality do seem more useful when creating attribute views than just optional and required.

/cc @lmolkova

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few high level thoughts:

  • the word "resource" has specific meaning in OTEL, you're overloading it by saying "resource classification". You are classifying attributes, not resources.
  • I think the classifications can be part of the schema files that we already support, I don't see why std attributes need to be classified in user code.
  • I'm fairly certain there will be big pushback on making this an SDK feature, why not do it in the collector where it can be implemented once?
  • before an OTEP like this is accepted there needs to be a working prototype with benchmarks

@MovieStoreGuy
Copy link
Contributor Author

MovieStoreGuy commented Nov 11, 2021

the word "resource" has specific meaning in OTEL, you're overloading it by saying "resource classification". You are classifying attributes, not resources.

Both are being classified here, attributes have their own classification along with their the resource sending it. The reason being is that if you can not allow resources with PD or UGC to be exported to a vendor you can drop the resource or remove the attributes that contain PD or UGC.

Classifying the resource is important because it does not require searching / validating attributes or other fields of the resource.

I think the classifications can be part of the schema files that we already support, I don't see why std attributes need to be classified in user code.

I don't see an easy pathway to success by doing this, and I maintaining it within the schema (semantic convention) alone would create a lot of friction trying to release changes in it.

I'm fairly certain there will be big pushback on making this an SDK feature, why not do it in the collector where it can be implemented once?

Some organisations can not allow for PD, UGC to leave the application so only implementing in the collector already restricts those users from adopting.

before an OTEP like this is accepted there needs to be a working prototype with benchmarks
I currently have a rather stripped down version that I had used to test my assumptions, the scenarios that I had replicated where:

  • Regex filter (removing attributes that matched a pattern)
  • Lookup based filter (have a list of known strings to exactly match)
  • Classification Filtering (searching for attributes that are either PD or UGC)
  • Resource Filtering (removing resources if they can PD or UGC)

Using Golang's benchmark, I had got the following results:

BenchmarkBaselineFilter/Collection-Size-100-12       	   40220	     28404 ns/op
BenchmarkLookupFilter/Collection-Size-100-12         	   65653	     18251 ns/op
BenchmarkRegxFilter/Collection-Size-100-12           	     835	   1408659 ns/op
BenchmarkClassificationAttributeFilter/Collection-Size-100-12      	   50774	     23447 ns/op
BenchmarkClassificationResourceFilter/Collection-Size-100-12     	1000000000	         0.0000052 ns/op

Simply performing a range over an attribute collection of size 100 was 28404 ns/op, the time being a lookup filter containing exact string to match at 18251 ns/op with classification attribute filtering taking 23447 ns/op (approx in the middle of the two benchmarks).

However, using Resource classification filtering, it can filter out resource values sub nanosecond times.

@MovieStoreGuy
Copy link
Contributor Author

By the end of the day, I'll publish my mini prototype so others can play with themselves

@tsloughter
Copy link
Member

I'm fairly certain there will be big pushback on making this an SDK feature, why not do it in the collector where it can be implemented once?

Are you sure? Not only does not everyone use the collector but that would still mean sending the PII data to the collector.

@tigrannajaryan
Copy link
Member

  • I think the classifications can be part of the schema files that we already support, I don't see why std attributes need to be classified in user code.

+1 to this. I think it is worth entertaining this alternate to see what the tradeoffs are. The SDKs already contain pre-built dictionaries of standard attributes (in Go SDK it is in the semconv package). We can include the classification in these dictionaries and avoid the need to:

  • Call Classify() every time the attribute is added to a span/log/metrc.
  • Send classification information over the network.

For non-standard attributes it may still be necessary to do what this OTEP suggests. I don't know how important it is though, perhaps standard attributes cover the majority of cases where classification matters (maybe not, I may be wrong).

I don't see an easy pathway to success by doing this, and I maintaining it within the schema (semantic convention) alone would create a lot of friction trying to release changes in it.

On the other hand by NOT making it part of the semantic conventions there is a big risk of having inconsistent behaviour in different SDKs, where each instrumentation author makes their own classification decisions.

@jamesmoessis
Copy link

I think the classifications can be part of the schema files that we already support, I don't see why std attributes need to be classified in user code.

Maybe there could be sensible defaults which can be overridden. The semantic convention, or the instrumentation author can't know which attributes will be sensitive because it changes between applications. For example, net.peer.ip may or may not be PII depending on whether the peer is a user's device or another server. Only the application owner can know that information.

@MovieStoreGuy
Copy link
Contributor Author

As I mentioned earlier, here is the prototype I had mentioned https://github.com/MovieStoreGuy/data-classifier, it is a rather stripped down only preserving attributes and resource (without any of the values or nested fields).

+1 to this. I think it is worth entertaining this alternate to see what the tradeoffs are. The SDKs already contain pre-built dictionaries of standard attributes (in Go SDK it is in the semconv package). We can include the classification in these dictionaries and avoid the need to:

Right, I think you're referring to the Classification Values here?
I am happy for them to be defined as part of the semantic convention since the values themselves should be rather static over the longevity of the project. (High Cardinality, PII, or UGC data will not stop being a problem).

To your point of:

Send classification information over the network.

In the scenario where you have mixed mode deployments of running compute nods, lambdas and users running experiments / tests locally. Not sending the classification would mean that the internal gateway / proxy collector that is forwarding the data can not filter known classifications that is not supported by the system being exported to. Assuming that application owners ignored classification filtering in their exporter, it would mean that resources or attributes that are can not be classified without impacting processing performance.

Reactive classification is extremely computational expensive with a lot of operational overhead, proactive classifications reduces the need to categorise the data in order to filter it.

Moreover, I mention within the OTEP that half the space requested (64bits) can be reserved for Vendors to enable advanced features like smaller retention windows, improve resource catalogs when filtering for telemetry (SLO resources would be more important that high cardinality resources for example). The vendor can extend their exporter to do additional actions when these classifications are set.

Classifications can be limited to the application context and no included as part of the wire datagram, however, I see a lot value including it in order to help to build central means of ensuring resource data isn't sent to the wrong place.

I don't know how important it is though, perhaps standard attributes cover the majority of cases where classification matters (maybe not, I may be wrong).

Our current adoption of Open Telemetry in Atlassian requires heavy stream process in order to filter out the standard attributes that are high cardinality, and we also wanting to extend the semantic convention to support the idea of real user monitoring (RUM) which is currently not defined in the semantic convention so being able to add classification to those attributes and resources mean we can correctly experiment with the idea without having to add additional processing requirements into a our internal telemetry streaming service.

On the other hand by NOT making it part of the semantic conventions there is a big risk of having inconsistent behaviour in different SDKs, where each instrumentation author makes their own classification decisions.

I agree, the values of the classifications has to be consistent across language and the semantic convention is the best place to define it.

@MovieStoreGuy
Copy link
Contributor Author

To @jamesmoessis point, classifications are coupled to the application deployment.

Meaning that application sitting on the network edge would define net.peer.ip as PII however, an internal service behind the firewall would not consider it PII but both may consider it high cardinality.

@yurishkuro
Copy link
Member

Some organisations can not allow for PD, UGC to leave the application so only implementing in the collector already restricts those users from adopting.

Are those organizations prepared to commit significant engineering capacity to developing and maintaining this non-trivial capability in every language supported by OTel? On the other hand, a deployment of the main app with collector as a sidecar communicating over local network (i.e. in practice over local unix socket) - why would that be considered "leaving the application"?

@MovieStoreGuy
Copy link
Contributor Author

Hey @yurishkuro ,

Are those organizations prepared to commit significant engineering capacity to developing and maintaining this non-trivial capability in every language supported by OTel?

I am not sure your intent behind this comment and I want to understand more :)
I am fully understand that a lot of the language projects are struggling to reach feature complete and adding this would add more to those backlogs however the intent of this OTEP is to discuss its merits.

I wanted to understand what part of this OTEP is non-trivial?
As I understand the problem, adding classification support into Otel is at most adding less than 10 required methods into the SDK which their implementation is small.
Adding the classification values to the semantic convention which is mostly auto generated stubs per language.

I can see managing the classifications within the semantic convention and how the convention would adopt it to be non trivial and these can be expanded upon further here :)

Per your commitment comment, if the change is as simple as I think then it shouldn't require a significant timeline. I am happy to take on making a collector extension to support this idea as that has been my main project that I have contributed to.

On the other hand, a deployment of the main app with collector as a sidecar communicating over local network (i.e. in practice over local unix socket)

I am not certain that communicating via unix socket is supported by the collector? Let me check that :)

Why would that be considered "leaving the application"?

Once it has been sent via the application, it is no longer the application that has control of the data and that is where the concern is.

@MovieStoreGuy MovieStoreGuy changed the title Data Classifications for resources Data Classifications for resources and attributes Nov 15, 2021
@pirgeo
Copy link
Member

pirgeo commented Nov 22, 2021

I like the idea. I am, however, worried a little bit about metric aggregations, since attributes are part of the metric identity. Dropping them can require re-aggregation of data (e.g. open-telemetry/opentelemetry-specification#1297). I am not sure how easy or hard that is to add to existing metrics SDKs or if that would require a lot of additional processing on the collector side (@jmacd, @reyang, @jsuereth, do you have thoughts on this?)

@MovieStoreGuy
Copy link
Contributor Author

@pirgeo , I think you can get away with this without the need to re-aggregate.

You can follow similar principle of MapReduce to get around it, meaning:

  • Filter the data (Resource or Attributes)
  • Update any fields or resources (in the event of rewrite attributes or resources)
  • Aggregate and combine classification

In short, I think this could simply be a decorator on the exporter or considered an optional extension.

@MovieStoreGuy
Copy link
Contributor Author

It looks like a simplified version of this has been started here #100 where this proposal incapsulates all data classifications.

Perhaps that we should join these efforts?

- this was no considered heavily since it misses USERS (developers) whom are using the SDK natively and then would need to manage their own Semantic Convention that would also need to be understood by the down stream processors and vendors
- Does not account for version drift with a breaking change involved
- Does a processor need to know all versions?
- Does that mean values can not be changed in the semantic convention?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another downside of doing it in semantic conventions goes to your point above that different organizations will have potentially different classifications. Reaching a common agreement on classification for the semantic conventions would require lengthy discussion, with a good chance of many attributes not reaching a consensus decision on a classification

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite this being one more thing people will end up learning in order to do instrumentation, I like this proposal, especially as it's optional for current and future users.

text/0187-data-classification.md Outdated Show resolved Hide resolved
text/0187-data-classification.md Outdated Show resolved Hide resolved
text/0187-data-classification.md Outdated Show resolved Hide resolved
text/0187-data-classification.md Outdated Show resolved Hide resolved
text/0187-data-classification.md Outdated Show resolved Hide resolved
text/0187-data-classification.md Outdated Show resolved Hide resolved
text/0187-data-classification.md Outdated Show resolved Hide resolved
MovieStoreGuy and others added 5 commits December 1, 2022 10:33
Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
MovieStoreGuy and others added 2 commits December 1, 2022 10:35
Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
@tedsuo tedsuo added the stale This issue or PR is stale and will be closed soon unless it is resurrected by the author. label Jul 24, 2023
@tedsuo
Copy link
Contributor

tedsuo commented Jul 24, 2023

@MovieStoreGuy we are marking OTEPs which are no longer being updated as stale. We will close this PR in one week. If you wish to resurrect this PR, that is fine.

@MovieStoreGuy
Copy link
Contributor Author

No worries, I don't have much bandwidth to work on this now but I would love to pick this up at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sem-conv priority:p1 stale This issue or PR is stale and will be closed soon unless it is resurrected by the author. triaged
Projects
None yet
Development

Successfully merging this pull request may close these issues.