Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Metrics Label Processing Solutions #1892

Closed
amanbrar1999 opened this issue Oct 2, 2020 · 18 comments
Closed

Proposal: Metrics Label Processing Solutions #1892

amanbrar1999 opened this issue Oct 2, 2020 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@amanbrar1999
Copy link
Contributor

amanbrar1999 commented Oct 2, 2020

Problem

We are looking to add the ability to attach data point labels that are specified in the Collector config file to all metrics that pass through the Collectors. This functionality does not exist at present. We request feedback from maintainers regarding the proposed solutions described below.

Use Case

Our need for this functionality stems from Cortex. We wish to have multiple collectors running concurrently and feeding data to Cortex for reliability, so that if any collector instance fails for any reason, there are others still collecting metrics and resulting in no data loss. Quite clearly this also means that many duplicate metrics will be exported to Cortex, however Cortex has a method for deduplication using “HA labels”. What this requires is for each metric to have 2 labels: “cluster” and “__replica__”. The most important fact is that each individual metric source (in this case, each collector instance) should have a different “__replica__” value. This is doable because the Collector config file supports use of environment variables, hence a unique value such as a pod name can be used. Cortex will choose which “__replica__” to collect metrics from within each cluster, hence the issue of duplicate metrics is solved.

More information can be found here: https://cortexmetrics.io/docs/production/ha-pair-handling/

To be more specific, in OTLP format if a metric looks like this:

[
    {
        "resource": {
            "attributes": [
                {
                    "key": "service.name",
                    "value": {
                        "Value": {
                            "string_value": "otel-collector"
                        }
                    }
                }
            ]
        },
        "instrumentation_library_metrics": [
            {
                "metrics": [
                    {
                        "name": "otelcol_process_cpu_seconds",
                        "description": "Total CPU user and system time in seconds",
                        "unit": "s",
                        "Data": {
                            "double_gauge": {
                                "data_points": [
                                    {
                                        "labels": [
                                            {
                                                "key": "service_instance_id",
                                                "value": "xxxx"
                                            }
                                        ],
                                        "time_unix_nano": 1601479169981000000
                                    }
                                ]
                            }
                        }
                    }
                ]
            }
        ]
    }
]

We want it to look like this after modifications:

[
    {
        "resource": {
            "attributes": [
                {
                    "key": "service.name",
                    "value": {
                        "Value": {
                            "string_value": "otel-collector"
                        }
                    }
                }
            ]
        },
        "instrumentation_library_metrics": [
            {
                "metrics": [
                    {
                        "name": "otelcol_process_cpu_seconds",
                        "description": "Total CPU user and system time in seconds",
                        "unit": "s",
                        "Data": {
                            "double_gauge": {
                                "data_points": [
                                    {
                                        "labels": [
                                            {
                                                "key": "service_instance_id",
                                                "value": "xxxx"
                                            },
                                            {
                                                "key": "cluster",
                                                "value": "some value"
                                            },
                                            ,
                                            {
                                                "key": "__replica__",
                                                "value": "some unique value"
                                            }
                                        ],
                                        "time_unix_nano": 1601479169981000000
                                    }
                                ]
                            }
                        }
                    }
                ]
            }
        ]
    }
]

Note that the difference between the two metrics is highlighted.

Potential Solutions

(Recommended) Solution 1: New ‘labelprocessor’

A possible solution is a new processor which reads labels specified in the Collector config file and attaches them to the data point labels of all metrics

Pros:

  • Flexibility. labels can be added to any metrics for whatever reasons customers may deem fit
  • Having the functionality here conforms with the definition of a processor

Cons:

  • This will require more engineering effort compared to other proposed solutions that require changes in already existing components

Solution 2: Modify ‘resourceprocessor’ to allow changes to data point labels

The ‘resourceprocessor’ performs a similar transformation, however the key idea to note here is that within OTLP metrics, resource attributes are not the same as data point labels. ‘resourceprocessor’ can add resource attributes, however the desired functionality is the ability to add data point labels.

Here is an example metric from logging exporter that shows the difference:

Resource labels:
 -> service.name: STRING(otel-collector)
 -> host.hostname: STRING(0.0.0.0)
 -> port: STRING(8888)
 -> scheme: STRING(http)
InstrumentationLibraryMetrics #0
Metric #0
Descriptor:
 -> Name: otelcol_process_cpu_seconds
 -> Description: Total CPU user and system time in seconds
 -> Unit: s
 -> DataType: DoubleGauge
DoubleDataPoints #0
Data point labels:
 -> service_instance_id: xxxxx
StartTime: 0
Timestamp: 1601479909978000000
Value: 0.000000

Pros:

  • Similar to making a new ‘labelprocessor’, however less development is required to modify compared to writing a new processor entirely

Cons:

  • This may be out of scope for what should be expected of a ‘resourceprocessor’ given that we want to make changes within metrics as opposed to in resource attributes

Solution 3: Add functionality to Prometheus Remote Write Exporter

What this solution entails is (1) using the ‘resourceprocessor’ to add our labels as resource attributes, and (2) ‘Prometheus remote write exporter’ taking these resource attributes and converting them to valid Prometheus labels.

The functionality for step (2) of this solution does not exist, so a possible solution is to add that.

Pros:

  • No new processor is required, hence there is less engineering effort compared to a new processor

Cons:

  • Although this satisfies our own use case, the scope is limited to Prometheus, whereas if we make a new processor it is more flexible and can be used to add labels to any metric
  • It is not ideal to always be adding all resource attributes as labels, hence some coordination will be required for ‘Prometheus remote write exporter’ to know which resource attributes were the ones added from the configuration file
    • This could mean a change to resourceprocessor, for example a flag associated with each attribute that specifies whether the attribute should be added as a metric label or not. This may once again not conform with the idea of what a ‘resource’ is.

Solution 4: Add batch label editing to metricstransformprocessor

metricstransformprocessor is a processor in the opentelemetry-collector-contrib repo that almost does what we want, however it can only add labels to metrics that are specified by name, whereas we wish to add labels to all metrics. Hence a possible solution is to add the ability to make transformations to all metrics in this processor.

Pros:

  • A chunk of the required logic exists so it may not be as much engineering effort to make these changes compared to a new processor

Cons:

  • May be out of the intended scope of this processor since it very clearly states it is not intended to be used for batch metrics changes

Additional Context

Prometheus Receiver

The Prometheus receiver has the functionality to add labels, however this does not suit our use case because in the scraping process Prometheus removes all labels starting with “__”. This means that the “__replica__” label cannot be set here.

Conclusion

The recommended solution for this issue is to make a new ‘labelprocessor’ that can be used to add labels to any metrics that are passing through the collector. Although our scope is currently limited to Prometheus and Cortex, this processor can be used for a more general purpose. We request approval from the Collector maintainers to proceed with our recommended solution, and we are open to discussion on other potential solutions.

cc - @bogdandrutu , @tigrannajaryan , @pjanotti , @alolita

@jrcamp
Copy link
Contributor

jrcamp commented Oct 6, 2020

https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/master/processor/metricstransformprocessor has the ability to add, remove, update metric labels. Would that work?

@amanbrar1999
Copy link
Contributor Author

As mentioned in the post, metricstransformprocessor was considered however it does not have the ability to add labels to a batch of metrics, which is our desired functionality. It requires individual metrics to be specified.

@tigrannajaryan tigrannajaryan self-assigned this Oct 14, 2020
@jrcamp
Copy link
Contributor

jrcamp commented Oct 14, 2020

One other option:

Maybe we could add metrics support to attributesprocessor under the assumption that attributes and labels will eventually be unified in OT spec. In the meantime maybe the attributesprocessor can handle both with the same or similar configuration semantics.

I don't strongly feel that it must be done this way. I think the more conservative approach is probably to do the dedicated processor and when the spec is settled then labelsprocessor can be deprecated and attributes can be the "grand unified" one.

@alolita
Copy link
Member

alolita commented Oct 14, 2020

Hi @tigrannajaryan can you please assign this issue to me @alolita and @amanbrar1999 ty! We plan to add PRs for this code.

@tigrannajaryan
Copy link
Member

@amanbrar1999 @alolita looking at metricstransformprocessor it seems trivial to make metric_name config option optional and then you will get the functionality that you need. I don't see the need for a new processor.

May be out of the intended scope of this processor since it very clearly states it is not intended to be used for batch metrics changes

I don't think so. It says it is not intended for aggregation. What you need is not aggregation and I would not call it batching either. It is bulk addition labels to all metrics regardless of the name. This looks like a subset of what metricstransformprocessor is intended for, we just need to make the metric name check optional.

Would this work?

@amanbrar1999
Copy link
Contributor Author

@tigrannajaryan This was indeed one of the proposed solutions, however it looks like I misunderstood the desired scope of metricstransformprocessor. I thought bulk label addition was not a desired functionality within it, but if it is indeed a subset of what metricstransformprocessor is intended for then I believe this solution works.

@tigrannajaryan
Copy link
Member

@amanbrar1999 I think it is fine to add bulk updates to metricstransformprocessor. @james-bebbington @jrcamp @bogdandrutu any objections (see my proposal above)?

@james-bebbington
Copy link
Member

james-bebbington commented Oct 20, 2020

Yes what I would actually like to do is update the metrics transform processor to support filter sets for bulk updating so that we can do transforms like the following:

Also blocked by related proposal:

That would then support this use case as well. I was intending to start working on this soon, but if @alolita and @amanbrar1999 want to take that over I'd be more than happy to review instead :)

@amanbrar1999
Copy link
Contributor Author

amanbrar1999 commented Oct 20, 2020

My understanding from reading these linked issues is that we want to (1) update the filtering configuration so that it is less verbose, (2) make those configurations external, and then (3) embed filtering into metricstransformprocessor.

I am trying to understand the scope of the first change, if that requires changes in the internal filtering libraries (https://github.com/open-telemetry/opentelemetry-collector/tree/master/internal/processor) then it would be a breaking change for everything that imports those files right?

@amanbrar1999
Copy link
Contributor Author

amanbrar1999 commented Oct 20, 2020

Upon further inspection I think (1) is another issue on its own, and I can instead proceed to try to make filtersets external and use them within metricstransformprocessor to support bulk updating

@james-bebbington
Copy link
Member

Upon further inspection I think (1) is another issue on its own, and I can instead proceed to try to make filtersets external and use them within metricstransformprocessor to support bulk updating

The context for #1081 is that no-one is particularly happy with the current structure of the filterset component. It's a little complicated to use. @bogdandrutu wanted us to simplify this component before we make it external and start using it more broadly. Once we do make filterset external, we won't make any further breaking changes to it, so we ideally we will clean it up first.

I think the proposal in #1081 (comment) would be relatively straightforward to implement, but it would be good to get sign off from @bogdandrutu and/or @tigrannajaryan first, as it requires breaking changes to the config as well as the code.

@tigrannajaryan
Copy link
Member

@james-bebbington #1081 (comment) proposes breaking changes to the configuration. I would not want to do that and the barrier to such changes is quite high. I would like to look carefully into that and understand if we really want it.

In the meantime I think there is a much simpler solution to this particular issue which does not require any breaking changes and simply requires a small change to the metricstransformprocessor to allow bulk updates. I would prefer that we do not make this change dependent on the more complicated decision on #1081

@amanbrar1999
Copy link
Contributor Author

amanbrar1999 commented Oct 22, 2020

Update from the most recent Collector SIG meeting:

Solution 3 is likely the best option to proceed forward in this issue due to how labels are handled in the Prometheus Remote Write Exporter. The metricstransformprocessor changes proposed by James are still very useful and can be added however I will be shifting my attention to the resourceprocessor (which should not require any changes as functionality to add resource attributes already exists) and the prometheus remote write exporter's handling of resource attributes.

Having talked with @jmacd we believe resource attributes can be combined with the idea of "external labels" in Prometheus, which is what Cortex expects us to use in order to add these labels.

Furthermore according to @tigrannajaryan the conversion of resource attributes to labels is an expected functionality of any exporter and is currently missing in the Prometheus Remote Write Exporter, hence information is being lost when metrics are exported. https://github.com/open-telemetry/opentelemetry-collector/blob/master/exporter/prometheusremotewriteexporter/exporter.go#L96 This line in the code stands as even further proof that this is intended functionality.

I have made a post in the gitter channel about adding functionality in Prometheus Remote Write Exporter to convert resource attributes to labels to get feedback from users of this exporter and ensure this does not break anything for them, and barring any objections I will be proceeding with adding this functionality to the Prometheus Remote Write Exporter.

@tigrannajaryan
Copy link
Member

Furthermore according to @tigrannajaryan the conversion of resource attributes to labels is an expected functionality of any exporter and is currently missing in the Prometheus Remote Write Exporter, hence information is being lost when metrics are exported. https://github.com/open-telemetry/opentelemetry-collector/blob/master/exporter/prometheusremotewriteexporter/exporter.go#L96 This line in the code stands as even further proof that this is intended functionality.

I have made a post in the gitter channel about adding functionality in Prometheus Remote Write Exporter to convert resource attributes to labels to get feedback from users of this exporter and ensure this does not break anything for them, and barring any objections I will be proceeding with adding this functionality to the Prometheus Remote Write Exporter.

@jamcd and @bogdandrutu what do you think? We also have the exact same issue when exporting from Otel metric SDKs to Prometheus, so the solution preferably needs to be the same in SDKs and in the Collector.

@amanbrar1999
Copy link
Contributor Author

amanbrar1999 commented Oct 29, 2020

Another update after more discussions from Metrics SIG:

A generic component is being designed for conversion of resource attributes to labels in exporters: open-telemetry/opentelemetry-collector-contrib#1359

This still does not solve my use case, the core issue here is that Prometheus Remote Write Exporter takes labels starting with "__" and appends "key" infront in order to "preserve" the label. This is not compatible with Cortex because Cortex expects users to be able to add "__replica__" label, which is doable in a regular Prometheus setup using "external labels". With this component resource attributes will become labels and get sanitized, which will modify "__replica__"

I was working on a custom resource attribute to label conversion within Prometheus Remote Write Exporter however it appears this is no longer a desired solution because of the component mentioned above.

Discussion with @jmacd has resulted in 2 possible solutions:
Solution 1. have whatever logic that erases __-labels have special cases, e.g., a lookup table w/ allowed double-underscore labels
Solution 2. have these double-underscore labels injected through exporter configuration, after all the other pipeline stages that may drop double-underscores are evaluated

Solution 1 here would be blocked by the generic component mentioned above, as this issue still requires the ability to add labels.

That being said, I am now proceeding with Solution 2, which is to allow label addition within the Prometheus Remote Write Exporter that does not have the restriction of "__" that other labels have. These labels would be functionally equivalent to Prometheus "external labels"

@jmacd
Copy link
Contributor

jmacd commented Oct 30, 2020

Solution 2 above sounds good to me, as a dead-simple solution to the problem at hand. All the other processor discussions above look like more-complicated ways of accomplishing this very simple thing needed for Prometheus remote-write.

@alolita
Copy link
Member

alolita commented Oct 31, 2020

@jmacd we're implementing -- Solution 2. have these double-underscore labels injected through exporter configuration, after all the other pipeline stages that may drop double-underscores are evaluated

@amanbrar1999 will file a PR shortly.

@amanbrar1999
Copy link
Contributor Author

External labels config has been added to prometheus remote write exporter and hence this issue is now resolved

@andrewhsu andrewhsu added the enhancement New feature or request label Jan 6, 2021
hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023
* Update deb/rpm signing

* Exclude release scripts from github workflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants