Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add helper functions for metric conversion [awsecscontainermetricsreceiver] #1089

Merged
merged 6 commits into from
Sep 25, 2020

Conversation

hossain-rayhan
Copy link
Contributor

Description:
This change adds helper functions for converting ECS resources metrics to OT metrics.

Link to tracking Issue:
#457

Testing:
Unit test added.

Documentation:
README.md

@hossain-rayhan hossain-rayhan requested a review from a team September 22, 2020 05:25
@codecov
Copy link

codecov bot commented Sep 22, 2020

Codecov Report

Merging #1089 into master will increase coverage by 0.15%.
The diff coverage is 99.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1089      +/-   ##
==========================================
+ Coverage   88.84%   89.00%   +0.15%     
==========================================
  Files         251      254       +3     
  Lines       11979    12161     +182     
==========================================
+ Hits        10643    10824     +181     
+ Misses        992      991       -1     
- Partials      344      346       +2     
Flag Coverage Δ
#integration 75.42% <ø> (-0.11%) ⬇️
#unit 88.04% <99.11%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...sreceiver/awsecscontainermetrics/metrics_helper.go 97.36% <97.36%> (-2.64%) ⬇️
...ricsreceiver/awsecscontainermetrics/accumulator.go 100.00% <100.00%> (ø)
...metricsreceiver/awsecscontainermetrics/resource.go 100.00% <100.00%> (ø)
...tricsreceiver/awsecscontainermetrics/translator.go 100.00% <100.00%> (ø)
...nermetricsreceiver/awsecscontainermetrics/utils.go 100.00% <100.00%> (ø)
receiver/k8sclusterreceiver/watcher.go 95.29% <0.00%> (-2.36%) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d6d534...91dac3d. Read the comment docs.

)

// GenerateDummyMetrics generates two dummy metrics
func GenerateDummyMetrics() consumerdata.MetricsData {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this method be moved to a test file? Same with createGaugeIntMetric below.

Copy link
Contributor Author

@hossain-rayhan hossain-rayhan Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Planing to totally remove it in our next PR when we will get our original Metrics generation code. Keeping it for now as its being used by previous code. Added a TODO note on top of it.

ContainerPrefix = "container."
ResourceAttributeServiceNameKey = "service.name"
ResourceAttributeServiceNameValue = "awsecscontainermetrics"
MetricResourceType = "aoc.ecs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what does aoc stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS Observability Collector-> Amazon distribution of OpenTelemetry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, resource type doesn't exist in OTel protocol but is there right now for metrics since it still seems to use opencensus. So this value will generally be dropped

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. Even if it gets utilized we want to use "aoc.ecs" to differentiate our OT metrics from ECS backend metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I point it out because it will go away - so if you do have an expectation of having it, it won't be there :P But I think if we put the receiver name in something like telemetry.sdk than the information should still be preserved, in a way that matches in some sense what our apps send.

containerMetrics.MemoryReserved = *containerMetadata.Limits.Memory
containerMetrics.CPUReserved = *containerMetadata.Limits.CPU

taskMemLimit += containerMetrics.MemoryReserved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these values aren't being used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

}

func (acc *metricDataAccumulator) accumulate(
startTime *timestamp.Timestamp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From usages of accumulate method, I think the first parameter being passed in is the timestamp of a reporting interval? This should instead be the time when a cumulative metric was reset to 0. Sere comment here for details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea. I looked into the dockerstatsreceiver and it's not setting the value I think. We do have PullStartedAt timestamp and I think we can utilize it. However, need more thoughts on this. For now, not setting this value to get the default behavior. Will send a separate PR for this.

resourceAttributes[ResourceAttributeServiceNameKey] = ResourceAttributeServiceNameValue

r := &resourcepb.Resource{
Type: MetricResourceType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this same type is being used for all resources from which metrics are being collected, both containers and tasks. Container metrics should be associated with a container resource and similarly a task resource for task metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @anuraaga mentioned, this will be dropped eventually. However, if it does get utilized, we prefer to use aoc.ecs for all metrics we are receiving from this receiver.

taskMemLimit += containerMetrics.MemoryReserved
taskCPULimit += containerMetrics.CPUReserved

labelKeys, labelValues := containerLabelKeysAndValues(containerMetadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Labels/Attributes that describe a resource (container/task) should be collected as attributes on the resource object. Same for labels collected by taskLabelKeysAndValues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @asuresh4 , I am little bit confused here. Can you explain a bit more? These labels describes the properties to differentiate the metrics. Then when/what should I set as metric labels. Also, these labels are supposed to be converted into metric dimensions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubelet_stats receiver should be good example to understand the concept. It collects metrics from different types of resources (containers, pods, nodes). The properties of resources are added as labels on the resource. An exporter exporting the metric would treat labels from the resource and the metric as dimensions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added those as Resource Attributes.

github.com/stretchr/testify v1.6.1
go.opentelemetry.io/collector v0.10.1-0.20200917170114-639b9a80ed46
go.uber.org/zap v1.16.0
google.golang.org/protobuf v1.25.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these are from importinggitpro.ttaallkk.top/golang/protobuf/ptypes/timestamp and google.golang.org/protobuf/types/known/timestamppb. Do you need both? Could probably just use timestamppb in both places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some flyby comments, but basically LGTM with @asuresh4's comments, thanks.

"time"

metricspb "github.com/census-instrumentation/opencensus-proto/gen-go/metrics/v1"
resourcepb "github.com/census-instrumentation/opencensus-proto/gen-go/resource/v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asuresh4 @bogdandrutu Are metrics receivers still using opencensus proto?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we changed most of the core to use the otlp and internal structs. Completely recommend for new components to avoid oc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hossain-rayhan you need to start using pdata.Metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bogdandrutu , before sending the data to next consumer I am using internaldata.OCToMetrics(md) to convert our metrics to pdata.Metrics. Wondering, isn't that enough like other receivers in the repo or we should strictly get rid of it now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a temporary solution to make progress and not have to change all components once. And decided to use that for some old components that we did not have time to chnage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hossain-rayhan Yeah so basically you should only be converting at the last moment when passing down, but here in this sort of receiver-specific logic we want to be using pdata, the OTel format. Or we just have to rewrite it right away. We're also having data-model issues because of using the old format (Resource type for example) and we want to make sure the model is right

Copy link
Contributor Author

@hossain-rayhan hossain-rayhan Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bogdandrutu and @anuraaga. I understand we need to use pdata to convert everything to OTel format eventually. I was planning to move forward with this to meet our internal deadline (9/30/2020). We can send a different PR after October 15th I guess. How do you guys feel about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you create an issue and assign to you and @anuraaga I am fine. I trust that you will fix it. I will let @anuraaga make the final call here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue created: #1122

ContainerPrefix = "container."
ResourceAttributeServiceNameKey = "service.name"
ResourceAttributeServiceNameValue = "awsecscontainermetrics"
MetricResourceType = "aoc.ecs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, resource type doesn't exist in OTel protocol but is there right now for metrics since it still seems to use opencensus. So this value will generally be dropped

TaskPrefix = "ecs.task."
ContainerPrefix = "container."
ResourceAttributeServiceNameKey = "service.name"
ResourceAttributeServiceNameValue = "awsecscontainermetrics"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service name corresponds with an application, not a backend, so for example AuthService, SearchFrontend, etc. We could fill this in with the ECS service name, or otherwise we shouldn't fill it since this isn't the correct semantics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, got it. But, for all of our AWS receivers, we are using the receiver name as service.name. Because, this field will be utilized by our CW EMFExporter to generate different rules for different receivers. Especially for Container Insights.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by AWS receivers? I think the only one we have is xray, which doesn't do this, and definitely shouldn't since we need to make sure the app's service name is used.

I'm not sure what you mean exactly by the rules, but anyways we can't just fill a semantic convention attribute with something that doesn't follow the spec. If anything, the telemetry.sdk matches closer to what this sort of receiver is doing. @bogdandrutu @tigrannajaryan any suggestion on that?

Also @hossain-rayhan it's important to take a step back and remember what this receiver is here for - it's to translate the container metrics data into the OpenTelemetry format / specification. This is because this data seems useful to users regardless of if they use cloudwatch or not. While we may need some, but hopefully not much, consideration for specific vendors like cloudwatch, that's not the intent here. If you haven't yet, you should go through in detail at least the Resource and Metrics semantics conventions of OTel spec before proceeding and make sure you are aligned with it https://github.com/open-telemetry/opentelemetry-specification/tree/master/specification/resource. That doesn't mean we want to block data that's needed, but it's important to follow the spec as much as possible.

Copy link
Member

@mxiamxia mxiamxia Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This receiver generates ECS container Metrics itself but not receiving any metrics from outside of OTel Collector. For the metrics generated inside the receiver, the idea is to put receiver name in service.name attribute on these metrics. It's similar to the idea Prometheus receiver uses job_name as service.name for metrics it scrapes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean exactly by the rules, but anyways we can't just fill a semantic convention attribute with something that doesn't follow the spec. If anything, the telemetry.sdk matches closer to what this sort of receiver is doing. @bogdandrutu @tigrannajaryan any suggestion on that?

+1. We should not use "service.name" for receiver name. That is not the purpose of "service.name". "service.name" is supposed to describe the source that emits the metrics. Collector is just collecting the metric, it is an intermediary, it is not the source. Nor is "telemetry.sdk" intended for that.

The source that emits the metrics is the container here. If we know the name of the service that runs in the container we should set that. If we don't know we should not record it at all.

I do not know why we want to record the receiver name, perhaps you can clarify the use case. This can then be added as a semantic convention for OpenTelemetry as a whole or just for the Collector and will possibly end up in the "otel" namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hossain-rayhan If you can give some more detail about this usage that would be great. I think filling in wrong information is blocking this PR, so the easiest way to proceed would be to just remove setting the service name for now and we can figure out a way to handle what you need in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anuraaga . I am removing it for now as this should not block the receiver. This is more related to the exporter logic as it's being utilized for supporting special customer use cases. If needed I can support it in a separate PR after further discussion.

"google.golang.org/protobuf/types/known/timestamppb"
)

func convertToOTMetrics(prefix string, m ECSMetrics, labelKeys []*metricspb.LabelKey, labelValues []*metricspb.LabelValue, timestamp *timestamppb.Timestamp) []*metricspb.Metric {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func convertToOTMetrics(prefix string, m ECSMetrics, labelKeys []*metricspb.LabelKey, labelValues []*metricspb.LabelValue, timestamp *timestamppb.Timestamp) []*metricspb.Metric {
func convertToOCMetrics(prefix string, m ECSMetrics, labelKeys []*metricspb.LabelKey, labelValues []*metricspb.LabelValue, timestamp *timestamppb.Timestamp) []*metricspb.Metric {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're actually converting to opencensus metrics here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@hossain-rayhan
Copy link
Contributor Author

Thanks @asuresh4 and @anuraaga thanks for your review. I pushed an update based on the review. Will appreciate another look.

"time"

metricspb "github.com/census-instrumentation/opencensus-proto/gen-go/metrics/v1"
resourcepb "github.com/census-instrumentation/opencensus-proto/gen-go/resource/v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hossain-rayhan Yeah so basically you should only be converting at the last moment when passing down, but here in this sort of receiver-specific logic we want to be using pdata, the OTel format. Or we just have to rewrite it right away. We're also having data-model issues because of using the old format (Resource type for example) and we want to make sure the model is right


TaskPrefix = "ecs.task."
ContainerPrefix = "container."
ResourceAttributeServiceNameKey = "service.name"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this? conventions.AttributeServiceName

BytesInMiB = 1024 * 1024

TaskPrefix = "ecs.task."
ContainerPrefix = "container."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to prefix the metrics with container? If they have container label, they're container metrics right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this part of the OTel convention but given that other receivers follow this approach, I think we should do the same here for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I found some other receivers are doing the same like kubeletstatsreceiver and dockerstatsreceiver.

TaskPrefix = "ecs.task."
ContainerPrefix = "container."
ResourceAttributeServiceNameKey = "service.name"
ResourceAttributeServiceNameValue = "awsecscontainermetrics"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by AWS receivers? I think the only one we have is xray, which doesn't do this, and definitely shouldn't since we need to make sure the app's service name is used.

I'm not sure what you mean exactly by the rules, but anyways we can't just fill a semantic convention attribute with something that doesn't follow the spec. If anything, the telemetry.sdk matches closer to what this sort of receiver is doing. @bogdandrutu @tigrannajaryan any suggestion on that?

Also @hossain-rayhan it's important to take a step back and remember what this receiver is here for - it's to translate the container metrics data into the OpenTelemetry format / specification. This is because this data seems useful to users regardless of if they use cloudwatch or not. While we may need some, but hopefully not much, consideration for specific vendors like cloudwatch, that's not the intent here. If you haven't yet, you should go through in detail at least the Resource and Metrics semantics conventions of OTel spec before proceeding and make sure you are aligned with it https://github.com/open-telemetry/opentelemetry-specification/tree/master/specification/resource. That doesn't mean we want to block data that's needed, but it's important to follow the spec as much as possible.

ContainerPrefix = "container."
ResourceAttributeServiceNameKey = "service.name"
ResourceAttributeServiceNameValue = "awsecscontainermetrics"
MetricResourceType = "aoc.ecs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I point it out because it will go away - so if you do have an expectation of having it, it won't be there :P But I think if we put the receiver name in something like telemetry.sdk than the information should still be preserved, in a way that matches in some sense what our apps send.

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@hossain-rayhan
Copy link
Contributor Author

Hi @bogdandrutu @tigrannajaryan @asuresh4 can we get this merged?


taskMetrics := ECSMetrics{}
timestamp := timestampProto(time.Now())
taskResources := taskResources(metadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: taskResource would be more accurate. Same with containerResources (-> containerResource)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


acc.accumulate(
taskResources,
convertToOCMetrics(TaskPrefix, taskMetrics, nil, nil, timestamp),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the 3rd and 4th parameters to this method are always nil, I would remove those parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about it while writing this piece of code. Here, I kept the skeleton ready and the same method can be utilized to set metric labels. In our next PRs, we can just pass the LabelKeys and LabelValues and we are done. If we really don't utilize, I will remove them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 SGTM

Copy link
Member

@asuresh4 asuresh4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, apart from couple of minor comments.

@tigrannajaryan
Copy link
Member

I'll merge once the last comments from @asuresh4 are addressed.

@tigrannajaryan tigrannajaryan merged commit ca0feff into open-telemetry:master Sep 25, 2020
dyladan referenced this pull request in dynatrace-oss-contrib/opentelemetry-collector-contrib Jan 29, 2021
* Split out processor READMEs

* Split out exporter READMEs

* Split out extension READMEs

* Split out receiver READMEs

* Add new line at end of READMEs
ljmsc referenced this pull request in ljmsc/opentelemetry-collector-contrib Feb 21, 2022
* Prepare for releasing v0.11.0

* Update CHANGELOG.md to reflect scope of v0.11.0 release

* Update CHANGELOG.md

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
codeboten pushed a commit that referenced this pull request Nov 23, 2022
codeboten pushed a commit that referenced this pull request Nov 23, 2022
codeboten pushed a commit that referenced this pull request Nov 23, 2022
* Use a shorter timeout for AWS EC2 metadata requests

Fix #1088 

According to the docs, the value for `timeout` is in seconds: https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen. 1000 seconds seems slow and in some cases can block the startup of the program being instrumented (see #1088 as an example), because the request will hang indefinitely in non-AWS environments. Using a much shorter 1 second timeout seems like a reasonable workaround for this.

* add changelog entry for timeout change

* use 5s timeout for ECS and EKS, update changelog

Co-authored-by: Srikanth Chekuri <srikanth.chekuri92@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants