Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add helper functions for metric conversion [awsecscontainermetricsreceiver] #1089

Merged
merged 6 commits into from
Sep 25, 2020
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
// Copyright 2020, OpenTelemetry Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package awsecscontainermetrics

import (
"time"

metricspb "github.com/census-instrumentation/opencensus-proto/gen-go/metrics/v1"
resourcepb "github.com/census-instrumentation/opencensus-proto/gen-go/resource/v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asuresh4 @bogdandrutu Are metrics receivers still using opencensus proto?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we changed most of the core to use the otlp and internal structs. Completely recommend for new components to avoid oc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hossain-rayhan you need to start using pdata.Metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bogdandrutu , before sending the data to next consumer I am using internaldata.OCToMetrics(md) to convert our metrics to pdata.Metrics. Wondering, isn't that enough like other receivers in the repo or we should strictly get rid of it now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a temporary solution to make progress and not have to change all components once. And decided to use that for some old components that we did not have time to chnage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hossain-rayhan Yeah so basically you should only be converting at the last moment when passing down, but here in this sort of receiver-specific logic we want to be using pdata, the OTel format. Or we just have to rewrite it right away. We're also having data-model issues because of using the old format (Resource type for example) and we want to make sure the model is right

Copy link
Contributor Author

@hossain-rayhan hossain-rayhan Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bogdandrutu and @anuraaga. I understand we need to use pdata to convert everything to OTel format eventually. I was planning to move forward with this to meet our internal deadline (9/30/2020). We can send a different PR after October 15th I guess. How do you guys feel about it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you create an issue and assign to you and @anuraaga I am fine. I trust that you will fix it. I will let @anuraaga make the final call here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue created: #1122

"go.opentelemetry.io/collector/consumer/consumerdata"
)

// metricDataAccumulator defines the accumulator
type metricDataAccumulator struct {
md []*consumerdata.MetricsData
}

// getMetricsData generates OT Metrics data from task metadata and docker stats
func (acc *metricDataAccumulator) getMetricsData(containerStatsMap map[string]ContainerStats, metadata TaskMetadata) {

taskMetrics := ECSMetrics{}
timestamp := timestampProto(time.Now())
taskResources := taskResources(metadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: taskResource would be more accurate. Same with containerResources (-> containerResource)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


for _, containerMetadata := range metadata.Containers {
stats := containerStatsMap[containerMetadata.DockerID]
containerMetrics := getContainerMetrics(stats)
containerMetrics.MemoryReserved = *containerMetadata.Limits.Memory
containerMetrics.CPUReserved = *containerMetadata.Limits.CPU

containerResources := containerResources(containerMetadata)
for k, v := range taskResources.Labels {
containerResources.Labels[k] = v
}

acc.accumulate(
containerResources,
convertToOCMetrics(ContainerPrefix, containerMetrics, nil, nil, timestamp),
)

aggregateTaskMetrics(&taskMetrics, containerMetrics)
}

// Overwrite Memory limit with task level limit
if metadata.Limits.Memory != nil {
taskMetrics.MemoryReserved = *metadata.Limits.Memory
}

taskMetrics.CPUReserved = taskMetrics.CPUReserved / CPUsInVCpu

// Overwrite CPU limit with task level limit
if metadata.Limits.CPU != nil {
taskMetrics.CPUReserved = *metadata.Limits.CPU
}

acc.accumulate(
taskResources,
convertToOCMetrics(TaskPrefix, taskMetrics, nil, nil, timestamp),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the 3rd and 4th parameters to this method are always nil, I would remove those parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about it while writing this piece of code. Here, I kept the skeleton ready and the same method can be utilized to set metric labels. In our next PRs, we can just pass the LabelKeys and LabelValues and we are done. If we really don't utilize, I will remove them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 SGTM

)
}

func (acc *metricDataAccumulator) accumulate(
r *resourcepb.Resource,
m ...[]*metricspb.Metric,
) {
var resourceMetrics []*metricspb.Metric
for _, metrics := range m {
for _, metric := range metrics {
if metric != nil {
resourceMetrics = append(resourceMetrics, metric)
}
}
}

r.Labels[ResourceAttributeServiceNameKey] = ResourceAttributeServiceNameValue

acc.md = append(acc.md, &consumerdata.MetricsData{
Metrics: resourceMetrics,
Resource: r,
})
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
// Copyright 2020, OpenTelemetry Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package awsecscontainermetrics

import (
"testing"

"github.com/stretchr/testify/require"
"go.opentelemetry.io/collector/consumer/consumerdata"
)

func TestGetMetricsData(t *testing.T) {
v := uint64(1)
f := float64(1.0)

memStats := make(map[string]uint64)
memStats["cache"] = v

mem := MemoryStats{
Usage: &v,
MaxUsage: &v,
Limit: &v,
MemoryReserved: &v,
MemoryUtilized: &v,
Stats: memStats,
}

disk := DiskStats{
IoServiceBytesRecursives: []IoServiceBytesRecursive{
{Op: "Read", Value: &v},
{Op: "Write", Value: &v},
{Op: "Total", Value: &v},
},
}

net := make(map[string]NetworkStats)
net["eth0"] = NetworkStats{
RxBytes: &v,
RxPackets: &v,
RxErrors: &v,
RxDropped: &v,
TxBytes: &v,
TxPackets: &v,
TxErrors: &v,
TxDropped: &v,
}

netRate := NetworkRateStats{
RxBytesPerSecond: &f,
TxBytesPerSecond: &f,
}

percpu := []*uint64{&v, &v}
cpuUsage := CPUUsage{
TotalUsage: &v,
UsageInKernelmode: &v,
UsageInUserMode: &v,
PerCPUUsage: percpu,
}

cpuStats := CPUStats{
CPUUsage: cpuUsage,
OnlineCpus: &v,
SystemCPUUsage: &v,
CPUUtilized: &v,
CPUReserved: &v,
}
containerStats := ContainerStats{
Name: "test",
ID: "001",
Memory: mem,
Disk: disk,
Network: net,
NetworkRate: netRate,
CPU: cpuStats,
}

tm := TaskMetadata{
Cluster: "cluster-1",
TaskARN: "arn:aws:some-value/001",
Family: "task-def-family-1",
Revision: "task-def-version",
Containers: []ContainerMetadata{
{ContainerName: "container-1", DockerID: "001", DockerName: "docker-container-1", Limits: Limit{CPU: &f, Memory: &v}},
},
Limits: Limit{CPU: &f, Memory: &v},
}

cstats := make(map[string]ContainerStats)
cstats["001"] = containerStats

var mds []*consumerdata.MetricsData
acc := metricDataAccumulator{
md: mds,
}

acc.getMetricsData(cstats, tm)
require.Less(t, 0, len(acc.md))
}
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,49 @@ const (
AttributeECSTaskRevesion = "ecs.task-definition-version"
AttributeECSServiceName = "ecs.service"

ContainerMetricsLabelLen = 3
TaskMetricsLabelLen = 6
CPUsInVCpu = 1024
BytesInMiB = 1024 * 1024

TaskPrefix = "ecs.task."
ContainerPrefix = "container."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to prefix the metrics with container? If they have container label, they're container metrics right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this part of the OTel convention but given that other receivers follow this approach, I think we should do the same here for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I found some other receivers are doing the same like kubeletstatsreceiver and dockerstatsreceiver.

ResourceAttributeServiceNameKey = "service.name"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this? conventions.AttributeServiceName

ResourceAttributeServiceNameValue = "awsecscontainermetrics"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service name corresponds with an application, not a backend, so for example AuthService, SearchFrontend, etc. We could fill this in with the ECS service name, or otherwise we shouldn't fill it since this isn't the correct semantics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, got it. But, for all of our AWS receivers, we are using the receiver name as service.name. Because, this field will be utilized by our CW EMFExporter to generate different rules for different receivers. Especially for Container Insights.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by AWS receivers? I think the only one we have is xray, which doesn't do this, and definitely shouldn't since we need to make sure the app's service name is used.

I'm not sure what you mean exactly by the rules, but anyways we can't just fill a semantic convention attribute with something that doesn't follow the spec. If anything, the telemetry.sdk matches closer to what this sort of receiver is doing. @bogdandrutu @tigrannajaryan any suggestion on that?

Also @hossain-rayhan it's important to take a step back and remember what this receiver is here for - it's to translate the container metrics data into the OpenTelemetry format / specification. This is because this data seems useful to users regardless of if they use cloudwatch or not. While we may need some, but hopefully not much, consideration for specific vendors like cloudwatch, that's not the intent here. If you haven't yet, you should go through in detail at least the Resource and Metrics semantics conventions of OTel spec before proceeding and make sure you are aligned with it https://github.com/open-telemetry/opentelemetry-specification/tree/master/specification/resource. That doesn't mean we want to block data that's needed, but it's important to follow the spec as much as possible.

Copy link
Member

@mxiamxia mxiamxia Sep 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This receiver generates ECS container Metrics itself but not receiving any metrics from outside of OTel Collector. For the metrics generated inside the receiver, the idea is to put receiver name in service.name attribute on these metrics. It's similar to the idea Prometheus receiver uses job_name as service.name for metrics it scrapes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean exactly by the rules, but anyways we can't just fill a semantic convention attribute with something that doesn't follow the spec. If anything, the telemetry.sdk matches closer to what this sort of receiver is doing. @bogdandrutu @tigrannajaryan any suggestion on that?

+1. We should not use "service.name" for receiver name. That is not the purpose of "service.name". "service.name" is supposed to describe the source that emits the metrics. Collector is just collecting the metric, it is an intermediary, it is not the source. Nor is "telemetry.sdk" intended for that.

The source that emits the metrics is the container here. If we know the name of the service that runs in the container we should set that. If we don't know we should not record it at all.

I do not know why we want to record the receiver name, perhaps you can clarify the use case. This can then be added as a semantic convention for OpenTelemetry as a whole or just for the Collector and will possibly end up in the "otel" namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hossain-rayhan If you can give some more detail about this usage that would be great. I think filling in wrong information is blocking this PR, so the easiest way to proceed would be to just remove setting the service name for now and we can figure out a way to handle what you need in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @anuraaga . I am removing it for now as this should not block the receiver. This is more related to the exporter logic as it's being utilized for supporting special customer use cases. If needed I can support it in a separate PR after further discussion.

MetricResourceType = "aoc.ecs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what does aoc stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS Observability Collector-> Amazon distribution of OpenTelemetry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, resource type doesn't exist in OTel protocol but is there right now for metrics since it still seems to use opencensus. So this value will generally be dropped

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. Even if it gets utilized we want to use "aoc.ecs" to differentiate our OT metrics from ECS backend metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I point it out because it will go away - so if you do have an expectation of having it, it won't be there :P But I think if we put the receiver name in something like telemetry.sdk than the information should still be preserved, in a way that matches in some sense what our apps send.


AttributeMemoryUsage = "memory.usage"
AttributeMemoryMaxUsage = "memory.usage.max"
AttributeMemoryLimit = "memory.usage.limit"
AttributeMemoryReserved = "memory.reserved"
AttributeMemoryUtilized = "memory.utilized"

AttributeCPUTotalUsage = "cpu.usage.total"
AttributeCPUKernelModeUsage = "cpu.usage.kernelmode"
AttributeCPUUserModeUsage = "cpu.usage.usermode"
AttributeCPUSystemUsage = "cpu.usage.system"
AttributeCPUCores = "cpu.cores"
AttributeCPUOnlines = "cpu.onlines"
AttributeCPUReserved = "cpu.reserved"
AttributeCPUUtilized = "cpu.utilized"

AttributeNetworkRateRx = "network.rate.rx"
AttributeNetworkRateTx = "network.rate.tx"

AttributeNetworkRxBytes = "network.io.usage.rx_bytes"
AttributeNetworkRxPackets = "network.io.usage.rx_packets"
AttributeNetworkRxErrors = "network.io.usage.rx_errors"
AttributeNetworkRxDropped = "network.io.usage.rx_dropped"
AttributeNetworkTxBytes = "network.io.usage.tx_bytes"
AttributeNetworkTxPackets = "network.io.usage.tx_packets"
AttributeNetworkTxErrors = "network.io.usage.tx_errors"
AttributeNetworkTxDropped = "network.io.usage.tx_dropped"

AttributeStorageRead = "storage.read_bytes"
AttributeStorageWrite = "storage.write_bytes"

UnitBytes = "Bytes"
UnitMegaBytes = "MB"
UnitNanoSecond = "NS"
UnitBytesPerSec = "Bytes/Sec"
UnitCount = "Count"
UnitVCpu = "vCPU"
)

This file was deleted.

Loading