Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric data exported repeatedly #832

Closed
bvwells opened this issue Apr 15, 2024 · 8 comments
Closed

Metric data exported repeatedly #832

bvwells opened this issue Apr 15, 2024 · 8 comments
Assignees
Labels
priority: p3 question Further information is requested

Comments

@bvwells
Copy link

bvwells commented Apr 15, 2024

We recently performed OpenTelemetry and opentelemetry-operations-go module upgrades and found an issue with the export of metric data (it may be that we have done something wrong with the upgrade so apologies if this is the case).

I've tried to illustrate the issue here:

https://github.com/bvwells/opentelemetry-issue

In this git repository I have two examples which generate a singe histogram metric with varying versions of OpenTelemetry modules. The legacy folder contains an application running:

github.com/GoogleCloudPlatform/opentelemetry-operations-go/exporter/metric v0.32.7
go.opentelemetry.io/otel v1.15.1
go.opentelemetry.io/otel/metric v0.34.0
go.opentelemetry.io/otel/sdk v1.15.1
go.opentelemetry.io/otel/sdk/metric v0.31.0

The latest folder contains an application running:

github.com/GoogleCloudPlatform/opentelemetry-operations-go/exporter/metric v0.46.0
go.opentelemetry.io/otel v1.25.0
go.opentelemetry.io/otel/metric v1.25.0
go.opentelemetry.io/otel/sdk/metric v1.25.0

There are more details in the README here https://github.com/bvwells/opentelemetry-issue/blob/main/README.md.

The behaviour when exporting to Google Cloud operations is different between the versions. The older version of the exporter only exports time-series data to the Google Cloud operations API once, whereas the new version of the exporter is exporting the same time-series data every time the exporter runs (every minute). This means that the ingested bytes into Google Cloud operations increases over time irrespective of whether any new metric data has been recorded.

When running in our application we see its memory footprint increase over time as metric data is kept in memory and the ingested bytes into Google Cloud operations increases drastically.

Please let me know if you require any further information on this issue.

@dashpole dashpole self-assigned this Apr 15, 2024
@dashpole dashpole added bug Something isn't working priority: p1 labels Apr 15, 2024
@dashpole
Copy link
Contributor

Where do the print statements come from?

Histograms in cloud monitoring are cumulative, so exporting the current state of the histogram each export interval is expected. I'm not sure why the older version was not doing this.

When running in our application we see its memory footprint increase over time as metric data is kept in memory and the ingested bytes into Google Cloud operations increases drastically.

This sounds like a bug. We should not be accumulating increasing amounts of metric data over time.

@dashpole
Copy link
Contributor

v0.31.0 of the metrics SDK was before it was completely re-written to comply with the specification (see https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md#0320-revised-metric-sdk-alpha---2022-09-18). I suspect the behavior changed with the re-write in open-telemetry/opentelemetry-go#3175.

The exporter (new and old) should be stateless. It should simply translate from the OpenTelemetry types to the Google Cloud Monitoring types. You can verify that by using the stdout exporter, or by adding a print statement inside the ForEach loop of the older exporter: https://github.com/bvwells/opentelemetry-issue/blob/3dea93c18e76f50197296c90b440b958579db127/legacy/vendor/github.com/GoogleCloudPlatform/opentelemetry-operations-go/exporter/metric/metric.go#L264.

I suspect the pre-rewrite metrics SDK did not properly implement cumulative histogram export, and that it was fixed in the stable version of the SDK.

What makes you think it is leaking memory?

@bvwells
Copy link
Author

bvwells commented Apr 15, 2024

Thanks for looking into the issue.

The increase in memory is caused due to us recording some metrics with quite high cardinality. Previously the data was only kept in memory for an export cycle, however now it keeps all data in memory for all time. I don't think it is a memory leak.

It has now become unfeasible to record this type of metric data in Google Cloud Operations due to this as the ingested bytes has exploded in magnitude and with it the cost (as the pricing model is based on ingested bytes). We had been able to ingest this type of metric into Google Cloud Operations using OpenCensus and previously versions OpenTelemetry. The cost for lower cardinality metrics has also increased significantly.

Technically it feels a little excessive that all time-series data is pushed at every export cycle rather than only time-series which has changed, but it sounds like this would be a large design change....

@dashpole
Copy link
Contributor

Got it. It definitely isn't recommended to use deltas with GCM (IIRC, it negatively impacts query performance), but if you've been happy with the existing experience, I think it is possible to work around this to get the SDK to send deltas. GCM doesn't really support Deltas, but we can send cumulatives that reset every interval.

You currently have:

	exporter, err := metric.New(
		metric.WithProjectID(*projectID),
		metric.WithMetricDescriptorTypeFormatter(func(m metricdata.Metrics) string {
			return fmt.Sprintf("custom.googleapis.com/opentelemetry/%s", m.Name)
		}))
	if err != nil {
		log.Printf("failed to create the exporter: %v", err)
		os.Exit(-1)
	}

You would add something like:

// deltaExporter overrides the Temporality of the Exporter with delta.
type deltaExporter struct {
        otelsdkmetric.Exporter
}

// Override the exporter's temporality function
func (d *deltaExporter) Temporality(ik metric.InstrumentKind) metricdata.Temporality {
        // Use delta temporality for all instruments
        return metricdata.DeltaTemporality
}  

Then, after your setup code above

        // Current setup code ...

        // Wrap the exporter to make the SDK give it deltas instead
        exporter = &deltaExporter{Exporter: exporter}

I'm not 100% sure it will work properly, but it sounds like this was the behavior in the older version of the OTel SDK.

@dashpole
Copy link
Contributor

dashpole commented Apr 15, 2024

The only other thing i'll add is that if cost is a concern, it may be worth looking into Google Managed service for Prometheus (GMP). It is much, much less expensive. You can use the Prometheus exporter from OTel-go with managed GMP collection, or use the googlemanagedprometheus exporter on the collector, but we don't have one that exports directly from the application.

@dashpole dashpole added question Further information is requested priority: p3 and removed bug Something isn't working priority: p1 labels Apr 15, 2024
@bvwells
Copy link
Author

bvwells commented Apr 15, 2024

Thanks for the suggestions and for your analysis. I'll do some playing! :-)

We already use Prometheus in some of the cloud environments we deploy our application to. I suspect that this may push us to fully adopt it in all environments.

@dashpole
Copy link
Contributor

I'm going to close this, but feel free to reopen if you have any other questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p3 question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants