Trace events lib changes #8816

rishabhmaurya · 2023-07-21T07:02:59Z

Description

As part of tracing framework #7648, we added tracing capabilities in OpenSearch, this proposal is to add diagnosis metrics which can be associated with spans. It can help provide additional insights on resource consumption around different parts of the system. When enabled, resource consumption will be recorded between start and end of the span and will publish the net consumption. It will consider resources consumed by all threads involved in the lifecycle of a span. Example usage, assuming segment merge is instrumented with tracing framework, on enabling diagnostic on segment merge traces, it will start emitting resource usage by merges such as heap allocation, cpu usage etc. Such diagnostic metrics could pin point the bottlenecks in the system and exact scaling requirements.

Tenet

Minimal latency overhead. No impact when disabled.
Failure in computing diagnosis metrics should not fail the request execution.
Should be behind dynamic cluster setting based feature flag.

Trace Event Listener

In order to add diagnosis metrics associated with span, trace event listeners are required which are called on span start and span end. To handle thread context switch and record resource consumption associated with the new thread spawned, Runnables associated with a span can be wrapped with RunnableWrapper, which is provided by this framework. Wrapped Runnable generates start and end events to accurately measure the resource consumption. TraceEventListener are generic and can be used for other purposes if needed and not just diagnosis.

_{Above figure illustrates the expected life cycle of a span, which is started in the middle of an execution of a task, started by Thread 1 (of ThreadPool 1). It performs an IO operation, and on completion of IO, Thread 2 (of ThreadPool 1) resumes the task. Thread 2 also does an IO operation and task is resumed later by the Thread 3 (of ThreadPool 1) on IO completion. Thread 3 performs 3 different sub-tasks, which it executes in parallel by spawning three different threads in different executor services, and is non-blocking in nature, so it continues to work on rest of the task, while sub-tasks are delegated to different threads. However, it does wait for sub-tasks completion, without blocking the thread, and on response to the last sub-task, the task is resumed by Thread 2 (of ThreadPool 1) to its completion and the Span is closed somewhere in between. ThreadPool 3, which doesn't have the Runnable wrapped and doesn't have the span and registered event listeners information, thus span events are not generated.}

Diagnostic Trace Event Listener

This is a generic implementation of trace event listeners which records resource consumption using provided ThreadResourceRecorder and emit corresponding metrics using provided MetricEmitter.

Thread Resource Recorder

Thread resource recorder exposes start and end recording method which are used by DiagnosticTraceEventListener to record resource consumption. On start recording, it observes the current usage of a resource using provided resource observer. On end recording, it again observes the current readings, computes the difference in observed metric from start point and emits them. The metric is of delta temporality type.

Thread Resource Observer

It simply exposes observe() method to observe the current reading associated with a thread.

Related Tasks

OpenTelemetry meter integration to emit JMX based diagnosis metrics - JMX based diagnosis metrics emitted using otel rishabhmaurya/OpenSearch#47
OpenSearch server integration - wrapping tracer, spans and runnable for diagnostics support - Trace events integration with server rishabhmaurya/OpenSearch#46
Performance impact of trace events

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2023-07-21T07:08:14Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/20653/
CommitID: fc22f60
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-07-21T17:36:44Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/20684/
CommitID: 4ea9921
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

suranjay · 2023-07-24T23:19:46Z

libs/telemetry/build.gradle

@@ -10,6 +10,9 @@
 */

 dependencies {
+  // logging
+  api "org.apache.logging.log4j:log4j-api:${versions.log4j}"


We should check once if we can add logger dependency in Telemetry lib. Only core lib has this dependency which is a special lib. Though i dont see a problem but we can re-confirm.

I was a bit hesitant too initially, I will check with others on it. Thanks.

The build fails if we add other dependencies. Please verify.

suranjay · 2023-07-24T23:32:01Z

libs/telemetry/src/main/java/org/opensearch/telemetry/metrics/MetricPoint.java

+     * @return an unmodifiable map of measurement names to measurement objects
+     */
+    public Map<String, Measurement<Number>> getMeasurements() {
+        return measurements;


defensive copy?

suranjay · 2023-07-24T23:37:30Z

libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/listeners/SpanEventListener.java

+     * @param span the span that has completed
+     * @param t    the thread associated with the span
+     */
+    void onSpanComplete(Span span, Thread t);


Nit: onSpanEnd

github-actions · 2023-07-25T05:50:49Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/20879/
CommitID: aaf19a0
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-07-27T19:40:51Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21133/
CommitID: 27d5511
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-08-01T01:10:59Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21524/
CommitID: 993f0e5
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

Gaganjuneja · 2023-08-02T13:30:25Z

@rishabhmaurya, Thanks a lot for putting this up. I have a few question.

I want to understand the motivation behind combining the span and diagnostic information. Do we want to capture the diagnostic info for a span/trace or to capture the diagnostic info we are creating the span.
Anyways as of now these metrics are being collated at span level but emitted independently. So can we keep these 2 pieces separate and link with exemplars through spanContext?
How are you thinking of linking these metrics and spans later for analysis? I see you are passing the attributes. Do attributes always have traceID?
Maybe too early but how do these metrics work in case of sampling? Will these also be sampled along with span or not?
I have concerns about exposing the Span publicly. If you just need SpanType and hasEnded attribute then can we just expose these properties through either SpanScope or minimum version of span?

rishabhmaurya · 2023-08-02T18:05:14Z

I want to understand the motivation behind combining the span and diagnostic information. Do we want to capture the diagnostic info for a span/trace or to capture the diagnostic info we are creating the span.

the motivation is to capture additional metrics around resource usage associated with a span. This is an optional capability which can be enabled on per span basis if needed and default behavior would be disabled diagnostics. This is helpful for long running jobs, where resource consumption can be recorded for a span like segment merges, peer recovery, index engine start etc and for such long running job, the overhead would be very minimal when compared to the runtime of these jobs.
Diagnosis shouldn't be enabled for operations which have very tight SLA of order of 10ms latency.
We do something similar with resource tracking framework at task level and this is a generic way to solve the problem and is built in an extensible way.

Anyways as of now these metrics are being collated at span level but emitted independently. So can we keep these 2 pieces separate and link with exemplars through spanContext?

I didn't quite get this part. But you're right about collection and emission and they are independent. This framework leaves emission logic open ended. For example, This PR is using opentelemetry for metric emission. - https://github.com/rishabhmaurya/OpenSearch/pull/44/commits. Whereas, if we want to use something else, like stats API to emit into or logging exporter, it can be integrated too.

How are you thinking of linking these metrics and spans later for analysis? I see you are passing the attributes. Do attributes always have traceID?

traceID would a regular field associated with metric which can used to filter when visualizing, it will not be a dimension attribute to avoid high cardinality dimension.

Maybe too early but how do these metrics work in case of sampling? Will these also be sampled along with span or not?

It is the right time to discuss it. It depends on when we are sampling, if it is at collector level, then all of these diagnosis metrics will be available to collector. Metric sampling works differently at collector, they can both be aggregated or sampled at collector and a lot depends on the aggregation function. I will add more details around it.

I have concerns about exposing the Span publicly. If you just need SpanType and hasEnded attribute then can we just expose these properties through either SpanScope or minimum version of span?

Adding points we discussed offline around it a few days ago -
SpanScope is getting bloated with a lot of redundant methods same as Span. We wanted to avoid exposing Span because of endSpan() method, which we only want to be exposed to framework internal classes like SpanScope, which can call it when SpanScope is closed. Is there any other reason why we don't want to expose Span publicly?

github-actions · 2023-08-02T22:02:24Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21733/
CommitID: afe97d0
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-08-02T22:46:28Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21737/
CommitID: 9977eda
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

opensearch-trigger-bot · 2023-08-02T23:11:19Z

Compatibility status:


> Task :checkCompatibility
Checking compatibility for: https://github.com/opensearch-project/reporting.git with ref: main
Incompatible components: [https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git]
Compatible components: [https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git]

BUILD SUCCESSFUL in 33m 10s

github-actions · 2023-08-03T03:52:13Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21750/
CommitID: 3a79984
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

opensearch-trigger-bot · 2023-08-03T04:21:31Z

Compatibility status:



> Task :checkCompatibility
Incompatible components: [https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer.git]
Compatible components: [https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/reporting.git]

BUILD SUCCESSFUL in 36m 35s

Gaganjuneja · 2023-08-03T14:41:50Z

I want to understand the motivation behind combining the span and diagnostic information. Do we want to capture the diagnostic info for a span/trace or to capture the diagnostic info we are creating the span.

the motivation is to capture additional metrics around resource usage associated with a span. This is an optional capability which can be enabled on per span basis if needed and default behavior would be disabled diagnostics. This is helpful for long running jobs, where resource consumption can be recorded for a span like segment merges, peer recovery, index engine start etc and for such long running job, the overhead would be very minimal when compared to the runtime of these jobs. Diagnosis shouldn't be enabled for operations which have very tight SLA of order of 10ms latency. We do something similar with resource tracking framework at task level and this is a generic way to solve the problem and is built in an extensible way.

Make sense, I was just thinking if we can keep metric and span separate and link them with traceID then metrics can be supported throughout as most of the work is already done by you.

Anyways as of now these metrics are being collated at span level but emitted independently. So can we keep these 2 pieces separate and link with exemplars through spanContext?

I didn't quite get this part. But you're right about collection and emission and they are independent. This framework leaves emission logic open ended. For example, This PR is using opentelemetry for metric emission. - https://github.com/rishabhmaurya/OpenSearch/pull/44/commits. Whereas, if we want to use something else, like stats API to emit into or logging exporter, it can be integrated too.

My point is, we are not doing any aggregation at span level so there metrics captured for thread are independent and can be independently emitted without clubbing with the span logic.

How are you thinking of linking these metrics and spans later for analysis? I see you are passing the attributes. Do attributes always have traceID?

traceID would a regular field associated with metric which can used to filter when visualizing, it will not be a dimension attribute to avoid high cardinality dimension.

How this is being done? Can you please point me to the code path?

Maybe too early but how do these metrics work in case of sampling? Will these also be sampled along with span or not?

It is the right time to discuss it. It depends on when we are sampling, if it is at collector level, then all of these diagnosis metrics will be available to collector. Metric sampling works differently at collector, they can both be aggregated or sampled at collector and a lot depends on the aggregation function. I will add more details around it.

I am talking for head sampling.

I have concerns about exposing the Span publicly. If you just need SpanType and hasEnded attribute then can we just expose these properties through either SpanScope or minimum version of span?

Adding points we discussed offline around it a few days ago - SpanScope is getting bloated with a lot of redundant methods same as Span. We wanted to avoid exposing Span because of endSpan() method, which we only want to be exposed to framework internal classes like SpanScope, which can call it when SpanScope is closed. Is there any other reason why we don't want to expose Span publicly?

No this is the main concern. Anyways this is being discussed here - https://github.com/opensearch-project/OpenSearch/pull/8831/files/7bac1e33dea887e9c35e08e54b8e7d5d2bc53bcf#r1281897313

github-actions · 2023-08-03T19:16:17Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21816/
CommitID: 6e7471d
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

opensearch-trigger-bot · 2023-08-03T19:34:34Z

Compatibility status:



> Task :checkCompatibility
Incompatible components: [https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/security-analytics.git]
Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

BUILD SUCCESSFUL in 26m 48s

github-actions · 2023-08-03T19:36:16Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21819/
CommitID: 71206b9
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

github-actions · 2023-08-03T19:44:15Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21821/
CommitID: e16a9df
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

opensearch-trigger-bot · 2023-08-03T19:58:16Z

Compatibility status:



> Task :checkCompatibility
Incompatible components: [https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer.git]
Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/performance-analyzer-rca.git]

BUILD SUCCESSFUL in 27m 58s

github-actions · 2023-08-04T00:45:48Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/21854/
CommitID: 997a01f
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

opensearch-trigger-bot · 2023-08-04T01:02:04Z

Compatibility status:



> Task :checkCompatibility
Incompatible components: [https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/security-analytics.git]
Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

BUILD SUCCESSFUL in 26m 36s

rishabhmaurya · 2023-08-10T01:06:32Z

cc @reta looking forward to your comments. Retagging in case you missed it earlier, thanks.

@rishabhmaurya could you please rebase? there are conflicts, thank you

There were some disruptive changes with last rebase, but its rebased now. Thank you.

github-actions · 2023-08-10T01:10:39Z

Gradle Check (Jenkins) Run Completed with:

RESULT: FAILURE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/22303/
CommitID: 52a6212
Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green.
Is the failure a flaky test unrelated to your change?

reta · 2023-08-11T14:09:17Z

@rishabhmaurya I am not sure what is the state of this pull request but here are my general observations:

the telemetry support is split across 2 pieces (right now): tracing and metric
the telemetry may provide both or just one
the metrics and tracing should be separated (they are very much coupled now)

In case of OTel (our plugin):

the metrics, if available, could be integrated with tracing using SpanProcessor
the tracing should be using semantic conventions (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/span-general.md#general-thread-attributes) to facilitate context passing (like thread ids, etc)

Please bring this pull request into the state where at least checks are passing, so it could be reviewed knowing nothing is missed, thank you.

rishabhmaurya · 2023-08-24T15:39:14Z

My bad on not getting to this one timely, I got side tracked into other work and didn't get chance to take it forward.

the metrics and tracing should be separated (they are very much coupled now)

If you look at the decoupling for example all the wrappers and Span event listener interfaces and implemented, they are very much decoupled.

the metrics, if available, could be integrated with tracing using SpanProcessor

I like the idea of using SpanProcessor for otel implementation, we still would need SpanProcess similar interfaces defined in opensearch telemetry packages, which would be similar to SpanEventListener, I will rename it and try to make it compatible with SpanProcessor.

the tracing should be using semantic conventions (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/span-general.md#general-thread-attributes) to facilitate context passing (like thread ids, etc)

Sure, I will change them in the next iteration.

rishabhmaurya · 2023-09-18T18:48:11Z

@Gaganjuneja @reta
I was thinking about using SpanProcessor however it can only be tied the to Span start/end events. The reason for using TraceEventListener was to accurately measure the resource consumed by the different threads involved in the lifecycle of the span, so just start and end event of span were not sufficient, thus I associated Runnable start and end event as well.
The framework implementation has changed quite a bit and if my understanding is correct, SpanScope is something which can start and end multiple times now for a given span depending on when a thread starts/ends its execution.
Does it makes sense to associate these events with SpanScope instead of Span start/end and Runnable start/end?

Gaganjuneja · 2023-09-19T16:37:09Z

@Gaganjuneja @reta I was thinking about using SpanProcessor however it can only be tied the to Span start/end events. The reason for using TraceEventListener was to accurately measure the resource consumed by the different threads involved in the lifecycle of the span, so just start and end event of span were not sufficient, thus I associated Runnable start and end event as well. The framework implementation has changed quite a bit and if my understanding is correct, SpanScope is something which can start and end multiple times now for a given span depending on when a thread starts/ends its execution. Does it makes sense to associate these events with SpanScope instead of Span start/end and Runnable start/end?

@rishabhmaurya, I think here we have 2 uber points.

How can we evaluate the metric in the scope of a span. Either through EventListener and Runnables or SpanScope.
How can we integrate the metric and traces?

As @reta also mentioned, metric and traces are two distinct components and shouldn't be coupled as in the running prod environment it both may be independently enabled/disabled or even coexist. Traces would get sampled as well but metrics we may not want to sample as these are aggregate in nature. So in case we need to integrate them together then it should be done at a framework level once both are generated.

But if I understood your use case, then you just want to aggregate a metric for all the threads being used in the scope of a span. We need to think of a way to do that like using some callbacks/logic with span state or maybe adding spanId as an attribute to the metric etc.

rishabhmaurya · 2023-09-21T01:53:07Z

This PR shouldn't be confused with adding metrics support in the framework. It has following components -

Adding event listeners around spans to precisely track when a thread start/end working on a span for all threads involved for a given span (and subspans).
Thread resource recorder and observer interfaces (ThreadResourceObserver & ThreadResourceRecorder) to track resource utilization of threads involved in a span (aka DiagnosticSpan).
Emission of resource utilization computed in 2 using MetricEmitter interface.

If you look at the interfaces, all 3 are components are very much decoupled. MetricEmitter when implmented in OpenTelemetry uses OpenTelemetry meters which are part of another PR - rishabhmaurya#47

The motivation of this PR is, when diagnosis is enabled in tracing framework, it will start recording resource usage for all threads within span boundaries. Later metrics would have attributes associated with them to filter/aggregate on those attributes.

Gaganjuneja · 2023-09-21T03:31:54Z

I think we are not digressing towards the Metrics Framework (Anyways for that I have put up an RFC #10141). I think we need the answer of following fundamental questions and then we should be proceed.

How are we going to use these metrics?
The impact on metric if spans are getting sampled?
What is the expected state when either metrics or tracing or both are disabled.
If this is so coupled with the scope of a Span then why can't we just make it an attribute to a Span?

rishabhmaurya · 2023-09-21T23:13:49Z

How are we going to use these metrics?

The use of metric is open ended in the framework lib. If we use OpenTelemetry to emitMetric() then the opentelemetry meters would be used to emit these metrics and metrics would contain the span attributes which can be used to filter/aggregate metrics later on.
For instance, assuming PeerRecovery code is wrapped around spans, we can measure the resources consumed by the threads involved in PeerRecovery. We can visualize the overall impact and can drill down on the basis of attributes such as source_node, destination_node, index, shard or which phase consumed most resources (send_files, translog_recovery)as these attributes would be associated with metrics while emission.
Currently, its not adding trace-id as an attribute, however if added, consumption can be filtered on the basis of individual traces too if needed.

The impact on metric if spans are getting sampled?

As far as span context is propagated, span events would be generated and resource recording will happen and metric would be emitted.

What is the expected state when either metrics or tracing or both are disabled.

When metrics are disabled, resource usage would be computed but will not be emitted.
When tracing is completely disabled, resource usage won't be computed as there are no spans for which the resource needs to be computed and thus no metrics would be generated.
If both are disabled, nothing would be computed or emitted.

If this is so coupled with the scope of a Span then why can't we just make it an attribute to a Span?

the only reason its coupled with Span is because its recording the usage of a span and its not coupled directly to span but depends on the events of the spans.

Did you mean to say why can't we emit these metrics as part of span attribute or emit a span event?
If yes, we want them as metrics and not attributes. Metrics can be stored in a timeseries database, can be visualized, aggregated and filtered, which isn't possible with raw span attributes/events.

reta · 2023-09-22T03:56:54Z

@rishabhmaurya let's try to get back to basics

We have two separate components: traces and metrics (see please [RFC] Metrics Framework #10141)
By and large, those components are independent and could function independently together or individually

Now, let's try to understand what are the relations between tracing and metrics:

tracing is about execution flow: how the individual request travels through the system
metrics are about operational state of the system expressed in terms of quantifiable measures

What are the lifetimes of tracing data and metrics in the functioning (up and running) system? Most of the metrics live as along as the system is running (I am discarding the persistent layer as it is not relevant at the moment), the system like OpenSearch may have thousands of metrics (and it really has) exposed. Most of the tracing data (like spans) live for a duration of the request (again, I am discarding the persistent layer as it is not relevant at the moment), the system like OpenSearch may have thousands of spans every single second.

Span have a mechanism to capture additional details (like attributes and events), but I would argue that spans should not mess with metrics in any way imaginable. Why is that:

each span has only limited context it is aware of, it does not have sufficient details to produce meaningful metrics
however, some metrics could be derived from spans and it is done by collectors (see please https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/connector/spanmetricsconnector/README.md)

When enabled, resource consumption will be recorded between start and end of the span and will publish the net consumption. It will consider resources consumed by all threads involved in the lifecycle of a span.

First of all, we already have the tracking mechanism, see please TaskResourceTrackingService & related APIs, we definitely don't need over tracking here. Secondly, could you please explain what are practical reasons to track CPU time for each and every span in the system? I would suspect the >90% of them would have 0 time spent but the overhead of calling ThreadMXBean would be huge.

rishabhmaurya · 2023-09-24T20:22:04Z

Thanks @reta for the feedback. You have raised all valid concerns and its a privilege to discuss them with folks here who understand this part of system well and get their feedback. I would like to trace back my thought process on how I arrived at this solution and then we can decide if we want to move ahead with this solution or a different solution.

Its mostly derived from managed service use cases (and should be helpful for the whole community) where we don't have enough insights on system level metrics for background jobs like peer recovery, merges, index management jobs like ultrawarm-migrations/transforms/rollover etc. These job can have impact on the search/indexing latency or resiliency of the system and it becomes very hard to troubleshoot such problem both for live and past events and pin point to specific job/event.

My plan was to add capability to emit metrics which are available with the performance analyzer plugin - https://opensearch.org/docs/latest/monitoring-your-cluster/pa/reference/ only for background jobs and not for each search/indexing request. The number of events generated by background tasks is most of the time directly proportional to the rate of growth of data, various indexing policies and cluster events. Whereas, the number of events generated by the requests is directly proportional to the number of requests. The majority of the events of any background task are long running ( > 1s) unlike requests where majority of events are shortlived (< 10 msecs). These are important invariants in the strategy to design the impact measurement for background tasks and for this reason the proposal here cannot be used for requests.

Your concern is valid and it's crucial to consider the trade-offs, including the potential overhead, and ensure that the data collected serves a meaningful purpose in optimizing system performance or diagnosing issues. There is no one-size-fits-all answer, and this approach should be selective in nature where operator can choose for which background job this should be enabled and which metrics to emit. For instance, if we want to get insights on segment merges, we may want to enable metrics like - cpu time, heap allocation rate, IO read/write throughput, Disk metrics, page faults - major/minor. Similarly, for peer recovery, there could be contention between threads esp while translog recovery, so we may want to monitor additionally thread block_time.

Once the impact is quantified, these metrics can be used for various purposes - (1) Improving the resiliency by detecting issues faster, understanding the bottlenecks because of a background task and devising the remediation by changing configuration at various layers - OpenSearch (Shards/Index or Cluster), JVM or by using the right hardware or scaling the machines. (2) Building automation to throttle the background tasks at most granular level like shards and nodes on which they run. This would contribute towards improving the availability of the system. (3) Improving the quality of software releases by using these metrics to create baselines of background tasks impact, for different workloads defined in the benchmark.

Why I'm trying to club span events with metrics?
If we have spans around these background jobs, then we don't have to write any additional code to get these system level metrics around them. It would just be some configuration to be enabled and choosing available metrics to be computed/emitted against each of these background task. Spans generated for background jobs would be very less in number compared to spans for search/indexing request, so scaling should not be an issue. Also, overhead of calling ThreadMXBean or any metric computation logic would be limited too. I agree we should not enable such metrics for regular search/indexing requests.

msfroh · 2023-09-25T07:08:15Z

@rishabhmaurya -- I've read through your comment above, and I'm still not entirely sure what the purpose of this PR should be -- which I think is more on me than it is on you -> I lack sufficient context, but I would like to get it. Essentially, I would like to review your PR, but I don't know where to start.

Can you provide a tl;dr: explanation in three sentences that conveys the following:

Who needs to care about this PR -- essentially, who is the target audience?
What problem does this PR solve?
How does the PR solve the problem?

rishabhmaurya · 2023-09-25T16:56:29Z

Who needs to care about this PR -- essentially, who is the target audience?

Target audience is cluster administrators to get insights into scaling requirements, schedule (if possible) background tasks at appropriate time, insights into increased latency because of contention with background tasks.
Background job could be segment merges, snapshots, shard peer recovery, index management jobs etc.

What problem does this PR solve?

It assumes once a background job is wrapped around span, its span events can be used to trigger computation of resources consumed by the threads involved. It will compute the delta of resources consumed by threads from start to end of the span and publish it in form of metrics.
Initial plan is to support metrics which can be computed using ThreadMXBean. Ultimately support subset of metrics from https://opensearch.org/docs/latest/monitoring-your-cluster/pa/reference/ like Disk IO - readBytes, writeBytes, readSysCallRate, writeSysCallRate, page faults etc.
Admins should be able to control for which background job this diagnosis needs to be enabled dynamically. They should also have control to select which metrics needs to be enabled against a background job.

How does the PR solve the problem?

It exposes trace event listeners and runnable event listeners (Runnable wrapper to listen to event for a Runnable when it starts/stop executing a task).
It adds a DiagnosisSpan, which is a wrapper over a Span. DiagnosticsEventListener which is a TraceEventListener have access to DiagnosisSpan to persist the start reading and other attributes and use it to compute delta at end event and publish the metrics with the right attributes.
This PR adds ThreadResourceObserver and ThreadResourceRecorder. ThreadResourceRecorder starts and ends recording for a ThreadResourceObserver is observing and is invoked by DiagnosticsEventListener.
Its a pluggable architecture so one can write a plugin to add logic to observe and record usage of some other resource.

Thanks @msfroh for taking a look.

reta · 2023-09-25T19:54:51Z

Thanks @reta for the feedback. You have raised all valid concerns and its a privilege to discuss them with folks here who understand this part of system well and get their feedback. I would like to trace back my thought process on how I arrived at this solution and then we can decide if we want to move ahead with this solution or a different solution.

Thanks for the context @rishabhmaurya, it is interesting that you have mentioned merges a few times, we have been thinking about them a lot but it seems like there should be a lot of prerequisite work to be done in order to even get this processes somehow instrumented (merge schedulers etc). Right now Apache Lucene does use internal pools and threads to spawn merges but has to be changed.

I would like to suggest the following plan of action:

kick off [RFC] Metrics Framework #10141 - that will give the foundation for metrics (as the holistic APIs)
look into existing TaskResourceTrackingService, it is supposed to be tracing the most important flows in the system (which are usually executed as tasks)
specifically related to merges, look into instrumenting the Apache Lucene merge process

With respect to this particular pull request, I would suggest to put it on hold (draft mode?) and get back to it when we a) have all pieces together b) conclude that this is the way to implement resource tracking or alike

rishabhmaurya · 2023-09-25T21:19:04Z

@reta my understanding for merge was, since we use custom OpenSearchConcurrentMergeScheduler, thus wrapping doMerge() with span should be sufficient since its a single threaded task.

reta · 2023-09-25T22:00:57Z

@reta my understanding for merge was, since we use custom OpenSearchConcurrentMergeScheduler, thus wrapping doMerge() with span should be sufficient since its a single threaded task.

@rishabhmaurya hm ... doMerge() in called in MergeThread , whereas the merge thread is created by getMergeThread, wrapping doMerge() won't be sufficient since we have to propagate the context to the MergeThread first, unless I am missing something.

rishabhmaurya · 2023-09-26T00:05:07Z

@rishabhmaurya hm ... doMerge() in called in MergeThread , whereas the merge thread is created by getMergeThread, wrapping doMerge() won't be sufficient since we have to propagate the context to the MergeThread first, unless I am missing something.

@reta since merge thread is single threaded for a single merge task, do we have to care of context propagation? It sounds like a little hack but should work if we pass the tracer to OpenSearchConcurrentMergeScheduler constructor and start and end the span in doMerge().

reta · 2023-09-26T00:32:29Z

@rishabhmaurya hm ... doMerge() in called in MergeThread , whereas the merge thread is created by getMergeThread, wrapping doMerge() won't be sufficient since we have to propagate the context to the MergeThread first, unless I am missing something.

@reta since merge thread is single threaded for a single merge task, do we have to care of context propagation? It sounds like a little hack but should work if we pass the tracer to OpenSearchConcurrentMergeScheduler constructor and start and end the span in doMerge().

@rishabhmaurya No, we need to see not only merge thread(s), which would appear as isolated spans, but how the merge was triggered, background task or force merge, other means. For this purposes we need context propagation.

rishabhmaurya · 2023-09-26T16:59:08Z

@reta how about propagating context to merge thread in getMergeThread() -

OpenSearch/server/src/main/java/org/opensearch/index/engine/OpenSearchConcurrentMergeScheduler.java

Line 192 in 13fc7cc

    
           protected MergeThread getMergeThread(MergeSource mergeSource, MergePolicy.OneMerge merge) throws IOException {

so if the thread executing IndexWriter merge functions has the context, it should work?

reta · 2023-09-26T17:09:28Z

so if the thread executing IndexWriter merge functions has the context, it should work?

Yeah, that the instrumentation that I mentioned and needed to be in place, however we also need to make sure the flow where OpenSearchConcurrentMergeScheduler is running is also instrumented. If you think to take that, please create an issue and submit the pull request, thank you.

rishabhmaurya · 2023-09-26T18:02:14Z

@reta created #10244 feel free to assign it to me.

opensearch-trigger-bot · 2024-01-13T15:19:44Z

This PR is stalled because it has been open for 30 days with no activity.

kotwanikunal · 2024-02-21T23:11:43Z

@rishabhmaurya Are you still working on this one?

kotwanikunal · 2024-03-20T20:12:15Z

Please reopen if you are still planning to contribute this.

suranjay reviewed Jul 24, 2023

View reviewed changes

rishabhmaurya force-pushed the trace-lib-latest branch from 27d5511 to 993f0e5 Compare August 1, 2023 01:00

rishabhmaurya force-pushed the trace-lib-latest branch from 3a79984 to 6e7471d Compare August 3, 2023 19:07

rishabhmaurya force-pushed the trace-lib-latest branch from 6e7471d to 71206b9 Compare August 3, 2023 19:24

rishabhmaurya force-pushed the trace-lib-latest branch from e16a9df to 997a01f Compare August 4, 2023 00:35

This was referenced Aug 4, 2023

Trace events integration with server rishabhmaurya/OpenSearch#46

Draft

JMX based diagnosis metrics emitted using otel rishabhmaurya/OpenSearch#47

Draft

rishabhmaurya marked this pull request as ready for review August 4, 2023 16:51

fix rebase merge conflicts

52a6212

rishabhmaurya force-pushed the trace-lib-latest branch from 997a01f to 52a6212 Compare August 10, 2023 01:05

Gaganjuneja mentioned this pull request Sep 20, 2023

[RFC] Metrics Framework #10141

Open

opensearch-trigger-bot bot added the stalled Issues that have stalled label Jan 13, 2024

opensearch-trigger-bot bot removed the stalled Issues that have stalled label Feb 23, 2024

kotwanikunal closed this Mar 20, 2024

Trace events lib changes #8816

Trace events lib changes #8816

Conversation

rishabhmaurya commented Jul 21, 2023 • edited Loading

Description

Tenet

Trace Event Listener

Diagnostic Trace Event Listener

Thread Resource Recorder

Thread Resource Observer

Related Tasks

Related Issues

Check List

github-actions bot commented Jul 21, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 21, 2023

Gradle Check (Jenkins) Run Completed with:

suranjay Jul 24, 2023

Choose a reason for hiding this comment

rishabhmaurya Aug 1, 2023

Choose a reason for hiding this comment

Gaganjuneja Aug 2, 2023

Choose a reason for hiding this comment

suranjay Jul 24, 2023

Choose a reason for hiding this comment

suranjay Jul 24, 2023

Choose a reason for hiding this comment

github-actions bot commented Jul 25, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Jul 27, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Aug 1, 2023

Gradle Check (Jenkins) Run Completed with:

Gaganjuneja commented Aug 2, 2023

rishabhmaurya commented Aug 2, 2023 • edited Loading

github-actions bot commented Aug 2, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Aug 2, 2023

Gradle Check (Jenkins) Run Completed with:

opensearch-trigger-bot bot commented Aug 2, 2023

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

opensearch-trigger-bot bot commented Aug 3, 2023

Gaganjuneja commented Aug 3, 2023

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

opensearch-trigger-bot bot commented Aug 3, 2023

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Aug 3, 2023

Gradle Check (Jenkins) Run Completed with:

opensearch-trigger-bot bot commented Aug 3, 2023

github-actions bot commented Aug 4, 2023

Gradle Check (Jenkins) Run Completed with:

opensearch-trigger-bot bot commented Aug 4, 2023

rishabhmaurya commented Aug 10, 2023

github-actions bot commented Aug 10, 2023

Gradle Check (Jenkins) Run Completed with:

reta commented Aug 11, 2023 • edited Loading

rishabhmaurya commented Aug 24, 2023

rishabhmaurya commented Sep 18, 2023

Gaganjuneja commented Sep 19, 2023

rishabhmaurya commented Sep 21, 2023 • edited Loading

Gaganjuneja commented Sep 21, 2023

rishabhmaurya commented Sep 21, 2023

reta commented Sep 22, 2023 • edited Loading

rishabhmaurya commented Sep 24, 2023 • edited Loading

msfroh commented Sep 25, 2023

rishabhmaurya commented Sep 25, 2023

reta commented Sep 25, 2023

rishabhmaurya commented Sep 25, 2023

reta commented Sep 25, 2023 • edited Loading

rishabhmaurya commented Sep 26, 2023

reta commented Sep 26, 2023

rishabhmaurya commented Sep 26, 2023

reta commented Sep 26, 2023 • edited Loading

rishabhmaurya commented Sep 26, 2023

opensearch-trigger-bot bot commented Jan 13, 2024

kotwanikunal commented Feb 21, 2024

kotwanikunal commented Mar 20, 2024

rishabhmaurya commented Jul 21, 2023 •

edited

Loading

rishabhmaurya commented Aug 2, 2023 •

edited

Loading

reta commented Aug 11, 2023 •

edited

Loading

rishabhmaurya commented Sep 21, 2023 •

edited

Loading

reta commented Sep 22, 2023 •

edited

Loading

rishabhmaurya commented Sep 24, 2023 •

edited

Loading

reta commented Sep 25, 2023 •

edited

Loading

reta commented Sep 26, 2023 •

edited

Loading