[Feature] Java agent self-observability #12595

wu-sheng · 2024-09-05T01:40:52Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

This has been a thing in my mind for years. I am writing this down to see if someone would be interested in implementing this.
Java agent is bundled in-process codes, so its runtime performance is hard to measure by using traditional tools, so, I want to propose an in-kernel self-observability implementation to measure tracing performance.

We should measure the agent performance using the following metrics:

created_tracing_context_counter - Counter. The number of created tracing contexts. This should include a label=created_by(value=sampler,propagated). created_by=propagated means the agent created the context due to downstream service added sw8 header to trigger force sampling. created_by=sampler means the agent created this context by local sampler no matter which policy it uses.
finished_tracing_context_counter - Counter. The number of created contexts. The gap between finished_tracing_context_counter and created_tracing_context_counter should be relatively stable, otherwise, the memory cost would be increased.
created_ignored_context_counter and finished_ignored_context_counter. Same concepts like *_tracing_context_counter.
interceptor_error_counter - Counter. The number of errors happened in the interceptor logic, with label=plugin_name, inter_type(constructor, inst, static). We don't add interceptor names into labels in case of OOM. The number of plugins is only dozens, it is predictable, but the number of interceptors will be hundreds.
possible_leaked_context_counter - Counter. The number of detected leaked contexts. It should include the label=source(value=tracing, ignore). When source=tracing, it is today's shadow tracing context. But now, it is measured.
tracing_context_performance - Histogram. For successfully finished tracing context, it measures every interceptor's time cost(by using nanoseconds), the buckets of the histogram should be {0.01, 0.1, 0.5, 1, 3, 5, 10, 50, 100, 200, 500, 1000}ms. This provides the performance behavior for the tracing operations.

Use case

SkyWalking OAP should accept these meters through native protocols, and build a new self-observability dashboard for the Java agent.

Also, I hope this provides some inspirations for other agent maintainers/contributors to add similar concepts agent by agent. cc @apache/skywalking-committers

Related issues

No response

Are you willing to submit a pull request to implement this on your own?

Yes I am willing to submit a pull request on my own!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

weixiang1862 · 2024-09-05T02:02:06Z

Please assign this to me and @CzyerChen , we will collaborate to implement this feature.

wu-sheng added core feature Core and important feature. Sometimes, break backwards compatibility. agent Language agent related. feature New feature java Java agent related labels Sep 5, 2024

wu-sheng self-assigned this Sep 5, 2024

wu-sheng assigned weixiang1862 and CzyerChen Sep 5, 2024

weixiang1862 mentioned this issue Sep 13, 2024

Add agent self-observability. apache/skywalking-java#716

Merged

6 tasks

CzyerChen mentioned this issue Sep 14, 2024

Add SkyWalking Java Agent self observability dashboard #12622

Merged

4 tasks

wu-sheng closed this as completed Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Java agent self-observability #12595

[Feature] Java agent self-observability #12595

wu-sheng commented Sep 5, 2024

weixiang1862 commented Sep 5, 2024

[Feature] Java agent self-observability #12595

[Feature] Java agent self-observability #12595

Comments

wu-sheng commented Sep 5, 2024

Search before asking

Description

Use case

Related issues

Are you willing to submit a pull request to implement this on your own?

Code of Conduct

weixiang1862 commented Sep 5, 2024