[Feature] Java agent self-observability #12595
Labels
agent
Language agent related.
core feature
Core and important feature. Sometimes, break backwards compatibility.
feature
New feature
java
Java agent related
Search before asking
Description
This has been a thing in my mind for years. I am writing this down to see if someone would be interested in implementing this.
Java agent is bundled in-process codes, so its runtime performance is hard to measure by using traditional tools, so, I want to propose an in-kernel self-observability implementation to measure tracing performance.
We should measure the agent performance using the following metrics:
created_tracing_context_counter
- Counter. The number of created tracing contexts. This should include a label=created_by(value=sampler,propagated).created_by=propagated
means the agent created the context due to downstream service addedsw8
header to trigger force sampling.created_by=sampler
means the agent created this context by local sampler no matter which policy it uses.finished_tracing_context_counter
- Counter. The number of created contexts. The gap betweenfinished_tracing_context_counter
andcreated_tracing_context_counter
should be relatively stable, otherwise, the memory cost would be increased.created_ignored_context_counter
andfinished_ignored_context_counter
. Same concepts like*_tracing_context_counter
.interceptor_error_counter
- Counter. The number of errors happened in the interceptor logic, withlabel=plugin_name, inter_type(constructor, inst, static)
. We don't add interceptor names into labels in case of OOM. The number of plugins is only dozens, it is predictable, but the number of interceptors will be hundreds.possible_leaked_context_counter
- Counter. The number of detected leaked contexts. It should include the label=source(value=tracing, ignore). Whensource=tracing
, it is today's shadow tracing context. But now, it is measured.tracing_context_performance
- Histogram. For successfully finished tracing context, it measures every interceptor's time cost(by using nanoseconds), the buckets of the histogram should be {0.01, 0.1, 0.5, 1, 3, 5, 10, 50, 100, 200, 500, 1000}ms. This provides the performance behavior for the tracing operations.Use case
SkyWalking OAP should accept these meters through native protocols, and build a new self-observability dashboard for the Java agent.
Also, I hope this provides some inspirations for other agent maintainers/contributors to add similar concepts agent by agent. cc @apache/skywalking-committers
Related issues
No response
Are you willing to submit a pull request to implement this on your own?
Code of Conduct
The text was updated successfully, but these errors were encountered: