Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka binder without JMX #1835

Merged
merged 39 commits into from
Mar 10, 2020
Merged

Conversation

jeqo
Copy link
Contributor

@jeqo jeqo commented Jan 31, 2020

Support for Clients (Consumer/Producer/Admin APIs) and Streams via metrics API, instead of JMX-based metrics.

There are 3 different interfaces to gather metrics from Kafka applications:

  • JMX exported metrics
  • Metrics Reporter
  • Metrics API

JMX exported metrics have been used as a first attempt to gather metrics, but there some downsides to consider:

  • Instrumentation needs to know set of metrics to gather before hand, therefore maintenance become very cumbersome. Set of metrics changes over releases, making this even more difficult.

Metrics Reporter is another option, which require to load an implementation of MetricsReporter when configuring Kafka clients. Downsides (?):

  • Requires adding class name to Kafka configurations, which differs from current micrometer's user experience (compared with Metrics API)

Metrics API (proposed by this PR) uses KafkaClient#metrics() method to gather current metrics available in a client instance.

Resolves #1095
Resolves #1096

@shakuzen shakuzen added the enhancement A general enhancement label Feb 3, 2020
@jeqo jeqo marked this pull request as ready for review February 3, 2020 18:18
@jeqo jeqo requested a review from shakuzen February 8, 2020 13:46
Copy link
Member

@shakuzen shakuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the pull request. This generally seems like the right direction.
I left some comments - please take a look.

@jeqo
Copy link
Contributor Author

jeqo commented Feb 11, 2020

I'd like to add that I do agree an approach like #1173 could be complementary to export metrics when there are off-the-shelf components like Kafka Connectors, and we can pass this instrumentation via config.

Similar to Zipkin's Brave for streams and clients, and Zipkin kafka interceptors

@jeqo
Copy link
Contributor Author

jeqo commented Feb 11, 2020

I've removed the TimeGauge for rates, as API doesn't allow to get TimeUnit.

@jeqo jeqo force-pushed the kafka-binder branch 2 times, most recently from 002a4d1 to 039d6ae Compare February 18, 2020 00:50
Copy link
Member

@shakuzen shakuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. I'm really wanting to move away from running the logic on each time the kafka metric values are read. Maybe a task scheduler that runs the logic once per step per registered MeterRegistry would be better. It would need to be cleaned up when the MeterBinder is closed, like notification listeners with some JMX-based MeterBinders.

Copy link
Member

@shakuzen shakuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're just about ready to ship this. Could you rebase this on the latest master? I did some ad-hoc testing myself but I don't have any realistic apps using Kafka. Have you tried this out with some kafka apps to make sure everything looks as expected? I think the plan will be to release this in 1.4 (non-LTS) as an Incubating feature, gather feedback from users on it, and hopefully with enough feedback, remove the Incubating status in time for 1.5 (LTS).

@jeqo
Copy link
Contributor Author

jeqo commented Mar 10, 2020

Sounds great! I've test it with this sample application https://github.com/jeqo/tracing-kafka-apps/compare/metrics and works as expected on all components (producer/consumer/streams).

@jeqo
Copy link
Contributor Author

jeqo commented Mar 10, 2020

e.g. producer metrics in Prometheus format:

# HELP kafka_producer_topic_record_send_rate The average number of records sent per second for a topic.
# TYPE kafka_producer_topic_record_send_rate gauge
kafka_producer_topic_record_send_rate{client_id="producer-1",kafka_version="2.4.0",topic="events-v1",} 0.0
# HELP kafka_producer_batch_split_rate The average number of batch splits per second
# TYPE kafka_producer_batch_split_rate gauge
kafka_producer_batch_split_rate{client_id="producer-1",kafka_version="2.4.0",} 0.0
# HELP kafka_producer_incoming_byte_rate The number of bytes read off all sockets per second
# TYPE kafka_producer_incoming_byte_rate gauge
kafka_producer_incoming_byte_rate{client_id="producer-1",kafka_version="2.4.0",} 2.696326255476913
# HELP kafka_producer_request_total The total number of requests sent
# TYPE kafka_producer_request_total counter
kafka_producer_request_total{client_id="producer-1",kafka_version="2.4.0",} 452.0
# HELP kafka_producer_batch_split_total The total number of batch splits
# TYPE kafka_producer_batch_split_total counter
kafka_producer_batch_split_total{client_id="producer-1",kafka_version="2.4.0",} 0.0
# HELP kafka_producer_request_latency_avg The average request latency in ms
# TYPE kafka_producer_request_latency_avg gauge
kafka_producer_request_latency_avg{client_id="producer-1",kafka_version="2.4.0",} NaN
# HELP kafka_producer_node_request_latency_max 
# TYPE kafka_producer_node_request_latency_max gauge
kafka_producer_node_request_latency_max{client_id="producer-1",kafka_version="2.4.0",node_id="node--1",} NaN

@saltos
Copy link

saltos commented Mar 10, 2020

Will you skip the group kafka-metrics-count? This group contains the only one counter - amount of the counters. I am sure kafka-metrics-count.count is the only counter which has the same name for a producer and consumer but different amount of tags. Maybe you can remove the double-check for new metrics completely. I've tested your code without double-checking and it works for me.

@jeqo
Copy link
Contributor Author

jeqo commented Mar 10, 2020

@saltos thanks for your review. I'm ok removing kafka-metrics-count as could be of little/none use (e.g. comparing number of metrics between versions maybe?) but I don't think this removes the need for double checking as metrics with different tags would happen between clients state transition (e.g. producer before/after sending messages to a topic will have a metric without/with topic tag depending on the state). wdyt?

@saltos
Copy link

saltos commented Mar 10, 2020

@jeqo thank you for the reply. I've add some extra test for my code... You are right.
My producer has no metrics with the topic tag before sending any message so there is no conflict with the existing metric with different amount of tags, but a consumer has several metrics with different amount of tags. There is a metric with the client-id tag and with the client-id, topic, partition tags at the same time... So you code will be truly useful.

@jeqo
Copy link
Contributor Author

jeqo commented Mar 10, 2020

@shakuzen do you have an opinion about removing kafka-metrics-count group? Currently meter name is kafka.kafka.count.count which seems useless.

I'm between adding an special path to name this counter properly (e.g. kafka.metrics.count), or filtering out the group altogether (1 counter only)--easiest path.

Testing consumer/producer (IT) with the same registry lead to 2 meters with the same tags: client-id and kafka-version, which looks ok. If eventually there is a tags disparity between consumers/producers only the one with more tags will persist. @saltos could you confirm if this is the case on your side?--if it is then removing the group would be more consistent than exposing counter for only one client.

@shakuzen
Copy link
Member

shakuzen commented Mar 10, 2020

@shakuzen do you have an opinion about removing kafka-metrics-count group? Currently meter name is kafka.kafka.count.count which seems useless.

I would be inclined to go with filtering it out unless a Kafka user can make a case for why they would want that metric specifically. It would not break anyone to filter it out initially and undo that decision later, but it might be breaking for someone if we decide to filter it later.

with current naming this group is named kafka.kafka.count.count which is useless

if users request this counter we have to add an alternative naming for it
@shakuzen shakuzen added this to the 1.4.0 milestone Mar 10, 2020
@shakuzen
Copy link
Member

@jeqo Thank you so much for your perseverance on getting this to the state it is now and testing. Supporting Kafka producer metrics and stream metrics have been highly popular feature requests, and with this we will finally have them. Great work!

@shakuzen shakuzen added the release notes Noteworthy change to call out in the release notes label Mar 10, 2020
@shakuzen shakuzen merged commit abc0b30 into micrometer-metrics:master Mar 10, 2020
tchiotludo added a commit to kestra-io/kestra that referenced this pull request Mar 10, 2020
Waiting for micrometer-metrics/micrometer#1835 release
I simply copy the implementation and add collectors on all Kafka client
@saltos
Copy link

saltos commented Mar 10, 2020

@saltos could you confirm if this is the case on your side?

Yes, as a result I have only one metric with client-id and listener tags (from a consumer).

@saltos
Copy link

saltos commented Mar 11, 2020

I found some odd behaviour with a consumer metrics.
I have 3 listeners but the only one bytes_consumed_total metric.

kafka_consumer_fetch_manager_bytes_consumed_total{client_id="prototype-0",group="kafka-spring",kafka_version="2.3.1",listener="foo-dlt-listener",topic="foo_DLT",} 0.0

Function checkAndBindMetrics has the code that can lead to a loss of some metrics when new metric has the same tags with different values.

Collection<Meter> meters = registry.find(metricName(metric)).meters();
for (Meter meter : meters) {
    if (meter.getId().getTags().size() < (metricTags(metric).size() + extraTagsSize))
        registry.remove(meter);
    // Check if already exists
    else if (meter.getId().getTags().equals(metricTags(metric))) return;
    else hasLessTags = true;
}

The condition meter.getId().getTags().equals(metricTags(metric)) will be false because metricTags(metric) doesn't contains the extra tags.
I changed this condition to:

for (Meter meter : meters) {
    List<Tag> meterTags = meter.getId().getTags();
    List<Tag> metricTags = metricTags(metric);
    metricTags.addAll(tags);

    if (meterTags.size() < metricTags.size()) {
        registry.remove(meter);
    } else if (meterTags.size() == metricTags.size()) {
        if (meter.getId().getTags().equals(metricTags)) {
            // do not register the metric with the same tags
            return;
        } else {
            // create the new metric with different values of the tags
            break;
        }
    } else {
        hasLessTags = true;
    }
}

and got all 3 metrics.

kafka_consumer_fetch_manager_bytes_consumed_total{client_id="prototype-0",group="kafka-spring",kafka_version="2.3.1",listener="bar-listener",topic="bar",} 1113.0
kafka_consumer_fetch_manager_bytes_consumed_total{client_id="prototype-0",group="kafka-spring",kafka_version="2.3.1",listener="foo-listener",topic="foo",} 2268.0
kafka_consumer_fetch_manager_bytes_consumed_total{client_id="prototype-0",group="kafka-spring",kafka_version="2.3.1",listener="foo-dlt-listener",topic="foo_DLT",} 0.0

@jeqo jeqo deleted the kafka-binder branch March 11, 2020 06:15
@shakuzen
Copy link
Member

@saltos Thank you for testing this out and reporting back your findings - it's great to find these kinds of things before the release. Would you please make a new issue to track this? If you would be interested, also feel free to make a pull request with your mentioned changes. Ideally we would want to add a test case that covers the scenario you found. Thanks again for the help!

@saltos
Copy link

saltos commented Mar 11, 2020

@shakuzen Ok, I will make an issue and try to make a pull request with the test, that's my pleasure. You are doing a really remarkable job.

@jeqo jeqo mentioned this pull request Mar 11, 2020
@jeqo
Copy link
Contributor Author

jeqo commented Mar 11, 2020

just realized I was creating the PR while this discussion was ongoing, sorry if I've interrupted your work on this @saltos

#1896 should solve this, while removing some boilerplate code, take a look at it please.

@saltos
Copy link

saltos commented Mar 11, 2020

@jeqo Awesome, great job! This solves the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A general enhancement release notes Noteworthy change to call out in the release notes
Projects
None yet
4 participants