Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry & Monitoring: Kibana Monitoring & BulkUploader #68998

Closed
3 of 4 tasks
afharo opened this issue Jun 12, 2020 · 13 comments · Fixed by #82638
Closed
3 of 4 tasks

Telemetry & Monitoring: Kibana Monitoring & BulkUploader #68998

afharo opened this issue Jun 12, 2020 · 13 comments · Fixed by #82638
Labels
Feature:Stack Monitoring Feature:Telemetry Meta Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Monitoring Stack Monitoring team

Comments

@afharo
Copy link
Member

afharo commented Jun 12, 2020

Lately we've noticed that when adding some cluster-level stats (#68603 and #64935), those collectors are not registered as Usage or Stats collector in Kibana, because they are not supposed to be reported as part of the stack_stats.kibana.plugins payload.

This results in missing some information when monitoring is enabled. As far as I could understand from taking a look at the code in x-pack/plugins/monitoring, the information reported to the Monitoring cluster is collected via the code in x-pack/plugins/monitoring/server/kibana_monitoring. More specifically, in the bulk_uploader.js file.

I'm creating this issue to review the logic in BulkUploader to:

  • Do not start collecting until Self-Monitoring is enabled and fully started.
  • Provide a way to report Kibana-collected cluster-stats to be used later on when sending telemetry from the Monitoring cluster.
  • When Kibana belongs to a monitoring cluster, but that same cluster is not self-monitoring itself (only has monitoring info from other clusters), it won't report any telemetry about itself.
  • Are there any differences when Metricbeat is used vs. the legacy collector mechanism? Has this API anything to do with it?
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

@afharo
Copy link
Member Author

afharo commented Jun 12, 2020

Pinging @elastic/pulse (Team:KibanaTelemetry)

@TinaHeiligers
Copy link
Contributor

When monitoring is enabled, cluster_stats usage data is read from the .monitoring-es-* indices. This means that any usage data not collected from Kibana needs to be added to those indices (pushed by elasticsearch and beats) in order to retain parity between local and monitoring collection. We need to decide if this is appropriate and, if not, determine the best path forward for monitoring-shipped usage data.

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Jun 29, 2020

@Bamieh
I ran a terms agg to see the % of usage data that's reported through local and monitoring collection:

{
  "aggs" : {
    "telemetry_collection" : {
      "terms" : { "field" : "stack_stats.xpack.monitoring.collection_enabled" } 
    }
  }

Roughly 25% of data that's reported is through monitoring.

@afharo
Copy link
Member Author

afharo commented Jun 30, 2020

And it could happen that those monitored clusters are also reporting the local telemetry themselves? 😅

@TinaHeiligers
Copy link
Contributor

@afharo maybe, maybe not. We can't tell, since we don't combine local collection and collection through monitoring ATM.

@afharo
Copy link
Member Author

afharo commented Jun 30, 2020

I mean: the monitored cluster reporting local telemetry + the monitoring cluster reporting on its behalf.
It would be nice to know that ratio because if, for instance, 90% of the clusters that are reported via monitoring also report local-collected telemetry themselves, then, disabling telemetry from monitoring would affect to even fewer clusters: we would only lose 2.5% of the clusters

@chrisronline
Copy link
Contributor

Are there any differences when Metricbeat is used vs. the legacy collector mechanism? Has this API

There should not be a difference here. The bulk uploader (which is used by monitoring plugin when collecting monitoring data for legacy collection) and the stats api (which is used by Metricbeat when collecting monitoring data for Metricbeat collection) should use the same exact code, or at the very least, return the same output. There is a ticket to better consolidate this but hasn't been worked on yet. It's worth noting that we have a collection of parity tests that ensure Metricbeat collected monitoring documents are identical to documents collected through legacy collection.

For future proofing, the bulk uploader is going away in 8.0. We are currently deprecating that behavior for 7.x and will completely remove it in 8.0 so it might not be worth it to invest much in that area of the code.

We still want to be sure we understand the telemetry story here, but I'm not sure I'm entirely up to date on it. Happy to help anyway I can though

@alexfrancoeur
Copy link

alexfrancoeur commented Aug 5, 2020

I see this is targeted for 7.10. Are we still on track for that release? Many production clusters have monitoring enabled and we'll want to start receiving additional telemetry for them as soon as possible. Let me know if there is anything I can do to help expedite.

@afharo
Copy link
Member Author

afharo commented Aug 10, 2020

AFAIK, we are discussing an RFC to, possibly entirely remove the Kibana-related telemetry from the monitoring collection. If that happens, I think we can close or repurpose this issue to make that happen 🙂

@alexfrancoeur
Copy link

++ I believe we capture data from multiple Kibana instances today when monitoring is enabled, so we'd have to understand impact of removing complete.

I do think not having data telemetry from monitoring clusters will become more visible soon as we begin to trust and use the data. If 25% of clusters really have monitoring enabled, and most production clusters have monitoring enabled (assumption) then we're really only capturing a small subset of production clusters. Should we have a sync specifically to discuss the RFC?

@afharo
Copy link
Member Author

afharo commented Nov 6, 2020

After the discussions in the RFC and the changes in #82638, I think we can close this issue.

There will be one outstanding item:

  • Do not start collecting until Self-Monitoring is enabled and fully started.

But since bulk_uploader is going to be removed in 8.0, maybe we can let it be for now?

@afharo afharo linked a pull request Nov 6, 2020 that will close this issue
6 tasks
@lukeelmers lukeelmers added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Oct 1, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Stack Monitoring Feature:Telemetry Meta Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Monitoring Stack Monitoring team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants