Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve metric query performance #95776

Closed
6 of 7 tasks
martijnvg opened this issue May 3, 2023 · 3 comments
Closed
6 of 7 tasks

Improve metric query performance #95776

martijnvg opened this issue May 3, 2023 · 3 comments
Assignees
Labels
Meta :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@martijnvg
Copy link
Member

martijnvg commented May 3, 2023

There are a number of performance issues that have been found in production cluster for metric solutions that need to be addressed in order to have competitive query latency in the metric space. This is part of the tsdb effort as it aim is to make Elasticsearch better at storing and querying metric data. Tasks mentioned here are improvements that significantly reduce query time of many metric query workloads or specific ones.

Our current observations indicate that the poor performance is caused by the default refresh behaviour. Shards by default go search-idle after 30 seconds of search inactivity. When a shard is queries that is search idle then a refresh is performed as part of the search and then search execution continues. This adds a significant amount of latency to the query time. Especially because the refresh isn't triggered, but awaits until the scheduled refresh kicks in (which means often for 1 second nothing happens).

Additionally we observed that any search with a percentile aggregation is slow. Under the hood the percentile aggregation uses avl t-digest to compute the percentiles. This shows up as significant hotspot when profiling.

@martijnvg martijnvg added Meta :StorageEngine/TSDB You know, for Metrics labels May 3, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 3, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@StephanErb
Copy link

StephanErb commented May 8, 2023

Our current observations indicate that the poor performance is caused by the default refresh behaviour. Shards by default go search-idle after 30 seconds of search inactivity.

Have you considered to move a way from the default (unset) refresh_interval interval for TSDB indices? As per my understanding the default setting mostly optimizes for bulk indexing. However, for a vast number of TSDB usecases, I would expect:

  • individual series on an index receiving new values in roughly equidistant points in time
  • highly frequent querying by alert rules (xx/s)
  • rather predictable ingest load

This means we can probably have a heuristic which refreshes an index every 0.5 * min(metricset.period). For a typical monitoring interval in a distributed system of 10s - 60s, we'd now be looking at a rather reasonable refresh interval of 5s - 30s without any significant loss in functionality.

There might be some exceptions to the rule (like there Kubernetes Events metricset where you'd want much higher refresh intervals for troubleshooting purposes), but for many other observability usecases such a heuristic might work well.

What do you think?

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jun 18, 2023
… rewrite without a SearchExecutionContext.

With this change, both query builders can rewrite without using a search context, because QueryRewriteContext often has all the mapping and other index metadata available.

The `TermQueryBuilder` can with this resolve to a `MatchAllQueryBuilder` with needing a `SearchExecutionContext`, which during the can_match phase means that no searcher needs to be acquired and therefor avoiding making a shard search active / potentially refresh.

The `AbstractQueryBuilder#doRewrite(...)` method is altered to by default attempt a coordination rewrite, then fall back to attempt a search rewrite, then finally fall back to do an index metadata aware rewrite.

This was forgotten as part of elastic#96161 and is needed to complete elastic#95776.
martijnvg added a commit that referenced this issue Jun 19, 2023
…t a SearchExecutionContext. (#96905)

With this change, both query builders can rewrite without using a search context, because QueryRewriteContext often has all the mapping and other index metadata available.

The TermQueryBuilder can with this change resolve to a MatchNoneQueryBuilder without needing a SearchExecutionContext, which during the can_match phase means that no searcher needs to be acquired and therefor avoid making a shard search active and doing a potentially refresh.

The AbstractQueryBuilder#doRewrite(...) method is altered to by default attempt a coordination rewrite, then fall back to attempt a search rewrite, then finally fall back to do an index metadata aware rewrite. This is inline with what was discussed here: #96161 (comment)

This change was forgotten as part of #96161 and is needed to complete #95776.
@martijnvg
Copy link
Member Author

Better detect when execution hint map or global_ordinals should be used.

I expermented with this via #101619, but it didn't yield the performance gains I was hoping for. In some cases the performance was worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Meta :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

3 participants