Improve metric query performance #95776

martijnvg · 2023-05-03T10:28:38Z

There are a number of performance issues that have been found in production cluster for metric solutions that need to be addressed in order to have competitive query latency in the metric space. This is part of the tsdb effort as it aim is to make Elasticsearch better at storing and querying metric data. Tasks mentioned here are improvements that significantly reduce query time of many metric query workloads or specific ones.

Our current observations indicate that the poor performance is caused by the default refresh behaviour. Shards by default go search-idle after 30 seconds of search inactivity. When a shard is queries that is search idle then a refresh is performed as part of the search and then search execution continues. This adds a significant amount of latency to the query time. Especially because the refresh isn't triggered, but awaits until the scheduled refresh kicks in (which means often for 1 second nothing happens).

Additionally we observed that any search with a percentile aggregation is slow. Under the hood the percentile aggregation uses avl t-digest to compute the percentiles. This shows up as significant hotspot when profiling.

Build new Rally track that measure performance when shards go search-idle. Add new track for tsdb based on k8s integration rally-tracks#373
Trigger a refresh when a shard becomes search active instead of waiting for it. #95544
Avoid refreshing search-idle shards that don't yield results after query rewrite #95541
Improve the performance of percentile aggregation by switching to the merging based t-digest implementation. The current avl based implementation performs slowly in production with metric data set of any reasonable size. This work consists out of forking the t-digest library (Fork tdigest library #95903)) and then change the implementation to merging t-digest (Feature/replace avl digest with merging digest #35182).
Improve cardinality aggregation performance on low cardinality fields (Add support for dynamic pruning to cardinality aggregations on low-cardinality keyword fields. #92060).
Better detect when execution hint map or global_ordinals should be used.
[Search] Async search backing off strategy makes dashboards slow when searches take less than 1s to be served kibana#157837

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-05-03T10:29:02Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

StephanErb · 2023-05-08T19:38:52Z

Our current observations indicate that the poor performance is caused by the default refresh behaviour. Shards by default go search-idle after 30 seconds of search inactivity.

Have you considered to move a way from the default (unset) refresh_interval interval for TSDB indices? As per my understanding the default setting mostly optimizes for bulk indexing. However, for a vast number of TSDB usecases, I would expect:

individual series on an index receiving new values in roughly equidistant points in time
highly frequent querying by alert rules (xx/s)
rather predictable ingest load

This means we can probably have a heuristic which refreshes an index every 0.5 * min(metricset.period). For a typical monitoring interval in a distributed system of 10s - 60s, we'd now be looking at a rather reasonable refresh interval of 5s - 30s without any significant loss in functionality.

There might be some exceptions to the rule (like there Kubernetes Events metricset where you'd want much higher refresh intervals for troubleshooting purposes), but for many other observability usecases such a heuristic might work well.

What do you think?

… rewrite without a SearchExecutionContext. With this change, both query builders can rewrite without using a search context, because QueryRewriteContext often has all the mapping and other index metadata available. The `TermQueryBuilder` can with this resolve to a `MatchAllQueryBuilder` with needing a `SearchExecutionContext`, which during the can_match phase means that no searcher needs to be acquired and therefor avoiding making a shard search active / potentially refresh. The `AbstractQueryBuilder#doRewrite(...)` method is altered to by default attempt a coordination rewrite, then fall back to attempt a search rewrite, then finally fall back to do an index metadata aware rewrite. This was forgotten as part of elastic#96161 and is needed to complete elastic#95776.

…t a SearchExecutionContext. (#96905) With this change, both query builders can rewrite without using a search context, because QueryRewriteContext often has all the mapping and other index metadata available. The TermQueryBuilder can with this change resolve to a MatchNoneQueryBuilder without needing a SearchExecutionContext, which during the can_match phase means that no searcher needs to be acquired and therefor avoid making a shard search active and doing a potentially refresh. The AbstractQueryBuilder#doRewrite(...) method is altered to by default attempt a coordination rewrite, then fall back to attempt a search rewrite, then finally fall back to do an index metadata aware rewrite. This is inline with what was discussed here: #96161 (comment) This change was forgotten as part of #96161 and is needed to complete #95776.

martijnvg · 2024-02-12T11:03:23Z

Better detect when execution hint map or global_ordinals should be used.

I expermented with this via #101619, but it didn't yield the performance gains I was hoping for. In some cases the performance was worse.

martijnvg added Meta :StorageEngine/TSDB You know, for Metrics labels May 3, 2023

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 3, 2023

wchaparro assigned martijnvg May 4, 2023

martijnvg mentioned this issue Jun 18, 2023

Update MatchPhrase- and TermQueryBuilder to be able to rewrite without a SearchExecutionContext. #96905

Merged

StephanErb mentioned this issue Aug 16, 2023

Is the default refresh_interval a sensible default for Observability data? #78776

Open

martijnvg closed this as completed Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve metric query performance #95776

Improve metric query performance #95776

martijnvg commented May 3, 2023 •

edited

Loading

elasticsearchmachine commented May 3, 2023

StephanErb commented May 8, 2023 •

edited

Loading

martijnvg commented Feb 12, 2024

Improve metric query performance #95776

Improve metric query performance #95776

Comments

martijnvg commented May 3, 2023 • edited Loading

elasticsearchmachine commented May 3, 2023

StephanErb commented May 8, 2023 • edited Loading

martijnvg commented Feb 12, 2024

martijnvg commented May 3, 2023 •

edited

Loading

StephanErb commented May 8, 2023 •

edited

Loading