[Search] Async search backing off strategy makes dashboards slow when searches take less than 1s to be served #157837

dej611 · 2023-05-16T08:13:28Z

Describe the bug:

While debugging a slow dashboard I've realised that most of the waiting time was due to the async search backing off polling strategy.
Each ES search query time was within 1s, many around 500ms, but often the first request was served back from the server within 150ms: this triggered the first 1000ms delay for a new polling request, so the final Kibana reported time is can be even more than 2x the ES response timing.

In this example the final timing is the sum of the 2 requests + the 1s delay, taking more than 1.5s while ES reports less than 300ms to serve the full respose:

1684224058942 - 1684224058484 = 458ms (ES time 105ms) // Initial async search
1684224060231 - 1684224059958 = 273ms (ES time 247ms) // second polling request

458 + 273 + 1000 = 1731ms // <= the final reported time

Maybe we could improve this aspect of the polling strategy?
One idea could be providing an additional step for the initial polling delay of 500ms for the first second?
On the other hand, discussing with @martijnvg offline, ES could take a little bit longer (from 100ms to ~200ms) to serve the first request in order to catch all "faster" searches and reduce the polling for these. Perhaps the two could be adopted together in some form?

cc @ppisljar @lukasolson

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-05-16T08:13:31Z

Pinging @elastic/kibana-visualizations @elastic/kibana-visualizations-external (Team:Visualizations)

elasticmachine · 2023-05-16T08:13:31Z

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

henrikno · 2023-06-30T22:21:47Z

Rel: #143277

dej611 · 2023-07-28T14:56:06Z

Had a go with an investigation on this, but couldn't find a the real offender of the performance problem.
First of all a diagram of the architecture in place:

flowchart TB
  SearchInterceptor[Search Interceptor w/ Polling] --"Network Request"--> bFetch.onBatchItem
  subgraph CLIENT
    subgraph DATA PLUGIN
      SearchInterceptor[Search Interceptor w/ Polling]
    end
    Expression --> SearchInterceptor
    subgraph DASHBOARD
      subgraph EMBEDDABLE
        subgraph LENS
            Chart --> Expression
        end
      end
    end
  end
  ESESearch --"Network Request"--> Elasticsearch
  subgraph KIBANA [SERVER KIBANA]
    subgraph DATA PLUGIN
      bFetch.onBatchItem --> ESESearch[ESE Search Strategy w/ Polling]
      ESESearch <--> UiSettings
    end
  end
  subgraph ES [SERVER ES]
    Elasticsearch
  end

While at the beginning I thought uiSettingsClient was a bottleneck I've found that it was not the case as long as get calls happen not concurrently (not await Promise.all([ .... ])) as there's a caching wrapper around the client which leverages the fact that get calls are in series.
The Observable structure makes a bit hard to debug/profile the exact timings and order of each function as there's a lot of noise in the profile stacktrace.

It was still not clear to me where most of the time is lost. Measuring timings from both Kibana client and server side it looks like at least a 1s overhead is spent somewhere else than the actual request handling, and I assume the network delay in that case to be minimal as both were running locally (the ES cluster was remote to simulate a slow request).
So perhaps more investigation is due here to better understand where the actual time is spent.

lukasolson · 2023-08-02T23:40:30Z

Would this be solved by increasing the wait_for_completion_timeout to something like 1s?

lukasolson added the impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. label May 25, 2023

stratoula added impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. and removed impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. labels Jun 19, 2023

martijnvg mentioned this issue Jun 20, 2023

Improve metric query performance elastic/elasticsearch#95776

Closed

7 tasks

dej611 self-assigned this Jul 20, 2023

dej611 assigned ppisljar and unassigned dej611 Aug 28, 2023

ppisljar mentioned this issue Aug 28, 2023

changing defaults #164957

Merged

ppisljar closed this as completed in #164957 Sep 4, 2023

lukasolson added the Feature:Search Querying infrastructure in Kibana label Oct 24, 2023

markov00 mentioned this issue Apr 30, 2024

[Research] Possible performance improvements in Lens expressions #182151

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Search] Async search backing off strategy makes dashboards slow when searches take less than 1s to be served #157837

[Search] Async search backing off strategy makes dashboards slow when searches take less than 1s to be served #157837

dej611 commented May 16, 2023

elasticmachine commented May 16, 2023

elasticmachine commented May 16, 2023

henrikno commented Jun 30, 2023

dej611 commented Jul 28, 2023

lukasolson commented Aug 2, 2023

[Search] Async search backing off strategy makes dashboards slow when searches take less than 1s to be served #157837

[Search] Async search backing off strategy makes dashboards slow when searches take less than 1s to be served #157837

Comments

dej611 commented May 16, 2023

elasticmachine commented May 16, 2023

elasticmachine commented May 16, 2023

henrikno commented Jun 30, 2023

dej611 commented Jul 28, 2023

lukasolson commented Aug 2, 2023