[Fleet] Improve performance of Fleet setup API / package installation #110500

joshdover · 2021-08-30T15:59:21Z

We need to improve and continue to monitor the performance of the /api/fleet/setup endpoint. This API can take upwards of 40s, even on a local environment. More details TBD.

Related to #109072

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-08-30T15:59:23Z

Pinging @elastic/fleet (Team:Fleet)

joshdover · 2021-09-02T10:24:22Z

I spent some time investigating this yesterday and unfortunately, I've come to a preliminary conclusion that there are not any major opportunities for optimization from the Kibana side here.

The longest running operations during the setup (and general package install process) is creating the ingest assets in Elasticsearch: ingest pipelines, index templates, component templates, and data streams.

We're currently parallelizing these Elasticsearch API calls as much as we realistically can already, however it seems that these requests are getting queued on the Elasticsearch side. I discovered this by looking at the APM spans for the Elasticsearch calls and noticed that many of these operations are taking 5-7s to return a response. However, if I time a single one of these API calls, they return very quickly (in the 200-500 ms range). This leads me to believe that Elasticsearch is queueing these requests, I suspect because each of these assets are stored in the cluster state which needs to be consistent across the cluster, requiring a write lock to be acquired and replication across nodes to be completed before confirming that the asset has been committed.

I identified additional evidence of this behavior in Elasticsearch by changing these API calls to be executed serially, rather than in parallel from the Kibana side. In this scenario, each API call completed quite quickly (again in the 200-500 ms range), while the overall setup process took the same amount of time (39s locally).

I also confirmed that these requests are not being queued on the Kibana side by increasing the agent.maxSockets option on the Elasticsearch client from the default of 256 to 1024. There are about 300 total ES requests during this process, so I don't believe we're hitting this connection pool maximum. Further investigation could be done to prove this.

I believe the next steps here would be connecting with the Elasticsearch team to:

Confirm the root cause of the behavior I am seeing; and if I'm correct
Determine if there is some way we could optimize this in Elasticsearch. Ideas include:
- Providing a bulk API to minimize the number of separate cluster state writes we need
- Providing a way to write component templates and index templates in a single API call. Today we have to serialize these calls to be sure that the component templates are created before the index templates that reference them.
- Optimizing cluster state writes in general

joshdover · 2021-09-02T11:10:25Z

Here's an example of what I'm seeing. In this first APM screenshot, you'll see that creating each ingest pipeline is taking ~9s to receive a response from ES when we're running many things in parallel.

In this second screenshot, where I changed assets to install serially, we see each ingest pipeline take ~500ms to receive a response. In both cases, the overall setup process takes 39s. This behavior is very consistent across runs as well.

joshdover · 2021-09-02T15:12:13Z

@elastic/es-distributed @colings86 I believe you all would be the best to confirm if my suspicions here are correct regarding these requests being blocked on cluster state updates. Is this analysis accurate and if so, do you all have any ideas for how we could improve this?

I'm also happy to open a new issue in the ES repo to discuss this further.

colings86 · 2021-09-06T12:50:27Z

@henningandersen Could you or someone in the distributed team help @joshdover investigate the behaviour he is seeing here?

Something feels off here because unless the cluster is under extremely heavy load, adding 5 ingest pipelines should not be taking ~9 seconds even with requests sent in parallel.

@joshdover can you confirm the ES version you are using and the specs of the instance? Also, is this test done on an otherwise idle cluster?

joshdover · 2021-09-06T12:56:41Z

can you confirm the ES version you are using and the specs of the instance? Also, is this test done on an otherwise idle cluster?

This was tested against a 8.0.0-SNAPSHOT from last week on a local 1 node cluster on my laptop with no other load at all. We're seeing similar performance in production instances of 7.14.x.

Something feels off here because unless the cluster is under extremely heavy load, adding 5 ingest pipelines should not be taking ~9 seconds even with requests sent in parallel.

It's worth noting that there are other operations happening at the same time outside this screenshot also likely requiring cluster state updates (PUTs on component templates and index templates). These also exhibit the same behavior where each individual request is quite fast when run individually but very slow when we try parallelize this. The total runtime is about the same whether we run the requests in parallel or serially.

joshdover added performance Team:Fleet Team label for Observability Data Collection Fleet team labels Aug 30, 2021

jen-huang added the technical debt Improvement of the software architecture and operational architecture label Aug 30, 2021

joshdover mentioned this issue Sep 2, 2021

Allow Fleet to complete package upgrade before Kibana server is ready #108993

Closed

joshdover changed the title ~~Improve performance of Fleet setup API~~ Improve performance of Fleet setup API / package installation Sep 2, 2021

axw mentioned this issue Sep 6, 2021

Speed up systemtests elastic/apm-server#6034

Closed

joshdover changed the title ~~Improve performance of Fleet setup API / package installation~~ [Fleet] Improve performance of Fleet setup API / package installation Sep 13, 2021

joshdover mentioned this issue Sep 14, 2021

[Fleet] Add support for bundling Stack-version aligned packages with Kibana #112095

Closed

joshdover mentioned this issue Nov 17, 2021

[Fleet] Evaluate Fleet page load performance and steps to improve #118751

Closed

This was referenced Dec 9, 2021

[Fleet] Add a reset API #118214

Closed

[Fleet] Changes in package install format should be applied on Stack upgrades #121099

Closed

joshdover mentioned this issue Jan 4, 2022

[Fleet] Make default integration install explicit #121628

Merged

4 tasks

joshdover mentioned this issue Apr 26, 2022

[Fleet] Optimize package installation performance, phase 1 #130906

Merged

joshdover self-assigned this Apr 29, 2022

joshdover mentioned this issue May 5, 2022

Optimize package installation performance, phase 2 #131627

Merged

joshdover closed this as completed in #131627 May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Improve performance of Fleet setup API / package installation #110500

[Fleet] Improve performance of Fleet setup API / package installation #110500

joshdover commented Aug 30, 2021

elasticmachine commented Aug 30, 2021

joshdover commented Sep 2, 2021

joshdover commented Sep 2, 2021

joshdover commented Sep 2, 2021

colings86 commented Sep 6, 2021

joshdover commented Sep 6, 2021

[Fleet] Improve performance of Fleet setup API / package installation #110500

[Fleet] Improve performance of Fleet setup API / package installation #110500

Comments

joshdover commented Aug 30, 2021

elasticmachine commented Aug 30, 2021

joshdover commented Sep 2, 2021

joshdover commented Sep 2, 2021

joshdover commented Sep 2, 2021

colings86 commented Sep 6, 2021

joshdover commented Sep 6, 2021