Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Improve performance of Fleet setup API / package installation #110500

Closed
joshdover opened this issue Aug 30, 2021 · 6 comments · Fixed by #131627
Closed

[Fleet] Improve performance of Fleet setup API / package installation #110500

joshdover opened this issue Aug 30, 2021 · 6 comments · Fixed by #131627
Assignees
Labels
performance Team:Fleet Team label for Observability Data Collection Fleet team technical debt Improvement of the software architecture and operational architecture

Comments

@joshdover
Copy link
Contributor

We need to improve and continue to monitor the performance of the /api/fleet/setup endpoint. This API can take upwards of 40s, even on a local environment. More details TBD.

Related to #109072

@joshdover joshdover added performance Team:Fleet Team label for Observability Data Collection Fleet team labels Aug 30, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@jen-huang jen-huang added the technical debt Improvement of the software architecture and operational architecture label Aug 30, 2021
@joshdover
Copy link
Contributor Author

I spent some time investigating this yesterday and unfortunately, I've come to a preliminary conclusion that there are not any major opportunities for optimization from the Kibana side here.

The longest running operations during the setup (and general package install process) is creating the ingest assets in Elasticsearch: ingest pipelines, index templates, component templates, and data streams.

We're currently parallelizing these Elasticsearch API calls as much as we realistically can already, however it seems that these requests are getting queued on the Elasticsearch side. I discovered this by looking at the APM spans for the Elasticsearch calls and noticed that many of these operations are taking 5-7s to return a response. However, if I time a single one of these API calls, they return very quickly (in the 200-500 ms range). This leads me to believe that Elasticsearch is queueing these requests, I suspect because each of these assets are stored in the cluster state which needs to be consistent across the cluster, requiring a write lock to be acquired and replication across nodes to be completed before confirming that the asset has been committed.

I identified additional evidence of this behavior in Elasticsearch by changing these API calls to be executed serially, rather than in parallel from the Kibana side. In this scenario, each API call completed quite quickly (again in the 200-500 ms range), while the overall setup process took the same amount of time (39s locally).

I also confirmed that these requests are not being queued on the Kibana side by increasing the agent.maxSockets option on the Elasticsearch client from the default of 256 to 1024. There are about 300 total ES requests during this process, so I don't believe we're hitting this connection pool maximum. Further investigation could be done to prove this.

I believe the next steps here would be connecting with the Elasticsearch team to:

  1. Confirm the root cause of the behavior I am seeing; and if I'm correct
  2. Determine if there is some way we could optimize this in Elasticsearch. Ideas include:
    • Providing a bulk API to minimize the number of separate cluster state writes we need
    • Providing a way to write component templates and index templates in a single API call. Today we have to serialize these calls to be sure that the component templates are created before the index templates that reference them.
    • Optimizing cluster state writes in general

@joshdover
Copy link
Contributor Author

Here's an example of what I'm seeing. In this first APM screenshot, you'll see that creating each ingest pipeline is taking ~9s to receive a response from ES when we're running many things in parallel.

image

In this second screenshot, where I changed assets to install serially, we see each ingest pipeline take ~500ms to receive a response. In both cases, the overall setup process takes 39s. This behavior is very consistent across runs as well.

image

@joshdover
Copy link
Contributor Author

@elastic/es-distributed @colings86 I believe you all would be the best to confirm if my suspicions here are correct regarding these requests being blocked on cluster state updates. Is this analysis accurate and if so, do you all have any ideas for how we could improve this?

I'm also happy to open a new issue in the ES repo to discuss this further.

@joshdover joshdover changed the title Improve performance of Fleet setup API Improve performance of Fleet setup API / package installation Sep 2, 2021
@colings86
Copy link
Contributor

@henningandersen Could you or someone in the distributed team help @joshdover investigate the behaviour he is seeing here?

Something feels off here because unless the cluster is under extremely heavy load, adding 5 ingest pipelines should not be taking ~9 seconds even with requests sent in parallel.

@joshdover can you confirm the ES version you are using and the specs of the instance? Also, is this test done on an otherwise idle cluster?

@joshdover
Copy link
Contributor Author

can you confirm the ES version you are using and the specs of the instance? Also, is this test done on an otherwise idle cluster?

This was tested against a 8.0.0-SNAPSHOT from last week on a local 1 node cluster on my laptop with no other load at all. We're seeing similar performance in production instances of 7.14.x.

Something feels off here because unless the cluster is under extremely heavy load, adding 5 ingest pipelines should not be taking ~9 seconds even with requests sent in parallel.

It's worth noting that there are other operations happening at the same time outside this screenshot also likely requiring cluster state updates (PUTs on component templates and index templates). These also exhibit the same behavior where each individual request is quite fast when run individually but very slow when we try parallelize this. The total runtime is about the same whether we run the requests in parallel or serially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Team:Fleet Team label for Observability Data Collection Fleet team technical debt Improvement of the software architecture and operational architecture
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants