[Fleet] Fix Fleet Setup to handle concurrent calls across nodes in HA Kibana deployment #118423

kpollich · 2021-11-11T19:31:24Z

Currently, Fleet tracks the status of the setup process in memory.

Because of this, it's possible that multiple concurrent calls to setup can occur in environments with multiple Kibana instances. This introduces multiple issues such as

Potential for duplicated data from preconfiguration like our default policies
Unintended conflicts/collisions when handling setup/preconfiguration that causes Kibana to error
General performance issues that come with incurring all of our setup calls on every node

In #111858 (comment), we discussed how we could solve this problem by making all of the setup operations idempotent. Here's the summary of what we concluded in that thread:

Anything that creates a new Saved Object needs to use a deterministic ID with the overwrite option enabled:
- Outputs uses uuidv5 [Fleet] Use predefined id for default output #120158
  - We only use uuidv5 for preconfigured outputs, but not for the default output today. We should use a uuidv5 for the default output as well and seed it with the keyword default.
- Packages uses package name
- Package assets uses uuidv5
- [Fleet] Preconfigured Package Policies should require an ID #120612
  - We may be able to use a uuidv5 seeded with the deterministic ID we use for Agent policies + the package name, but this would break if there are multiple package policies for the same package in the preconfiguration. It might be best to require an ID be supplied in preconfiguration to avoid this problem.
- [Fleet] Update agent policy saved objects to ensure Fleet idempotency #120613
  - We should treat this similar to how we treat the output IDs, using uuidv5. We can seed this with either the ID from preconfiguration or with the keywords default_policy and default_fleet_server_policy for those two policies since they do not require an ID in preconfiguration.
- Enrollment keys in .fleet-enrollment-api-keys uses uuidv4 but creating duplicates should not be a problem.
Elasticsearch operations need to be idempotent as well
- Creating index templates, component templates, ILM policies, transforms, and ingest pipelines are all idempotent
- [Fleet] Rolling over data streams is not idempotent #120946
  - This may not be a problem since we only do a rollover if the mappings cannot be updated on the data stream's write index. If one node has already done the rollover, the next node that tries to update the mappings should succeed and not need to do a rollover.
  - Punting a root cause solve on this now and discussing further in the linked issue
- [Fleet] Prevent installation of packages containing ML models during preconfiguration/setup #120903
  - Should we add an assertion that blocks packages that contain ML models from being installed during setup to avoid this problem in the future should a managed package add an ML model?
  - Clarified this with @alvarezmelissa87 and we shouldn't have any concerns around idempotency for ML assets, so marked this as done

Original description, see why this may not work in all cases here

Our architecture today looks like this:

We should consider our options for moving the status of Fleet's setup process into some kind of persistent state in order to avoid these issues. It's likely that these issues have been exacerbated by #111858, in which we've moved Fleet's setup process to Kibana boot.

If we store the status of Fleet setup in Elasticsearch, we'd have an architecture more like this:

It would be helpful to stand up an environment with multiple Kibana instances and report findings on boot in this issue to further crystalize these issues.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-11-11T19:31:26Z

Pinging @elastic/fleet (Team:Fleet)

joshdover · 2021-11-18T10:27:07Z

Dec 1: moved to description

joshdover · 2021-11-18T11:09:57Z

We should consider our options for moving the status of Fleet's setup process into some kind of persistent state in order to avoid these issues. It's likely that these issues have been exacerbated by #111858, in which we've moved Fleet's setup process to Kibana boot.

This could be another option instead of the idempotent route. We could have an SO that is only used for this purpose. My concern is that this type of approach isn't guaranteed to always work, though if we need a quick solution it's better than what we have today.

Where this approach doesn't work:

Node A writes to shared doc that it's starting setup
Node B reads shared doc and waits for Node A to write to doc that it's completed setup
Node A has network issues for 60s
After 30s of waiting, Node B assumes that Node A died or failed in some way and claims the shared doc to start set up
Node A's network issues resolve and it assumes it still has the "lock" and continues setup

This results in both Node A and B running setup operations simultaneously.

joshdover · 2021-12-08T12:08:41Z

@kpollich FYI I think we can / should backport all of these fixes to 7.16.1 if they're clean backports. Should help customers in those cases too where we're already seeing this issue come up.

kpollich · 2021-12-13T13:32:11Z

@joshdover Everything we initially tracked here has been addressed. Do we want to start thinking about how to test this in a HA environment?

joshdover · 2021-12-13T15:42:40Z

Do we want to start thinking about how to test this in a HA environment?

Good thinking. I'm thinking we should try running several instances of setupFleet

kibana/x-pack/plugins/fleet/server/services/setup.ts

Line 50 in da96f61

async function createSetupSideEffects(

in parallel in a Jest integration test and then verifying the exact same objects are created, no less, no more.

However, I know we've had some challenges with the Jest integration tests in #118797. I can't remember if the issues we were encountering there would affect this test as well or not.

Any other ideas?

kpollich · 2021-12-13T16:25:39Z

I think that approach is definitely sound, but as I recall we weren't able to actually get Kibana to boot up when running Jest integration tests against the x-pack directory. I probably won't have time this week to actually start addressing this, but I do think documenting that approach is a good start.

joshdover · 2021-12-14T14:14:13Z

@kpollich Let me give a it a try later this week and see if I can't get something working.

kpollich added technical debt Improvement of the software architecture and operational architecture Team:Fleet Team label for Observability Data Collection Fleet team labels Nov 11, 2021

joshdover mentioned this issue Nov 18, 2021

[Fleet] Initiate Fleet setup on boot #111858

Closed

10 tasks

joshdover changed the title ~~[Fleet] Refactor Fleet Setup Status to Handle Concurrent Calls Across Nodes~~ [Fleet] Refactor Fleet Setup to Handle Concurrent Calls Across Nodes Nov 30, 2021

jen-huang added the v8.1.0 label Dec 1, 2021

joshdover mentioned this issue Dec 2, 2021

[Fleet] Use predefined id for default output #120158

Merged

joshdover changed the title ~~[Fleet] Refactor Fleet Setup to Handle Concurrent Calls Across Nodes~~ [Fleet] Fix Fleet Setup to handle concurrent calls across nodes in HA Kibana deployment Dec 2, 2021

joshdover added v8.0.0 bug Fixes for quality problems that affect the customer experience and removed v8.1.0 technical debt Improvement of the software architecture and operational architecture labels Dec 2, 2021

joshdover assigned kpollich Dec 2, 2021

joshdover mentioned this issue Dec 7, 2021

[Fleet] Block Kibana startup for Fleet setup completion #120616

Open

8 tasks

joshdover added the v7.16.1 label Dec 9, 2021

joshdover self-assigned this Dec 14, 2021

joshdover mentioned this issue Jan 5, 2022

Add integration test for Fleet setup with HA kibana deployment #122349

Merged

joshdover closed this as completed in #122349 Feb 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Fix Fleet Setup to handle concurrent calls across nodes in HA Kibana deployment #118423

[Fleet] Fix Fleet Setup to handle concurrent calls across nodes in HA Kibana deployment #118423

kpollich commented Nov 11, 2021 •

edited

Loading

elasticmachine commented Nov 11, 2021

joshdover commented Nov 18, 2021 •

edited

Loading

joshdover commented Nov 18, 2021

joshdover commented Dec 8, 2021

kpollich commented Dec 13, 2021

joshdover commented Dec 13, 2021

kpollich commented Dec 13, 2021

joshdover commented Dec 14, 2021

[Fleet] Fix Fleet Setup to handle concurrent calls across nodes in HA Kibana deployment #118423

[Fleet] Fix Fleet Setup to handle concurrent calls across nodes in HA Kibana deployment #118423

Comments

kpollich commented Nov 11, 2021 • edited Loading

elasticmachine commented Nov 11, 2021

joshdover commented Nov 18, 2021 • edited Loading

joshdover commented Nov 18, 2021

joshdover commented Dec 8, 2021

kpollich commented Dec 13, 2021

joshdover commented Dec 13, 2021

kpollich commented Dec 13, 2021

joshdover commented Dec 14, 2021

kpollich commented Nov 11, 2021 •

edited

Loading

joshdover commented Nov 18, 2021 •

edited

Loading