Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Fix Fleet Setup to handle concurrent calls across nodes in HA Kibana deployment #118423

Closed
9 tasks done
Tracked by #120616
kpollich opened this issue Nov 11, 2021 · 8 comments · Fixed by #122349
Closed
9 tasks done
Tracked by #120616
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team v7.16.1 v8.0.0

Comments

@kpollich
Copy link
Member

kpollich commented Nov 11, 2021

Currently, Fleet tracks the status of the setup process in memory.

Because of this, it's possible that multiple concurrent calls to setup can occur in environments with multiple Kibana instances. This introduces multiple issues such as

  1. Potential for duplicated data from preconfiguration like our default policies
  2. Unintended conflicts/collisions when handling setup/preconfiguration that causes Kibana to error
  3. General performance issues that come with incurring all of our setup calls on every node

In #111858 (comment), we discussed how we could solve this problem by making all of the setup operations idempotent. Here's the summary of what we concluded in that thread:

  • Anything that creates a new Saved Object needs to use a deterministic ID with the overwrite option enabled:
  • Elasticsearch operations need to be idempotent as well
    • Creating index templates, component templates, ILM policies, transforms, and ingest pipelines are all idempotent
    • [Fleet] Rolling over data streams is not idempotent #120946
      • This may not be a problem since we only do a rollover if the mappings cannot be updated on the data stream's write index. If one node has already done the rollover, the next node that tries to update the mappings should succeed and not need to do a rollover.
      • Punting a root cause solve on this now and discussing further in the linked issue
    • [Fleet] Prevent installation of packages containing ML models during preconfiguration/setup #120903
      • Should we add an assertion that blocks packages that contain ML models from being installed during setup to avoid this problem in the future should a managed package add an ML model?
      • Clarified this with @alvarezmelissa87 and we shouldn't have any concerns around idempotency for ML assets, so marked this as done

Original description, see why this may not work in all cases here

Our architecture today looks like this:

image

We should consider our options for moving the status of Fleet's setup process into some kind of persistent state in order to avoid these issues. It's likely that these issues have been exacerbated by #111858, in which we've moved Fleet's setup process to Kibana boot.

If we store the status of Fleet setup in Elasticsearch, we'd have an architecture more like this:

image

It would be helpful to stand up an environment with multiple Kibana instances and report findings on boot in this issue to further crystalize these issues.

@kpollich kpollich added technical debt Improvement of the software architecture and operational architecture Team:Fleet Team label for Observability Data Collection Fleet team labels Nov 11, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@joshdover
Copy link
Contributor

joshdover commented Nov 18, 2021

Dec 1: moved to description

@joshdover
Copy link
Contributor

We should consider our options for moving the status of Fleet's setup process into some kind of persistent state in order to avoid these issues. It's likely that these issues have been exacerbated by #111858, in which we've moved Fleet's setup process to Kibana boot.

This could be another option instead of the idempotent route. We could have an SO that is only used for this purpose. My concern is that this type of approach isn't guaranteed to always work, though if we need a quick solution it's better than what we have today.

Where this approach doesn't work:

  • Node A writes to shared doc that it's starting setup
  • Node B reads shared doc and waits for Node A to write to doc that it's completed setup
  • Node A has network issues for 60s
  • After 30s of waiting, Node B assumes that Node A died or failed in some way and claims the shared doc to start set up
  • Node A's network issues resolve and it assumes it still has the "lock" and continues setup

This results in both Node A and B running setup operations simultaneously.

@joshdover joshdover changed the title [Fleet] Refactor Fleet Setup Status to Handle Concurrent Calls Across Nodes [Fleet] Refactor Fleet Setup to Handle Concurrent Calls Across Nodes Nov 30, 2021
@joshdover joshdover changed the title [Fleet] Refactor Fleet Setup to Handle Concurrent Calls Across Nodes [Fleet] Fix Fleet Setup to handle concurrent calls across nodes in HA Kibana deployment Dec 2, 2021
@joshdover joshdover added v8.0.0 bug Fixes for quality problems that affect the customer experience and removed v8.1.0 technical debt Improvement of the software architecture and operational architecture labels Dec 2, 2021
@joshdover
Copy link
Contributor

@kpollich FYI I think we can / should backport all of these fixes to 7.16.1 if they're clean backports. Should help customers in those cases too where we're already seeing this issue come up.

@kpollich
Copy link
Member Author

@joshdover Everything we initially tracked here has been addressed. Do we want to start thinking about how to test this in a HA environment?

@joshdover
Copy link
Contributor

Do we want to start thinking about how to test this in a HA environment?

Good thinking. I'm thinking we should try running several instances of setupFleet

async function createSetupSideEffects(
in parallel in a Jest integration test and then verifying the exact same objects are created, no less, no more.

However, I know we've had some challenges with the Jest integration tests in #118797. I can't remember if the issues we were encountering there would affect this test as well or not.

Any other ideas?

@kpollich
Copy link
Member Author

I think that approach is definitely sound, but as I recall we weren't able to actually get Kibana to boot up when running Jest integration tests against the x-pack directory. I probably won't have time this week to actually start addressing this, but I do think documenting that approach is a good start.

@joshdover joshdover self-assigned this Dec 14, 2021
@joshdover
Copy link
Contributor

@kpollich Let me give a it a try later this week and see if I can't get something working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team v7.16.1 v8.0.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants