Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Improve performance of auto package policy upgrades #121639

Closed
Tracked by #120616
joshdover opened this issue Dec 20, 2021 · 9 comments · Fixed by #125909
Closed
Tracked by #120616

[Fleet] Improve performance of auto package policy upgrades #121639

joshdover opened this issue Dec 20, 2021 · 9 comments · Fixed by #125909
Assignees
Labels
Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project performance Team:Fleet Team label for Observability Data Collection Fleet team v8.2.0

Comments

@joshdover
Copy link
Contributor

We have suspicion that some customers are running into some performance problems on the /api/fleet/setup endpoint caused by the "keep policies up-to-date" logic in Fleet setup. This has some opportunity for some optimization improvements which would minimize the overhead of running this logic during Fleet setup.

Current implementation

  • Fetch the ids for all package policies
    // Handle automatic package policy upgrades for managed packages and package with
    // the `keep_policies_up_to_date` setting enabled
    const allPackagePolicyIds = await packagePolicyService.listIds(soClient, {
    page: 1,
    perPage: SO_SEARCH_LIMIT,
    });
    const packagePolicyUpgradeResults = await upgradeManagedPackagePolicies(
    soClient,
    esClient,
    allPackagePolicyIds.items
    );
  • For each package policy id:
    • Fetch the full package policy
      const packagePolicy = await packagePolicyService.get(soClient, packagePolicyId);
    • Fetch the package info
    • Fetch the installation
    • If the package has keepPoliciesUpToDate == false or is already up to date, skip this package policy
    • Run the upgrade dry run for the package policy (which fetches the package policy a 2nd time)
    • If dry succeeds, run the actual upgrade (which fetches the package policy a 3rd time)

Proposed implementation

Most of the decisions we need to make could be made with much fewer fetches by looking at the packages first and then using that information to filter for the package policies that need to be upgraded.

  • Fetch the installed packages, filtering for packages with keepPoliciesUpToDate = true and name in [managed package names]
    • Maybe we already have these fetched in the previous steps of Fleet setup and could just filter in memory?
  • For each package that matches the query above:
    • Fetch all package policies where packageName = installedPackage.name and version != installedPackage.version
    • For each package policy that matches query above:
      • Attempt the dry run using the installed package and package policy that has already been fetched
      • If dry run succeed, run the actual upgrade using the installed package and package policy that has already been fetched
@joshdover joshdover added performance Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project Team:Fleet Team label for Observability Data Collection Fleet team v7.17.0 labels Dec 20, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Feature:EPM)

@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@joshdover
Copy link
Contributor Author

Before we do the implementation work above, we should validate with a large data set + APM instrumentation that this is indeed the area that is causing the slowdown.

@juliaElastic
Copy link
Contributor

juliaElastic commented Feb 16, 2022

@joshdover I started working on this, and had a few questions:

  • Fetch the installed packages, filtering for packages with keepPoliciesUpToDate = true and name in [managed package names]
  • Isn't it enough to filter on keepPoliciesUpToDate = true? Since managed packages come with keepPoliciesUpToDate: false by default and have to be enabled by users.

  • upgradeManagedPackagePolicies logic is now part of ensurePreconfiguredPackagesAndPolicies, which is called from two places: setup and PUT /setup/preconfiguration API.
    Do you think it is right to upgrade policies when updating preconfig? I don't find it very obvious that they are related.

Before we do the implementation work above, we should validate with a large data set + APM instrumentation that this is indeed the area that is causing the slowdown.

Is there a guide about how to use APM instrumentation to check slowness in kibana, I haven't used APM before.

@joshdover
Copy link
Contributor Author

  • Isn't it enough to filter on keepPoliciesUpToDate = true? Since managed packages come with keepPoliciesUpToDate: false by default and have to be enabled by users.

Makes sense to me 👍

  • upgradeManagedPackagePolicies logic is now part of ensurePreconfiguredPackagesAndPolicies, which is called from two places: setup and PUT /setup/preconfiguration API.
    Do you think it is right to upgrade policies when updating preconfig? I don't find it very obvious that they are related.

Agreed, that is probably a confusing workflow to have this API have this side-effect. cc @nchaulet and @kpollich on any input here.

Is there a guide about how to use APM instrumentation to check slowness in kibana, I haven't used APM before.

See https://www.elastic.co/guide/en/kibana/current/kibana-debugging.html#_instrumenting_with_elastic_apm


I also want to flag that this PR may have a slight impact on this work: #125788

@juliaElastic
Copy link
Contributor

thanks, yes I noticed that pr, will rebase once merged :)

@kpollich
Copy link
Member

Do you think it is right to upgrade policies when updating preconfig? I don't find it very obvious that they are related.

Hmm, thinking about this a bit I don't think this is ideal either. It makes more sense to me to have upgrades happen only during setup rather than during preconfiguration updates. I'm not 100% sure where we call preconfiguration update rather than setup currently, though, so I might be missing a use case.

@juliaElastic
Copy link
Contributor

juliaElastic commented Feb 17, 2022

I raised a draft pr, will continue with tests.

I tried to use APM instrumentation (following the guide), but currently it is not working for 8+ versions, because of the default policies change, fix is in progress: elastic/apm-integration-testing#1435
I will get back to this when fixed.

Apart from that, I did some manual time measurements of upgradeManagedPackagePolicies, and noticed that the getPackageInfo call is quite slow (about 100ms / call on average), which was called once for each package policy previously. This added up with many package policies e.g. 5 policies - 500ms, 10 policies - 1000ms.

Measuring with this improvement merged by @joshdover yesterday, which removed getPackageInfo calls, I only see the upgradeManagedPackagePolicies logic taking 10-20ms.
The improvements of this issue does't seem to make too much difference (10-20ms still). These times were measured with 3-11 package policies with 4 managed packages (apm, system, fleet_server, elastic_agent), without an actual upgrade needed.

The actual policy upgrade takes more time as expected, so far the times were around 4-5000ms for system package.

I think the refactor still make sense to make the logic simpler and more readable.

@joshdover
Copy link
Contributor Author

I tried to use APM instrumentation (following the guide), but currently it is not working for 8+ versions, because of the default policies change, fix is in progress: elastic/apm-integration-testing#1435 I will get back to this when fixed.

You can also use your own Cloud deployment if you'd like, this is how I do it personally. It's nice to have a test cluster that you use over the course of several months with some real data to play around with and can make whatever Fleet changes you'd like to it without affecting other employees.

Measuring with this improvement merged by @joshdover yesterday, which removed getPackageInfo calls, I only see the upgradeManagedPackagePolicies logic taking 10-20ms. The improvements of this issue does't seem to make too much difference (10-20ms still). These times were measured with 3-11 package policies with 4 managed packages (apm, system, fleet_server, elastic_agent), without an actual upgrade needed.

Well that's a nice side-effect of that bug fix. Thanks for taking the time to measure it. I also suspected this was the main issue as it seemed to be the thing we were querying repeatedly unnecessarily.

I think the refactor still make sense to make the logic simpler and more readable.

+1, thanks for continuing forward with this.

@joshdover joshdover changed the title [Fleet] Improve performance of auto package upgrades [Fleet] Improve performance of auto package policy upgrades Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project performance Team:Fleet Team label for Observability Data Collection Fleet team v8.2.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants