Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Manager] Reinstalling packages when Kibana crashes #74792

Closed
neptunian opened this issue Aug 11, 2020 · 5 comments
Closed

[Ingest Manager] Reinstalling packages when Kibana crashes #74792

neptunian opened this issue Aug 11, 2020 · 5 comments
Assignees
Labels
Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@neptunian
Copy link
Contributor

neptunian commented Aug 11, 2020

As part of #64213, we should handle cases where the package is stuck in an install state, likely because Kibana crashed. My approach is to add the following attributes to each epm-packages saved object:

Screen Shot 2020-08-11 at 12 59 03 PM

install_status could be installed, installing and later uninstalling (and uninstalled if we decide to keep the SOs around in the future to keep track of when it was uninstalled).
install_version is the version it's currently installing (because this could be an upgrade/downgrade to another package)
install_started_at can be compared to the current time to see how long this package has been installing for if status is installing.

This could cover the case that something went wrong, like Kibana crashing, to try reinstalling the packages. Perhaps if the time elapsed is greater than 30 seconds. Since the epm-packges saved object is created at the beginning of installing and updated as kibana and elasticsearch assets are installed, this is a way to make sure the installation/update was completed.

We could automatically try to reinstall when they visit ingest manager (filter for installing packages and check the time) or, if we want to avoid automatically trying to reinstall these packages (which I believe @ph preferred), let the user know some way through the UI that the packages weren't installed correctly and that they should try to reinstall. If the latter, perhaps @hbharding has some ideas on how we want to handle this scenario.

@neptunian neptunian added Team:Fleet Team label for Observability Data Collection Fleet team Ingest Management:beta2 labels Aug 11, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@ruflin
Copy link
Member

ruflin commented Aug 12, 2020

Few questions:

  • Does this only cover if a package is installed the first time or also updating a package?
  • With the above, we know which package is currently installing (and perhaps failed in the middle). Do we have also info still around on what the currently installed package is which might still have some assets around?

For the "auto reinstall" it probably depends on the failure message. If Kibana crashed / restarted in the middle, we should try to install it again. If for some reasons we got permissions errors, retry will not work. One of the cases I would retry, is if we get 429 (too many requests). I think we need to define the different cases on when we retry and when not.

@neptunian
Copy link
Contributor Author

neptunian commented Aug 12, 2020

@ruflin

Does this only cover if a package is installed the first time or also updating a package?

Both

With the above, we know which package is currently installing (and perhaps failed in the middle). Do we have also info still around on what the currently installed package is which might still have some assets around?

Yes. The version attribute doesn't get updated until the package is complete if its an update. If it's installing for the first time, it's added initially.

I think this case is different as a 429 or other errors can be handled as we know what the error is whereas the above changes are specifically for the case where we don't know what happened as Kibana crashed for some reason. The 429 or 408 responses can be handled with the other errors in the api handler and either retried automatically or just letting the user decide what to do. Currently for an error we aren't handling specifically it will remove the assets its installed thus far and respond with a 500 error, so they can try again if they want. I'm sure there is more improvement that can be done as part of #66688

@neptunian neptunian self-assigned this Aug 12, 2020
@neptunian
Copy link
Contributor Author

Henry and I had a discussion and decided for this edge case we could leave out notifying the user to take some action. So I'm going forward with automatically installing during setup when the a package is found to not have completed installation.

@ph
Copy link
Contributor

ph commented Aug 18, 2020

I have a chat with @neptunian concerning this and I agree its an edge case, +1 for your proposal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

4 participants