Skip to content

Commit

Permalink
NO-ISSUE: Soft install timeout enhancement (#6694)
Browse files Browse the repository at this point in the history
* MGMT-8115: Soft install timeout

This patch adds an enhacement proposal about replacing hard timeouts
with soft timeouts, so that users can manually fix issues and
installation can resume.

Signed-off-by: Juan Hernandez <juan.hernandez@redhat.com>

* Fix typos

Signed-off-by: Juan Hernandez <juan.hernandez@redhat.com>

* MGMT-8115: Add more details about the current implementation, and suggested implementation

---------

Signed-off-by: Juan Hernandez <juan.hernandez@redhat.com>
Co-authored-by: Juan Hernandez <juan.hernandez@redhat.com>
  • Loading branch information
ori-amizur and jhernand authored Aug 22, 2024
1 parent c016547 commit 17927d8
Show file tree
Hide file tree
Showing 2 changed files with 163 additions and 0 deletions.
163 changes: 163 additions & 0 deletions docs/enhancements/soft-install-timeout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
---
title: soft-install-timeout
authors:
- "@jhernand"
- "@oamizur"
creation-date: 2023-10-04
last-updated: 2023-11-23
---

# Soft install timeout

## Summary

Currently cluster installation by assisted installer may fail due to timeout expiration.
Assisted installer maintains many timeouts with different values. A timeout limits the time
period to run an installation stage or to perform a specific installation operation.
We want to change these timeouts to be soft. Soft timeout expiration will cause warning that the
installation is taking longer than expected and the installation will continue.

## Motivation

This is important because users may be able to fix the issues that delay the
installation instead of having to start it over. This is not true for ZTP installations. Users using ZTP
will not try to fix these installations in case of a failure.

### Goals

### Non-Goals

It is not a goal to change the global installation timeout configured via the
`INSTALLATION_TIMEOUT` environment variable. That is set by default to `24h`
and at that point the assisted service stops monitoring the cluster.

## Proposal

### User Stories

#### Allow installation in slow network environments

A user having installations in slow network environment that may take long enough that they can triggered the timeout.
I want my installations to continue and succeed even if they are very slow.

#### Allow manually fixing a SaaS installation

As a user I tried to create a SNO cluster in the SaaS environment. An issue
prevented one of the cluster operators from reporting success. After one hour
the timeout expired and the installation was marked as failed. When I found the
failed install, I was able to easily resolve the issue and get the cluster
operator to report success. But the service no longer cared; as far as it was
concerned, the installation failed and was permanently marked as such. It also
would not give me the kubeadmin password, which is an important feature of the
install experience and tricky to obtain otherwise. I would like the service
inform me about the exceeded timeout, but it should give me the kubeadmin
password (if available at that point) and after my fixes it should continue
and eventually mark the installation as successful.

### Implementation Details/Notes/Constraints

### Risks and Mitigations

## Design Details

### Existing timeouts and handling

There are several timeout types in assisted installer. Each timeout type is handled differently:


| Timeout | Entity | Type | Managed by | Environment variable | Default | Action | Description |
|----------------------------|:-------:|---------------------------------------:|--------------------:|---------------------------------------------:|--------------------------:|--------------------------------------------------------------:|------------:|
| Prepare for installation | cluster | Status | Assisted service | PREPARE_FOR_INSTALLATION_TIMEOUT | 10m | Move to ready |
| Installation | cluster | Status | Assisted service | INSTALLATION_TIMEOUT | 24h | Move to error |
| Finalizing | cluster | Status | Assisted service | FINALIZING_TIMEOUT | 5h | Move to error |
| Installation in progress | host | Stage (general) | Assisted service | Hard coded | 60m | Move to error |
| Starting installation | host | stage | Assisted service | HOST_STAGE_STARTING_INSTALLATION_TIMEOUT | 30m | Move to error |
| Installing | host | stage | Assisted service | HOST_STAGE_INSTALLING_TIMEOUT | 60m | Move to error |
| Waiting for control plane | host | stage | Assisted service | HOST_STAGE_WAITING_FOR_CONTROL_PLANE_TIMEOUT | 60m | Move to error |
| Waiting for controller | host| stage | Assisted service | HOST_STAGE_WAITING_FOR_CONTROLLER_TIMEOUT | 60m | Move to error |
| Waiting for bootkube | host| stage | Assisted service | HOST_STAGE_WAITING_FOR_BOOTKUBE_TIMEOUT | 60m | Move to error |
| Joined | host| stage | Assisted service | HOST_STAGE_JOINED_TIMEOUT | 60m | Move to error |
| Writing image to disk | host| stage | Assisted service | HOST_STAGE_WRITING_IMAGE_TO_DISK_TIMEOUT | 30m | Move to error |
| Configuring | host| stage | Assisted service | HOST_STAGE_CONFIGURING_TIMEOUT | 60m | Move to error |
| Waiting for ignition | host| stage | Assisted service | HOST_STAGE_WAITING_FOR_IGNITION_TIMEOUT | 24h | Move to error |
| Rebooting | host| stage | Assisted service | HOST_STAGE_REBOOTING_TIMEOUT | 40m | Move pending user action |
| Wait for nodes | installation (cluster)| controller | Assisted controller | hard coded | 10h | Abort waiting for nodes |
| Wait for finalizing | installation (cluster)| controller | Assisted controller | hard coded | 10h | Don't perform post install + don't send complete installation |
| Wait for cluster operators | installation (cluster)| controller | Assisted controller | hard coded | 10h | Don't perform rest of controller operations | Only CVO and console |
| Add router CA | installation (cluster)| controller | Assisted controller | hard coded | 70m | Don't perform rest of controller operations |
| Wait for OLM operators | installation (cluster)| controller | Assisted controller || calculated from operators | | Don't perform rest of controller operations |
| Apply manifests | installation (cluster)| controller | Assisted controller | hard coded | 10m | Don't perform rest of controller operations |
| Wait for OLM operators CSV | installation (cluster)| controller | Assisted controller | | calculated from operators | Don't perform rest of controller operations |
| Send complete installation | installation (cluster)| controller | Assisted controller | hard coded | 30m | Don't perform rest of controller operations |


There are 2 flows for completing installation. One is managed by controller, and one by assisted service.

In the flow managed by the assisted service, the following steps must be completed:

- kubeconfig must be uploaded
- cluster operators are successful
- Monitored operators are either successful or failed

In the flow managed by the controller all steps must be completed (in the table above). The last step is a
a notification to complete installation.

### Suggested changes
- Only assisted service should notify installation completion
- All steps by controller (as specified in the above table) will be known as stages in the assisted service. This will enable the service to manage
them in a similar way that host stages are managed.
- Controller will not stop activity due to timeout.
- Assisted service should be able to terminate the controller.
- All existing timeouts except installation timeout and rebooting timeout should be treated as soft timeouts (i.e cluster or host events)
- Since OLM operators are not mandatory for successful installation, in case of timeout on these operators, the installation will proceed and marked as successul.
- The suggested functionality should be optional. For SaaS it will be enabled globally by default and will be disabled by default
for ZTP. In addition, to use this feature it has to be enabled at organizational level.
- The 24h global installation timeout will be kept. It will be considered as hard timeout.

### Open Questions

- Should we collect must gather logs when a soft timeout expires or after a cluster with expired soft timeout
is cancelled , even if the cluster is not marked as failed?

### UI Impact

The UI will need to explicitly show to the user that the cluster installation
is taking longer than expected, and give suggestions on how to proceed. For
example, we could present a warning message with this text:

> Cluster installation is taking too long
>
> Most installations complete in approximately 45 minutes, but it took 23 hours
> and 52 minutes already. Check the logs to find out why or reset the cluster
> to start over.
The progress bars and other UI elements used to indicate progress should also
explicitly indicate that the installation is taking longer than expected, for
example using warning icons or specific colors. For example:

![UI example](./soft-install-timeout/ui-example.png)

### Test Plan

We will need the following test cases, for the preparation, installation and
finalizing phases:

- Prepare a cluster that exceeds the timeout and verify that fixing the issue
manually allows the service to continue with the installation.

- Verify that the UI explicitly shows the information about the expired
timeout, and that it recovers when the issue is eventually fixed and the
service continues with the installation.

These tests may require introducing a mechanism to artificially delay the
installation in the agent or the installer.

## Drawbacks

Most of our cluster installation failures are due to these timeouts. If we just
disable them then we will have a large amount of installations that have failed
but will not be accounted as such.

## Alternatives

None.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 17927d8

Please sign in to comment.