-
Notifications
You must be signed in to change notification settings - Fork 214
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
NO-ISSUE: Soft install timeout enhancement (#6694)
* MGMT-8115: Soft install timeout This patch adds an enhacement proposal about replacing hard timeouts with soft timeouts, so that users can manually fix issues and installation can resume. Signed-off-by: Juan Hernandez <juan.hernandez@redhat.com> * Fix typos Signed-off-by: Juan Hernandez <juan.hernandez@redhat.com> * MGMT-8115: Add more details about the current implementation, and suggested implementation --------- Signed-off-by: Juan Hernandez <juan.hernandez@redhat.com> Co-authored-by: Juan Hernandez <juan.hernandez@redhat.com>
- Loading branch information
1 parent
c016547
commit 17927d8
Showing
2 changed files
with
163 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,163 @@ | ||
--- | ||
title: soft-install-timeout | ||
authors: | ||
- "@jhernand" | ||
- "@oamizur" | ||
creation-date: 2023-10-04 | ||
last-updated: 2023-11-23 | ||
--- | ||
|
||
# Soft install timeout | ||
|
||
## Summary | ||
|
||
Currently cluster installation by assisted installer may fail due to timeout expiration. | ||
Assisted installer maintains many timeouts with different values. A timeout limits the time | ||
period to run an installation stage or to perform a specific installation operation. | ||
We want to change these timeouts to be soft. Soft timeout expiration will cause warning that the | ||
installation is taking longer than expected and the installation will continue. | ||
|
||
## Motivation | ||
|
||
This is important because users may be able to fix the issues that delay the | ||
installation instead of having to start it over. This is not true for ZTP installations. Users using ZTP | ||
will not try to fix these installations in case of a failure. | ||
|
||
### Goals | ||
|
||
### Non-Goals | ||
|
||
It is not a goal to change the global installation timeout configured via the | ||
`INSTALLATION_TIMEOUT` environment variable. That is set by default to `24h` | ||
and at that point the assisted service stops monitoring the cluster. | ||
|
||
## Proposal | ||
|
||
### User Stories | ||
|
||
#### Allow installation in slow network environments | ||
|
||
A user having installations in slow network environment that may take long enough that they can triggered the timeout. | ||
I want my installations to continue and succeed even if they are very slow. | ||
|
||
#### Allow manually fixing a SaaS installation | ||
|
||
As a user I tried to create a SNO cluster in the SaaS environment. An issue | ||
prevented one of the cluster operators from reporting success. After one hour | ||
the timeout expired and the installation was marked as failed. When I found the | ||
failed install, I was able to easily resolve the issue and get the cluster | ||
operator to report success. But the service no longer cared; as far as it was | ||
concerned, the installation failed and was permanently marked as such. It also | ||
would not give me the kubeadmin password, which is an important feature of the | ||
install experience and tricky to obtain otherwise. I would like the service | ||
inform me about the exceeded timeout, but it should give me the kubeadmin | ||
password (if available at that point) and after my fixes it should continue | ||
and eventually mark the installation as successful. | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
### Risks and Mitigations | ||
|
||
## Design Details | ||
|
||
### Existing timeouts and handling | ||
|
||
There are several timeout types in assisted installer. Each timeout type is handled differently: | ||
|
||
|
||
| Timeout | Entity | Type | Managed by | Environment variable | Default | Action | Description | | ||
|----------------------------|:-------:|---------------------------------------:|--------------------:|---------------------------------------------:|--------------------------:|--------------------------------------------------------------:|------------:| | ||
| Prepare for installation | cluster | Status | Assisted service | PREPARE_FOR_INSTALLATION_TIMEOUT | 10m | Move to ready | | ||
| Installation | cluster | Status | Assisted service | INSTALLATION_TIMEOUT | 24h | Move to error | | ||
| Finalizing | cluster | Status | Assisted service | FINALIZING_TIMEOUT | 5h | Move to error | | ||
| Installation in progress | host | Stage (general) | Assisted service | Hard coded | 60m | Move to error | | ||
| Starting installation | host | stage | Assisted service | HOST_STAGE_STARTING_INSTALLATION_TIMEOUT | 30m | Move to error | | ||
| Installing | host | stage | Assisted service | HOST_STAGE_INSTALLING_TIMEOUT | 60m | Move to error | | ||
| Waiting for control plane | host | stage | Assisted service | HOST_STAGE_WAITING_FOR_CONTROL_PLANE_TIMEOUT | 60m | Move to error | | ||
| Waiting for controller | host| stage | Assisted service | HOST_STAGE_WAITING_FOR_CONTROLLER_TIMEOUT | 60m | Move to error | | ||
| Waiting for bootkube | host| stage | Assisted service | HOST_STAGE_WAITING_FOR_BOOTKUBE_TIMEOUT | 60m | Move to error | | ||
| Joined | host| stage | Assisted service | HOST_STAGE_JOINED_TIMEOUT | 60m | Move to error | | ||
| Writing image to disk | host| stage | Assisted service | HOST_STAGE_WRITING_IMAGE_TO_DISK_TIMEOUT | 30m | Move to error | | ||
| Configuring | host| stage | Assisted service | HOST_STAGE_CONFIGURING_TIMEOUT | 60m | Move to error | | ||
| Waiting for ignition | host| stage | Assisted service | HOST_STAGE_WAITING_FOR_IGNITION_TIMEOUT | 24h | Move to error | | ||
| Rebooting | host| stage | Assisted service | HOST_STAGE_REBOOTING_TIMEOUT | 40m | Move pending user action | | ||
| Wait for nodes | installation (cluster)| controller | Assisted controller | hard coded | 10h | Abort waiting for nodes | | ||
| Wait for finalizing | installation (cluster)| controller | Assisted controller | hard coded | 10h | Don't perform post install + don't send complete installation | | ||
| Wait for cluster operators | installation (cluster)| controller | Assisted controller | hard coded | 10h | Don't perform rest of controller operations | Only CVO and console | | ||
| Add router CA | installation (cluster)| controller | Assisted controller | hard coded | 70m | Don't perform rest of controller operations | | ||
| Wait for OLM operators | installation (cluster)| controller | Assisted controller || calculated from operators | | Don't perform rest of controller operations | | ||
| Apply manifests | installation (cluster)| controller | Assisted controller | hard coded | 10m | Don't perform rest of controller operations | | ||
| Wait for OLM operators CSV | installation (cluster)| controller | Assisted controller | | calculated from operators | Don't perform rest of controller operations | | ||
| Send complete installation | installation (cluster)| controller | Assisted controller | hard coded | 30m | Don't perform rest of controller operations | | ||
|
||
|
||
There are 2 flows for completing installation. One is managed by controller, and one by assisted service. | ||
|
||
In the flow managed by the assisted service, the following steps must be completed: | ||
|
||
- kubeconfig must be uploaded | ||
- cluster operators are successful | ||
- Monitored operators are either successful or failed | ||
|
||
In the flow managed by the controller all steps must be completed (in the table above). The last step is a | ||
a notification to complete installation. | ||
|
||
### Suggested changes | ||
- Only assisted service should notify installation completion | ||
- All steps by controller (as specified in the above table) will be known as stages in the assisted service. This will enable the service to manage | ||
them in a similar way that host stages are managed. | ||
- Controller will not stop activity due to timeout. | ||
- Assisted service should be able to terminate the controller. | ||
- All existing timeouts except installation timeout and rebooting timeout should be treated as soft timeouts (i.e cluster or host events) | ||
- Since OLM operators are not mandatory for successful installation, in case of timeout on these operators, the installation will proceed and marked as successul. | ||
- The suggested functionality should be optional. For SaaS it will be enabled globally by default and will be disabled by default | ||
for ZTP. In addition, to use this feature it has to be enabled at organizational level. | ||
- The 24h global installation timeout will be kept. It will be considered as hard timeout. | ||
|
||
### Open Questions | ||
|
||
- Should we collect must gather logs when a soft timeout expires or after a cluster with expired soft timeout | ||
is cancelled , even if the cluster is not marked as failed? | ||
|
||
### UI Impact | ||
|
||
The UI will need to explicitly show to the user that the cluster installation | ||
is taking longer than expected, and give suggestions on how to proceed. For | ||
example, we could present a warning message with this text: | ||
|
||
> Cluster installation is taking too long | ||
> | ||
> Most installations complete in approximately 45 minutes, but it took 23 hours | ||
> and 52 minutes already. Check the logs to find out why or reset the cluster | ||
> to start over. | ||
The progress bars and other UI elements used to indicate progress should also | ||
explicitly indicate that the installation is taking longer than expected, for | ||
example using warning icons or specific colors. For example: | ||
|
||
![UI example](./soft-install-timeout/ui-example.png) | ||
|
||
### Test Plan | ||
|
||
We will need the following test cases, for the preparation, installation and | ||
finalizing phases: | ||
|
||
- Prepare a cluster that exceeds the timeout and verify that fixing the issue | ||
manually allows the service to continue with the installation. | ||
|
||
- Verify that the UI explicitly shows the information about the expired | ||
timeout, and that it recovers when the issue is eventually fixed and the | ||
service continues with the installation. | ||
|
||
These tests may require introducing a mechanism to artificially delay the | ||
installation in the agent or the installer. | ||
|
||
## Drawbacks | ||
|
||
Most of our cluster installation failures are due to these timeouts. If we just | ||
disable them then we will have a large amount of installations that have failed | ||
but will not be accounted as such. | ||
|
||
## Alternatives | ||
|
||
None. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.