Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout for backup/restore expose #6472

Merged

Conversation

Lyndon-Li
Copy link
Contributor

@Lyndon-Li Lyndon-Li commented Jul 7, 2023

For backup expose, the exposer waits for the snapshot to be ready, creates a volume from the snapshot and a pod to consume it and then Velero data mover waits for the pod to get to running status.
For restore expose, the exposer dynamically provisions a volume and a pod to consume it and then Velero data mover waits for the pod to get to running status.

It is possible that due to an unsatisfied condition, the volume creation from snapshot/volume dynamic creation hangs so the pod never gets to running status.
One example is that the information in the storage class is wrong, as a result, the volume dynamic creation never finishes.

For both backup expose and restore expose, if the above problem happens, the DataUpload/DataDownload will hang until a 4 hours timeout.

This PR adds a mechanism to track the time of the backup/restore expose and set a timeout value, if the timeout happens, the DataUpload/DataDownload will be marked as fail and any intermediate resources will be cleared.
At present, we set the timeout value as 30 min and is configurable by specifying a node-agent server parameter.

This PR can also fix the problem in a node-agent restart scenario. In the case that node-agent restarts while a backup exposer is waiting for the snapshot to be ready by the mean time, after node-agent restarts it doesn't know which DataUploads are affected, as a result, it cannot cancel them.
This mechanism can back node-agent server in this case -- any orphan DataUploads that node-agent server cannot cancel will fall into this timeout mechanism.

@Lyndon-Li Lyndon-Li added the kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes label Jul 7, 2023
@Lyndon-Li Lyndon-Li force-pushed the add-wait-timeout-for-expose-prepare branch from 76a9936 to 7e445f1 Compare July 7, 2023 11:42
@codecov-commenter
Copy link

codecov-commenter commented Jul 7, 2023

Codecov Report

Merging #6472 (bf2a981) into main (7deae4c) will increase coverage by 0.12%.
The diff coverage is 75.18%.

@@            Coverage Diff             @@
##             main    #6472      +/-   ##
==========================================
+ Coverage   60.18%   60.31%   +0.12%     
==========================================
  Files         229      229              
  Lines       24219    24319     +100     
==========================================
+ Hits        14577    14667      +90     
- Misses       8634     8647      +13     
+ Partials     1008     1005       -3     
Impacted Files Coverage Δ
pkg/cmd/cli/nodeagent/server.go 11.20% <0.00%> (-0.10%) ⬇️
pkg/controller/data_upload_controller.go 69.12% <77.14%> (+2.98%) ⬆️
pkg/controller/data_download_controller.go 79.08% <80.32%> (+1.84%) ⬆️

@Lyndon-Li Lyndon-Li force-pushed the add-wait-timeout-for-expose-prepare branch from 7e445f1 to a002b7f Compare July 7, 2023 11:52
@Lyndon-Li Lyndon-Li changed the title Add wait timeout for expose prepare Add timeout for backup/restore expose Jul 7, 2023
@Lyndon-Li Lyndon-Li force-pushed the add-wait-timeout-for-expose-prepare branch 2 times, most recently from aaa2105 to bf2a981 Compare July 7, 2023 16:19
@Lyndon-Li Lyndon-Li marked this pull request as ready for review July 10, 2023 01:23
Signed-off-by: Lyndon-Li <lyonghui@vmware.com>
@Lyndon-Li Lyndon-Li force-pushed the add-wait-timeout-for-expose-prepare branch from bf2a981 to 9f5162e Compare July 10, 2023 09:32
@github-actions github-actions bot added the Dependencies Pull requests that update a dependency file label Jul 10, 2023
@Lyndon-Li Lyndon-Li merged commit 0945879 into vmware-tanzu:main Jul 11, 2023
22 checks passed
@Lyndon-Li Lyndon-Li deleted the add-wait-timeout-for-expose-prepare branch July 11, 2023 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dependencies Pull requests that update a dependency file has-unit-tests kind/changelog-not-required PR does not require a user changelog. Often for docs, website, or build changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants