Error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline #7991

Shashank1306s · 2024-07-09T09:26:26Z

What steps did you take and what happened:

There are bunch of clusters where the backup of snapshots are failing with the following error:

**What did you expect to happen: **
Cluster seems to be healthy; we are not seeing any issue related to pod crunch or Api server in the timeframe when this error occurred.

The following information will help us better understand what's going on:
Found a related GitHub issues: https://github.com/helm/helm/issues/12154
Error is getting thrown at this Code Path: https://github.com/kubernetes/client-go/blob/354ed1bc9f1f48c820fdc5e84b566acfd716cf42/rest/request.go#L619

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:

Environment:

Velero version (use velero version):
Velero features (use velero client config get features):
Kubernetes version (use kubectl version):
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2024-07-09T10:15:24Z

Please share more details:

Where do you see the error "failed to get volumesnapshot..."
Please share the entire debug bundle generated by velero debug
Which Velero version are you using

Shashank1306s · 2024-07-09T17:38:35Z

Please share more details:

Where do you see the error "failed to get volumesnapshot..."

Please share the entire debug bundle generated by velero debug

Which Velero version are you using

We are on velero 1.13.

Attaching the velero log file.
clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109-logs.gz

Lyndon-Li · 2024-07-10T03:02:53Z

Looks like the wait retry mechanism which hits a timeout cause the client-go generates an unexpected error (see here):

{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:259","msg":"Waiting for volumesnapshotcontents snapcontent-186c144e-fabc-4354-bea3-ea7c5e19df7f to have snapshot handle. Retrying in 5s","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:54:50Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:259","msg":"Waiting for volumesnapshotcontents snapcontent-186c144e-fabc-4354-bea3-ea7c5e19df7f to have snapshot handle. Retrying in 5s","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:54:55Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:259","msg":"Waiting for volumesnapshotcontents snapcontent-186c144e-fabc-4354-bea3-ea7c5e19df7f to have snapshot handle. Retrying in 5s","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:55:00Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:407","msg":"Deleting Volumesnapshot hrweb/velero-testpvc-nvvgd","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:426","msg":"Deleted volumesnapshot with volumesnapshotContent hrweb/velero-testpvc-nvvgd","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"debug","logSource":"/__w/1/s/velero/pkg/backup/item_backupper.go:237","msg":"Executing post hooks","name":"velero-testpvc-nvvgd","namespace":"hrweb","resource":"volumesnapshots.snapshot.storage.k8s.io","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"debug","logSource":"/__w/1/s/velero/pkg/backup/item_backupper.go:237","msg":"Executing post hooks","name":"testpvc","namespace":"hrweb","resource":"persistentvolumeclaims","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"debug","logSource":"/__w/1/s/velero/pkg/backup/item_backupper.go:237","msg":"Executing post hooks","name":"webserver-b94d78ff-c8clp","namespace":"hrweb","resource":"pods","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"info","logSource":"/__w/1/s/velero/pkg/backup/backup.go:457","msg":"1 errors encountered backup up item","name":"webserver-b94d78ff-c8clp","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","error.message":"error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=hrweb, name=velero-testpvc-nvvgd): rpc error: code = Unknown desc = failed to get volumesnapshot hrweb/velero-testpvc-nvvgd: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline","level":"error","logSource":"/__w/1/s/velero/pkg/backup/backup.go:461","msg":"Error backing up item","name":"webserver-b94d78ff-c8clp","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"info","logSource":"/__w/1/s/velero/pkg/backup/backup.go:405","msg":"Backed up 4 items out of an estimated total of 1299 (estimate will change throughout the backup)","name":"webserver-b94d78ff-c8clp","namespace":"hrweb","progress":"","resource":"pods","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"info","logSource":"/__w/1/s/velero/pkg/backup/backup.go:365","msg":"Processing item","name":"webserver-b94

Shashank1306s · 2024-07-10T03:58:23Z

Hi @Lyndon-Li , anyway we have the default timeout of 10min and the polling at an interval of 5 sec, from the logs we can see that the polling succeeded after 2-3 retries, so I don't think we hit timeout here over here.

blackpiglet · 2024-07-11T03:23:37Z

This issue may relate to #7978.

anshulahuja98 · 2024-07-11T04:38:48Z

@blackpiglet can you help explain why you think they might be correlated?
I am still not clear on what is the meaning of this error message.

blackpiglet · 2024-07-11T04:52:44Z

Error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline

IMO, this error means the client failed to get the VolumeSnapshot in the time limitation of the go-client.
One possible reason is the VolumeSnaphotContent rePut caused by #7978 causing a significant load on the kube-apiserver.

curtis-baillie · 2024-08-22T13:06:54Z

Hi, any update on this issue? I seem to be dealing with the same issue on v1.14.0:

Warnings:  <error getting warnings: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline>

Errors:  <error getting errors: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline>

veerendra2 · 2024-09-27T07:46:14Z

In our case, we get hit by this issue because of Azure/AKS#4555

blackpiglet · 2024-09-29T06:03:18Z

@curtis-baillie
Could you try v1.14.1?

monotek · 2024-09-30T08:07:32Z

We already do.

blackpiglet · 2024-10-08T07:17:09Z

@monotek
Could you collect the debug bundle to help investigate?

monotek · 2024-10-08T12:12:23Z

Sorry, we already removed Velero now and do our volume snapshots by an own controller.

anshulahuja98 mentioned this issue Jul 9, 2024

Add design for velero backup performance improvements #7628

Merged

1 task

Lyndon-Li added the Area/CSI Related to Container Storage Interface support label Jul 10, 2024

Lyndon-Li assigned blackpiglet Jul 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline #7991

Error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline #7991

Shashank1306s commented Jul 9, 2024

Lyndon-Li commented Jul 9, 2024 •

edited

Loading

Shashank1306s commented Jul 9, 2024

Lyndon-Li commented Jul 10, 2024 •

edited

Loading

Shashank1306s commented Jul 10, 2024

blackpiglet commented Jul 11, 2024

anshulahuja98 commented Jul 11, 2024

blackpiglet commented Jul 11, 2024

curtis-baillie commented Aug 22, 2024

veerendra2 commented Sep 27, 2024

blackpiglet commented Sep 29, 2024

monotek commented Sep 30, 2024

blackpiglet commented Oct 8, 2024

monotek commented Oct 8, 2024

Error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline #7991

Error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline #7991

Comments

Shashank1306s commented Jul 9, 2024

Lyndon-Li commented Jul 9, 2024 • edited Loading

Shashank1306s commented Jul 9, 2024

Lyndon-Li commented Jul 10, 2024 • edited Loading

Shashank1306s commented Jul 10, 2024

blackpiglet commented Jul 11, 2024

anshulahuja98 commented Jul 11, 2024

blackpiglet commented Jul 11, 2024

curtis-baillie commented Aug 22, 2024

veerendra2 commented Sep 27, 2024

blackpiglet commented Sep 29, 2024

monotek commented Sep 30, 2024

blackpiglet commented Oct 8, 2024

monotek commented Oct 8, 2024

Lyndon-Li commented Jul 9, 2024 •

edited

Loading

Lyndon-Li commented Jul 10, 2024 •

edited

Loading