Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline #7991

Open
Shashank1306s opened this issue Jul 9, 2024 · 13 comments
Assignees
Labels
Area/CSI Related to Container Storage Interface support

Comments

@Shashank1306s
Copy link
Contributor

What steps did you take and what happened:

There are bunch of clusters where the backup of snapshots are failing with the following error:
image

**What did you expect to happen: **
Cluster seems to be healthy; we are not seeing any issue related to pod crunch or Api server in the timeframe when this error occurred.

The following information will help us better understand what's going on:
Found a related GitHub issues: https://github.com/helm/helm/issues/12154
Error is getting thrown at this Code Path: https://github.com/kubernetes/client-go/blob/354ed1bc9f1f48c820fdc5e84b566acfd716cf42/rest/request.go#L619

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

Environment:

  • Velero version (use velero version):
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jul 9, 2024

Please share more details:

  1. Where do you see the error "failed to get volumesnapshot..."
  2. Please share the entire debug bundle generated by velero debug
  3. Which Velero version are you using

@Shashank1306s
Copy link
Contributor Author

Please share more details:

  1. Where do you see the error "failed to get volumesnapshot..."
  2. Please share the entire debug bundle generated by velero debug
  3. Which Velero version are you using

We are on velero 1.13.

Attaching the velero log file.
clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109-logs.gz

@Lyndon-Li Lyndon-Li added the Area/CSI Related to Container Storage Interface support label Jul 10, 2024
@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jul 10, 2024

Looks like the wait retry mechanism which hits a timeout cause the client-go generates an unexpected error (see here):

{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:259","msg":"Waiting for volumesnapshotcontents snapcontent-186c144e-fabc-4354-bea3-ea7c5e19df7f to have snapshot handle. Retrying in 5s","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:54:50Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:259","msg":"Waiting for volumesnapshotcontents snapcontent-186c144e-fabc-4354-bea3-ea7c5e19df7f to have snapshot handle. Retrying in 5s","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:54:55Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:259","msg":"Waiting for volumesnapshotcontents snapcontent-186c144e-fabc-4354-bea3-ea7c5e19df7f to have snapshot handle. Retrying in 5s","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:55:00Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:407","msg":"Deleting Volumesnapshot hrweb/velero-testpvc-nvvgd","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/__w/1/s/velero-plugin-for-csi/internal/util/util.go:426","msg":"Deleted volumesnapshot with volumesnapshotContent hrweb/velero-testpvc-nvvgd","pluginName":"velero-plugin-for-csi","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"debug","logSource":"/__w/1/s/velero/pkg/backup/item_backupper.go:237","msg":"Executing post hooks","name":"velero-testpvc-nvvgd","namespace":"hrweb","resource":"volumesnapshots.snapshot.storage.k8s.io","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"debug","logSource":"/__w/1/s/velero/pkg/backup/item_backupper.go:237","msg":"Executing post hooks","name":"testpvc","namespace":"hrweb","resource":"persistentvolumeclaims","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"debug","logSource":"/__w/1/s/velero/pkg/backup/item_backupper.go:237","msg":"Executing post hooks","name":"webserver-b94d78ff-c8clp","namespace":"hrweb","resource":"pods","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"info","logSource":"/__w/1/s/velero/pkg/backup/backup.go:457","msg":"1 errors encountered backup up item","name":"webserver-b94d78ff-c8clp","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","error.message":"error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=hrweb, name=velero-testpvc-nvvgd): rpc error: code = Unknown desc = failed to get volumesnapshot hrweb/velero-testpvc-nvvgd: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline","level":"error","logSource":"/__w/1/s/velero/pkg/backup/backup.go:461","msg":"Error backing up item","name":"webserver-b94d78ff-c8clp","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"info","logSource":"/__w/1/s/velero/pkg/backup/backup.go:405","msg":"Backed up 4 items out of an estimated total of 1299 (estimate will change throughout the backup)","name":"webserver-b94d78ff-c8clp","namespace":"hrweb","progress":"","resource":"pods","time":"2024-07-02T21:55:05Z"}
{"backup":"dataprotection-microsoft/clusterbackup-dataprotection-microsoft-backup-a11d83da-6e15-4d8c-8bca-ee6b3aba7109","level":"info","logSource":"/__w/1/s/velero/pkg/backup/backup.go:365","msg":"Processing item","name":"webserver-b94

@Shashank1306s
Copy link
Contributor Author

Hi @Lyndon-Li , anyway we have the default timeout of 10min and the polling at an interval of 5 sec, from the logs we can see that the polling succeeded after 2-3 retries, so I don't think we hit timeout here over here.

@blackpiglet
Copy link
Contributor

This issue may relate to #7978.

@anshulahuja98
Copy link
Collaborator

@blackpiglet can you help explain why you think they might be correlated?
I am still not clear on what is the meaning of this error message.

@blackpiglet
Copy link
Contributor

Error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline

IMO, this error means the client failed to get the VolumeSnapshot in the time limitation of the go-client.
One possible reason is the VolumeSnaphotContent rePut caused by #7978 causing a significant load on the kube-apiserver.

@curtis-baillie
Copy link

Hi, any update on this issue? I seem to be dealing with the same issue on v1.14.0:

Warnings:  <error getting warnings: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline>

Errors:  <error getting errors: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline>

@veerendra2
Copy link

In our case, we get hit by this issue because of Azure/AKS#4555

@blackpiglet
Copy link
Contributor

@curtis-baillie
Could you try v1.14.1?

@monotek
Copy link

monotek commented Sep 30, 2024

We already do.

@blackpiglet
Copy link
Contributor

@monotek
Could you collect the debug bundle to help investigate?

@monotek
Copy link

monotek commented Oct 8, 2024

Sorry, we already removed Velero now and do our volume snapshots by an own controller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/CSI Related to Container Storage Interface support
Projects
None yet
Development

No branches or pull requests

7 participants