CreateVolume API continues to be requested even after related objects are deleted on v1.2.0 #1022

imbohyun1 · 2021-08-13T07:26:01Z

/kind bug

What happened?

Tried to create PV encrypted by KMS key with PVC and it got failed because of the permission issue. After the first attempt at CreateVolume failed, I could see the controller retrying to create the volume and failing over and over again.

From the second try, CreateVolume failed with InternalError, I have checked the internal log that the error occurred because the client token is associated with a resource that is already deleted.

"eventSource": "ec2.amazonaws.com",
"eventName": "CreateVolume",
"awsRegion": "ap-northeast-2",
"userAgent": "aws-sdk-go/1.40.4 (go1.16.6; linux; amd64) exec-env/aws-ebs-csi-driver-v1.2.0",
"errorCode": "Server.InternalError",
"errorMessage": "An internal error has occurred",
....

So I deleted the pod and pvc via kubectl (checked with kube-apiserver-audit log that the pvc deleted successfully), however, 'Createvolume' API continues to be requested even after related objects are deleted.

. controller log - first/second attempt

W0813 06:50:26.040984       1 controller.go:958] Retrying syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db", failure 0
E0813 06:50:26.041005       1 controller.go:981] error syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db": failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = Internal desc = Could not create volume "pvc-1bef4ab2-2ee9-4a25-89af-c8d85675d6db": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-0a3bca4957543370d' does not exist.
	status code: 400, request id: b6d97405-784b-4b7c-b4b1-72b75a4833f5
I0813 06:50:26.041033       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979326", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = Internal desc = Could not create volume "pvc-1bef4ab2-2ee9-4a25-89af-c8d85675d6db": failed to get an available volume in EC2: InvalidVolume.NotFound: The volume 'vol-0a3bca4957543370d' does not exist.

I0813 06:50:37.041621       1 controller.go:1106] Temporary error received, adding PVC 1bef4ab2-2ee9-4a25-89af-c8d85675d6db to claims in progress
W0813 06:50:37.041656       1 controller.go:958] Retrying syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db", failure 1
E0813 06:50:37.041677       1 controller.go:981] error syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db": failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0813 06:50:37.041836       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979326", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0813 06:50:39.041937       1 controller.go:1332] provision "default/ebs-claim-enc" class "ebs-sc-enc": started
I0813 06:50:39.042344       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979326", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/ebs-claim-enc"

. Delete the pod and pvc - the request of pvc deletion has made audit log

{
    "kind": "Event",
    "apiVersion": "audit.k8s.io/v1",
    "level": "RequestResponse",
    "auditID": "1f4cca8d-6014-4700-ae3f-875fc6dcd75a",
    "stage": "ResponseComplete",
    "requestURI": "/api/v1/namespaces/default/persistentvolumeclaims/ebs-claim-enc",
    ...
    "objectRef": {
        "resource": "persistentvolumeclaims",
        "namespace": "default",
        "name": "ebs-claim-enc",
        "apiVersion": "v1"
    },
    "responseStatus": {
        "metadata": {},
        "code": 200
    },
    "requestObject": {
        "kind": "DeleteOptions",
        "apiVersion": "v1",
        "propagationPolicy": "Background"
    },
    ...
    "responseObject": {
        "kind": "PersistentVolumeClaim",
        "apiVersion": "v1",
        "metadata": {
            "name": "ebs-claim-enc",
            "namespace": "default",
            "uid": "1bef4ab2-2ee9-4a25-89af-c8d85675d6db",
            "resourceVersion": "10979451",
            "creationTimestamp": "2021-08-13T06:50:13Z",
            "deletionTimestamp": "2021-08-13T06:50:51Z",
...

Still seeing Retrying syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db"

I0813 07:04:35.170445       1 controller.go:1332] provision "default/ebs-claim-enc" class "ebs-sc-enc": started
I0813 07:04:35.170900       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979451", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/ebs-claim-enc"
I0813 07:04:45.170878       1 controller.go:1106] Temporary error received, adding PVC 1bef4ab2-2ee9-4a25-89af-c8d85675d6db to claims in progress
W0813 07:04:45.170906       1 controller.go:958] Retrying syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db", failure 10
E0813 07:04:45.170935       1 controller.go:981] error syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db": failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0813 07:04:45.171717       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979451", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0813 07:09:45.171173       1 controller.go:1332] provision "default/ebs-claim-enc" class "ebs-sc-enc": started
I0813 07:09:45.171513       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979451", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/ebs-claim-enc"
I0813 07:09:55.171555       1 controller.go:1106] Temporary error received, adding PVC 1bef4ab2-2ee9-4a25-89af-c8d85675d6db to claims in progress
W0813 07:09:55.171588       1 controller.go:958] Retrying syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db", failure 11
E0813 07:09:55.171609       1 controller.go:981] error syncing claim "1bef4ab2-2ee9-4a25-89af-c8d85675d6db": failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = DeadlineExceeded desc = context deadline exceeded
I0813 07:09:55.172164       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"ebs-claim-enc", UID:"1bef4ab2-2ee9-4a25-89af-c8d85675d6db", APIVersion:"v1", ResourceVersion:"10979451", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ebs-sc-enc": rpc error: code = DeadlineExceeded desc = context deadline exceeded

I was able to end this by restarting ebs-csi-controller.

Also I have tested with v1.1.3/v1.1.2, the issue not happened. Looks like this issue has appeared after adding the feature 'utilize latest go sdk to ensure createVolume idempotency' #982.

What you expected to happen?

'Createvolume' should not be requested after pvc deleted.

How to reproduce it (as minimally and precisely as possible)?

Create SC with kmsKeyId, which is not provided enough permission to noderole

encrypted: 'true'
kmsKeyId: arn:aws:kms:ap-northeast-2:xxxxxxxxx:key/xxxxxxxxxx
volumeBindingMode: WaitForFirstConsumer
Create PVC/Pod -> failed to create PV, PVC is still pending
Delete PVC/Pod
You can see ebs-csi-controller keep retrying.

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version): 1.17, 1.20
Driver version: 1.2.0

The text was updated successfully, but these errors were encountered:

wongma7 · 2021-08-19T18:58:11Z

Did you mean to link #982 ?

The logs are coming from the external provisioner controller library. However the version of external-provisioner/csi-provisioner sidecar is the same (v2.1.1) for driver versions 1.1-1.2.
The library intentionally ignores PVC deletion events https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go#L680. Instead, it expects the PVC cache to be up-to-date such that if a PVC has been deleted then it will be removed from the cache, then removed from the queue, and so stopped being retried https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/master/controller/controller.go#L1015-L1018. But somehow the PVC is still in its cache so the controller keeps retrying anyway.

For the time the driver is printing these errors, is it provisioning volumes for any other PVCs? (trying to understand if the cache is stale somehow)

Also, based on my understanding of the error the driver is not actually sending CreateVolume API requests since the context deadline exceeded, but it is sending some pointless kubernetes API requests, plus of course being noisy

imbohyun1 · 2021-08-23T16:21:17Z

@wongma7

For the time the driver is printing these errors, is it provisioning volumes for any other PVCs? (trying to understand if the cache is stale somehow)

yes, I could see the new pvc was provisioning while the driver was printing this error, so as the test progressed, the number of CreateVolumes requests to create pvcs increased and the deleted pvcs continued to call CreateVolume as well.
It's not only sending kubernetes API requests, but also it was actually sending CreateVolume API. I was able to confirm with CloudTrail.

By the way, I tried to reproduce this issue again to check verbose log today, but the issue was not reproduced. Looking at the CloudTrail events, when CreateVolume was called and it failed with the client token is associated with a resource that is already deleted, the error code matched to Client.IdempotentParameterMismatch not Server.InternalError.

When the issue was reproduced, it seems like the cache data related to pvc was handled differently because of the error code from CreateVolume API was not expected error. I'll share the detailed information via Slack.

k8s-triage-robot · 2021-11-21T17:18:42Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-01-15T22:33:27Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-02-14T22:41:43Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-02-14T22:41:54Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 13, 2021

vdhanan self-assigned this Aug 16, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2021

vdhanan removed their assignment Dec 16, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 15, 2022

k8s-ci-robot closed this as completed Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CreateVolume API continues to be requested even after related objects are deleted on v1.2.0 #1022

CreateVolume API continues to be requested even after related objects are deleted on v1.2.0 #1022

imbohyun1 commented Aug 13, 2021 •

edited

Loading

wongma7 commented Aug 19, 2021

imbohyun1 commented Aug 23, 2021

k8s-triage-robot commented Nov 21, 2021

k8s-triage-robot commented Jan 15, 2022

k8s-triage-robot commented Feb 14, 2022

k8s-ci-robot commented Feb 14, 2022

CreateVolume API continues to be requested even after related objects are deleted on v1.2.0 #1022

CreateVolume API continues to be requested even after related objects are deleted on v1.2.0 #1022

Comments

imbohyun1 commented Aug 13, 2021 • edited Loading

wongma7 commented Aug 19, 2021

imbohyun1 commented Aug 23, 2021

k8s-triage-robot commented Nov 21, 2021

k8s-triage-robot commented Jan 15, 2022

k8s-triage-robot commented Feb 14, 2022

k8s-ci-robot commented Feb 14, 2022

imbohyun1 commented Aug 13, 2021 •

edited

Loading