VolumeDelete may fail if Snapshots for volume not deleted first #252

saad-ali · 2019-03-14T21:15:09Z

Per container-storage-interface/spec#347, CSI permits a storage system to not delete a volume if it has outstanding snapshots and the storage system does not treat snapshots as independent entities (this is not recommended, but the spec permits it).

We should figure out how we handle this. Is there some programmatic recovery we should implement? If that is not possible, let's make sure the correct errors are surfaced to users to allow them to handle this, and document it somewhere.

fejta-bot · 2019-06-12T22:01:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

msau42 · 2019-06-20T02:49:23Z

/lifecycle frozen

dsupure · 2020-03-06T17:05:23Z

We implemented the SPEC behavior by returning the FailedPrecondition error code when there are snapshots associated with the volume(pvc). However, when we delete the pvc, the external provisioner will retry to delete it several times and eventually let k8s delete the pvc. Is it the expected behavior? We assume the expected behavior should be the pvc still exists with a warning or error message in its event field to let a user take further action (ie. delete the snapshots).

It seems the current spec definition will leave dangling volume and snapshots, and let the status of k8s and storage systems out of sync.

We are using k8s 1.17 + external provisioner 1.5

Here is an example log to show this behavor:

time="2020-03-05T23:07:19Z" level=error msg="Cannot delete the volume because there are snapshot(s) associated with it." snapList="[0xc00019e6c0]" span="{}" volID="&{k8s pvc-b8f31a2a-50c1-4883-8ac3-056cb227198a}"
Mar 05 23:07:19 k8s-dsu-98d72-2 sha256:2130c4e026a538046333e953b06743a8a9a74e1b5fec5c5cdde0e1ddc03bd4a4/k8s_csi-provisioner_pure-provisioner-0_default_439823e9-d66f-4a3a-b03c-5e82185fa77b_0/907419a05a18[24269]: E0305 23:07:19.408099       1 controller.go:1334] delete "pvc-b8f31a2a-50c1-4883-8ac3-056cb227198a": volume deletion failed: rpc error: code = FailedPrecondition desc = Cannot delete the volume because there are snapshot(s) associated with it.
Mar 05 23:07:19 k8s-dsu-98d72-2 sha256:2130c4e026a538046333e953b06743a8a9a74e1b5fec5c5cdde0e1ddc03bd4a4/k8s_csi-provisioner_pure-provisioner-0_default_439823e9-d66f-4a3a-b03c-5e82185fa77b_0/907419a05a18[24269]: W0305 23:07:19.408860       1 controller.go:936] Retrying syncing volume "pvc-b8f31a2a-50c1-4883-8ac3-056cb227198a", failure 0
Mar 05 23:07:20 k8s-dsu-98d72-2 sha256:2130c4e026a538046333e953b06743a8a9a74e1b5fec5c5cdde0e1ddc03bd4a4/k8s_csi-provisioner_pure-provisioner-0_default_439823e9-d66f-4a3a-b03c-5e82185fa77b_0/907419a05a18[24269]: E0305 23:07:20.452400       1 controller.go:1334] delete "pvc-b8f31a2a-50c1-4883-8ac3-056cb227198a": volume deletion failed: rpc error: code = FailedPrecondition desc = Cannot delete the volume because there are snapshot(s) associated with it.
Mar 05 23:07:20 k8s-dsu-98d72-2 sha256:2130c4e026a538046333e953b06743a8a9a74e1b5fec5c5cdde0e1ddc03bd4a4/k8s_csi-provisioner_pure-provisioner-0_default_439823e9-d66f-4a3a-b03c-5e82185fa77b_0/907419a05a18[24269]: W0305 23:07:20.452595       1 controller.go:936] Retrying syncing volume "pvc-b8f31a2a-50c1-4883-8ac3-056cb227198a", failure 1

sdodsley · 2020-03-06T17:20:07Z

The SPEC seems to be at odds with the implementation.
If the SPEC has the ability to return a FailedPrecondtion then it shouldn't just be ignored. What's the point of having that definition in the SPEC if it's just ignored.
It seems like there should be a finalizer on the PV(C) that will either delete all associated dependant snapshots, or hold the PV deletion until all associated snapshots are removed.
Independent snapshots obviously wouldn't need to be held to this restriction.

xing-yang · 2020-03-07T21:06:01Z

I wonder if we should add a new SNAPSHOTS_DEPEND_ON_VOLUME controller capability. If it is true, the snapshot controller can add a finalizer on the source PVC when taking a snapshot and remove it after the last snapshot is deleted. There's already a finalizer to protect the source PVC currently, but it is only added while the snapshot is being created and removed when snapshot is ready to use. Maybe we can extend the life of that finalizer if SNAPSHOTS_DEPEND_ON_VOLUME is true.

xing-yang · 2020-03-07T21:06:39Z

/remove-lifecycle frozen

dsupure · 2020-03-09T21:14:50Z

I think SNAPSHOTS_DEPEND_ON_VOLUME is a good idea that we can use a finalizer to prevent the pvc to be deleted. It seems we also need a refCount to indicate the number of snapshots associated with it. We can only remove the finalizer when refCount is zero. However, who should own this refCount? etcd, external-snapshotter or SP? Note that refCount needs to be persistent in etcd or need to get from the SP.

ShyamsundarR · 2020-03-10T14:50:25Z

I wonder if we should add a new SNAPSHOTS_DEPEND_ON_VOLUME controller capability. If it is true, the snapshot controller can add a finalizer on the source PVC when taking a snapshot and remove it after the last snapshot is deleted. There's already a finalizer to protect the source PVC currently, but it is only added while the snapshot is being created and removed when snapshot is ready to use. Maybe we can extend the life of that finalizer if SNAPSHOTS_DEPEND_ON_VOLUME is true.

Correct me if the following is misunderstood.

The manner to recover (roll-back) a PVC to a snapshot of the same, would be to delete the PVC, and recreate it with a DataSource pointing to the snapshot from which to recover.

If we add the said finalizers, the PVC will not be deleted till all snapshots against the PVC are deleted. This leaves behind no snapshot to revert to in the above workflow, or inability to re-create the PVC as it is not deleted till the snapshots are deleted.

Also, in my testing the PVC is deleted, but the PV remains. The PV deletion is the one that is attempted multiple times (exponentially backing off) till it can be deleted. This hence currently frees up the PVC re-creation using the same name@namespace but pointing to a snapshot as the DataSource.

If we are attempting to deal with graceful attempts, when pre-conditions are met, then the said finalizers on the PV can be placed, and the PV deletion not attempted till these are removed. This will reduce the unnecessary PV delete calls and possibly is what is needed to address this issue.

sdodsley · 2020-03-10T19:08:19Z

@ShyamsundarR the way most enterprise-class external storage arrays work with snapshots gives them the ability to restore the snapshot directly over the existing volume without having to delete it the volume first.
This means that the PVC should need to be deleted and recreated. The existing PV and PVC are retained.

ShyamsundarR · 2020-03-10T19:28:36Z

@ShyamsundarR the way most enterprise-class external storage arrays work with snapshots gives them the ability to restore the snapshot directly over the existing volume without having to delete it the volume first.
This means that the PVC should need to be deleted and recreated. The existing PV and PVC are retained.

Fair enough, so a rollback of the volume to a snapshot. As I work with Ceph, it also has similar rollback options for block device (rbd) based volumes/images, hence I understand the concept.

But, how can this rollback be instrumented via CSI and kubernetes though?

The snapshot creation is instrumented via kubernetes and CSI, the rollback, IMHO, should be instrumented via similar means in the orchestration plane as a result. This also can help ensure the current volume, being rolled back, is not in active use by any workload, and a recovery (or rollback) can be initiated safely.

There currently are no schemes for the same afaict, unless this is done using storage provider external interfaces. Is the future direction in this regard changing?

Also, there would be storage providers that do not support such a rollback, the scheme in such cases should cover the ability to delete and recreate the volume from a snapshot IMO. So we may need to ensure the solution caters to both use cases.

humblec · 2020-03-12T04:22:06Z

I wonder if we should add a new SNAPSHOTS_DEPEND_ON_VOLUME controller capability. If it is true, the snapshot controller can add a finalizer on the source PVC when taking a snapshot and remove it after the last snapshot is deleted. There's already a finalizer to protect the source PVC currently, but it is only added while the snapshot is being created and removed when snapshot is ready to use. Maybe we can extend the life of that finalizer if SNAPSHOTS_DEPEND_ON_VOLUME is true.

@xing-yang I think this is a good idea! . So that, we give flexibility to the storage vendors to operate the mode they want. As per my understanding the snap/clone/promotion of the snap..etc depends on the storage backend in place and they all implement it their own way. Some keep references , some untangle it completely from the source volume..etc. By keeping a way for the driver to advertise the relation to parent volume we are opening it up and imo ,thats the best take compared to going ahead with some restricted path.

fejta-bot · 2020-06-11T05:35:43Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-07-11T06:17:16Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

sdodsley · 2020-07-11T15:32:47Z

/remove-lifecycle rotten

fejta-bot · 2020-10-09T16:31:49Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

xing-yang · 2020-10-19T03:33:46Z

/remove-lifecycle stale

fejta-bot · 2021-01-17T04:23:00Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-02-16T05:09:04Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-03-18T05:55:38Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-03-18T05:55:39Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

xing-yang · 2021-12-03T17:21:04Z

/remove-lifecycle rotten

k8s-triage-robot · 2022-03-03T18:09:32Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-04-02T18:28:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-05-02T18:57:30Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-05-02T18:57:32Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Update 3.0 changelog

adb3af9df Merge pull request kubernetes-csi#252 from bells17/update-go-version b82ee3888 Merge pull request kubernetes-csi#253 from bells17/fix-typo c31745621 Fix typo 0a7850561 Bump to Go 1.22.3 edd89ad58 Merge pull request kubernetes-csi#251 from jsafrane/add-logcheck 043fd0991 Add test-logcheck target d7535ae0c Merge pull request kubernetes-csi#250 from jsafrane/go-1.22 b52e7ad35 Update go to 1.22.2 14fdb6f66 Merge pull request kubernetes-csi#247 from msau42/prow 9b4352e9f Update release playbook c7bb972cc Fix release notes script to use fixed tags 463a0e9f5 Add script to update specific go modules git-subtree-dir: release-tools git-subtree-split: adb3af9dfa3ed4d1a922cd839bb48e0b73918617

f40f0ccd4 Merge pull request kubernetes-csi#256 from solumath/master cfa92106c Instruction update 379a1bb9b Merge pull request kubernetes-csi#255 from humblec/sidecar-md a5667bbbb fix typo in sidecar release process 49676850e Merge pull request kubernetes-csi#254 from bells17/add-github-actions d9bd160c2 Update skip list in codespell GitHub Action adb3af9df Merge pull request kubernetes-csi#252 from bells17/update-go-version f5aebfc9f Add GitHub Actions workflows b82ee3888 Merge pull request kubernetes-csi#253 from bells17/fix-typo c31745621 Fix typo 0a7850561 Bump to Go 1.22.3 edd89ad58 Merge pull request kubernetes-csi#251 from jsafrane/add-logcheck 043fd0991 Add test-logcheck target d7535ae0c Merge pull request kubernetes-csi#250 from jsafrane/go-1.22 b52e7ad35 Update go to 1.22.2 14fdb6f66 Merge pull request kubernetes-csi#247 from msau42/prow 9b4352e9f Update release playbook c7bb972cc Fix release notes script to use fixed tags 463a0e9f5 Add script to update specific go modules git-subtree-dir: release-tools git-subtree-split: f40f0ccd458f2d4555e3ca98d69b5a984bae0f14

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2019

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 20, 2019

k8s-ci-robot removed the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Mar 7, 2020

xing-yang mentioned this issue Mar 7, 2020

If driver can't delete PVC when it still has snapshots, when deleting the PVC, maybe it can't delete the PVC in the etcd? kubernetes-csi/external-snapshotter#115

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 11, 2020

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 11, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 9, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 19, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 16, 2021

k8s-ci-robot closed this as completed Mar 18, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 3, 2021

xing-yang reopened this Dec 3, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 2, 2022

k8s-ci-robot closed this as completed May 2, 2022

kbsonlong pushed a commit to kbsonlong/external-provisioner that referenced this issue Dec 29, 2023

Merge pull request kubernetes-csi#252 from msau42/changelog

bacf664

Update 3.0 changelog

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VolumeDelete may fail if Snapshots for volume not deleted first #252

VolumeDelete may fail if Snapshots for volume not deleted first #252

saad-ali commented Mar 14, 2019

fejta-bot commented Jun 12, 2019

msau42 commented Jun 20, 2019

dsupure commented Mar 6, 2020

sdodsley commented Mar 6, 2020

xing-yang commented Mar 7, 2020

xing-yang commented Mar 7, 2020

dsupure commented Mar 9, 2020 •

edited

Loading

ShyamsundarR commented Mar 10, 2020

sdodsley commented Mar 10, 2020

ShyamsundarR commented Mar 10, 2020

humblec commented Mar 12, 2020 •

edited

Loading

fejta-bot commented Jun 11, 2020

fejta-bot commented Jul 11, 2020

sdodsley commented Jul 11, 2020

fejta-bot commented Oct 9, 2020

xing-yang commented Oct 19, 2020

fejta-bot commented Jan 17, 2021

fejta-bot commented Feb 16, 2021

fejta-bot commented Mar 18, 2021

k8s-ci-robot commented Mar 18, 2021

xing-yang commented Dec 3, 2021

k8s-triage-robot commented Mar 3, 2022

k8s-triage-robot commented Apr 2, 2022

k8s-triage-robot commented May 2, 2022

k8s-ci-robot commented May 2, 2022

VolumeDelete may fail if Snapshots for volume not deleted first #252

VolumeDelete may fail if Snapshots for volume not deleted first #252

Comments

saad-ali commented Mar 14, 2019

fejta-bot commented Jun 12, 2019

msau42 commented Jun 20, 2019

dsupure commented Mar 6, 2020

sdodsley commented Mar 6, 2020

xing-yang commented Mar 7, 2020

xing-yang commented Mar 7, 2020

dsupure commented Mar 9, 2020 • edited Loading

ShyamsundarR commented Mar 10, 2020

sdodsley commented Mar 10, 2020

ShyamsundarR commented Mar 10, 2020

humblec commented Mar 12, 2020 • edited Loading

fejta-bot commented Jun 11, 2020

fejta-bot commented Jul 11, 2020

sdodsley commented Jul 11, 2020

fejta-bot commented Oct 9, 2020

xing-yang commented Oct 19, 2020

fejta-bot commented Jan 17, 2021

fejta-bot commented Feb 16, 2021

fejta-bot commented Mar 18, 2021

k8s-ci-robot commented Mar 18, 2021

xing-yang commented Dec 3, 2021

k8s-triage-robot commented Mar 3, 2022

k8s-triage-robot commented Apr 2, 2022

k8s-triage-robot commented May 2, 2022

k8s-ci-robot commented May 2, 2022

dsupure commented Mar 9, 2020 •

edited

Loading

humblec commented Mar 12, 2020 •

edited

Loading