Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VolumeDelete may fail if Snapshots for volume not deleted first #252

Closed
saad-ali opened this issue Mar 14, 2019 · 25 comments
Closed

VolumeDelete may fail if Snapshots for volume not deleted first #252

saad-ali opened this issue Mar 14, 2019 · 25 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@saad-ali
Copy link
Member

Per container-storage-interface/spec#347, CSI permits a storage system to not delete a volume if it has outstanding snapshots and the storage system does not treat snapshots as independent entities (this is not recommended, but the spec permits it).

We should figure out how we handle this. Is there some programmatic recovery we should implement? If that is not possible, let's make sure the correct errors are surfaced to users to allow them to handle this, and document it somewhere.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2019
@msau42
Copy link
Collaborator

msau42 commented Jun 20, 2019

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 20, 2019
@dsupure
Copy link

dsupure commented Mar 6, 2020

We implemented the SPEC behavior by returning the FailedPrecondition error code when there are snapshots associated with the volume(pvc). However, when we delete the pvc, the external provisioner will retry to delete it several times and eventually let k8s delete the pvc. Is it the expected behavior? We assume the expected behavior should be the pvc still exists with a warning or error message in its event field to let a user take further action (ie. delete the snapshots).

It seems the current spec definition will leave dangling volume and snapshots, and let the status of k8s and storage systems out of sync.

We are using k8s 1.17 + external provisioner 1.5

Here is an example log to show this behavor:

time="2020-03-05T23:07:19Z" level=error msg="Cannot delete the volume because there are snapshot(s) associated with it." snapList="[0xc00019e6c0]" span="{}" volID="&{k8s pvc-b8f31a2a-50c1-4883-8ac3-056cb227198a}"
Mar 05 23:07:19 k8s-dsu-98d72-2 sha256:2130c4e026a538046333e953b06743a8a9a74e1b5fec5c5cdde0e1ddc03bd4a4/k8s_csi-provisioner_pure-provisioner-0_default_439823e9-d66f-4a3a-b03c-5e82185fa77b_0/907419a05a18[24269]: E0305 23:07:19.408099       1 controller.go:1334] delete "pvc-b8f31a2a-50c1-4883-8ac3-056cb227198a": volume deletion failed: rpc error: code = FailedPrecondition desc = Cannot delete the volume because there are snapshot(s) associated with it.
Mar 05 23:07:19 k8s-dsu-98d72-2 sha256:2130c4e026a538046333e953b06743a8a9a74e1b5fec5c5cdde0e1ddc03bd4a4/k8s_csi-provisioner_pure-provisioner-0_default_439823e9-d66f-4a3a-b03c-5e82185fa77b_0/907419a05a18[24269]: W0305 23:07:19.408860       1 controller.go:936] Retrying syncing volume "pvc-b8f31a2a-50c1-4883-8ac3-056cb227198a", failure 0
Mar 05 23:07:20 k8s-dsu-98d72-2 sha256:2130c4e026a538046333e953b06743a8a9a74e1b5fec5c5cdde0e1ddc03bd4a4/k8s_csi-provisioner_pure-provisioner-0_default_439823e9-d66f-4a3a-b03c-5e82185fa77b_0/907419a05a18[24269]: E0305 23:07:20.452400       1 controller.go:1334] delete "pvc-b8f31a2a-50c1-4883-8ac3-056cb227198a": volume deletion failed: rpc error: code = FailedPrecondition desc = Cannot delete the volume because there are snapshot(s) associated with it.
Mar 05 23:07:20 k8s-dsu-98d72-2 sha256:2130c4e026a538046333e953b06743a8a9a74e1b5fec5c5cdde0e1ddc03bd4a4/k8s_csi-provisioner_pure-provisioner-0_default_439823e9-d66f-4a3a-b03c-5e82185fa77b_0/907419a05a18[24269]: W0305 23:07:20.452595       1 controller.go:936] Retrying syncing volume "pvc-b8f31a2a-50c1-4883-8ac3-056cb227198a", failure 1

@sdodsley
Copy link

sdodsley commented Mar 6, 2020

The SPEC seems to be at odds with the implementation.
If the SPEC has the ability to return a FailedPrecondtion then it shouldn't just be ignored. What's the point of having that definition in the SPEC if it's just ignored.
It seems like there should be a finalizer on the PV(C) that will either delete all associated dependant snapshots, or hold the PV deletion until all associated snapshots are removed.
Independent snapshots obviously wouldn't need to be held to this restriction.

@xing-yang
Copy link
Contributor

I wonder if we should add a new SNAPSHOTS_DEPEND_ON_VOLUME controller capability. If it is true, the snapshot controller can add a finalizer on the source PVC when taking a snapshot and remove it after the last snapshot is deleted. There's already a finalizer to protect the source PVC currently, but it is only added while the snapshot is being created and removed when snapshot is ready to use. Maybe we can extend the life of that finalizer if SNAPSHOTS_DEPEND_ON_VOLUME is true.

@xing-yang
Copy link
Contributor

/remove-lifecycle frozen

@dsupure
Copy link

dsupure commented Mar 9, 2020

I think SNAPSHOTS_DEPEND_ON_VOLUME is a good idea that we can use a finalizer to prevent the pvc to be deleted. It seems we also need a refCount to indicate the number of snapshots associated with it. We can only remove the finalizer when refCount is zero. However, who should own this refCount? etcd, external-snapshotter or SP? Note that refCount needs to be persistent in etcd or need to get from the SP.

@ShyamsundarR
Copy link
Contributor

I wonder if we should add a new SNAPSHOTS_DEPEND_ON_VOLUME controller capability. If it is true, the snapshot controller can add a finalizer on the source PVC when taking a snapshot and remove it after the last snapshot is deleted. There's already a finalizer to protect the source PVC currently, but it is only added while the snapshot is being created and removed when snapshot is ready to use. Maybe we can extend the life of that finalizer if SNAPSHOTS_DEPEND_ON_VOLUME is true.

Correct me if the following is misunderstood.

The manner to recover (roll-back) a PVC to a snapshot of the same, would be to delete the PVC, and recreate it with a DataSource pointing to the snapshot from which to recover.

If we add the said finalizers, the PVC will not be deleted till all snapshots against the PVC are deleted. This leaves behind no snapshot to revert to in the above workflow, or inability to re-create the PVC as it is not deleted till the snapshots are deleted.

Also, in my testing the PVC is deleted, but the PV remains. The PV deletion is the one that is attempted multiple times (exponentially backing off) till it can be deleted. This hence currently frees up the PVC re-creation using the same name@namespace but pointing to a snapshot as the DataSource.

If we are attempting to deal with graceful attempts, when pre-conditions are met, then the said finalizers on the PV can be placed, and the PV deletion not attempted till these are removed. This will reduce the unnecessary PV delete calls and possibly is what is needed to address this issue.

@sdodsley
Copy link

@ShyamsundarR the way most enterprise-class external storage arrays work with snapshots gives them the ability to restore the snapshot directly over the existing volume without having to delete it the volume first.
This means that the PVC should need to be deleted and recreated. The existing PV and PVC are retained.

@ShyamsundarR
Copy link
Contributor

@ShyamsundarR the way most enterprise-class external storage arrays work with snapshots gives them the ability to restore the snapshot directly over the existing volume without having to delete it the volume first.
This means that the PVC should need to be deleted and recreated. The existing PV and PVC are retained.

Fair enough, so a rollback of the volume to a snapshot. As I work with Ceph, it also has similar rollback options for block device (rbd) based volumes/images, hence I understand the concept.

But, how can this rollback be instrumented via CSI and kubernetes though?

The snapshot creation is instrumented via kubernetes and CSI, the rollback, IMHO, should be instrumented via similar means in the orchestration plane as a result. This also can help ensure the current volume, being rolled back, is not in active use by any workload, and a recovery (or rollback) can be initiated safely.

There currently are no schemes for the same afaict, unless this is done using storage provider external interfaces. Is the future direction in this regard changing?

Also, there would be storage providers that do not support such a rollback, the scheme in such cases should cover the ability to delete and recreate the volume from a snapshot IMO. So we may need to ensure the solution caters to both use cases.

@humblec
Copy link
Contributor

humblec commented Mar 12, 2020

I wonder if we should add a new SNAPSHOTS_DEPEND_ON_VOLUME controller capability. If it is true, the snapshot controller can add a finalizer on the source PVC when taking a snapshot and remove it after the last snapshot is deleted. There's already a finalizer to protect the source PVC currently, but it is only added while the snapshot is being created and removed when snapshot is ready to use. Maybe we can extend the life of that finalizer if SNAPSHOTS_DEPEND_ON_VOLUME is true.

@xing-yang I think this is a good idea! . So that, we give flexibility to the storage vendors to operate the mode they want. As per my understanding the snap/clone/promotion of the snap..etc depends on the storage backend in place and they all implement it their own way. Some keep references , some untangle it completely from the source volume..etc. By keeping a way for the driver to advertise the relation to parent volume we are opening it up and imo ,thats the best take compared to going ahead with some restricted path.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 11, 2020
@sdodsley
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 11, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 9, 2020
@xing-yang
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 19, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 16, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@xing-yang
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 3, 2021
@xing-yang xing-yang reopened this Dec 3, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 2, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kbsonlong pushed a commit to kbsonlong/external-provisioner that referenced this issue Dec 29, 2023
jsafrane added a commit to jsafrane/external-provisioner that referenced this issue May 13, 2024
adb3af9df Merge pull request kubernetes-csi#252 from bells17/update-go-version
b82ee3888 Merge pull request kubernetes-csi#253 from bells17/fix-typo
c31745621 Fix typo
0a7850561 Bump to Go 1.22.3
edd89ad58 Merge pull request kubernetes-csi#251 from jsafrane/add-logcheck
043fd0991 Add test-logcheck target
d7535ae0c Merge pull request kubernetes-csi#250 from jsafrane/go-1.22
b52e7ad35 Update go to 1.22.2
14fdb6f66 Merge pull request kubernetes-csi#247 from msau42/prow
9b4352e9f Update release playbook
c7bb972cc Fix release notes script to use fixed tags
463a0e9f5 Add script to update specific go modules

git-subtree-dir: release-tools
git-subtree-split: adb3af9dfa3ed4d1a922cd839bb48e0b73918617
hime added a commit to hime/external-provisioner that referenced this issue Jun 13, 2024
f40f0ccd4 Merge pull request kubernetes-csi#256 from solumath/master
cfa92106c Instruction update
379a1bb9b Merge pull request kubernetes-csi#255 from humblec/sidecar-md
a5667bbbb fix typo in sidecar release process
49676850e Merge pull request kubernetes-csi#254 from bells17/add-github-actions
d9bd160c2 Update skip list in codespell GitHub Action
adb3af9df Merge pull request kubernetes-csi#252 from bells17/update-go-version
f5aebfc9f Add GitHub Actions workflows
b82ee3888 Merge pull request kubernetes-csi#253 from bells17/fix-typo
c31745621 Fix typo
0a7850561 Bump to Go 1.22.3
edd89ad58 Merge pull request kubernetes-csi#251 from jsafrane/add-logcheck
043fd0991 Add test-logcheck target
d7535ae0c Merge pull request kubernetes-csi#250 from jsafrane/go-1.22
b52e7ad35 Update go to 1.22.2
14fdb6f66 Merge pull request kubernetes-csi#247 from msau42/prow
9b4352e9f Update release playbook
c7bb972cc Fix release notes script to use fixed tags
463a0e9f5 Add script to update specific go modules

git-subtree-dir: release-tools
git-subtree-split: f40f0ccd458f2d4555e3ca98d69b5a984bae0f14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

10 participants