Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove createSnapshotContentRetryCount and createSnapshotContentInterval #211

Merged
merged 4 commits into from
Dec 20, 2019

Conversation

xing-yang
Copy link
Collaborator

@xing-yang xing-yang commented Dec 17, 2019

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
Removes createSnapshotContentRetryCount and createSnapshotContentInterval
from command line options.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
This was brought up during the review of the README PR (#206) that these parameters should just be internal variables, not command line options.

Does this PR introduce a user-facing change?:

Removes createSnapshotContentRetryCount and createSnapshotContentInterval
from command line options.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 17, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 17, 2019
@xing-yang
Copy link
Collaborator Author

/assign @yuxiangqian

@@ -42,6 +42,12 @@ import (
"k8s.io/kubernetes/pkg/util/goroutinemap"
)

// Number of retries when we create a snapshot content object for a snapshot.
const createSnapshotContentRetryCount = 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my higher level question is why do we need to retry within the workqueue func? Can we just re-add the snapshot to the queue for retry whenever there's a failure, and let the workqueue properly throttle the retries?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is recommended practice for API update? PV controller also retries when PV object fails to be created. I remember when I first started to work on snapshot in external-storage repo, I was asked to add this type of retry as a bug fix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe PV controller does it because of the possibility of leaking the volume. I don't think we have the same issue here since we are creating the content object first before calling CSI. Cc @jsafrane

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have the same issue here since we are creating the content object first before calling CSI

+1, in PV controller, a volume is already created on storage backend when the controller saves a PV. And it really needs to save it, otherwise the volume may be leaked.

In snapshots, nothing is leaked if SnapshotContent is not created, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the beta version, we changed the behavior and create the VolumeSnapshotContent first before creating the physical snapshot resource.

We are also doing retries when update snapshot fails. In general, is it recommended practice to do retries when update an API object fails?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general pattern is to requeue it on any failure and let workqueue schedule the retry.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the RetryCount.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 18, 2019
if err != nil {
klog.V(4).Infof("failed to update snapshot %s status: %v", utils.SnapshotKey(snapshot), err)
// update snapshot status failed
ctrl.updateSnapshotErrorStatusWithEvent(snapshot, v1.EventTypeWarning, "SnapshotStatusUpdateFailed", fmt.Sprintf("Snapshot status update failed, %v", err))
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the caller, I see that we don't requeue the object and instead rely on resync period to retry. I think we should change the behavior to requeue, but we should also make sure the workqueue has proper throttling/backoff as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use queue.AddRateLimited(key)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you take a look of this? 10017c2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like this direction. Also cc @yuxiangqian and @jsafrane for more eyes. This will cause snapshotter to potentially retry more frequently than it has in the past, although it should be rate limited.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should delay this change. This changes snapshot controller behavior a lot. In the past we don't do retries when create snapshot fails because snapshot is time sensitive (other than a few retries when creating/updating API objects).

Can we revisit this after the 2.0 release? We have a pending fix for the create snapshot timeout issue as well that we should go back to after 2.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little concerned about removing retrying logic for status update and snapshot content creation from application perspective. Do we have a rough idea how often retry would resolve API server failures? Application could end up based of the existence of status field of a snapshot to decide whether or not to unquiesce. With RateLimited requeue, the waiting time could be significantly longer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed snapshotQueue.AddRateLimited().

What is conclusion here? Should I just add back createSnapshotContentRetryCount() as a internal const or is the current PR okay? @msau42

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like having retries within retries. It makes timing hard to reason about (metrics look strange), and we can end up overloading the API server since there's two different throttling behaviors that need to be controlled. We previously had issues in external-provisioner due to it having multiple retries within retries and I'd like to avoid that here too. We can tune the rate limiting as necessary.

But I agree this is a large, fundamental change to how the snapshot controller has operated, so I'm ok deferring it to the next release.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's add the retry count back for now so the only change is not making the retry count configurable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -365,8 +362,10 @@ func (ctrl *csiSnapshotCommonController) updateSnapshot(snapshot *crdv1.VolumeSn
klog.V(3).Infof("could not sync claim %q: %+v", utils.SnapshotKey(snapshot), err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to return err in this case either?

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 20, 2019
@msau42
Copy link
Collaborator

msau42 commented Dec 20, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 20, 2019
@k8s-ci-robot k8s-ci-robot merged commit bbbadbc into kubernetes-csi:master Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants