Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix context handling in WaitForVolumeAttachment & add in-flight checks to attachment/detachment operations #1621

Merged
merged 1 commit into from
Jun 15, 2023

Conversation

torredil
Copy link
Member

@torredil torredil commented Jun 2, 2023

What is this PR about? / Why do we need it?

cloud.go changes:

This PR addresses incorrect handling of the Context in WaitForAttachmentState which implements a retry mechanism using exponential backoff to wait for an attachment to reach the expected state via the DescribeVolumesWithContext API.

The context is used to carry deadlines and cancellation signals across API boundaries. When polling for volume state, its possible for the context to be cancelled due to timeout etc. If this happens while we're waiting for the volume to reach the expected state, it is desirable to stop waiting and immediately return an error to the caller. This change allows us to better handle cases where the context is cancelled or reaches its deadline by stopping execution when no longer necessary.


controller.go changes:

Added in-flight checks to both ControllerPublishVolume and ControllerUnpublishVolume. The in-flight check acts as a synchronization mechanism by providing mutual exclusion to ensure idempotency.


Testing changes:

Added tests to validate code paths and logic when attaching / detaching volumes. Also did some refactoring to move towards table-driven tests which vastly improves readability and maintainability of these tests.

What testing is done?

  • Manual
  • CI

With this patch:

I0602 17:16:29.032300       1 controller.go:370] "ControllerPublishVolume: attached" volumeID="vol-0163b0b755696dcdc" nodeID="i-07532540ce852721b" devicePath="/dev/xvdal"
I0602 17:16:47.545514       1 controller.go:362] "ControllerPublishVolume: attaching" volumeID="vol-0bf9e9d3205eb9fe7" nodeID="i-07532540ce852721b"
I0602 17:16:49.352063       1 controller.go:370] "ControllerPublishVolume: attached" volumeID="vol-0bf9e9d3205eb9fe7" nodeID="i-07532540ce852721b" devicePath="/dev/xvdap"
I0602 17:16:49.358931       1 controller.go:362] "ControllerPublishVolume: attaching" volumeID="vol-0bf9e9d3205eb9fe7" nodeID="i-07532540ce852721b"
I0602 17:16:49.519855       1 controller.go:370] "ControllerPublishVolume: attached" volumeID="vol-0bf9e9d3205eb9fe7" nodeID="i-07532540ce852721b" devicePath="/dev/xvdap"

Without this patch:

I0524 20:56:35.690733       1 inflight.go:74] "Node Service: volume operation finished" key="snapshot-299d9d27-a925-4954-93df-80ccd2105e85"
I0524 20:56:35.711511       1 cloud.go:713] "Waiting for volume state" volumeID="vol-05ee7370b8091e868" actual="detaching" desired="detached"
I0524 20:56:36.174965       1 inflight.go:74] "Node Service: volume operation finished" key="pvc-2a9b6b41-dc69-42ff-aa57-ec6bfcbd73e0"
I0524 20:56:36.176699       1 cloud.go:713] "Waiting for volume state" volumeID="vol-08a1528c35dfeb1ff" actual="detaching" desired="detached"
I0524 20:56:36.205657       1 cloud.go:654] "Ignoring error from describe volume, will retry" volumeID="vol-0aecdf207857b8263" err=<
	RequestCanceled: request context canceled
	caused by: context deadline exceeded
 >
I0524 20:57:31.4134426       1 cloud.go:654] "Ignoring error from describe volume, will retry" volumeID="vol-0aecdf207857b8263" err=<
	RequestCanceled: request context canceled
	caused by: context deadline exceeded
 >
...

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 2, 2023
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 2, 2023
@torredil torredil marked this pull request as draft June 2, 2023 15:36
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 2, 2023
@torredil
Copy link
Member Author

torredil commented Jun 2, 2023

/retest

@torredil torredil marked this pull request as ready for review June 2, 2023 17:21
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 2, 2023
@k8s-ci-robot k8s-ci-robot requested a review from rdpsin June 2, 2023 17:22
pkg/cloud/cloud.go Outdated Show resolved Hide resolved
pkg/cloud/cloud.go Show resolved Hide resolved
pkg/driver/controller_test.go Outdated Show resolved Hide resolved
pkg/cloud/cloud.go Outdated Show resolved Hide resolved
@torredil torredil force-pushed the dv-context branch 2 times, most recently from 2ebbb80 to 62c5510 Compare June 6, 2023 22:27
Copy link
Contributor

@ConnorJC3 ConnorJC3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 14, 2023
pkg/driver/controller.go Outdated Show resolved Hide resolved
…s to attachment/detachment operations

Signed-off-by: Eddie Torres <torredil@amazon.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2023
@hanyuel
Copy link
Contributor

hanyuel commented Jun 15, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2023
@ConnorJC3
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ConnorJC3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2023
@k8s-ci-robot k8s-ci-robot merged commit 663e614 into kubernetes-sigs:master Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants