Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PartiallyFailed phase for backups, don't exit on first error #1386

Merged
merged 1 commit into from
Apr 26, 2019

Conversation

skriss
Copy link
Contributor

@skriss skriss commented Apr 19, 2019

rel #1371

This PR adds the PartiallyFailed phase for backups. It modifies the backup execution logic so that rather than immediately exiting when the first error is encountered, the error is logged and execution continues. At the end, the number of errors logged is counted, and a >0 result moves the backup to PartiallyFailed.

Still a WIP, but the basic functionality is working. An easy way to see what it looks like is to create a cluster role binding for a cluster role that doesn't exist, to a service account in the namespace you're backing up -- this will create a "partial failure" error.

@nrb @carlisia appreciate any input at this point!

@skriss skriss requested review from carlisia and nrb April 19, 2019 15:02
@skriss
Copy link
Contributor Author

skriss commented Apr 19, 2019

output of velero backup get:

NAME             STATUS            CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
partial-fail     PartiallyFailed   2019-04-19 08:40:53 -0600 MDT   29d       default            <none>
partial-fail-2   PartiallyFailed   2019-04-19 08:52:02 -0600 MDT   29d       default            <none>

output of velero backup logs partial-fail-2 | grep error:

time="2019-04-19T14:52:16Z" level=info msg="1 errors encountered backup up item" backup=velero/partial-fail-2 group=v1 logSource="pkg/backup/resource_backupper.go:265" name=velero namespace=velero resource=serviceaccounts
time="2019-04-19T14:52:16Z" level=error msg="Error backing up item" backup=velero/partial-fail-2 error="clusterroles.rbac.authorization.k8s.io \"non-existent\" not found" error.file="/go/src/github.com/heptio/velero/pkg/backup/item_backupper.go:328" error.function="github.com/heptio/velero/pkg/backup.(*defaultItemBackupper).executeActions" group=v1 logSource="pkg/backup/resource_backupper.go:267" name=velero namespace=velero resource=serviceaccounts

Copy link
Contributor

@nrb nrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking good!

pkg/controller/backup_controller.go Outdated Show resolved Hide resolved
@skriss skriss marked this pull request as ready for review April 22, 2019 16:45
@skriss
Copy link
Contributor Author

skriss commented Apr 22, 2019

Made a few more updates - this is ready for review. Still testing.

Copy link
Contributor

@carlisia carlisia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor suggestions. Besides those, I didn't see in this code any logging of warning. I suppose you are displaying warnings that we were already adding.

pkg/backup/backup.go Show resolved Hide resolved
pkg/backup/backup.go Outdated Show resolved Hide resolved
pkg/backup/backup_test.go Show resolved Hide resolved
pkg/backup/resource_backupper.go Outdated Show resolved Hide resolved
pkg/controller/backup_controller.go Outdated Show resolved Hide resolved
@skriss
Copy link
Contributor Author

skriss commented Apr 23, 2019

addressed comments in latest commits

Copy link
Contributor

@carlisia carlisia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just needs rebasing.

Signed-off-by: Steve Kriss <krisss@vmware.com>
@skriss
Copy link
Contributor Author

skriss commented Apr 25, 2019

rebased/squashed/changelogged

@carlisia carlisia merged commit 8392e6d into vmware-tanzu:master Apr 26, 2019
@skriss skriss deleted the backup-error-count branch April 26, 2019 16:22
jessestuart pushed a commit to jessestuart/velero that referenced this pull request May 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants