Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convey snapshot information #658

Merged
merged 8 commits into from
Sep 8, 2014
Merged

Convey snapshot information #658

merged 8 commits into from
Sep 8, 2014

Conversation

exarkun
Copy link
Contributor

@exarkun exarkun commented Sep 4, 2014

This partially addresses #46 - but does not completely resolve it.

Added here is a new flocker-volume snapshots command and a new method on IRemoteVolumeManager to invoke it. This allows the pushing side of an interaction to learn what snapshots exist on the receiving side of that interaction. A follow-up branch will use this information to generate incremental streams (using the functionality implemented in #657).

This branch is based on #657. Review that first, then recompute the diff to review this one.

@wallrj
Copy link
Contributor

wallrj commented Sep 5, 2014

Thanks @exarkun

I've only glanced at the code. It all looks good. I guess what's not clear to me (yet) is what triggers a snapshot to be taken after the initial push....oh, I see that the snapshot occurs implicitly on every call to f.v.zfs.FileSystem.reader

What happens:

  • if the check_call([b"zfs", b"snapshot", snapshot]) raises CalledProcessError?
  • if the check_output([b"zfs"] + _list_snapshots_command(self)), fails
  • if RemoteVolumeManager._destination.get_output( [b"flocker-volume", ..."snapshots"... fails.

Will sufficient information be logged (stderr) to allow us to debug the failures?

And more generally, I was trying to imagine some failure scenarios eg what would happen if a user, having modified their deployment.yml config, runs flocker-deploy to move an application but realises they've made a mistake. They wanted to move the application to a different node. So they kill flocker-deploy, correct the deployment config and re-run it. Meanwhile the original phase1 push:

  • may have completed.
  • may still be in progress
  • may have failed

I can't remember how a node knows that it is authoritative for a particular volume. Is it possible that two nodes might identify themselves authoritative for a volume and attempt to send their data to the new recipient?

But I those general problems (if they are problems) are addressed by the handoff in phase2 (after stopping the application).

Anyway, that's all I can think of. Please address or answer the points above as you see fit and then merge so that you can continue with the final integration.

@exarkun
Copy link
Contributor Author

exarkun commented Sep 5, 2014

The coverage of the new code seems mostly complete except http://build.clusterhq.com/results/use-snapshot-information-46/flocker-1884/complete/flocker_volume_script.html, but the other subcommand options aren't completely unit tested either so that's probably ok

I think maybe we have better coverage than the report suggests here. Some of our tests run the CLI tools in child processes. Lines executed in those child processes aren't measured and reported by the coverage tool.

@tomprince
Copy link
Contributor

I've filed ClusterHQ/build.clusterhq.com#21 for collecting that coverage information.

@exarkun
Copy link
Contributor Author

exarkun commented Sep 8, 2014

Will sufficient information be logged (stderr) to allow us to debug the failures?

Probably not. In general we're doing a really bad job of zfs error reporting right now. My story for now is that this will improve when we switch to libzfs_core bindings. Hopefully we'll do that real soon.

exarkun added a commit that referenced this pull request Sep 8, 2014
Expand the remote volume manager interface so it can convey information about the snapshots that exist for the filesystems it manages.
@exarkun exarkun merged commit e38b3d2 into master Sep 8, 2014
@exarkun exarkun deleted the use-snapshot-information-46 branch September 8, 2014 12:43
@exarkun exarkun removed the accepted label Sep 8, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants