Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoint: set default work-dir to image-path #3029

Merged
merged 1 commit into from
Jun 25, 2021

Conversation

liusdu
Copy link

@liusdu liusdu commented Jun 15, 2021

Sump/restrore logs are very useful when process fails.
So put them to image path defaultly,otherwise these files
will be deleted when container exits.

Signed-off-by: Liu Hua weldonliu@tencent.com'

Proposed changelog entry

  • checkpoint/restore: if --work-path is not specified, it defaults to the value of --image-path (same as criu does).

@kolyshkin
Copy link
Contributor

What prevents you from using --work-path instead of --image-path to achieve the same results? If --image-path is not set, it defaults to the value of --work-path, see https://criu.org/Directories).

Ahh I see -- standalone criu binary works as described above, but runc c/r doesn't.

I think we need to modify runc to NOT set WorkDirectory (and CriuOpts.WorkDirFd) if --work-path is not specified. This way runc checkpoint and runc restore will work the same way as criu, which I think is a sane way.

@liusdu can you implement the above?

@liusdu
Copy link
Author

liusdu commented Jun 15, 2021

What prevents you from using --work-path instead of --image-path to achieve the same results? If --image-path is not set, it defaults to the value of --work-path, see https://criu.org/Directories).

Hi @kolyshkin, thanks to you reply. if --images-dir is not set, it defaults to the value of "$(pwd)", not --work-dir. The description is as following:

This is the one specified with the -D|--images-dir $path option. If not given CRIU uses current working directory.

And if --work-dir is not set, it defaults to the values to --images-dir see https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/crtools.c#L133

Ahh I see -- standalone criu binary works as described above, but runc c/r doesn't.

I think we need to modify runc to NOT set WorkDirectory (and CriuOpts.WorkDirFd) if --work-path is not specified. This way runc checkpoint and runc restore will work the same way as criu, which I think is a sane way.

@liusdu can you implement the above?

Yes, It can fix it this way~

@liusdu liusdu force-pushed the work branch 2 times, most recently from de742d4 to 0710bbb Compare June 15, 2021 08:40
@liusdu
Copy link
Author

liusdu commented Jun 15, 2021

@kolyshkin done, please take a look.

@kolyshkin
Copy link
Contributor

@liusdu needs a rebase

@liusdu
Copy link
Author

liusdu commented Jun 15, 2021

@kolyshkin sorry for this, done~

@@ -1012,6 +997,19 @@ func (c *linuxContainer) Checkpoint(criuOpts *CriuOpts) error {
LazyPages: proto.Bool(criuOpts.LazyPages),
}

// if criuOpts.WorkDirectory is not set. let criu sets it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// if criuOpts.WorkDirectory is not set. let criu sets it.
// If criuOpts.WorkDirectory is not set, criu default is used
// (which is to reuse ImagesDirectory as WorkDirectory).

@kolyshkin
Copy link
Contributor

I think you need to amend the commit message to explain how it works (i.e. by not setting the WorkDirectory we let criu use its default, which is to reuse ImagesDirectory, or some such).

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few typos in the commit message:

Dump/restrore logs are very useful when process fails.

restore

New runc puts these files in c.root, which will be deleted

I think you mean "Now".

when container exits. So if checkpinting/restoring failed,

checkpointing

Now runc puts dump/restore logs in c.root defaultly, which will be deleted
when container exits. So if checkpinting/restoring failed, we can not get
these logs and analyze why.

This patch lets criu use its default if --work-path is not set:
 - Use WorkDirectory found in criu's configfile.
 - Use ImageDirectory.

Signed-off-by: Liu Hua <weldonliu@tencent.com>
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adrianreber PTAL

@adrianreber
Copy link
Contributor

The only thing I am worried about is if Docker or Podman are relying on this somehow. Would be good to verify that neither of them break with this change.

@liusdu
Copy link
Author

liusdu commented Jun 17, 2021

@adrianreber Podman and Docker/Containerd all set WorkDirectory. So this change does not affect them.
And I think behavior of containers is not good enough, because we can not get restore log when restoring failed.

  1. Podman:
  • Code
    Podman set WorkDirectory to ctr.bundlePath()
       // workPath will be used to store dump.log and stats-dump
        workPath := ctr.bundlePath()
        logrus.Debugf("Writing checkpoint to %s", imagePath)
        logrus.Debugf("Writing checkpoint logs to %s", workPath)
        logrus.Debugf("Pre-dump the container %t", options.PreCheckPoint)
        args := []string{}
        args = append(args, r.runtimeFlags...)
        args = append(args, "checkpoint")
        args = append(args, "--image-path")
        args = append(args, imagePath)
        args = append(args, "--work-path")
        args = append(args, workPath)
  • behavior:
# restore command
podman  container restore  06fd1a24a1f0992c9349e7de45796b80e253a10fc1b7adcfc89052cd460a4288
# output
Error: OCI runtime error: criu failed: type NOTIFY errno 0
log file: /var/lib/containers/storage/overlay-containers/06fd1a24a1f0992c9349e7de45796b80e253a10fc1b7adcfc89052cd460a4288/userdata/restore.log
# restore.log is perisist on disk
ls -l /var/lib/containers/storage/overlay-containers/06fd1a24a1f0992c9349e7de45796b80e253a10fc1b7adcfc89052cd460a4288/userdata/restore.log 
-rw------- 1 root root 25921 6月  17 11:03 /var/lib/containers/storage/overlay-containers/06fd1a24a1f0992c9349e7de45796b80e253a10fc1b7adcfc89052cd460a4288/userdata/restore.log
  1. Containerd:
  • code
if p.CriuWorkPath == "" {
              // if criu work path not set, use container WorkDir
              p.CriuWorkPath = p.WorkDir
      }
  • behavior
# restore 
docker start --checkpoint checkpoint1 c1b1a73a91
# restore output
Error response from daemon: failed to retrieve OCI runtime container pid: open /var/run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/c1b1a73a91b8bbbf61318f7bfa391a8ce6ad42952aba10076372e60e3a34b7ae/init.pid: no such file or directory: unknown
#  I can not found restore.log
find  /var  -name restore.log

@adrianreber
Copy link
Contributor

adrianreber commented Jun 18, 2021

Thanks for looking into it. So Docker and Podman both seem to always set WorkDirectory and ImagesDirectory.

This PR should not lead to different behaviour on the container engine level.

Patch LGTM

@kolyshkin kolyshkin added this to the 1.1.0 milestone Jun 22, 2021
Copy link
Contributor

@mrunalp mrunalp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mrunalp mrunalp merged commit 245fe2b into opencontainers:master Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants