Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another nail to mountinfo coffin #2599

Merged

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Sep 23, 2020

This is a set of patches for libcontainer/cgroups that

  • implements mountinfo cache for cgroup filesystem and reuses it;
  • drops two custom mountinfo parsers (one in favor of moby/sys/mountinfo, another to reuse cgroups/utils);
  • removes mountinfo dependency in cgroups/fs/cpuset code.

The idea is roughly the same as #2265 but the implementation is way simpler.
The end result is smaller code (98 insertions(+), 147 deletions(-)), less bugs (I hope) and overall higher performance.

Tested it with strace shows that the number of times runc run opens /proc/self/mountinfo has decreased from 19 to 6 times (note it was 108 half a year ago). Of which 5 times must be out of cgroups code (haven't checked), so it's 14 -> 1 improvement. In absolute terms, it can be up to 0.023*14 ~= 0.3 seconds savings on a container start (on a system very busy with mounts).

v2 changes:

  • unexport readCgroupMountinfo (by removing its two external users, see below);
  • switch cgroups/fs/getCgroupRoot to use cgroups.GetCgroupMounts;
  • don't rely on mountinfo in cgroups/fs/cpuset code (switch to statfs);
  • improve commit messages.

NOTE this currently depends on (and includes) #2689 -- will rebase once that one is merged.

@kolyshkin kolyshkin force-pushed the another-nail-to-mountinfo-coffin branch from 529dd48 to aaba1e9 Compare September 24, 2020 01:57
@kolyshkin kolyshkin force-pushed the another-nail-to-mountinfo-coffin branch from 3c4c97d to 47563e3 Compare October 6, 2020 00:18
@kolyshkin
Copy link
Contributor Author

Rebased; @AkihiroSuda @mrunalp PTAL

@kolyshkin kolyshkin force-pushed the another-nail-to-mountinfo-coffin branch 3 times, most recently from 498e533 to 0df9654 Compare October 6, 2020 01:12
@kolyshkin
Copy link
Contributor Author

@AkihiroSuda @mrunalp PTAL

AkihiroSuda
AkihiroSuda previously approved these changes Oct 26, 2020
@AkihiroSuda AkihiroSuda added this to the 1.0.0-rc93 milestone Oct 26, 2020
@AkihiroSuda
Copy link
Member

@cyphar PTAL

@kolyshkin
Copy link
Contributor Author

Rebased on top of #2666 just in case

@AkihiroSuda
Copy link
Member

@cyphar WDYT?

AkihiroSuda
AkihiroSuda previously approved these changes Nov 13, 2020
Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @cyphar

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a suggestion (let me know what you think)

libcontainer/cgroups/utils_test.go Show resolved Hide resolved
@@ -91,6 +95,16 @@ func tryDefaultPath(cgroupPath, subsystem string) string {
return path
}

func ReadCgroupMountinfo() ([]*mountinfo.Info, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func ReadCgroupMountinfo() ([]*mountinfo.Info, error) {
// ReadCgroupMountinfo returns a list of cgroup mounts for the current running
// process. ReadCgroupMountinfo caches the results in memory, and assumes that
// no new cgroup mounts appear after it was first called.
func ReadCgroupMountinfo() ([]*mountinfo.Info, error) {

(perhaps "a list of cgroup v1 mounts"?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually; the name ReadCgroupMountinfo is now a bit ambiguous; Read.... would indicate it actually reads (not caches) mountinfo.

Looking a bit further; instead of introducing a new, exported, function;

  • This same file has a getCgroupMountsV1(), which is called by GetCgroupMounts()
  • getCgroupMountsV1() performs additional conversion (mapping mountinfo.Info to a libcontainer/cgroups.Mount using getCgroupMountsHelper())
    • this conversion will be done every time GetCgroupMounts() is called
  • side-note: I see `` and getCgroupMountsV1() have an `all` argument, that seems to be always `false` (perhaps it's no longer needed)
  • All current uses of ReadCgroupMountinfo() outside of this package (libcontainer/cgroups/fs/cpuset.go and libcontainer/cgroups/fs/fs.go are only interested in Mountpoint and Root, both of which are also preserved in libcontainer/cgroups.Mount (returned by getCgroupMountsV1 ()

Perhaps:

  • un-export ReadCgroupMountinfo()
  • remove the sync.Once() from ReadCgroupMountinfo()
  • export getCgroupMountsV1 (perhaps rename to GetCgroupV1Mounts()
    • This makes it a nice parallel to GetCgroupMounts() (but only returning v1 mounts, and thus clearer on what it does)
  • remove the all argument (if it's indeed no longer needed)
  • add the sync.Once to GetCgroupV1Mounts(), so that the conversions are only done once (if no new cgroups are expected to be added, there should be no need to perform the conversion each time either?)

This also makes the interface somewhat cleaner, as we're not returning an "external" type (mountinfo.Info, but instead a type that's defined in this package (cgroups.Mount)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. all was introduced by Add flag to allow getting all mounts for cgroups subsystems #1049, and it very vividly demonstrates the major problem with runc: it's a tool as well as a set of packages with lots of users. I have changed runc/libcontainer public API in the past and in some cases the users had to modify their code, and in some cases I needed to revert back.

I will look into cadvisor code wrt using all argument.

  1. The idea of reusing GetCgroupMounts (and adding sync.Once to it) looks interesting. The problem is, it's somewhat harder to do with all argument as it is now. I tried it in the past, and the result was lots of code. What I like about this PR is it is simple, it removes lots of code making things faster at the same time.

  2. I don't like yet another exported function either. I think the solution to that is to use internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into cadvisor code wrt using all argument.

google/cadvisor#1461
google/cadvisor#1476

I read all that and thought for some time and I am still not sure how to handle this :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I have added a comment describing ReadCgroupMountinfo as per your earlier suggestion, and I'll take a look at internalizing it later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with doing this API cleanup later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I took a fresh look and found a way to unexport ReadCgroupMountinfo by not using in the other two commits. The core of this PR stays the same, but the last two commits (getCgroupRoot and cpusets ones) changed a lot. PR description updated with "v2 changes", please re-review @thaJeztah @cyphar

libcontainer/cgroups/v1_utils.go Show resolved Hide resolved
Comment on lines 184 to 186
if numFound >= len(ss) {
break
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this new check necessary? It's also not correct if all is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't changed the original logic here, which was

for scanner.Scan() && numFound < len(ss) {

but it's good that you asked, as now I see that this function is not tested with all set to true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, the all logic was added by f557996 (done for google/cadvisor#1476) and it correctly avoided incrementing numFound in case all == true....

.... until 5ee0648 (PR #1817) broke it (since v1.0.0-rc6 and until now).

I am not sure why e.g. google/cadvisor#2309 (that bumps runc to a version with this bug) haven't reintroduced the google/cadvisor#1461. There is still an open issue that looks very similar to it though: google/cadvisor#1843.

Anyway, I am fixing this in #2689,
and rebasing this PR on top of that one.

@cyphar
Copy link
Member

cyphar commented Nov 18, 2020

Aside from my nit, this LGTM.

@kolyshkin kolyshkin force-pushed the another-nail-to-mountinfo-coffin branch 2 times, most recently from a0e0e62 to 75a9015 Compare November 24, 2020 20:32
@kolyshkin kolyshkin force-pushed the another-nail-to-mountinfo-coffin branch from 75a9015 to 08caa9b Compare November 25, 2020 01:29
@AkihiroSuda
Copy link
Member

CI failing

@kolyshkin kolyshkin force-pushed the another-nail-to-mountinfo-coffin branch from 08caa9b to dc17476 Compare November 27, 2020 10:22
@kolyshkin
Copy link
Contributor Author

kolyshkin commented Nov 27, 2020

Hmm, failure on rootless + idmap on Fedora 33:

4034not ok 90 runc exec [tty consolesize]
4035# (in test file /vagrant/tests/integration/tty.bats, line 183)
4036#   `timeout 1 tail --pid="$(head -n 1 pid.txt)" -f /dev/null' failed with status 124
4037# runc spec --rootless (status=0):
4038# 
4039# runc run -d --console-socket /tmp/console.sock test_busybox (status=0):
4040# 
4041# runc state test_busybox (status=0):
4042# {
4043#   "ociVersion": "1.0.2-dev",
4044#   "id": "test_busybox",
4045#   "pid": 72435,
4046#   "status": "running",
4047#   "bundle": "/tmp/busyboxtest",
4048#   "rootfs": "/tmp/busyboxtest/rootfs",
4049#   "created": "2020-11-27T11:03:43.589028179Z",
4050#   "owner": ""
4051# }
4052# runc exec -t --pid-file pid.txt -d --console-socket /tmp/console.sock -p /dev/fd/63 test_busybox (status=0):
4053# 

This is the timeout being hit. Does not look related, and I haven't ever seen this before.

Maybe CI is exceptionally slow today? :)

update: #2692

@AkihiroSuda
Copy link
Member

CI seems to have got stuck, could you push to retrigger CI?

@kolyshkin
Copy link
Contributor Author

CI seems to have got stuck, could you push to retrigger CI?

Yes, basically all old PRs (that don't have #2690) needs to be rebased :-\

Alternatively, I can unmark those checks as required.

@kolyshkin kolyshkin force-pushed the another-nail-to-mountinfo-coffin branch from dc17476 to 2c1eacd Compare December 4, 2020 01:18
@kolyshkin
Copy link
Contributor Author

Rebased to include #2690

@AkihiroSuda
Copy link
Member

needs rebase again sorry

Apparently it is inevitable that we have to read mountinfo multiple
times when dealing with cgroup v1. It seems we can only do it once
and reuse the data, without major modifications to the code.

This commit does a few things.

1. Drop our custom mountinfo parser implementation in favor of
   moby/sys/mountinfo. While the custom parser is faster
   (about 2x according to benchmark) for this particular case,
   the one from the package is more correct and future-proof.

2. Read mountinfo only once, caching all the entries with fstype of
   cgroup.  With this, there's no need to worry about performance
   degradation introduced above.

3. Drop "isSubsystemAvailable" optimization (introduced by commit
   2a1a6cd) because now with the cache it is probably slowing
   things down.

4. Modify the tests accordingly.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Drop the custom mountinfo parser, reuse the cgroups.GetCgroupMounts()
to get the cgroup root. Faster, and much less code.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In cgroup v1, the parent of a cgroup mount is on tmpfs
(or maybe any other fs but definitely not cgroupfs).

Use this to traverse up the tree until we reach non-cgroupfs.

This should work in any setups (nested container, non-standard
cgroup mounts) so issue [1] won't happen.

Theoretically, the only problematic case is when someone mounts
cpuset cgroupfs into a directory on another cgroupfs. Let's
assume people don't do that -- if they do, they will get other
error (e.g. inability to read cpuset.cpus file).

[1] opencontainers#1367

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Here, we always create a parent directory, so using MkdirAll
is redundant. Use Mkdir instead.

One difference between MkdirAll and Mkdir is the former ignores
EEXIST, and since we sometimes try to create a directory that
already exists, we need to explicitly ignore that.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin kolyshkin force-pushed the another-nail-to-mountinfo-coffin branch from 2c1eacd to 9f2153c Compare January 6, 2021 22:54
@kolyshkin
Copy link
Contributor Author

rebased (for new CI)

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah
Copy link
Member

@cyphar still LGTY?

@AkihiroSuda
Copy link
Member

@cyphar PTAL

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Kir!

@cyphar cyphar closed this in 6eed6e5 Feb 1, 2021
@cyphar cyphar merged commit 6eed6e5 into opencontainers:master Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants