Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose rollout for Docker shared PID namespace #207

Merged
merged 4 commits into from
Jan 26, 2017
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ container setup that are not currently trackable as Pod constraints, e.g.,
filesystem setup, container image pulling, etc.*

A container in a PodSandbox maps to an application in the Pod Spec. For Linux
containers, they are expected to share at least network and IPC namespaces,
containers, they are expected to share at least network, PID and IPC namespaces,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think we should change an old design proposal. This should be documented somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change should be reflected somewhere, and SIG Node pointed me here.

I guess the broader question is whether proposals are living documents, but that's out of scope for this PR. If you want me to make the change somewhere else please suggest a place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do update old proposals to reflect new features or changes to design originally proposed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's impossible to keep all the proposals up-to-date though. Adding a separate spec for CRI anyway has to be done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, since CRI v1alpha1 is not out yet, and shared pid namespace is what we agreed for v1alpha1, updating the proposal might be the least effort we could do?

with sharing more namespaces discussed in [#1615](https://issues.k8s.io/1615).


Expand Down
78 changes: 78 additions & 0 deletions contributors/design-proposals/pod-pid-namespace.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Shared PID Namespace

Pods share namespaces where possible, but a requirement for sharing the PID
namespace has not been defined due to lack of support in Docker. Docker began
supporting a shared PID namespace in 1.12, and other Kubernetes runtimes (rkt,
cri-o, hyper) have already implemented a shared PID namespace.

This proposal defines a shared PID namespace as a requirement of the Container
Runtime Interface and links its rollout in Docker to that of the CRI.

## Motivation

Sharing a PID namespace is discussed in [#1615](https://issues.k8s.io/1615),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add "sharing in a pod".

and enables:

1. signaling between containers, which is useful for side cars (e.g. for
signaling a daemon process after rotating logs).
2. easier troubleshooting of pods.
3. addressing [Docker's zombie problem][1] by reaping orphaned zombies in the
infra container.

## Goals and Non-Goals

Goals include:
- Changing default behavior in the Docker runtime as implemented by the CRI
- Making Docker behavior compatible with the other Kubernetes runtimes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this goal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I had questions about the goals. The first one sounds more like an approach to achieve a goal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My primary goal is to enable the shared PID namespace in the docker runtime. That's what I want to happen.

The secondary goal is to make this consistent across runtimes (where feasible). A shared PID namespace is consistent with the pod abstraction. That's the way I think things should be, but it wasn't my original goal. It was suggested by a reviewer.


Non-goals include:
- Creating an init solution that works for all runtimes
- Supporting isolated PID namespace indefinitely

## Modification to the Docker Runtime

We will modify the Docker implementation of the CRI to use a shared PID
namespace when running with a version of Docker >= 1.12. The legacy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't v1.12 have issues with reparenting? Should we wait for v1.13?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 1.12 reparents orphans incorrectly and they are reaped outside of the container. That's a no-op in our use case since we don't really care what it is that reaps the zombies. Regular zombies are unaffected.

Choosing whatever version > 1.12 fixes this bug would be a reasonable policy choice, but it's not technically necessary. Maybe we could target the next version of docker SIG Node wants to certify for Kubernetes, which I expect won't be until after the CRI because default. If we know that's 1.13, then I can add that here, otherwise we could use a placeholder until it's decided.

`dockertools` implementation will not be changed.

Linking this change to the CRI means that Kubernetes users who care to test such
changes can test the combined changes at once. Users who do not care to test
such changes will be insulated by Kubernetes not recommending Docker >= 1.12
until after switching to the CRI.

Other changes that must be made to support this change:

1. Ensure all containers restart if the infra container responsible for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this means. kubelet already restarts all containers if the sandbox/infra container is dead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we should verify this behavior and maybe add a test. I added this item because a comment in another PR says differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a change in the code itself. Suggest making it clear by rewording it as "Adding a test to verify"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to "add a test", but maybe we can drop this entirely if it's inherent in the concept of the sandbox.

PodSandbox dies. (Note: With Docker 1.12 if the source of the PID namespace
dies all containers sharing that namespace are killed as well.)
2. Modify the Infra container used by the Docker runtime to reap orphaned
zombies ([#36853](https://pr.k8s.io/36853)).

## Rollout Plan

SIG Node is planning to switch to the CRI as a default in 1.6, at which point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed to have shared pid namespace by default for CRI v1alpha1 interface. In this case, if the user use docker 1.12 or above, enabling shared pid namespace means early exposure to the end users is good thing for us to catch any potential issues.

On another side, I think there is a real concern that this is a behavior change for Kubernetes container runtime in production. Even other runtime implementations are having shared pid namespace by default today, but in production, the majority of Kubernetes users are not using such features, and we are not sure if this might cause any user issues. If any problem involving with shared pid namespace resulted in a CRI disabling might be too dramatic.

Can we introduce a flag to disable the shared pid namespace instead of disabling entire CRI?

cc/ @yujuhong @verb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, I'll update the proposal.

users with Docker >= 1.12 will be able to test Shared namespaces. Switching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Shared namespaces/shared PID namespaces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

back to isolated PID namespaces will require disabling the CRI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned about bundling the PID namespace with the CRI rollout, i.e., there are no way to disable shared PID namespaces in CRI. The majority of the CRI implementation should not change the behavior of the applications, but this (the shared pid namespace) does not fall into this category. This may affect the adoption of CRI in 1.6.

@dchen1107, can we discuss more whether we want to bundle the two, as this is currently not included in the CRI rollout plan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's very safe to assume SIG Node will discuss this more. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire proposal revolves around the idea of bundling rollout, hence it deserves more discussion.


At some point, say 1.7, SIG Node will remove support for disabling the CRI.
After this point users must roll back to a previous version of Kubernetes or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why rolling back? It seems to me that the users won't even choose to upgrade their kubernetes cluster if this is the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose not to upgrade in order to avoid the additional shared namespace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I didn't get why users would upgrade and then roll back for the shared pid namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they would roll back if they found their container images were incompatible with a new version of Kubernetes.

Docker to achieve PID namespace isolation. This is acceptable because:

* No one has been able to identify a concrete use case requiring isolated PID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm....I am not sure what has been done to reach a broader audience. Could you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No attempt has been made to reach a broader audience. I was referring to reviewers of this proposal.

namespaces.
* The lack of use cases means we can't justify the complexity required to make
PID namespace type configurable.
* Users will already be looking for issues due to the major version upgrade and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "major version upgrade" mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.X -> 1.Y

prepared for a rollback to the previous release.

Alternatively, we could create a flag in the kublet to disable shared PID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/kublet/kubelet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

namespace, but this wouldn't be especially useful to users of a hosted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? The maintainer of the hosted clusters can still choose to disable this feature....

Copy link
Contributor Author

@verb verb Jan 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There would need a mechanism for user's to affect flags of the running kubelet. I'd be surprised if this was common.

Or perhaps you mean they would choose to disable shared PID namespaces for all users on principle, which seems a stretch for a feature that so far no one opposes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an escape hatch is better. The other option is to make a new patch release which can incur a lot of latency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There would need a mechanism for user's to affect flags of the running kubelet. I'd be surprised if this was common.

I can't comment on behalf of all the provider of hosted kubernetes clusters...

Or perhaps you mean they would choose to disable shared PID namespaces for all users on principle, which seems a stretch for a feature that so far no one opposes.

From past experiences, it's hard to gather feedback from users. Erring on the side of caution might be good. Is making shared pid namespace configurable for the transitional period really such a big hurdle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it's not a big hurdle. It can be done with an annotation and adding an option to the CRI.

Kubernetes cluster.


[1]: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/pod-pid-namespace.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->