title | authors | reviewers | approvers | api-approvers | creation-date | last-updated | tracking-link | see-also | replaces | superseded-by | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pod-security-admission-autolabeling |
|
|
|
|
2022-02-04 |
2022-02-04 |
|
This enhancement expands the "PodSecurity admission in OpenShift". It introduces an opt-in mechanism that allows users to to keep their workloads running when Pod Security Admission plugin gets turned on.
We want to adhere with the upstream pod security standards for our workloads but we also want to provide our users access to the Security Context Constraints (hereinafter SCCs) API that they are already used to. However, each of these admission plugins works a bit differently and so there must be a middle-man that synchronizes the privileges SCCs provide into privileges that Pod Security admission (hereinafter PSa) understands.
- OpenShift clusters can run with the restricted Pod Security admission level by default without breaking pod controller workloads
- Privileged users can opt-in/opt-out a namespace to and from the Pod Security admission
- Make sure that bare pods (pods w/o pod controllers) continue working
- Both the admission systems are triggered on pod creation, possibly preventing the API server from persisting the pod on admission denial. It is therefore impossible to take the Pod as an input without touching code of both the admission controllers. Such a code change would also very likely end up being rather unreasonable.
- Add the ability to tune the cluster-wide PSa configuration
- I would like my workloads to be admitted based on the current pod security standards with minimal effort or no effort at all
This section explains how pod security admission works and later elaborates on how to use the pod security labeling in order to synchronize SCC permissions into pod security admission restriction levels.
The SCC permissions to PSa levels transformation gets explained at the end of this section.
Pod Security admission validates pods' security according to the upstream pod security standards and distinguishes three different security levels:
- privileged - most privileged mode, anything is allowed
- baseline - minimally restrictive policy which prevents known privilege escalations
- restricted - heavily restricted policy, following current Pod hardening best practices
By default, there is a cluster-global configuration which enforces the configured policies on pods and known workloads.
It is possible to override the cluster-global policy enforce configuration on a per-namespace
basis by using the pod-security.kubernetes.io/enforce
label on given namespaces. It is also
possible to exempt certain users, namespace and runtime classes from the admission completely.
This section describes a controller that is going to be a part of the cluster-policy-controller.
In order to keep SCCs working alongside restricted cluster-wide PSa level, there must be a mechanism that would synchronize the PSa policy level for a namespace that workload runs in, so that the workload is allowed by both SCC and PSa admissions. For that purpose, a PSa Label Synchronization Controller (hereinafter the Controller) shall be built.
The Controller watches changes of:
- namespaces
- manual changes to the interesting labels of the namespaces that were opted-in to label synchronization should be reverted
- roles and rolebindings (and their cluster variants)
- to be able to properly assess SA capabilities in controlled namespaces
- SCCs
- to properly assess restrictive PSa levels per affected namespaces
- to be able to assess SA capabilities in controlled namespace along with the
information retrieved from roles and rolebindings (using the SCC
users
andgroups
fields)
The generic control loop of the Controller performs the following actions:
- list all controlled (opted-in by a label) namespaces
- for each namespace:
- create SCC <-> []SA associations for all SAs present in the namespace
- order observed SCCs most-privileged first (according to PSa privilege levels they map to)
- evaluate the namespace's PSa privilege level based on the most privileged SCC that is allowed to be used by its SA
- modify the namespace's PSa labels based on the enforce level observed in the previous step
There is another control loop reacting only to changes of namespaces which performs all the steps in 2.
The Controller needs to be able to map SCCs to a given SA in order to be able to later use this information to determine the restrictive level for a given SA.
To map SCCs to SAs, the Controller lists all roles and all SCCs, and locally attempts
to match the SCCs to the roles based on their rules
. It does that using the upstream
RBAC rules evaluation code.
The local evaluation was chosen over sending SARs to the API server in order to prevent periodic surge of requests in clusters with a larger amount of namespaces + roles + SCCs.
For mapping the roles to SAs, an indexing should be introduced on top of the
rolebinding informer cache. The indexing will index rolebindings to SAs,
SA groups, and the system:authenticated
group.
To get SCCs that match a given SA, the process is as follows:
- list all SCCs
- determine SA access to SCCs based on SCC
users
andgroups
fields - get all roles for an SA and its groups from an indexed informer cache
- add SCCs for the roles found in 3. to the list of SCCs found in 2.
The Controller described in PSa Label Synchronization Controller works on an opt-in basis, meaning that users need to specifically sign themselves up if they wanted to stick with just SCCs.
Namespaces that specifically want to have their PSa labels synchronized
should be labelled by the security.openshift.io/scc.podSecurityLabelSync
label with the label value set to “true”
, on the other hand, to specifically
request no PSa label synchronization, the namespace should set the label's
value to "false".
By default, namespaces without the synchronization label will be still considered for label synchronization. These namespaces are considered "no-opinion" and the label synchronization behavior for these may change in any future release.
Having the "no-opinion" path helps keeping the older workloads working during upgrades. It also gives us space to decide whether we want the label synchronization be done by default or whether the users should specifically opt into it in the future.
As a consequence of the opt-in/opt-out being dependent on Namespace
labelling, only the privileged users can determine what kind of
admission mechanism may run on namespaces, non-privileged users
depend on the platform defaults.
The automatic PSa label synchronization requires us to map SCCs directly to a PSa level.
In order to be able to map SCCs to PSa level automatically, it is necessary to introspect each field of such an SCC.
SCC fields each tell the SCC authorizer how to either:
- default a value of a given pod's
securityContext
field if unspecified - restrict the domain of values for a given pod's
securityContext
field
PSa is based on a number of checks, each of which validates a certain securityContext
field per given PSa restrictiveness level.
As per the above, it is possible to introspect the SCC type fields and based on
possible values and based on introspecting the PSa check functions for the given
pod's securityContext
field that's affected by the SCC field values, it is possible
to create mappings to different PSa levels.
Given the need for human interaction with PSa check functions when creating the SCC->PSa mapping, and the complexity of this task, each OpenShift version should always ever consider the "latest" PSa checks.
The SCC fields perform different tasks for fields of a security context:
- restrict the values to a certain value domain
- modify the field value in case the field is unset
- combination of the above
- restrict the values of an already allowed set (e.g.
AllowedFlexVolumes
only further restricts flex volumes allowed byVolumes
but upstream makes no difference there)
For the conversion between SCC and PSa levels, we are only interested in fields that do 1. and 3. The conversion itself needs to make sure that the values allowed by SCCs are a subset of the values allowed by the given PSa level in order for the SCC to match the PSa level.
Caveats and Gotchas Found During SCC Mapping Investigation
- Currently, there exists no default SCC supplied with the platform that would
match any PSa level more restricted than
baseline
- the
restricted
(and any dependent) SCC needs to be modified to match the pod security standards of today:RequiredDropCapabilities
should now contain just one value -ALL
AllowPrivilegeEscalation
should now befalse
AllowedCapabilities
should includeNET_BIND_SERVICE
which allows the application to use domain privileged ports (numbers lower than 1024)SeccompProfiles
should equalruntime/default
which only explicitly defaults this value for pods but it is already implicitly used
- the
- The SCC to PSa conversion depends on a namespace for the conversion
- the
MustRunAs
strategy ofRunAsUser
allows setting an empty range and its evaluation for a given pod retrieves the range from the pod's namespace annotations
- the
- The upstream checks can be updated at any time and there is nothing that would allow a simple machine conversion between SCC and PSa levels, this is all human driven
- PSa is designed in a way that allows improving security with time if such a need
arises
- this could mean that our SCCs might slip to less restrictive PSa levels in time
- we will need to have a process to tighten SCC permissions that would match the PSa capabilities in this regard
From the above points, we can safely skip 2. as that is just an interesting observation. We are evaluating the PSa levels on namespace level anyway when we are checking the SCC-related permissions for each of the SA, which are namespaced by nature.
The above changes required to make 1. work might cause workloads to break
if applied directly to the restricted
SCC. It is impossible to tell which
workload would break by this step because most of the platform users very likely just
went with the platform defaults and did not perform further steps in order to
tighten their workloads to their actual needs. To solve this issue, new
*-v2
(e.g. restricted-v2
, nonroot-v2
) SCCs with tighter security policies
should be introduced for the restricted
SCC and SCCs derived from it. The restricted
SCC will also drop system:authenticated
from its Groups
field in favour of adding
it to the restricted-v2
SCC. This effectively causes older clusters to keep the looser
permissions of restricted
SCC for their workloads but tighten up the policies
for workloads of newly created clusters.
In case the changes above still break any of user workloads in upgraded clusters
(restricted-v2
has better restrictive score during evaluation and so it's chosen
first for field defaulting), the users will need to fix their workloads' securityContext
accordingly but they should still be able to since they would still have access to
the restricted
SCC. The documentation should include information how
to fix such a workload, and how to remove broad user audience access to the legacy
restricted
SCC.
For 3, a process needs to be introduce that allows to simply check the changes to
the k8s.io/pod-security-admission/policy
folder during k8s.io rebases. The
repository containing the SCC conversion should include the commit hash of the
latest checked version of the k8s.io/pod-security-admission/policy
folder that
the current mapping corresponds to and developers will need to manually check that
no significant changes happened to the folder in between the two rebases that would
cause updates to checks. If there were, the conversion needs to be updated. In both
cases, the commit hash should be updated to match the latest checked version of the
folder.
The process for 4. should be drawn in a follow-up enhancement.
A new Namespace
label is being introduced - security.openshift.io/scc.podSecurityLabelSync
.
When set to value "true"
, the such-labeled Namespace
gets picked by the
PSa label synchronization controller and this Namespace
and its workloads
will be reconciled according to the PSa Label Synchronization Controller
section.
Described in the sections above.
Turning on the PSa to restricted level enforcing even with the namespace label synchronization described in this enforcement brings the risk of breaking direct pod application for pods that don't rely on SA permissions, i.e. pods created directly by users with SCC permissions higher than the ones of the SA for the given pod.
The problem above could be mitigated by providing the SA running the pod the same
SCC-related permission as the users has, or at least permission to use
the least
privileged SCC required to run the pod.
It is also likely that pod controllers might happen to fail pod creation before the labels are properly synchronized on their namespace. There is no way to mitigate that, the pods will eventually succeed to be created if the given service account running them has the correct permissions.
The restricted
SCC and its variations (e.g. nonroot
) will need a duplicit <name>-v2
variant that should match the updated pod security standards (see
Deriving PSa Level from SCC fields for
details of the proposed changes).
The restricted
SCC changes may cause some workloads to break. Such failures
should be rather rare as modern restricted workloads should not be susceptible to
the setuid bits and should tolerate dropped capabilities. Nevertheless, the
documentation should state how to fix such workloads by modifying the dropCapabilities
to match the previous restricted
SCC.
Everything was already described elsewhere.
- Should we have a way to completely disable SCC admission per namespace to allow only PSa?
The tests shall include these scenarios:
- a namespace labelled for label synchronization gets a more privileged label when it contains a workload controller whose SA can run the pods with matching SCC
- a namespace labelled for label synchronization does not get a more privileged label when it contains a workload controller whose SA cannot run the pods with matching SCC
- a namespace not labeled for label synchronization does not get its PSa labels synchronized
Irrelevant.
Irrelevant.
Irrelevant.
Irrelevant.
The ugprade strategy is discussed in the Label Synchronization Opting-in section.
Irrelevant.
One can derive all of this from the description in the Proposal section.
N/A
Describe how to
-
detect the failure modes in a support situation, describe possible symptoms (events, metrics, alerts, which log output in which component)
Examples:
- If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz".
- Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed".
- The metric
webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")
will show >1s latency and alertWebhookAdmissionLatencyHigh
will fire.
-
disable the API extension (e.g. remove MutatingWebhookConfiguration
xyz
, remove APIServicefoo
)-
What consequences does it have on the cluster health?
Examples:
- Garbage collection in kube-controller-manager will stop working.
- Quota will be wrongly computed.
- Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data. Disabling the conversion webhook will break garbage collection.
-
What consequences does it have on existing, running workloads?
Examples:
- New namespaces won't get the finalizer "xyz" and hence might leak resource X when deleted.
- SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod communication after some minutes.
-
What consequences does it have for newly created workloads?
Examples:
- New pods in namespace with Istio support will not get sidecars injected, breaking their networking.
-
-
Does functionality fail gracefully and will work resume when re-enabled without risking consistency?
Examples:
- The mutating admission webhook "xyz" has FailPolicy=Ignore and hence will not block the creation or updates on objects when it fails. When the webhook comes back online, there is a controller reconciling all objects, applying labels that were not applied during admission webhook downtime.
- Namespaces deletion will not delete all objects in etcd, leading to zombie objects when another namespace with the same name is created.
Major milestones in the life cycle of a proposal should be tracked in Implementation History
.
Empty.