Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deploy: fix the owner reference kind to be rc #13996

Merged
merged 9 commits into from
May 22, 2017

Conversation

mfojtik
Copy link
Contributor

@mfojtik mfojtik commented May 2, 2017

This PR will:

  • Fix the "kind" field in the ownerRef from deployerPod to point to RC instead of DC
  • Automatically set ownerRef from lifecycle hook pods created by deployer pods to RC (so we clean them up when the RC is removed)
  • Fix infra policy to use deployment SA instead of deploymentConfig SA in deployment controller
  • Fix permissions on pods for deploymentConfig SA (don't need to patch, update, delete or create)
  • Use the SA client in deploymentConfig controller instead of privileged client

@mfojtik
Copy link
Contributor Author

mfojtik commented May 2, 2017

[test]

@@ -435,7 +436,9 @@ func (c *DeploymentConfigController) cleanupOldDeployments(existingDeployments [
continue
}

err := c.rn.ReplicationControllers(deployment.Namespace).Delete(deployment.Name, nil)
err := c.rn.ReplicationControllers(deployment.Namespace).Delete(deployment.Name, &metav1.DeleteOptions{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis check.

APIVersion: "v1",
Kind: deployapi.Kind("DeploymentConfig").Kind,
Kind: kapi.Kind("ReplicationController").Kind,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncdc any chance this is the source of your problem? The names wouldn't match, right?

@0xmichalis
Copy link
Contributor

Needs a test that verifies GC of deployers works.

APIVersion: "v1",
Kind: deployapi.Kind("DeploymentConfig").Kind,
Kind: kapi.Kind("ReplicationController").Kind,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, @mfojtik this pod is more related to the DC than the RC and should be cleaned up when the DC is deleted, right? That would mean the kind was correct and the name and UID were wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be cleaned up because when you delete DC that should trigger deletion of RC and that should remove the deployer pod. Similarely when you set revisionHistory, this should automatically remove failed deployer pods (success deployer pods are removed automatically) as we delete old RC. correct @Kargakis ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mfojtik correct, this pod is more related to the RC than the DC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to set finalizers in the RCs for enabling a cascading deletion from a DC to reach pods but that should be part of https://trello.com/c/B10xLQ6L/927-set-owner-refs-in-rcs-owned-by-dcs.

@0xmichalis
Copy link
Contributor

Needs a test that verifies GC of deployers works.

@mfojtik I think you can re-use the existing test for cleaning up older RCs.

@ncdc
Copy link
Contributor

ncdc commented May 2, 2017

What I'd really like to see is a test that can reliably reproduce the transient deletion we saw in Jenkins that is then fixed by this PR (in whatever form it ends up in).

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 3, 2017
@mfojtik
Copy link
Contributor Author

mfojtik commented May 3, 2017

@ncdc @Kargakis @deads2k does kube GC even check the "kind" ? it seems like the UUID is correct (it points to the RC). Also i'm not able to reproduce the failure Andy saw... I tried:

  1. Make pre-hook with sleep 120 (in theory the deployer pod should be erased prematurely by GC?)
  2. Make pre-hook with abort policy and exit 1, which will cause deployment controller to keep the pod around (not removed by deployment controller)... it works as expected...
  3. Same as 2) but set revisionHistory limit to 2, I see all the failed deployer pods and hook pods sitting around and they are never garbage collected (with this broken, I would expect they will be vanished as they point to an non-existent DC (the UUID is RC).

... unless the GC only checks the UUID and not the Kind. In that case the cause of the problem is somewhere else and this is just a cosmetic issue.

EDIT: @Kargakis do we have logic in place when I delete the RC and the failed deployer pod is deleted automatically as well? (because that is what I see)... also reminds me that we should probably also set the owner reference for the hook pods...

@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 3, 2017
@mfojtik
Copy link
Contributor Author

mfojtik commented May 3, 2017

OK, so it seems like the GC honors the UID from the owner reference, which means this is not fixing #13995 but we should fix this anyway...

i'll add owner reference to hook pods in this PR.

@0xmichalis
Copy link
Contributor

0xmichalis commented May 3, 2017

Make sure that when we cleanup RCs based on revisionHistoryLimit, we delete with propagationPolicy set to Background.

@ncdc
Copy link
Contributor

ncdc commented May 3, 2017

The failure that I originally found was from a CI run for #13886. It didn't look like that PR would cause the problem. Either way, I still haven't reasoned through how this could have happened. That's why I want a test case that can "slow down time" and try to be prescriptive about what happens and see if we can reproduce.

@mfojtik
Copy link
Contributor Author

mfojtik commented May 3, 2017

@ncdc sure, I will work on that test, I think this PR is not related to that issue at this point.

@deads2k @liggitt @enj will you guys have any objections granting the "deployer" SA ability to delete pods in the project? When I add ownerRef in a hook pod created by the deployer I'm hitting this error:

 cannot set an ownerRef on a resource you can't delete: User "system:serviceaccount:test:deployer" cannot delete pods in project

There are two options we discussed with @Kargakis:

  1. Grant deployer delete
  2. Update the failed hook pods (which are subject of GC when the RC is removed) with ownerRef from the controller (basically when a deployer controller see a failed hook pod, it automatically patches it with the ownerRef).

I like 1) because it is cleaner, but less secure, but will take an advise from you :)

EDIT: so i tried the controller approach and got

I0502 21:06:11.518767    6460 deployment_controller.go:217] error saetting owner references: User "system:serviceaccount:openshift-infra:deploymentconfig-controller" cannot "patch" "pods" with name "nginx-15-hook-pre" in project "test"

but I guess making the controller more privileged is better than making the SA more privileged?

@@ -998,7 +998,7 @@ func (c *MasterConfig) DeploymentConfigInstantiateClients() (*osclient.Client, k

// DeploymentControllerClients returns the deployment controller client objects
func (c *MasterConfig) DeploymentControllerClients() (*osclient.Client, kclientsetinternal.Interface, kclientsetexternal.Interface) {
_, osClient, internalKubeClientset, externalKubeClientset, err := c.GetServiceAccountClients(bootstrappolicy.InfraDeploymentConfigControllerServiceAccountName)
_, osClient, internalKubeClientset, externalKubeClientset, err := c.GetServiceAccountClients(bootstrappolicy.InfraDeploymentControllerServiceAccountName)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k fyi

@@ -208,7 +208,7 @@ func init() {
},
// DeploymentController.podClient
{
Verbs: sets.NewString("get", "list", "create", "watch", "delete", "update"),
Verbs: sets.NewString("get", "list", "watch"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k down here we have:

 				{
 					Verbs:     sets.NewString("create", "update", "patch"),
 					Resources: sets.NewString("events"),
 				},

and this is copied in other policies... do we really need PATCH and UPDATE for events?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe patching/updating events has to do with event compression?

@enj
Copy link
Contributor

enj commented May 5, 2017

will you guys have any objections granting the "deployer" SA ability to delete pods in the project

@mfojtik I am missing some context on this PR but shouldn't you use a custom controller SA for this? @deads2k's #14033 is working towards that. Then you could grant that SA whatever permissions it needed. Maybe there are other issues I am missing in regards to this being inside a project...

I cannot comment on if you actually have other viable solutions to the problem as a whole.

@0xmichalis
Copy link
Contributor

@mfojtik I am missing some context on this PR but shouldn't you use a custom controller SA for this? @deads2k's #14033 is working towards that. Then you could grant that SA whatever permissions it needed. Maybe there are other issues I am missing in regards to this being inside a project...

I cannot comment on if you actually have other viable solutions to the problem as a whole.

We are already using a custom sa for the deployer, the question is "do we want to give pod deletion power to the deployer?"

}

encoder := kapi.Codecs.LegacyCodec(kapi.Registry.EnabledVersions()...)
errors := []error{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var errors []error

}
newPod, ok := objCopy.(*kapi.Pod)
if !ok {
errors = append(errors, fmt.Errorf("object %#+v is not a pod", objCopy))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unlike that this error will ever be hit but if you want it, that's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kargakis for the sake of somebody will copy&paste this code somewhere else :-)

@enj
Copy link
Contributor

enj commented May 5, 2017

We are already using a custom sa for the deployer, the question is "do we want to give pod deletion power to the deployer?"

@Kargakis Ah OK. Not really sure on that one.

@mfojtik
Copy link
Contributor Author

mfojtik commented May 9, 2017

@deads2k @Kargakis do we want to merge this or hold? (this is a bugfix)

oldPodBytes, err := runtime.Encode(encoder, pod)
if err != nil {
errors = append(errors, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a continue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed.

@0xmichalis
Copy link
Contributor

@deads2k @Kargakis do we want to merge this or hold? (this is a bugfix)

I am fine with merging.

@mfojtik
Copy link
Contributor Author

mfojtik commented May 19, 2017

@deads2k rolled back the changes to ICT controller, will open a separate PR for that once this merge.

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 19, 2017
@mfojtik
Copy link
Contributor Author

mfojtik commented May 22, 2017

@bparees @coreydaley pls check, I made some tweaks to accomodate the cleanup policy (i think the build controller now needs ability to delete builds and create build configs?)

@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 22, 2017
@openshift-bot
Copy link
Contributor

Evaluated for origin test up to c3d4af7

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to c3d4af7

@openshift-bot
Copy link
Contributor

openshift-bot commented May 22, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/744/) (Base Commit: a666b7c) (Image: devenv-rhel7_6247)

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/1605/) (Base Commit: a666b7c)

@openshift-bot openshift-bot merged commit dc33c36 into openshift:master May 22, 2017
@bparees
Copy link
Contributor

bparees commented May 22, 2017

@bparees @coreydaley pls check, I made some tweaks to accomodate the cleanup policy (i think the build controller now needs ability to delete builds and create build configs?)

@mfojtik looks like this merged before we had a chance to look at it... why would the build controller need the ability to create buildconfigs?

@0xmichalis
Copy link
Contributor

@mfojtik looks like this merged before we had a chance to look at it... why would the build controller need the ability to create buildconfigs?

If you delete something with an owner ref, you have to be able to create owners. I think.

@0xmichalis
Copy link
Contributor

If you delete something with an owner ref, you have to be able to create owners. I think.

No, I confused this with another rule: adding owner refs to things means you can also delete them. Not sure about this one.

@mfojtik
Copy link
Contributor Author

mfojtik commented May 22, 2017

sorry it was not create but delete build and get bc (https://github.com/openshift/origin/blob/master/pkg/cmd/server/origin/controller/build.go#L59)

@bparees
Copy link
Contributor

bparees commented May 22, 2017

ok yeah those make sense.

@@ -428,7 +429,10 @@ func (c *DeploymentConfigController) cleanupOldDeployments(existingDeployments [
continue
}

err := c.rn.ReplicationControllers(deployment.Namespace).Delete(deployment.Name, nil)
policy := metav1.DeletePropagationBackground
err := c.rn.ReplicationControllers(deployment.Namespace).Delete(deployment.Name, &metav1.DeleteOptions{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tnozicka fyi this is related to DeploymentConfig GC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants