Make the trigger controllers cooperate instead of race #8746

0xmichalis · 2016-05-04T19:30:58Z

The plan here is to remove the generator from the trigger controllers and make the image change controller update only the spec of the deploymentConfig, deferring to the config change controller for incrementing latestVersion.

FIxes #8937
Fixes #8696
Fixes #6934

Trello card: https://trello.com/c/XRA3RF5W/714-5-remove-shared-dependency-from-trigger-controllers

@ironcladlou @smarterclayton @deads2k @liggitt @pweil- @mfojtik

deads2k · 2016-05-04T19:45:12Z

I think that if we're going to the effort of significant refactor, we should invest the time to switch to an Informer and workqueue pattern. Those controllers are easy to parallelize, easy to de-dupe, easy to handle retries, and easy to share common caches.

ironcladlou · 2016-05-04T20:18:50Z

I think that if we're going to the effort of significant refactor, we should invest the time to switch to an Informer and workqueue pattern. Those controllers are easy to parallelize, easy to de-dupe, easy to handle retries, and easy to share common caches.

Would you be okay if @Kargakis continued to chisel the overall shape of the system into something more closely resembling upstream before doing an informer/workqueue refactor? I can envision an informer refactor helping us get down to a single controller once these remaining peripheral refactors are addressed.

deads2k · 2016-05-04T20:22:46Z

Would you be okay if @Kargakis continued to chisel the overall shape of the system into something more closely resembling upstream before doing an informer/workqueue refactor? I can envision an informer refactor helping us get down to a single controller once these remaining peripheral refactors are addressed.

I'm not close to the area, but I would have expected the effort to look more like creating a new controller using an informer and workqueue, then picking pieces like this into it. Doing it like this seems like it would be going through the effort twice.

Is it somehow easier to get there after combining our controllers using a different mechanism first?

deads2k · 2016-05-04T23:58:29Z

Perhaps a quick sketch of what will help. Here's what I would expect our DC controllers to look like:

DC config change: watches all DCs. On any spec change, checks for config change trigger, if it exists, bump spec generation
istag change: watches all istags and all DCs. Maintains a reverse map of istag to DC with a lock factory for each istag to ensure strict happens-before semantics to avoid drift as DCs try to race with istag updates. if the istag is new, bump dc spec generation.
dc spec generation: watches all DCs. On any spec or status change, look for mismatched generations. Check for an inprogress change (status field maybe) to drop any in progress.
deployment pod: when its done working, marks the inprogress status field.

Each controller is small, none overlap, clean hand-offs, most races stopped due to pessimistic updates.

Is this what we're shooting for or are looking for something else?

deads2k · 2016-05-05T00:04:33Z

I could probably see a fifth to handle cancellation that would promote a spec request into a concrete: revert to exactly this config as the next deployment. In trying to think it all through, I'd probably make a really specific status field that tracked my destination "make this" and I could see a more specific request mechanism that allowed a detailed "I'd like you to do this" in cases more complicated that a straight "deploy again".

I'm not sure that I see the solution as trying to combine all the function into a single controller. Is there a consensus reached elsewhere that the multi-controller system is harder to deal with? What is the hand-off between our controllers today?

smarterclayton · 2016-05-05T00:12:46Z

Also, I want to be crystal clear. Our #1 priority is to get alignment with upstream without breaking existing deployments. Refactoring the controllers should be a secondary priority underneath that. Refactors not in pursuit of that delivery need a really high bar "because it's cleaner". I will be merciless with reviews until we can show that upgrade path MVP

smarterclayton · 2016-05-05T00:13:45Z

Higher than "because its cleaner". "Because it's faster" is better, but not enough.

smarterclayton · 2016-05-05T00:16:32Z

On May 4, 2016, at 7:58 PM, David Eads notifications@github.com wrote:

Perhaps a quick sketch of what will help. Here's what I would expect our DC
controllers to look like:

DC config change: watches all DCs. On any spec change, checks for
config change trigger, if it exists, bump spec generation
istag change: watches all istags and all DCs. Maintains a reverse map
of istag to DC with a lock factory for each istag to ensure strict
happens-before semantics to avoid drift as DCs try to race with istag
updates. if the istag is new, bump dc spec generation.

Moving to a different controller. Do this last.

dc spec generation: watches all DCs. On any spec or status change,
look for mismatched generations. Check for an inprogress change (status
field maybe) to drop any in progress.
deployment pod: when its done working, marks the inprogress status
field.

Each controller is small, none overlap, clean hand-offs, most races stopped
due to pessimistic updates.

Is this what we're shooting for or are looking for something else?

Don't underestimate DC complexity. I don't want us chasing subtle
regressions until we have deployments round tripping.

deads2k · 2016-05-05T12:00:01Z

Moving to a different controller. Do this last.

It wasn't meant as a TODO list, more an appeal for help understanding what the alternate vision is.

0xmichalis · 2016-05-05T12:19:34Z

@ironcladlou after thinking about this more, we could possibly continue using the config change controller and just trash the generator. Unfortunately semantics of latestVersion are wrong. Think about this for a sec: The main controller never updated latestVersion. It just acts based on it. Unfortunately oc (and probably the console cc: @jwforres ) are latestVersion-addicts meaning they always expect that latestVersion is synced so we need to continue supporting it without allowing them to update it directly (#7149). @deads2k i think that's the main reason shared caches are not a top priority atm, we want latestVersion in etcd as soon as we can. Races between the trigger controllers will not be a thing once we remove the shared dependency (#8696).

I think the hardest part is the image change controller as it needs to update both spec and status. @liggitt suggested having the image change controller update just the spec and config change controller detect the template diff and update status. Then we can special case the imagechange controller changes in the main controller instead of the registry (what I am doing in #7149 currently) and have the main controller set the status details. Thoughts?

0xmichalis · 2016-05-05T12:23:10Z

@smarterclayton this is not influenced either by readability or performance-related issues. We need to transition to separate status updates and reduce the scope of what each controller is updating. Currently the trigger controllers are a mess because of the generator and our client tools are a mess because of latestVersion.

smarterclayton · 2016-05-05T16:25:17Z

I'm ok with the generator refactor - I'm just concerned about much larger
refactors.

On Thu, May 5, 2016 at 8:23 AM, Michail Kargakis notifications@github.com
wrote:

@smarterclayton https://github.com/smarterclayton this is not
influenced either by readability or performance-related issues. We need to
transition to separate status updates and reduce the scope of what each
controller is updating. Currently the trigger controller are a mess because
of the generator and our client tools are a mess because of latestVersion.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8746 (comment)

smarterclayton · 2016-05-05T16:25:29Z

I'm fine with the suggestion on image change controller.

On Thu, May 5, 2016 at 12:25 PM, Clayton Coleman ccoleman@redhat.com
wrote:

I'm ok with the generator refactor - I'm just concerned about much larger
refactors.

On Thu, May 5, 2016 at 8:23 AM, Michail Kargakis <notifications@github.com

wrote:

@smarterclayton https://github.com/smarterclayton this is not
influenced either by readability or performance-related issues. We need to
transition to separate status updates and reduce the scope of what each
controller is updating. Currently the trigger controller are a mess because
of the generator and our client tools are a mess because of latestVersion.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8746 (comment)

0xmichalis · 2016-05-11T17:15:40Z

@ironcladlou @smarterclayton PTAL

No more generator and less hacks in the server 🎉 I didn't touch the generator code, just unplugged it. Since we expose it as an endpoint I guess we will need to continue having it.

0xmichalis · 2016-05-11T18:44:17Z

@liggitt this is what you expected

smarterclayton · 2016-05-11T21:58:45Z

pkg/deploy/controller/configchange/controller.go

+// change controller did its job. The second return argument helps in separating between config
+// change and image change causes.
+//
+// TODO: Handle multiple image change triggers


Kind of an important todo

Unless you just mean record them, in which case it's not that important

Well, the only thing remaining here is the annotation. Only one image is currently stored as lastTriggered. I am thinking of saving all images as part of the annotation ie. deployment.openshift.io/last-triggered-image: fullspec1,fullspec2,.. for multiple triggers. Thoughts?

Serialization as a json array if you have to, using commas is pretty gross. Or add multiple keys. For the generic trigger controller we're going to have to use multiple annotations.

I think I will add multiple keys for now. What's the generic trigger controller? Config change?

0xmichalis · 2016-05-13T12:12:36Z

@smarterclayton do you have any idea why annotation keys need to be qualified names?
https://github.com/kubernetes/kubernetes/blob/4591aa0f3b9bf003d0244cc389e020a64378907e/pkg/api/validation/validation.go#L97

I want to use namespace/name:tag for the annotation key but it is rejected by validation

0xmichalis · 2016-05-13T13:27:35Z

Switched to namespace.name.tag for now. I think this is ready for reviews, I just need to get to tests.

smarterclayton · 2016-05-13T14:08:14Z

Are we sure that the allowed tag values overlap with the allowed annotation
keys?

On Fri, May 13, 2016 at 9:27 AM, Michail Kargakis notifications@github.com
wrote:

Switched to namespace.name.tag for now. I think this is ready for reviews,
I just need to get to tests.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8746 (comment)

0xmichalis · 2016-05-13T15:39:05Z

Are we sure that the allowed tag values overlap with the allowed annotation keys?

Hm, not on the deployment side but we do validate imagestreamtag names, alas looser validation

0xmichalis · 2016-05-14T14:35:35Z

@mfojtik fyi I am also fixing #6934 here

cc: @bparees

ironcladlou · 2016-05-19T00:39:42Z

pkg/deploy/controller/imagechange/controller.go

 				continue
 			}

-			// Check if the image repo matches the trigger
-			if !triggerMatchesImage(config, params, imageRepo) {
+			// Resolve images for all initial deployments.


Comment seems inaccurate? Isn't it "Only detect changes when the trigger's enabled" or something along those lines?

It's not inaccurate - maybe ambiguous? All initial deployments will have their images resolved, even those with automatic==false (fixes #6934) . Otherwise (latestVersion != 0), we are going to resolve only automatic==true triggers. I will update the comment clarifying this.

0xmichalis · 2016-05-19T11:21:37Z

So it seems to me like the config change controller doesn't even need to exist anymore- all it does can be wrapped up into the sync function in the deployment config controller. Then there are even less places to deal with updates.

Actually, my first take on this was doing exactly that but it would break a lot of people. The reason is that the deploymentconfig controller is syncing based on latestVersion.

Say we remove the cc controller: When the deploymentconfig controller receives a latestVersion update, it will have to update the deploymentconfig, exit the reconcilation, (otherwise clients that expect a specific latestVersion are broken), then wait for the next etcd event that would be triggered by this update, and resync the deploymentconfig as it does now, right?

ironcladlou · 2016-05-19T12:30:57Z

Say we remove the cc controller: When the deploymentconfig controller receives a latestVersion update, it will have to update the deploymentconfig, exit the reconcilation, (otherwise clients that expect a specific latestVersion are broken), then wait for the next etcd event that would be triggered by this update, and resync the deploymentconfig as it does now, right?

My thought was that the config controller would update the latestVersion and then create the new deployment during the same handle call. The DC update would cause the DC to be re-queued and would no-op the 2nd pass through handle.

Anyway, I agree (based on IRC convo) that it's not worth doing that sort of change right now.

ironcladlou · 2016-05-19T12:31:26Z

This LGTM as long as all the integration tests still pass.

0xmichalis · 2016-05-19T15:40:05Z

@smarterclayton @ironcladlou I just realized that we don't need to store any new annotation that serializes anything, we can just check the serialized config of the latest deployment (which is already deserialized for comparing templates). Last commit, PTAL

ironcladlou · 2016-05-19T16:37:16Z

pkg/deploy/controller/configchange/controller.go

-	if err != nil {
-		return false, fatalError(fmt.Sprintf("error decoding deployment config from deployment %q for deployment config %s: %v", deployutil.LabelForDeployment(deployment), deployutil.LabelForDeploymentConfig(config), err))
-	}
+	return c.decodeConfig(deployment)


Note for future ref: I think this decoder abstraction has proven to be annoying/unnecessary and we can probably just replace it with a simple stateless utility.

ironcladlou · 2016-05-19T16:40:30Z

@Kargakis

I just realized that we don't need to store any new annotation that serializes anything, we can just check the serialized config of the latest deployment (which is already deserialized for comparing templates). Last commit, PTAL

I think this works, good idea.

0xmichalis · 2016-05-19T16:41:52Z

one more [test]

0xmichalis · 2016-05-19T17:34:51Z

known flakes [test]

0xmichalis · 2016-05-19T18:44:44Z

new flake #8946

[test]

0xmichalis · 2016-05-19T19:40:56Z

another new: #8947

gotta deflake em all!

[test]

0xmichalis · 2016-05-19T21:30:56Z

@ironcladlou @smarterclayton no other concerns from me. Can we merge this?

0xmichalis · 2016-05-19T21:35:24Z

Hold on, I need to cleanup

0xmichalis · 2016-05-19T21:56:26Z

Ready

smarterclayton · 2016-05-19T22:33:37Z

I have no other concerns [merge]

On Thu, May 19, 2016 at 6:29 PM, Michail Kargakis notifications@github.com
wrote:

Ready

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8746 (comment)

openshift-bot · 2016-05-19T22:35:17Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5954/) (Image: devenv-rhel7_4245)

openshift-bot · 2016-05-19T22:35:17Z

Evaluated for origin merge up to e3510a7

openshift-bot · 2016-05-19T23:05:17Z

Evaluated for origin test up to e3510a7

openshift-bot · 2016-05-20T03:40:14Z

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/3939/)

0xmichalis mentioned this pull request May 9, 2016

Generation and separate status updates for DCs #7149

Merged

0xmichalis changed the title ~~[WIP] Stop using the configchange controller~~ [WIP] Make the trigger controllers cooperate instead of race May 9, 2016

openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 9, 2016

0xmichalis added component/apps and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 11, 2016

0xmichalis assigned ironcladlou May 11, 2016

0xmichalis mentioned this pull request May 11, 2016

db-templates: mongodb ephemeral dc should automatically resolve #8843

Merged

smarterclayton reviewed May 11, 2016
View reviewed changes

0xmichalis added this to the 1.3.0 milestone May 12, 2016

0xmichalis added the area/techdebt label May 16, 2016

ironcladlou reviewed May 19, 2016
View reviewed changes

smarterclayton mentioned this pull request May 19, 2016

Deployments from ImageStream do not work for manual deploy #8937

Closed

ironcladlou reviewed May 19, 2016
View reviewed changes

0xmichalis added 5 commits May 19, 2016 23:59

Stop using the deploymentconfig generator

bfe739f

Deploymentconfig controller should update dc status on successful exit

10061e9

Resolve image for all initial deployments

eb23dcb

Fix tests for the trigger controllers

b0dac66

deployapi: fix automatic description

e3510a7

openshift-bot merged commit 7133830 into openshift:master May 20, 2016

0xmichalis mentioned this pull request May 25, 2016

added cleanup policy for deployments #8691

Merged

0xmichalis deleted the poc-turn-cc-controller-off branch May 25, 2016 14:07

0xmichalis mentioned this pull request Jun 1, 2016

deployments with ICTs on images from image source/docker builds not firing #9120

Closed

Make the trigger controllers cooperate instead of race #8746

Make the trigger controllers cooperate instead of race #8746

Conversation

0xmichalis commented May 4, 2016 • edited Loading

deads2k commented May 4, 2016

ironcladlou commented May 4, 2016

deads2k commented May 4, 2016

deads2k commented May 4, 2016

deads2k commented May 5, 2016

smarterclayton commented May 5, 2016

smarterclayton commented May 5, 2016

smarterclayton commented May 5, 2016

deads2k commented May 5, 2016

0xmichalis commented May 5, 2016 • edited Loading

0xmichalis commented May 5, 2016 • edited Loading

smarterclayton commented May 5, 2016

smarterclayton commented May 5, 2016

0xmichalis commented May 11, 2016

0xmichalis commented May 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xmichalis commented May 13, 2016

0xmichalis commented May 13, 2016

smarterclayton commented May 13, 2016

0xmichalis commented May 13, 2016

0xmichalis commented May 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xmichalis commented May 19, 2016 • edited Loading

ironcladlou commented May 19, 2016

ironcladlou commented May 19, 2016

0xmichalis commented May 19, 2016 • edited Loading

Choose a reason for hiding this comment

ironcladlou commented May 19, 2016

0xmichalis commented May 19, 2016

0xmichalis commented May 19, 2016

0xmichalis commented May 19, 2016

0xmichalis commented May 19, 2016

0xmichalis commented May 19, 2016

0xmichalis commented May 19, 2016

0xmichalis commented May 19, 2016

smarterclayton commented May 19, 2016

openshift-bot commented May 19, 2016 • edited Loading

openshift-bot commented May 19, 2016

openshift-bot commented May 19, 2016

openshift-bot commented May 20, 2016

0xmichalis commented May 4, 2016 •

edited

Loading

0xmichalis commented May 5, 2016 •

edited

Loading

0xmichalis commented May 5, 2016 •

edited

Loading

0xmichalis commented May 19, 2016 •

edited

Loading

0xmichalis commented May 19, 2016 •

edited

Loading

openshift-bot commented May 19, 2016 •

edited

Loading