Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the trigger controllers cooperate instead of race #8746

Merged
merged 5 commits into from
May 20, 2016
Merged

Make the trigger controllers cooperate instead of race #8746

merged 5 commits into from
May 20, 2016

Conversation

0xmichalis
Copy link
Contributor

@0xmichalis 0xmichalis commented May 4, 2016

The plan here is to remove the generator from the trigger controllers and make the image change controller update only the spec of the deploymentConfig, deferring to the config change controller for incrementing latestVersion.

FIxes #8937
Fixes #8696
Fixes #6934

Trello card: https://trello.com/c/XRA3RF5W/714-5-remove-shared-dependency-from-trigger-controllers

@ironcladlou @smarterclayton @deads2k @liggitt @pweil- @mfojtik

@deads2k
Copy link
Contributor

deads2k commented May 4, 2016

I think that if we're going to the effort of significant refactor, we should invest the time to switch to an Informer and workqueue pattern. Those controllers are easy to parallelize, easy to de-dupe, easy to handle retries, and easy to share common caches.

@ironcladlou
Copy link
Contributor

I think that if we're going to the effort of significant refactor, we should invest the time to switch to an Informer and workqueue pattern. Those controllers are easy to parallelize, easy to de-dupe, easy to handle retries, and easy to share common caches.

Would you be okay if @Kargakis continued to chisel the overall shape of the system into something more closely resembling upstream before doing an informer/workqueue refactor? I can envision an informer refactor helping us get down to a single controller once these remaining peripheral refactors are addressed.

@deads2k
Copy link
Contributor

deads2k commented May 4, 2016

Would you be okay if @Kargakis continued to chisel the overall shape of the system into something more closely resembling upstream before doing an informer/workqueue refactor? I can envision an informer refactor helping us get down to a single controller once these remaining peripheral refactors are addressed.

I'm not close to the area, but I would have expected the effort to look more like creating a new controller using an informer and workqueue, then picking pieces like this into it. Doing it like this seems like it would be going through the effort twice.

Is it somehow easier to get there after combining our controllers using a different mechanism first?

@deads2k
Copy link
Contributor

deads2k commented May 4, 2016

Perhaps a quick sketch of what will help. Here's what I would expect our DC controllers to look like:

  1. DC config change: watches all DCs. On any spec change, checks for config change trigger, if it exists, bump spec generation
  2. istag change: watches all istags and all DCs. Maintains a reverse map of istag to DC with a lock factory for each istag to ensure strict happens-before semantics to avoid drift as DCs try to race with istag updates. if the istag is new, bump dc spec generation.
  3. dc spec generation: watches all DCs. On any spec or status change, look for mismatched generations. Check for an inprogress change (status field maybe) to drop any in progress.
  4. deployment pod: when its done working, marks the inprogress status field.

Each controller is small, none overlap, clean hand-offs, most races stopped due to pessimistic updates.

Is this what we're shooting for or are looking for something else?

@deads2k
Copy link
Contributor

deads2k commented May 5, 2016

I could probably see a fifth to handle cancellation that would promote a spec request into a concrete: revert to exactly this config as the next deployment. In trying to think it all through, I'd probably make a really specific status field that tracked my destination "make this" and I could see a more specific request mechanism that allowed a detailed "I'd like you to do this" in cases more complicated that a straight "deploy again".

I'm not sure that I see the solution as trying to combine all the function into a single controller. Is there a consensus reached elsewhere that the multi-controller system is harder to deal with? What is the hand-off between our controllers today?

@smarterclayton
Copy link
Contributor

Also, I want to be crystal clear. Our #1 priority is to get alignment with upstream without breaking existing deployments. Refactoring the controllers should be a secondary priority underneath that. Refactors not in pursuit of that delivery need a really high bar "because it's cleaner". I will be merciless with reviews until we can show that upgrade path MVP

@smarterclayton
Copy link
Contributor

Higher than "because its cleaner". "Because it's faster" is better, but not enough.

@smarterclayton
Copy link
Contributor

On May 4, 2016, at 7:58 PM, David Eads notifications@github.com wrote:

Perhaps a quick sketch of what will help. Here's what I would expect our DC
controllers to look like:

  1. DC config change: watches all DCs. On any spec change, checks for
    config change trigger, if it exists, bump spec generation
  2. istag change: watches all istags and all DCs. Maintains a reverse map
    of istag to DC with a lock factory for each istag to ensure strict
    happens-before semantics to avoid drift as DCs try to race with istag
    updates. if the istag is new, bump dc spec generation.

Moving to a different controller. Do this last.

  1. dc spec generation: watches all DCs. On any spec or status change,
    look for mismatched generations. Check for an inprogress change (status
    field maybe) to drop any in progress.
  2. deployment pod: when its done working, marks the inprogress status
    field.

Each controller is small, none overlap, clean hand-offs, most races stopped
due to pessimistic updates.

Is this what we're shooting for or are looking for something else?

Don't underestimate DC complexity. I don't want us chasing subtle
regressions until we have deployments round tripping.

@deads2k
Copy link
Contributor

deads2k commented May 5, 2016

Moving to a different controller. Do this last.

It wasn't meant as a TODO list, more an appeal for help understanding what the alternate vision is.

@0xmichalis
Copy link
Contributor Author

0xmichalis commented May 5, 2016

@ironcladlou after thinking about this more, we could possibly continue using the config change controller and just trash the generator. Unfortunately semantics of latestVersion are wrong. Think about this for a sec: The main controller never updated latestVersion. It just acts based on it. Unfortunately oc (and probably the console cc: @jwforres ) are latestVersion-addicts meaning they always expect that latestVersion is synced so we need to continue supporting it without allowing them to update it directly (#7149). @deads2k i think that's the main reason shared caches are not a top priority atm, we want latestVersion in etcd as soon as we can. Races between the trigger controllers will not be a thing once we remove the shared dependency (#8696).

I think the hardest part is the image change controller as it needs to update both spec and status. @liggitt suggested having the image change controller update just the spec and config change controller detect the template diff and update status. Then we can special case the imagechange controller changes in the main controller instead of the registry (what I am doing in #7149 currently) and have the main controller set the status details. Thoughts?

@0xmichalis
Copy link
Contributor Author

0xmichalis commented May 5, 2016

@smarterclayton this is not influenced either by readability or performance-related issues. We need to transition to separate status updates and reduce the scope of what each controller is updating. Currently the trigger controllers are a mess because of the generator and our client tools are a mess because of latestVersion.

@smarterclayton
Copy link
Contributor

I'm ok with the generator refactor - I'm just concerned about much larger
refactors.

On Thu, May 5, 2016 at 8:23 AM, Michail Kargakis notifications@github.com
wrote:

@smarterclayton https://github.com/smarterclayton this is not
influenced either by readability or performance-related issues. We need to
transition to separate status updates and reduce the scope of what each
controller is updating. Currently the trigger controller are a mess because
of the generator and our client tools are a mess because of latestVersion.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8746 (comment)

@smarterclayton
Copy link
Contributor

I'm fine with the suggestion on image change controller.

On Thu, May 5, 2016 at 12:25 PM, Clayton Coleman ccoleman@redhat.com
wrote:

I'm ok with the generator refactor - I'm just concerned about much larger
refactors.

On Thu, May 5, 2016 at 8:23 AM, Michail Kargakis <notifications@github.com

wrote:

@smarterclayton https://github.com/smarterclayton this is not
influenced either by readability or performance-related issues. We need to
transition to separate status updates and reduce the scope of what each
controller is updating. Currently the trigger controller are a mess because
of the generator and our client tools are a mess because of latestVersion.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8746 (comment)

@0xmichalis 0xmichalis changed the title [WIP] Stop using the configchange controller [WIP] Make the trigger controllers cooperate instead of race May 9, 2016
@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 9, 2016
@0xmichalis
Copy link
Contributor Author

@ironcladlou @smarterclayton PTAL

No more generator and less hacks in the server 🎉 I didn't touch the generator code, just unplugged it. Since we expose it as an endpoint I guess we will need to continue having it.

@0xmichalis 0xmichalis added component/apps and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 11, 2016
@0xmichalis
Copy link
Contributor Author

@liggitt this is what you expected

// change controller did its job. The second return argument helps in separating between config
// change and image change causes.
//
// TODO: Handle multiple image change triggers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of an important todo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you just mean record them, in which case it's not that important

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the only thing remaining here is the annotation. Only one image is currently stored as lastTriggered. I am thinking of saving all images as part of the annotation ie. deployment.openshift.io/last-triggered-image: fullspec1,fullspec2,.. for multiple triggers. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serialization as a json array if you have to, using commas is pretty gross. Or add multiple keys. For the generic trigger controller we're going to have to use multiple annotations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will add multiple keys for now. What's the generic trigger controller? Config change?

@0xmichalis 0xmichalis added this to the 1.3.0 milestone May 12, 2016
@0xmichalis
Copy link
Contributor Author

@smarterclayton do you have any idea why annotation keys need to be qualified names?
https://github.com/kubernetes/kubernetes/blob/4591aa0f3b9bf003d0244cc389e020a64378907e/pkg/api/validation/validation.go#L97

I want to use namespace/name:tag for the annotation key but it is rejected by validation

@0xmichalis
Copy link
Contributor Author

Switched to namespace.name.tag for now. I think this is ready for reviews, I just need to get to tests.

@smarterclayton
Copy link
Contributor

Are we sure that the allowed tag values overlap with the allowed annotation
keys?

On Fri, May 13, 2016 at 9:27 AM, Michail Kargakis notifications@github.com
wrote:

Switched to namespace.name.tag for now. I think this is ready for reviews,
I just need to get to tests.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8746 (comment)

@0xmichalis
Copy link
Contributor Author

Are we sure that the allowed tag values overlap with the allowed annotation keys?

Hm, not on the deployment side but we do validate imagestreamtag names, alas looser validation

@0xmichalis
Copy link
Contributor Author

@mfojtik fyi I am also fixing #6934 here

cc: @bparees

continue
}

// Check if the image repo matches the trigger
if !triggerMatchesImage(config, params, imageRepo) {
// Resolve images for all initial deployments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment seems inaccurate? Isn't it "Only detect changes when the trigger's enabled" or something along those lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not inaccurate - maybe ambiguous? All initial deployments will have their images resolved, even those with automatic==false (fixes #6934) . Otherwise (latestVersion != 0), we are going to resolve only automatic==true triggers. I will update the comment clarifying this.

@0xmichalis
Copy link
Contributor Author

0xmichalis commented May 19, 2016

So it seems to me like the config change controller doesn't even need to exist anymore- all it does can be wrapped up into the sync function in the deployment config controller. Then there are even less places to deal with updates.

Actually, my first take on this was doing exactly that but it would break a lot of people. The reason is that the deploymentconfig controller is syncing based on latestVersion.

Say we remove the cc controller: When the deploymentconfig controller receives a latestVersion update, it will have to update the deploymentconfig, exit the reconcilation, (otherwise clients that expect a specific latestVersion are broken), then wait for the next etcd event that would be triggered by this update, and resync the deploymentconfig as it does now, right?

@ironcladlou
Copy link
Contributor

Say we remove the cc controller: When the deploymentconfig controller receives a latestVersion update, it will have to update the deploymentconfig, exit the reconcilation, (otherwise clients that expect a specific latestVersion are broken), then wait for the next etcd event that would be triggered by this update, and resync the deploymentconfig as it does now, right?

My thought was that the config controller would update the latestVersion and then create the new deployment during the same handle call. The DC update would cause the DC to be re-queued and would no-op the 2nd pass through handle.

Anyway, I agree (based on IRC convo) that it's not worth doing that sort of change right now.

@ironcladlou
Copy link
Contributor

This LGTM as long as all the integration tests still pass.

@0xmichalis
Copy link
Contributor Author

0xmichalis commented May 19, 2016

@smarterclayton @ironcladlou I just realized that we don't need to store any new annotation that serializes anything, we can just check the serialized config of the latest deployment (which is already deserialized for comparing templates). Last commit, PTAL

if err != nil {
return false, fatalError(fmt.Sprintf("error decoding deployment config from deployment %q for deployment config %s: %v", deployutil.LabelForDeployment(deployment), deployutil.LabelForDeploymentConfig(config), err))
}
return c.decodeConfig(deployment)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for future ref: I think this decoder abstraction has proven to be annoying/unnecessary and we can probably just replace it with a simple stateless utility.

@ironcladlou
Copy link
Contributor

@Kargakis

I just realized that we don't need to store any new annotation that serializes anything, we can just check the serialized config of the latest deployment (which is already deserialized for comparing templates). Last commit, PTAL

I think this works, good idea.

@0xmichalis
Copy link
Contributor Author

one more [test]

@0xmichalis
Copy link
Contributor Author

known flakes [test]

@0xmichalis
Copy link
Contributor Author

new flake #8946

[test]

@0xmichalis
Copy link
Contributor Author

another new: #8947

gotta deflake em all!

[test]

@0xmichalis
Copy link
Contributor Author

@ironcladlou @smarterclayton no other concerns from me. Can we merge this?

@0xmichalis
Copy link
Contributor Author

Hold on, I need to cleanup

@0xmichalis
Copy link
Contributor Author

Ready

@smarterclayton
Copy link
Contributor

I have no other concerns [merge]

On Thu, May 19, 2016 at 6:29 PM, Michail Kargakis notifications@github.com
wrote:

Ready


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#8746 (comment)

@openshift-bot
Copy link
Contributor

openshift-bot commented May 19, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/5954/) (Image: devenv-rhel7_4245)

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to e3510a7

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to e3510a7

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/3939/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants