Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make login, project, and discovery work against kube with RBAC enabled #11340

Merged
merged 1 commit into from
Nov 23, 2016

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented Oct 12, 2016

When RBAC is enabled (see kubernetes/kubernetes#34619), we have to tolerate 403s in addition to 404s.

@pweil- you'll probably need this if you want the tooling to work nicely (and it makes a difference)

@deads2k
Copy link
Contributor Author

deads2k commented Oct 13, 2016

@fabianofranz @juanvallejo ping.

@juanvallejo
Copy link
Contributor

@fabianofranz LGTM

@fabianofranz
Copy link
Member

LGTM, @deads2k merge at will.

@deads2k
Copy link
Contributor Author

deads2k commented Oct 13, 2016

[merge]

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@deads2k
Copy link
Contributor Author

deads2k commented Oct 17, 2016

yum re[test] re[merge]

@deads2k
Copy link
Contributor Author

deads2k commented Oct 17, 2016

re[test]

@deads2k
Copy link
Contributor Author

deads2k commented Oct 17, 2016

@deads2k
Copy link
Contributor Author

deads2k commented Oct 19, 2016

@deads2k
Copy link
Contributor Author

deads2k commented Oct 20, 2016

@deads2k
Copy link
Contributor Author

deads2k commented Oct 21, 2016

@deads2k
Copy link
Contributor Author

deads2k commented Oct 24, 2016

@deads2k
Copy link
Contributor Author

deads2k commented Oct 24, 2016

@deads2k
Copy link
Contributor Author

deads2k commented Oct 24, 2016

@stevekuznetsov does networking ever work? I keep getting #11315

@stevekuznetsov
Copy link
Contributor

Unclear -- I don't have a good answer for you.

@deads2k
Copy link
Contributor Author

deads2k commented Oct 25, 2016

@mfojtik @smarterclayton do we we want to force this one into 1.4?

@smarterclayton
Copy link
Contributor

[merge]

@smarterclayton
Copy link
Contributor

[merge] unless DinD is horribly broken.

On Tue, Oct 25, 2016 at 5:53 PM, OpenShift Bot notifications@github.com
wrote:

continuous-integration/openshift-jenkins/merge FAILURE (
https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/10645/) (Base
Commit: 8b8e813
8b8e813
)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11340 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4P1oWQ3i634L7AVOiNyED6WIOtuks5q3npSgaJpZM4KU_am
.

@deads2k
Copy link
Contributor Author

deads2k commented Oct 28, 2016

@openshift/networking I'm going to guess that somehow this broke the networking test. Can you help me figure out how?

@danwinship
Copy link
Contributor

tagging @marun since this is dind-related

If I check out that branch and do make clean; ./test/extended/networking.sh, I see the same deployment failure.

Doing . dind-nettest.sh; oc get nodes from another shell while it is waiting shows that the nodes are created and ready. But stracing the dind script shows that when it runs "oc get nodes ...", it gets back:

the server doesn't have a resource type "nodes"

???

@deads2k deads2k added this to the 1.4.0 milestone Oct 31, 2016
@@ -36,7 +36,7 @@ func negotiateVersion(client *kclient.Client, config *restclient.Config, request
// Get server versions
serverGVs, err := serverAPIVersions(client, "/oapi")
if err != nil {
if errors.IsNotFound(err) {
if errors.IsNotFound(err) || errors.IsForbidden(err) {
glog.V(4).Infof("Server path /oapi was not found, returning the requested group version %v", preferredGV)
return preferredGV, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change that breaks the tests... Is there some other reason we could be getting a 403 here in some circumstances?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change that breaks the tests... Is there some other reason we could be getting a 403 here in some circumstances?

Are you hitting an openshift server or a kubernetes server? We allow all users (authenticated and unauthenticated) to hit our discovery endpoints. The only way I can think of to fail is to race with an initial cache priming, but that's a little crazy. You could wait for a zero exit code oc get --raw /oapi.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it might not be that; the failure doesn't seem to be 100% reliable, so it might just be luck that it passed without that change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is against an openshift server

@smarterclayton
Copy link
Contributor

Why is a race with initial cache priming crazy?

On Mon, Oct 31, 2016 at 10:52 AM, David Eads notifications@github.com
wrote:

@deads2k commented on this pull request.

In pkg/cmd/util/clientcmd/negotiate.go
#11340:

@@ -36,7 +36,7 @@ func negotiateVersion(client *kclient.Client, config *restclient.Config, request
// Get server versions
serverGVs, err := serverAPIVersions(client, "/oapi")
if err != nil {

  • if errors.IsNotFound(err) {
    
  • if errors.IsNotFound(err) || errors.IsForbidden(err) {
      glog.V(4).Infof("Server path /oapi was not found, returning the requested group version %v", preferredGV)
      return preferredGV, nil
    

This is the change that breaks the tests... Is there some other reason we
could be getting a 403 here in some circumstances?

Are you hitting an openshift server or a kubernetes server? We allow all
users (authenticated and unauthenticated) to hit our discovery endpoints.
The only way I can think of to fail is to race with an initial cache
priming, but that's a little crazy. You could wait for a zero exit code oc
get --raw /oapi.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#11340, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pwcEccxbuRrMoI-b3Sm1z1coddfEks5q5gCmgaJpZM4KU_am
.

@marun
Copy link
Contributor

marun commented Oct 31, 2016

Has this patch been tested against a non-dind multinode cluster?

@marun
Copy link
Contributor

marun commented Oct 31, 2016

I'm seeing the failure consistently after rebasing the PR, which is how it is being tested in CI.

After rm -rf ~/.kube, I am no longer able to replicate.

@marun
Copy link
Contributor

marun commented Oct 31, 2016

Assuming stale cache in ~/.kube is the cause of the CI failure, consider updating test/extended/networking.sh to clear it and seeing if that allows the job to pass.

@liggitt
Copy link
Contributor

liggitt commented Oct 31, 2016

consider updating test/extended/networking.sh to clear it and seeing if that allows the job to pass

this confuses me... the cache shouldn't be storing anything in error cases, right?

@deads2k
Copy link
Contributor Author

deads2k commented Oct 31, 2016

this confuses me... the cache shouldn't be storing anything in error cases, right?

riddle me this. How does this pull produce this?

https://paste.fedoraproject.org/466900/14779326/

@deads2k
Copy link
Contributor Author

deads2k commented Oct 31, 2016

It almost looks like networking is starting an openshift server that does not run the kubernetes API server, but instead tries to proxy it and somehow the proxy isn't proxying the discovery doc. That's insane though. Running an openshift API server like that won't work and I don't think this code will react differently in that case than the old code would have.

@danwinship
Copy link
Contributor

It almost looks like networking is starting an openshift server that does not run the kubernetes API server

dind runs openshift in several different modes while setting up the cluster. eg, first it runs openshift admin ca create-master-certs ..., then openshift start master --write-config ..., etc. (see images/dind/master/). And the "oc get nodes" loop to test if the cluster is ready starts before those config steps finish. So if one of those returns bad data, and oc caches it, then...

@deads2k
Copy link
Contributor Author

deads2k commented Oct 31, 2016

dind runs openshift in several different modes while setting up the cluster. eg, first it runs openshift admin ca create-master-certs ..., then openshift start master --write-config ..., etc. (see images/dind/master/). And the "oc get nodes" loop to test if the cluster is ready starts before those config steps finish. So if one of those returns bad data, and oc caches it, then...

Thing is, none of these are server-side changes, so the discovery information being served is identical and whatever command eventually saved the empty discovery doc, got that back from the server it queried. But, every other component that starts a master is serving "normal" discovery information from the discovery endpoints.

@danwinship
Copy link
Contributor

I meant, they run before the master does, maybe they're binding to port 8443 and serving bogus data. But it looks like they don't.

Is there any chance the master itself could be returning bad answers briefly at startup? Like, does the startup code do something like:

kube.StartMasterHTTPServer()
openshift.OhBTWHandleOpenShiftURLsToo()

@deads2k
Copy link
Contributor Author

deads2k commented Oct 31, 2016

I meant, they run before the master does, maybe they're binding to port 8443 and serving bogus data. But it looks like they don't.

Is there any chance the master itself could be returning bad answers briefly at startup? Like, does the startup code do something like:

Even so, this doesn't change that behavior one way or the other. Are you guys running an oc login --token or oc project command somewhere on a loop? That might start succeeding sooner than usual.

@stevekuznetsov
Copy link
Contributor

@danwinship there is one handler chain for the server that serves everything, described in MasterConfig.Run()

@deads2k
Copy link
Contributor Author

deads2k commented Oct 31, 2016

Is there any chance the master itself could be returning bad answers briefly at startup? Like, does the startup code do something like:

@danwinship That's a good theory. Can you link me to where the oc get nodes is done so I can switch it to a health check first?

@danwinship
Copy link
Contributor

wait-for-cluster in hack/dind-cluster.sh

@marun
Copy link
Contributor

marun commented Oct 31, 2016

I don't see how the master could be returning bad answers briefly at startup, because if I am deploying a dind cluster after having removed ~/kube, everything works fine. oc get nodes works just fine, too. It's only if I wait (like 5m) and then try running oc get nodes that I'm seeing the failure indicated by fpaste.

@marun
Copy link
Contributor

marun commented Oct 31, 2016

I don't think this issue is specific to dind deployment, and that merging should wait until the networking job is passing.

@deads2k
Copy link
Contributor Author

deads2k commented Oct 31, 2016

I don't think this issue is specific to dind deployment, and that merging should wait until the networking job is passing.

I haven't proposed forcing it.

I'm not familiar with the job though. It would be nice to have a smaller, faster reproducer to help track this down. Is there a flag to prevent tear down and then scripts to debug using various kubeconfigs? The differences in this job make it harder to jump in and debug.

@marun
Copy link
Contributor

marun commented Nov 1, 2016

@deads2k hack/dind-cluster.sh start -r reproduces the issue for me. Once the dind images have been built, deploying a cluster takes 20-30s. The only requirement is linux running recent (> 1.10) docker.

My comment about merge was in response to the bot trying to merge after your push. I realize now that the consistent job failure means it can't succeed.

@deads2k
Copy link
Contributor Author

deads2k commented Nov 22, 2016

@deads2k hack/dind-cluster.sh start -r reproduces the issue for me. Once the dind images have been built, deploying a cluster takes 20-30s. The only requirement is linux running recent (> 1.10) docker.

It doesn't reproduce it for me. That command always seems to work.

@deads2k
Copy link
Contributor Author

deads2k commented Nov 22, 2016

@deads2k hack/dind-cluster.sh start -r reproduces the issue for me. Once the dind images have been built, deploying a cluster takes 20-30s. The only requirement is linux running recent (> 1.10) docker.
It doesn't reproduce it for me. That command always seems to work.

@marun can you reproduce this reliably on an AWS instance I could use to try to diagnose it. The pastebin you made doesn't seem to appear in the jenkins test run (near as I can tell) and every other master we test it against works. This really looks like its specific to networking test provisioning somehow, I can't seem to make it fail locally, and I can't figure out where the failure happens and is logged for instrumenting while running jenkins.

@@ -221,7 +221,7 @@ function wait-for-cluster() {
oc="$(os::build::find-binary oc)"

# wait for healthz to report ok before trying to get nodes
os::util::wait-for-condition "ok" "${oc} get --config=\"${kubeconfig}\" --raw=/healthz" "120"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function in question claims it uses eval but in fact does this:

  while ! $(${condition}); do

This means that yes, these quotes were incorrect and the actual value of --config that is passed to oc get is "${kubeconfig}", literal quotes and all. @marun this type of thing is why I feel so strongly about not having the provision_util.sh file, since we spent a lot of time and effort getting it right the first time in os::cmd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provision_util.sh isn't used here. That's legacy and only used by vagrant. I'm assuming you meant images/dind/node/openshift-dind-lib.sh, which needs to be a separate file so it can be distributed in the dind image. I'm happy to have os::cmd take over responsibility for this use case, but it would have to be copied into the image file for distribution regardless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, whatever checkout of Origin I'm in right now has the os::util::wait-for-condition function in provision-util.sh.

@@ -249,7 +249,7 @@ EOF
# Remove formatting before use
template="$(echo "${template}" | tr -d '\n' | sed -e 's/} \+/}/g')"
local count
count="$("${oc}" --config="${kubeconfig}" get nodes \
count="$("${oc}" --config=${kubeconfig} get nodes \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get this one to fail for me, and these quotes look fine. Revert this change, please.

@@ -220,6 +220,9 @@ function wait-for-cluster() {
local oc
oc="$(os::build::find-binary oc)"

# wait for healthz to report ok before trying to get nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this change would be necessary. Consider removing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this change would be necessary. Consider removing.

I think its reasonable to wait until the API server is ready before making real requests to it. This mirrors what other e2e tests do.

@deads2k
Copy link
Contributor Author

deads2k commented Nov 22, 2016

and it worked. I'll squash down and eliminate the "extra" unquoting.

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 6c6ec1a

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11643/) (Base Commit: fdd204a)

@ncdc
Copy link
Contributor

ncdc commented Nov 23, 2016

@deads2k congrats, tests passed

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 6c6ec1a

@openshift-bot
Copy link
Contributor

openshift-bot commented Nov 23, 2016

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/11653/) (Base Commit: 00ec8d2) (Image: devenv-rhel7_5407)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants