Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Failures: machine-config-server is not ready #414

Closed
dgoodwin opened this issue Feb 12, 2019 · 14 comments
Closed

CI Failures: machine-config-server is not ready #414

dgoodwin opened this issue Feb 12, 2019 · 14 comments

Comments

@dgoodwin
Copy link
Contributor

In Hive we're seeing failures launching a cluster that appear related to machine config:

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_hive/221/pull-ci-openshift-hive-master-e2e/40/

Installer log (https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_hive/221/pull-ci-openshift-hive-master-e2e/40/artifacts/e2e/installer/.openshift_install.log) shows:

time="2019-02-12T18:39:57Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config is reporting a failure: Failed when progressing towards 3.11.0-605-g56569135-dirty because: error syncing: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-server is not ready. status: (desired: 3, updated: 3, ready: 2, unavailable: 2)"
time="2019-02-12T18:59:07Z" level=fatal msg="failed to initialize the cluster: timed out waiting for the condition"

In the pod logs I found:

I0212 18:29:07.506788 1 render_controller.go:456] Generated machineconfig master-701b1a947a6b7a021d9adb12cbfbe2ab from 3 configs: [{MachineConfig 00-master machineconfiguration.openshift.io/v1 } {MachineConfig 00-master-ssh machineconfiguration.openshift.io/v1 } {MachineConfig 01-master-kubelet machineconfiguration.openshift.io/v1 }]
I0212 18:29:07.606512 1 node_controller.go:345] Error syncing machineconfigpool worker: Empty Current MachineConfig
I0212 18:29:07.606548 1 node_controller.go:345] Error syncing machineconfigpool master: Empty Current MachineConfig

More info available in the artifacts linked above.

@kikisdeliveryservice
Copy link
Contributor

@dgoodwin I believe that those errors will be seen a few times and aren't what caused your failure.

I'm not familiar with hive, but I looked thru the MCS logs and see:

I0212 18:55:23.859972       1 start.go:37] Version: 3.11.0-605-g56569135-dirty
I0212 18:55:23.860683       1 api.go:51] launching server
I0212 18:55:23.860705       1 api.go:51] launching server
I0212 18:55:23.860773       1 api.go:51] launching server
I0212 18:55:23.860801       1 api.go:51] launching server
F0212 18:55:23.861301       1 api.go:59] Machine Config Server exited with error: listen tcp :49500: bind: address already in use

Are you using port 49500 elsewhere? Due to issue: #166 I recently put in a fix (#368) to transition from using 49500 -> 22623. Might that solve the problem? You'd have to pull master to get it though since it just merged this afternoon.

see also: openshift/installer#1180

@kikisdeliveryservice
Copy link
Contributor

Also @dgoodwin can you add a bit of info about what you are running installer version, any modifictions, etc.. as I'm not familiar with Hive.

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Feb 13, 2019

The MCC logs seem reasonably normal, the machineconfigs were generated as expected:

I0212 18:29:00.106929       1 template_controller.go:111] Starting MachineConfigController-TemplateController
I0212 18:29:00.307373       1 kubelet_config_controller.go:138] Starting MachineConfigController-KubeletConfigController
I0212 18:29:00.406491       1 render_controller.go:112] Starting MachineConfigController-RenderController
I0212 18:29:00.508037       1 node_controller.go:121] Starting MachineConfigController-NodeController
I0212 18:29:06.306587       1 node_controller.go:345] Error syncing machineconfigpool master: Empty Current MachineConfig
I0212 18:29:06.506896       1 node_controller.go:345] Error syncing machineconfigpool worker: Empty Current MachineConfig
I0212 18:29:06.906978       1 node_controller.go:345] Error syncing machineconfigpool master: Empty Current MachineConfig
I0212 18:29:07.006638       1 node_controller.go:345] Error syncing machineconfigpool worker: Empty Current MachineConfig
I0212 18:29:07.116877       1 node_controller.go:345] Error syncing machineconfigpool master: Empty Current MachineConfig
I0212 18:29:07.116971       1 node_controller.go:345] Error syncing machineconfigpool worker: Empty Current MachineConfig
I0212 18:29:07.206546       1 node_controller.go:345] Error syncing machineconfigpool master: Empty Current MachineConfig
I0212 18:29:07.206565       1 node_controller.go:345] Error syncing machineconfigpool worker: Empty Current MachineConfig
I0212 18:29:07.306574       1 node_controller.go:345] Error syncing machineconfigpool worker: Empty Current MachineConfig
I0212 18:29:07.306572       1 node_controller.go:345] Error syncing machineconfigpool master: Empty Current MachineConfig
I0212 18:29:07.406844       1 node_controller.go:345] Error syncing machineconfigpool worker: Empty Current MachineConfig
I0212 18:29:07.406937       1 node_controller.go:345] Error syncing machineconfigpool master: Empty Current MachineConfig
I0212 18:29:07.407092       1 render_controller.go:456] Generated machineconfig worker-7b1afa90723f2df82b21a03c03142cf0 from 3 configs: [{MachineConfig  00-worker  machineconfiguration.openshift.io/v1  } {MachineConfig  00-worker-ssh  machineconfiguration.openshift.io/v1  } {MachineConfig  01-worker-kubelet  machineconfiguration.openshift.io/v1  }]
I0212 18:29:07.506788       1 render_controller.go:456] Generated machineconfig master-701b1a947a6b7a021d9adb12cbfbe2ab from 3 configs: [{MachineConfig  00-master  machineconfiguration.openshift.io/v1  } {MachineConfig  00-master-ssh  machineconfiguration.openshift.io/v1  } {MachineConfig  01-master-kubelet  machineconfiguration.openshift.io/v1  }]
I0212 18:29:07.606512       1 node_controller.go:345] Error syncing machineconfigpool worker: Empty Current MachineConfig
I0212 18:29:07.606548       1 node_controller.go:345] Error syncing machineconfigpool master: Empty Current MachineConfig```

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Feb 13, 2019

MCO logs seem ok as well:

0212 18:39:06.256443       1 sync.go:48] [init mode] synced mcs in 10m0.116155332s
I0212 18:39:11.349882       1 sync.go:48] [init mode] synced mcd in 5.079689389s
I0212 18:39:14.358941       1 sync.go:48] [init mode] synced required-pools in 3.000310177s
I0212 18:39:14.447846       1 sync.go:48] [init mode] synced pools in 8.438724ms
I0212 18:39:16.491769       1 sync.go:48] [init mode] synced mcc in 2.035433546s
I0212 18:49:16.563495       1 sync.go:48] [init mode] synced mcs in 10m0.054134196s
I0212 18:49:17.634668       1 sync.go:48] [init mode] synced mcd in 1.053824898s
I0212 18:49:18.648130       1 sync.go:48] [init mode] synced required-pools in 1.000246201s
I0212 18:49:18.731312       1 sync.go:48] [init mode] synced pools in 11.629855ms
I0212 18:49:20.776755       1 sync.go:48] [init mode] synced mcc in 2.036854879s
I0212 18:59:20.883647       1 sync.go:48] [init mode] synced mcs in 10m0.096227779s
I0212 18:59:21.955005       1 sync.go:48] [init mode] synced mcd in 1.055063822s
I0212 18:59:22.964661       1 sync.go:48] [init mode] synced required-pools in 1.000234888s
I0212 18:59:23.042570       1 sync.go:48] [init mode] synced pools in 7.961962ms
I0212 18:59:25.089284       1 sync.go:48] [init mode] synced mcc in 2.038535905s

I'm also no seeing any errors in the the MCD logs either but I'll spare everyone more pasting :)

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Feb 13, 2019

Summary:
Can confirm that I see the timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-server is not ready. status: (desired: 3, updated: 3, ready: 2, unavailable: 2)" time="2019-02-12T18:59:07Z" level=fatal msg="failed to initialize the cluster: timed out waiting for the condition" error in installer logs.

And after looking thru all of the MCx logs, can confirm that the only error seems to be in MCS with the port: 49500: bind: address already in use (the logs with the error are also the newest logs out of the 3 sets of server logs in the archive).

The installer change only went in a few hours ago, but master should be using 22623 now instead of 49500.

@dgoodwin
Copy link
Contributor Author

This was a CI job flake, so it's basically just running the installer from master and it's floating release image and would have occurred around the timestamp on the files in the links. There's nothing we're doing with that port in play that I can think of. It's possible I missed the merge, and I do know we made it through a PR later in the day.

@kikisdeliveryservice
Copy link
Contributor

Thanks for the update. I'll leave this open for a bit to see if it recurs though aws was having lot of issues yesterday.

@kikisdeliveryservice
Copy link
Contributor

Closing for now, feel free to reopen if this comes up again.

@kikisdeliveryservice kikisdeliveryservice changed the title CI Installation Failures With Empty MachineConfig CI Failures: machine-config-server is not ready Feb 14, 2019
@kikisdeliveryservice
Copy link
Contributor

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Feb 14, 2019

Saw exact same error as author above:
level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is reporting a failure: Failed when progressing towards 3.11.0-646-g875bc74e-dirty because: error syncing: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-server is not ready. status: (desired: 3, updated: 3, ready: 2, unavailable: 2)"

and saw in MCS logs:
0214 22:21:58.887072 1 api.go:59] Machine Config Server exited with error: listen tcp :49500: bind: address already in use

@kikisdeliveryservice
Copy link
Contributor

kikisdeliveryservice commented Feb 14, 2019

is the installer that's running in e2e-aws master (since I put in that PR that doesn't use 49500 anymore)? Will #423 take care of this once and for all since it completely removes 49500 or is the problem coming from some other issue like the api serve breakdowns?

The PR in question has passed e2e-aws (with no code changes) before it failed with this. Not sure if this why this is happening.

@jlebon
Copy link
Member

jlebon commented Feb 15, 2019

Yeah, should be fixed now that #423 is merged.

@ashcrow
Copy link
Member

ashcrow commented Feb 15, 2019

@dgoodwin let us know if #423 doesn't fix the issue.

@runcom
Copy link
Member

runcom commented Mar 18, 2019

fixed by #423

@runcom runcom closed this as completed Mar 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants