Multiple machine-config server restarts after 'http: TLS handshake error from 10.0.29.128:17205: EOF' #233

wking · 2018-12-14T07:42:59Z

Dec 14 05:53:23.635: INFO: Pod status openshift-machine-config-operator/machine-config-server-c9dr5:
{
  "phase": "Running",
  "conditions": [
    {
      "type": "Initialized",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2018-12-14T05:41:23Z"
    },
    {
      "type": "Ready",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2018-12-14T05:47:34Z"
    },
    {
      "type": "ContainersReady",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": null
    },
    {
      "type": "PodScheduled",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2018-12-14T05:41:23Z"
    }
  ],
  "message": "container machine-config-server has restarted more than 5 times",
  "hostIP": "10.0.2.151",
  "podIP": "10.0.2.151",
  "startTime": "2018-12-14T05:41:23Z",
  "containerStatuses": [
    {
      "name": "machine-config-server",
      "state": {
        "running": {
          "startedAt": "2018-12-14T05:47:33Z"
        }
      },
      "lastState": {
        "terminated": {
          "exitCode": 1,
          "reason": "Error",
          "startedAt": "2018-12-14T05:44:50Z",
          "finishedAt": "2018-12-14T05:44:50Z",
          "containerID": "cri-o://35e3004e72b35a273ab4b0e2e75e082f0840464c55a13f5716d3b796be241e8a"
        }
      },
      "ready": true,
      "restartCount": 6,
      "image": "registry.svc.ci.openshift.org/ci-op-4xwzpczq/stable@sha256:7f2cd078c139f2ed319d16d68e7a5d05f9c60012fd4eeafddc66b1d24a78abf8",
      "imageID": "registry.svc.ci.openshift.org/ci-op-4xwzpczq/stable@sha256:7f2cd078c139f2ed319d16d68e7a5d05f9c60012fd4eeafddc66b1d24a78abf8",
      "containerID": "cri-o://c60a6df780ea4d8d9679309a9037c057002a96f7db1fb62772e2a0b5bb00eaa3"
    }
  ],
  "qosClass": "BestEffort"
}
Dec 14 05:53:23.639: INFO: Running AfterSuite actions on all node
Dec 14 05:53:23.639: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/operators/cluster.go:109]: Expected
    <[]string | len:1, cap:1>: [
        "Pod openshift-machine-config-operator/machine-config-server-c9dr5 is not healthy: container machine-config-server has restarted more than 5 times",
    ]
to be empty
...
Dec 14 05:51:09.642 W ns=openshift-monitoring pod=prometheus-adapter-bdc5f58cb-5l4jt MountVolume.SetUp failed for volume "prometheus-adapter-tls" : secrets "prometheus-adapter-tls" not found
Dec 14 05:51:16.557 E kube-apiserver Kube API started failing: Get https://ci-op-4xwzpczq-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system?timeout=3s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Dec 14 05:51:16.557 I openshift-apiserver OpenShift API started failing: Get https://ci-op-4xwzpczq-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=3s: context deadline exceeded
Dec 14 05:51:18.547 E kube-apiserver Kube API is not responding to GET requests
Dec 14 05:51:18.547 E openshift-apiserver OpenShift API is not responding to GET requests
Dec 14 05:51:20.645 I openshift-apiserver OpenShift API started responding to GET requests
Dec 14 05:51:20.742 I kube-apiserver Kube API started responding to GET requests
...
failed: (2m18s) 2018-12-14T05:53:23 "[Feature:Platform][Suite:openshift/smoke-4] Managed cluster should have no crashlooping pods in core namespaces over two minutes [Suite:openshift/conformance/parallel]"

From the logs for one of those server pods:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/905/pull-ci-openshift-installer-master-e2e-aws/2291/artifacts/e2e-aws/pods/openshift-machine-config-operator_machine-config-server-77dc7_machine-config-server.log.gz | zcat
I1214 05:41:50.478893       1 start.go:37] Version: 3.11.0-352-g0cfc4183-dirty
I1214 05:41:50.480250       1 api.go:54] launching server
I1214 05:41:50.480380       1 api.go:54] launching server
2018/12/14 05:41:51 http: TLS handshake error from 10.0.29.128:17205: EOF
2018/12/14 05:41:52 http: TLS handshake error from 10.0.0.231:28579: EOF
2018/12/14 05:41:52 http: TLS handshake error from 10.0.72.138:31458: EOF
...
2018/12/14 06:10:01 http: TLS handshake error from 10.0.72.138:38099: EOF
2018/12/14 06:10:02 http: TLS handshake error from 10.0.29.128:59541: EOF
2018/12/14 06:10:02 http: TLS handshake error from 10.0.45.28:9790: EOF

This is possibly related to #199, which also had TLS handshake errors (although in that case they were bad-certificate errors). Are these errors someone attempting to connect to the MCS but immediately hanging up? Who would do that? Is there information about the restart reason somewhere I can dig up?

Also, only one of the three machine-config-server containers seems to have had a restart issue:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/905/pull-ci-openshift-installer-master-e2e-aws/2291/artifacts/e2e-aws/pods.json | jq '.items[] | .status.containerStatuses[] | select(.restartCount > 0) | {name, restartCount}'
{
  "name": "operator",
  "restartCount": 1
}
{
  "name": "operator",
  "restartCount": 1
}
{
  "name": "csi-operator",
  "restartCount": 1
}
{
  "name": "machine-config-server",
  "restartCount": 6
}
{
  "name": "prometheus",
  "restartCount": 1
}
{
  "name": "prometheus",
  "restartCount": 1
}
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/905/pull-ci-openshift-installer-master-e2e-aws/2291/artifacts/e2e-aws/pods.json | jq '.items[] | .status.containerStatuses[] | select(.name == "machine-config-server") | {name, restartCount}'
{
  "name": "machine-config-server",
  "restartCount": 0
}
{
  "name": "machine-config-server",
  "restartCount": 6
}
{
  "name": "machine-config-server",
  "restartCount": 0
}

The text was updated successfully, but these errors were encountered:

abhinavdahiya · 2018-12-15T01:34:57Z

also saw here https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/711/pull-ci-openshift-installer-master-e2e-aws/2331

The reason might be

I1215 01:17:04.446950       1 start.go:37] Version: 3.11.0-354-g542d610c-dirty
I1215 01:17:04.449017       1 api.go:54] launching server
I1215 01:17:04.449191       1 api.go:54] launching server
F1215 01:17:04.449246       1 api.go:58] Machine Config Server exited with error: listen tcp :49501: bind: address already in use

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/711/pull-ci-openshift-installer-master-e2e-aws/2331/artifacts/e2e-aws/pods/openshift-machine-config-operator_machine-config-server-sjlbx_machine-config-server.log.gz

kikisdeliveryservice · 2018-12-15T03:00:56Z

I recently experienced some TLS errors today, when I re-run I will try to log what happened better.

cgwalters · 2018-12-20T15:28:58Z

F1215 01:17:04.449246 1 api.go:58] Machine Config Server exited with error: listen tcp :49501: bind: address already in use

That'd make this a dup of #166 right?

kikisdeliveryservice · 2019-01-11T01:34:30Z

bin/openshift-install v0.9.1 running on aws

I'm seeing these same errors in the logs of my machine-config-servers basically running in a non-stop loop:
$ oc logs -f -n openshift-machine-config-operator machine-config-server-42qmm

then infinitely repeating:
2019/01/11 01:29:15 http: TLS handshake error from 10.0.8.225:40761: EOF

I haven't seen any performance problems in my mco-mcc-mcd work, but wanted to add the datapoint bc it's kind of disconcerting to see the infinite error scroll in the logs.

The machine-config-servers are, listing 0 restarts ftr.

wking · 2019-01-11T02:47:48Z

I'm going to mark these EOF errors as fixed by openshift/installer#924. If anyone can reproduce with a cluster built from an installer with that commit included (it will be in the next release), please comment and we can re-open.

wking mentioned this issue Dec 14, 2018

cmd/openshift-install/create: Drop dead-code "no routes found" error openshift/installer#905

Merged

wking closed this as completed Jan 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple machine-config server restarts after 'http: TLS handshake error from 10.0.29.128:17205: EOF' #233

Multiple machine-config server restarts after 'http: TLS handshake error from 10.0.29.128:17205: EOF' #233

wking commented Dec 14, 2018

abhinavdahiya commented Dec 15, 2018 •

edited

Loading

kikisdeliveryservice commented Dec 15, 2018

cgwalters commented Dec 20, 2018

kikisdeliveryservice commented Jan 11, 2019 •

edited

Loading

wking commented Jan 11, 2019

Multiple machine-config server restarts after 'http: TLS handshake error from 10.0.29.128:17205: EOF' #233

Multiple machine-config server restarts after 'http: TLS handshake error from 10.0.29.128:17205: EOF' #233

Comments

wking commented Dec 14, 2018

abhinavdahiya commented Dec 15, 2018 • edited Loading

kikisdeliveryservice commented Dec 15, 2018

cgwalters commented Dec 20, 2018

kikisdeliveryservice commented Jan 11, 2019 • edited Loading

wking commented Jan 11, 2019

abhinavdahiya commented Dec 15, 2018 •

edited

Loading

kikisdeliveryservice commented Jan 11, 2019 •

edited

Loading