Problem with etcd in hostNetwork #115

caseydavenport · 2016-09-15T17:02:35Z

Ran into this today while trying to test the changes from #112

NAME                                READY     STATUS    RESTARTS   AGE
po/etcd-cluster-0000                0/1       Error     0          1m

2016-09-15 00:49:12.196100 I | etcdmain: etcd Version: 3.0.8
2016-09-15 00:49:12.196134 I | etcdmain: Git SHA: d40982f
2016-09-15 00:49:12.196174 I | etcdmain: Go Version: go1.6.3
2016-09-15 00:49:12.196204 I | etcdmain: Go OS/Arch: linux/amd64
2016-09-15 00:49:12.196232 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2016-09-15 00:49:12.196591 I | etcdmain: listening for peers on http://0.0.0.0:2380
2016-09-15 00:49:12.196725 I | etcdmain: listening for client requests on 0.0.0.0:2379
2016-09-15 00:49:12.213419 E | netutil: could not resolve host etcd-cluster-0000:2380
2016-09-15 00:49:12.236475 I | etcdmain: stopping listening for client requests on 0.0.0.0:2379
2016-09-15 00:49:12.236515 I | etcdmain: stopping listening for peers on http://0.0.0.0:2380
2016-09-15 00:49:12.236541 I | etcdmain: --initial-cluster must include etcd-cluster-0000=http://etcd-cluster-0000:2380 given --initial-advertise-peer-urls=http://etcd-cluster-0000:2380

Looks like it's failing to resolve etcd-cluster-0000. I'll investigate.

The text was updated successfully, but these errors were encountered:

hongchaodeng · 2016-09-15T17:07:29Z

@caseydavenport
It's an etcd thing that we just resolved: etcd-io/etcd#6336
Let me push an urgent fix for this.

hongchaodeng · 2016-09-15T17:19:56Z

Fixed in #116

caseydavenport · 2016-09-15T18:09:44Z

Hm, I think I'm still seeing this even with #116.

hongchaodeng · 2016-09-15T18:22:20Z

@caseydavenport
It works for me.
Please paste your steps to reproduce.

caseydavenport · 2016-09-15T20:11:30Z

I've been using the provided examples, though I've modified example-etcd-cluster.yaml to include hostNetwork: true.

Here are the steps I'm running and the output from each:

Create the controller

macncheese:kube-etcd-controller casey$ kubectl create -f example/etcd-controller.yaml
pod "kubeetcdctrl" created
macncheese:kube-etcd-controller casey$ kubectl get rc,deployments,svc,replicaset,pod,thirdpartyresource --show-all
NAME             CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
svc/kubernetes   10.0.0.1     <none>        443/TCP   18s
NAME              READY     STATUS    RESTARTS   AGE
po/kubeetcdctrl   1/1       Running   0          7s
NAME                                          DESCRIPTION             VERSION(S)
thirdpartyresources/etcd-cluster.coreos.com   Managed etcd clusters   v1

This appears to create successfully.

Then, I create the desired cluster. The manifest looks like this:

macncheese:kube-etcd-controller casey$ cat example/example-etcd-cluster.yaml
apiVersion: "coreos.com/v1"
kind: "EtcdCluster"
metadata:
  name: "etcd-cluster"
spec:
  size: 3
  hostNetwork: true
  backup:
    # short snapshot interval for testing, do not use this in production!
    snapshotIntervalInSecond: 30
    maxSnapshot: 5
    volumeSizeInMB: 512

I see this occur:

macncheese:kube-etcd-controller casey$ kubectl create -f example/example-etcd-cluster.yaml
etcdcluster "etcd-cluster" created
macncheese:kube-etcd-controller casey$ kubectl get rc,deployments,svc,replicaset,pod,thirdpartyresource --show-all
NAME                           CLUSTER-IP    EXTERNAL-IP   PORT(S)             AGE
svc/etcd-cluster-0000          10.0.252.49   <none>        2380/TCP,2379/TCP   12s
svc/etcd-cluster-backup-tool   10.0.35.118   <none>        19999/TCP           11s
svc/kubernetes                 10.0.0.1      <none>        443/TCP             4m
NAME                          DESIRED   CURRENT   READY     AGE
rs/etcd-cluster-backup-tool   1         1         1         11s
NAME                                READY     STATUS    RESTARTS   AGE
po/etcd-cluster-0000                0/1       Error     0          12s
po/etcd-cluster-backup-tool-duijz   1/1       Running   0          11s
po/kubeetcdctrl                     1/1       Running   0          4m
NAME                                          DESCRIPTION             VERSION(S)
thirdpartyresources/etcd-cluster.coreos.com   Managed etcd clusters   v1

And I can get the logs:

macncheese:kube-etcd-controller casey$ kubectl logs etcd-cluster-0000
2016-09-15 20:07:56.438104 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2379_TCP=tcp://10.0.252.49:2379
2016-09-15 20:07:56.438182 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_SERVICE_HOST=10.0.252.49
2016-09-15 20:07:56.438186 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_SERVICE_PORT_CLIENT=2379
2016-09-15 20:07:56.438189 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2380_TCP_PROTO=tcp
2016-09-15 20:07:56.438192 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2380_TCP_PORT=2380
2016-09-15 20:07:56.438195 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2380_TCP_ADDR=10.0.252.49
2016-09-15 20:07:56.438198 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2379_TCP_ADDR=10.0.252.49
2016-09-15 20:07:56.438201 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_SERVICE_PORT=2380
2016-09-15 20:07:56.438204 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2380_TCP=tcp://10.0.252.49:2380
2016-09-15 20:07:56.438207 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2379_TCP_PORT=2379
2016-09-15 20:07:56.438212 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_SERVICE_PORT_SERVER=2380
2016-09-15 20:07:56.438215 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT=tcp://10.0.252.49:2380
2016-09-15 20:07:56.438217 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2379_TCP_PROTO=tcp
2016-09-15 20:07:56.438252 I | etcdmain: etcd Version: 3.0.0+git
2016-09-15 20:07:56.438255 I | etcdmain: Git SHA: 9913e00
2016-09-15 20:07:56.438258 I | etcdmain: Go Version: go1.7.1
2016-09-15 20:07:56.438260 I | etcdmain: Go OS/Arch: linux/amd64
2016-09-15 20:07:56.438264 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2016-09-15 20:07:56.438372 I | embed: listening for peers on http://0.0.0.0:2380
2016-09-15 20:07:56.438416 I | embed: listening for client requests on 0.0.0.0:2379
2016-09-15 20:07:56.519617 E | netutil: could not resolve host etcd-cluster-0000:2380
2016-09-15 20:07:56.545915 I | etcdmain: --initial-cluster must include etcd-cluster-0000=http://etcd-cluster-0000:2380 given --initial-advertise-peer-urls=http://etcd-cluster-0000:2380
2016-09-15 20:07:56.545947 I | etcdmain: forgot to set --initial-cluster flag?
2016-09-15 20:07:56.545955 I | etcdmain: if you want to use discovery service, please set --discovery flag.

"kubectl describe etcd-cluster-0000" shows me the container image and command:

Containers:
  etcd-cluster-0000:
    Container ID:       docker://23268da1700e364cb14b83646e0f593a1453b5b7c9d0f2276129ca54dbe53030
    Image:              gcr.io/coreos-k8s-scale-testing/etcd-amd64:latest
    Image ID:           docker://sha256:1e3488a197dd60328276742ee4f93cc3277bc2773f5579e60b8e81491cc28aba
    Port:               2380/TCP
    Command:
      /usr/local/bin/etcd
      --data-dir
      /var/etcd/data
      --name
      etcd-cluster-0000
      --initial-advertise-peer-urls
      http://etcd-cluster-0000:2380
      --listen-peer-urls
      http://0.0.0.0:2380
      --listen-client-urls
      http://0.0.0.0:2379
      --advertise-client-urls
      http://etcd-cluster-0000:2379
      --initial-cluster
      etcd-cluster-0000=http://etcd-cluster-0000:2380
      --initial-cluster-state
      new
      --initial-cluster-token
      d029eb69-417f-49f8-a9a7-ff887469fb6

hongchaodeng · 2016-09-15T20:34:02Z

I have reproduced and verified it.
It looks like kubernetes dns isn't working for host network.
This is a problem of Kubernetes, not controller.

@caseydavenport Can you try to figure out how to make it work?

caseydavenport · 2016-09-15T21:08:58Z

Yeah. Looks like we're hitting up against this: kubernetes/kubernetes#17406

Essentially, Kubernetes doesn't expose cluster DNS to host networked pods :(

The issue points out some potential workarounds, but none seem that great. I'll have to dig a little deeper to see what we can do here.

caseydavenport · 2016-09-20T00:42:08Z

So, I don't think we can rely on DNS for this use case. It's not supported for Kubernetes host networked pods, and even if it were it wouldn't meet the Calico / Canal use-case since we need etcd to come up before any other pods (including DNS), and we don't want etcd being subject to DNS pod failures.

I think the solution is to rely on service clusterIPs rather than DNS names (for host networked clusters). I'm going to prototype this and see if it can get us what we need.

hongchaodeng · 2016-09-20T00:44:25Z

@caseydavenport
Great to hear that.
I'm still dealing with some kubernetes test failures lately and wasn't able to do the image parameter thing..
You can grep gcr.io/coreos-k8s-scale-testing and replace them easily. Hopefully.
Also ping me on slack if you have any question.

ethernetdan · 2016-10-06T17:31:50Z

A requirement for self-hosting clusters (#128) is that we will not be able to depend on any Kubernetes capabilities for client or peer networking.

@hongchaodeng @xiang90 what would be required to enable true host networking for clusters managed by the controller?

xiang90 · 2016-10-06T17:34:06Z

A requirement for self-hosting clusters (#128) is that we will not be able to depend on any Kubernetes capabilities for client or peer networking.

Can you explain why?

ethernetdan · 2016-10-06T17:37:31Z

Etcd is a requirement for Kubernetes networking to be functional so in recovery scenarios we are unable to rely on it.

xiang90 · 2016-10-06T17:43:57Z

@ethernetdan What is the recovery scenarios?

xiang90 · 2016-10-06T18:09:39Z

@ethernetdan resolved the issue offline. recovery from a disaster case (majority of etcd cluster used by k8s is down) should work similarly as bring up a new cluster except that the seed etcd member is recovered from a backup.

philips · 2016-10-07T16:02:37Z

I really think we need to have etcd support host networking and node IPs here. I am going to propose upstream this logical thinking for self-hosted cluster bring-up:

To do this we are going to divide the Kubernetes cluster into a variety of layers beginning with the Kubelet (level 0) and going up to the add-ons and required networking (Level 3). A cluster can self-host all of these levels 0-3 or only partially self-hosted 2-3. We will give nicknames to some of these common types.

Kubelet : (level 0)

etcd: (level 1)

Control Plane: (level 2)

Add-ons and Networking: (level 3)

With this thing I really don't want a circular dependency between level 1 and level 3 as we start to make it really difficult for administrators to recover from cluster failure. And it makes it harder to reason about the system overall.

xiang90 · 2016-10-07T16:17:00Z

@ethernetdan @philips I am fine with adding host network support. It def can help with decoupling the dependency mess. Just wanted to make sure we understand everything, since in theory everything should still work even without host network if we are careful about ordering.

ethernetdan · 2016-10-12T21:45:10Z

Anyway I can do to help speed up this work?

xiang90 · 2016-11-16T22:19:57Z

Fixed by #366

bamb00 · 2017-05-18T23:12:26Z

Hi,

Is there a fix or workaround for the hostNetwork=true? I'm running etcd 3.1.6. What workaround I can apply from Kubernetes?

Is this issue fix from etcd side and still need a fix from Kubernetes to support host networked pod?

Thanks in Advance.

hongchaodeng · 2017-05-18T23:13:25Z

@bamb00
What's the context? What are you trying to do?

bamb00 · 2017-05-19T00:46:14Z

The issue is reference in @caseydavenport with the link from kubernetes (kubernetes/kubernetes#17406). I'm just wondering is there a workaround for this issue.

hongchaodeng closed this as completed Sep 15, 2016

hongchaodeng reopened this Sep 15, 2016

hongchaodeng assigned caseydavenport Sep 15, 2016

caseydavenport mentioned this issue Sep 20, 2016

[WIP] Don't require cluster DNS #122

Closed

2 tasks

xiang90 added the kind/bug label Nov 4, 2016

xiang90 mentioned this issue Nov 15, 2016

Support self hosted cluster #366

Merged

xiang90 closed this as completed Nov 16, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with etcd in hostNetwork #115

Problem with etcd in hostNetwork #115

caseydavenport commented Sep 15, 2016

hongchaodeng commented Sep 15, 2016

hongchaodeng commented Sep 15, 2016

caseydavenport commented Sep 15, 2016

hongchaodeng commented Sep 15, 2016

caseydavenport commented Sep 15, 2016

hongchaodeng commented Sep 15, 2016

caseydavenport commented Sep 15, 2016

caseydavenport commented Sep 20, 2016 •

edited

Loading

hongchaodeng commented Sep 20, 2016

ethernetdan commented Oct 6, 2016

xiang90 commented Oct 6, 2016

ethernetdan commented Oct 6, 2016

xiang90 commented Oct 6, 2016 •

edited

Loading

xiang90 commented Oct 6, 2016

philips commented Oct 7, 2016

xiang90 commented Oct 7, 2016

ethernetdan commented Oct 12, 2016 •

edited

Loading

xiang90 commented Nov 16, 2016

bamb00 commented May 18, 2017 •

edited

Loading

hongchaodeng commented May 18, 2017

bamb00 commented May 19, 2017

Problem with etcd in hostNetwork #115

Problem with etcd in hostNetwork #115

Comments

caseydavenport commented Sep 15, 2016

hongchaodeng commented Sep 15, 2016

hongchaodeng commented Sep 15, 2016

caseydavenport commented Sep 15, 2016

hongchaodeng commented Sep 15, 2016

caseydavenport commented Sep 15, 2016

hongchaodeng commented Sep 15, 2016

caseydavenport commented Sep 15, 2016

caseydavenport commented Sep 20, 2016 • edited Loading

hongchaodeng commented Sep 20, 2016

ethernetdan commented Oct 6, 2016

xiang90 commented Oct 6, 2016

ethernetdan commented Oct 6, 2016

xiang90 commented Oct 6, 2016 • edited Loading

xiang90 commented Oct 6, 2016

philips commented Oct 7, 2016

xiang90 commented Oct 7, 2016

ethernetdan commented Oct 12, 2016 • edited Loading

xiang90 commented Nov 16, 2016

bamb00 commented May 18, 2017 • edited Loading

hongchaodeng commented May 18, 2017

bamb00 commented May 19, 2017

caseydavenport commented Sep 20, 2016 •

edited

Loading

xiang90 commented Oct 6, 2016 •

edited

Loading

ethernetdan commented Oct 12, 2016 •

edited

Loading

bamb00 commented May 18, 2017 •

edited

Loading