Skip to content
This repository has been archived by the owner on Mar 28, 2020. It is now read-only.

Problem with etcd in hostNetwork #115

Closed
caseydavenport opened this issue Sep 15, 2016 · 21 comments
Closed

Problem with etcd in hostNetwork #115

caseydavenport opened this issue Sep 15, 2016 · 21 comments
Assignees
Labels

Comments

@caseydavenport
Copy link
Contributor

Ran into this today while trying to test the changes from #112

NAME                                READY     STATUS    RESTARTS   AGE
po/etcd-cluster-0000                0/1       Error     0          1m
2016-09-15 00:49:12.196100 I | etcdmain: etcd Version: 3.0.8
2016-09-15 00:49:12.196134 I | etcdmain: Git SHA: d40982f
2016-09-15 00:49:12.196174 I | etcdmain: Go Version: go1.6.3
2016-09-15 00:49:12.196204 I | etcdmain: Go OS/Arch: linux/amd64
2016-09-15 00:49:12.196232 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2016-09-15 00:49:12.196591 I | etcdmain: listening for peers on http://0.0.0.0:2380
2016-09-15 00:49:12.196725 I | etcdmain: listening for client requests on 0.0.0.0:2379
2016-09-15 00:49:12.213419 E | netutil: could not resolve host etcd-cluster-0000:2380
2016-09-15 00:49:12.236475 I | etcdmain: stopping listening for client requests on 0.0.0.0:2379
2016-09-15 00:49:12.236515 I | etcdmain: stopping listening for peers on http://0.0.0.0:2380
2016-09-15 00:49:12.236541 I | etcdmain: --initial-cluster must include etcd-cluster-0000=http://etcd-cluster-0000:2380 given --initial-advertise-peer-urls=http://etcd-cluster-0000:2380

Looks like it's failing to resolve etcd-cluster-0000. I'll investigate.

@hongchaodeng
Copy link
Member

@caseydavenport
It's an etcd thing that we just resolved: etcd-io/etcd#6336
Let me push an urgent fix for this.

@hongchaodeng
Copy link
Member

Fixed in #116

@caseydavenport
Copy link
Contributor Author

Hm, I think I'm still seeing this even with #116.

@hongchaodeng hongchaodeng reopened this Sep 15, 2016
@hongchaodeng
Copy link
Member

@caseydavenport
It works for me.
Please paste your steps to reproduce.

@caseydavenport
Copy link
Contributor Author

I've been using the provided examples, though I've modified example-etcd-cluster.yaml to include hostNetwork: true.

Here are the steps I'm running and the output from each:

  1. Create the controller
macncheese:kube-etcd-controller casey$ kubectl create -f example/etcd-controller.yaml
pod "kubeetcdctrl" created
macncheese:kube-etcd-controller casey$ kubectl get rc,deployments,svc,replicaset,pod,thirdpartyresource --show-all
NAME             CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
svc/kubernetes   10.0.0.1     <none>        443/TCP   18s
NAME              READY     STATUS    RESTARTS   AGE
po/kubeetcdctrl   1/1       Running   0          7s
NAME                                          DESCRIPTION             VERSION(S)
thirdpartyresources/etcd-cluster.coreos.com   Managed etcd clusters   v1

This appears to create successfully.

Then, I create the desired cluster. The manifest looks like this:

macncheese:kube-etcd-controller casey$ cat example/example-etcd-cluster.yaml
apiVersion: "coreos.com/v1"
kind: "EtcdCluster"
metadata:
  name: "etcd-cluster"
spec:
  size: 3
  hostNetwork: true
  backup:
    # short snapshot interval for testing, do not use this in production!
    snapshotIntervalInSecond: 30
    maxSnapshot: 5
    volumeSizeInMB: 512

I see this occur:

macncheese:kube-etcd-controller casey$ kubectl create -f example/example-etcd-cluster.yaml
etcdcluster "etcd-cluster" created
macncheese:kube-etcd-controller casey$ kubectl get rc,deployments,svc,replicaset,pod,thirdpartyresource --show-all
NAME                           CLUSTER-IP    EXTERNAL-IP   PORT(S)             AGE
svc/etcd-cluster-0000          10.0.252.49   <none>        2380/TCP,2379/TCP   12s
svc/etcd-cluster-backup-tool   10.0.35.118   <none>        19999/TCP           11s
svc/kubernetes                 10.0.0.1      <none>        443/TCP             4m
NAME                          DESIRED   CURRENT   READY     AGE
rs/etcd-cluster-backup-tool   1         1         1         11s
NAME                                READY     STATUS    RESTARTS   AGE
po/etcd-cluster-0000                0/1       Error     0          12s
po/etcd-cluster-backup-tool-duijz   1/1       Running   0          11s
po/kubeetcdctrl                     1/1       Running   0          4m
NAME                                          DESCRIPTION             VERSION(S)
thirdpartyresources/etcd-cluster.coreos.com   Managed etcd clusters   v1

And I can get the logs:

macncheese:kube-etcd-controller casey$ kubectl logs etcd-cluster-0000
2016-09-15 20:07:56.438104 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2379_TCP=tcp://10.0.252.49:2379
2016-09-15 20:07:56.438182 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_SERVICE_HOST=10.0.252.49
2016-09-15 20:07:56.438186 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_SERVICE_PORT_CLIENT=2379
2016-09-15 20:07:56.438189 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2380_TCP_PROTO=tcp
2016-09-15 20:07:56.438192 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2380_TCP_PORT=2380
2016-09-15 20:07:56.438195 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2380_TCP_ADDR=10.0.252.49
2016-09-15 20:07:56.438198 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2379_TCP_ADDR=10.0.252.49
2016-09-15 20:07:56.438201 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_SERVICE_PORT=2380
2016-09-15 20:07:56.438204 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2380_TCP=tcp://10.0.252.49:2380
2016-09-15 20:07:56.438207 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2379_TCP_PORT=2379
2016-09-15 20:07:56.438212 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_SERVICE_PORT_SERVER=2380
2016-09-15 20:07:56.438215 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT=tcp://10.0.252.49:2380
2016-09-15 20:07:56.438217 W | flags: unrecognized environment variable ETCD_CLUSTER_0000_PORT_2379_TCP_PROTO=tcp
2016-09-15 20:07:56.438252 I | etcdmain: etcd Version: 3.0.0+git
2016-09-15 20:07:56.438255 I | etcdmain: Git SHA: 9913e00
2016-09-15 20:07:56.438258 I | etcdmain: Go Version: go1.7.1
2016-09-15 20:07:56.438260 I | etcdmain: Go OS/Arch: linux/amd64
2016-09-15 20:07:56.438264 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2016-09-15 20:07:56.438372 I | embed: listening for peers on http://0.0.0.0:2380
2016-09-15 20:07:56.438416 I | embed: listening for client requests on 0.0.0.0:2379
2016-09-15 20:07:56.519617 E | netutil: could not resolve host etcd-cluster-0000:2380
2016-09-15 20:07:56.545915 I | etcdmain: --initial-cluster must include etcd-cluster-0000=http://etcd-cluster-0000:2380 given --initial-advertise-peer-urls=http://etcd-cluster-0000:2380
2016-09-15 20:07:56.545947 I | etcdmain: forgot to set --initial-cluster flag?
2016-09-15 20:07:56.545955 I | etcdmain: if you want to use discovery service, please set --discovery flag.

"kubectl describe etcd-cluster-0000" shows me the container image and command:

Containers:
  etcd-cluster-0000:
    Container ID:       docker://23268da1700e364cb14b83646e0f593a1453b5b7c9d0f2276129ca54dbe53030
    Image:              gcr.io/coreos-k8s-scale-testing/etcd-amd64:latest
    Image ID:           docker://sha256:1e3488a197dd60328276742ee4f93cc3277bc2773f5579e60b8e81491cc28aba
    Port:               2380/TCP
    Command:
      /usr/local/bin/etcd
      --data-dir
      /var/etcd/data
      --name
      etcd-cluster-0000
      --initial-advertise-peer-urls
      http://etcd-cluster-0000:2380
      --listen-peer-urls
      http://0.0.0.0:2380
      --listen-client-urls
      http://0.0.0.0:2379
      --advertise-client-urls
      http://etcd-cluster-0000:2379
      --initial-cluster
      etcd-cluster-0000=http://etcd-cluster-0000:2380
      --initial-cluster-state
      new
      --initial-cluster-token
      d029eb69-417f-49f8-a9a7-ff887469fb6

@hongchaodeng
Copy link
Member

I have reproduced and verified it.
It looks like kubernetes dns isn't working for host network.
This is a problem of Kubernetes, not controller.

@caseydavenport Can you try to figure out how to make it work?

@caseydavenport
Copy link
Contributor Author

Yeah. Looks like we're hitting up against this: kubernetes/kubernetes#17406

Essentially, Kubernetes doesn't expose cluster DNS to host networked pods :(

The issue points out some potential workarounds, but none seem that great. I'll have to dig a little deeper to see what we can do here.

@caseydavenport
Copy link
Contributor Author

caseydavenport commented Sep 20, 2016

So, I don't think we can rely on DNS for this use case. It's not supported for Kubernetes host networked pods, and even if it were it wouldn't meet the Calico / Canal use-case since we need etcd to come up before any other pods (including DNS), and we don't want etcd being subject to DNS pod failures.

I think the solution is to rely on service clusterIPs rather than DNS names (for host networked clusters). I'm going to prototype this and see if it can get us what we need.

@hongchaodeng
Copy link
Member

@caseydavenport
Great to hear that.
I'm still dealing with some kubernetes test failures lately and wasn't able to do the image parameter thing..
You can grep gcr.io/coreos-k8s-scale-testing and replace them easily. Hopefully.
Also ping me on slack if you have any question.

@ethernetdan
Copy link

A requirement for self-hosting clusters (#128) is that we will not be able to depend on any Kubernetes capabilities for client or peer networking.

@hongchaodeng @xiang90 what would be required to enable true host networking for clusters managed by the controller?

@xiang90
Copy link
Collaborator

xiang90 commented Oct 6, 2016

A requirement for self-hosting clusters (#128) is that we will not be able to depend on any Kubernetes capabilities for client or peer networking.

Can you explain why?

@ethernetdan
Copy link

Etcd is a requirement for Kubernetes networking to be functional so in recovery scenarios we are unable to rely on it.

@xiang90
Copy link
Collaborator

xiang90 commented Oct 6, 2016

@ethernetdan What is the recovery scenarios?

@xiang90
Copy link
Collaborator

xiang90 commented Oct 6, 2016

@ethernetdan resolved the issue offline. recovery from a disaster case (majority of etcd cluster used by k8s is down) should work similarly as bring up a new cluster except that the seed etcd member is recovered from a backup.

@philips
Copy link
Contributor

philips commented Oct 7, 2016

I really think we need to have etcd support host networking and node IPs here. I am going to propose upstream this logical thinking for self-hosted cluster bring-up:

To do this we are going to divide the Kubernetes cluster into a variety of layers beginning with the Kubelet (level 0) and going up to the add-ons and required networking (Level 3). A cluster can self-host all of these levels 0-3 or only partially self-hosted 2-3. We will give nicknames to some of these common types.

  • Kubelet : (level 0)
  • etcd: (level 1)
  • Control Plane: (level 2)
  • Add-ons and Networking: (level 3)

With this thing I really don't want a circular dependency between level 1 and level 3 as we start to make it really difficult for administrators to recover from cluster failure. And it makes it harder to reason about the system overall.

@xiang90
Copy link
Collaborator

xiang90 commented Oct 7, 2016

@ethernetdan @philips I am fine with adding host network support. It def can help with decoupling the dependency mess. Just wanted to make sure we understand everything, since in theory everything should still work even without host network if we are careful about ordering.

@ethernetdan
Copy link

ethernetdan commented Oct 12, 2016

Anyway I can do to help speed up this work?

@xiang90
Copy link
Collaborator

xiang90 commented Nov 16, 2016

Fixed by #366

@xiang90 xiang90 closed this as completed Nov 16, 2016
@bamb00
Copy link

bamb00 commented May 18, 2017

Hi,

Is there a fix or workaround for the hostNetwork=true? I'm running etcd 3.1.6. What workaround I can apply from Kubernetes?

Is this issue fix from etcd side and still need a fix from Kubernetes to support host networked pod?

Thanks in Advance.

@hongchaodeng
Copy link
Member

@bamb00
What's the context? What are you trying to do?

@bamb00
Copy link

bamb00 commented May 19, 2017

The issue is reference in @caseydavenport with the link from kubernetes (kubernetes/kubernetes#17406). I'm just wondering is there a workaround for this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants