-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High glbc CPU usage #25
Comments
From @thockin on January 27, 2017 1:57 is that 130. IP a GCE load-balancer? Are all the Services behind that Ingress NodePort or LoadBalancer type? On Thu, Jan 26, 2017 at 5:53 PM, jmn notifications@github.com wrote:
|
From @jmn on January 27, 2017 2:0 I only have services of type NodePort and ClusterIP the 130. IP is unknown to me. |
From @thockin on January 27, 2017 2:2 Look for it in your cloud console - maybe it is a load-balancer or a VM IP. If you put a ClutsterIP service behind an ingress you could trigger this. On Thu, Jan 26, 2017 at 6:00 PM, jmn notifications@github.com wrote:
|
From @jmn on January 27, 2017 2:18 I had two ClusterIPs behind Ingress, I changed those services to NodePort. |
From @bprashanth on January 27, 2017 2:29 i think your loadbalancer controllers are fighting. The fact that you have multiple ips in the status list indicates that you are using both the nginx and gce ingress controller, the 130 is a google ip. You need to specify an ingress.class annotation to stop the fighting: |
From @bprashanth on January 27, 2017 2:30 oh sorry, you did mention that they're all class nginx, can you please post the ingress that has multiple ips? |
From @bprashanth on January 27, 2017 2:32 GLBC runs on the master, are you also running it on your nodes? |
From @jmn on January 27, 2017 2:35 My nodes are running the GCI I have not added GLBC |
From @jmn on January 27, 2017 2:48
|
From @jmn on January 27, 2017 2:50 The annotation nginx is not quoted, don't know if it is significant. |
From @jmn on January 27, 2017 3:1 I basically run https://github.com/kubernetes/ingress/blob/master/examples/deployment/nginx/nginx-ingress-controller.yaml deployment. It doesn't come with a Service like in the kube-lego example. I have not spent a very long time configuring this and studying the kube-lego documentation so I might have missed something essential. |
From @bprashanth on January 27, 2017 3:7 So stack driver is showing CPU usage of what to be 30%? Glbc runnning on your gke master? Afaik it won't show master CPU usage, and the master is managed by Google anyway. If you're running a GCE cluster can you please SSH into your master and post the out put of tail on /var/log/glbc? |
From @jmn on January 27, 2017 3:13 It's GKE I'm not sure I can SSH to master, can I? |
From @jmn on January 27, 2017 3:19 It's stackdriver GKE container metric CPU Usage, not sure how to narrow it down beyond this in stackdriver. |
From @bprashanth on January 27, 2017 3:23 Hmm, I'll try it and report |
From @bprashanth on January 27, 2017 18:12 I can't seem to reproduce your chart, mine has more details in the annotations. Anyway, applying stackdriver to a GKE cluster I only see nodes, which means it's showing you something on the node. My guess is that it's showing you the cpu usage of the default backend, which runs in the kube-system namespace (@piosz might be able to confirm -- Piotr, what is it using to produce that "glbc" annotation on the tooltip, the container name? in which case is it actually getting a container from the GKE master?). You can observe it via: This is a simple go http server that serves the 404 page, if it's using cpu, that probably means your frontend is getting a lot of requests that don't match any url. |
From @jmn on January 27, 2017 20:22 Is the l7-default-backend-v1.0 logging to somewhere? In logs in the kube-system namespace I find only some errors related to heapster-nanny.
|
From @jmn on January 27, 2017 20:24 Also, is the l7-default-backend-v1.0 replicationcontroller required to run in GKE? |
From @bprashanth on January 27, 2017 20:28 A standard go server doesn't log each request. The default backend is used by the nginx controller, and by all gce cloud lbs, whenever they receive a url they don't understand, to return a http 404. It has both resource and limits set:
So if something else lands on the node and needs the resources, it'll be forced into the 10m/20mi it requested |
From @jmn on January 27, 2017 20:40 Not sure that I need the GLBC at the moment (?) so I did
Seems fine? |
From @jmn on January 27, 2017 21:3 Note I have another default-http-backend (in the default namespace) https://github.com/kubernetes/contrib/blob/master/ingress/controllers/nginx/examples/default-backend.yaml |
From @tonglil on January 30, 2017 16:21 I can identify the same issue. Always just thought this was normal (maybe CPU on master, which I can't access so 🤷♂️) until reading this. Here is a screenshot of 3 different dev clusters, where glbc is the top CPU consumer. |
From @bprashanth on January 30, 2017 16:27 @tonglil your screenshot confirms that it's the default-http-backend (as you can see in the tool tip at the bottom right). First thing, I think we need to add some clarity around this. The source code for the default http backend is https://github.com/kubernetes/contrib/tree/master/404-server, maybe as a first step we can output a periodic log line for every 100th request? Anyway as mentioned before it has limits set, so if there's anything else that needs the cpu, the webserver will be constrainted to 10m |
From @tonglil on January 31, 2017 21:22 Thanks, yes I realize it is the default HTTP backend now. Regards to the second point, I don't believe there is a lot of requests to unknown URLs. For that endpoint where there is http + https set up, together there is about 50 requests per day from IP hits + unknown sources. The rest is the standard It hasn't impacted anything so far (with the limits) so it's not a big concern. |
From @mlazarov on February 16, 2017 21:48 I'm having the same issue on GKE and the glbc is not default backend So far what I'm observing that the cpu load on glbc is going higher and higher over time. The cluster is almost 5 months old. Updating from kubernetes 1.4.7 to 1.5.2 didn't make any change on the load. |
From @bprashanth on February 16, 2017 22:2 Yeah that is the default backend, it's just wrongly marked as glbc based on the label "name: glbc''. You simply won't get glbc stats on GKE, glbc is running on the master, which is not in your project. The default backend is constrained to 10m https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/cluster-loadbalancing/glbc/default-svc-controller.yaml#L38, so if you shove more pods on that node and specify limits > 10m on them, and they actually need more cpu, the default backed should get boxed into that 10m. I'm actually not sure why a dumb http server would start using more cpu over time. Perhaps there's a bug in the stats collection? what does something like top or atop show on the node? |
From @mlazarov on February 16, 2017 22:27 The No one from the external world is hitting the
There are no logs inside
Any idea how to debug what is this |
From @tonglil on February 16, 2017 22:32 What about |
From @mlazarov on February 16, 2017 22:45 Here is the output of top:
If I'm correct 7m = 70% of 10m. And that is exactly the same information that I'm having for glbc at the stackdriver dashboard. |
From @bprashanth on February 16, 2017 22:53 Oh, that's even more confusing then as I'd expect that dashboard to show % of available cpu used per pod, not % of specified limits (or indicate that 100 is 10m somewhere). I'll follow up with a bug against stackdriver monitoring. |
From @mlazarov on February 16, 2017 22:56 If I'm correct this is not a bug. 7m (the real usage) of 10m (the real limit) is exactly 70% of the limit - and that stackdriver is displaying. It looks right to me. |
From @bprashanth on February 16, 2017 22:57 It's a confusing feature of the UI, not a bug. |
From @tonglil on February 16, 2017 23:1 I agree, I would expect the UI to show percentages for resources used of the pod or node based on past graph experiences. It should be more clear in SD docs somewhere that the percentages are usage/limit.
|
From @mlazarov on February 16, 2017 23:9 Yes, when I saw it for the first time it was confusing but seems right way to display the information. Some more inline docs will be good too. But what about the |
From @bprashanth on February 16, 2017 23:51 I think that's alright, the limits are set to be a little higher than normal usage. Setting them much higher would be a waste, and setting them lower would result in throttling. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
/close. |
Thanks @tonglil |
From @jmn on January 27, 2017 1:53
I noticed in Stackdriver monitoring that glbc is using about 30% CPU constantly in my cluster. @thockin said on Slack that this might be related to misconfigured GCE ingress. However I looked through all my Ingresses and they are all class nginx. Anyone got a clue or a suggestion as to troubleshoot further?
One thing which I noticed is in my Ingress "kube-lego", which is configured automatically by kube-lego, there is a part "status" which looks like this:
The second IP adress adress is my nginx loadbalancer, however: the first IP adress is unknown to me, I currently do not know where that is from.
Copied from original issue: kubernetes/ingress-nginx#183
The text was updated successfully, but these errors were encountered: