Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KSM is unable to authenticate to cluster in 1.3.x+ releases #543

Closed
ehashman opened this issue Sep 25, 2018 · 9 comments · Fixed by #549
Closed

KSM is unable to authenticate to cluster in 1.3.x+ releases #543

ehashman opened this issue Sep 25, 2018 · 9 comments · Fixed by #549
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ehashman
Copy link
Member

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

When I run kube-state-metrics on any release 1.3.0+, it launches, but when it attempts to scrape any resources, I get the error:

User "system:anonymous" cannot list <resource>s at the cluster scope: No policy matched.

However, all the various clusterbindingroles are set up correctly for the cluster. I am not sure why KSM is unable to authenticate.

The correct token is definitely loaded inside the KSM Pod and I've successfully used it to authenticate as the ServiceAccount against the API server with a curl command (it matches "system:serviceaccount:kube-system:kube-state-metrics" in this case). For some reason, it seems that KSM is ignoring the token?

What you expected to happen:

KSM successfully authenticates with the API server and scrapes the cluster. This does work correctly in the 1.2.0 release (but I believe that certificate verification might not be working properly in that release, as I didn't have to load in our custom CA certs to get KSM 1.2.0 to work correctly).

How to reproduce it (as minimally and precisely as possible):

I'm not really sure how to minimally reproduce this; it affects all our clusters (prod + QA) but there's nothing particularly special about them. We have RBAC enabled.

Environment:

  • Kubernetes version (use kubectl version): 1.8.7
  • Kube-state-metrics image version: 1.3.0, 1.3.1, 1.4.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 25, 2018
@mxinden
Copy link
Contributor

mxinden commented Sep 25, 2018

@ehashman can you post the full logs of the kube-state-metrics pod? I am expecting something like Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. as kube-state-metrics should use the in cluster configuration.

In addition would you mind posting the deployment manifest and the rbac related manifests?

@ehashman
Copy link
Member Author

Logs:

I0925 15:11:32.640473       1 main.go:76] Using default collectors
I0925 15:11:32.640519       1 main.go:90] Using all namespace
I0925 15:11:32.640525       1 main.go:96] No metric whitelist or blacklist set. No filtering of metrics will be done.
I0925 15:11:32.641979       1 main.go:145] Testing communication with server
I0925 15:11:32.670376       1 main.go:150] Running with Kubernetes cluster version: v1.8+. git version: v1.8.7-23+c83423d42b8533. git tree state: clean. commit: c83423d42b85332c503293c3f15845374424d68c. platform: linux/amd64
I0925 15:11:32.670401       1 main.go:152] Communication with server successful
I0925 15:11:32.671168       1 pod.go:207] collect pod with v1
I0925 15:11:32.671206       1 main.go:161] Starting kube-state-metrics self metrics server: 0.0.0.0:8081
I0925 15:11:32.671232       1 resourcequota.go:55] collect resourcequota with v1
I0925 15:11:32.671251       1 cronjob.go:97] collect cronjob with batch/v1beta1
I0925 15:11:32.671280       1 persistentvolume.go:61] collect persistentvolume with v1
I0925 15:11:32.671294       1 namespace.go:75] collect namespace with v1
I0925 15:11:32.671308       1 hpa.go:86] collect hpa with autoscaling/v2beta1
I0925 15:11:32.671326       1 job.go:122] collect job with batch/v1
I0925 15:11:32.671356       1 persistentvolumeclaim.go:67] collect persistentvolumeclaim with v1
I0925 15:11:32.671372       1 node.go:142] collect node with v1
I0925 15:11:32.671405       1 replicaset.go:82] collect replicaset with extensions/v1beta1
I0925 15:11:32.672069       1 replicationcontroller.go:90] collect replicationcontroller with v1
I0925 15:11:32.672596       1 statefulset.go:97] collect statefulset with apps/v1beta1
I0925 15:11:32.672841       1 endpoint.go:79] collect endpoint with v1
I0925 15:11:32.672865       1 secret.go:77] collect secret with v1
I0925 15:11:32.672884       1 configmap.go:61] collect configmap with v1
I0925 15:11:32.672895       1 daemonset.go:103] collect daemonset with extensions/v1beta1
I0925 15:11:32.672914       1 deployment.go:123] collect deployment with extensions/v1beta1
I0925 15:11:32.672936       1 limitrange.go:53] collect limitrange with v1
I0925 15:11:32.672947       1 service.go:70] collect service with v1
I0925 15:11:32.672960       1 main.go:231] Active collectors: pods,resourcequotas,cronjobs,persistentvolumes,namespaces,horizontalpodautoscalers,jobs,persistentvolumeclaims,nodes,replicasets,replicationcontrollers,statefulsets,endpoints,secrets,configmaps,daemonsets,deployments,limitranges,services
I0925 15:11:32.672968       1 main.go:186] Starting metrics server: 0.0.0.0:8080
E0925 15:11:32.674624       1 reflector.go:205] k8s.io/kube-state-metrics/pkg/collectors/collectors.go:91: Failed to list *v1.ResourceQuota: resourcequotas is forbidden: User "system:anonymous" cannot list resourcequotas at the cluster scope: No policy matched.
E0925 15:11:32.674691       1 reflector.go:205] k8s.io/kube-state-metrics/pkg/collectors/collectors.go:91: Failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:anonymous" cannot list persistentvolumes at the cluster scope: No policy matched.

Command from the manifest:

command:
- /kube-state-metrics
- --apiserver=https://<master FQDN>
- --port=8080
- --telemetry-port=8081

Re: full manifests, let me see if I can just get you the diff. They are not very different than upstream, particularly the RBAC stuff.

@ehashman
Copy link
Member Author

The only deviation for our Kubernetes manifests from upstream are some small changes to the deployment. Namely, we don't use the nanny Pod and set resource requests explicitly, we specify the API server explicitly, and we mount our internal certificates inside the KSM Pod.

diff --git a/kubernetes/kube-state-metrics-deployment.yaml b/kubernetes/kube-state-metrics-deployment.yaml
index 28a119f..7a00a66 100644
--- a/kubernetes/kube-state-metrics-deployment.yaml
+++ b/kubernetes/kube-state-metrics-deployment.yaml
@@ -1,59 +1,61 @@
 apiVersion: apps/v1beta2
-# Kubernetes versions after 1.9.0 should use apps/v1
-# Kubernetes versions before 1.8.0 should use apps/v1beta1 or extensions/v1beta1
 kind: Deployment
 metadata:
-  name: kube-state-metrics
+  name: ksm-1.4.0
   namespace: kube-system
 spec:
   selector:
     matchLabels:
-      k8s-app: kube-state-metrics
+      k8s-app: ksm-1.4.0
   replicas: 1
   template:
     metadata:
       labels:
-        k8s-app: kube-state-metrics
+        k8s-app: ksm-1.4.0
     spec:
       serviceAccountName: kube-state-metrics
       containers:
       - name: kube-state-metrics
-        image: quay.io/coreos/kube-state-metrics:v1.4.0
+        image: <internal image server>/kube-state-metrics:v1.4.0
         ports:
         - name: http-metrics
           containerPort: 8080
         - name: telemetry
           containerPort: 8081
-        readinessProbe:
+        livenessProbe:
           httpGet:
             path: /healthz
             port: 8080
           initialDelaySeconds: 5
           timeoutSeconds: 5
-      - name: addon-resizer
-        image: k8s.gcr.io/addon-resizer:1.7
+        readinessProbe:
+          httpGet:
+            path: /metrics
+            port: 8080
+          initialDelaySeconds: 15
+          timeoutSeconds: 15
+        command:
+        - /kube-state-metrics
+        - --apiserver=https://<master FQDN>
+        - --port=8080
+        - --telemetry-port=8081
         resources:
-          limits:
-            cpu: 100m
-            memory: 30Mi
           requests:
-            cpu: 100m
-            memory: 30Mi
-        env:
-          - name: MY_POD_NAME
-            valueFrom:
-              fieldRef:
-                fieldPath: metadata.name
-          - name: MY_POD_NAMESPACE
-            valueFrom:
-              fieldRef:
-                fieldPath: metadata.namespace
-        command:
-          - /pod_nanny
-          - --container=kube-state-metrics
-          - --cpu=100m
-          - --extra-cpu=1m
-          - --memory=100Mi
-          - --extra-memory=2Mi
-          - --threshold=5
-          - --deployment=kube-state-metrics
+            cpu: 3
+            memory: 8Gi
+          limits:
+            cpu: 3
+            memory: 8Gi
+        volumeMounts:
+        - mountPath: /etc/ssl/certs
+          name: ca-certificates
+          readOnly: true
+        - mountPath: /etc/ssl/internal/certs/ca-certificates.crt
+          name: cert
+      volumes:
+      - name: ca-certificates
+        hostPath:
+          path: /etc/ssl/certs
+      - name: cert
+        hostPath:
+          path: /etc/ssl/certs/IT_Security.pem

@gregory-lyons
Copy link

@ehashman backwards compatibility was sadly broken in #371

You can no longer use in-cluster config AND specify the API server URL.

We are also affected by this as we use the same configuration (in-cluster + API server URL) and are unable to upgrade past 1.2 at the moment.

@andyxning given your comment here #371 (comment), is it possible we could re-introduce this as a valid configuration? I would imagine that we are not the only two users affected by this.

@ehashman
Copy link
Member Author

Ahhh, very interesting. In that case, I wonder if I can hack around this by overriding the master discovery environment variables (we need to specify API server by FQDN because our certs don't contain IP SANs and fail to verify). Will let you know how that goes.

@gregory-lyons
Copy link

@ehashman great suggestion, overriding the KUBERNETES_SERVICE_HOST environment variable worked for me!

@ehashman
Copy link
Member Author

Confirmed that this worked for me as well and I've successfully upgraded our KSM to 1.4.0 in production.

Question: I checked the release notes when trying to figure this out and didn't see any mention of the breaking changes in #371 that caused this issue. Could we retroactively add a note of this with a link to the workaround?

@mxinden
Copy link
Contributor

mxinden commented Sep 28, 2018

@ehashman right, this should have been a [CHANGE] entry in the changelog. Do you want to create a PR patching CHANGELOG.md? Otherwise I can do so as well.

@ehashman
Copy link
Member Author

Happy to tackle this, PR incoming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants