KSM is unable to authenticate to cluster in 1.3.x+ releases #543

ehashman · 2018-09-25T15:29:54Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

When I run kube-state-metrics on any release 1.3.0+, it launches, but when it attempts to scrape any resources, I get the error:

User "system:anonymous" cannot list <resource>s at the cluster scope: No policy matched.

However, all the various clusterbindingroles are set up correctly for the cluster. I am not sure why KSM is unable to authenticate.

The correct token is definitely loaded inside the KSM Pod and I've successfully used it to authenticate as the ServiceAccount against the API server with a curl command (it matches "system:serviceaccount:kube-system:kube-state-metrics" in this case). For some reason, it seems that KSM is ignoring the token?

What you expected to happen:

KSM successfully authenticates with the API server and scrapes the cluster. This does work correctly in the 1.2.0 release (but I believe that certificate verification might not be working properly in that release, as I didn't have to load in our custom CA certs to get KSM 1.2.0 to work correctly).

How to reproduce it (as minimally and precisely as possible):

I'm not really sure how to minimally reproduce this; it affects all our clusters (prod + QA) but there's nothing particularly special about them. We have RBAC enabled.

Environment:

Kubernetes version (use kubectl version): 1.8.7
Kube-state-metrics image version: 1.3.0, 1.3.1, 1.4.0

The text was updated successfully, but these errors were encountered:

mxinden · 2018-09-25T15:44:50Z

@ehashman can you post the full logs of the kube-state-metrics pod? I am expecting something like Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. as kube-state-metrics should use the in cluster configuration.

In addition would you mind posting the deployment manifest and the rbac related manifests?

ehashman · 2018-09-25T16:00:27Z

Logs:

I0925 15:11:32.640473       1 main.go:76] Using default collectors
I0925 15:11:32.640519       1 main.go:90] Using all namespace
I0925 15:11:32.640525       1 main.go:96] No metric whitelist or blacklist set. No filtering of metrics will be done.
I0925 15:11:32.641979       1 main.go:145] Testing communication with server
I0925 15:11:32.670376       1 main.go:150] Running with Kubernetes cluster version: v1.8+. git version: v1.8.7-23+c83423d42b8533. git tree state: clean. commit: c83423d42b85332c503293c3f15845374424d68c. platform: linux/amd64
I0925 15:11:32.670401       1 main.go:152] Communication with server successful
I0925 15:11:32.671168       1 pod.go:207] collect pod with v1
I0925 15:11:32.671206       1 main.go:161] Starting kube-state-metrics self metrics server: 0.0.0.0:8081
I0925 15:11:32.671232       1 resourcequota.go:55] collect resourcequota with v1
I0925 15:11:32.671251       1 cronjob.go:97] collect cronjob with batch/v1beta1
I0925 15:11:32.671280       1 persistentvolume.go:61] collect persistentvolume with v1
I0925 15:11:32.671294       1 namespace.go:75] collect namespace with v1
I0925 15:11:32.671308       1 hpa.go:86] collect hpa with autoscaling/v2beta1
I0925 15:11:32.671326       1 job.go:122] collect job with batch/v1
I0925 15:11:32.671356       1 persistentvolumeclaim.go:67] collect persistentvolumeclaim with v1
I0925 15:11:32.671372       1 node.go:142] collect node with v1
I0925 15:11:32.671405       1 replicaset.go:82] collect replicaset with extensions/v1beta1
I0925 15:11:32.672069       1 replicationcontroller.go:90] collect replicationcontroller with v1
I0925 15:11:32.672596       1 statefulset.go:97] collect statefulset with apps/v1beta1
I0925 15:11:32.672841       1 endpoint.go:79] collect endpoint with v1
I0925 15:11:32.672865       1 secret.go:77] collect secret with v1
I0925 15:11:32.672884       1 configmap.go:61] collect configmap with v1
I0925 15:11:32.672895       1 daemonset.go:103] collect daemonset with extensions/v1beta1
I0925 15:11:32.672914       1 deployment.go:123] collect deployment with extensions/v1beta1
I0925 15:11:32.672936       1 limitrange.go:53] collect limitrange with v1
I0925 15:11:32.672947       1 service.go:70] collect service with v1
I0925 15:11:32.672960       1 main.go:231] Active collectors: pods,resourcequotas,cronjobs,persistentvolumes,namespaces,horizontalpodautoscalers,jobs,persistentvolumeclaims,nodes,replicasets,replicationcontrollers,statefulsets,endpoints,secrets,configmaps,daemonsets,deployments,limitranges,services
I0925 15:11:32.672968       1 main.go:186] Starting metrics server: 0.0.0.0:8080
E0925 15:11:32.674624       1 reflector.go:205] k8s.io/kube-state-metrics/pkg/collectors/collectors.go:91: Failed to list *v1.ResourceQuota: resourcequotas is forbidden: User "system:anonymous" cannot list resourcequotas at the cluster scope: No policy matched.
E0925 15:11:32.674691       1 reflector.go:205] k8s.io/kube-state-metrics/pkg/collectors/collectors.go:91: Failed to list *v1.PersistentVolume: persistentvolumes is forbidden: User "system:anonymous" cannot list persistentvolumes at the cluster scope: No policy matched.

Command from the manifest:

command:
- /kube-state-metrics
- --apiserver=https://<master FQDN>
- --port=8080
- --telemetry-port=8081

Re: full manifests, let me see if I can just get you the diff. They are not very different than upstream, particularly the RBAC stuff.

ehashman · 2018-09-25T16:09:29Z

The only deviation for our Kubernetes manifests from upstream are some small changes to the deployment. Namely, we don't use the nanny Pod and set resource requests explicitly, we specify the API server explicitly, and we mount our internal certificates inside the KSM Pod.

diff --git a/kubernetes/kube-state-metrics-deployment.yaml b/kubernetes/kube-state-metrics-deployment.yaml
index 28a119f..7a00a66 100644
--- a/kubernetes/kube-state-metrics-deployment.yaml
+++ b/kubernetes/kube-state-metrics-deployment.yaml
@@ -1,59 +1,61 @@
 apiVersion: apps/v1beta2
-# Kubernetes versions after 1.9.0 should use apps/v1
-# Kubernetes versions before 1.8.0 should use apps/v1beta1 or extensions/v1beta1
 kind: Deployment
 metadata:
-  name: kube-state-metrics
+  name: ksm-1.4.0
   namespace: kube-system
 spec:
   selector:
     matchLabels:
-      k8s-app: kube-state-metrics
+      k8s-app: ksm-1.4.0
   replicas: 1
   template:
     metadata:
       labels:
-        k8s-app: kube-state-metrics
+        k8s-app: ksm-1.4.0
     spec:
       serviceAccountName: kube-state-metrics
       containers:
       - name: kube-state-metrics
-        image: quay.io/coreos/kube-state-metrics:v1.4.0
+        image: <internal image server>/kube-state-metrics:v1.4.0
         ports:
         - name: http-metrics
           containerPort: 8080
         - name: telemetry
           containerPort: 8081
-        readinessProbe:
+        livenessProbe:
           httpGet:
             path: /healthz
             port: 8080
           initialDelaySeconds: 5
           timeoutSeconds: 5
-      - name: addon-resizer
-        image: k8s.gcr.io/addon-resizer:1.7
+        readinessProbe:
+          httpGet:
+            path: /metrics
+            port: 8080
+          initialDelaySeconds: 15
+          timeoutSeconds: 15
+        command:
+        - /kube-state-metrics
+        - --apiserver=https://<master FQDN>
+        - --port=8080
+        - --telemetry-port=8081
         resources:
-          limits:
-            cpu: 100m
-            memory: 30Mi
           requests:
-            cpu: 100m
-            memory: 30Mi
-        env:
-          - name: MY_POD_NAME
-            valueFrom:
-              fieldRef:
-                fieldPath: metadata.name
-          - name: MY_POD_NAMESPACE
-            valueFrom:
-              fieldRef:
-                fieldPath: metadata.namespace
-        command:
-          - /pod_nanny
-          - --container=kube-state-metrics
-          - --cpu=100m
-          - --extra-cpu=1m
-          - --memory=100Mi
-          - --extra-memory=2Mi
-          - --threshold=5
-          - --deployment=kube-state-metrics
+            cpu: 3
+            memory: 8Gi
+          limits:
+            cpu: 3
+            memory: 8Gi
+        volumeMounts:
+        - mountPath: /etc/ssl/certs
+          name: ca-certificates
+          readOnly: true
+        - mountPath: /etc/ssl/internal/certs/ca-certificates.crt
+          name: cert
+      volumes:
+      - name: ca-certificates
+        hostPath:
+          path: /etc/ssl/certs
+      - name: cert
+        hostPath:
+          path: /etc/ssl/certs/IT_Security.pem

gregory-lyons · 2018-09-26T20:47:51Z

@ehashman backwards compatibility was sadly broken in #371

You can no longer use in-cluster config AND specify the API server URL.

We are also affected by this as we use the same configuration (in-cluster + API server URL) and are unable to upgrade past 1.2 at the moment.

@andyxning given your comment here #371 (comment), is it possible we could re-introduce this as a valid configuration? I would imagine that we are not the only two users affected by this.

ehashman · 2018-09-27T12:58:26Z

Ahhh, very interesting. In that case, I wonder if I can hack around this by overriding the master discovery environment variables (we need to specify API server by FQDN because our certs don't contain IP SANs and fail to verify). Will let you know how that goes.

gregory-lyons · 2018-09-27T18:20:51Z

@ehashman great suggestion, overriding the KUBERNETES_SERVICE_HOST environment variable worked for me!

ehashman · 2018-09-28T21:09:41Z

Confirmed that this worked for me as well and I've successfully upgraded our KSM to 1.4.0 in production.

Question: I checked the release notes when trying to figure this out and didn't see any mention of the breaking changes in #371 that caused this issue. Could we retroactively add a note of this with a link to the workaround?

mxinden · 2018-09-28T21:14:12Z

@ehashman right, this should have been a [CHANGE] entry in the changelog. Do you want to create a PR patching CHANGELOG.md? Otherwise I can do so as well.

ehashman · 2018-09-28T21:21:09Z

Happy to tackle this, PR incoming.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 25, 2018

ehashman mentioned this issue Sep 28, 2018

Add changelog note for breaking change in #371 #549

Merged

k8s-ci-robot closed this as completed in #549 Oct 9, 2018

ehashman mentioned this issue Jul 8, 2019

REQUEST: New membership for ehashman kubernetes/org#993

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KSM is unable to authenticate to cluster in 1.3.x+ releases #543

KSM is unable to authenticate to cluster in 1.3.x+ releases #543

ehashman commented Sep 25, 2018

mxinden commented Sep 25, 2018

ehashman commented Sep 25, 2018

ehashman commented Sep 25, 2018

gregory-lyons commented Sep 26, 2018

ehashman commented Sep 27, 2018

gregory-lyons commented Sep 27, 2018

ehashman commented Sep 28, 2018

mxinden commented Sep 28, 2018

ehashman commented Sep 28, 2018

KSM is unable to authenticate to cluster in 1.3.x+ releases #543

KSM is unable to authenticate to cluster in 1.3.x+ releases #543

Comments

ehashman commented Sep 25, 2018

mxinden commented Sep 25, 2018

ehashman commented Sep 25, 2018

ehashman commented Sep 25, 2018

gregory-lyons commented Sep 26, 2018

ehashman commented Sep 27, 2018

gregory-lyons commented Sep 27, 2018

ehashman commented Sep 28, 2018

mxinden commented Sep 28, 2018

ehashman commented Sep 28, 2018