amazon-cloudwatch-observability fails with open /root/.aws/credentials ignoring the IRSA credentials #1101

ecerulm · 2024-03-22T10:38:55Z

Describe the bug
When IDMSv2 is enabled in the worker nodes with hop limit 1 , the IMDSv2 is not accessible from the pods. In general, I don't want pods to access IMDS since they can get credentials for the node IAM role.

When IDMSv2 is no accessible it seems that the cloudagent (i'm using amazon-cloudwatch-observability eks addon) , tries to use credentials from the non existing file /root/.aws/credentials instead of using the credentials from IRSA. The pod uses a service account with IRSA annotation and it was the environment variables AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE (injected by IRSA).

But I believe the amazon-cloudwatch-agent is ignoring the IRSA credentials (I suspect it it's because IMDSv2 is not available , and the it decides it "onprem")

I see on startup of the pod

D! [EC2] Found active network interface
I! imds retry client will retry 1 timesD! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! could not get hostname without imds v1 fallback enable thus enable fallback
E! [EC2] Fetch hostname from EC2 metadata fail: EC2MetadataError: failed to make EC2Metadata request
 status code: 401, request id: 
D! should retry true for imds error : RequestError: send request failed

...

I! Detected the instance is OnPremise

Steps to reproduce
If possible, provide a recipe for reproducing the error.

EKS 1.29
EKS nodes 1.29 bottlerocket
with IMDSv2 (http tokens required, hop limit 1)
amazon-cloudwatch-observability eks addon v1.4.0-eksbuild.1 (default config)

What did you expect to see?

I expect to allow me to override "OnPrem" / "EC2" from the eks addon configuration , I don't see that as possibility in the amazon-cloudwatch-observability addon,

 aws eks describe-addon-configuration --addon-name amazon-cloudwatch-observability --addon-version "v1.4.0-eksbuild.1" --query configurationSchema | jq '.|fromjson'

What did you see instead?

I see that it detects "OnPremise" and I believe that in turn forces it to use /root/.aws/credentials when in fact it should be using the credentials from IRSA via the existing environment variables $AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE.

What version did you use?
Version: (e.g., v1.247350.0, etc)

I'm using amazon-cloudwatch-observability eks addon version v1.4.0-eksbuild.1, I don't know which version of the amazon-cloudwatch-agent is indluded with that

What config did you use?
Config: (e.g. the agent json config file)

Environment
EKS 1.29
EKS nodes 1.29 bottlerocket
with IMDSv2 (http tokens required, hop limit 1)
amazon-cloudwatch-observability eks addon v1.4.0-eksbuild.1, (default config)

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

faizanshah-tp · 2024-03-25T11:42:26Z

facing the same while runing cloudwatch agent as a daemonset.

D! [EC2] Found active network interface
I! imds retry client will retry 1 timesD! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! could not get hostname without imds v1 fallback enable thus enable fallback
E! [EC2] Fetch hostname from EC2 metadata fail: EC2MetadataError: failed to make EC2Metadata request

	status code: 401, request id: 
D! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! should retry true for imds error : RequestError: send request failed
caused by: Put "http://169.254.169.254/latest/api/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)D! could not get instance document without imds v1 fallback enable thus enable fallback

ecerulm · 2024-03-25T13:55:33Z

I uninstalled the amazon-cloudwatch-observability eks add-on and installed using the the instructions at https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-metrics.html
and I'm getting the same result.

But I can set RUN_WITH_IRSA="True" (I actually have a service account with IRSA annotation) in the DaemonSet and that makes it detect

I! Detected from ENV RUN_WITH_IRSA is True

The RUN_WITH_IRSA environment variable does not seem to be documented but it's in the source code and it works the values needs to be True (not true or 1).

ecerulm · 2024-03-25T15:53:46Z

Although, using RUN_WITH_IRSA allows the DaemonSet to run and some metrics are sent to CloudWatch . I can see that it still tries to use the IMDS to get the EC2 metadata (instance id, image id, and instance type) I guess it's not possible to get those in any other way currently

The CloudWatch Container Insights still lacks the "Top 10 Nodes by CPU Utilization" , etc. I guess all metrics that have NodeName as dimension are missing.

I guess this means that amazon-cloudwatch-agent really needs IMDS , and maybe it should documented so. You can't have it without it, can you?


2024-03-25T14:14:10Z I! {"caller":"host/ec2metadata.go:78","msg":"Fetch instance id and type from ec2 metadata","kind":"receiver","name":"awscontainerinsightreceiver","data_type":"metrics"}
2024-03-25T14:14:11.425Z	DEBUG	aws@v0.0.0-20231208183748-c00ca1f62c3e/imdsretryer.go:45	imds error : 	{"shouldRetry": true, "error": "RequestError: send request failed\ncaused by: Put \"http://169.254.169.254/latest/api/token\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}

ecerulm · 2024-03-26T09:07:52Z

Restrict the use of host networking and block access to instance metadata service
sort of recommends blocking IMDS from the pods:

While these privileges are required for the node to operate effectively, it is not usually desirable that the pods running on the node inherit these privileges.

But IMDS access seems like hard requirement for using cloudwatch agent for kubernetes container insights with enhanced observability (enhanced_container_insights).

The alternatives are

Increase IMDSv2 hop limit to 2 (that allows all pods to get access to IMDS and use the node IAM role)
Increase IMDSv2 hop limit and add k8s NetworkPolicy to each Namespace to block access to 169.254.0.0/16, this is cumbersome since you need to add it to all namespaces (there is no GlobalNetworkPolicy in vanilla Kubernetes) and there is no explicity deny either.
Keep IMDSV2 hop limit 1 and run cloudwatch agent daemonset with pod.spec.hostNetwork = true
- you need to enforce that untrusted pods do not use host networking, with admission controllers

jefchien · 2024-03-29T16:55:13Z

Hi @ecerulm, we are aware of the current IMDS requirement and are tracking an alternative for it when IMDS is unavailable internally.

AaronFriel · 2024-04-10T18:44:00Z

But IMDS access seems like hard requirement for using cloudwatch agent for kubernetes container insights with enhanced observability (enhanced_container_insights).

This isn't the case - if you edit the operator's agent resource configuration like so, you will see that it is capable of using IRSA:

$ kubectl -n amazon-cloudwatch edit amazoncloudwatchagents.cloudwatch.aws.amazon.com

Apply this change:

 apiVersion: v1
 items:
 - apiVersion: cloudwatch.aws.amazon.com/v1alpha1
   kind: AmazonCloudWatchAgent
   metadata:
     annotations:
       pulumi.com/patchForce: "true"
     creationTimestamp: "2024-04-01T08:21:38Z"
     generation: 5
     labels:
       app.kubernetes.io/managed-by: amazon-cloudwatch-agent-operator
     name: cloudwatch-agent
     namespace: amazon-cloudwatch
     resourceVersion: "3839446"
     uid: 542fecd4-0368-4ab1-8d8b-e7e5ad47c538
   spec:
     config: '{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"app_signals":{"hosted_in":"opal-quokka-6860d02"},"kubernetes":{"cluster_name":"opal-quokka-6860d02","enhanced_container_insights":true}}},"traces":{"traces_collected":{"app_signals":{}}}}'
     env:
+  - name: RUN_WITH_IRSA
+    value: true  
   - name: K8S_NODE_NAME
     valueFrom:
       fieldRef:
         fieldPath: spec.nodeName

ecerulm · 2024-04-10T19:28:30Z

@AaronFriel

Like I commented at #1101 (comment) even with RUN_WITH_IRSA it still goes to IMDS to obtain the instance id, etc

Although, using RUN_WITH_IRSA allows the DaemonSet to run and some metrics are sent to CloudWatch . I can see that it still tries to use the IMDS to get the EC2 metadata (instance id, image id, and instance type)

The instance id, etc are needed metrics for “ kubernetes container insights with enhanced observability (enhanced_container_insights)” and since they can’t be obtained those metric are not sent.

I don’t think there is anyway to pass the instance id, etc by any other means today but @jefchien seems to be indicating that there may be working in some alternative.

sbabalol · 2024-06-11T14:58:37Z

Any ETA on this fix?

ArthurMelin · 2024-07-11T09:52:42Z

I don’t think there is anyway to pass the instance id, etc by any other means today but @jefchien seems to be indicating that there may be working in some alternative.

Maybe the agent can grab the instance ID from the spec.providerID field on the Node object which it can be fetched from the kube-api?

Cornul11 · 2024-07-23T11:19:48Z

But IMDS access seems like hard requirement for using cloudwatch agent for kubernetes container insights with enhanced observability (enhanced_container_insights).

This isn't the case - if you edit the operator's agent resource configuration like so, you will see that it is capable of using IRSA:

$ kubectl -n amazon-cloudwatch edit amazoncloudwatchagents.cloudwatch.aws.amazon.com

Apply this change:

 apiVersion: v1
 items:
 - apiVersion: cloudwatch.aws.amazon.com/v1alpha1
   kind: AmazonCloudWatchAgent
   metadata:
     annotations:
       pulumi.com/patchForce: "true"
     creationTimestamp: "2024-04-01T08:21:38Z"
     generation: 5
     labels:
       app.kubernetes.io/managed-by: amazon-cloudwatch-agent-operator
     name: cloudwatch-agent
     namespace: amazon-cloudwatch
     resourceVersion: "3839446"
     uid: 542fecd4-0368-4ab1-8d8b-e7e5ad47c538
   spec:
     config: '{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"app_signals":{"hosted_in":"opal-quokka-6860d02"},"kubernetes":{"cluster_name":"opal-quokka-6860d02","enhanced_container_insights":true}}},"traces":{"traces_collected":{"app_signals":{}}}}'
     env:
+  - name: RUN_WITH_IRSA
+    value: true  
   - name: K8S_NODE_NAME
     valueFrom:
       fieldRef:
         fieldPath: spec.nodeName

With this change applied to the agent, the issue still persists.

As mentioned in another comment, I created a custom launch template with an increased number of max hops, which solved the issue. I do understand, however, that this may be a security concern and should be avoided, but, as a temporary measure until the addon is fixed, it is acceptable for our use case.

mcujba · 2024-07-30T14:28:36Z

But IMDS access seems like hard requirement for using cloudwatch agent for kubernetes container insights with enhanced observability (enhanced_container_insights).

This isn't the case - if you edit the operator's agent resource configuration like so, you will see that it is capable of using IRSA:
$ kubectl -n amazon-cloudwatch edit amazoncloudwatchagents.cloudwatch.aws.amazon.com 
Apply this change:
 apiVersion: v1
 items:
 - apiVersion: cloudwatch.aws.amazon.com/v1alpha1
   kind: AmazonCloudWatchAgent
   metadata:
     annotations:
       pulumi.com/patchForce: "true"
     creationTimestamp: "2024-04-01T08:21:38Z"
     generation: 5
     labels:
       app.kubernetes.io/managed-by: amazon-cloudwatch-agent-operator
     name: cloudwatch-agent
     namespace: amazon-cloudwatch
     resourceVersion: "3839446"
     uid: 542fecd4-0368-4ab1-8d8b-e7e5ad47c538
   spec:
     config: '{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"app_signals":{"hosted_in":"opal-quokka-6860d02"},"kubernetes":{"cluster_name":"opal-quokka-6860d02","enhanced_container_insights":true}}},"traces":{"traces_collected":{"app_signals":{}}}}'
     env:
+  - name: RUN_WITH_IRSA
+    value: true  
   - name: K8S_NODE_NAME
     valueFrom:
       fieldRef:
         fieldPath: spec.nodeName
With this change applied to the agent, the issue still persists.

As mentioned in another comment, I created a custom launch template with an increased number of max hops, which solved the issue. I do understand, however, that this may be a security concern and should be avoided, but, as a temporary measure until the addon is fixed, it is acceptable for our use case.

The workroud is valid. Need to write the "True" value starting with uppercase.

kwangjong · 2024-07-31T06:56:24Z

I used this helm chart to deploy the add-on:
https://github.com/aws-observability/helm-charts

Modifying amazon-cloudwatch-observability/templates/linux/cloudwatch-agent-daemonset.yaml like below solved the issue.

apiVersion: cloudwatch.aws.amazon.com/v1alpha1
kind: AmazonCloudWatchAgent
metadata:
  name: {{ template "cloudwatch-agent.name" . }}
  namespace: {{ .Release.Namespace }}
spec:
+ hostNetwork: true
  image: {{ template "cloudwatch-agent.image" . }}
  mode: daemonset
  ...
  env:
+ - name: RUN_WITH_IRSA
+   value: "True"
  - name: K8S_NODE_NAME
    valueFrom:
      fieldRef:
        fieldPath: spec.nodeName
  ...

Although this is not required, I configured Gatekeeper to restrict host network access exclusive to CloudWatch Agent pods for enhanced security.

contraintTemplate.yaml:

apiVersion: templates.gatekeeper.sh/v1beta1
kind: ConstraintTemplate
metadata:
  name: k8sallowedhostnetworking
spec:
  crd:
    spec:
      names:
        kind: K8sAllowedHostNetworking
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8sallowedhostnetworking

        default allow = false

        allow {
          input.review.object.metadata.labels["app.kubernetes.io/name"] == "cloudwatch-agent"
        }

        violation[{"msg": msg}] {
          not allow
          input.review.object.spec.hostNetwork == true
          msg := "Host network is not allowed for this pod"
        }

constraint.yaml:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sAllowedHostNetworking
metadata:
  name: allowed-host-networking
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]

jefchien mentioned this issue Apr 9, 2024

Agent fails with credential errors, cannot use IAM Roles for Service Accounts (IRSA) or EKS Pod Identities with EKS Addon aws/amazon-cloudwatch-agent-operator#119

Open

This was referenced Apr 15, 2024

Add IMDS fallback with DescribeInstances. #1139

Merged

Add IMDS fallback with DescribeInstances. amazon-contributing/opentelemetry-collector-contrib#201

Merged

jefchien mentioned this issue Apr 29, 2024

Add private IPv4 describe instances filter fallback. amazon-contributing/opentelemetry-collector-contrib#208

Closed

kwangjong mentioned this issue Jul 31, 2024

CloudWatch Agent fails to authenticate: IMDS Issues aws-observability/helm-charts#75

Open

lisguo mentioned this issue Aug 14, 2024

Add hostNetwork: true for cloudwatch agent pod on linux aws-observability/helm-charts#82

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amazon-cloudwatch-observability fails with open /root/.aws/credentials ignoring the IRSA credentials #1101

amazon-cloudwatch-observability fails with open /root/.aws/credentials ignoring the IRSA credentials #1101

ecerulm commented Mar 22, 2024

faizanshah-tp commented Mar 25, 2024

ecerulm commented Mar 25, 2024 •

edited

Loading

ecerulm commented Mar 25, 2024

ecerulm commented Mar 26, 2024

jefchien commented Mar 29, 2024

AaronFriel commented Apr 10, 2024 •

edited

Loading

ecerulm commented Apr 10, 2024

sbabalol commented Jun 11, 2024

ArthurMelin commented Jul 11, 2024

Cornul11 commented Jul 23, 2024 •

edited

Loading

mcujba commented Jul 30, 2024

kwangjong commented Jul 31, 2024 •

edited

Loading

amazon-cloudwatch-observability fails with open /root/.aws/credentials ignoring the IRSA credentials #1101

amazon-cloudwatch-observability fails with open /root/.aws/credentials ignoring the IRSA credentials #1101

Comments

ecerulm commented Mar 22, 2024

faizanshah-tp commented Mar 25, 2024

ecerulm commented Mar 25, 2024 • edited Loading

ecerulm commented Mar 25, 2024

ecerulm commented Mar 26, 2024

jefchien commented Mar 29, 2024

AaronFriel commented Apr 10, 2024 • edited Loading

ecerulm commented Apr 10, 2024

sbabalol commented Jun 11, 2024

ArthurMelin commented Jul 11, 2024

Cornul11 commented Jul 23, 2024 • edited Loading

mcujba commented Jul 30, 2024

kwangjong commented Jul 31, 2024 • edited Loading

ecerulm commented Mar 25, 2024 •

edited

Loading

AaronFriel commented Apr 10, 2024 •

edited

Loading

Cornul11 commented Jul 23, 2024 •

edited

Loading

kwangjong commented Jul 31, 2024 •

edited

Loading