Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

env-file generated with values such as: $(_TEL_APP_A__TEL_APP_A_xxxx):$(_TEL_APP_A__TEL_APP_A_yyyy) #2853

Closed
aaccioly opened this issue Nov 3, 2022 · 12 comments

Comments

@aaccioly
Copy link

aaccioly commented Nov 3, 2022

Description of the bug

I have containers configured with a combination of Environment Variables and Secrets, such as:

KAFKA_BOOTSTRAP_SERVERS : $(Kafka_Host):$(Kafka_Port)
KAFKA_INIT_ENABLED : true
Kafka_Host : secretKeyRef(cp-kafka.host) 
Kafka_Port : secretKeyRef(cp-kafka.port)
PG_PORT : secretKeyRef(postgresql-my-app.PG_PORT) 

When I intercept a service with a command such as:

telepresence intercept my-app --port 8080:http --env-file ~/myconfig.env

It generates something like this:

KAFKA_BOOTSTRAP_SERVERS=$(_TEL_APP_A__TEL_APP_A_Kafka_Host):$(_TEL_APP_A__TEL_APP_A_Kafka_Port)
KAFKA_INIT_ENABLED=true
Kafka_Host=cp-kafka-headless
Kafka_Port=9092
PG_PORT=5432

As you can see, while Telepresence is able to retrieve environment variables and even secrets, it's not dealing with secret expansion correctly. I would expect output such as the following:

KAFKA_BOOTSTRAP_SERVERS=cp-kafka-headless:9092
KAFKA_INIT_ENABLED=true
Kafka_Host=cp-kafka-headless
Kafka_Port=9092
PG_PORT=5432

Or, alternatively:

KAFKA_BOOTSTRAP_SERVERS=$(Kafka_Host):$(Kafka_Port)
KAFKA_INIT_ENABLED=true
Kafka_Host=cp-kafka-headless
Kafka_Port=9092
PG_PORT=5432

The _TEL_APP_A__TEL_APP_A_ shouldn't be there.

Logs:
You can find logs attached bellow:

telepresence_logs.zip

Versions:

  • Telepresence: v2.8.5 (api v3)
  • Operating system: macOS Ventura 13.0
  • Kubernetes cluster: Running on GCP
❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.14", GitCommit:"0f77da5bd4809927e15d1658fb4aa8f13ad890a5", GitTreeState:"clean", BuildDate:"2022-06-15T14:17:29Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.14-gke.2700", GitCommit:"2211759978749a9c2935fe2f8db1e6823c3c0bb1", GitTreeState:"clean", BuildDate:"2022-07-26T09:20:06Z", GoVersion:"go1.16.15b7", Compiler:"gc", Platform:"linux/amd64"}
@thallgren
Copy link
Member

@aaccioly I'm not able to reproduce this. From the looks of it, you've got the agent-injector to inject a traffic-agent sidecar for an already existing traffic-agent. Not sure how that can happen. I've never seen it before. I tested doing telepresence intercept echo-env -- env on the simple manifest that is pasted below. The secret refs are expanded and interpolated correctly and I can find the following entries in the output:

PUSER=admin
PPASSWD=1f2d1e2e67df
PCRED=admin:1f2d1e2e67df

The manifest:

---
apiVersion: v1
kind: Secret
metadata:
  name: mysecret
type: Opaque
data:
  user: YWRtaW4=
  password: MWYyZDFlMmU2N2Rm
---
apiVersion: v1
kind: Service
metadata:
  name: echo-env
spec:
  type: ClusterIP
  selector:
    app: echo-env
  ports:
    - name: http
      port: 80
      targetPort: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo-env
  labels:
    app: echo-env
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo-env
  template:
    metadata:
      labels:
        app: echo-env
    spec:
      containers:
        - name: echo
          image: jmalloc/echo-server
          env:
            - name: PUSER
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: user
            - name: PPASSWD
              valueFrom:
                secretKeyRef:
                  name: mysecret
                  key: password
            - name: PCRED
              value: $(PUSER):$(PPASSWD)
          ports:
            - name: http
              containerPort: 8080

@aaccioly
Copy link
Author

aaccioly commented Nov 12, 2022

Hey @thallgren, thanks for your input and for attempting to reproduce the issue.
On my client's environment the issue is happening every time that I connect / intercept a deployment. I'm not sure if there's something specific to my client's environment that may result in the "injection of a second sidecar" issue. While I can workaround the issue by post-processing the generated environment file to remove the offending variable prefixes (say, with a simple sed command), it's still a curious bug.
Is there anything else that I can do to help reproduce the issue?

@thallgren
Copy link
Member

You'll need to check what the intercepted pod looks like when this happens. I suspect multiple recursive injections of the same thing, which isn't good at all. Maybe your client have another webhook that actually renames the injectected traffic-agent, causing a second (and a third) agent to be injected?

@thallgren
Copy link
Member

Closing this, as env injections work as expected.

@aaccioly
Copy link
Author

aaccioly commented Dec 28, 2022

@thallgren , this is still happening with 2.9.5. A couple of other users have reported the same error as well, I wouldn't close the issue.

@thallgren
Copy link
Member

@aaccioly feel free to reopen this if you can provide more info that would help us understand the problem. In particular, I'd be interested in information pertaining to my last comment. No such info has been provided and the reproducers we have work as expected, so this must be something that is very specific to your environment. Unless we get some more information about this so that we have a chance to reproduce, there isn't much that we can do.

@aaccioly
Copy link
Author

aaccioly commented Dec 28, 2022

@thallgren , if you send me a specific list of commands that you want me to run I'll be happy to do it.
I'm not an expert on K8S or Telepresence so someone needs to break down what is missing.

This is neither specific to my environment nor such a difficult issue to reproduce (we had two other folks linking to this issue with similar intermittent problems since I've opened it, theirs went away, mine didn't).

@thallgren
Copy link
Member

thallgren commented Dec 28, 2022

Start by doing this:

telepresence quit -s

Now remove all your old logs so that we know what gets generated when you reconnect:

rm ~/Library/Logs/telepresence/*`
telepresence connect
telepresence loglevel debug
telepresence status

What's the output of the status command?

Ensure that the agent is uninstalled:

telepresence uninstall agent my-app
  • What does the pod look like now? kubectl get pod my-app-xxx -o yaml
  • What does your workload look like? kubectl get deploy my-app -o yaml
  • What is listed in the telepresence-agent config map? kubectl describe configmap telepresence-agents

Now, start watching events in the ambassador namespace one terminal using kubectl -n ambassador get events -w. Then, in another terminal, try your intercept again, and again check the following:

  • What does the pod look like now?
  • What does your workload look like?
  • What is listed in the telepresence-agent config map?

And please provide the output of telepresence gather logs after doing all of this.

This ought to give us a clue of what's going on.

@thallgren thallgren reopened this Dec 28, 2022
@aaccioly
Copy link
Author

aaccioly commented Dec 28, 2022

Hey @thallgren, thanks for reopening the issue. I'll arrange for the logs and outputs from the above command, it will probably take a while since I need clearance from my client before sharing any logs publicly (and everyone is currently on vacations), but I should have it in a week or two.

@aaccioly
Copy link
Author

aaccioly commented Jan 5, 2023

Apparently this problem has disappeared from our environment.
As in, I've intercepted workloads on Azure clusters a few dozen times and had clean files coming out of it every time. I reproduced it once in a GCP cluster running traffic-manager version 2.9.5 about 10 days ago but this cluster was decommissioned and I'm not having any luck reproducing it on other GCP clusters.

So I guess that I'm stuck with this one.

For anyone else hitting the issue, here's the command that I use to clean up the file:

sed -i.bak 's/_TEL_APP_A__TEL_APP_A_//g' ~/myconfig.env

Version 2.9.5 of telepresence + traffic-manager seems to really helps with this.
If i do manage to reproduce it again I'll reopen this issue with the requested logs.

@aaccioly aaccioly closed this as completed Jan 5, 2023
@dpoetzsch
Copy link

dpoetzsch commented Jul 10, 2023

Hi,

we still experience this issue with telepresence 2.14.0.

We have a service running on GKE and use telepresence to intercept. We have configuration such as

      - env:
        - name: PGUSER
          valueFrom:
            secretKeyRef:
              key: user
              name: legal-tool-backend-db-credentials
        - name: PGPASSWORD
          valueFrom:
            secretKeyRef:
              key: password
              name: legal-tool-backend-db-credentials
        - name: PGDATABASE
          value: backend
        - name: REDIS_PASSWORD
          valueFrom:
            secretKeyRef:
              key: password
              name: legal-tool-redis-credentials
        - name: REDIS_QUEUE_URL
          value: redis://:$(REDIS_PASSWORD)@legal-tool-staging-redis-master:6379
        - name: DATABASE_URL
          value: postgres://$(PGUSER):$(PGPASSWORD)@$(PGHOST):$(PGPORT)/$(PGDATABASE)

During interceptions this ends up in env vars such as

DATABASE_URL=postgres://$(_TEL_APP_A__TEL_APP_A_PGUSER):secret@staging-gcloud-sqlproxy:5432/backend
REDIS_QUEUE_URL=redis://:$(_TEL_APP_A__TEL_APP_A_REDIS_PASSWORD)@staging-redis-master:6379

As you see, some variables have been replaced correctly while others have not. I have found that it is pretty random which variables are replaced and which are not. To me, it feels like some kind of race condition.

I also did all the steps you described earlier @thallgren.

The output of telepresence status:

$ telepresence status
OSS User Daemon: Running
  Version           : v2.14.0
  Executable        : /usr/local/bin/telepresence
  Install ID        : 51b2fa17-ca6a-487c-ab36-be5304c2437e
  Status            : Connected
  Kubernetes server : https://35.198.177.182
  Kubernetes context: gke_jurbit-214022_europe-west3-a_legal-tool
  Manager namespace : ambassador
  Intercepts        : 0 total
OSS Root Daemon: Running
  Version    : v2.14.0
  Version    : v2.14.0
  DNS        : 
    Remote IP       : 127.0.0.1
    Exclude suffixes: [.com .io .net .org .ru]
    Include suffixes: []
    Timeout         : 8s
  Also Proxy : (0 subnets)
  Never Proxy: (1 subnets)
    - 35.198.177.182/32

As far as I can see, neither Pod, nor Deployment does contain anything telepresence related at this time except an annotation telepresence.getambassador.io/restartedAt. The configmap shows as follows:

Name:         telepresence-agents
Namespace:    staging
Labels:       app.kubernetes.io/created-by=traffic-manager
              app.kubernetes.io/name=telepresence-agents
              app.kubernetes.io/version=2.14.0
Annotations:  <none>

Data
====

BinaryData
====

Events:  <none>

After intercepting there is telepresence stuff in deployment/pod such as

    - name: _TEL_APP_A_DATABASE_URL
      value: postgres://$(_TEL_APP_A__TEL_APP_A_PGUSER):$(_TEL_APP_A__TEL_APP_A_PGPASSWORD)@$(_TEL_APP_A_PGHOST):$(_TEL_APP_A_PGPORT)/$(_TEL_APP_A__TEL_APP_A_PGDATABASE)

If you do need specifics please let me know, as I don't feel comfortable sharing the whole deployment publicly on the internet.

The configmap looks as follows:

Name:         telepresence-agents
Namespace:    staging
Labels:       app.kubernetes.io/created-by=traffic-manager
              app.kubernetes.io/name=telepresence-agents
              app.kubernetes.io/version=2.14.0
Annotations:  <none>

Data
====
legal-tool-staging-backend:
----
EnvoyAdminPort: 19000
EnvoyLogLevel: warning
EnvoyServerPort: 18000
agentImage: docker.io/datawire/ambassador-telepresence-agent:1.13.17
agentName: legal-tool-staging-backend
containers:
- Mounts:
  - /etc/legal-backend-config
  envPrefix: A_
  intercepts:
  - agentPort: 9900
    containerPort: 3000
    containerPortName: http
    protocol: TCP
    serviceName: legal-tool-staging-backend
    servicePort: 3000
    servicePortName: http
    serviceUID: ef2b572f-5838-4737-8dea-58de91463d6f
  mountPoint: /tel_app_mounts/backend
  name: backend
logLevel: info
managerHost: traffic-manager.ambassador
managerPort: 8081
namespace: staging
tracingPort: 15766
workloadKind: Deployment
workloadName: legal-tool-staging-backend


BinaryData
====

Events:  <none>

Here are the logs: telepresence_logs.zip

I did not see any events via kubectl -n ambassador get events -w.

If you need more info let me know.

@thallgren
Copy link
Member

I found a solution for this problem. Please see this comment for more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants