When telepresence crashes, previous state is not restored #260

itamarst · 2017-08-01T14:31:09Z

If Telepresence client crashes, --swap-deployment won't restore original Deployment, and --new-deployment created Deployment won't be deleted.

Implementation idea: have the k8s-proxy pod clean itself up after some timeout (1 minute?) of the client being disconnected. I believe the necessary Kubernetes API credentials are available in /var/run/secrets/kubernetes.io, the "service account" (https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/).

This may not work in all cases, e.g. in OpenShift or similar Kubernetes configurations where the pod may not have permission to access the API server necessarily. So client should do cleanup by default, probably, even if this is done.

The text was updated successfully, but these errors were encountered:

ark3 · 2017-09-11T21:46:05Z

Here's a related idea (from @james-atwill-hs in Gitter chat):

I think it'd be kinda handy if telepresence tucked the existing pod/deployment/etc configs in an annotation before taking them over in swap-deployment (then there could be 'recover-deployment' that wouldn't need local state)

it's common for developers to just close their laptops or if they go to a meeting they may fall off the 'dev' wifi network and into the general 'office' one which doesn't have any k8s access.

(and if they're not debugging a microservice that is a leaf in the dependency tree, things would get messy pretty fast).

jaredallard · 2018-03-26T20:13:07Z

What the status of work on this? So far telepresence has shown to be really bad at cleaning up after itself.

ark3 · 2018-04-10T21:08:46Z

Plan:

Telepresence writes to the filesystem the steps required to clean up the cluster. If it crashes, it informs the user that the file is available for manual fix-up.
Telepresence pushes that same information into the cluster, probably as an annotation, to allow any user to clean up the cluster.
The Telepresence pod detects that it has lost its connection to the client and that sufficient time has passed, then performs cleanup itself, including deleting itself.

bourquep · 2018-06-01T21:59:10Z

Whenever I get into this state, running Telepresence again doesn't work. What are the processes that I must kill to start from a clean slate, short of rebooting my Mac?

ark3 · 2018-06-05T16:01:09Z

@bourquep My apologies for not getting back to you sooner. The issue is mostly talking about things in the cluster (deployments, mainly), but it's possible that a rogue sshuttle-telepresence process gets left behind, which might cause networking problems on your local machine.

Can you describe the bad state you're experiencing, i.e. what are the symptoms? Do you have the telepresence log file from the crashed session? This info will help me answer your question.

Thanks for your help.

ofpiyush · 2018-06-21T17:53:32Z

In our case, we often need to disconnect and reconnect wifi before restarting telepresence.

Haven't diagnosed deeper than this yet as it is usually in the middle of trying to get something else done.

mhworth · 2019-01-06T21:34:16Z

Is there an official way to clean up after a crash or disconnection? I'm using --swap-deployment and --docker-run, which means when there's a disconnect, the telepresence deployment becomes a zombie. To restore, I do this:

List all of the deployments
Find the zombie one
Delete it
Edit the original deployment
Change replicas from 0 to 1

Is there an easier way? What I would like is for telepresence to see that there is already a zombie deployment and either kill it to start up another, or reconnect to it.

ark3 · 2019-01-07T16:31:33Z

If Telepresence crashes, it will still perform normal cleanup after you respond to the crash reporter, unless you hit Ctrl-C. Perhaps Telepresence should be more aggressive about preventing the user from breaking it forcibly, but I know in my usage (where Telepresence crashes a lot, since I'm changing the code frequently), I almost never have to clean up the cluster.

On the other hand, if you close your laptop lid or fall of the network, you'll definitely end up in a bad state where cluster cleanup will be necessary. It would be great if Telepresence automated post-hoc cleanup! In the meantime, the following two command should cover most cases:

kubectl delete svc,deploy -l telepresence
kubectl scale --replicas=1 deploy [deployment name]

mhworth · 2019-01-08T03:37:13Z

Makes sense; thanks. Those last two commands have been what I am doing most of the time. Just wanted to check and see if there was a blessed way to do this. The laptop lid use case is the one I'm worried about; we using telepresence as the backbone for our development process, so this comes up pretty frequently when people close the lid before shutting down the telepresence session.

Thanks for the great tool by the way!

tvvignesh · 2020-01-11T13:57:18Z

Currently internet issues causes telepresence to crash and then I have to manually go and delete the deployments from the cluster and run it again. Any idea on how we can get telepresence to do it (if not automatically, manually?) Thanks.

tenitski · 2020-01-12T18:50:18Z

We do this sort of cleanup in shell script which wraps swap command:

echo "Cleaning up container: ${container_name}"
# Remove docker container if was not properly handled by Telepresence
docker ps -a -q --filter name="${container_name}" | xargs -t docker rm --force
# Remove previous Telepresence proxy if still running
kubectl delete service,deployment -l "telepresence,app=${service}"

Then when you do the actual swap you would need to set container name for cleanup code to know what container to watch for:

telepresence --swap--deployment --name ${container_name} ...

The cleanup happens on the next swap run, not after swap session was finished. This is not ideal but better than nothing.

takumakume · 2020-02-12T02:48:45Z

I thought that it is difficult to solve this problem in telepresence process.

I developed a cleaner tool.

https://github.com/takumakume/telepolice

Solved the problem by making the cleaner resident.

Broken telepresence resources have problems with the sshd process in the container.
This tool detects them and does something like telepresence cleanup.

Easy to install.

for Install as a cleaner on kubernetes
kubectl apply -f https://raw.githubusercontent.com/takumakume/telepolice/master/manifests/release.yaml
for Install Cli tool
go get github.com/takumakume/telepolice

What do you think about this tool?

ark3 · 2020-02-14T16:06:56Z

@takumakume Nice job!

The proxy pod has its own idea of whether the connection is still okay. At present, it does not do anything with this information (other than logging a message). But this information could be exposed, e.g., as a file in the Pod. So your tool could check for the existence of /tmp/session-is-dead or maybe the contents of /tmp/session-status or something like that via kubectl exec. What do you think?

Similarly, it would be easy for Telepresence to mark the original deployment name and number of replicas as annotations on the proxy deployment/pod. This would make it easier for your tool to reset the original deployment. Would that be helpful?

takumakume · 2020-02-19T02:41:13Z

@ark3 Thanks for any helpful comments!

The proxy pod has its own idea of whether the connection is still okay.

I think the same.

So your tool could check for the existence of /tmp/session-is-dead or maybe the contents of /tmp/session-status or something like that via kubectl exec.

I tried.
However, the following files could not always be found in the telepresence-k8s container.

/tmp/session-is-dead
/tmp/session-status

(I checked in the telepresence code, but none was found.)

I will use the following check method instead.
I have referenced the telepresence polling class.

telepresence/k8s-proxy/periodic.py

Lines 44 to 61 in 0fb4b14

    
           def periodic(self): 
        
               """Periodically query the client""" 
        
               deferred = self.agent.request(b"HEAD", b"http://localhost:9055/") 
        
               deferred.addCallback(self.success) 
        
               deferred.addErrback(self.failure) 
        
           def success(self, response): 
        
               """Client is still there""" 
        
               if response.code == 200: 
        
                   self.log.info("Checkpoint") 
        
               else: 
        
                   self.log.warn("Client returned code {}".format(response.code)) 
        
           def failure(self, failure): 
        
               """Client is not there""" 
        
               self.log.error("Failed to contact Telepresence client:") 
        
               self.log.error(failure.getErrorMessage()) 
        
               self.log.warn("Perhaps it's time to exit?")

#
# telepresence client is working
#
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055
HTTP/1.0 200 OK
Server: BaseHTTP/0.6 Python/3.7.6
Date: Wed, 19 Feb 2020 02:21:59 GMT

#
# telepresence client is not working
#
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055

I am trying to perform a health check using this information source.
What do you think?

donnyyung · 2021-07-23T12:13:34Z

I believe this is no longer an issue in Telepresence 2, since you can use the uninstall command to remove agents and if telepresence quits while an intercept is in-place, then the intercept will be removed. Here are the docs on how to install Telepresence (https://www.telepresence.io/docs/latest/install/), please re-open if you still see this issue in our latest version!

itamarst added the bug Something isn't working label Aug 1, 2017

ark3 mentioned this issue Nov 30, 2017

Need better (any) signal handling during cleanup #358

Closed

ark3 self-assigned this Apr 20, 2018

ark3 mentioned this issue May 7, 2018

Telepresence session dies due to idle connection #573

Closed

ark3 mentioned this issue Feb 6, 2019

telepresence should cleanup/kill its own inactive an expired pods #920

Closed

ark3 mentioned this issue Apr 30, 2019

telepresence --run-shell failure: 'Telepresence pod not found for Deployment' error #1021

Closed

ark3 mentioned this issue Jun 3, 2019

Dangling pods, deployments, left overs... #824

Closed

ark3 removed their assignment Jan 16, 2020

takumakume mentioned this issue Mar 4, 2020

fix telepresence health check takumakume/telepolice#3

Merged

donnyyung closed this as completed Jul 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When telepresence crashes, previous state is not restored #260

When telepresence crashes, previous state is not restored #260

itamarst commented Aug 1, 2017

ark3 commented Sep 11, 2017 •

edited

Loading

jaredallard commented Mar 26, 2018

ark3 commented Apr 10, 2018

bourquep commented Jun 1, 2018

ark3 commented Jun 5, 2018

ofpiyush commented Jun 21, 2018

mhworth commented Jan 6, 2019

ark3 commented Jan 7, 2019

mhworth commented Jan 8, 2019

tvvignesh commented Jan 11, 2020

tenitski commented Jan 12, 2020 •

edited

Loading

takumakume commented Feb 12, 2020

ark3 commented Feb 14, 2020

takumakume commented Feb 19, 2020

donnyyung commented Jul 23, 2021

When telepresence crashes, previous state is not restored #260

When telepresence crashes, previous state is not restored #260

Comments

itamarst commented Aug 1, 2017

ark3 commented Sep 11, 2017 • edited Loading

jaredallard commented Mar 26, 2018

ark3 commented Apr 10, 2018

bourquep commented Jun 1, 2018

ark3 commented Jun 5, 2018

ofpiyush commented Jun 21, 2018

mhworth commented Jan 6, 2019

ark3 commented Jan 7, 2019

mhworth commented Jan 8, 2019

tvvignesh commented Jan 11, 2020

tenitski commented Jan 12, 2020 • edited Loading

takumakume commented Feb 12, 2020

ark3 commented Feb 14, 2020

takumakume commented Feb 19, 2020

donnyyung commented Jul 23, 2021

ark3 commented Sep 11, 2017 •

edited

Loading

tenitski commented Jan 12, 2020 •

edited

Loading