Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When telepresence crashes, previous state is not restored #260

Closed
itamarst opened this issue Aug 1, 2017 · 15 comments
Closed

When telepresence crashes, previous state is not restored #260

itamarst opened this issue Aug 1, 2017 · 15 comments
Labels
bug Something isn't working

Comments

@itamarst
Copy link

itamarst commented Aug 1, 2017

If Telepresence client crashes, --swap-deployment won't restore original Deployment, and --new-deployment created Deployment won't be deleted.

Implementation idea: have the k8s-proxy pod clean itself up after some timeout (1 minute?) of the client being disconnected. I believe the necessary Kubernetes API credentials are available in /var/run/secrets/kubernetes.io, the "service account" (https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/).

This may not work in all cases, e.g. in OpenShift or similar Kubernetes configurations where the pod may not have permission to access the API server necessarily. So client should do cleanup by default, probably, even if this is done.

@itamarst itamarst added the bug Something isn't working label Aug 1, 2017
@ark3
Copy link
Contributor

ark3 commented Sep 11, 2017

Here's a related idea (from @james-atwill-hs in Gitter chat):

I think it'd be kinda handy if telepresence tucked the existing pod/deployment/etc configs in an annotation before taking them over in swap-deployment (then there could be 'recover-deployment' that wouldn't need local state)

it's common for developers to just close their laptops or if they go to a meeting they may fall off the 'dev' wifi network and into the general 'office' one which doesn't have any k8s access.

(and if they're not debugging a microservice that is a leaf in the dependency tree, things would get messy pretty fast).

@jaredallard
Copy link

What the status of work on this? So far telepresence has shown to be really bad at cleaning up after itself.

@ark3
Copy link
Contributor

ark3 commented Apr 10, 2018

Plan:

  1. Telepresence writes to the filesystem the steps required to clean up the cluster. If it crashes, it informs the user that the file is available for manual fix-up.
  2. Telepresence pushes that same information into the cluster, probably as an annotation, to allow any user to clean up the cluster.
  3. The Telepresence pod detects that it has lost its connection to the client and that sufficient time has passed, then performs cleanup itself, including deleting itself.

@bourquep
Copy link

bourquep commented Jun 1, 2018

Whenever I get into this state, running Telepresence again doesn't work. What are the processes that I must kill to start from a clean slate, short of rebooting my Mac?

@ark3
Copy link
Contributor

ark3 commented Jun 5, 2018

@bourquep My apologies for not getting back to you sooner. The issue is mostly talking about things in the cluster (deployments, mainly), but it's possible that a rogue sshuttle-telepresence process gets left behind, which might cause networking problems on your local machine.

Can you describe the bad state you're experiencing, i.e. what are the symptoms? Do you have the telepresence log file from the crashed session? This info will help me answer your question.

Thanks for your help.

@ofpiyush
Copy link
Collaborator

In our case, we often need to disconnect and reconnect wifi before restarting telepresence.

Haven't diagnosed deeper than this yet as it is usually in the middle of trying to get something else done.

@mhworth
Copy link

mhworth commented Jan 6, 2019

Is there an official way to clean up after a crash or disconnection? I'm using --swap-deployment and --docker-run, which means when there's a disconnect, the telepresence deployment becomes a zombie. To restore, I do this:

  1. List all of the deployments
  2. Find the zombie one
  3. Delete it
  4. Edit the original deployment
  5. Change replicas from 0 to 1

Is there an easier way? What I would like is for telepresence to see that there is already a zombie deployment and either kill it to start up another, or reconnect to it.

@ark3
Copy link
Contributor

ark3 commented Jan 7, 2019

If Telepresence crashes, it will still perform normal cleanup after you respond to the crash reporter, unless you hit Ctrl-C. Perhaps Telepresence should be more aggressive about preventing the user from breaking it forcibly, but I know in my usage (where Telepresence crashes a lot, since I'm changing the code frequently), I almost never have to clean up the cluster.

On the other hand, if you close your laptop lid or fall of the network, you'll definitely end up in a bad state where cluster cleanup will be necessary. It would be great if Telepresence automated post-hoc cleanup! In the meantime, the following two command should cover most cases:

kubectl delete svc,deploy -l telepresence
kubectl scale --replicas=1 deploy [deployment name]

@mhworth
Copy link

mhworth commented Jan 8, 2019

Makes sense; thanks. Those last two commands have been what I am doing most of the time. Just wanted to check and see if there was a blessed way to do this. The laptop lid use case is the one I'm worried about; we using telepresence as the backbone for our development process, so this comes up pretty frequently when people close the lid before shutting down the telepresence session.

Thanks for the great tool by the way!

@tvvignesh
Copy link

Currently internet issues causes telepresence to crash and then I have to manually go and delete the deployments from the cluster and run it again. Any idea on how we can get telepresence to do it (if not automatically, manually?) Thanks.

@tenitski
Copy link

tenitski commented Jan 12, 2020

We do this sort of cleanup in shell script which wraps swap command:

echo "Cleaning up container: ${container_name}"
# Remove docker container if was not properly handled by Telepresence
docker ps -a -q --filter name="${container_name}" | xargs -t docker rm --force
# Remove previous Telepresence proxy if still running
kubectl delete service,deployment -l "telepresence,app=${service}"

Then when you do the actual swap you would need to set container name for cleanup code to know what container to watch for:

telepresence --swap--deployment --name ${container_name} ...

The cleanup happens on the next swap run, not after swap session was finished. This is not ideal but better than nothing.

@ark3 ark3 removed their assignment Jan 16, 2020
@takumakume
Copy link

I thought that it is difficult to solve this problem in telepresence process.

I developed a cleaner tool.

https://github.com/takumakume/telepolice

Solved the problem by making the cleaner resident.

Broken telepresence resources have problems with the sshd process in the container.
This tool detects them and does something like telepresence cleanup.

Easy to install.

  • for Install as a cleaner on kubernetes
    kubectl apply -f https://raw.githubusercontent.com/takumakume/telepolice/master/manifests/release.yaml
  • for Install Cli tool
    go get github.com/takumakume/telepolice

What do you think about this tool?

@ark3
Copy link
Contributor

ark3 commented Feb 14, 2020

@takumakume Nice job!

The proxy pod has its own idea of whether the connection is still okay. At present, it does not do anything with this information (other than logging a message). But this information could be exposed, e.g., as a file in the Pod. So your tool could check for the existence of /tmp/session-is-dead or maybe the contents of /tmp/session-status or something like that via kubectl exec. What do you think?

Similarly, it would be easy for Telepresence to mark the original deployment name and number of replicas as annotations on the proxy deployment/pod. This would make it easier for your tool to reset the original deployment. Would that be helpful?

@takumakume
Copy link

@ark3 Thanks for any helpful comments!

The proxy pod has its own idea of whether the connection is still okay.

I think the same.

So your tool could check for the existence of /tmp/session-is-dead or maybe the contents of /tmp/session-status or something like that via kubectl exec.

I tried.
However, the following files could not always be found in the telepresence-k8s container.

  • /tmp/session-is-dead
  • /tmp/session-status

(I checked in the telepresence code, but none was found.)

I will use the following check method instead.
I have referenced the telepresence polling class.

def periodic(self):
"""Periodically query the client"""
deferred = self.agent.request(b"HEAD", b"http://localhost:9055/")
deferred.addCallback(self.success)
deferred.addErrback(self.failure)
def success(self, response):
"""Client is still there"""
if response.code == 200:
self.log.info("Checkpoint")
else:
self.log.warn("Client returned code {}".format(response.code))
def failure(self, failure):
"""Client is not there"""
self.log.error("Failed to contact Telepresence client:")
self.log.error(failure.getErrorMessage())
self.log.warn("Perhaps it's time to exit?")

#
# telepresence client is working
#
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055
HTTP/1.0 200 OK
Server: BaseHTTP/0.6 Python/3.7.6
Date: Wed, 19 Feb 2020 02:21:59 GMT

#
# telepresence client is not working
#
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055

I am trying to perform a health check using this information source.
What do you think?

@donnyyung
Copy link
Contributor

I believe this is no longer an issue in Telepresence 2, since you can use the uninstall command to remove agents and if telepresence quits while an intercept is in-place, then the intercept will be removed. Here are the docs on how to install Telepresence (https://www.telepresence.io/docs/latest/install/), please re-open if you still see this issue in our latest version!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

10 participants