-
-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When telepresence crashes, previous state is not restored #260
Comments
Here's a related idea (from @james-atwill-hs in Gitter chat):
|
What the status of work on this? So far telepresence has shown to be really bad at cleaning up after itself. |
Plan:
|
Whenever I get into this state, running Telepresence again doesn't work. What are the processes that I must kill to start from a clean slate, short of rebooting my Mac? |
@bourquep My apologies for not getting back to you sooner. The issue is mostly talking about things in the cluster (deployments, mainly), but it's possible that a rogue Can you describe the bad state you're experiencing, i.e. what are the symptoms? Do you have the telepresence log file from the crashed session? This info will help me answer your question. Thanks for your help. |
In our case, we often need to disconnect and reconnect wifi before restarting telepresence. Haven't diagnosed deeper than this yet as it is usually in the middle of trying to get something else done. |
Is there an official way to clean up after a crash or disconnection? I'm using --swap-deployment and --docker-run, which means when there's a disconnect, the telepresence deployment becomes a zombie. To restore, I do this:
Is there an easier way? What I would like is for telepresence to see that there is already a zombie deployment and either kill it to start up another, or reconnect to it. |
If Telepresence crashes, it will still perform normal cleanup after you respond to the crash reporter, unless you hit Ctrl-C. Perhaps Telepresence should be more aggressive about preventing the user from breaking it forcibly, but I know in my usage (where Telepresence crashes a lot, since I'm changing the code frequently), I almost never have to clean up the cluster. On the other hand, if you close your laptop lid or fall of the network, you'll definitely end up in a bad state where cluster cleanup will be necessary. It would be great if Telepresence automated post-hoc cleanup! In the meantime, the following two command should cover most cases:
|
Makes sense; thanks. Those last two commands have been what I am doing most of the time. Just wanted to check and see if there was a blessed way to do this. The laptop lid use case is the one I'm worried about; we using telepresence as the backbone for our development process, so this comes up pretty frequently when people close the lid before shutting down the telepresence session. Thanks for the great tool by the way! |
Currently internet issues causes telepresence to crash and then I have to manually go and delete the deployments from the cluster and run it again. Any idea on how we can get telepresence to do it (if not automatically, manually?) Thanks. |
We do this sort of cleanup in shell script which wraps swap command: echo "Cleaning up container: ${container_name}"
# Remove docker container if was not properly handled by Telepresence
docker ps -a -q --filter name="${container_name}" | xargs -t docker rm --force
# Remove previous Telepresence proxy if still running
kubectl delete service,deployment -l "telepresence,app=${service}" Then when you do the actual swap you would need to set container name for cleanup code to know what container to watch for:
The cleanup happens on the next swap run, not after swap session was finished. This is not ideal but better than nothing. |
I thought that it is difficult to solve this problem in telepresence process. I developed a cleaner tool. https://github.com/takumakume/telepolice Solved the problem by making the cleaner resident. Broken telepresence resources have problems with the Easy to install.
What do you think about this tool? |
@takumakume Nice job! The proxy pod has its own idea of whether the connection is still okay. At present, it does not do anything with this information (other than logging a message). But this information could be exposed, e.g., as a file in the Pod. So your tool could check for the existence of Similarly, it would be easy for Telepresence to mark the original deployment name and number of replicas as annotations on the proxy deployment/pod. This would make it easier for your tool to reset the original deployment. Would that be helpful? |
@ark3 Thanks for any helpful comments!
I think the same.
I tried.
(I checked in the telepresence code, but none was found.) I will use the following check method instead. telepresence/k8s-proxy/periodic.py Lines 44 to 61 in 0fb4b14
#
# telepresence client is working
#
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055
HTTP/1.0 200 OK
Server: BaseHTTP/0.6 Python/3.7.6
Date: Wed, 19 Feb 2020 02:21:59 GMT
#
# telepresence client is not working
#
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055
~ $ echo -en "HEAD / HTTP/1.1\n\n" | nc localhost 9055 I am trying to perform a health check using this information source. |
I believe this is no longer an issue in Telepresence 2, since you can use the |
If Telepresence client crashes,
--swap-deployment
won't restore original Deployment, and--new-deployment
created Deployment won't be deleted.Implementation idea: have the k8s-proxy pod clean itself up after some timeout (1 minute?) of the client being disconnected. I believe the necessary Kubernetes API credentials are available in /var/run/secrets/kubernetes.io, the "service account" (https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/).
This may not work in all cases, e.g. in OpenShift or similar Kubernetes configurations where the pod may not have permission to access the API server necessarily. So client should do cleanup by default, probably, even if this is done.
The text was updated successfully, but these errors were encountered: