Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness and Readiness Probes #109

Closed
acsulli opened this issue Apr 12, 2018 · 6 comments
Closed

Liveness and Readiness Probes #109

acsulli opened this issue Apr 12, 2018 · 6 comments

Comments

@acsulli
Copy link

acsulli commented Apr 12, 2018

To support HA of the Trident pod, liveness and readiness probes should be defined for both the Trident and etcd containers.

@wavezhang
Copy link
Contributor

wavezhang commented May 23, 2018

you can do it yourself,
[ for version 18.04.0, and need some changes for other version ]
use the following command to generate yaml files used:

tridentctl --generate-custom-yaml

, you will see there is a trident-deployment.yaml under setupdir

add following lines to configure livenessprobe.
for the trident-main contianer:

+        livenessProbe:
+          failureThreshold: 3
+          exec:
+            command:
+            - curl
+            - 127.0.0.1:8000/trident/v1/backend
+          initialDelaySeconds: 15
+          timeoutSeconds: 10
+          periodSeconds: 3

for etcd container:

+        livenessProbe:
+          failureThreshold: 3
+          exec:
+            command:
+            - etcdctl
+            - -endpoint=http://127.0.0.1:8001/ 
+            - cluster-health
+          initialDelaySeconds: 15
+          timeoutSeconds: 3

@clintonk
Copy link
Contributor

@wavezhang Thanks for chiming in. We're working to get liveness probes into an upcoming release. Be careful with the timeout value on the trident-main container! Most operations, including listing backends, are protected by a shared lock in Trident's core. The Kubernetes liveness probe timeout defaults to 1 second which would not be long enough if the REST call is held off by another operation in Trident. For reference, tridentctl uses a default timeout of 90 seconds.

@wavezhang
Copy link
Contributor

@clintonk 90 seconds a little long for our application, can this be optimized?

@clintonk
Copy link
Contributor

@wavezhang 90 seconds is a worst case that we only see during heavy stress tests on older hardware. You shouldn't see delays of more than a few seconds during typical operation. But the default of 1 second is definitely too short, since creating a Flexvol or other storage operations can take more than that. You might try something like 15 seconds to start with and watch for any probe-triggered restarts over a few days. Alternatively, if you want something really short, you can use the version API (http://127.0.0.1:8000/trident/v1/version) which isn't gated by the shared lock; that one should always return instantly, but the tradeoff is that it won't detect issues like deadlocks or hangs in the core or the lower storage management layers.

@wavezhang
Copy link
Contributor

@clintonk What happens if I restart pod while there running operations? Will everything recover after pod restart?

@clintonk
Copy link
Contributor

@wavezhang Operations like volume creations are wrapped with transactions so they are unwound cleanly if Trident restarts before completion. And Kubernetes is an "eventually consistent" system that continually tries to make its current state consistent with the desired state. Likewise, Trident would just retry any failed operation shortly after restarting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants