-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster can't recover after loosing too many pods at once. #21
Comments
We've run into this scenario a couple of times during upgrades to our GKE clusters. The fix was, as you mentioned, calling |
I wonder if this is something that the controller could do if it realized that the cluster was in a bad enough state for a long enough amount of time. |
Hi! That's a known issue we've been thinking about deeply, as we've seen it some times, specially when updating the clusters as @miles- has mentioned. As the implementation on the controller is very delicate (in order not to loose data), we did some improvements on the resilience of the redises and sentinels adding an AntiAffinity #14 We're now preparing a library to help creating CRDs and Operators, and we're working on moving this to the new concept. Once that's done, as it'll be easier to maintain resources, we'll focus this problem. If you have suggestions on how the controller should focus this problem and what steps to follow, we'll be very happy to hear them so we can create a much better application. |
I think this is resolved in 0.2. Feel free to reopened if its needed |
I had an issue today where I was running the redis operator on some preemptive nodes.
It looks like two of the nodes went offline simultaneous, and they each had a sentinel and a redis server (one of which was the master).
Here's a quick look at the pods
Here's the log from the one running sentinel at the time
All the init containers for the 4 pods that were trying to come back up looked like this:
So basically, I think:
Sentinal that still existed couldn't find a quorum to elect a new master
New sentinels were waiting to come up since they couldn't talk to master.
The one slave didn't fail over.
I think in this case, where everything was locked up, manually calling
SENTINEL failover mymaster
would have fixed things (I came across the command afterwards, so didn't get to test it out)The text was updated successfully, but these errors were encountered: