Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster can't recover after loosing too many pods at once. #21

Closed
adamresson opened this issue Feb 1, 2018 · 4 comments
Closed

Cluster can't recover after loosing too many pods at once. #21

adamresson opened this issue Feb 1, 2018 · 4 comments

Comments

@adamresson
Copy link

I had an issue today where I was running the redis operator on some preemptive nodes.

It looks like two of the nodes went offline simultaneous, and they each had a sentinel and a redis server (one of which was the master).

Here's a quick look at the pods

rfr-redis-0                            0/1       Init:0/1           0          1d
rfr-redis-1                            1/1       Running            0          3h
rfr-redis-2                            0/1       Init:0/1           0          3h
rfs-redis-6d89765fd6-8vcsj             0/1       Init:0/1           0          1d
rfs-redis-6d89765fd6-gmvd4             0/1       Init:0/1           0          1d
rfs-redis-6d89765fd6-zkkmd             1/1       Running            0          1d

Here's the log from the one running sentinel at the time

1:X 01 Feb 17:40:08.235 # +sdown master mymaster 10.28.7.5 6379
1:X 01 Feb 17:40:08.235 # +sdown sentinel e052de8bf770601397f56bca3145a37db8321b59 10.28.7.4 26379 @ mymaster 10.28.7.5 6379
1:X 01 Feb 17:40:08.453 # +sdown slave 10.28.6.8:6379 10.28.6.8 6379 @ mymaster 10.28.7.5 6379
1:X 01 Feb 17:40:08.454 # +sdown sentinel 1a2db190705a578efe4011cfa5039885230a3344 10.28.6.5 26379 @ mymaster 10.28.7.5 6379

All the init containers for the 4 pods that were trying to come back up looked like this:

Attempting to connect to Master at 10.28.7.5...
Could not connect to Redis at 10.28.7.5:6379: Operation timed out
FAILED
Wating to retry
Attempting to connect to Sentinel service to retrieve current master...
Got '"10.28.7.5"'
OK

So basically, I think:

Sentinal that still existed couldn't find a quorum to elect a new master
New sentinels were waiting to come up since they couldn't talk to master.
The one slave didn't fail over.

I think in this case, where everything was locked up, manually calling SENTINEL failover mymaster would have fixed things (I came across the command afterwards, so didn't get to test it out)

@miles-
Copy link

miles- commented Feb 1, 2018

We've run into this scenario a couple of times during upgrades to our GKE clusters. The fix was, as you mentioned, calling SENTINEL failover mymaster to force a failover.

@adamresson
Copy link
Author

I wonder if this is something that the controller could do if it realized that the cluster was in a bad enough state for a long enough amount of time.

@jchanam
Copy link
Collaborator

jchanam commented Feb 9, 2018

Hi!

That's a known issue we've been thinking about deeply, as we've seen it some times, specially when updating the clusters as @miles- has mentioned.

As the implementation on the controller is very delicate (in order not to loose data), we did some improvements on the resilience of the redises and sentinels adding an AntiAffinity #14

We're now preparing a library to help creating CRDs and Operators, and we're working on moving this to the new concept. Once that's done, as it'll be easier to maintain resources, we'll focus this problem.

If you have suggestions on how the controller should focus this problem and what steps to follow, we'll be very happy to hear them so we can create a much better application.

@ese
Copy link
Member

ese commented Mar 19, 2018

I think this is resolved in 0.2. Feel free to reopened if its needed

@ese ese closed this as completed Mar 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants