Cluster can't recover after loosing too many pods at once. #21

adamresson · 2018-02-01T18:43:12Z

I had an issue today where I was running the redis operator on some preemptive nodes.

It looks like two of the nodes went offline simultaneous, and they each had a sentinel and a redis server (one of which was the master).

Here's a quick look at the pods

rfr-redis-0                            0/1       Init:0/1           0          1d
rfr-redis-1                            1/1       Running            0          3h
rfr-redis-2                            0/1       Init:0/1           0          3h
rfs-redis-6d89765fd6-8vcsj             0/1       Init:0/1           0          1d
rfs-redis-6d89765fd6-gmvd4             0/1       Init:0/1           0          1d
rfs-redis-6d89765fd6-zkkmd             1/1       Running            0          1d

Here's the log from the one running sentinel at the time

1:X 01 Feb 17:40:08.235 # +sdown master mymaster 10.28.7.5 6379
1:X 01 Feb 17:40:08.235 # +sdown sentinel e052de8bf770601397f56bca3145a37db8321b59 10.28.7.4 26379 @ mymaster 10.28.7.5 6379
1:X 01 Feb 17:40:08.453 # +sdown slave 10.28.6.8:6379 10.28.6.8 6379 @ mymaster 10.28.7.5 6379
1:X 01 Feb 17:40:08.454 # +sdown sentinel 1a2db190705a578efe4011cfa5039885230a3344 10.28.6.5 26379 @ mymaster 10.28.7.5 6379

All the init containers for the 4 pods that were trying to come back up looked like this:

Attempting to connect to Master at 10.28.7.5...
Could not connect to Redis at 10.28.7.5:6379: Operation timed out
FAILED
Wating to retry
Attempting to connect to Sentinel service to retrieve current master...
Got '"10.28.7.5"'
OK

So basically, I think:

Sentinal that still existed couldn't find a quorum to elect a new master
New sentinels were waiting to come up since they couldn't talk to master.
The one slave didn't fail over.

I think in this case, where everything was locked up, manually calling SENTINEL failover mymaster would have fixed things (I came across the command afterwards, so didn't get to test it out)

The text was updated successfully, but these errors were encountered:

miles- · 2018-02-01T18:55:52Z

We've run into this scenario a couple of times during upgrades to our GKE clusters. The fix was, as you mentioned, calling SENTINEL failover mymaster to force a failover.

adamresson · 2018-02-01T18:58:19Z

I wonder if this is something that the controller could do if it realized that the cluster was in a bad enough state for a long enough amount of time.

jchanam · 2018-02-09T09:22:43Z

Hi!

That's a known issue we've been thinking about deeply, as we've seen it some times, specially when updating the clusters as @miles- has mentioned.

As the implementation on the controller is very delicate (in order not to loose data), we did some improvements on the resilience of the redises and sentinels adding an AntiAffinity #14

We're now preparing a library to help creating CRDs and Operators, and we're working on moving this to the new concept. Once that's done, as it'll be easier to maintain resources, we'll focus this problem.

If you have suggestions on how the controller should focus this problem and what steps to follow, we'll be very happy to hear them so we can create a much better application.

ese · 2018-03-19T10:42:06Z

I think this is resolved in 0.2. Feel free to reopened if its needed

ese closed this as completed Mar 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster can't recover after loosing too many pods at once. #21

Cluster can't recover after loosing too many pods at once. #21

adamresson commented Feb 1, 2018

miles- commented Feb 1, 2018

adamresson commented Feb 1, 2018

jchanam commented Feb 9, 2018

ese commented Mar 19, 2018

Cluster can't recover after loosing too many pods at once. #21

Cluster can't recover after loosing too many pods at once. #21

Comments

adamresson commented Feb 1, 2018

miles- commented Feb 1, 2018

adamresson commented Feb 1, 2018

jchanam commented Feb 9, 2018

ese commented Mar 19, 2018