Issue with #533 #535

samof76 · 2022-12-05T05:22:38Z

@ese there seems to be inherent issue with this #533

Before applying check and heal wait for all expected pods up and running instead wait only for exists to let Kubernetes controllers do their job

Consider this scenario....

The master and sentinel pods are running
The master pod and setinels get killed.
Those pods are unable to get scheduled.

In this case the check-and-heal would not do what its intended to do.

Consider another scenario...

The master and sentinel pods are running
All of the sentinel get killed and along with a slave
Now the sentinels get scheduled but on slave is still node scheduled.

In this case the check-and-heal would not configure the sentinels because of the fix here.

The text was updated successfully, but these errors were encountered:

samof76 · 2022-12-07T12:54:04Z

Looks like #536 might fix this.

ese · 2022-12-07T13:44:54Z

First scenario: IMHO if the Redis master and more than N/2 sentinels are deleted the cluster is effectively broken until they can be scheduled and running again. Performing actions by operator during that period is quite dangerous because at the end we rely on sentinels to maintain the quorum in the long term.

Second scenario: Make sense not wait to have all redis replicas available to reconcile the sentinels with the existing master.

I don't think #536 resolve this, thanks @samof76

AndreasSko · 2022-12-29T14:51:55Z

We might have faced a scenario similar to the one described here which resulted in an outage of our Redis cluster. We run 3 Redis and Sentinel pods each. In our case, some of the pods were rescheduled on new nodes, but one Sentinel pod failed to properly terminate (e.g. the logs indicate that it shut down successfully, but it stayed in the deployment; we are still trying to figure out what exactly happened here).
After this, the operator failed to configure the rest of the Redis and Sentinel pods and just logged:

Number of sentinel mismatch, waiting for sentinel deployment reconcile

Probably our setup would have continued to work if the redis-operator had been able to configure the remaining pods.

I will try to reproduce our scenario but wanted to already note my initial findings here 🙂

AndreasSko · 2022-12-30T08:21:58Z

One update: I was able to reliably reproduce this issue by triggering an eviction of a sentinel pod (for example by allocating a huge file via fallocate -l 100G big.file and forcing an eviction by kubelet). In this case there will be one Completed Sentinel pod and the redis-operator will wait "for sentinel deployment reconcile". If in the meantime the other sentinel pods are restarted, the whole cluster will fall apart as nothing is getting configured anymore.

samof76 · 2023-01-01T06:53:16Z

@AndreasSko were you able to reproduce this with the latest operator version?

AndreasSko · 2023-01-01T08:21:04Z

Yes, we are running v1.2.4

github-actions · 2023-02-16T02:03:30Z

This issue is stale because it has been open for 45 days with no activity.

github-actions · 2023-03-02T02:07:38Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

AndreasSko · 2023-03-03T09:46:10Z

Unfortunately, the issue is still happening with our system and has resulted in a couple of outages 😅 Would it be possible to re-open it?

github-actions bot added the stale label Feb 16, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with #533 #535

Issue with #533 #535

samof76 commented Dec 5, 2022

samof76 commented Dec 7, 2022

ese commented Dec 7, 2022

AndreasSko commented Dec 29, 2022

AndreasSko commented Dec 30, 2022

samof76 commented Jan 1, 2023

AndreasSko commented Jan 1, 2023

github-actions bot commented Feb 16, 2023

github-actions bot commented Mar 2, 2023

AndreasSko commented Mar 3, 2023

Issue with #533 #535

Issue with #533 #535

Comments

samof76 commented Dec 5, 2022

samof76 commented Dec 7, 2022

ese commented Dec 7, 2022

AndreasSko commented Dec 29, 2022

AndreasSko commented Dec 30, 2022

samof76 commented Jan 1, 2023

AndreasSko commented Jan 1, 2023

github-actions bot commented Feb 16, 2023

github-actions bot commented Mar 2, 2023

AndreasSko commented Mar 3, 2023