Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with #533 #535

Closed
samof76 opened this issue Dec 5, 2022 · 9 comments
Closed

Issue with #533 #535

samof76 opened this issue Dec 5, 2022 · 9 comments
Labels

Comments

@samof76
Copy link
Contributor

samof76 commented Dec 5, 2022

@ese there seems to be inherent issue with this #533

Before applying check and heal wait for all expected pods up and running instead wait only for exists to let Kubernetes controllers do their job

Consider this scenario....

  1. The master and sentinel pods are running
  2. The master pod and setinels get killed.
  3. Those pods are unable to get scheduled.

In this case the check-and-heal would not do what its intended to do.

Consider another scenario...

  1. The master and sentinel pods are running
  2. All of the sentinel get killed and along with a slave
  3. Now the sentinels get scheduled but on slave is still node scheduled.

In this case the check-and-heal would not configure the sentinels because of the fix here.

@samof76
Copy link
Contributor Author

samof76 commented Dec 7, 2022

Looks like #536 might fix this.

@ese
Copy link
Member

ese commented Dec 7, 2022

First scenario: IMHO if the Redis master and more than N/2 sentinels are deleted the cluster is effectively broken until they can be scheduled and running again. Performing actions by operator during that period is quite dangerous because at the end we rely on sentinels to maintain the quorum in the long term.

Second scenario: Make sense not wait to have all redis replicas available to reconcile the sentinels with the existing master.

I don't think #536 resolve this, thanks @samof76

@AndreasSko
Copy link

We might have faced a scenario similar to the one described here which resulted in an outage of our Redis cluster. We run 3 Redis and Sentinel pods each. In our case, some of the pods were rescheduled on new nodes, but one Sentinel pod failed to properly terminate (e.g. the logs indicate that it shut down successfully, but it stayed in the deployment; we are still trying to figure out what exactly happened here).
After this, the operator failed to configure the rest of the Redis and Sentinel pods and just logged:

Number of sentinel mismatch, waiting for sentinel deployment reconcile

Probably our setup would have continued to work if the redis-operator had been able to configure the remaining pods.

I will try to reproduce our scenario but wanted to already note my initial findings here 🙂

@AndreasSko
Copy link

One update: I was able to reliably reproduce this issue by triggering an eviction of a sentinel pod (for example by allocating a huge file via fallocate -l 100G big.file and forcing an eviction by kubelet). In this case there will be one Completed Sentinel pod and the redis-operator will wait "for sentinel deployment reconcile". If in the meantime the other sentinel pods are restarted, the whole cluster will fall apart as nothing is getting configured anymore.

@samof76
Copy link
Contributor Author

samof76 commented Jan 1, 2023

@AndreasSko were you able to reproduce this with the latest operator version?

@AndreasSko
Copy link

Yes, we are running v1.2.4

@github-actions
Copy link

This issue is stale because it has been open for 45 days with no activity.

@github-actions github-actions bot added the stale label Feb 16, 2023
@github-actions
Copy link

github-actions bot commented Mar 2, 2023

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 2, 2023
@AndreasSko
Copy link

Unfortunately, the issue is still happening with our system and has resulted in a couple of outages 😅 Would it be possible to re-open it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants