Skip to content
This repository has been archived by the owner on Oct 20, 2022. It is now read-only.

[BUG] Pod disruption budget's MaxPodUnavailable gets changed unexpectedly when reading stale information from apiserver #321

Open
srteam2020 opened this issue Apr 28, 2021 · 4 comments

Comments

@srteam2020
Copy link
Contributor

Bug Report

We find that after a restart the controller could accidentally invoke DeletePodDisruptionBudget and then CreatePodDisruptionBudget, which leads to an unexpected change (decrease) of PDB's MaxPodUnavailable during the process of ensureCassandraPodDisruptionBudget. In that case, when the user increases MaxPodUnavailable to force the operator to continue its actions even if there are multiple disruptions ongoing, due to the unexpected decrease of MaxPodUnavailable, the operator will fail to take any actions including any creation or update to the cluster's statefulset.

We find that the root cause is the staleness populated from the apiserver that makes the controller think the "desired" MaxPodUnavailable (from the stale CassandraCluster spec) does not equal the MaxPodUnavailable in the current PodDisruptionBudget and recreates the PodDisruptionBudget to make the MaxPodUnavailable to the stale read value.

One potential approach to fix this is to save a copy of CassandraCluster's ResourceVersion inside the PDB instance when it is created. Each time we are going to recreate the PDB instance, we are required to compare the current CassandraCluster's ResourceVersion with what is stored in PDB instance to see whether the former version is newer than the latter. If it is not the case, it then indicates this is a stale read of the CassandraCluster, and we should not update the PDB.

What did you do?
Concrete steps that led to this issue (in a HA k8s cluster):

  1. Create a CassandraCluster cassandra-cluster with MaxPodUnavailable set to 1. The controller is talking to apiserver1. At this time, the corresponding PDB is also created by the operator with MaxPodUnavailable set to 1.
  2. Resize the MaxPodUnavailable of cassandra-cluster to 4. The controller will find the desired MaxPodUnavailable is different than the current one so it will delete the current PDB and recreate the PDB with MaxPodUnavailable set to 4. Meanwhile, apiserver2 gets stuck at the view where cassandra-cluster's MaxPodUnavailable is still 1 (due to running slow or network issues).
  3. The controller restarts after a node failure and talks to the stale apiserver2. Reading the stale information about cassandra-cluster from apiserver2, the controller thinks the "desired" MaxPodUnavailable is still 1 that is different from the current one (4) in PDB and the controller will delete the PDB and recreate the PDB with MaxPodUnavailable set to 1.

What did you expect to see?
PDB should not be affected when the controller reads stale information about cassandra-cluster.

What did you see instead? Under which circumstances?
PDB gets recreated with a smaller MaxPodUnavailable which can potentially block user operation if disruption is happening as mentioned above.

  • casskop version:
    f87c8e0 (master branch)

  • Kubernetes version information:
    1.18.9

  • Cassandra version:
    3.11

Possible Solution
We are willing to help fix this bug by issuing a PR.
As mentioned above, we can attach CRD's ResourceVersion to PDB. In that case, the operator will be able to tell that the "desired" MaxPodUnavailable shown in the CassandraCluster spec is invalid (stale) by checking whether the ResourceVersion (as an incremental ID) of CRD is older than what is attached in PDB. If that is the stale read case, the operator should not recreate the PDB, but return with an error.

@cscetbon
Copy link
Contributor

@srteam2020 don't hesitate to send a PR then we can start talking about those changes

@srteam2020
Copy link
Contributor Author

srteam2020 commented Jul 11, 2021

@cscetbon
Thanks for the reply! Sorry for the delay and we are working on the fix for this issue.

@srteam2020
Copy link
Contributor Author

Sorry for the delay. We sent a fix for this problem here: #355
It seems that the fix breaks some unit tests. We will try to find a better way to gracefully solve the problem without introducing test failures.

@cscetbon
Copy link
Contributor

@srteam2020 also it must be working for existing clusters. So if it could prevent an existing cluster from working it needs to be activated using a parameter or another non disruptive solution

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants