[BUG] Pod disruption budget's MaxPodUnavailable gets changed unexpectedly when reading stale information from apiserver #321

srteam2020 · 2021-04-28T20:42:01Z

Bug Report

We find that after a restart the controller could accidentally invoke DeletePodDisruptionBudget and then CreatePodDisruptionBudget, which leads to an unexpected change (decrease) of PDB's MaxPodUnavailable during the process of ensureCassandraPodDisruptionBudget. In that case, when the user increases MaxPodUnavailable to force the operator to continue its actions even if there are multiple disruptions ongoing, due to the unexpected decrease of MaxPodUnavailable, the operator will fail to take any actions including any creation or update to the cluster's statefulset.

We find that the root cause is the staleness populated from the apiserver that makes the controller think the "desired" MaxPodUnavailable (from the stale CassandraCluster spec) does not equal the MaxPodUnavailable in the current PodDisruptionBudget and recreates the PodDisruptionBudget to make the MaxPodUnavailable to the stale read value.

One potential approach to fix this is to save a copy of CassandraCluster's ResourceVersion inside the PDB instance when it is created. Each time we are going to recreate the PDB instance, we are required to compare the current CassandraCluster's ResourceVersion with what is stored in PDB instance to see whether the former version is newer than the latter. If it is not the case, it then indicates this is a stale read of the CassandraCluster, and we should not update the PDB.

What did you do?
Concrete steps that led to this issue (in a HA k8s cluster):

Create a CassandraCluster cassandra-cluster with MaxPodUnavailable set to 1. The controller is talking to apiserver1. At this time, the corresponding PDB is also created by the operator with MaxPodUnavailable set to 1.
Resize the MaxPodUnavailable of cassandra-cluster to 4. The controller will find the desired MaxPodUnavailable is different than the current one so it will delete the current PDB and recreate the PDB with MaxPodUnavailable set to 4. Meanwhile, apiserver2 gets stuck at the view where cassandra-cluster's MaxPodUnavailable is still 1 (due to running slow or network issues).
The controller restarts after a node failure and talks to the stale apiserver2. Reading the stale information about cassandra-cluster from apiserver2, the controller thinks the "desired" MaxPodUnavailable is still 1 that is different from the current one (4) in PDB and the controller will delete the PDB and recreate the PDB with MaxPodUnavailable set to 1.

What did you expect to see?
PDB should not be affected when the controller reads stale information about cassandra-cluster.

What did you see instead? Under which circumstances?
PDB gets recreated with a smaller MaxPodUnavailable which can potentially block user operation if disruption is happening as mentioned above.

casskop version:
f87c8e0 (master branch)
Kubernetes version information:
1.18.9
Cassandra version:
3.11

Possible Solution
We are willing to help fix this bug by issuing a PR.
As mentioned above, we can attach CRD's ResourceVersion to PDB. In that case, the operator will be able to tell that the "desired" MaxPodUnavailable shown in the CassandraCluster spec is invalid (stale) by checking whether the ResourceVersion (as an incremental ID) of CRD is older than what is attached in PDB. If that is the stale read case, the operator should not recreate the PDB, but return with an error.

The text was updated successfully, but these errors were encountered:

cscetbon · 2021-05-30T15:17:54Z

@srteam2020 don't hesitate to send a PR then we can start talking about those changes

srteam2020 · 2021-07-11T22:06:33Z

@cscetbon
Thanks for the reply! Sorry for the delay and we are working on the fix for this issue.

srteam2020 · 2021-09-01T16:19:54Z

Sorry for the delay. We sent a fix for this problem here: #355
It seems that the fix breaks some unit tests. We will try to find a better way to gracefully solve the problem without introducing test failures.

cscetbon · 2021-09-10T03:23:21Z

@srteam2020 also it must be working for existing clusters. So if it could prevent an existing cluster from working it needs to be activated using a parameter or another non disruptive solution

srteam2020 mentioned this issue Aug 31, 2021

Fix #321 by labeling pdb's UID to CassandraCluster #355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Pod disruption budget's MaxPodUnavailable gets changed unexpectedly when reading stale information from apiserver #321

[BUG] Pod disruption budget's MaxPodUnavailable gets changed unexpectedly when reading stale information from apiserver #321

srteam2020 commented Apr 28, 2021

cscetbon commented May 30, 2021

srteam2020 commented Jul 11, 2021 •

edited

Loading

srteam2020 commented Sep 1, 2021

cscetbon commented Sep 10, 2021

[BUG] Pod disruption budget's MaxPodUnavailable gets changed unexpectedly when reading stale information from apiserver #321

[BUG] Pod disruption budget's MaxPodUnavailable gets changed unexpectedly when reading stale information from apiserver #321

Comments

srteam2020 commented Apr 28, 2021

Bug Report

cscetbon commented May 30, 2021

srteam2020 commented Jul 11, 2021 • edited Loading

srteam2020 commented Sep 1, 2021

cscetbon commented Sep 10, 2021

srteam2020 commented Jul 11, 2021 •

edited

Loading