Evicted pod do not release controller ConfigMap lock Bug 1749620 #1874

ilanuriel · 2019-09-02T09:48:22Z

The controller pod becomes a leader by acquiring a lock on a configmap resource.
Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours.
However, the evicted pod lock was not released and thus, the new pod could not become a leader.
The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries.
According to Kubernetes documentation, evicted pods locks are deleted as they evict.

camilamacedo86 · 2019-09-02T10:01:00Z

HI @ilanuriel,

Note that when a pod is flagged as evict the API will try to remove it. Also, before removing the pod it will ensure that all pre-conditions to allowing it to happens.

Shows that because of your code caused a memory leak it was running out of resources. I'd like to suggest you check the following documentation.

Please, let us know if the above information attends your expectations and if we can close this ticket or if has anything else that you are looking for to be done here.

ilanuriel · 2019-09-03T10:31:33Z

Thanks Camila,
Due to my memory leak the POD was evicted. This part is OK. The leak was violating resource consumption so Kubernetes was right to evict the POD. However, the evicted POD was not removed and lock was not freed. Therefore the new POD replacing the evicted POD was unable to become the leader. This part in my opinion is a bug because if an evicted POD, which is a non working POD does not release resources, new PODs will fail to operate, so essentially eviction means service outage instead of replacing a sick POD with a new healthy one and resuming service. Unless I am missing something, I believe this is a bug. Do you agree?

camilamacedo86 · 2019-09-03T11:19:44Z

Hi @ilanuriel,

In the following k8s documents you can have a better idea over how k8s API works in these scenarios and how you can set up your environment to dealing better with this kind of disruption.

https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-policy

Also, could you please let us know why do you believe that it is a bug in the operator-framework?

ilanuriel · 2019-09-03T16:26:26Z

Hi Camila,
Per documentation, evicted pod resources need to be released. The documentation suggests various methods to tune when this happens. But whenever this happens, the eviction should end with resources being released, which is what is not happening in this case.
The reason I believe this is an operator-framework bug is as follows:
The way the controller code in the framework works, is that a configmapis is created by the pod which wants to become the leader. Upon a successful configmap generation the pod becomes the leader, and it will stay the leader even if another pod instance tries to become one, because no second pod can generate the configmap as long as it is not deleted first.
However, when the leading pod is evicted, the one replacing it should be able to become a leader, which means that it should be able to generate a configmap. This will happen only if the evicted pod will delete the configmap when evicted, so the one replacing it can successfully generate the configmap again and become a leader. I found no code that is responsible to delete a configmap on eviction, therefore I conclude that there is a missing handler in the framework. However, I am not familiar with the framework code to a level that I can be sure my conclusion is correct, that is why I approached you.

hasbro17 · 2019-09-03T18:06:32Z

Evicted pods not getting deleted/garbage collected (from an unresponsive or partitioned node) is an edge case with the SDK's leader-for-life approach.

See the original proposal for more context on this issue:
#481 (comment)

I'm not sure if it's something that we can easily fix given that the nature of the lock is not leased/expiry based. Worth discussing further though.

At the very least we should have this case clearly documented in the leader pkg godocs.

ilanuriel · 2019-09-04T04:52:46Z

Thank you Haseeb,
May I suggest a resolution?
leader.go in the framework deals with the owner. In line 85, if an owner is found and this is not a restart case of the same pod, it will wait for sometime and in the case I have raised, give up since the lock will never be released.
What we can do is check if the ownerexistingOwner.name belongs to a POD that is not in RUNNING or PENDING state. In such case, the lock will be regarded as an error lock and thus, we should allow the controller to proceed as a leader and even further, delete the existing configmap and re-create it so its ownership is updated to be owned by the new pod. This way, the code operator will supervise locks are owned by living pods and not staying in ownership of invalid entities that for some reason where not released properly.

jianzhangbjz · 2019-09-06T02:37:17Z

Since it blocks the CNV operators' development, I create a bug in BZ for this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1749620

joelanford · 2019-09-06T12:52:07Z

What we can do is check if the ownerexistingOwner.name belongs to a POD that is not in RUNNING or PENDING state. In such case, the lock will be regarded as an error lock and thus, we should allow the controller to proceed as a leader and even further, delete the existing configmap and re-create it so its ownership is updated to be owned by the new pod.

@ilanuriel Unfortunately that approach would cause the leader-for-life election guarantees to fail in the scenario that the leader pod is evicted due to it being on an unresponsive or unreachable node. In that case, the pod may actually still be running so it would be invalid to delete its lock.

However, controller-runtime supports leader election with the leader-with-lease approach that may handle this particular issue correctly. See our leader election documentation for more details.

There are tradeoffs with either approach, so you should consider carefully which makes the most sense for your use case.

joelanford · 2019-09-09T15:21:51Z

Thinking about this more, it seems there are two ways a pod can be evicted:

The pod is on a node that becomes unreachable.
The pod is killed by kubelet due to resource usage.

For 1) there are two scenarios:
1a. The pod is able to maintain a connection to the API server.
1b. The pod loses the connection to the API server.

To cover 1a., we could have the leader node monitor itself, checking to see if it ever gets evicted. If so, it could stop all of its controllers, delete the lock, and exit. This would allow an operator replica on a non-partitioned node to gain the lock and become the leader.

For 1b., if the leader is no longer able to communicate with the API server, it would no longer be able to monitor itself or delete the lock. In this case, it seems like there is still the possibility of a deadlock.

To cover 2, we could look into catching the signal from kubelet that terminates the container, which would allow us to stop all controllers, delete the lock, and exit cleanly.

Note that for 1a and 2. It is important that all of the controllers are stopped before the lock is deleted to guarantee that another operator replica is not able to run the same controllers simultaneously.

@shawn-hurley @hasbro17 @estroz @camilamacedo86 Anything I'm getting wrong or any different perspectives or ideas for other approaches? Any ideas for the 1b case?

estroz · 2019-09-09T19:01:11Z

@joelanford won't the deadlock only occur in case 1b for pod-eviction-timeout amount of time, since the pod will be deleted after that amount of unresponsiveness?

hasbro17 · 2019-09-10T01:00:51Z

@estroz Not necessarily. If the node is unreachable the apiserver can't tell the kubelet to delete the pods after pod-eviction-timeout.

https://kubernetes.io/docs/concepts/architecture/nodes/#condition

In some cases when the node is unreachable, the apiserver is unable to communicate with the kubelet on the node. The decision to delete the pods cannot be communicated to the kubelet until communication with the apiserver is re-established. In the meantime, the pods that are scheduled for deletion may continue to run on the partitioned node.

In versions of Kubernetes prior to 1.5, the node controller would force delete these unreachable pods from the apiserver. However, in 1.5 and higher, the node controller does not force delete pods until it is confirmed that they have stopped running in the cluster. You can see the pods that might be running on an unreachable node as being in the Terminating or Unknown state.

So deadlock forever is possible on partitioned/unreachable nodes. I don't think we should try to handle 1b. 1a is just a best case scenario for 1 so I don't know if it's worth optimizing for.

@joelanford I need to read more on what are the sequence of events when a pod is evicted but from the termination lifecycle of a pod it doesn't seem that we need to handle the case of the pod being unresponsive to SIGTERM from the kubelet.

https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

User sends command to delete Pod, with default grace period (30s)

The Pod in the API server is updated with the time beyond which the Pod is considered “dead” along with the grace period.

Pod shows up as “Terminating” when listed in client commands

(simultaneous with 3) When the Kubelet sees that a Pod has been marked as terminating because the time in 2 has been set, it begins the Pod shutdown process.
...

In the case of no network partitions, once the graceperiod expires the pod will be killed via SIGKILL and will be removed from the API server.
So for 2 we shouldn't be trying to catch SIGTERM/SIGKILL signals from the kubelet in the operator pod. At that point the pod will be successfully deleted anyway.
At best the pod can monitor its own terminationGracePeriodSeconds field in the spec to realize it's being marked for eviction and then shutdown on its own.

Going back to @ilanuriel's original issue:

However, the evicted POD was not removed
What I want to know is what is the graceperiod for that eviction policy and how long does the pod stick around for after being evicted.
https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#soft-eviction-thresholds

hasbro17 · 2019-09-10T03:20:21Z

Edit: The pods do get killed it seems.

Okay so evicted pods will get killed(all containers stopped and resources reclaimed) but not deleted(i.e deletion timestamp is not set) from the API it seems. This is true for pods managed by Deployments but not for Daemonsets or other controllers. And it seems intentional.

See kubernetes/kubernetes#54525 (comment) kubernetes/kubernetes#54525 (comment) and kubernetes/kubernetes#54525 (comment)

~~I guess for the low resource eviction case if we know the pod is going to stick around with status.phase=Failed then the leader pod can give up the lock and exit itself.~~
The operator can't remove its own pod if it's killed already. One of the other pods could do it. But that would be messy. Not sure we want to handle it that way.
Another solution would be to set the kube-controller-manager flag --terminated-pod-gc-threshold and bring it down from the default of 12500 but that won't completely solve the problem and you can still have evicted pods sticking around.
kubernetes/kubernetes#55051 (comment)

@ilanuriel I don't know if you can confirm this anymore for your evicted pods to see if they were marked with status.phase=Failed?

ilanuriel · 2019-09-10T05:01:07Z

I think it was marked with failed

ilanuriel · 2019-09-10T05:50:11Z

I would like to emphasize that even if there are cases where the framework won't fix, please be advised that the newly created POD could not become the leader and after sometime stopped trying and remained in a RUNNING state, where it is essentially was doing nothing. Only by deleting BOTH the evicted POD AND the new POD was I able to get things OK again, as the new POD stopped trying to become the leader, so deleting ONLY the evicted POD was NOT enough. From a DevOps perspective, the new POD looked perfect: it was running and no report that this operator will be doing nothing was generated.
From a system management perspective, this is for sure a problem because we get nothing to suggest that the operator is not working and we have to do something about it. A new POD is in a RUNNING state, everything looks OK, the system administrator relaxes with a cup of coffee that all is good, while in practice the new POD operator will NEVER work.
To make things even worse, even if deleting evicted POD, the new POD has to be deleted as well to make things work again.
So to summarize, even if there are situations where the framework cannot recover evicted PODs issues, at the very least, it cannot generate new PODs that cannot work and provide the illusion that all is great while they do not work. A POD that cannot become a leader and function, should switch into a Kubernetes state that allows admins to understand the system needs care.
Also, while every admin knows that evicted POD should be deleted eventually, they probably have no idea that the new operator POD should be deleted as well to recover, so at the current situation, they will delete the evicted POD, thinking that the new POD is OK since it is in a RUNNING state, while that POD gave up trying to become the leader a long time ago and remain in a "doing nothing" RUNNING state.

jmazzitelli · 2019-10-23T16:36:37Z

This looks like a dup of this issue that was reported back in April. See: #1305

mhrivnak · 2019-10-25T19:22:09Z

I agree that the simplest way to address this is for the non-leader to check the status of the leader pod and delete its lock if that pod's phase is Failed.

Only by deleting BOTH the evicted POD AND the new POD was I able to get things OK again, as the new POD stopped trying to become the leader

That sounds like a separate bug. Do you have any additional findings or info on why the new Pod stopped trying to become the leader?

When the leader pod is evicted but not deleted, the leader lock configmap is not garbage collected. With this patch, a pod attempting to become the leader is able to delete the evicted pod, triggering garbage collection and allowing it to create a new leader lock. operator-framework#1305 Fixes operator-framework#1305 operator-framework#1874 Fixes operator-framework#1874

Before this patch, when the leader pod is evicted but not deleted, the leader lock configmap is not garbage collected and subsequent operaters can never become leader. With this patch, an operator attempting to become the leader is able to delete the evicted operator pod, triggering garbage collection and allowing leader election to continue. Sometimes, evicted operator pods will remain, even with this patch. This occurs when the leader operator pod is evicted, a new operator pod is created on the same node. In this case, the new pod will also be evicted. When an operator pod is created on a non-failing node, leader election will delete only the evicted leader pod, leaving any evicted operator pods that were not the leader. To replicate the evicted state, I used a `kind` cluster with 2 worker nodes with altered kubelet configuration: `memory.available`: Set about 8Gi less than the host machine `avail Mem` from `top`. `evictionPressureTransitionPeriod: 5s`: Allows kubelet to evict earlier. With these settings in place (and kubelet restarted), a memory-explosion function is triggered, which results in the eviction of all pods on that node. operator-framework#1305 Fixes operator-framework#1305 operator-framework#1874 Fixes operator-framework#1874

avarf · 2019-11-18T13:24:01Z

Does this issue solved?

Before this patch, when the leader pod is evicted but not deleted, the leader lock configmap is not garbage collected and subsequent operaters can never become leader. With this patch, an operator attempting to become the leader is able to delete the evicted operator pod, triggering garbage collection and allowing leader election to continue. Sometimes, evicted operator pods will remain, even with this patch. This occurs when the leader operator pod is evicted, a new operator pod is created on the same node. In this case, the new pod will also be evicted. When an operator pod is created on a non-failing node, leader election will delete only the evicted leader pod, leaving any evicted operator pods that were not the leader. To replicate the evicted state, I used a `kind` cluster with 2 worker nodes with altered kubelet configuration: `memory.available`: Set about 8Gi less than the host machine `avail Mem` from `top`. `evictionPressureTransitionPeriod: 5s`: Allows kubelet to evict earlier. With these settings in place (and kubelet restarted), a memory-explosion function is triggered, which results in the eviction of all pods on that node. operator-framework#1305 Fixes operator-framework#1305 operator-framework#1874 Fixes operator-framework#1874

asmacdo · 2019-11-18T17:57:41Z

@avarf the patch isn't merged yet, but if you would like to help test it, the the PR is 2210.

asmacdo · 2019-11-20T16:49:58Z

BZ clone https://bugzilla.redhat.com/show_bug.cgi?id=1749620

asmacdo · 2019-11-21T18:39:04Z

Demo of bugfix: https://www.youtube.com/watch?v=qTzdAYWPdCQ

camilamacedo86 self-assigned this Sep 3, 2019

camilamacedo86 added triage/needs-information Indicates an issue needs more information in order to work on it. triage/support Indicates an issue that is a support question. labels Sep 3, 2019

hasbro17 added kind/documentation Categorizes issue or PR as related to documentation. kind/bug Categorizes issue or PR as related to a bug. labels Sep 3, 2019

camilamacedo86 assigned hasbro17 Sep 3, 2019

joelanford removed kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it. labels Sep 6, 2019

camilamacedo86 removed their assignment Nov 13, 2019

asmacdo self-assigned this Nov 15, 2019

asmacdo mentioned this issue Nov 18, 2019

leader election bugfix: Delete evicted leader pods #2210

Merged

asmacdo changed the title ~~Evicted pod do not release controller ConfigMap lock~~ Evicted pod do not release controller ConfigMap lock Bug 1749620 Nov 20, 2019

asmacdo closed this as completed in #2210 Nov 20, 2019

joel-bluedata mentioned this issue Feb 7, 2020

bad interaction between pod eviction and leader lock bluek8s/kubedirector#265

Closed

hmoravec mentioned this issue Jun 5, 2020

Outdated Operator SDK version causing bugs kedacore/keda#870

Closed

matthewygf mentioned this issue Jul 3, 2020

Unknown pod do not release the configmap lock #3339

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evicted pod do not release controller ConfigMap lock Bug 1749620 #1874

Evicted pod do not release controller ConfigMap lock Bug 1749620 #1874

ilanuriel commented Sep 2, 2019

camilamacedo86 commented Sep 2, 2019 •

edited

Loading

ilanuriel commented Sep 3, 2019

camilamacedo86 commented Sep 3, 2019

ilanuriel commented Sep 3, 2019

hasbro17 commented Sep 3, 2019

ilanuriel commented Sep 4, 2019 •

edited

Loading

jianzhangbjz commented Sep 6, 2019

joelanford commented Sep 6, 2019

joelanford commented Sep 9, 2019

estroz commented Sep 9, 2019

hasbro17 commented Sep 10, 2019

hasbro17 commented Sep 10, 2019 •

edited

Loading

ilanuriel commented Sep 10, 2019

ilanuriel commented Sep 10, 2019

jmazzitelli commented Oct 23, 2019

mhrivnak commented Oct 25, 2019

avarf commented Nov 18, 2019

asmacdo commented Nov 18, 2019

asmacdo commented Nov 20, 2019

asmacdo commented Nov 21, 2019

Evicted pod do not release controller ConfigMap lock Bug 1749620 #1874

Evicted pod do not release controller ConfigMap lock Bug 1749620 #1874

Comments

ilanuriel commented Sep 2, 2019

camilamacedo86 commented Sep 2, 2019 • edited Loading

ilanuriel commented Sep 3, 2019

camilamacedo86 commented Sep 3, 2019

ilanuriel commented Sep 3, 2019

hasbro17 commented Sep 3, 2019

ilanuriel commented Sep 4, 2019 • edited Loading

jianzhangbjz commented Sep 6, 2019

joelanford commented Sep 6, 2019

joelanford commented Sep 9, 2019

estroz commented Sep 9, 2019

hasbro17 commented Sep 10, 2019

hasbro17 commented Sep 10, 2019 • edited Loading

ilanuriel commented Sep 10, 2019

ilanuriel commented Sep 10, 2019

jmazzitelli commented Oct 23, 2019

mhrivnak commented Oct 25, 2019

avarf commented Nov 18, 2019

asmacdo commented Nov 18, 2019

asmacdo commented Nov 20, 2019

asmacdo commented Nov 21, 2019

camilamacedo86 commented Sep 2, 2019 •

edited

Loading

ilanuriel commented Sep 4, 2019 •

edited

Loading

hasbro17 commented Sep 10, 2019 •

edited

Loading