Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evicted pod do not release controller ConfigMap lock Bug 1749620 #1874

Closed
ilanuriel opened this issue Sep 2, 2019 · 20 comments · Fixed by #2210
Closed

Evicted pod do not release controller ConfigMap lock Bug 1749620 #1874

ilanuriel opened this issue Sep 2, 2019 · 20 comments · Fixed by #2210
Assignees
Labels
kind/documentation Categorizes issue or PR as related to documentation. triage/support Indicates an issue that is a support question.

Comments

@ilanuriel
Copy link

The controller pod becomes a leader by acquiring a lock on a configmap resource.
Due to a bug in my code which caused a memory leak, the pod was evicted after a few hours.
However, the evicted pod lock was not released and thus, the new pod could not become a leader.
The only way to fix that was to manually delete the evicted pod, which was locking the resource and delete the new pod which could not lock the resource and gave up after a few tries.
According to Kubernetes documentation, evicted pods locks are deleted as they evict.

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Sep 2, 2019

HI @ilanuriel,

Note that when a pod is flagged as evict the API will try to remove it. Also, before removing the pod it will ensure that all pre-conditions to allowing it to happens.

Shows that because of your code caused a memory leak it was running out of resources. I'd like to suggest you check the following documentation.

Please, let us know if the above information attends your expectations and if we can close this ticket or if has anything else that you are looking for to be done here.

@ilanuriel
Copy link
Author

Thanks Camila,
Due to my memory leak the POD was evicted. This part is OK. The leak was violating resource consumption so Kubernetes was right to evict the POD. However, the evicted POD was not removed and lock was not freed. Therefore the new POD replacing the evicted POD was unable to become the leader. This part in my opinion is a bug because if an evicted POD, which is a non working POD does not release resources, new PODs will fail to operate, so essentially eviction means service outage instead of replacing a sick POD with a new healthy one and resuming service. Unless I am missing something, I believe this is a bug. Do you agree?

@camilamacedo86
Copy link
Contributor

Hi @ilanuriel,

In the following k8s documents you can have a better idea over how k8s API works in these scenarios and how you can set up your environment to dealing better with this kind of disruption.

https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#eviction-policy

Also, could you please let us know why do you believe that it is a bug in the operator-framework?

@camilamacedo86 camilamacedo86 self-assigned this Sep 3, 2019
@camilamacedo86 camilamacedo86 added triage/needs-information Indicates an issue needs more information in order to work on it. triage/support Indicates an issue that is a support question. labels Sep 3, 2019
@ilanuriel
Copy link
Author

Hi Camila,
Per documentation, evicted pod resources need to be released. The documentation suggests various methods to tune when this happens. But whenever this happens, the eviction should end with resources being released, which is what is not happening in this case.
The reason I believe this is an operator-framework bug is as follows:
The way the controller code in the framework works, is that a configmapis is created by the pod which wants to become the leader. Upon a successful configmap generation the pod becomes the leader, and it will stay the leader even if another pod instance tries to become one, because no second pod can generate the configmap as long as it is not deleted first.
However, when the leading pod is evicted, the one replacing it should be able to become a leader, which means that it should be able to generate a configmap. This will happen only if the evicted pod will delete the configmap when evicted, so the one replacing it can successfully generate the configmap again and become a leader. I found no code that is responsible to delete a configmap on eviction, therefore I conclude that there is a missing handler in the framework. However, I am not familiar with the framework code to a level that I can be sure my conclusion is correct, that is why I approached you.

@hasbro17
Copy link
Contributor

hasbro17 commented Sep 3, 2019

Evicted pods not getting deleted/garbage collected (from an unresponsive or partitioned node) is an edge case with the SDK's leader-for-life approach.

See the original proposal for more context on this issue:
#481 (comment)

I'm not sure if it's something that we can easily fix given that the nature of the lock is not leased/expiry based. Worth discussing further though.

At the very least we should have this case clearly documented in the leader pkg godocs.

@hasbro17 hasbro17 added kind/documentation Categorizes issue or PR as related to documentation. kind/bug Categorizes issue or PR as related to a bug. labels Sep 3, 2019
@ilanuriel
Copy link
Author

ilanuriel commented Sep 4, 2019

Thank you Haseeb,
May I suggest a resolution?
leader.go in the framework deals with the owner. In line 85, if an owner is found and this is not a restart case of the same pod, it will wait for sometime and in the case I have raised, give up since the lock will never be released.
What we can do is check if the ownerexistingOwner.name belongs to a POD that is not in RUNNING or PENDING state. In such case, the lock will be regarded as an error lock and thus, we should allow the controller to proceed as a leader and even further, delete the existing configmap and re-create it so its ownership is updated to be owned by the new pod. This way, the code operator will supervise locks are owned by living pods and not staying in ownership of invalid entities that for some reason where not released properly.

@jianzhangbjz
Copy link
Contributor

Since it blocks the CNV operators' development, I create a bug in BZ for this issue: https://bugzilla.redhat.com/show_bug.cgi?id=1749620

@joelanford
Copy link
Member

What we can do is check if the ownerexistingOwner.name belongs to a POD that is not in RUNNING or PENDING state. In such case, the lock will be regarded as an error lock and thus, we should allow the controller to proceed as a leader and even further, delete the existing configmap and re-create it so its ownership is updated to be owned by the new pod.

@ilanuriel Unfortunately that approach would cause the leader-for-life election guarantees to fail in the scenario that the leader pod is evicted due to it being on an unresponsive or unreachable node. In that case, the pod may actually still be running so it would be invalid to delete its lock.

However, controller-runtime supports leader election with the leader-with-lease approach that may handle this particular issue correctly. See our leader election documentation for more details.

There are tradeoffs with either approach, so you should consider carefully which makes the most sense for your use case.

@joelanford joelanford removed kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it. labels Sep 6, 2019
@joelanford
Copy link
Member

Thinking about this more, it seems there are two ways a pod can be evicted:

  1. The pod is on a node that becomes unreachable.
  2. The pod is killed by kubelet due to resource usage.

For 1) there are two scenarios:
1a. The pod is able to maintain a connection to the API server.
1b. The pod loses the connection to the API server.

To cover 1a., we could have the leader node monitor itself, checking to see if it ever gets evicted. If so, it could stop all of its controllers, delete the lock, and exit. This would allow an operator replica on a non-partitioned node to gain the lock and become the leader.

For 1b., if the leader is no longer able to communicate with the API server, it would no longer be able to monitor itself or delete the lock. In this case, it seems like there is still the possibility of a deadlock.

To cover 2, we could look into catching the signal from kubelet that terminates the container, which would allow us to stop all controllers, delete the lock, and exit cleanly.

Note that for 1a and 2. It is important that all of the controllers are stopped before the lock is deleted to guarantee that another operator replica is not able to run the same controllers simultaneously.

@shawn-hurley @hasbro17 @estroz @camilamacedo86 Anything I'm getting wrong or any different perspectives or ideas for other approaches? Any ideas for the 1b case?

@estroz
Copy link
Member

estroz commented Sep 9, 2019

@joelanford won't the deadlock only occur in case 1b for pod-eviction-timeout amount of time, since the pod will be deleted after that amount of unresponsiveness?

@hasbro17
Copy link
Contributor

@estroz Not necessarily. If the node is unreachable the apiserver can't tell the kubelet to delete the pods after pod-eviction-timeout.

https://kubernetes.io/docs/concepts/architecture/nodes/#condition

In some cases when the node is unreachable, the apiserver is unable to communicate with the kubelet on the node. The decision to delete the pods cannot be communicated to the kubelet until communication with the apiserver is re-established. In the meantime, the pods that are scheduled for deletion may continue to run on the partitioned node.

In versions of Kubernetes prior to 1.5, the node controller would force delete these unreachable pods from the apiserver. However, in 1.5 and higher, the node controller does not force delete pods until it is confirmed that they have stopped running in the cluster. You can see the pods that might be running on an unreachable node as being in the Terminating or Unknown state.

So deadlock forever is possible on partitioned/unreachable nodes. I don't think we should try to handle 1b. 1a is just a best case scenario for 1 so I don't know if it's worth optimizing for.

@joelanford I need to read more on what are the sequence of events when a pod is evicted but from the termination lifecycle of a pod it doesn't seem that we need to handle the case of the pod being unresponsive to SIGTERM from the kubelet.

https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods

  1. User sends command to delete Pod, with default grace period (30s)
  2. The Pod in the API server is updated with the time beyond which the Pod is considered “dead” along with the grace period.
  3. Pod shows up as “Terminating” when listed in client commands
  4. (simultaneous with 3) When the Kubelet sees that a Pod has been marked as terminating because the time in 2 has been set, it begins the Pod shutdown process.
    ...

In the case of no network partitions, once the graceperiod expires the pod will be killed via SIGKILL and will be removed from the API server.
So for 2 we shouldn't be trying to catch SIGTERM/SIGKILL signals from the kubelet in the operator pod. At that point the pod will be successfully deleted anyway.
At best the pod can monitor its own terminationGracePeriodSeconds field in the spec to realize it's being marked for eviction and then shutdown on its own.

Going back to @ilanuriel's original issue:

However, the evicted POD was not removed
What I want to know is what is the graceperiod for that eviction policy and how long does the pod stick around for after being evicted.
https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#soft-eviction-thresholds

@hasbro17
Copy link
Contributor

hasbro17 commented Sep 10, 2019

Edit: The pods do get killed it seems.

Okay so evicted pods will get killed(all containers stopped and resources reclaimed) but not deleted(i.e deletion timestamp is not set) from the API it seems. This is true for pods managed by Deployments but not for Daemonsets or other controllers. And it seems intentional.

See kubernetes/kubernetes#54525 (comment) kubernetes/kubernetes#54525 (comment) and kubernetes/kubernetes#54525 (comment)

I guess for the low resource eviction case if we know the pod is going to stick around with status.phase=Failed then the leader pod can give up the lock and exit itself.
The operator can't remove its own pod if it's killed already. One of the other pods could do it. But that would be messy. Not sure we want to handle it that way.
Another solution would be to set the kube-controller-manager flag --terminated-pod-gc-threshold and bring it down from the default of 12500 but that won't completely solve the problem and you can still have evicted pods sticking around.
kubernetes/kubernetes#55051 (comment)

@ilanuriel I don't know if you can confirm this anymore for your evicted pods to see if they were marked with status.phase=Failed?

@ilanuriel
Copy link
Author

I think it was marked with failed

@ilanuriel
Copy link
Author

I would like to emphasize that even if there are cases where the framework won't fix, please be advised that the newly created POD could not become the leader and after sometime stopped trying and remained in a RUNNING state, where it is essentially was doing nothing. Only by deleting BOTH the evicted POD AND the new POD was I able to get things OK again, as the new POD stopped trying to become the leader, so deleting ONLY the evicted POD was NOT enough. From a DevOps perspective, the new POD looked perfect: it was running and no report that this operator will be doing nothing was generated.
From a system management perspective, this is for sure a problem because we get nothing to suggest that the operator is not working and we have to do something about it. A new POD is in a RUNNING state, everything looks OK, the system administrator relaxes with a cup of coffee that all is good, while in practice the new POD operator will NEVER work.
To make things even worse, even if deleting evicted POD, the new POD has to be deleted as well to make things work again.
So to summarize, even if there are situations where the framework cannot recover evicted PODs issues, at the very least, it cannot generate new PODs that cannot work and provide the illusion that all is great while they do not work. A POD that cannot become a leader and function, should switch into a Kubernetes state that allows admins to understand the system needs care.
Also, while every admin knows that evicted POD should be deleted eventually, they probably have no idea that the new operator POD should be deleted as well to recover, so at the current situation, they will delete the evicted POD, thinking that the new POD is OK since it is in a RUNNING state, while that POD gave up trying to become the leader a long time ago and remain in a "doing nothing" RUNNING state.

@jmazzitelli
Copy link

This looks like a dup of this issue that was reported back in April. See: #1305

@mhrivnak
Copy link
Member

I agree that the simplest way to address this is for the non-leader to check the status of the leader pod and delete its lock if that pod's phase is Failed.

Only by deleting BOTH the evicted POD AND the new POD was I able to get things OK again, as the new POD stopped trying to become the leader

That sounds like a separate bug. Do you have any additional findings or info on why the new Pod stopped trying to become the leader?

@camilamacedo86 camilamacedo86 removed their assignment Nov 13, 2019
asmacdo added a commit to asmacdo/operator-sdk that referenced this issue Nov 15, 2019
When the leader pod is evicted but not deleted, the leader lock
configmap is not garbage collected. With this patch, a pod attempting to
become the leader is able to delete the evicted pod, triggering garbage
collection and allowing it to create a new leader lock.

operator-framework#1305
Fixes operator-framework#1305
operator-framework#1874
Fixes operator-framework#1874
asmacdo added a commit to asmacdo/operator-sdk that referenced this issue Nov 15, 2019
Before this patch, when the leader pod is evicted but not deleted, the
leader lock configmap is not garbage collected and subsequent operaters
can never become leader. With this patch, an operator attempting to
become the leader is able to delete the evicted operator pod, triggering
garbage collection and allowing leader election to continue.

Sometimes, evicted operator pods will remain, even with this patch.
This occurs when the leader operator pod is evicted, a new operator pod
is created on the same node. In this case, the new pod will also be
evicted. When an operator pod is created on a non-failing node, leader
election will delete only the evicted leader pod, leaving any evicted
operator pods that were not the leader.

To replicate the evicted state, I used a `kind` cluster with 2 worker
nodes with altered kubelet configuration:
  `memory.available`: Set about 8Gi less than the host machine `avail
  Mem` from `top`.
  `evictionPressureTransitionPeriod: 5s`: Allows kubelet to evict
  earlier.
With these settings in place (and kubelet restarted), a memory-explosion
function is triggered, which results in the eviction of all pods on that
node.

operator-framework#1305
Fixes operator-framework#1305
operator-framework#1874
Fixes operator-framework#1874
asmacdo added a commit to asmacdo/operator-sdk that referenced this issue Nov 15, 2019
Before this patch, when the leader pod is evicted but not deleted, the
leader lock configmap is not garbage collected and subsequent operaters
can never become leader. With this patch, an operator attempting to
become the leader is able to delete the evicted operator pod, triggering
garbage collection and allowing leader election to continue.

Sometimes, evicted operator pods will remain, even with this patch.
This occurs when the leader operator pod is evicted, a new operator pod
is created on the same node. In this case, the new pod will also be
evicted. When an operator pod is created on a non-failing node, leader
election will delete only the evicted leader pod, leaving any evicted
operator pods that were not the leader.

To replicate the evicted state, I used a `kind` cluster with 2 worker
nodes with altered kubelet configuration:
  `memory.available`: Set about 8Gi less than the host machine `avail
  Mem` from `top`.
  `evictionPressureTransitionPeriod: 5s`: Allows kubelet to evict
  earlier.
With these settings in place (and kubelet restarted), a memory-explosion
function is triggered, which results in the eviction of all pods on that
node.

operator-framework#1305
Fixes operator-framework#1305
operator-framework#1874
Fixes operator-framework#1874
@asmacdo asmacdo self-assigned this Nov 15, 2019
@avarf
Copy link

avarf commented Nov 18, 2019

Does this issue solved?

asmacdo added a commit to asmacdo/operator-sdk that referenced this issue Nov 18, 2019
Before this patch, when the leader pod is evicted but not deleted, the
leader lock configmap is not garbage collected and subsequent operaters
can never become leader. With this patch, an operator attempting to
become the leader is able to delete the evicted operator pod, triggering
garbage collection and allowing leader election to continue.

Sometimes, evicted operator pods will remain, even with this patch.
This occurs when the leader operator pod is evicted, a new operator pod
is created on the same node. In this case, the new pod will also be
evicted. When an operator pod is created on a non-failing node, leader
election will delete only the evicted leader pod, leaving any evicted
operator pods that were not the leader.

To replicate the evicted state, I used a `kind` cluster with 2 worker
nodes with altered kubelet configuration:
  `memory.available`: Set about 8Gi less than the host machine `avail
  Mem` from `top`.
  `evictionPressureTransitionPeriod: 5s`: Allows kubelet to evict
  earlier.
With these settings in place (and kubelet restarted), a memory-explosion
function is triggered, which results in the eviction of all pods on that
node.

operator-framework#1305
Fixes operator-framework#1305
operator-framework#1874
Fixes operator-framework#1874
@asmacdo
Copy link
Member

asmacdo commented Nov 18, 2019

@avarf the patch isn't merged yet, but if you would like to help test it, the the PR is 2210.

@asmacdo asmacdo changed the title Evicted pod do not release controller ConfigMap lock Evicted pod do not release controller ConfigMap lock Bug 1749620 Nov 20, 2019
@asmacdo
Copy link
Member

asmacdo commented Nov 20, 2019

@asmacdo
Copy link
Member

asmacdo commented Nov 21, 2019

Demo of bugfix: https://www.youtube.com/watch?v=qTzdAYWPdCQ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation. triage/support Indicates an issue that is a support question.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants