Fix stopping of modules started by kubernetes autodiscover #10476

jsoriano · 2019-02-01T09:43:16Z

Kubernetes autodiscover only emits events for containers with
an ID in pods with an IP, but when a pod is being terminated,
their containers can lack of ID and the pod itself can lack of IP.
This leads to modules that are never stopped because the
delete event that should stop them lacks of the needed
information.

This change makes two things to avoid this problem:

Don't require the pod to have an IP on stop events.
Use IDs for containers that don't depend on its state.

libbeat/autodiscover/providers/kubernetes/kubernetes.go

exekias · 2019-02-03T11:30:05Z

well done @jsoriano, thank you for taking this! ❤️

ruflin

LGTM. Could you rebase to get CI green?

When deleting pods, they are first marked for deletion and then, after a grace period or after all containers have stopped, they are actually deleted. We were stopping modules only on pod deletion, but at this moment the container can lack of information of containers that were running when the modules started, but have been stopping during the grace period. This change schedules additional stops as soon as we receive an event with a deletion timestamp. At this moment the pods still contains the information of all containers and thus the stop is actually done.

jsoriano · 2019-02-05T23:02:55Z

TLDR; My original assumption was wrong, modules were not being re-started because of a race condition between last update and delete, but they weren't being stopped because on delete we may not have enough information to stop the modules.

On my first implementation of this change I messed up with the DeletionTimestamp, I was using the GetNanos() method but I have discovered that it always returns 0, so the modules were being deleted as soon as the pod is marked for termination. This fixed the original issue, no modules were running after the pod is stopped, but it was stopping the pods too soon, what can be a problem in some beats.
When I discovered that, I tried to fix it by using Seconds, confirming that all the timestamps were properly calculated. Then I saw the original problem happening again. And then I discovered what was happening.
We only generate events for pods that have an IP, and for containers that have an ID, because we need them, but when a pod is stopping, containers can be dead and without an ID, so now I schedule modules stop as soon as we know that the pod is stopping, at this moment the containers are still alive.

jsoriano · 2019-02-05T23:05:10Z

Writting my previous comment made me think that we may need an identifier for containers that don't depend on its state, so we can start and stop modules normally.

jsoriano · 2019-02-05T23:40:46Z

Writting my previous comment made me think that we may need an identifier for containers that don't depend on its state, so we can start and stop modules normally.

Yes, much simpler, it works by using another ID, description updated.

houndci-bot · 2019-02-06T10:52:07Z

libbeat/autodiscover/providers/kubernetes/kubernetes.go

+
+		// This must be an id that doesn't depend on the state of the container
+		// so it works also on `stop` if containers have been already deleted.
+		eventId := fmt.Sprintf("%s.%s", pod.Metadata.GetUid(), c.GetName())


var eventId should be eventID

jsoriano · 2019-02-06T10:55:01Z

@jsoriano I think failing tests are related. Could be that some tests need update.

Except for the adjustement in an event id, failures were legit, I was overwriting also the value for container.id, what was wrong. Great to have these tests 🙂

jsoriano · 2019-02-06T11:54:25Z

jenkins, test this again please

jsoriano · 2019-02-07T10:45:05Z

jenkins, test this again

jsoriano · 2019-02-07T10:45:51Z

Oh, this was not the build I wanted to relaunch 🤦‍♂️

jsoriano · 2019-02-07T13:43:28Z

jenkins, test this again

jsoriano · 2019-02-07T19:02:32Z

I am merging this to start with the backports, I'll do follow-ups with further fixes if needed.

…0476) Kubernetes autodiscover only emits events for containers with an ID in pods with an IP, but when a pod is being terminated, their containers can lack of ID and the pod itself can lack of IP. This leads to modules that are never stopped because the delete event that should stop them lacks of the needed information. This change makes two things to avoid this problem: * Don't require the pod to have an IP on stop events. * Use IDs for containers that don't depend on its state. (cherry picked from commit 15f2f26)

…10643) Kubernetes autodiscover only emits events for containers with an ID in pods with an IP, but when a pod is being terminated, their containers can lack of ID and the pod itself can lack of IP. This leads to modules that are never stopped because the delete event that should stop them lacks of the needed information. This change makes two things to avoid this problem: * Don't require the pod to have an IP on stop events. * Use IDs for containers that don't depend on its state. (cherry picked from commit 15f2f26)

…10642) Kubernetes autodiscover only emits events for containers with an ID in pods with an IP, but when a pod is being terminated, their containers can lack of ID and the pod itself can lack of IP. This leads to modules that are never stopped because the delete event that should stop them lacks of the needed information. This change makes two things to avoid this problem: * Don't require the pod to have an IP on stop events. * Use IDs for containers that don't depend on its state. (cherry picked from commit 15f2f26)

…10641) Kubernetes autodiscover only emits events for containers with an ID in pods with an IP, but when a pod is being terminated, their containers can lack of ID and the pod itself can lack of IP. This leads to modules that are never stopped because the delete event that should stop them lacks of the needed information. This change makes two things to avoid this problem: * Don't require the pod to have an IP on stop events. * Use IDs for containers that don't depend on its state. (cherry picked from commit 15f2f26)

…10644) Kubernetes autodiscover only emits events for containers with an ID in pods with an IP, but when a pod is being terminated, their containers can lack of ID and the pod itself can lack of IP. This leads to modules that are never stopped because the delete event that should stop them lacks of the needed information. This change makes two things to avoid this problem: * Don't require the pod to have an IP on stop events. * Use IDs for containers that don't depend on its state. (cherry picked from commit 15f2f26)

…0476) (elastic#10644) Kubernetes autodiscover only emits events for containers with an ID in pods with an IP, but when a pod is being terminated, their containers can lack of ID and the pod itself can lack of IP. This leads to modules that are never stopped because the delete event that should stop them lacks of the needed information. This change makes two things to avoid this problem: * Don't require the pod to have an IP on stop events. * Use IDs for containers that don't depend on its state. (cherry picked from commit 4162aa9)

jsoriano added bug review libbeat needs_backport PR is waiting to be backported to other branches. containers Related to containers use case v7.0.0 Team:Integrations Label for the Integrations team v6.7.0 v7.0.0-beta1 v6.6.1 labels Feb 1, 2019

jsoriano self-assigned this Feb 1, 2019

jsoriano requested a review from a team as a code owner February 1, 2019 09:43

jsoriano requested a review from a team February 1, 2019 10:05

ruflin approved these changes Feb 1, 2019

View reviewed changes

libbeat/autodiscover/providers/kubernetes/kubernetes.go Outdated Show resolved Hide resolved

jsoriano commented Feb 1, 2019

View reviewed changes

libbeat/autodiscover/providers/kubernetes/kubernetes.go Outdated Show resolved Hide resolved

jsoriano force-pushed the kubernetes-autodiscover-terminating branch from 12df0f0 to b67d810 Compare February 1, 2019 14:11

ruflin previously approved these changes Feb 4, 2019

View reviewed changes

jsoriano force-pushed the kubernetes-autodiscover-terminating branch from b67d810 to 5a6ce23 Compare February 5, 2019 22:45

jsoriano changed the title ~~Don't restart modules on terminated pod updates~~ Schedule stop of modules started by kubernetes autodiscover as soon as possible Feb 5, 2019

jsoriano added 5 commits February 6, 2019 00:20

Use an id for containers that don't depend on its state

4e25ed8

Use a better id

39c9898

Revert uneeded change

3cfff44

Fix changelog

bf9317f

Fix

972ceeb

jsoriano changed the title ~~Schedule stop of modules started by kubernetes autodiscover as soon as possible~~ Fix stopping of modules started by kubernetes autodiscover Feb 5, 2019

houndci-bot reviewed Feb 6, 2019

View reviewed changes

Make hound happy

d0d575a

ruflin approved these changes Feb 6, 2019

View reviewed changes

Add test cases for stop

604d2bd

jsoriano merged commit 15f2f26 into elastic:master Feb 7, 2019

jsoriano deleted the kubernetes-autodiscover-terminating branch February 7, 2019 19:03

jsoriano mentioned this pull request Feb 7, 2019

Cherry-pick #10476 to 7.x: Fix stopping of modules started by kubernetes autodiscover #10641

Merged

jsoriano added v7.2.0 and removed needs_backport PR is waiting to be backported to other branches. labels Feb 7, 2019

This was referenced Feb 7, 2019

Cherry-pick #10476 to 7.0: Fix stopping of modules started by kubernetes autodiscover #10642

Merged

Cherry-pick #10476 to 6.7: Fix stopping of modules started by kubernetes autodiscover #10643

Merged

jsoriano mentioned this pull request Feb 7, 2019

Cherry-pick #10476 to 6.6: Fix stopping of modules started by kubernetes autodiscover #10644

Merged

jsoriano mentioned this pull request Mar 13, 2019

Filebeat still has memory leak? #9302

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stopping of modules started by kubernetes autodiscover #10476

Fix stopping of modules started by kubernetes autodiscover #10476

jsoriano commented Feb 1, 2019 •

edited

Loading

exekias commented Feb 3, 2019

ruflin left a comment

jsoriano commented Feb 5, 2019 •

edited

Loading

jsoriano commented Feb 5, 2019

jsoriano commented Feb 5, 2019

houndci-bot Feb 6, 2019

jsoriano commented Feb 6, 2019

jsoriano commented Feb 6, 2019

jsoriano commented Feb 7, 2019

jsoriano commented Feb 7, 2019

jsoriano commented Feb 7, 2019

jsoriano commented Feb 7, 2019

Fix stopping of modules started by kubernetes autodiscover #10476

Fix stopping of modules started by kubernetes autodiscover #10476

Conversation

jsoriano commented Feb 1, 2019 • edited Loading

exekias commented Feb 3, 2019

ruflin left a comment

Choose a reason for hiding this comment

jsoriano commented Feb 5, 2019 • edited Loading

jsoriano commented Feb 5, 2019

jsoriano commented Feb 5, 2019

houndci-bot Feb 6, 2019

Choose a reason for hiding this comment

jsoriano commented Feb 6, 2019

jsoriano commented Feb 6, 2019

jsoriano commented Feb 7, 2019

jsoriano commented Feb 7, 2019

jsoriano commented Feb 7, 2019

jsoriano commented Feb 7, 2019

jsoriano commented Feb 1, 2019 •

edited

Loading

jsoriano commented Feb 5, 2019 •

edited

Loading