[libbeat] Fix / clarify the module reload logic #16440

faec · 2020-02-19T21:26:49Z

What does this PR do?

The loop logic for the module reloader had a bug that would reload module configurations even if they hadn't changed. We suspect that in some cases this could cause unbounded memory consumption (we are still investigating the exact scope of the problem). This PR fixes the error and restructures the surrounding code to clarify the intended logic.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~I have made corresponding changes to the documentation~~
~~I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works

How to test this PR locally

With debug output on, any beat with module reloading enabled should output Number of module configs found: ... every time it reloads the configuration. Running a beat without this PR should show that this happens at every reload interval, whereas with it it should only reload when a file has changed.

urso · 2020-02-19T23:43:32Z

Fix looks good to me. Maybe we can introduce an unit, integration, or system test that makes the old behavior fail.

ruflin

Could you add a changelog entry?

ruflin · 2020-02-20T07:26:50Z

libbeat/cfgfile/reload.go

-				// Make sure the next run also updates because some runners were not properly loaded
-				overwriteUpdate = true
+			if err = list.Reload(configs); err != nil {
+				logp.Err("Error reloading module configurations: %v", err)


Please check if we should really log this on the error level. Could this reload return an error because a harvester was not finished yet? I just want to make sure we don't spam the error logs with this in some cases.

Looking at the test runs, the results are real errors, but they're also already logged by Reload itself, so this does end up being noisy and redundant. Removed / commented.

ruflin · 2020-02-20T07:27:55Z

libbeat/cfgfile/reload.go

@@ -185,7 +185,11 @@ func (rl *Reloader) Run(runnerFactory RunnerFactory) {
 		rl.config.Reload.Period = 0
 	}

-	overwriteUpdate := true
+	// If forceReload is set, the configuration should be reloaded


I like the renaming of the variable.

ruflin

LGTM and ok to merge for me. I really like the tests you added but I'm a bit concerned that they might become flaky. We had many such issues with timing related tests in the past.

ruflin · 2020-02-25T07:42:07Z

libbeat/cfgfile/reload.go

 				continue
 			}
+			configReloads.Add(1)


Wondering if the config reload counter should only increase when it was successful? And have an additional error counter for debugging?

I'm hesitant, because an error counter here wouldn't be at the right granularity -- we get an error here even if almost everything reloads successfully and there's one invalid file in the search path, and it would be misleading to report that as though the reload didn't happen.

The related monitor vars libbeat.config.module.{starts, stops, running} are defined in this file also but they aren't used at all. I was planning to enable them in a followup (it requires changing the RunnerList API and modifying other call sites or I'd have done it in this PR), but it might make sense to add errors at the same time to track reload failures at module granularity instead of the whole configuration directory.

ruflin · 2020-02-25T07:43:18Z

libbeat/cfgfile/reload_test.go

+		}
+		// time interval is slightly more than a second so we don't slightly
+		// undershoot the first iteration and wait a whole extra second.
+		time.Sleep(1100 * time.Millisecond)


I'm a bit worried about tests with Sleep as they tend to get flaky earlier or later. Is there an other way to achieve this "wait"?

I don't know a better way, though I'm open to suggestions :-) A spin lock would presumably be even worse. The difficulty is that the behavior being tested is itself timer-based. I did structure the logic here so that the test is only affected by the state changes and not other timing issues -- the only "race condition" here is that the unit test might wait several seconds without the reloader's 1-second-interval timer going off, in which case it has no way to tell whether anything's happening. But, the retry timeout was shorter than I meant it to be, so I just increased it to 10 seconds.

Sorry, I'm late on this one. Lets see if it shows up in flaky tests in the future ;-)

What we did as workarounds in the past is:

Check for log messages

Have an other data point we can check very frequently for a change.

In general I think it's problematic if a single unit tests increases the test suite by seconds and only spends the time waiting. To improve the above, perhaps there is a way to decrease the scan frequency and do checks every 10ms but retry 1000 times.

faec · 2020-02-26T15:44:43Z

I finally isolated the problem with the flaky reloader tests in the CI (test_reload.py in both heartbeat and metricbeat). The tests were waiting on a log message that, with "correct" reloader behavior, should never have happened in the first place (and were thus relying on the previous broken behavior to pass).

Some details for posterity: the tests were waiting for lines like "Starting reload procedure, current runners: 1" to detect that one runner was started, but this is the log line at the beginning of the reload procedure, and so this line will generally not appear unless the configuration is changed after a runner has been started (we will instead see "current runners: 0" followed by "Start list: 1"). When we incorrectly triggered changes every iteration this was ok, and just resulted in a short delay before the target log entry appeared because of a redundant scan, but fixing the reloader prevented this.

On the other hand, the test consistently passed locally, because (as mentioned elsewhere in this PR) our glob watcher rounds down to the next earlier second, so with short timer intervals some changes can be detected twice, and so local test runs were still getting the redundant log line they needed to pass -- however, the slower CI runs often didn't hit the next scan within exactly a second, and in that case would not trigger a redundant change.

I've changed the tests to instead trigger on the less ambiguous log lines that specify how many runners were started / stopped, which is independent of the exact timer intervals and modification times.

(cherry picked from commit 2743d51)

jsoriano · 2020-02-28T15:44:13Z

We suspect that in some cases this could cause unbounded memory consumption (we are still investigating the exact scope of the problem).

@faec do you have more info about the cases where this happens? I know that this is an issue at least when some processors that keep some connections are used at the module or input level. I have some WIP PRs (#16349) to look for a solution for this.

faec · 2020-02-28T16:15:44Z

do you have more info about the cases where this happens?

Sort of -- that suspicion was based on a user issue that was seeing apparent memory leaks where the number of changed / loaded configurations steadily grew, which is what led to finding this bug. However, we haven't been able to reproduce the problem, because the loaded configurations (after Reload is called in this PR) are still hashed and compared before actually starting / stopping the respective runners -- this bug caused redundant configuration processing, but we haven't observed any effect on the running modules.

However if there's an additional bug somewhere that makes the hash for a particular loaded configuration unstable, then this bug combined with that one could produce the behavior mentioned. But we don't have any direct evidence that such a bug exists (we're still waiting on more configuration details in the original discussion).

(cherry picked from commit 2743d51)

…logic (#16687)

Fix / clarify the module reload logic

276c6cb

faec added bug libbeat needs_backport PR is waiting to be backported to other branches. labels Feb 19, 2020

faec requested a review from ruflin February 19, 2020 21:26

ruflin reviewed Feb 20, 2020

View reviewed changes

faec added 4 commits February 20, 2020 13:40

Clean up error output

b987226

Merge branch 'master' into module-reload

3a64cae

Add new monitoring variable libbeat.config.scans, add tests

9982772

update changelog

7cd61b0

faec requested a review from a team as a code owner February 24, 2020 23:08

faec added 2 commits February 24, 2020 18:14

reload tests should only be in the integration build

beadaf3

remove no-longer-needed mock structures

5066edc

ruflin approved these changes Feb 25, 2020

View reviewed changes

faec added 4 commits February 25, 2020 09:21

increase unit test timeout

a14ecc5

Merge branch 'master' into module-reload

c1ca81b

increase time limit on preexisting flaky reloader test

11cdbf5

increase the timing for the right test_reload.py

3840fcc

faec requested a review from a team as a code owner February 26, 2020 12:39

faec added 3 commits February 26, 2020 09:34

(system test troubleshooting) always trigger config reloader

181f081

turn real reload detection back on

eb763e8

fix flaky reloader tests

fdf2c03

faec merged commit 2743d51 into elastic:master Feb 26, 2020

faec deleted the module-reload branch February 27, 2020 17:05

This was referenced Feb 28, 2020

reloader never continue loop when updated is false #12155

Closed

fix update condition bug in reloader #12156

Closed

faec added a commit to faec/beats that referenced this pull request Feb 28, 2020

[libbeat] Fix / clarify the module reload logic (elastic#16440)

9c76ca0

(cherry picked from commit 2743d51)

faec mentioned this pull request Feb 28, 2020

Cherry-pick #16440 to 7.x: [libbeat] Fix / clarify the module reload logic #16686

Merged

5 tasks

faec added v7.7.0 and removed needs_backport PR is waiting to be backported to other branches. labels Feb 28, 2020

faec added a commit to faec/beats that referenced this pull request Feb 28, 2020

[libbeat] Fix / clarify the module reload logic (elastic#16440)

5c54905

(cherry picked from commit 2743d51)

faec mentioned this pull request Feb 28, 2020

Cherry-pick #16440 to 7.6: [libbeat] Fix / clarify the module reload logic #16687

Merged

5 tasks

faec added the v7.6.1 label Feb 28, 2020

faec added a commit that referenced this pull request Mar 2, 2020

[libbeat] Fix / clarify the module reload logic (#16440) (#16686)

2ca17ff

(cherry picked from commit 2743d51)

faec added a commit that referenced this pull request Mar 3, 2020

Cherry-pick #16440 to 7.6: [libbeat] Fix / clarify the module reload …

19ca8b2

…logic (#16687)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libbeat] Fix / clarify the module reload logic #16440

[libbeat] Fix / clarify the module reload logic #16440

faec commented Feb 19, 2020 •

edited

Loading

urso commented Feb 19, 2020

ruflin left a comment

ruflin Feb 20, 2020

faec Feb 20, 2020

ruflin Feb 20, 2020

ruflin left a comment

ruflin Feb 25, 2020

faec Feb 25, 2020

ruflin Feb 25, 2020

faec Feb 25, 2020

ruflin Feb 27, 2020

faec commented Feb 26, 2020 •

edited

Loading

jsoriano commented Feb 28, 2020

faec commented Feb 28, 2020 •

edited

Loading

[libbeat] Fix / clarify the module reload logic #16440

[libbeat] Fix / clarify the module reload logic #16440

Conversation

faec commented Feb 19, 2020 • edited Loading

What does this PR do?

Checklist

How to test this PR locally

urso commented Feb 19, 2020

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

faec commented Feb 26, 2020 • edited Loading

jsoriano commented Feb 28, 2020

faec commented Feb 28, 2020 • edited Loading

faec commented Feb 19, 2020 •

edited

Loading

faec commented Feb 26, 2020 •

edited

Loading

faec commented Feb 28, 2020 •

edited

Loading