[Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types #91966

marshallmain · 2021-02-19T01:49:10Z

Summary

The goal here is to move the gap detection and remediation logic out of searchAfterBulkCreate so that in the future it can be shared among all rule types. In the process I refactored some of the logic to avoid calculating values multiple times in different places. Examining the code also exposed some bugs which I will comment on in the diff below and should be fixed in the refactored code.

Follow up work:

Establish a now timestamp that is used throughout rule executor to prevent subtle bugs from calculations using different values for now
Use gap remediation logic for all rule types

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios

For maintainers

This was checked for breaking API changes and was labeled appropriately

…types

marshallmain · 2021-02-19T02:03:58Z

...ck/plugins/security_solution/server/lib/detection_engine/signals/search_after_bulk_create.ts

@@ -64,16 +62,6 @@ export const searchAfterAndBulkCreate = async ({
  // to ensure we don't exceed maxSignals
  let signalsCreatedCount = 0;

-  const totalToFromTuples = getSignalTimeTuples({


Rather than calculate the tuples inside searchAfterBulkCreate, they are calculated in signal_rule_alert_type.ts and passed into searchAfterBulkCreate.

marshallmain · 2021-02-19T02:09:22Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts

+        maxSignals,
+        buildRuleMessage,
+      });
+      const remainingGap = getRemainingGap({ tuples, previousStartedAt });


getRemainingGap calculates the difference between the from date for the earliest tuple and the previousStartedAt date (aka last time the rule ran). The advantage of this approach is we don't have to know about internals of how the tuples are computed to determine if there is a gap.

x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts

marshallmain · 2021-02-19T02:17:42Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/utils.test.ts

-    test('it returns null if the interval is an invalid string such as "invalid"', () => {
-      const gap = getGapBetweenRuns({
-        previousStartedAt: nowDate.clone().toDate(),
-        interval: 'invalid', // if not set to "x" where x is an interval such as 6m


getGapBetweenRuns now takes an intervalDuration which has already been parsed, so this case is not needed.

marshallmain · 2021-02-19T02:18:37Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/utils.ts

-  const dateMathRuleParamsFrom = dateMath.parse(ruleParamsFrom);
-  if (dateMathRuleParamsFrom != null && intervalMoment != null) {
-    const momentUnit = shorthandMap[unit].momentString as moment.DurationInputArg2;
-    const gapDiffInUnits = dateMathRuleParamsFrom.diff(calculatedFromAsMoment, momentUnit);


gapDiffInUnits was always an integer, so if we compute the difference using units like hours then it can be truncated to 0, even though we should have a decimal gap (e.g. 0.1 hours)

marshallmain · 2021-02-19T02:21:24Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/utils.ts

-          buildRuleMessage('failed to calculate maxCatchup, ratio, or gapDiffInUnits')
-        );
-      }
-      let tempTo = dateMath.parse(ruleParamsFrom);


tempTo is converted from a "now" based timestamp to a concrete moment here. However, time keeps advancing before ruleParamsFrom is converted into a concrete timestamp for the current rule run tuple at the end. Thus we can end up with a small gap between some of the tuples.

marshallmain · 2021-02-19T02:23:15Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/utils.ts

-      }`;
-      logger.debug(buildRuleMessage(`calculatedFrom: ${calculatedFrom}`));
-
-      const intervalMoment = moment.duration(parseInt(interval, 10), unit);


unit is pulled off of the from parameter, but is being used to parse the interval here. If interval and from use different units, interval wouldn't be parsed correctly.

marshallmain · 2021-02-19T02:32:39Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/utils.ts

+    }),
+  ];
+  try {
+    const intervalDuration = parseInterval(interval);


I tried to minimize the parsing of strings into dates and instead parse once and pass the parsed value around. I also tried to reduce the number of places we can return null or undefined, and instead do up-front checks so the rest of the code can rely on the values existing.

Here, for example, parseInterval can throw, in which case we know that the rest of code path here that relies on intervalDuration won't be able to complete so we skip straight to the catch rather than catching the error inside parseInterval and returning null.

marshallmain · 2021-02-19T02:37:46Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/utils.ts

-  if (duration !== null) {
-    return duration.subtract(interval);
-  } else {
-    return null;


I didn't see a way for duration to be null, however if there is a way I'll revert this change.

marshallmain · 2021-02-20T03:46:57Z

@elasticmachine merge upstream

elasticmachine · 2021-02-20T05:58:04Z

Pinging @elastic/security-solution (Team: SecuritySolution)

elasticmachine · 2021-02-20T05:58:04Z

Pinging @elastic/security-detections-response (Team:Detections and Resp)

marshallmain · 2021-02-22T15:59:59Z

@elasticmachine merge upstream

madirey · 2021-02-23T18:13:18Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts

+      if (remainingGap.asMilliseconds() > 0) {
+        const gapString = remainingGap.humanize();
+        const gapMessage = buildRuleMessage(
+          `${gapString} (${remainingGap.asMilliseconds()}ms) has passed since last rule execution, and signals may have been missed.`,


Let's be more clear with this wording (it's not the number of milliseconds that has passed... it's the gap).

madirey · 2021-02-23T18:23:06Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/utils.ts

-    ratio: null,
-    gapDiffInUnits: null,
-  };
+  const ratio = Math.ceil(gapInMilliseconds / intervalDuration.asMilliseconds());


Can intervalDuration.asMilliseconds() be 0?

I hope that gets prevented in the alerting framework, but it's a good point. I added a check to prevent divide by 0 here.

madirey · 2021-02-23T18:27:02Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/utils.ts

+  try {
+    const intervalDuration = parseInterval(interval);
+    const gap = getGapBetweenRuns({ previousStartedAt, intervalDuration, from, to });
+    const catchup = getGapMaxCatchupRatio({


Can we make it more clear that this is the number of intervals to use to catch up?

…into share-gap-detection

madirey · 2021-02-23T23:25:44Z

...ck/plugins/security_solution/server/lib/detection_engine/signals/search_after_bulk_create.ts

  mergeReturns,
  mergeSearchResults,
 } from './utils';
 import { SearchAfterAndBulkCreateParams, SearchAfterAndBulkCreateReturnType } from './types';

 // search_after through documents and re-index using bulk endpoint.
 export const searchAfterAndBulkCreate = async ({
-  gap,
-  previousStartedAt,
+  tuples: totalToFromTuples,


Nit: maybe a more descriptive name than tuples? e.g. timeRangeTuples ...

madirey · 2021-02-23T23:28:30Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts

+        const gapString = remainingGap.humanize();
+        const gapMessage = buildRuleMessage(
+          `${gapString} (${remainingGap.asMilliseconds()}ms) were not queried between this rule execution and the last execution, so signals may have been missed.`,
+          'Consider increasing your look behind time or adding more Kibana instances.'


madirey · 2021-02-23T23:30:16Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/utils.test.ts

-        now: nowDate.clone(),
-      });
-      expect(gap).toBeNull();
+      expect(gap.asMilliseconds()).toEqual(0);


madirey · 2021-02-23T23:35:03Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/utils.test.ts

-      const someTuple = someTuples[1];
-      expect(moment(someTuple.to).diff(moment(someTuple.from), 's')).toEqual(10);
+      const someTuple = tuples[1];
+      expect(moment(someTuple.to).diff(moment(someTuple.from), 's')).toEqual(55);


So the tuples were reversed from the previous functionality, essentially?

The previous implementation had a special case for if the gap was <1 interval which was the scenario in this test. In this special case it would create only one extra tuple which was scaled so it covered only the gap duration rather than a full rule interval duration. The refactored implementation always uses interval + lookback as the tuple duration for consistency, so the difference to - from here changed from 10s (the gap duration) to 55s (interval + lookback)

I'm not sure this is the appropriate avenue to pursue. If a rule is scheduled to run weekly but only had a gap of an hour, it's going to schedule a second search over a weeks worth of data that it had already searched over instead of just the hours gap? That doesn't seem efficient.

I left a comment in the code explaining some of the reasoning behind this change. The gist of it is that for some rule types a consistent query duration is important, and the overlap between consecutive rule runs affects the behavior as well. Since we have to be able to handle up to 4 extra gap-covering queries in a rule execution anyway I think the benefit of having consistent query durations outweighs the cost of extending the partial time duration to 1 full time duration.

I would suggest re-working it in such a way that the default is the original functionality and the extended lookback is updated to use the overlap if the given rule type requires it. This way the lookback does not default to a full rule interval if the gap is less than that.

Will the dupes be ignored? In the case of threshold, they will. If they're ignored for all rule types, it seems fine to me to leave the consistent query duration... if we're experiencing gaps frequently, then probably something else is wrong?

All rule types have duplicate detection and must have it due to the expected overlap between query time ranges, so dupes will be ignored. @madirey my reasoning is the same, the gaps should be very infrequent so ensuring correctness by making the code easy to reason about was a higher priority to me than optimizing for each case.

madirey · 2021-02-23T23:57:19Z

x-pack/plugins/security_solution/server/lib/detection_engine/signals/utils.ts

+  const intervalInMilliseconds = intervalDuration.asMilliseconds();
+  let currentTo = to;
+  let currentFrom = from;
+  // This loop will create tuples with overlapping time ranges, the same way rule runs have overlapping time


marshallmain · 2021-02-24T18:51:41Z

@elasticmachine merge upstream

kibanamachine · 2021-02-24T21:05:39Z

💛 Build succeeded, but was flaky

continuous-integration/kibana-ci/pull-request
Commit: 716e09c
Storybooks Preview
Documentation Changes
Flaky suites:
- xpack-securitySolutionCypressChrome

Test Failures

Kibana Pipeline / general / "before all" hook for "should contain notes".Timeline notes tab "before all" hook for "should contain notes"

Link to Jenkins

Stack Trace

Failed Tests Reporter:
  - Test has not failed recently on tracked branches

AssertionError: Timed out retrying after 60000ms: Expected to find element: `[data-test-subj="add-a-note"] textarea`, but never found it.

Because this error occurred during a `before all` hook we are skipping the remaining tests in the current suite: `Timeline notes tab`

Although you have test retries enabled, we do not retry tests when `before all` or `after all` hooks fail
    at Object.addNotesToTimeline (http://localhost:6121/__cypress/tests?p=cypress/integration/timelines/notes_tab.spec.ts:15917:8)
    at Context.eval (http://localhost:6121/__cypress/tests?p=cypress/integration/timelines/notes_tab.spec.ts:15046:28)

Metrics [docs]

✅ unchanged

History

💚 Build #108888 succeeded 3154687
💔 Build #108868 failed a2f6589
💚 Build #108872 succeeded f867b10
💛 Build #108455 was flaky cebf49f
💚 Build #108220 succeeded 652a4a7

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

FrankHassanabad

LGTM,

There was long lengthy conversations outside of github about a few items I'll add here.

It seems like maybe now that the rules have grown in variety and types we might need individual rules be able to select their backtracking choices rather than a 1 size fits all.

Examples are indicator matches and KQL searches need to only clear the gap and not go back and de-duplicate beyond the gap where thresholds and EQL need to clear the gap + re-look at their last time segment to ensure they didn't miss something within their max spans/aggs because of a time boundary.

As pointed out in the comments, rules which don't need to go back to the last segment and have a long interval of say 30 minutes but a short gap of say < 1 minute will end up incurring a higher cost from querying and de-duplicating if they fall behind momentarily vs before.

However, a good default such as the 4 segment backtracking one where it's gap + last segment is a good default if a rule doesn't have a preference for backtracking as it is the lowest common denominator in that it will work for all rule types with regards to correctness.

As rules and rule types increase we will more than likely want individual rules to choose what's the best for them but have an easy to choose fall back such as the 4 segment back-tracking one. We might always stay with this one strategy but as usual we re-visit decisions and things from time to time as feedback rolls in and adjust as needed.

In the discussions, I think everyone is on the same page that rules shouldn't be clearing gaps other than as rare events because that is usually a sign the rule runs are already behind because the system is over-subscribed. Feedback from most teams and forum posters so far to date is that either the rules are running fine or they go 💥 kaboom rather quickly and then adjustments are made.

Typically only when Kibana is rebooted or brought down for maintenance or unexpected surges in operation that are very short lived happen do we expect the gaps to be showing up. Either that or the system is over-subscribed in which case it should be upgraded/fixed/maintained/tuned if possible to remove the gap messages.

…ration for sharing between rule types (elastic#91966) * Pull gap detection logic out in preparation for sharing between rule types * Remove comments and unused import * Remove unncessary function, cleanup, comment * Update comment * Address PR comments * remove unneeded mocks * Undo change to parseInterval * Remove another unneeded mock Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

…ration for sharing between rule types (#91966) (#92822) * Pull gap detection logic out in preparation for sharing between rule types * Remove comments and unused import * Remove unncessary function, cleanup, comment * Update comment * Address PR comments * remove unneeded mocks * Undo change to parseInterval * Remove another unneeded mock Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

…tiple-searchable-snapshot-actions * 'master' of github.com:elastic/kibana: [Rollup] Fix use of undefined value in JS import (elastic#92791) [ILM] Fix replicas not showing (elastic#92782) [Event Log] Extended README.md with the documentation for a REST API and Start plugin contract. (elastic#92562) [XY] Enables page reload toast for the legacyChartsLibrary setting (elastic#92811) [Security Solution][Case] Improve hooks (elastic#89580) [Security Solution] Update wordings and breadcrumb for timelines page (elastic#90809) [Security Solution] Replace EUI theme with mocks in jest suites (elastic#92462) docs: ✏️ use correct heading level (elastic#92806) [ILM ] Fix logic for showing/hiding recommended allocation on Cloud (elastic#90592) [Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types (elastic#91966) [core.savedObjects] Remove _shard_doc tiebreaker since ES now adds it automatically. (elastic#92295) docs: ✏️ fix links in embeddable plugin readme (elastic#92778) # Conflicts: # x-pack/plugins/index_lifecycle_management/public/application/sections/edit_policy/components/phases/shared_fields/searchable_snapshot_field/searchable_snapshot_field.tsx

…ration for sharing between rule types (elastic#91966) * Pull gap detection logic out in preparation for sharing between rule types * Remove comments and unused import * Remove unncessary function, cleanup, comment * Update comment * Address PR comments * remove unneeded mocks * Undo change to parseInterval * Remove another unneeded mock Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> # Conflicts: # x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts

…ration for sharing between rule types (#91966) (#93787) * Pull gap detection logic out in preparation for sharing between rule types * Remove comments and unused import * Remove unncessary function, cleanup, comment * Update comment * Address PR comments * remove unneeded mocks * Undo change to parseInterval * Remove another unneeded mock Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> # Conflicts: # x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts

Pull gap detection logic out in preparation for sharing between rule …

1040fd3

…types

marshallmain commented Feb 19, 2021

View reviewed changes

marshallmain added 3 commits February 18, 2021 21:40

Remove comments and unused import

14fdb10

Remove unncessary function, cleanup, comment

0eb8606

Update comment

267263f

Merge branch 'master' into share-gap-detection

652a4a7

marshallmain marked this pull request as ready for review February 20, 2021 05:57

marshallmain requested a review from a team as a code owner February 20, 2021 05:57

Merge branch 'master' into share-gap-detection

cebf49f

spong requested a review from a team February 22, 2021 18:19

madirey reviewed Feb 23, 2021

View reviewed changes

marshallmain added 5 commits February 23, 2021 14:35

Address PR comments

de875af

Merge branch 'share-gap-detection' of github.com:marshallmain/kibana …

a2f6589

…into share-gap-detection

remove unneeded mocks

f867b10

Undo change to parseInterval

f3867dd

Remove another unneeded mock

3154687

madirey reviewed Feb 23, 2021

View reviewed changes

Merge branch 'master' into share-gap-detection

716e09c

madirey approved these changes Feb 24, 2021

View reviewed changes

FrankHassanabad approved these changes Feb 25, 2021

View reviewed changes

marshallmain merged commit f0838e6 into elastic:master Feb 25, 2021

marshallmain mentioned this pull request Feb 25, 2021

[7.x] [Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types (#91966) #92822

Merged

marshallmain added the v7.12.0 label Mar 5, 2021

marshallmain mentioned this pull request Mar 5, 2021

[7.12] [Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types (#91966) #93787

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types #91966

[Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types #91966

marshallmain commented Feb 19, 2021 •

edited

Loading

marshallmain Feb 19, 2021

marshallmain Feb 19, 2021

marshallmain Feb 19, 2021

marshallmain Feb 19, 2021

marshallmain Feb 19, 2021

marshallmain Feb 19, 2021

marshallmain Feb 19, 2021

marshallmain Feb 19, 2021

marshallmain commented Feb 20, 2021

elasticmachine commented Feb 20, 2021

elasticmachine commented Feb 20, 2021

marshallmain commented Feb 22, 2021

madirey Feb 23, 2021

madirey Feb 23, 2021

marshallmain Feb 23, 2021

madirey Feb 23, 2021

madirey Feb 23, 2021

madirey Feb 23, 2021

madirey Feb 23, 2021

madirey Feb 23, 2021

marshallmain Feb 23, 2021

dhurley14 Feb 24, 2021

marshallmain Feb 24, 2021

dhurley14 Feb 24, 2021

madirey Feb 24, 2021

marshallmain Feb 24, 2021

madirey Feb 23, 2021

marshallmain commented Feb 24, 2021

kibanamachine commented Feb 24, 2021

Stack Trace

FrankHassanabad left a comment

[Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types #91966

[Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types #91966

Conversation

marshallmain commented Feb 19, 2021 • edited Loading

Summary

Checklist

For maintainers

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marshallmain commented Feb 20, 2021

elasticmachine commented Feb 20, 2021

elasticmachine commented Feb 20, 2021

marshallmain commented Feb 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marshallmain commented Feb 24, 2021

kibanamachine commented Feb 24, 2021

💛 Build succeeded, but was flaky

Stack Trace

Metrics [docs]

History

FrankHassanabad left a comment

Choose a reason for hiding this comment

marshallmain commented Feb 19, 2021 •

edited

Loading