-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types #91966
[Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types #91966
Conversation
@@ -64,16 +62,6 @@ export const searchAfterAndBulkCreate = async ({ | |||
// to ensure we don't exceed maxSignals | |||
let signalsCreatedCount = 0; | |||
|
|||
const totalToFromTuples = getSignalTimeTuples({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than calculate the tuples inside searchAfterBulkCreate
, they are calculated in signal_rule_alert_type.ts
and passed into searchAfterBulkCreate
.
maxSignals, | ||
buildRuleMessage, | ||
}); | ||
const remainingGap = getRemainingGap({ tuples, previousStartedAt }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getRemainingGap
calculates the difference between the from
date for the earliest tuple and the previousStartedAt
date (aka last time the rule ran). The advantage of this approach is we don't have to know about internals of how the tuples are computed to determine if there is a gap.
x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts
Show resolved
Hide resolved
test('it returns null if the interval is an invalid string such as "invalid"', () => { | ||
const gap = getGapBetweenRuns({ | ||
previousStartedAt: nowDate.clone().toDate(), | ||
interval: 'invalid', // if not set to "x" where x is an interval such as 6m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getGapBetweenRuns
now takes an intervalDuration
which has already been parsed, so this case is not needed.
const dateMathRuleParamsFrom = dateMath.parse(ruleParamsFrom); | ||
if (dateMathRuleParamsFrom != null && intervalMoment != null) { | ||
const momentUnit = shorthandMap[unit].momentString as moment.DurationInputArg2; | ||
const gapDiffInUnits = dateMathRuleParamsFrom.diff(calculatedFromAsMoment, momentUnit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gapDiffInUnits
was always an integer, so if we compute the difference using units like hours then it can be truncated to 0, even though we should have a decimal gap (e.g. 0.1 hours)
buildRuleMessage('failed to calculate maxCatchup, ratio, or gapDiffInUnits') | ||
); | ||
} | ||
let tempTo = dateMath.parse(ruleParamsFrom); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tempTo is converted from a "now" based timestamp to a concrete moment here. However, time keeps advancing before ruleParamsFrom is converted into a concrete timestamp for the current rule run tuple at the end. Thus we can end up with a small gap between some of the tuples.
}`; | ||
logger.debug(buildRuleMessage(`calculatedFrom: ${calculatedFrom}`)); | ||
|
||
const intervalMoment = moment.duration(parseInt(interval, 10), unit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit
is pulled off of the from
parameter, but is being used to parse the interval
here. If interval
and from
use different units, interval
wouldn't be parsed correctly.
}), | ||
]; | ||
try { | ||
const intervalDuration = parseInterval(interval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to minimize the parsing of strings into dates and instead parse once and pass the parsed value around. I also tried to reduce the number of places we can return null
or undefined
, and instead do up-front checks so the rest of the code can rely on the values existing.
Here, for example, parseInterval
can throw, in which case we know that the rest of code path here that relies on intervalDuration
won't be able to complete so we skip straight to the catch
rather than catching the error inside parseInterval
and returning null.
if (duration !== null) { | ||
return duration.subtract(interval); | ||
} else { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see a way for duration to be null, however if there is a way I'll revert this change.
@elasticmachine merge upstream |
Pinging @elastic/security-solution (Team: SecuritySolution) |
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
@elasticmachine merge upstream |
if (remainingGap.asMilliseconds() > 0) { | ||
const gapString = remainingGap.humanize(); | ||
const gapMessage = buildRuleMessage( | ||
`${gapString} (${remainingGap.asMilliseconds()}ms) has passed since last rule execution, and signals may have been missed.`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be more clear with this wording (it's not the number of milliseconds that has passed... it's the gap).
ratio: null, | ||
gapDiffInUnits: null, | ||
}; | ||
const ratio = Math.ceil(gapInMilliseconds / intervalDuration.asMilliseconds()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can intervalDuration.asMilliseconds()
be 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope that gets prevented in the alerting framework, but it's a good point. I added a check to prevent divide by 0 here.
try { | ||
const intervalDuration = parseInterval(interval); | ||
const gap = getGapBetweenRuns({ previousStartedAt, intervalDuration, from, to }); | ||
const catchup = getGapMaxCatchupRatio({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it more clear that this is the number of intervals to use to catch up?
mergeReturns, | ||
mergeSearchResults, | ||
} from './utils'; | ||
import { SearchAfterAndBulkCreateParams, SearchAfterAndBulkCreateReturnType } from './types'; | ||
|
||
// search_after through documents and re-index using bulk endpoint. | ||
export const searchAfterAndBulkCreate = async ({ | ||
gap, | ||
previousStartedAt, | ||
tuples: totalToFromTuples, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe a more descriptive name than tuples
? e.g. timeRangeTuples
...
const gapString = remainingGap.humanize(); | ||
const gapMessage = buildRuleMessage( | ||
`${gapString} (${remainingGap.asMilliseconds()}ms) were not queried between this rule execution and the last execution, so signals may have been missed.`, | ||
'Consider increasing your look behind time or adding more Kibana instances.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
now: nowDate.clone(), | ||
}); | ||
expect(gap).toBeNull(); | ||
expect(gap.asMilliseconds()).toEqual(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
const someTuple = someTuples[1]; | ||
expect(moment(someTuple.to).diff(moment(someTuple.from), 's')).toEqual(10); | ||
const someTuple = tuples[1]; | ||
expect(moment(someTuple.to).diff(moment(someTuple.from), 's')).toEqual(55); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the tuples were reversed from the previous functionality, essentially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous implementation had a special case for if the gap was <1 interval which was the scenario in this test. In this special case it would create only one extra tuple which was scaled so it covered only the gap duration rather than a full rule interval duration. The refactored implementation always uses interval + lookback
as the tuple duration for consistency, so the difference to - from
here changed from 10s (the gap duration) to 55s (interval + lookback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the appropriate avenue to pursue. If a rule is scheduled to run weekly but only had a gap of an hour, it's going to schedule a second search over a weeks worth of data that it had already searched over instead of just the hours gap? That doesn't seem efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment in the code explaining some of the reasoning behind this change. The gist of it is that for some rule types a consistent query duration is important, and the overlap between consecutive rule runs affects the behavior as well. Since we have to be able to handle up to 4 extra gap-covering queries in a rule execution anyway I think the benefit of having consistent query durations outweighs the cost of extending the partial time duration to 1 full time duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest re-working it in such a way that the default is the original functionality and the extended lookback is updated to use the overlap if the given rule type requires it. This way the lookback does not default to a full rule interval if the gap is less than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the dupes be ignored? In the case of threshold, they will. If they're ignored for all rule types, it seems fine to me to leave the consistent query duration... if we're experiencing gaps frequently, then probably something else is wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All rule types have duplicate detection and must have it due to the expected overlap between query time ranges, so dupes will be ignored. @madirey my reasoning is the same, the gaps should be very infrequent so ensuring correctness by making the code easy to reason about was a higher priority to me than optimizing for each case.
const intervalInMilliseconds = intervalDuration.asMilliseconds(); | ||
let currentTo = to; | ||
let currentFrom = from; | ||
// This loop will create tuples with overlapping time ranges, the same way rule runs have overlapping time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@elasticmachine merge upstream |
💛 Build succeeded, but was flaky
Test FailuresKibana Pipeline / general / "before all" hook for "should contain notes".Timeline notes tab "before all" hook for "should contain notes"Stack Trace
Metrics [docs]
History
To update your PR or re-run it, just comment with: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM,
There was long lengthy conversations outside of github about a few items I'll add here.
It seems like maybe now that the rules have grown in variety and types we might need individual rules be able to select their backtracking choices rather than a 1 size fits all.
Examples are indicator matches and KQL searches need to only clear the gap and not go back and de-duplicate beyond the gap where thresholds and EQL need to clear the gap + re-look at their last time segment to ensure they didn't miss something within their max spans/aggs because of a time boundary.
As pointed out in the comments, rules which don't need to go back to the last segment and have a long interval of say 30 minutes but a short gap of say < 1 minute will end up incurring a higher cost from querying and de-duplicating if they fall behind momentarily vs before.
However, a good default such as the 4 segment backtracking one where it's gap + last segment is a good default if a rule doesn't have a preference for backtracking as it is the lowest common denominator in that it will work for all rule types with regards to correctness.
As rules and rule types increase we will more than likely want individual rules to choose what's the best for them but have an easy to choose fall back such as the 4 segment back-tracking one. We might always stay with this one strategy but as usual we re-visit decisions and things from time to time as feedback rolls in and adjust as needed.
In the discussions, I think everyone is on the same page that rules shouldn't be clearing gaps other than as rare events because that is usually a sign the rule runs are already behind because the system is over-subscribed. Feedback from most teams and forum posters so far to date is that either the rules are running fine or they go 💥 kaboom rather quickly and then adjustments are made.
Typically only when Kibana is rebooted or brought down for maintenance or unexpected surges in operation that are very short lived happen do we expect the gaps to be showing up. Either that or the system is over-subscribed in which case it should be upgraded/fixed/maintained/tuned if possible to remove the gap messages.
…ration for sharing between rule types (elastic#91966) * Pull gap detection logic out in preparation for sharing between rule types * Remove comments and unused import * Remove unncessary function, cleanup, comment * Update comment * Address PR comments * remove unneeded mocks * Undo change to parseInterval * Remove another unneeded mock Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
…ration for sharing between rule types (#91966) (#92822) * Pull gap detection logic out in preparation for sharing between rule types * Remove comments and unused import * Remove unncessary function, cleanup, comment * Update comment * Address PR comments * remove unneeded mocks * Undo change to parseInterval * Remove another unneeded mock Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
…tiple-searchable-snapshot-actions * 'master' of github.com:elastic/kibana: [Rollup] Fix use of undefined value in JS import (elastic#92791) [ILM] Fix replicas not showing (elastic#92782) [Event Log] Extended README.md with the documentation for a REST API and Start plugin contract. (elastic#92562) [XY] Enables page reload toast for the legacyChartsLibrary setting (elastic#92811) [Security Solution][Case] Improve hooks (elastic#89580) [Security Solution] Update wordings and breadcrumb for timelines page (elastic#90809) [Security Solution] Replace EUI theme with mocks in jest suites (elastic#92462) docs: ✏️ use correct heading level (elastic#92806) [ILM ] Fix logic for showing/hiding recommended allocation on Cloud (elastic#90592) [Security Solution][Detections] Pull gap detection logic out in preparation for sharing between rule types (elastic#91966) [core.savedObjects] Remove _shard_doc tiebreaker since ES now adds it automatically. (elastic#92295) docs: ✏️ fix links in embeddable plugin readme (elastic#92778) # Conflicts: # x-pack/plugins/index_lifecycle_management/public/application/sections/edit_policy/components/phases/shared_fields/searchable_snapshot_field/searchable_snapshot_field.tsx
…ration for sharing between rule types (elastic#91966) * Pull gap detection logic out in preparation for sharing between rule types * Remove comments and unused import * Remove unncessary function, cleanup, comment * Update comment * Address PR comments * remove unneeded mocks * Undo change to parseInterval * Remove another unneeded mock Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> # Conflicts: # x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts
…ration for sharing between rule types (#91966) (#93787) * Pull gap detection logic out in preparation for sharing between rule types * Remove comments and unused import * Remove unncessary function, cleanup, comment * Update comment * Address PR comments * remove unneeded mocks * Undo change to parseInterval * Remove another unneeded mock Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> # Conflicts: # x-pack/plugins/security_solution/server/lib/detection_engine/signals/signal_rule_alert_type.ts
Summary
The goal here is to move the gap detection and remediation logic out of
searchAfterBulkCreate
so that in the future it can be shared among all rule types. In the process I refactored some of the logic to avoid calculating values multiple times in different places. Examining the code also exposed some bugs which I will comment on in the diff below and should be fixed in the refactored code.Follow up work:
now
timestamp that is used throughout rule executor to prevent subtle bugs from calculations using different values fornow
Checklist
Delete any items that are not applicable to this PR.
For maintainers