[Security Solution][Detections] Optimize rule fetching endpoints [WIP] #90639

banderror · 2021-02-08T15:52:28Z

Needed for: #89877

Summary

In order to be able to implement in-memory rule management table, we need these endpoints to be fast and scalable in terms of the number of requested/existing rules:

api/detection_engine/rules/_find
api/detection_engine/rules/_find_statuses
api/detection_engine/rules/prepackaged/_status

This PR attempts to do that.

Details

Currently, the endpoints mentioned above generate ~2N requests to Elasticsearch. I guess, we did this partly because of current limitations of saved object client and alerting client APIs.

Instead, to be able to fetch for example N latest statuses of N rules, and do it in one query to Elasticsearch instead of N queries, we need to be able to do term + terms queries and aggregations.

# Bulk fetch current (latest) rule execution statuses (one per each rule) by a list of alert ids
GET .kibana/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "term": { "type": "siem-detection-engine-rule-status" } },
        {
          "terms": {
            "siem-detection-engine-rule-status.alertId": [
              "b921e020-6a13-11eb-bf61-5f56ac457c9b",
              "4bc22346-6a0f-11eb-b39f-e95fdc584ff7"
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "rules": {
      "terms": {
        "field": "siem-detection-engine-rule-status.alertId",
        "size": 2
      },
      "aggs": {
        "latest_status": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "siem-detection-engine-rule-status.statusDate": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

# Bulk fetch rule actions (one object per each rule) by a list of alert ids
GET .kibana/_search
{
  "size": 2,
  "query": {
    "bool": {
      "filter": [
        { "term": { "type": "siem-detection-engine-rule-actions" } },
        {
          "terms": {
            "siem-detection-engine-rule-actions.ruleAlertId": [
              "b921e020-6a13-11eb-bf61-5f56ac457c9b",
              "4bc22346-6a0f-11eb-b39f-e95fdc584ff7"
            ]
          }
        }
      ]
    }
  }
}

In this PR I'm trying to use ElasticsearchClient directly. I'm building a custom query, executing it via ElasticsearchClient.search() API and manually normalizing the response to a desired format of our domain entities. Search request directly queries .kibana index.

This naive hacky-dirty implementation works locally and reduces the api/detection_engine/rules/_find roundtrip time from 2-5 seconds down to 300 - 1100 ms when requesting 500 rules.

Notes on data model

We have detections rules which are implemented as "alerts" in terms of alerting framework, and rule statuses and rule actions implemented as what we call "sidecar" saved objects.

Detection rules are saved objects with "type": "alert" and "alert.alertTypeId": "siem.signals".
Rule statuses are saved objects with "type": "siem-detection-engine-rule-status".
Rule actions are saved objects with "type": "siem-detection-engine-rule-actions".

Rule statuses and actions objects have "foreign keys" pointing to detections rules via their ids. E.g. if a rule has an "_id": "alert:b921e020-6a13-11eb-bf61-5f56ac457c9b", its status object will contain a foreign key "alertId": "b921e020-6a13-11eb-bf61-5f56ac457c9b" in its saved object attributes.

Association types:

Rule --* Status (one to many, statuses are time based, and we store sort of status history instead of only the current status)
Rule -- Actions (one to one, actions is a single object containing all actions of the rule)

Example for rule with id b921e020-6a13-11eb-bf61-5f56ac457c9b:

# The rule itself
GET .kibana/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "type": "alert" } },
        { "term": { "alert.alertTypeId": "siem.signals" } },
        { "term": { "_id": "alert:b921e020-6a13-11eb-bf61-5f56ac457c9b" } }
      ]
    }
  }
}

# Sidecar statuses of the rule
GET .kibana/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "type": "siem-detection-engine-rule-status" } },
        { "term": { "siem-detection-engine-rule-status.alertId": "b921e020-6a13-11eb-bf61-5f56ac457c9b" } }
      ]
    }
  }
}

# Sidecar actions of the rule
GET .kibana/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "type": "siem-detection-engine-rule-actions" } },
        { "term": { "siem-detection-engine-rule-actions.ruleAlertId": "b921e020-6a13-11eb-bf61-5f56ac457c9b" } }
      ]
    }
  }
}

Checklist

Delete any items that are not applicable to this PR.

Any text added follows EUI's writing guidelines, uses sentence case text and includes i18n support
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
Any UI touched in this PR is usable by keyboard only (learn more about keyboard accessibility)
Any UI touched in this PR does not create any new axe failures (run axe in browser: FF, Chrome)
If a plugin configuration key changed, check if it needs to be allowlisted in the cloud and added to the docker list
This renders correctly on smaller devices using a responsive layout. (You can test this in your browser)
This was checked for cross-browser compatibility

For maintainers

This was checked for breaking API changes and was labeled appropriately

banderror · 2021-02-08T16:06:51Z

x-pack/plugins/security_solution/server/lib/detection_engine/routes/rules/find_rules_route.ts

+        const [ruleActions, ruleStatuses] = await Promise.all([
+          bulkFetchRuleActionsOfManyRules({ savedObjectsClient, ruleIds }),
+          bulkFetchLatestStatusesOfManyRules(elasticsearchClient.asCurrentUser, ruleIds),
+        ]);


Please review this from the security standpoint (access privileges, isolation (spaces), error handling, potential issues, etc.).

Is it ok to query .kibana index via elasticsearchClient.asCurrentUser, or privileges to .kibana are not guaranteed to all users by default?

I guess it would be better to query asInternalUser. Would it be safe in this particular case?

banderror · 2021-02-08T16:09:49Z

x-pack/plugins/security_solution/server/lib/detection_engine/routes/rules/find_rules_route.ts

+const bulkFetchLatestStatusesOfManyRules = async (
+  client: ElasticsearchClient,
+  ruleIds: string[]
+): Promise<IRuleSavedAttributesSavedObjectAttributes[]> => {
+  // This is not only a performance optimization. If ids is empty, the search
+  // request to Elasticsearch using terms aggregation will throw a 400 error,
+  // which is not what we want here.
+  if (ruleIds.length === 0) {
+    return [];
+  }
+
+  const response = await client.search({
+    index: '.kibana',
+    ignore_unavailable: true,
+    size: 0,
+    body: {
+      size: 0,
+      query: {
+        bool: {
+          filter: [
+            { term: { type: 'siem-detection-engine-rule-status' } },
+            { terms: { 'siem-detection-engine-rule-status.alertId': ruleIds } },
+          ],
+        },
+      },
+      aggs: {
+        rules: {
+          terms: {
+            field: 'siem-detection-engine-rule-status.alertId',
+            size: ruleIds.length,
+          },
+          aggs: {
+            latest_status: {
+              top_hits: {
+                size: 1,
+                sort: [{ 'siem-detection-engine-rule-status.statusDate': { order: 'desc' } }],
+              },
+            },
+          },
+        },
+      },
+    },
+  });
+
+  const ruleBuckets = response.body.aggregations?.rules?.buckets ?? [];
+  const statusHits = ruleBuckets.map((bucket: any) => {
+    const [firstHit] = bucket?.latest_status?.hits?.hits ?? [];
+    return firstHit ?? null;
+  });
+  const statuses = statusHits.map((hit: any) => {
+    return hit?._source?.['siem-detection-engine-rule-status'] ?? null;
+  });
+
+  return statuses.filter(Boolean) as IRuleSavedAttributesSavedObjectAttributes[];
+};


Please review this from the security standpoint (access privileges, isolation (spaces), error handling, potential issues, etc.).

Note that:

Detection rules are saved objects with "type": "alert" and "alert.alertTypeId": "siem.signals".

Rule statuses are saved objects with "type": "siem-detection-engine-rule-status".

Rule statuses have "foreign keys" pointing to detections rules via their ids. If a rule has an "_id": "alert:b921e020-6a13-11eb-bf61-5f56ac457c9b", its status object will contain a foreign key "alertId": "b921e020-6a13-11eb-bf61-5f56ac457c9b" in its saved object attributes.

In this endpoint handler ruleIds are not coming from request parameters. We retrieve them via alerting client which uses saved objects client. Which should handle spaces and RBAC for us correctly (I guess?).

See more notes on our data model in the PR description.

banderror · 2021-02-08T16:10:58Z

x-pack/plugins/security_solution/server/lib/detection_engine/routes/rules/find_rules_route.ts

+  }
+
+  const response = await client.search({
+    index: '.kibana',


TODO: Kibana index name should be retrieved from the config.

kibanamachine · 2021-02-11T13:49:47Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request
Commit: a7836c6
Pipeline Steps (look for red circles / failed steps)
Interpreting CI Failures

Failed CI Steps

Test Failures

Kibana Pipeline / general / "before all" hook for "should open a modal".Open timeline Open timeline modal "before all" hook for "should open a modal"

Link to Jenkins

Stack Trace

Failed Tests Reporter:
  - Test has not failed recently on tracked branches

AssertionError: Timed out retrying after 60000ms: Expected to find element: `[data-test-subj="title-c10fd380-6c77-11eb-a1ff-69f5dff48f73"]`, but never found it.

Because this error occurred during a `before all` hook we are skipping the remaining tests in the current suite: `Open timeline`

Although you have test retries enabled, we do not retry tests when `before all` or `after all` hooks fail
    at Object.openTimelineById (http://localhost:61131/__cypress/tests?p=cypress/integration/timelines/open_timeline.spec.ts:16072:15)
    at Context.eval (http://localhost:61131/__cypress/tests?p=cypress/integration/timelines/open_timeline.spec.ts:15046:28)

Metrics [docs]

✅ unchanged

History

💔 Build #105186 failed 471dab6
💔 Build #104761 failed d57a937

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

banderror · 2021-02-12T16:03:02Z

We haven't got an approval from Kibana platform security team, unfortunately.

This sounds like a use-case for an 'advanced find API' that we've been considering adding: #84729
We haven't settled on a technical solution to how we would implement this safely yet, so any context you could add to that issue would really be helpful.
We won't have that implemented in a week like you need, but it could be the 'proper' solution to switch to later once we have implemented it. That said, querying the .kibana index direction is a non-trivial undertaking that requires a lot of careful consideration around authz. I'm worried about a temporary workaround that isn't tested very thoroughly. That has bitten us in the past pretty hard and I don't think we should compromise on quality just get something in the next release (remember the next train is only a few more weeks away)

We decided to go a different way by changing the data model of our rule execution statuses. The idea is to avoid "joins", queries that require sorting by status date, aggregations - any of these - when we need to load the rule management table, and so fetch N rules with their N current statuses in 1-2, but not N queries to Elasticsearch. We'll start by evaluating Alerting's APIs for these needs, in particular the "event log" API.

Closing this one.

banderror self-assigned this Feb 8, 2021

banderror commented Feb 8, 2021

View reviewed changes

banderror force-pushed the optimize-rule-fetching-endpoints branch from d57a937 to 471dab6 Compare February 9, 2021 14:58

TEMP COMMIT. REBASE ME

a7836c6

banderror force-pushed the optimize-rule-fetching-endpoints branch from 471dab6 to a7836c6 Compare February 11, 2021 13:34

banderror closed this Feb 12, 2021

banderror deleted the optimize-rule-fetching-endpoints branch February 12, 2021 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution][Detections] Optimize rule fetching endpoints [WIP] #90639

[Security Solution][Detections] Optimize rule fetching endpoints [WIP] #90639

banderror commented Feb 8, 2021 •

edited

Loading

banderror Feb 8, 2021 •

edited

Loading

banderror Feb 8, 2021 •

edited

Loading

banderror Feb 8, 2021

kibanamachine commented Feb 11, 2021 •

edited

Loading

Stack Trace

banderror commented Feb 12, 2021

[Security Solution][Detections] Optimize rule fetching endpoints [WIP] #90639

[Security Solution][Detections] Optimize rule fetching endpoints [WIP] #90639

Conversation

banderror commented Feb 8, 2021 • edited Loading

Summary

Details

Notes on data model

Checklist

For maintainers

banderror Feb 8, 2021 • edited Loading

Choose a reason for hiding this comment

banderror Feb 8, 2021 • edited Loading

Choose a reason for hiding this comment

banderror Feb 8, 2021

Choose a reason for hiding this comment

kibanamachine commented Feb 11, 2021 • edited Loading

💔 Build Failed

Failed CI Steps

Test Failures

Stack Trace

Metrics [docs]

History

banderror commented Feb 12, 2021

banderror commented Feb 8, 2021 •

edited

Loading

banderror Feb 8, 2021 •

edited

Loading

banderror Feb 8, 2021 •

edited

Loading

kibanamachine commented Feb 11, 2021 •

edited

Loading