Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Detections] Optimize rule fetching endpoints [WIP] #90639

Closed

Conversation

banderror
Copy link
Contributor

@banderror banderror commented Feb 8, 2021

Needed for: #89877

Summary

In order to be able to implement in-memory rule management table, we need these endpoints to be fast and scalable in terms of the number of requested/existing rules:

  • api/detection_engine/rules/_find
  • api/detection_engine/rules/_find_statuses
  • api/detection_engine/rules/prepackaged/_status

This PR attempts to do that.

Details

Currently, the endpoints mentioned above generate ~2N requests to Elasticsearch. I guess, we did this partly because of current limitations of saved object client and alerting client APIs.

Instead, to be able to fetch for example N latest statuses of N rules, and do it in one query to Elasticsearch instead of N queries, we need to be able to do term + terms queries and aggregations.

# Bulk fetch current (latest) rule execution statuses (one per each rule) by a list of alert ids
GET .kibana/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "term": { "type": "siem-detection-engine-rule-status" } },
        {
          "terms": {
            "siem-detection-engine-rule-status.alertId": [
              "b921e020-6a13-11eb-bf61-5f56ac457c9b",
              "4bc22346-6a0f-11eb-b39f-e95fdc584ff7"
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "rules": {
      "terms": {
        "field": "siem-detection-engine-rule-status.alertId",
        "size": 2
      },
      "aggs": {
        "latest_status": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "siem-detection-engine-rule-status.statusDate": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

# Bulk fetch rule actions (one object per each rule) by a list of alert ids
GET .kibana/_search
{
  "size": 2,
  "query": {
    "bool": {
      "filter": [
        { "term": { "type": "siem-detection-engine-rule-actions" } },
        {
          "terms": {
            "siem-detection-engine-rule-actions.ruleAlertId": [
              "b921e020-6a13-11eb-bf61-5f56ac457c9b",
              "4bc22346-6a0f-11eb-b39f-e95fdc584ff7"
            ]
          }
        }
      ]
    }
  }
}

In this PR I'm trying to use ElasticsearchClient directly. I'm building a custom query, executing it via ElasticsearchClient.search() API and manually normalizing the response to a desired format of our domain entities. Search request directly queries .kibana index.

This naive hacky-dirty implementation works locally and reduces the api/detection_engine/rules/_find roundtrip time from 2-5 seconds down to 300 - 1100 ms when requesting 500 rules.

Notes on data model

We have detections rules which are implemented as "alerts" in terms of alerting framework, and rule statuses and rule actions implemented as what we call "sidecar" saved objects.

  • Detection rules are saved objects with "type": "alert" and "alert.alertTypeId": "siem.signals".
  • Rule statuses are saved objects with "type": "siem-detection-engine-rule-status".
  • Rule actions are saved objects with "type": "siem-detection-engine-rule-actions".

Rule statuses and actions objects have "foreign keys" pointing to detections rules via their ids. E.g. if a rule has an "_id": "alert:b921e020-6a13-11eb-bf61-5f56ac457c9b", its status object will contain a foreign key "alertId": "b921e020-6a13-11eb-bf61-5f56ac457c9b" in its saved object attributes.

Association types:

  • Rule --* Status (one to many, statuses are time based, and we store sort of status history instead of only the current status)
  • Rule -- Actions (one to one, actions is a single object containing all actions of the rule)

Example for rule with id b921e020-6a13-11eb-bf61-5f56ac457c9b:

# The rule itself
GET .kibana/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "type": "alert" } },
        { "term": { "alert.alertTypeId": "siem.signals" } },
        { "term": { "_id": "alert:b921e020-6a13-11eb-bf61-5f56ac457c9b" } }
      ]
    }
  }
}

# Sidecar statuses of the rule
GET .kibana/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "type": "siem-detection-engine-rule-status" } },
        { "term": { "siem-detection-engine-rule-status.alertId": "b921e020-6a13-11eb-bf61-5f56ac457c9b" } }
      ]
    }
  }
}

# Sidecar actions of the rule
GET .kibana/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "type": "siem-detection-engine-rule-actions" } },
        { "term": { "siem-detection-engine-rule-actions.ruleAlertId": "b921e020-6a13-11eb-bf61-5f56ac457c9b" } }
      ]
    }
  }
}

Checklist

Delete any items that are not applicable to this PR.

For maintainers

@banderror banderror added release_note:enhancement performance Feature:Detection Rules Anything related to Security Solution's Detection Rules Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. labels Feb 8, 2021
@banderror banderror self-assigned this Feb 8, 2021
Comment on lines +115 to +118
const [ruleActions, ruleStatuses] = await Promise.all([
bulkFetchRuleActionsOfManyRules({ savedObjectsClient, ruleIds }),
bulkFetchLatestStatusesOfManyRules(elasticsearchClient.asCurrentUser, ruleIds),
]);
Copy link
Contributor Author

@banderror banderror Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review this from the security standpoint (access privileges, isolation (spaces), error handling, potential issues, etc.).

  • Is it ok to query .kibana index via elasticsearchClient.asCurrentUser, or privileges to .kibana are not guaranteed to all users by default?
  • I guess it would be better to query asInternalUser. Would it be safe in this particular case?

Comment on lines +204 to +258
const bulkFetchLatestStatusesOfManyRules = async (
client: ElasticsearchClient,
ruleIds: string[]
): Promise<IRuleSavedAttributesSavedObjectAttributes[]> => {
// This is not only a performance optimization. If ids is empty, the search
// request to Elasticsearch using terms aggregation will throw a 400 error,
// which is not what we want here.
if (ruleIds.length === 0) {
return [];
}

const response = await client.search({
index: '.kibana',
ignore_unavailable: true,
size: 0,
body: {
size: 0,
query: {
bool: {
filter: [
{ term: { type: 'siem-detection-engine-rule-status' } },
{ terms: { 'siem-detection-engine-rule-status.alertId': ruleIds } },
],
},
},
aggs: {
rules: {
terms: {
field: 'siem-detection-engine-rule-status.alertId',
size: ruleIds.length,
},
aggs: {
latest_status: {
top_hits: {
size: 1,
sort: [{ 'siem-detection-engine-rule-status.statusDate': { order: 'desc' } }],
},
},
},
},
},
},
});

const ruleBuckets = response.body.aggregations?.rules?.buckets ?? [];
const statusHits = ruleBuckets.map((bucket: any) => {
const [firstHit] = bucket?.latest_status?.hits?.hits ?? [];
return firstHit ?? null;
});
const statuses = statusHits.map((hit: any) => {
return hit?._source?.['siem-detection-engine-rule-status'] ?? null;
});

return statuses.filter(Boolean) as IRuleSavedAttributesSavedObjectAttributes[];
};
Copy link
Contributor Author

@banderror banderror Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review this from the security standpoint (access privileges, isolation (spaces), error handling, potential issues, etc.).

Note that:

  • Detection rules are saved objects with "type": "alert" and "alert.alertTypeId": "siem.signals".
  • Rule statuses are saved objects with "type": "siem-detection-engine-rule-status".
  • Rule statuses have "foreign keys" pointing to detections rules via their ids. If a rule has an "_id": "alert:b921e020-6a13-11eb-bf61-5f56ac457c9b", its status object will contain a foreign key "alertId": "b921e020-6a13-11eb-bf61-5f56ac457c9b" in its saved object attributes.
  • In this endpoint handler ruleIds are not coming from request parameters. We retrieve them via alerting client which uses saved objects client. Which should handle spaces and RBAC for us correctly (I guess?).
  • See more notes on our data model in the PR description.

}

const response = await client.search({
index: '.kibana',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Kibana index name should be retrieved from the config.

@banderror banderror force-pushed the optimize-rule-fetching-endpoints branch from d57a937 to 471dab6 Compare February 9, 2021 14:58
@banderror banderror force-pushed the optimize-rule-fetching-endpoints branch from 471dab6 to a7836c6 Compare February 11, 2021 13:34
@kibanamachine
Copy link
Contributor

kibanamachine commented Feb 11, 2021

💔 Build Failed

Failed CI Steps


Test Failures

Kibana Pipeline / general / "before all" hook for "should open a modal".Open timeline Open timeline modal "before all" hook for "should open a modal"

Link to Jenkins

Stack Trace

Failed Tests Reporter:
  - Test has not failed recently on tracked branches

AssertionError: Timed out retrying after 60000ms: Expected to find element: `[data-test-subj="title-c10fd380-6c77-11eb-a1ff-69f5dff48f73"]`, but never found it.

Because this error occurred during a `before all` hook we are skipping the remaining tests in the current suite: `Open timeline`

Although you have test retries enabled, we do not retry tests when `before all` or `after all` hooks fail
    at Object.openTimelineById (http://localhost:61131/__cypress/tests?p=cypress/integration/timelines/open_timeline.spec.ts:16072:15)
    at Context.eval (http://localhost:61131/__cypress/tests?p=cypress/integration/timelines/open_timeline.spec.ts:15046:28)

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@banderror
Copy link
Contributor Author

We haven't got an approval from Kibana platform security team, unfortunately.

This sounds like a use-case for an 'advanced find API' that we've been considering adding: #84729
We haven't settled on a technical solution to how we would implement this safely yet, so any context you could add to that issue would really be helpful.
We won't have that implemented in a week like you need, but it could be the 'proper' solution to switch to later once we have implemented it. That said, querying the .kibana index direction is a non-trivial undertaking that requires a lot of careful consideration around authz. I'm worried about a temporary workaround that isn't tested very thoroughly. That has bitten us in the past pretty hard and I don't think we should compromise on quality just get something in the next release (remember the next train is only a few more weeks away)

We decided to go a different way by changing the data model of our rule execution statuses. The idea is to avoid "joins", queries that require sorting by status date, aggregations - any of these - when we need to load the rule management table, and so fetch N rules with their N current statuses in 1-2, but not N queries to Elasticsearch. We'll start by evaluating Alerting's APIs for these needs, in particular the "event log" API.

Closing this one.

@banderror banderror closed this Feb 12, 2021
@banderror banderror deleted the optimize-rule-fetching-endpoints branch February 12, 2021 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Detection Rules Anything related to Security Solution's Detection Rules performance release_note:enhancement Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants