Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics UI] Increase composite size to 10K for Metric Threshold Rule and optimize processing #121904

Conversation

simianhacker
Copy link
Member

@simianhacker simianhacker commented Dec 22, 2021

Summary

This PR closes #119501 by introducing a new configuration value xpack.infra.alerting.metric_threshold.group_by_page_size which controls the composite sizes for the group by queries. The default value is now set to 10000, the original values was set at 100.

I also took the opportunity to do some performance optimizations which also closes #120249 by refactoring how the Metric Threshold rule processes the data. I setup a baseline by indexing 10K unique events into Elasticsearch every 60 seconds and creating an Metric Threshold rule with a condition that checks for document count, grouped by a unique event identifier. Then I instrumented all the code where it was looping through the results using console.time/console.timeEnd. Ultimately, I ended up identifying three sections to focus on.

The first was a reduce that converts the ES buckets into a large hash with the grouping as the keys and the parsed results as the value. While I was performance testing this code, I noticed that it was also blocking the event loop for about 20 seconds with 10K groups. This code is identified by Reduce results into groups in the results.

const groupedResults = compositeBuckets.reduce(
(result, bucket) => ({
...result,
[Object.values(bucket.key)
.map((value) => value)
.join(', ')]: getValuesFromAggregations(
bucket,
aggType,
dropPartialBucketsOptions,
calculatedTimerange,
bucket.doc_count
),
}),
{}
);

The second was a filter / include which was basically finding the difference between the groups returned from Elasticsearch and the groups the rule had previously seen. In the 10K entity scenario, the gains seem modest but when I change to 50K, this code was adding an additional 1 second to process. This code is identified by Find missing groups in the results.

const missingGroups = prevGroups.filter((g) => !currentGroups.includes(g));

The third was another reduce that was used to backfill the missing groups with either zero or null. When the groups go missing, this step takes ~19 seconds and also blocks the event loop. This code is identified by Backfill previous groups in the results.

const backfilledPrevGroups: Record<
string,
Array<{ key: string; value: number }>
> = missingGroups.reduce(
(result, group) => ({
...result,
[group]: [
{
key: backfillTimestamp,
value: criterion.aggType === Aggregators.COUNT ? 0 : null,
},
],
}),
{}
);

There are two scenarios which I measured:

  1. 10K entities reporting - This would be a normal situation where everything is indexing and evaluating.
  2. 10K entities stop reporting - This would be a situation where we want to alert on missing groups that disappear from the index.

I also tested this with 50K unique events. The Metric Threshold rule executor handles the 50K group bys with little effort, it takes about 2-4 seconds to process. BUT the alerting framework ends up with the following error:

[2021-12-21T15:33:15.396-07:00][ERROR][plugins.alerting] Executing Rule default:metrics.alert.threshold:06e93fa0-62a8-11ec-b34a-6fe767801337 has resulted in Error: search_phase_execution_exception: [illegal_argument_exception] Reason: Result window is too large, from + size must be less than or equal to: [10000] but was [50000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting., caused by: "Result window is too large, from + size must be less than or equal to: [10000] but was [50000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.,Result window is too large, from + size must be less than or equal to: [10000] but was [50000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."

This error is being tacked via #122288

Results

10K entities reporting

event baseline optimized
Query Elasticsearch for all data 1.045s 64.151ms
Reduce results into groups 21.735s 693.883ms
Find missing groups 92.738ms 11.224ms
Backfill previous groups 0.007ms 0.007ms
evaluateRule 22.961s 865.421ms
MetricThresholdExecutor 23.012s 905.988ms

As you can see above the biggest gain was ~21 seconds from the "Reduce results into groups" event. There is also a modest gain or ~80 milliseconds with the "Find missing groups" event.

10K entities stop reporting

event baseline optimized
Query Elasticsearch for all data 3.649ms 22.049ms
Reduce results into groups 0.004ms 0.02ms
Find missing groups 0.448ms 2.302ms
Backfill previous groups 19.056s 3.239ms
evaluateRule 19.137s 117.733ms
MetricThresholdExecutor 19.894s 1.334s

The biggest gain in this scenario is the "Backfill previous groups" event with a ~19 second difference.

Setup for PR Review

  1. Setup a "High Cardinality Cluster" using this Docker setup: https://github.com/elastic/high-cardinality-cluster
  2. Modify the docker-compose.yml, set EVENTS_PER_CYCLE to 10000 and PAYLOAD_SIZE to 5000
  3. Start Docker cluster with ELASTIC_VERSION=8.1.0-SNAPSHOT docker compose up --remove-orphans
  4. Start Kibana with yarn start --ssl. DO NOT start Elasticsearch from the Kibana directory, the Docker cluster is configured to work with Kibana source.
  5. Create a "Metric Threshold Rule" in "Stack Management > Rules and Connectors".
  6. Configure the first condition to "Document count is below 1".
  7. Set the "Group alerts by" to label.eventId
  8. Add an action, try "Server log" with the understanding that you will end up with 10K messages if the alert triggers.

@simianhacker simianhacker added Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.1.0 release_note:fix labels Dec 22, 2021
@@ -96,27 +99,22 @@ export const evaluateRule = <Params extends EvaluatedRuleParams = EvaluatedRuleP
// If any previous groups are no longer being reported, backfill them with null values
const currentGroups = Object.keys(currentValues);

const missingGroups = prevGroups.filter((g) => !currentGroups.includes(g));
const missingGroups = difference(prevGroups, currentGroups);
Copy link
Contributor

@klacabane klacabane Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If currentGroups is large enough, we could convert it to a hash/set for constant read access to bring this down to a linear execution, but that's probably what lodash does ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gains with using Lodash's difference is probably sufficient plus it's easy enough to read. If the new code was in place when I was doing the performance testing, I probably wouldn't have even bothered with it.

@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@kibana-ci
Copy link
Collaborator

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
infra 22 25 +3
Unknown metric groups

API count

id before after diff
infra 25 28 +3

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Copy link
Contributor

@klacabane klacabane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this processing exposed to an endpoint or is it a background task ? I'm wondering if we could leverage performance tooling to tune group_by_page_size value depending on its impact on different hardware

@simianhacker
Copy link
Member Author

@klacabane We consulted with the Elasticsearch team quit a bit and 10K was their general recommendation. I recently tested this on a smaller 8G node and it was similarly performant. I'm feeling pretty confident about 10K but the setting will give user's an escape hatch if the need it.

@simianhacker simianhacker merged commit 8e6ec25 into elastic:main Jan 18, 2022
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Jan 18, 2022
ogupte pushed a commit to ogupte/kibana that referenced this pull request Jan 28, 2022
… and optimize processing (elastic#121904)

* [Metrics UI] Increase composite size for Metric Threshold Rule to 10K

* Adding performance optimizations

* Fixing metrics_alerting integration test

* fixing tests

* Fixing integration test and config mock

* Removing the setTimeout code to simplify to a for/of

* Adding new setting to docs

* Adding metric_threshold identifier to the config setting
simianhacker added a commit to simianhacker/kibana that referenced this pull request Feb 28, 2022
… and optimize processing (elastic#121904)

* [Metrics UI] Increase composite size for Metric Threshold Rule to 10K

* Adding performance optimizations

* Fixing metrics_alerting integration test

* fixing tests

* Fixing integration test and config mock

* Removing the setTimeout code to simplify to a for/of

* Adding new setting to docs

* Adding metric_threshold identifier to the config setting

(cherry picked from commit ae0c8d5)

# Conflicts:
#	x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts
simianhacker added a commit that referenced this pull request Feb 28, 2022
… and optimize processing (#121904) (#126506)

* [Metrics UI] Increase composite size for Metric Threshold Rule to 10K

* Adding performance optimizations

* Fixing metrics_alerting integration test

* fixing tests

* Fixing integration test and config mock

* Removing the setTimeout code to simplify to a for/of

* Adding new setting to docs

* Adding metric_threshold identifier to the config setting

(cherry picked from commit ae0c8d5)

# Conflicts:
#	x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts
@simianhacker simianhacker deleted the issue-119501-increase-composite-size-metric-threshold branch April 17, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Metrics UI Metrics UI feature release_note:fix Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.1.0
Projects
None yet
5 participants