[Metrics UI] Increase composite size to 10K for Metric Threshold Rule and optimize processing #121904

simianhacker · 2021-12-22T18:03:34Z

Summary

This PR closes #119501 by introducing a new configuration value xpack.infra.alerting.metric_threshold.group_by_page_size which controls the composite sizes for the group by queries. The default value is now set to 10000, the original values was set at 100.

I also took the opportunity to do some performance optimizations which also closes #120249 by refactoring how the Metric Threshold rule processes the data. I setup a baseline by indexing 10K unique events into Elasticsearch every 60 seconds and creating an Metric Threshold rule with a condition that checks for document count, grouped by a unique event identifier. Then I instrumented all the code where it was looping through the results using console.time/console.timeEnd. Ultimately, I ended up identifying three sections to focus on.

The first was a reduce that converts the ES buckets into a large hash with the grouping as the keys and the parsed results as the value. While I was performance testing this code, I noticed that it was also blocking the event loop for about 20 seconds with 10K groups. This code is identified by Reduce results into groups in the results.

kibana/x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts

Lines 205 to 219 in 89cdb47

    
           const groupedResults = compositeBuckets.reduce( 
        
             (result, bucket) => ({ 
        
               ...result, 
        
               [Object.values(bucket.key) 
        
                 .map((value) => value) 
        
                 .join(', ')]: getValuesFromAggregations( 
        
                 bucket, 
        
                 aggType, 
        
                 dropPartialBucketsOptions, 
        
                 calculatedTimerange, 
        
                 bucket.doc_count 
        
               ), 
        
             }), 
        
             {} 
        
           );

The second was a filter / include which was basically finding the difference between the groups returned from Elasticsearch and the groups the rule had previously seen. In the 10K entity scenario, the gains seem modest but when I change to 50K, this code was adding an additional 1 second to process. This code is identified by Find missing groups in the results.

kibana/x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts

Line 99 in 89cdb47

const missingGroups = prevGroups.filter((g) => !currentGroups.includes(g));

The third was another reduce that was used to backfill the missing groups with either zero or null. When the groups go missing, this step takes ~19 seconds and also blocks the event loop. This code is identified by Backfill previous groups in the results.

kibana/x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts

Lines 105 to 119 in 89cdb47

    
           const backfilledPrevGroups: Record< 
        
             string, 
        
             Array<{ key: string; value: number }> 
        
           > = missingGroups.reduce( 
        
             (result, group) => ({ 
        
               ...result, 
        
               [group]: [ 
        
                 { 
        
                   key: backfillTimestamp, 
        
                   value: criterion.aggType === Aggregators.COUNT ? 0 : null, 
        
                 }, 
        
               ], 
        
             }), 
        
             {} 
        
           );

There are two scenarios which I measured:

10K entities reporting - This would be a normal situation where everything is indexing and evaluating.
10K entities stop reporting - This would be a situation where we want to alert on missing groups that disappear from the index.

I also tested this with 50K unique events. The Metric Threshold rule executor handles the 50K group bys with little effort, it takes about 2-4 seconds to process. BUT the alerting framework ends up with the following error:

[2021-12-21T15:33:15.396-07:00][ERROR][plugins.alerting] Executing Rule default:metrics.alert.threshold:06e93fa0-62a8-11ec-b34a-6fe767801337 has resulted in Error: search_phase_execution_exception: [illegal_argument_exception] Reason: Result window is too large, from + size must be less than or equal to: [10000] but was [50000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting., caused by: "Result window is too large, from + size must be less than or equal to: [10000] but was [50000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.,Result window is too large, from + size must be less than or equal to: [10000] but was [50000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."

This error is being tacked via #122288

Results

10K entities reporting

event	baseline	optimized
Query Elasticsearch for all data	1.045s	64.151ms
Reduce results into groups	21.735s	*693.883ms*
Find missing groups	92.738ms	*11.224ms*
Backfill previous groups	0.007ms	0.007ms
evaluateRule	22.961s	865.421ms
MetricThresholdExecutor	23.012s	905.988ms

As you can see above the biggest gain was ~21 seconds from the "Reduce results into groups" event. There is also a modest gain or ~80 milliseconds with the "Find missing groups" event.

10K entities stop reporting

event	baseline	optimized
Query Elasticsearch for all data	3.649ms	22.049ms
Reduce results into groups	0.004ms	0.02ms
Find missing groups	0.448ms	2.302ms
Backfill previous groups	19.056s	*3.239ms*
evaluateRule	19.137s	117.733ms
MetricThresholdExecutor	19.894s	1.334s

The biggest gain in this scenario is the "Backfill previous groups" event with a ~19 second difference.

Setup for PR Review

Setup a "High Cardinality Cluster" using this Docker setup: https://github.com/elastic/high-cardinality-cluster
Modify the docker-compose.yml, set EVENTS_PER_CYCLE to 10000 and PAYLOAD_SIZE to 5000
Start Docker cluster with ELASTIC_VERSION=8.1.0-SNAPSHOT docker compose up --remove-orphans
Start Kibana with yarn start --ssl. DO NOT start Elasticsearch from the Kibana directory, the Docker cluster is configured to work with Kibana source.
Create a "Metric Threshold Rule" in "Stack Management > Rules and Connectors".
Configure the first condition to "Document count is below 1".
Set the "Group alerts by" to label.eventId
Add an action, try "Server log" with the understanding that you will end up with 10K messages if the alert triggers.

x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts

klacabane · 2021-12-23T18:19:13Z

x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts

@@ -96,27 +99,22 @@ export const evaluateRule = <Params extends EvaluatedRuleParams = EvaluatedRuleP
      // If any previous groups are no longer being reported, backfill them with null values
      const currentGroups = Object.keys(currentValues);

-      const missingGroups = prevGroups.filter((g) => !currentGroups.includes(g));
+      const missingGroups = difference(prevGroups, currentGroups);


If currentGroups is large enough, we could convert it to a hash/set for constant read access to bring this down to a linear execution, but that's probably what lodash does ?

The gains with using Lodash's difference is probably sufficient plus it's easy enough to read. If the new code was in place when I was doing the performance testing, I probably wouldn't have even bothered with it.

…crease-composite-size-metric-threshold

elasticmachine · 2022-01-05T00:05:26Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

…crease-composite-size-metric-threshold

kibana-ci · 2022-01-18T17:35:05Z

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`infra`	22	25	+3

Unknown metric groups

API count

id	before	after	diff
`infra`	25	28	+3

History

💚 Build #16222 succeeded 8d94f9d
💚 Build #15771 succeeded 2984abe
💚 Build #15038 succeeded a691f9c
💛 Build #15027 was flaky 30cffea
💔 Build #14945 failed a7106b5
💔 Build #14865 failed 7014ac9

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

klacabane

Is this processing exposed to an endpoint or is it a background task ? I'm wondering if we could leverage performance tooling to tune group_by_page_size value depending on its impact on different hardware

simianhacker · 2022-01-18T21:35:43Z

@klacabane We consulted with the Elasticsearch team quit a bit and 10K was their general recommendation. I recently tested this on a smaller 8G node and it was similarly performant. I'm feeling pretty confident about 10K but the setting will give user's an escape hatch if the need it.

… and optimize processing (elastic#121904) * [Metrics UI] Increase composite size for Metric Threshold Rule to 10K * Adding performance optimizations * Fixing metrics_alerting integration test * fixing tests * Fixing integration test and config mock * Removing the setTimeout code to simplify to a for/of * Adding new setting to docs * Adding metric_threshold identifier to the config setting

… and optimize processing (elastic#121904) * [Metrics UI] Increase composite size for Metric Threshold Rule to 10K * Adding performance optimizations * Fixing metrics_alerting integration test * fixing tests * Fixing integration test and config mock * Removing the setTimeout code to simplify to a for/of * Adding new setting to docs * Adding metric_threshold identifier to the config setting (cherry picked from commit ae0c8d5) # Conflicts: # x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts

… and optimize processing (#121904) (#126506) * [Metrics UI] Increase composite size for Metric Threshold Rule to 10K * Adding performance optimizations * Fixing metrics_alerting integration test * fixing tests * Fixing integration test and config mock * Removing the setTimeout code to simplify to a for/of * Adding new setting to docs * Adding metric_threshold identifier to the config setting (cherry picked from commit ae0c8d5) # Conflicts: # x-pack/plugins/infra/server/lib/alerting/metric_threshold/lib/evaluate_rule.ts

simianhacker added 2 commits December 22, 2021 09:52

[Metrics UI] Increase composite size for Metric Threshold Rule to 10K

66b1d98

Adding performance optimizations

7014ac9

simianhacker added Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.1.0 release_note:fix labels Dec 22, 2021

simianhacker added 3 commits December 22, 2021 23:47

Fixing metrics_alerting integration test

e5f574e

fixing tests

a7106b5

Fixing integration test and config mock

30cffea

klacabane reviewed Dec 23, 2021

View reviewed changes

simianhacker added 3 commits December 23, 2021 12:23

Removing the setTimeout code to simplify to a for/of

a691f9c

Adding new setting to docs

b277d00

Merge branch 'main' of github.com:elastic/kibana into issue-119501-in…

2984abe

…crease-composite-size-metric-threshold

simianhacker mentioned this pull request Jan 4, 2022

[RAC][Rule Registry] Rules that generate over 10K alerts cause an exception in the Kibana logs #122288

Closed

simianhacker marked this pull request as ready for review January 5, 2022 00:05

simianhacker requested a review from a team as a code owner January 5, 2022 00:05

claudiopro mentioned this pull request Jan 5, 2022

[test] Issue 122292 inventory threshold time unit #122322

Closed

1 task

jasonrhodes mentioned this pull request Jan 6, 2022

[Logs UI] Handle log threshold alert grouping fields with large cardinalities more robustly #98010

Closed

simianhacker added 2 commits January 6, 2022 13:50

Adding metric_threshold identifier to the config setting

5d6f9fc

Merge branch 'main' of github.com:elastic/kibana into issue-119501-in…

8d94f9d

…crease-composite-size-metric-threshold

simianhacker requested a review from klacabane January 18, 2022 16:07

Merge branch 'main' of github.com:elastic/kibana into issue-119501-in…

16d87df

…crease-composite-size-metric-threshold

klacabane approved these changes Jan 18, 2022

View reviewed changes

simianhacker merged commit 8e6ec25 into elastic:main Jan 18, 2022

kibanamachine added the backport:skip This commit does not require backporting label Jan 18, 2022

rudolf added a commit that referenced this pull request Jan 28, 2022

Unoptimize metric threshold alert by partially reverting #121904

73c3291

This was referenced Feb 28, 2022

[8.0] [Metrics UI] Increase composite size to 10K for Metric Threshold Rule and optimize processing (#121904) #126506

Merged

[Infrastructure UI] Refactor the use of reduces for the Metric Threshold Rule in 7.17 #126508

Closed

simianhacker deleted the issue-119501-increase-composite-size-metric-threshold branch April 17, 2024 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metrics UI] Increase composite size to 10K for Metric Threshold Rule and optimize processing #121904

[Metrics UI] Increase composite size to 10K for Metric Threshold Rule and optimize processing #121904

simianhacker commented Dec 22, 2021 •

edited

Loading

klacabane Dec 23, 2021 •

edited

Loading

simianhacker Dec 23, 2021

elasticmachine commented Jan 5, 2022

kibana-ci commented Jan 18, 2022

API count

klacabane left a comment

simianhacker commented Jan 18, 2022

	const groupedResults = compositeBuckets.reduce(
	(result, bucket) => ({
	...result,
	[Object.values(bucket.key)
	.map((value) => value)
	.join(', ')]: getValuesFromAggregations(
	bucket,
	aggType,
	dropPartialBucketsOptions,
	calculatedTimerange,
	bucket.doc_count
	),
	}),
	{}
	);

	const backfilledPrevGroups: Record<
	string,
	Array<{ key: string; value: number }>
	> = missingGroups.reduce(
	(result, group) => ({
	...result,
	[group]: [
	{
	key: backfillTimestamp,
	value: criterion.aggType === Aggregators.COUNT ? 0 : null,
	},
	],
	}),
	{}
	);

[Metrics UI] Increase composite size to 10K for Metric Threshold Rule and optimize processing #121904

[Metrics UI] Increase composite size to 10K for Metric Threshold Rule and optimize processing #121904

Conversation

simianhacker commented Dec 22, 2021 • edited Loading

Summary

Results

10K entities reporting

10K entities stop reporting

Setup for PR Review

klacabane Dec 23, 2021 • edited Loading

Choose a reason for hiding this comment

simianhacker Dec 23, 2021

Choose a reason for hiding this comment

elasticmachine commented Jan 5, 2022

kibana-ci commented Jan 18, 2022

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

API count

History

klacabane left a comment

Choose a reason for hiding this comment

simianhacker commented Jan 18, 2022

simianhacker commented Dec 22, 2021 •

edited

Loading

klacabane Dec 23, 2021 •

edited

Loading