[ML] Autoscaling for machine learning #59309

benwtrent · 2020-07-09T15:30:07Z

This provides an autoscaling service integration for machine learning.

The underlying logic is fairly straightforward with a couple of caveats:

When considering to scale up/down, ML will automatically translate between Node size and the memory that the node will potentially provide for ML after the scaling plan is implemented.
If knowledge of job sizes is out of date, we will do a best effort check for scaling up. But, if that cannot be determined with our current view of job memory, we attempt a refresh and return a no_scale event
We assume that if the auto memory percent calculation is being used, we treat all JVM sizes on the nodes the same.
For scale down, we keep our last scale down calculation time in memory. So, if master nodes are changed in between, we reset the scale down delay.

droberts195

I didn't review the code at a low level as it may change, but left some comments about edge cases that need consideration.

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/MachineLearningField.java

.../src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderConfiguration.java

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportOpenJobAction.java

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

…aling-integration

elasticmachine · 2020-11-12T15:26:37Z

Pinging @elastic/es-distributed (:Distributed/Autoscaling)

elasticmachine · 2020-11-12T15:26:39Z

Pinging @elastic/ml-core (:ml)

droberts195

I had to stop reviewing as other things came up, but here are the two minor things I'd typed before that. I'll have another look on Monday.

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java

.../src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderConfiguration.java

Co-authored-by: David Roberts <dave.roberts@elastic.co>

…aling-integration

droberts195

I just have one concern. Apart from that I think this is good to merge so we get a first attempt into CI.

droberts195 · 2020-11-16T14:04:42Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

+                // TODO Better default???
+                AnalysisLimits.DEFAULT_MODEL_MEMORY_LIMIT_MB,


I think this is dangerous. It means we could scale the cluster significantly due to the memory tracker not having the required information. It would also be difficult to diagnose this based on a user complaint that their cluster scaled wildly. For example, suppose there are 50 jobs each with a memory requirement of 0.1GB, total requirement 5GB, but then due to a glitch in the memory tracker we treat that as 50GB and scale up to a much more costly cluster.

Since this is for scaling up, I think we should use 0 for jobs where we have no memory information, so we'll only scale up if the sum of the memory requirements for jobs that we know the memory requirement for imply a scale up.

(Same a few lines below.)

Since this is for scaling up, I think we should use 0 for jobs where we have no memory information

The reason I did do this was this caused a scale_up to be delayed. If we are fine with the scale up being delayed, then its fine.

I think this should only happen once per master node though, if the Cloud infrastructure asks for a scaling decision immediately after that master node became master. In that rare situation it might be better not to make a hasty decision that will need to get adjusted again at the next scaling decision.

When a job is opened/started via the open/start API, we put a value for its memory requirement in the memory tracker. And when a new master takes over, we recalculate the memory requirement of all jobs.

So the memory tracker should only return null for a job’s memory requirement in the period between a new master being elected and the recalculation of all jobs’ memory requirements completing.

I think it would be good to log a warning if the default is used during a scale up decision. We'll need this to debug the situation where a cluster doesn't scale at all due to some unforeseen problem. But unless I am mistaken I think it will be very rare to see that log message. In the case of a new master node being elected and an autoscaling decision being requested soon afterwards we should see the log message once. And only in the case of some unforeseen problem will we see the log message repeatedly.

droberts195 · 2020-11-16T14:07:55Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

+            nodeLoads.add(nodeLoad);
+            isMemoryAccurateFlag = isMemoryAccurateFlag && nodeLoad.isUseMemory();
+        }
+        // Even if we verify that memory usage is up today before checking node capacity, we could still run into stale information.


Suggested change

// Even if we verify that memory usage is up today before checking node capacity, we could still run into stale information.

// Even if we verify that memory usage is up to date before checking node capacity, we could still run into stale information.

…aling-integration

droberts195

I'll be happy to merge this if you could just change a few more things related to what happens if memory information isn't available.

droberts195 · 2020-11-17T10:46:46Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

+                                                    Duration memoryTrackingStale,
+                                                    AutoscalingDeciderResult potentialResult) {
+        if (mlMemoryTracker.isRecentlyRefreshed(memoryTrackingStale) == false) {
+            return buildDecisionAndRequestRefresh(reasonBuilder);


Am I correct that getting here should be incredibly rare, because we checked isRecentlyRefreshed near the beginning of the scale method, so we'll only get here if the definition of "recently" was breached while the code of the scale method was running?

If this is correct then we should log a warning here, because if some strange bug causes us to get here more often then we'll want to know when dealing with the "why doesn't my cluster scale" support case.

Am I correct that getting here should be incredibly rare

We will hit this predicate if scale_up fails due to memory being stale.

We will also hit this predicate if scale_down fails due to memory being stale (more rare).

Oh yes, sorry. When I wrote this I was thinking that scale was checking for staleness before both scale up and scale down, but it checks in between.

droberts195 · 2020-11-17T10:51:16Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

+                                                                            int maxNumInQueue) {
+        List<Long> jobSizes = unassignedJobs
+            .stream()
+            // TODO do we want to verify memory requirements aren't stale? Or just consider `null` a fastpath?


Please remove this comment, as staleness is already being checked elsewhere.

droberts195 · 2020-11-17T11:05:56Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

+    }
+
+    private Long getAnalyticsMemoryRequirement(String analyticsId) {
+        return mlMemoryTracker.getDataFrameAnalyticsJobMemoryRequirement(analyticsId);


Given that we check staleness at the beginning of the scale method I think it should impossible to return null here given where it's called from. So:

Please add a Javadoc comment saying this method must only be called after checking the that memory tracker is recently refreshed

Add an assertion that the return value isn't null so we can catch unexpected situations in tests

Add a warning log so that if we return null in production and it causes a support case that autoscaling isn't working we have something in the log

Given that we check staleness at the beginning of the scale method I think it should impossible to return null here given where it's called from.

This method is also used in scale_up, which does not directly check for memory to not be stale.

But, in one place where this method is used (right before scale down), we have recently checked. I will put an assert there.

I will also put an assert in scale down. If the node load returns saying that memory is stale, that is also a weird scenario and we should fail as we just recently checked.

droberts195 · 2020-11-17T11:06:07Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java

+    }
+
+    private Long getAnomalyMemoryRequirement(String anomalyId) {
+        return mlMemoryTracker.getAnomalyDetectorJobMemoryRequirement(anomalyId);


Given that we check staleness at the beginning of the scale method I think it should impossible to return null here given where it's called from. So:

Please add a Javadoc comment saying this method must only be called after checking the that memory tracker is recently refreshed

Add an assertion that the return value isn't null so we can catch unexpected situations in tests

Add a warning log so that if we return null in production and it causes a support case that autoscaling isn't working we have something in the log

droberts195

LGTM

...l/src/test/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderServiceTests.java

This provides an autoscaling service integration for machine learning. The underlying logic is fairly straightforward with a couple of caveats: - When considering to scale up/down, ML will automatically translate between Node size and the memory that the node will potentially provide for ML after the scaling plan is implemented. - If knowledge of job sizes is out of date, we will do a best effort check for scaling up. But, if that cannot be determined with our current view of job memory, we attempt a refresh and return a no_scale event - We assume that if the auto memory percent calculation is being used, we treat all JVM sizes on the nodes the same. - For scale down, we keep our last scale down calculation time in memory. So, if master nodes are changed in between, we reset the scale down delay.

* [ML] Autoscaling for machine learning (#59309) This provides an autoscaling service integration for machine learning. The underlying logic is fairly straightforward with a couple of caveats: - When considering to scale up/down, ML will automatically translate between Node size and the memory that the node will potentially provide for ML after the scaling plan is implemented. - If knowledge of job sizes is out of date, we will do a best effort check for scaling up. But, if that cannot be determined with our current view of job memory, we attempt a refresh and return a no_scale event - We assume that if the auto memory percent calculation is being used, we treat all JVM sizes on the nodes the same. - For scale down, we keep our last scale down calculation time in memory. So, if master nodes are changed in between, we reset the scale down delay.

benwtrent added :ml Machine learning :Distributed/Autoscaling labels Jul 9, 2020

benwtrent force-pushed the feature/ml-autoscaling-integration branch 2 times, most recently from c4e1f9a to e5f75e5 Compare August 25, 2020 12:21

[ML] Adding ML autoscaling decider integration

4ceb881

benwtrent force-pushed the feature/ml-autoscaling-integration branch from e5f75e5 to 4ceb881 Compare August 27, 2020 18:21

droberts195 reviewed Sep 3, 2020

View reviewed changes

benwtrent added 22 commits September 14, 2020 09:25

Merge branch 'master' into feature/ml-autoscaling-integration

dacbffc

Merge branch 'master' into feature/ml-autoscaling-integration

f5a1d80

Merge remote-tracking branch 'upstream/master' into feature/ml-autosc…

bb639c6

…aling-integration

adjusting autoscaling decider

040b16d

fixing setting handling

d199dca

addressing scale up

1394ca1

finalizing scale up logic

5b267e0

adding downscale delay option

8b8462e

Merge remote-tracking branch 'upstream/master' into feature/ml-autosc…

0aa0301

…aling-integration

adding native memory calculator for special dynamic case

b2139a5

adjusting how we calculate memory percentage

1e8183b

Merge remote-tracking branch 'upstream/master' into feature/ml-autosc…

453d4a7

…aling-integration

handling native size to node size and scale down

8e846e4

Merge branch 'master' into feature/ml-autoscaling-integration

28f2257

updating from master

a665245

undo bad delete

6e45e6a

minor adjustments

d5e065b

Merge branch 'master' into feature/ml-autoscaling-integration

a4a7f23

fixing tests

f4ed982

fixing tests

a00e129

Merge remote-tracking branch 'upstream/master' into feature/ml-autosc…

e47a136

…aling-integration

adding scaledown tests, refactoring nodeload class

7fbe30e

droberts195 added the v7.11.0 label Nov 12, 2020

benwtrent added 2 commits November 12, 2020 10:25

adding tests and validations

8cf46ed

Merge remote-tracking branch 'upstream/master' into feature/ml-autosc…

6684333

…aling-integration

benwtrent marked this pull request as ready for review November 12, 2020 15:26

elasticmachine added the Team:Distributed Meta label for distributed team label Nov 12, 2020

benwtrent changed the title ~~Autoscaling for machine learning~~ [ML] Autoscaling for machine learning Nov 12, 2020

droberts195 reviewed Nov 12, 2020

View reviewed changes

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java Outdated Show resolved Hide resolved

.../src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderConfiguration.java Outdated Show resolved Hide resolved

benwtrent and others added 4 commits November 12, 2020 14:28

Apply suggestions from code review

6dc17aa

Co-authored-by: David Roberts <dave.roberts@elastic.co>

Merge remote-tracking branch 'upstream/master' into feature/ml-autosc…

f023e77

…aling-integration

fixing compilation

b181b40

Merge remote-tracking branch 'upstream/master' into feature/ml-autosc…

94a969e

…aling-integration

droberts195 reviewed Nov 16, 2020

View reviewed changes

benwtrent added 2 commits November 16, 2020 09:55

addressing PR concerns

15a8a7b

Merge remote-tracking branch 'upstream/master' into feature/ml-autosc…

6536073

…aling-integration

droberts195 reviewed Nov 17, 2020

View reviewed changes

adding logging

ac8ca59

droberts195 approved these changes Nov 17, 2020

View reviewed changes

...l/src/test/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderServiceTests.java Outdated Show resolved Hide resolved

fixing test

29adb17

benwtrent merged commit ebd0996 into elastic:master Nov 17, 2020

benwtrent deleted the feature/ml-autoscaling-integration branch November 17, 2020 15:54

benwtrent added the backport pending label Nov 17, 2020

benwtrent mentioned this pull request Nov 17, 2020

[7.x] [ML] Autoscaling for machine learning (#59309) #65151

Merged

henningandersen mentioned this pull request Nov 23, 2020

Ml Autoscaling Named Writables fix #65392

Merged

benwtrent removed the backport pending label Dec 3, 2020

lcawl mentioned this pull request Dec 14, 2020

[DOCS] Adds xpack.ml.max_ml_node_size #66285

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Autoscaling for machine learning #59309

[ML] Autoscaling for machine learning #59309

benwtrent commented Jul 9, 2020 •

edited

Loading

droberts195 left a comment

elasticmachine commented Nov 12, 2020

elasticmachine commented Nov 12, 2020

droberts195 left a comment

droberts195 left a comment

droberts195 Nov 16, 2020

benwtrent Nov 16, 2020

droberts195 Nov 16, 2020 •

edited

Loading

droberts195 Nov 16, 2020

droberts195 left a comment

droberts195 Nov 17, 2020

benwtrent Nov 17, 2020

droberts195 Nov 17, 2020

droberts195 Nov 17, 2020

droberts195 Nov 17, 2020

benwtrent Nov 17, 2020

benwtrent Nov 17, 2020

droberts195 Nov 17, 2020

droberts195 left a comment

		// TODO Better default???
		AnalysisLimits.DEFAULT_MODEL_MEMORY_LIMIT_MB,

	// Even if we verify that memory usage is up today before checking node capacity, we could still run into stale information.
	// Even if we verify that memory usage is up to date before checking node capacity, we could still run into stale information.

[ML] Autoscaling for machine learning #59309

[ML] Autoscaling for machine learning #59309

Conversation

benwtrent commented Jul 9, 2020 • edited Loading

droberts195 left a comment

Choose a reason for hiding this comment

elasticmachine commented Nov 12, 2020

elasticmachine commented Nov 12, 2020

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 Nov 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

benwtrent commented Jul 9, 2020 •

edited

Loading

droberts195 Nov 16, 2020 •

edited

Loading