-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Autoscaling for machine learning #59309
[ML] Autoscaling for machine learning #59309
Conversation
c4e1f9a
to
e5f75e5
Compare
e5f75e5
to
4ceb881
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review the code at a low level as it may change, but left some comments about edge cases that need consideration.
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/MachineLearningField.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderConfiguration.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderConfiguration.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportOpenJobAction.java
Outdated
Show resolved
Hide resolved
...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java
Outdated
Show resolved
Hide resolved
...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java
Show resolved
Hide resolved
...gin/ml/src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderService.java
Outdated
Show resolved
Hide resolved
…aling-integration
…aling-integration
…aling-integration
…aling-integration
Pinging @elastic/es-distributed (:Distributed/Autoscaling) |
Pinging @elastic/ml-core (:ml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to stop reviewing as other things came up, but here are the two minor things I'd typed before that. I'll have another look on Monday.
x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MachineLearning.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderConfiguration.java
Outdated
Show resolved
Hide resolved
Co-authored-by: David Roberts <dave.roberts@elastic.co>
…aling-integration
…aling-integration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have one concern. Apart from that I think this is good to merge so we get a first attempt into CI.
// TODO Better default??? | ||
AnalysisLimits.DEFAULT_MODEL_MEMORY_LIMIT_MB, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is dangerous. It means we could scale the cluster significantly due to the memory tracker not having the required information. It would also be difficult to diagnose this based on a user complaint that their cluster scaled wildly. For example, suppose there are 50 jobs each with a memory requirement of 0.1GB, total requirement 5GB, but then due to a glitch in the memory tracker we treat that as 50GB and scale up to a much more costly cluster.
Since this is for scaling up, I think we should use 0 for jobs where we have no memory information, so we'll only scale up if the sum of the memory requirements for jobs that we know the memory requirement for imply a scale up.
(Same a few lines below.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is for scaling up, I think we should use 0 for jobs where we have no memory information
The reason I did do this was this caused a scale_up to be delayed. If we are fine with the scale up being delayed, then its fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should only happen once per master node though, if the Cloud infrastructure asks for a scaling decision immediately after that master node became master. In that rare situation it might be better not to make a hasty decision that will need to get adjusted again at the next scaling decision.
When a job is opened/started via the open/start API, we put a value for its memory requirement in the memory tracker. And when a new master takes over, we recalculate the memory requirement of all jobs.
So the memory tracker should only return null
for a job’s memory requirement in the period between a new master being elected and the recalculation of all jobs’ memory requirements completing.
I think it would be good to log a warning if the default is used during a scale up decision. We'll need this to debug the situation where a cluster doesn't scale at all due to some unforeseen problem. But unless I am mistaken I think it will be very rare to see that log message. In the case of a new master node being elected and an autoscaling decision being requested soon afterwards we should see the log message once. And only in the case of some unforeseen problem will we see the log message repeatedly.
nodeLoads.add(nodeLoad); | ||
isMemoryAccurateFlag = isMemoryAccurateFlag && nodeLoad.isUseMemory(); | ||
} | ||
// Even if we verify that memory usage is up today before checking node capacity, we could still run into stale information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Even if we verify that memory usage is up today before checking node capacity, we could still run into stale information. | |
// Even if we verify that memory usage is up to date before checking node capacity, we could still run into stale information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be happy to merge this if you could just change a few more things related to what happens if memory information isn't available.
Duration memoryTrackingStale, | ||
AutoscalingDeciderResult potentialResult) { | ||
if (mlMemoryTracker.isRecentlyRefreshed(memoryTrackingStale) == false) { | ||
return buildDecisionAndRequestRefresh(reasonBuilder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that getting here should be incredibly rare, because we checked isRecentlyRefreshed
near the beginning of the scale
method, so we'll only get here if the definition of "recently" was breached while the code of the scale
method was running?
If this is correct then we should log a warning here, because if some strange bug causes us to get here more often then we'll want to know when dealing with the "why doesn't my cluster scale" support case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that getting here should be incredibly rare
We will hit this predicate if scale_up fails due to memory being stale.
We will also hit this predicate if scale_down fails due to memory being stale (more rare).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, sorry. When I wrote this I was thinking that scale
was checking for staleness before both scale up and scale down, but it checks in between.
int maxNumInQueue) { | ||
List<Long> jobSizes = unassignedJobs | ||
.stream() | ||
// TODO do we want to verify memory requirements aren't stale? Or just consider `null` a fastpath? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this comment, as staleness is already being checked elsewhere.
} | ||
|
||
private Long getAnalyticsMemoryRequirement(String analyticsId) { | ||
return mlMemoryTracker.getDataFrameAnalyticsJobMemoryRequirement(analyticsId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we check staleness at the beginning of the scale
method I think it should impossible to return null
here given where it's called from. So:
- Please add a Javadoc comment saying this method must only be called after checking the that memory tracker is recently refreshed
- Add an assertion that the return value isn't
null
so we can catch unexpected situations in tests - Add a warning log so that if we return
null
in production and it causes a support case that autoscaling isn't working we have something in the log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we check staleness at the beginning of the scale method I think it should impossible to return null here given where it's called from.
This method is also used in scale_up, which does not directly check for memory to not be stale.
But, in one place where this method is used (right before scale down), we have recently checked. I will put an assert there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also put an assert in scale down. If the node load returns saying that memory is stale, that is also a weird scenario and we should fail as we just recently checked.
} | ||
|
||
private Long getAnomalyMemoryRequirement(String anomalyId) { | ||
return mlMemoryTracker.getAnomalyDetectorJobMemoryRequirement(anomalyId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we check staleness at the beginning of the scale
method I think it should impossible to return null
here given where it's called from. So:
- Please add a Javadoc comment saying this method must only be called after checking the that memory tracker is recently refreshed
- Add an assertion that the return value isn't
null
so we can catch unexpected situations in tests - Add a warning log so that if we return
null
in production and it causes a support case that autoscaling isn't working we have something in the log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
...l/src/test/java/org/elasticsearch/xpack/ml/autoscaling/MlAutoscalingDeciderServiceTests.java
Outdated
Show resolved
Hide resolved
This provides an autoscaling service integration for machine learning. The underlying logic is fairly straightforward with a couple of caveats: - When considering to scale up/down, ML will automatically translate between Node size and the memory that the node will potentially provide for ML after the scaling plan is implemented. - If knowledge of job sizes is out of date, we will do a best effort check for scaling up. But, if that cannot be determined with our current view of job memory, we attempt a refresh and return a no_scale event - We assume that if the auto memory percent calculation is being used, we treat all JVM sizes on the nodes the same. - For scale down, we keep our last scale down calculation time in memory. So, if master nodes are changed in between, we reset the scale down delay.
* [ML] Autoscaling for machine learning (#59309) This provides an autoscaling service integration for machine learning. The underlying logic is fairly straightforward with a couple of caveats: - When considering to scale up/down, ML will automatically translate between Node size and the memory that the node will potentially provide for ML after the scaling plan is implemented. - If knowledge of job sizes is out of date, we will do a best effort check for scaling up. But, if that cannot be determined with our current view of job memory, we attempt a refresh and return a no_scale event - We assume that if the auto memory percent calculation is being used, we treat all JVM sizes on the nodes the same. - For scale down, we keep our last scale down calculation time in memory. So, if master nodes are changed in between, we reset the scale down delay.
This provides an autoscaling service integration for machine learning.
The underlying logic is fairly straightforward with a couple of caveats: