Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Autoscaling for machine learning #59309

Merged
merged 33 commits into from
Nov 17, 2020

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Jul 9, 2020

This provides an autoscaling service integration for machine learning.

The underlying logic is fairly straightforward with a couple of caveats:

  • When considering to scale up/down, ML will automatically translate between Node size and the memory that the node will potentially provide for ML after the scaling plan is implemented.
  • If knowledge of job sizes is out of date, we will do a best effort check for scaling up. But, if that cannot be determined with our current view of job memory, we attempt a refresh and return a no_scale event
  • We assume that if the auto memory percent calculation is being used, we treat all JVM sizes on the nodes the same.
  • For scale down, we keep our last scale down calculation time in memory. So, if master nodes are changed in between, we reset the scale down delay.

@benwtrent benwtrent force-pushed the feature/ml-autoscaling-integration branch 2 times, most recently from c4e1f9a to e5f75e5 Compare August 25, 2020 12:21
@benwtrent benwtrent force-pushed the feature/ml-autoscaling-integration branch from e5f75e5 to 4ceb881 Compare August 27, 2020 18:21
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the code at a low level as it may change, but left some comments about edge cases that need consideration.

@benwtrent benwtrent marked this pull request as ready for review November 12, 2020 15:26
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Autoscaling)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Nov 12, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent benwtrent changed the title Autoscaling for machine learning [ML] Autoscaling for machine learning Nov 12, 2020
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to stop reviewing as other things came up, but here are the two minor things I'd typed before that. I'll have another look on Monday.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have one concern. Apart from that I think this is good to merge so we get a first attempt into CI.

Comment on lines 368 to 369
// TODO Better default???
AnalysisLimits.DEFAULT_MODEL_MEMORY_LIMIT_MB,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is dangerous. It means we could scale the cluster significantly due to the memory tracker not having the required information. It would also be difficult to diagnose this based on a user complaint that their cluster scaled wildly. For example, suppose there are 50 jobs each with a memory requirement of 0.1GB, total requirement 5GB, but then due to a glitch in the memory tracker we treat that as 50GB and scale up to a much more costly cluster.

Since this is for scaling up, I think we should use 0 for jobs where we have no memory information, so we'll only scale up if the sum of the memory requirements for jobs that we know the memory requirement for imply a scale up.

(Same a few lines below.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is for scaling up, I think we should use 0 for jobs where we have no memory information

The reason I did do this was this caused a scale_up to be delayed. If we are fine with the scale up being delayed, then its fine.

Copy link
Contributor

@droberts195 droberts195 Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should only happen once per master node though, if the Cloud infrastructure asks for a scaling decision immediately after that master node became master. In that rare situation it might be better not to make a hasty decision that will need to get adjusted again at the next scaling decision.

When a job is opened/started via the open/start API, we put a value for its memory requirement in the memory tracker. And when a new master takes over, we recalculate the memory requirement of all jobs.

So the memory tracker should only return null for a job’s memory requirement in the period between a new master being elected and the recalculation of all jobs’ memory requirements completing.

I think it would be good to log a warning if the default is used during a scale up decision. We'll need this to debug the situation where a cluster doesn't scale at all due to some unforeseen problem. But unless I am mistaken I think it will be very rare to see that log message. In the case of a new master node being elected and an autoscaling decision being requested soon afterwards we should see the log message once. And only in the case of some unforeseen problem will we see the log message repeatedly.

nodeLoads.add(nodeLoad);
isMemoryAccurateFlag = isMemoryAccurateFlag && nodeLoad.isUseMemory();
}
// Even if we verify that memory usage is up today before checking node capacity, we could still run into stale information.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Even if we verify that memory usage is up today before checking node capacity, we could still run into stale information.
// Even if we verify that memory usage is up to date before checking node capacity, we could still run into stale information.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be happy to merge this if you could just change a few more things related to what happens if memory information isn't available.

Duration memoryTrackingStale,
AutoscalingDeciderResult potentialResult) {
if (mlMemoryTracker.isRecentlyRefreshed(memoryTrackingStale) == false) {
return buildDecisionAndRequestRefresh(reasonBuilder);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that getting here should be incredibly rare, because we checked isRecentlyRefreshed near the beginning of the scale method, so we'll only get here if the definition of "recently" was breached while the code of the scale method was running?

If this is correct then we should log a warning here, because if some strange bug causes us to get here more often then we'll want to know when dealing with the "why doesn't my cluster scale" support case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that getting here should be incredibly rare

We will hit this predicate if scale_up fails due to memory being stale.

We will also hit this predicate if scale_down fails due to memory being stale (more rare).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, sorry. When I wrote this I was thinking that scale was checking for staleness before both scale up and scale down, but it checks in between.

int maxNumInQueue) {
List<Long> jobSizes = unassignedJobs
.stream()
// TODO do we want to verify memory requirements aren't stale? Or just consider `null` a fastpath?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this comment, as staleness is already being checked elsewhere.

}

private Long getAnalyticsMemoryRequirement(String analyticsId) {
return mlMemoryTracker.getDataFrameAnalyticsJobMemoryRequirement(analyticsId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we check staleness at the beginning of the scale method I think it should impossible to return null here given where it's called from. So:

  1. Please add a Javadoc comment saying this method must only be called after checking the that memory tracker is recently refreshed
  2. Add an assertion that the return value isn't null so we can catch unexpected situations in tests
  3. Add a warning log so that if we return null in production and it causes a support case that autoscaling isn't working we have something in the log

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we check staleness at the beginning of the scale method I think it should impossible to return null here given where it's called from.

This method is also used in scale_up, which does not directly check for memory to not be stale.

But, in one place where this method is used (right before scale down), we have recently checked. I will put an assert there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also put an assert in scale down. If the node load returns saying that memory is stale, that is also a weird scenario and we should fail as we just recently checked.

}

private Long getAnomalyMemoryRequirement(String anomalyId) {
return mlMemoryTracker.getAnomalyDetectorJobMemoryRequirement(anomalyId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we check staleness at the beginning of the scale method I think it should impossible to return null here given where it's called from. So:

  1. Please add a Javadoc comment saying this method must only be called after checking the that memory tracker is recently refreshed
  2. Add an assertion that the return value isn't null so we can catch unexpected situations in tests
  3. Add a warning log so that if we return null in production and it causes a support case that autoscaling isn't working we have something in the log

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit ebd0996 into elastic:master Nov 17, 2020
@benwtrent benwtrent deleted the feature/ml-autoscaling-integration branch November 17, 2020 15:54
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Nov 17, 2020
This provides an autoscaling service integration for machine learning.

The underlying logic is fairly straightforward with a couple of caveats:
- When considering to scale up/down, ML will automatically translate between Node size and the memory that the node will potentially provide for ML after the scaling plan is implemented.
- If knowledge of job sizes is out of date, we will do a best effort check for scaling up. But, if that cannot be determined with our current view of job memory, we attempt a refresh and return a no_scale event
- We assume that if the auto memory percent calculation is being used, we treat all JVM sizes on the nodes the same.
- For scale down, we keep our last scale down calculation time in memory. So, if master nodes are changed in between, we reset the scale down delay.
benwtrent added a commit that referenced this pull request Nov 17, 2020
* [ML] Autoscaling for machine learning (#59309)

This provides an autoscaling service integration for machine learning.

The underlying logic is fairly straightforward with a couple of caveats:
- When considering to scale up/down, ML will automatically translate between Node size and the memory that the node will potentially provide for ML after the scaling plan is implemented.
- If knowledge of job sizes is out of date, we will do a best effort check for scaling up. But, if that cannot be determined with our current view of job memory, we attempt a refresh and return a no_scale event
- We assume that if the auto memory percent calculation is being used, we treat all JVM sizes on the nodes the same.
- For scale down, we keep our last scale down calculation time in memory. So, if master nodes are changed in between, we reset the scale down delay.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants