Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Proposal: Asymmetric Scaling Policy #357

Open
alexlo03 opened this issue Jul 16, 2024 · 0 comments
Open

Feature Proposal: Asymmetric Scaling Policy #357

alexlo03 opened this issue Jul 16, 2024 · 0 comments

Comments

@alexlo03
Copy link
Contributor

Background

The current autoscaler logic roughly:

  1. Collect metrics over time interval (default last 5m)
    1. Over that interval, find the max value for that metric (High Priority CPU, 24H CPU, Storage)
  2. Determine appropriate spanner size
    1. maxSuggestedSize = min_size
    2. Count spanner dbs: maxSuggestedSize = size implied by number of dbs
    3. For each metric, determine proposed size (subject to scaleInLimit in Linear) - also do rounding
    4. maxSuggestedSize = max for all metric proposals
    5. Final proposal = min(maxSuggestedSize, max_size)
  3. Check final proposal and do scaling
    1. If ongoing scaling: exit
    2. If within cooldown period: exit
    3. Do scaling event

Problems

  1. The auto scaler logic is greedy which can lead to scaling "bouncing". If you have a "calmer" 5m period then it will scale you in. Spanner scaling events are not smooth experiences and we'd like to avoid when possible. Here is an important production database (this example from today):
Screenshot 2024-07-16 at 09 50 40

We don't want that scale in behavior, but we do want the immediate scale out behavior. We'd prefer an "Asymmetric Policy", example:

Scale in policy: "only scale in when it is clearly a good idea"
"if you see that for over 1h/2h/4h that you want to scale in the entire time without exception, then scale in."

Scale out policy: "scale out whenever things get hot"
"if you see over the last 5m that you need more, go ahead and scale out."

This is not easily expressible right now.

  • scaleInCoolingMinutes would let the bounce happen.
  • add another high priority CPU metric with look back "period" = 4h (or, 4h divided by 5). The problem here is that if there is a spike and it scales out, then even after scaling out it will see that spike in the 4h look back and then want to scale out again.
  • The only solution I can see is to use Custom Scaling Methods. You'd have to define some scale_out metrics which have the short lookback (5m) and scale_in metrics that have the longer lookback (4hr) then process those metrics differently in the calculateSize()
  1. Asymmetric Metrics Error Handling - In the case where metrics are bad - see Ignore Bad Values from Google Metrics #355 we'd want different behavior on scale out/scale in. If you get an incomplete metric set (say a CPU metric is zero), the entire possibility of scaling in should be discarded (fail static). If a single metric does return with signal that scaling out is warranted, then scale out should happen, an incomplete signal can be enough to scale out. Again this could be specially treated in a Custom Scaling Method.

Summary

I think I can make things work using Custom Scaling Method and maybe I will do that, but I think in general users of this project probably want the same things I do, and addressing this in the core project would be a good idea. Thanks.

==

Side note: How "Want Scale To" log metric works:

Examples:
            storage=0.5885764076510477%, BELOW the range [70%-80%] => however, cannot scale to 100 because it is lower than MIN 6000 PROCESSING_UNITS
            high_priority_cpu=7.566610239183218%, BELOW the range [60%-70%] => however, cannot scale to 700 because it is lower than MIN 6000 PROCESSING_UNITS

"want_scale_to %%{data:ignore}cannot scale to %%{number:want_scale_to}%%{data:ignore}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant