Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support for weighted zonal search request routing policy #2859

Closed
Bukhtawar opened this issue Apr 11, 2022 · 10 comments
Closed

[Feature] Support for weighted zonal search request routing policy #2859

Bukhtawar opened this issue Apr 11, 2022 · 10 comments
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request v2.5.0 'Issues and PRs related to version v2.5.0'

Comments

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Apr 11, 2022

Is your feature request related to a problem? Please describe.
The search requests at the coordinator performs a round robin to route requests to shard copies, with adaptive replica selection it might choose to route request to copies that rank lower in preference based on certain parameters. However there seems to be some use cases where weighted routing policy adds value

  1. Heterogeneous(zone wise) instance types -- Certain instance capacities are sometimes available only in certain zones and not others, which means customers can choose to run 4xl instances in 1 zone and 2xl instances in others. Weighted routing, 2:1 might help heterogeneous deplyoments.
  2. Zonal deployment model --Software Deployments can go slow and might choose to perform 1 zone deployment at a time, which means cutting off traffic to one zone under deployment might be needed. Setting the policy to 1:0 should effectively cut-off all search shard requests to go to copies in the AZ under deployment
  3. Zonal failures -- Zonal failures are common and there is no mechanism to weigh away shard request traffic off unhealthy zone, even though HTTP traffic is weighted away

Describe the solution you'd like
Support for a weighted routing policy that can help to incrementally weigh away traffic or route traffic based on the routing policy. To start off small we can have a manual mechanism to configure policies and provide smarter defaults and guard rails to prevent acting up on bad configurations.

@Bukhtawar Bukhtawar added enhancement Enhancement or improvement to existing feature or request untriaged labels Apr 11, 2022
@dblock
Copy link
Member

dblock commented Apr 11, 2022

If I read this correctly, you're trying to solve for availability scenarios with AZ failures by using weighted routing. If that's the case, you might want to clarify above (maybe explain where you come from a little better?). Then, does it ever make sense to have heterogeneous instances within the same zone? in which case weighted routing may also be a good idea for better throughput and not just availability?

@Bukhtawar Bukhtawar changed the title [Feature] Support for weighted shard search request routing policy [Feature] Support for weighted zonal search request routing policy Apr 11, 2022
@Bukhtawar
Copy link
Collaborator Author

If I read this correctly, you're trying to solve for availability scenarios with AZ failures by using weighted routing. If that's the case, you might want to clarify above (maybe explain where you come from a little better?). Then, does it ever make sense to have heterogeneous instances within the same zone? in which case weighted routing may also be a good idea for better throughput and not just availability?

Modified the issue to reflect zonal routing policy.

@dblock
Copy link
Member

dblock commented Apr 11, 2022

So is the answer to does it ever make sense to have heterogeneous instances within the same AZ a no?

@Bukhtawar
Copy link
Collaborator Author

It does make sense, but do you think we can build that capability incrementally. While building the zonal policy, we should see how it could be extended for these uses cases in future as well

@dblock
Copy link
Member

dblock commented Apr 13, 2022

Take a look at #2877 (comment), can we solve both this and that problem the same way or in the same path?

@andrross
Copy link
Member

I agree with @dblock that there may be a common mechanism here that would solve many different use cases. Just for my own clarification though I have a couple questions :)

re: zonal failures - Are you referring to partial failures here where hosts in the failed zone are still responsive but have degraded performance? If the failed zone is fully partitioned away from the rest of the cluster and all network connections are broken then would weighting away be necessary?

re: zonal deployment model - Can this be solved by graceful shutdowns during deployments so that new traffic is not accepted and existing requests are allowed to complete? It seems like it would be preferable to solve this in a way that doesn't require the operator to orchestrate weighting policies during deployments, if that's possible.

@Bukhtawar
Copy link
Collaborator Author

zonal failures - Are you referring to partial failures here where hosts in the failed zone are still responsive but have degraded performance? If the failed zone is fully partitioned away from the rest of the cluster and all network connections are broken then would weighting away be necessary

Yes weigh away would guarantee that transient network faults don't cause a flip flop till the zonal failure complete heals. For predictability it might be desired we stop routing any traffic to the impacted zone/rack.

zonal deployment model - Can this be solved by graceful shutdowns during deployments so that new traffic is not accepted and existing requests are allowed to complete? It seems like it would be preferable to solve this in a way that doesn't require the operator to orchestrate weighting policies during deployments, if that's possible.

Yes the intent is to introduce graceful shutdowns. While I am not sure what you meant by "orchestrating weighting policies during deployments", the idea would be to allow controls to incrementally(fe 5% -> 20% -> 50% -> 100%) weigh away traffic which might require operator/automated orchestration

@elfisher
Copy link

Can we add labels for "roadmap" and the version of OpenSearch this is targeting? I can add it to the overall project roadmap in the right column once that is done.

@saratvemulapalli
Copy link
Member

@Bukhtawar are we good to go for 2.5 ?
Code freeze is tomorrow.

@kotwanikunal
Copy link
Member

@Bukhtawar are we good to go for 2.5 ? Code freeze is tomorrow.

@Bukhtawar Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request v2.5.0 'Issues and PRs related to version v2.5.0'
Projects
None yet
Development

No branches or pull requests

8 participants