-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Envoy responding with errors despite correct config in config_dump
endpoint
#9657
Comments
Are the routes served using RDS or inline using LDS? If they are served using LDS you are likely looking at connection drain time. |
I'm pretty sure Istio Pilot pushes routes using RDS. The listeners are already present when the routes are pushed. |
Sorry than not sure off the top of my head. I would ask the Istio team to investigate. |
The istio team directed us here, since it appears that Envoy has configuration but isn't acting on it. |
Not sure which version of Istio and Envoy you are running, but FWIW, Envoy had a bug #7939 that used to show the rejected config instead of last applied config at some point. Based on what you are describing here, it seems config_dump might be showing rejected config? |
We're using Istio 1.4.2 which from what I can tell is using Envoy v1.12.1, and that version was cut after the resolution of that issue. Also, since the config does eventually get implemented I don't think it's rejected, unless there are reasons Envoy might reject configuration other than the validity of the configuration itself. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This remains an issue. Stale bots are bad practice. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions. |
We found that the result of 404 errors is that ClusterLoadAssingments (endpoints configured via EDS) are not yet ready for the given cluster we want to reach out hence causing 404 errors. Those endpoints are not shown in /config_dump endpoint and only in /clusters which we didn’t monitor. |
We (@astrieanna and I) are running scalability tests of Istio on GKE clusters. Our current results indicate that there's sometimes a significant delay between istio configuring Envoy and Envoy (specifically ingress gateways) successfully serving traffic based on that configuration.
Our Question
We're wondering why there's such a discrepancy between when envoy's
config_dump
shows a dynamic route configuration and when we stop seeing 404s for that route. This feels like a discrepancy between whatconfig_dump
gives us and how Envoy is actually behaving.We're also interested in advice on what other information we could capture to make debugging this easier, or alternative ways to configure Envoy to help with this issue. For context, scaling the Istio control plane is pretty much our job right now, so we'd love to hear your advice 😄
Our Setup
In our test we deploy 4,000 pods with services, then every 10 seconds we create a virtualservice (which becomes a
dynamic_route_config
) pointing to one of our applications. We then curl the route constantly until it comes up, record the time, then curl it for a few more minutes and record the last time we see an error.Parallel to that, we monitor the list of hostnames that the gateways know about (via
dynamic_route_configs[].route_config.virtual_hosts[].name
from/config_dump
). We record the first time a hostname appears in a given gateway's configuration, so we have a complete picture of how long it takes for configuration distribution to happen.Summary of Results
Here are the graphs from our most recent test on 240 nodes with 80 ingress envoys (each sharing a dedicated node with a single Istio Pilot).
The lines are the max (red) and median (blue) for the latency between the creation of the virtualservice (the
kubectl apply
) and one of these events:Appendix: Full Results
Summary Results (which is where the above image is from)
We also have separate pages for each of the three runs that are summarized on that main page.
Thanks!
/cc @rosenhouse @howardjohn
The text was updated successfully, but these errors were encountered: