Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy responding with errors despite correct config in config_dump endpoint #9657

Closed
XanderStrike opened this issue Jan 10, 2020 · 11 comments
Closed
Labels
investigate Potential bug that needs verification stale stalebot believes this issue/PR has not been touched recently

Comments

@XanderStrike
Copy link

We (@astrieanna and I) are running scalability tests of Istio on GKE clusters. Our current results indicate that there's sometimes a significant delay between istio configuring Envoy and Envoy (specifically ingress gateways) successfully serving traffic based on that configuration.

Our Question

We're wondering why there's such a discrepancy between when envoy's config_dump shows a dynamic route configuration and when we stop seeing 404s for that route. This feels like a discrepancy between what config_dump gives us and how Envoy is actually behaving.

We're also interested in advice on what other information we could capture to make debugging this easier, or alternative ways to configure Envoy to help with this issue. For context, scaling the Istio control plane is pretty much our job right now, so we'd love to hear your advice 😄

Our Setup

In our test we deploy 4,000 pods with services, then every 10 seconds we create a virtualservice (which becomes a dynamic_route_config) pointing to one of our applications. We then curl the route constantly until it comes up, record the time, then curl it for a few more minutes and record the last time we see an error.

Parallel to that, we monitor the list of hostnames that the gateways know about (via dynamic_route_configs[].route_config.virtual_hosts[].name from /config_dump). We record the first time a hostname appears in a given gateway's configuration, so we have a complete picture of how long it takes for configuration distribution to happen.

Summary of Results

Here are the graphs from our most recent test on 240 nodes with 80 ingress envoys (each sharing a dedicated node with a single Istio Pilot).

image

The lines are the max (red) and median (blue) for the latency between the creation of the virtualservice (the kubectl apply) and one of these events:

  • First Gateway - first time a hostname appears on any gateway
  • First Success - first time a curl succeeds
  • Last Error - last time a curl fails
  • Last Gateway - last time a hostname appears on any gateway for the first time

Appendix: Full Results

Summary Results (which is where the above image is from)

We also have separate pages for each of the three runs that are summarized on that main page.

Thanks!

/cc @rosenhouse @howardjohn

@mattklein123 mattklein123 added the investigate Potential bug that needs verification label Jan 11, 2020
@mattklein123
Copy link
Member

Are the routes served using RDS or inline using LDS? If they are served using LDS you are likely looking at connection drain time.

@XanderStrike
Copy link
Author

I'm pretty sure Istio Pilot pushes routes using RDS. The listeners are already present when the routes are pushed.

@mattklein123
Copy link
Member

Sorry than not sure off the top of my head. I would ask the Istio team to investigate.

@XanderStrike
Copy link
Author

The istio team directed us here, since it appears that Envoy has configuration but isn't acting on it.

@ramaraochavali
Copy link
Contributor

Not sure which version of Istio and Envoy you are running, but FWIW, Envoy had a bug #7939 that used to show the rejected config instead of last applied config at some point. Based on what you are describing here, it seems config_dump might be showing rejected config?

@XanderStrike
Copy link
Author

We're using Istio 1.4.2 which from what I can tell is using Envoy v1.12.1, and that version was cut after the resolution of that issue.

Also, since the config does eventually get implemented I don't think it's rejected, unless there are reasons Envoy might reject configuration other than the validity of the configuration itself.

@stale
Copy link

stale bot commented Feb 13, 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Feb 13, 2020
@XanderStrike
Copy link
Author

This remains an issue. Stale bots are bad practice.

@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Feb 13, 2020
@stale
Copy link

stale bot commented Mar 14, 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 14, 2020
@stale
Copy link

stale bot commented Mar 21, 2020

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@stale stale bot closed this as completed Mar 21, 2020
@mike1808
Copy link

We found that the result of 404 errors is that ClusterLoadAssingments (endpoints configured via EDS) are not yet ready for the given cluster we want to reach out hence causing 404 errors. Those endpoints are not shown in /config_dump endpoint and only in /clusters which we didn’t monitor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate Potential bug that needs verification stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

4 participants