Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS resolution failure results in UH / no healthy upstream #31992

Closed
nirvanagit opened this issue Jan 24, 2024 · 21 comments
Closed

DNS resolution failure results in UH / no healthy upstream #31992

nirvanagit opened this issue Jan 24, 2024 · 21 comments
Labels
area/dns bug investigate Potential bug that needs verification stale stalebot believes this issue/PR has not been touched recently

Comments

@nirvanagit
Copy link

nirvanagit commented Jan 24, 2024

Title: Observing DNS resolution timeout, resulting in UH at pod startup of istio proxy

Description:

What issue is being seen? Describe what should be happening instead of
the bug, for example: Envoy should successfully do dns resolutions, and come up with the cluster endpoints and prevent UH / no healthy upstream errors.

Repro steps:

Include sample requests, environment, etc. All data and inputs
required to reproduce the bug.

  1. Create 5000 strict dns clusters which point to AWS LB
  2. Startup envoy with dns logs enabled in debug mode.
  3. Should result in status=1, status=12 errors in dns resolution

Note: The Envoy_collect tool
gathers a tarball with debug logs, config and the following admin
endpoints: /stats, /clusters and /server_info. Please note if there are
privacy concerns, sanitize the data prior to sharing the tarball/pasting.

Admin and Stats Output:

Include the admin output for the following endpoints: /stats,
/clusters, /routes, /server_info. For more information, refer to the
admin endpoint documentation.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Config:

Include the config used to configure Envoy.

Logs:

Include the access logs and the Envoy logs.

2024-01-11T20:18:14.819795Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:18:14.821345Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:18:14.822802Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:19:34.386036Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. failed with c-ares status 12	thread=26
2024-01-11T20:19:34.386045Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 1 log_from_custom_dns_patch	thread=26
2024-01-11T20:19:34.386069Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. failed with c-ares status 12	thread=26
2024-01-11T20:19:34.386076Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 1 log_from_custom_dns_patch	thread=26
2024-01-11T20:19:34.386093Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. failed with c-ares status 12	thread=26
2024-01-11T20:19:34.386099Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 1 log_from_custom_dns_patch	thread=26
2024-01-11T20:19:35.653971Z	info	Envoy proxy is ready
2024-01-11T20:20:34.483098Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:20:34.517452Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:20:34.525608Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:20:39.483516Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 0 log_from_custom_dns_patch	thread=26 → First successful DNS resolution after Envoy marked itself as READY. It is 65 seconds after Envoy marked itself as ready.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Call Stack:

If the Envoy binary is crashing, a call stack is required.
Please refer to the Bazel Stack trace documentation.

@nirvanagit nirvanagit added bug triage Issue requires triage labels Jan 24, 2024
@nirvanagit
Copy link
Author

We see a lot of timeouts, is there a way to configure timeouts? I see there is an option in the c-ares library ARES_OPT_TIMEOUT, but don't see an option to override that in envoy

@zuercher zuercher added area/dns investigate Potential bug that needs verification and removed triage Issue requires triage labels Jan 25, 2024
@zuercher
Copy link
Member

Setting wait_for_warm_on_init on the cluster might help. I don't think there's a way to set c-ares' internal timeout at the moment.

cc @yanavlasov @mattklein123 as codeowners

@nirvanagit
Copy link
Author

wait_for_warm_on_init

@zuercher - this is set to true by default.

@nirvanagit
Copy link
Author

@lambdai / @howardjohn - will be glad if you can throw some light on this one.

Similar to #20562

@nirvanagit
Copy link
Author

nirvanagit commented Feb 14, 2024

Appears that envoy is marking itself ready even when DNS resolution failed with the following error codes as per C-Ares:
Error codes:

  1. 12 - ARES_ETIMEOUT
  2. 11 - ARES_ECONNREFUSED
  3. 16 - ARES_EDESTRUCTION

cc: @lambdai @alyssawilk @mattklein123

@nirvanagit
Copy link
Author

nirvanagit commented Feb 15, 2024

After updating Envoy's code with increased DNS resolution timeout, and increased no. of retries, we encountered a reduction of errors by 90%.

The experiment proved that the issue is rooted in the way Envoy does DNS resolutions.

My proposal to solve this issue/bug is three folds, each providing a fallback option if the previous one fails:

  1. Envoy should batch DNS resolution queries. So that the socket / channel is not bombarded with thousands of requests at the same time
  2. Envoy should categorize DNS failures as client side and server side, and for client side failures, Envoy should not mark itself as ready.
  3. When envoy receives a request for a CLUSTER endpoint, which doesn't have any IPs because DNS resolution had failed, envoy should retry DNS resolution X number of times at runtime before giving up

@nirvanagit
Copy link
Author

When there are 2 STRICT DNS clusters with the same endpoint, envoy does 2 DNS resolutions.

Can the DNS resolution mechanism be optimized to avoid duplicate DNS resolutions?
cc: @alyssawilk @mattklein123 @yuval-k

@ramaraochavali
Copy link
Contributor

@zuercher one question on c-ares, let us say if we have 3 STRICT_DNS clusters, does c-ares open 3 persistent connections to upstream resolver?

@alyssawilk
Copy link
Contributor

@nirvanagit are you setting the dns cache config? I think you should be able to aim 2 clusters at one cache and avoid the duplication

Copy link

github-actions bot commented Apr 4, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 4, 2024
@ramaraochavali
Copy link
Contributor

@alyssawilk looks like they are using regular STRICT_DNS cluster not the dynamic forward proxy. So no caching involved here?

@mattklein123 @alyssawilk Can you please help answer this question and would dynamic forward proxy maintains a persistent connection for all look ups or does it tear down a connection after lookup? One of the problems we are seeing with this STRICT_DNS clusters is one of the core DNS pods is overwhelmed with lot of connections? Have you seen this?

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Apr 9, 2024
@alyssawilk
Copy link
Contributor

ah sorry I'm much more familiar with DFP than strict DNS.

would dynamic forward proxy maintains a persistent connection for all look ups

I don't even know what this means?
for DFP we do a DNS lookup per hostname (DNS lookups are UDP and don't have connections associated), then cache the result until the TTL runs out at which point there's another lookup.
The persistent connections are the TCP connections upstream. If there's a new DNS resolution we'll continue using the connection (latched to the old address) until the connection is closed.

the DFP, as it uses the DNS cache, also supports stale DNS, so when DNS expires if resolution fails, you can configure the cache to use the last successful resolve result. Sounds like the problem is that strict DNS doesn't get any of these benefits - may be worth adding as an optional feature.

@ramaraochavali
Copy link
Contributor

I don't even know what this means?

Persistent connection is wrong choice of word here. When envoy sends a DNS query to core DNS, what we have observed in some environments is it is always sending to one single pod of core DNS flooding that pod. Curious if there is some thing in Envoy/C-ares that is making it to choose the same pod when DNS lookups are done for multiple STRICT_DNS clusters.

@ramaraochavali
Copy link
Contributor

#7965 - Found this and possible fix in c-ares c-ares/c-ares#549. Is it OK add this configuration to DNSResolverConfig?

@ramaraochavali
Copy link
Contributor

@alyssawilk ^^ WDYT?

@alyssawilk
Copy link
Contributor

If you've tested that this addresses your problem, SGTM.
We could either add a perment knob or set a "sensible default" runtime guarded and add a knob if anyone dislikes the default.

@ramaraochavali
Copy link
Contributor

#33551 - adding a permanent knob here

@rsarunprashad
Copy link

rsarunprashad commented May 10, 2024

After updating Envoy's code with increased DNS resolution timeout, and increased no. of retries, we encountered a reduction of errors by 90%.

@nirvanagit how did you mange to set timeouts on dns_resolver_config ? is that for tcp ?

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 10, 2024
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2024
@virajrk
Copy link

virajrk commented Jul 1, 2024

Hello @ramaraochavali, How did you apply this configuration to istio-proxy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dns bug investigate Potential bug that needs verification stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

No branches or pull requests

6 participants