DNS resolution failure results in UH / no healthy upstream #31992

nirvanagit · 2024-01-24T03:00:11Z

Title: Observing DNS resolution timeout, resulting in UH at pod startup of istio proxy

Description:

What issue is being seen? Describe what should be happening instead of
the bug, for example: Envoy should successfully do dns resolutions, and come up with the cluster endpoints and prevent UH / no healthy upstream errors.

Repro steps:

Include sample requests, environment, etc. All data and inputs
required to reproduce the bug.

Create 5000 strict dns clusters which point to AWS LB
Startup envoy with dns logs enabled in debug mode.
Should result in status=1, status=12 errors in dns resolution

Note: The Envoy_collect tool
gathers a tarball with debug logs, config and the following admin
endpoints: /stats, /clusters and /server_info. Please note if there are
privacy concerns, sanitize the data prior to sharing the tarball/pasting.

Admin and Stats Output:

Include the admin output for the following endpoints: /stats,
/clusters, /routes, /server_info. For more information, refer to the
admin endpoint documentation.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Config:

Include the config used to configure Envoy.

Logs:

Include the access logs and the Envoy logs.

2024-01-11T20:18:14.819795Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:18:14.821345Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:18:14.822802Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:19:34.386036Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. failed with c-ares status 12	thread=26
2024-01-11T20:19:34.386045Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 1 log_from_custom_dns_patch	thread=26
2024-01-11T20:19:34.386069Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. failed with c-ares status 12	thread=26
2024-01-11T20:19:34.386076Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 1 log_from_custom_dns_patch	thread=26
2024-01-11T20:19:34.386093Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. failed with c-ares status 12	thread=26
2024-01-11T20:19:34.386099Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 1 log_from_custom_dns_patch	thread=26
2024-01-11T20:19:35.653971Z	info	Envoy proxy is ready
2024-01-11T20:20:34.483098Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:20:34.517452Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:20:34.525608Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. started	thread=26
2024-01-11T20:20:39.483516Z	dns resolution for internal-abc.us-east-2.elb.amazonaws.com. completed with status 0 log_from_custom_dns_patch	thread=26 → First successful DNS resolution after Envoy marked itself as READY. It is 65 seconds after Envoy marked itself as ready.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Call Stack:

If the Envoy binary is crashing, a call stack is required.
Please refer to the Bazel Stack trace documentation.

nirvanagit · 2024-01-24T03:01:03Z

We see a lot of timeouts, is there a way to configure timeouts? I see there is an option in the c-ares library ARES_OPT_TIMEOUT, but don't see an option to override that in envoy

zuercher · 2024-01-25T17:36:53Z

Setting wait_for_warm_on_init on the cluster might help. I don't think there's a way to set c-ares' internal timeout at the moment.

cc @yanavlasov @mattklein123 as codeowners

nirvanagit · 2024-01-31T16:28:53Z

wait_for_warm_on_init

@zuercher - this is set to true by default.

nirvanagit · 2024-02-08T17:37:42Z

@lambdai / @howardjohn - will be glad if you can throw some light on this one.

Similar to #20562

nirvanagit · 2024-02-14T16:58:55Z

Appears that envoy is marking itself ready even when DNS resolution failed with the following error codes as per C-Ares:
Error codes:

12 - ARES_ETIMEOUT
11 - ARES_ECONNREFUSED
16 - ARES_EDESTRUCTION

cc: @lambdai @alyssawilk @mattklein123

nirvanagit · 2024-02-15T19:18:41Z

After updating Envoy's code with increased DNS resolution timeout, and increased no. of retries, we encountered a reduction of errors by 90%.

The experiment proved that the issue is rooted in the way Envoy does DNS resolutions.

My proposal to solve this issue/bug is three folds, each providing a fallback option if the previous one fails:

Envoy should batch DNS resolution queries. So that the socket / channel is not bombarded with thousands of requests at the same time
Envoy should categorize DNS failures as client side and server side, and for client side failures, Envoy should not mark itself as ready.
When envoy receives a request for a CLUSTER endpoint, which doesn't have any IPs because DNS resolution had failed, envoy should retry DNS resolution X number of times at runtime before giving up

nirvanagit · 2024-02-22T15:50:23Z

When there are 2 STRICT DNS clusters with the same endpoint, envoy does 2 DNS resolutions.

Can the DNS resolution mechanism be optimized to avoid duplicate DNS resolutions?
cc: @alyssawilk @mattklein123 @yuval-k

ramaraochavali · 2024-03-01T11:38:29Z

@zuercher one question on c-ares, let us say if we have 3 STRICT_DNS clusters, does c-ares open 3 persistent connections to upstream resolver?

alyssawilk · 2024-03-05T15:57:43Z

@nirvanagit are you setting the dns cache config? I think you should be able to aim 2 clusters at one cache and avoid the duplication

github-actions · 2024-04-04T16:01:07Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ramaraochavali · 2024-04-09T05:38:30Z

@alyssawilk looks like they are using regular STRICT_DNS cluster not the dynamic forward proxy. So no caching involved here?

@mattklein123 @alyssawilk Can you please help answer this question and would dynamic forward proxy maintains a persistent connection for all look ups or does it tear down a connection after lookup? One of the problems we are seeing with this STRICT_DNS clusters is one of the core DNS pods is overwhelmed with lot of connections? Have you seen this?

alyssawilk · 2024-04-10T13:16:43Z

ah sorry I'm much more familiar with DFP than strict DNS.

would dynamic forward proxy maintains a persistent connection for all look ups

I don't even know what this means?
for DFP we do a DNS lookup per hostname (DNS lookups are UDP and don't have connections associated), then cache the result until the TTL runs out at which point there's another lookup.
The persistent connections are the TCP connections upstream. If there's a new DNS resolution we'll continue using the connection (latched to the old address) until the connection is closed.

the DFP, as it uses the DNS cache, also supports stale DNS, so when DNS expires if resolution fails, you can configure the cache to use the last successful resolve result. Sounds like the problem is that strict DNS doesn't get any of these benefits - may be worth adding as an optional feature.

ramaraochavali · 2024-04-11T10:40:33Z

I don't even know what this means?

Persistent connection is wrong choice of word here. When envoy sends a DNS query to core DNS, what we have observed in some environments is it is always sending to one single pod of core DNS flooding that pod. Curious if there is some thing in Envoy/C-ares that is making it to choose the same pod when DNS lookups are done for multiple STRICT_DNS clusters.

ramaraochavali · 2024-04-11T10:52:30Z

#7965 - Found this and possible fix in c-ares c-ares/c-ares#549. Is it OK add this configuration to DNSResolverConfig?

ramaraochavali · 2024-04-15T03:11:50Z

@alyssawilk ^^ WDYT?

alyssawilk · 2024-04-23T13:21:33Z

If you've tested that this addresses your problem, SGTM.
We could either add a perment knob or set a "sensible default" runtime guarded and add a knob if anyone dislikes the default.

ramaraochavali · 2024-04-24T11:34:57Z

#33551 - adding a permanent knob here

rsarunprashad · 2024-05-10T20:17:25Z

After updating Envoy's code with increased DNS resolution timeout, and increased no. of retries, we encountered a reduction of errors by 90%.

@nirvanagit how did you mange to set timeouts on dns_resolver_config ? is that for tcp ?

github-actions · 2024-06-10T00:03:17Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions · 2024-06-17T00:03:24Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

virajrk · 2024-07-01T23:36:01Z

Hello @ramaraochavali, How did you apply this configuration to istio-proxy?

nirvanagit added bug triage Issue requires triage labels Jan 24, 2024

nirvanagit mentioned this issue Jan 24, 2024

Istio-Proxy returning UH / no healthy upstream at startup for STRICT_DNS clusters istio/istio#48692

Closed

2 tasks

zuercher added area/dns investigate Potential bug that needs verification and removed triage Issue requires triage labels Jan 25, 2024

nirvanagit mentioned this issue Feb 22, 2024

Envoy does duplicate DNS resolutions when there are strict DNS clusters, with the same endpoints #32524

Closed

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 4, 2024

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Apr 9, 2024

deveshkandpal1224 mentioned this issue Apr 16, 2024

c-ares: add option for udp_max_queries #33551

Merged

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jun 10, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS resolution failure results in UH / no healthy upstream #31992

DNS resolution failure results in UH / no healthy upstream #31992

nirvanagit commented Jan 24, 2024 •

edited

Loading

nirvanagit commented Jan 24, 2024

zuercher commented Jan 25, 2024

nirvanagit commented Jan 31, 2024

nirvanagit commented Feb 8, 2024

nirvanagit commented Feb 14, 2024 •

edited

Loading

nirvanagit commented Feb 15, 2024 •

edited

Loading

nirvanagit commented Feb 22, 2024

ramaraochavali commented Mar 1, 2024

alyssawilk commented Mar 5, 2024

github-actions bot commented Apr 4, 2024

ramaraochavali commented Apr 9, 2024

alyssawilk commented Apr 10, 2024

ramaraochavali commented Apr 11, 2024

ramaraochavali commented Apr 11, 2024

ramaraochavali commented Apr 15, 2024

alyssawilk commented Apr 23, 2024

ramaraochavali commented Apr 24, 2024

rsarunprashad commented May 10, 2024 •

edited

Loading

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 17, 2024

virajrk commented Jul 1, 2024

DNS resolution failure results in UH / no healthy upstream #31992

DNS resolution failure results in UH / no healthy upstream #31992

Comments

nirvanagit commented Jan 24, 2024 • edited Loading

nirvanagit commented Jan 24, 2024

zuercher commented Jan 25, 2024

nirvanagit commented Jan 31, 2024

nirvanagit commented Feb 8, 2024

nirvanagit commented Feb 14, 2024 • edited Loading

nirvanagit commented Feb 15, 2024 • edited Loading

nirvanagit commented Feb 22, 2024

ramaraochavali commented Mar 1, 2024

alyssawilk commented Mar 5, 2024

github-actions bot commented Apr 4, 2024

ramaraochavali commented Apr 9, 2024

alyssawilk commented Apr 10, 2024

ramaraochavali commented Apr 11, 2024

ramaraochavali commented Apr 11, 2024

ramaraochavali commented Apr 15, 2024

alyssawilk commented Apr 23, 2024

ramaraochavali commented Apr 24, 2024

rsarunprashad commented May 10, 2024 • edited Loading

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 17, 2024

virajrk commented Jul 1, 2024

nirvanagit commented Jan 24, 2024 •

edited

Loading

nirvanagit commented Feb 14, 2024 •

edited

Loading

nirvanagit commented Feb 15, 2024 •

edited

Loading

rsarunprashad commented May 10, 2024 •

edited

Loading