Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

add retry on EAI_AGAIN to improve DNS lookup stability #34187

Closed
wants to merge 1 commit into from

Conversation

wfurt
Copy link
Member

@wfurt wfurt commented Dec 20, 2018

This is attempt to fix #32797 and improve DNS lookup stability. When dns lookup fails with transient error, we currently return it as failure. However upper layers (ping, sockets, HTTP) do not seems to retry as error code suggests.

To make it more resilient, this change adds retry. To limit possible endless loop, there is retry max so the call still possible can return the transient error.
Also note that the MAX_RETRY_COUNT was chosen arbitrary as best guess.

@wfurt wfurt added os-mac-os-x OS-X aka Mac OS os-linux Linux OS (any supported distro) labels Dec 20, 2018
@wfurt wfurt added this to the 3.0 milestone Dec 20, 2018
@wfurt wfurt self-assigned this Dec 20, 2018
@wfurt wfurt requested a review from a team December 20, 2018 19:21
@wfurt wfurt changed the title add retry on EAI_AGAIN add retry on EAI_AGAIN to improve DNS lookup stability Dec 20, 2018
@stephentoub
Copy link
Member

Does this actually help, i.e. have you seen it to reduce failure rates? My (wrong?) assumption was that the transience wasn't actually brief and such immediate retries wouldn't help.

@wfurt
Copy link
Member Author

wfurt commented Dec 20, 2018

no, I don't have good repro. However when I debugging DNS test failures before I've seen resolver returning permanent failures even for our echo servers.

While back I was also experimenting with running caching name server locally to avoid network failures but I got distracted before some good statistical data.

I'm trying to get access to CI systems so I can collect some traces and collect evidence if it is environmental problem or something else.

@stephentoub
Copy link
Member

@wfurt, can this be closed for now? Or do you have evidence this actually helps? Thanks.

@wfurt
Copy link
Member Author

wfurt commented Jan 29, 2019

I was not able to reproduce it. I have access to new arm64 systems now and I can give it another try.
I'm not sure if I can get stats you asked for without getting it and and watching Kusto test results for a while. I'm wondering if you see any harm in this change - it may take longer to fail but it increases chance of success in general case (like restore) as well test runs IMHO.

However you can also see that this bug is different than for example #21224. I also have personal observation from running my tests locally that immediate retry resolves failure most of the time. Typical case would be "dotnet restore" where it works when executed again after failure.

Another way to improve DNS stability would be adding in-process cache. On Unix resolver and functions we use do not cache every name resolution query would lead to new DNS transaction. With cache we could decrease that and for test runs we can prime it and than have conditional test runs based on ability to resolve.

If you have other suggestions how to solve #32797 please let me know.

@stephentoub
Copy link
Member

it may take longer to fail but it increases chance of success in general case

This is what I'm questioning. Does it actually? My understanding of this error in this particular case is that it's describing a condition that won't be resolved soon, but the change is immediately retrying. If that's true, all we're going to end up doing is spend more cycles retrying something that won't succeed, and I would have concerns about that.

Am I wrong?

@wfurt
Copy link
Member Author

wfurt commented Jan 29, 2019

Yes, I think your description is right @stephentoub. man page describes EAI_AGAIN as transient error but there is no suggestions if the conditions will or will not be resolved soon.

I could run tests in loop but that may or may not simulate this situation. There may be difference depending if name server you asking has answer in cache (and than we talking about local connectivity problems) or if it needs to do recursive lookup from root servers.

I've been thinking about more and if you feel uncomfortable with product change I can modify the test to try again and fail while logging more details.
(like success or fail of second lookup)
That would allow us to study behavior in real test run environment.

@stephentoub
Copy link
Member

I've been thinking about more and if you feel uncomfortable with product change I can modify the test to try again and fail while logging more details. (like success or fail of second lookup) That would allow us to study behavior in real test run environment.

That sounds good to me. Let's gather data :)

@wfurt
Copy link
Member Author

wfurt commented Jan 29, 2019

added instrumentation in #32797. Closing this for now.

@wfurt wfurt closed this Jan 29, 2019
@wfurt
Copy link
Member Author

wfurt commented Jan 29, 2019

correct PR number is #34934

@wfurt wfurt deleted the fix_32797_dns branch June 15, 2020 18:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net os-linux Linux OS (any supported distro) os-mac-os-x OS-X aka Mac OS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NameResolutionPalTest failing intermittently on Linux Arm64
3 participants