Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather more diagnostics, and prepare skipping flaky DNS tests #36072

Closed
wants to merge 11 commits into from

Conversation

antonfirsov
Copy link
Member

@antonfirsov antonfirsov commented May 7, 2020

Test failures in #1488 and #27622 are suspected to originate from infrastructure/OS/configuration issues, we assume that the underlying OS calls should also fail.

This PR adds a utility to detect these failures and depending on a parameter:

  1. Either skip the test
  2. Or make the tests fail with detailed diagnostics

The plan is to first go with option 2. everywhere, then raise individual issues on possible new test failures with the current infrastructure. After collecting sufficient information, we can switch individual tests to go with option 1.

Fixes #1488, closes #27622 (already fixed)

Related past discussions: #28324, #16549

@ghost
Copy link

ghost commented May 7, 2020

Tagging subscribers to this area: @dotnet/ncl
Notify danmosemsft if you want to be subscribed.

@antonfirsov antonfirsov changed the title Gather more diagnostics, and prepare skipping of flaky DNS tests Gather more diagnostics, and prepare skipping flaky DNS tests May 7, 2020
@antonfirsov antonfirsov requested review from a team and wfurt May 7, 2020 23:29
@antonfirsov
Copy link
Member Author

/azp run runtime-libraries outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

private static extern unsafe int getaddrinfo(string node, string service, addrinfo* hints, addrinfo** res);

[DllImport("libc")]
private static extern unsafe void freeaddrinfo(addrinfo* res);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this P/Invoking is unlikely to be portable across various Unix implementations. It's why we use C shims as an intermediary for the production source.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suggesting to use gethostbyname() as it takes string returns NULL or result so it should be easy to check and be portable IMHO. We chat about it with @antonfirsov and the goal is to check if system can resolve it self without depending on production code we are trying to test. We do not need to process the answer in any detailed way. We could still move it to Interop to make it more portable.

Copy link
Member Author

@antonfirsov antonfirsov May 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suggesting to use gethostbyname()

@wfurt the main reason for choosing freeaddrinfo was because of gethostbyname being marked obsolete. Also: freeaddrinfo provides a trivial way to also get the error code, which I find useful.

All of this P/Invoking is unlikely to be portable across various Unix implementations.

@stephentoub while this is true in general, I don't see any reason to worry about it in this particular case. The addrinfo structure seems binary compatible across unixes, and the lines responsible for the query in our own PAL code are also portable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addrinfo structure seems binary compatible across unixes

We've seen numerous cases where the actual layout differed in header files from what was shown in the man pages, including but not limited to when targeting 32-bit vs 64-bit.

and the lines responsible for the query in our own PAL code are also portable.

Those are compiled by the C compiler, which sees the actual header files. That's the primary reason the native shims exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that we should not p/invoke directly but only use native C shims.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that @antonfirsov. But it gets you out of the structure business. Since it has pointers, the size will differ. socklen_t seems to be fixed to 32bit but I'm not sure if that is mandatory for al OSes.
Shim as @stephentoub suggested would be portable path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I understand the issue now.

From the various options I see, the simplest and safest one seems to be the use of gethostbyname as @wfurt suggested. Looks like there are several existing cases where we P/Invoke into similar simple methods in test code. Is there any strong counter-argument against using gethostbyname?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean the:

struct hostent *gethostbyname(const char *name);

function, and we treat the return as an opaque pointer, e.g.

public static extern IntPtr gethostbyname(string name);

I'm ok with that.

@antonfirsov
Copy link
Member Author

The configuration net5.0-OSX-Debug-x64-CoreCLR_release-OSX.1014.Amd64.Open is failing to resolve it's own hostname returned by gethostname() with the following info being printed:

System.Exception : Failed to resolve 'dci-mac-build-167'! \nDns.GetHostName() == dci-mac-build-167
--- /etc/hosts ---
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1   localhost
255.255.255.255    broadcasthost
::1 localhost
--- /etc/resolv.conf ---
#
# macOS Notice
#
# This file is not consulted for DNS hostname resolution, address
# resolution, or the DNS query routing mechanism used by most
# processes on this system.
#
# To view the DNS configuration used by this system, use:
# scutil --dns
#
# SEE ALSO
# dns-sd(1), scutil(8)
#
# This file is automatically generated.
#
domain redmond.corp.microsoft.com
nameserver 10.50.50.50
nameserver 10.50.10.50
------

@wfurt do you think there's anything useful here to address the issue at infrastructure level?

@wfurt
Copy link
Member

wfurt commented May 19, 2020

This is not too surprising and I cannot resolve 'dci-mac-build-167' from corp net either.
I think it may be on different domain or subnet. That would probably be true for Azure machines as well. The reason why Linux work is that in most cases installer puts name into /etc/host.
But that is not requirement AFAIK.

cc: @MattGal

@MattGal
Copy link
Member

MattGal commented May 19, 2020

This is not too surprising and I cannot resolve 'dci-mac-build-167' from corp net either.
I think it may be on different domain or subnet. That would probably be true for Azure machines as well. The reason why Linux work is that in most cases installer puts name into /etc/host.
But that is not requirement AFAIK.

cc: @MattGal

This tracks with my understanding of the way these machines are meant to be configured, and I thought was why we use localhost / 127.0.01 for such testing. To add one thing, our MacOS machines are not meant to even be accessible from corpnet (they're on a special walled-off version of it) and instead are generally accessed via KVM.

@antonfirsov
Copy link
Member Author

/azp run runtime-libraries outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@antonfirsov
Copy link
Member Author

After having some discussions offline, it turned out that there is space for improvement in the behavior of our product code.

Problem

There's no guarantee that getaddrinfo() (or gethostbyname()) can resolve the systems's own hostname (returned by gethostname()) (unless it has been added manually to /etc/hosts). On CI we have a case where the behavior depends on macOS version: works for 10.13, fails for 10.14.

Currently we have a strict validation logic in the related PAL code:

int result = getaddrinfo((const char*)address, NULL, &hint, &info);
if (result != 0)
{
return ConvertGetAddrInfoAndGetNameInfoErrorsToPal(result);
}

This may lead to failing Dns.GetHostByName("") or Dns.GetHostEntry("") calls on some systems, which is an unexpected behavior (bug?), that does not comply with our documentation:

If an empty string is passed as the hostNameOrAddress argument, then this method returns the IPv4 and IPv6 addresses of the local host.

Suggestion

We already have a logic handling hostname as a special case and returning local IP's thanks to dotnet/corefx#41764:

result = gethostname((char*)name, _POSIX_HOST_NAME_MAX);
bool includeIPv4Loopback = true;
bool includeIPv6Loopback = true;
if (result == 0 && strcasecmp((const char*)address, name) == 0)

If getaddrinfo() returns 0, we can run this comparison earlier and handle the special case instead of failing. Seems like a relatively easy change with no risks.

@stephentoub @davidsh @ManickaP thoughts?

@antonfirsov
Copy link
Member Author

Note GitHub actions team managed to find and solve the macOS issue in their environment:
actions/runner-images#1042 (comment)

@stephentoub
Copy link
Member

@antonfirsov, are you still working on this?

@antonfirsov
Copy link
Member Author

Closing for now, as we started to consider product code improvements instead. The problem I'm trying to solve is better tracked in #36849 (and maybe #42185).

@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
@karelz karelz added this to the 6.0.0 milestone Jan 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
7 participants