Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helix-machines has flaky test execution #1320

Open
1 of 5 tasks
riarenas opened this issue Apr 20, 2023 · 12 comments
Open
1 of 5 tasks

Helix-machines has flaky test execution #1320

riarenas opened this issue Apr 20, 2023 · 12 comments
Labels
dotnet-helix-machines Ops - Service Maintenance Used to track issues related to maintaining the services .NET Eng Supports

Comments

@riarenas
Copy link
Member

  • This issue is blocking
  • This issue is causing unreasonable pain
    PRs to the repo are very painful because even small changes need a lot of attempts to get past both flaky problems.

Helix-machines build attempts are regularly failing with one of two issues:

  1. dotnet.exe returning exit code 1 without any info during the build step, just after restoring projects: https://dnceng.visualstudio.com/internal/_build/results?buildId=2161729&view=logs&j=3dc8fd7e-4368-5a92-293e-d53cefc8c4b3&s=6884a131-87da-5381-61f3-d7acc3b91d76&t=7510ef29-db76-5a4c-4d87-8cd81d6b33b8&l=83
  2. dotnet returning failure exit codes during test execution: https://dnceng.visualstudio.com/internal/_build/results?buildId=2160507&view=logs&j=3dc8fd7e-4368-5a92-293e-d53cefc8c4b3&t=baf0d620-26ff-5880-4e4c-71df819d0940

Release Note Category

  • Feature changes/additions
  • Bug fixes
  • Internal Infrastructure Improvements

Release Note Description

@riarenas
Copy link
Member Author

Here is a PR build that saw these issues throughout 4 attempts: https://dnceng.visualstudio.com/internal/_build/results?buildId=2163280

Attempts 1, 2 and 4 all saw issue the problem in the build step, while attempt 3 saw the problem with the test step. Attempt 5 completed the build job successfully.

@riarenas
Copy link
Member Author

riarenas commented Apr 20, 2023

The test issues are exactly the same as the ones reported in dotnet/arcade-services#2324 for arcade-services.

Some things that we can try here:

  • Updating the dotnet sdk we use in the repo
  • Updating the version of vstest we use in the repo (I think this might just come from the sdk in helix-machines case)
  • Updating the version of Nuget, in case the problems are related to the central package management features.

@dougbu
Copy link
Member

dougbu commented Apr 20, 2023

see also

  • NuGet restore fails with The repository primary signature validity period has expired arcade#13070 which corresponds to (1) above i.e., we do have more information but the real errors are logged in white,
  • BuildImages errors about C:\Program Files (x86)\Windows Kits\10\Debuggers from windows-public-debuggers.ps1 and similar errors about C:\Program Files\dummyName from windows-docker.ps1,
  • BuildImages "Custom Script Extension failed to deploy" warnings that may cause failures of that job, and
  • DeployQueues flakiness, especially for Mariner images

@riarenas
Copy link
Member Author

I think I would like this issue to only be about the build stage in helix-machines. Issues related to images seem like they should be dealt with differently as they have to do with the images.

Any errors in deployqueues and createCustomImage are more involved than just building the applications that execute those stages.

Can you explain a bit about how you see dotnet/arcade#13018 being related? that seems exclusively a publishing infrastructure bug and the error doesn't seem to match what I'm seeing here.

@dougbu
Copy link
Member

dougbu commented Apr 20, 2023

My bad. I meant dotnet/arcade#13070 but hash-autocomplete failed me. Editing…

@riarenas
Copy link
Member Author

That makes some more sense. I still think this is something different, because every time we see the signature issue, we get that feedback. It's totally possible that both issues would be solved by updating to a newer NuGet though?

@dougbu
Copy link
Member

dougbu commented Apr 20, 2023

That makes some more sense. I still think this is something different, because every time we see the signature issue, we get that feedback. It's totally possible that both issues would be solved by updating to a newer NuGet though?

Why do you think the two restore problems are different❔ The link you provided shows error from dotnet/arcade#13070 in nice clear (white) letters.

On updating NuGet, I have a branch that updates the Arcade SDK and therefore the .NET SDK. that might do it.

Question is how to move forward once 'main' is green❔ Let's talk ordering in the next little while.

@riarenas
Copy link
Member Author

Ooh, that makes even more sense. I definitely missed that in my builds. I think the failures during test execution are the ones that are unaccounted for then.

@riarenas riarenas changed the title Helix-machines has flaky build and tests Helix-machines has flaky test execution Apr 21, 2023
@garath garath added the Operations Used by FR to track issues related to operations work label Jun 1, 2023
@ilyas1974 ilyas1974 added Ops - Service Maintenance Used to track issues related to maintaining the services .NET Eng Supports and removed Operations Used by FR to track issues related to operations work labels Jul 28, 2023
@ilyas1974 ilyas1974 added the Helix-Machines Asks to update images, to add new queues for new OSes, and maintenance of physical machines label Oct 17, 2023
@dougbu
Copy link
Member

dougbu commented Oct 29, 2023

I looked at the test failure a bit. Unfortunately, the dotnet-helix-machines-ci Tests analytics lack tests run in the Test C# job. I'm sure we've seen similar errors since @riarenas opened this issue but can't verify that w/o searching through all failed and passed on retry runs in the pipeline.

However, the failure linked just above occurred in the Dispose() method of test/Script.Tests/Tests.cs after otherwise successfully running UnicodePathTest(). The logging shows an unexpected leftover file caused an Assert.Empty(...) call to fail.

[xUnit.net 00:00:54.09]     ScriptTests.Tests.UnicodePathTest [FAIL]
  Failed ScriptTests.Tests.UnicodePathTest [1 s]
  Error Message:
   Assert.Empty() Failure
Collection: ["D:\\a\\_work\\_temp\\.artifacts\\helix-scripts\\te"...]
  Stack Trace:
     at ScriptTests.Tests.Dispose() in D:\a\_work\1\s\test\Script.Tests\Tests.cs:line 58
   at ReflectionAbstractionExtensions.DisposeTestClass(ITest test, Object testClass, IMessageBus messageBus, ExecutionTimer timer, CancellationTokenSource cancellationTokenSource) in C:\Dev\xunit\xunit\src\xunit.execution\Extensions\ReflectionAbstractionExtensions.cs:line 79
...
 =========== WORKITEM CONSOLE OUTPUT ==============
 Console log: 'SomeName' from job 00000000-0000-0000-0000-00000000000f workitem 00000000-0000-0000-0000-000000000010 (testqueue) executed on machine 13faa7cac000000 running Windows-10-10.0.20348-SP0
 ['SomeName' END OF WORK ITEM LOG: Command exited with 0]
 New files found in %TEMP% directory:
D:\a\_work\_temp\.artifacts\helix-scripts\temp\lbozsokg

@dotnet/dnceng anyone familiar enough w/ that test class or how helix-scripts/ uses $env:TEMP to guess what is likely to write out a seemingly-random 8 character file or directory❔ (May be worth adding more information to the output e.g., the contents of the directory or at least whether it's a file or a directory.)

@dougbu dougbu transferred this issue from dotnet/arcade Oct 29, 2023
@dougbu dougbu self-assigned this Oct 29, 2023
@markwilkie
Copy link
Member

Added to Helix Epic

@dougbu
Copy link
Member

dougbu commented Nov 4, 2023

unassigning myself because I don't understand the relevant tests and am not sure whether they'll need to change in other ways for the Helix epic.

@dougbu dougbu removed their assignment Nov 4, 2023
@dougbu dougbu added dotnet-helix-machines and removed Helix-Machines Asks to update images, to add new queues for new OSes, and maintenance of physical machines labels Jan 14, 2024
@garath garath added the Ops - P3 Operations task, priority 3 label Mar 25, 2024
@garath garath removed the Ops - P3 Operations task, priority 3 label Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dotnet-helix-machines Ops - Service Maintenance Used to track issues related to maintaining the services .NET Eng Supports
Projects
None yet
Development

No branches or pull requests

5 participants