Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of disk space building on Docker NanoServer #34780

Closed
BruceForstall opened this issue Apr 9, 2020 · 13 comments · Fixed by #35011
Closed

Out of disk space building on Docker NanoServer #34780

BruceForstall opened this issue Apr 9, 2020 · 13 comments · Fixed by #35011
Assignees
Labels
area-Infrastructure-coreclr blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs
Milestone

Comments

@BruceForstall
Copy link
Member

stress-http and stress-ssl jobs are failing to build due to out of disk space, e.g.:

D:\a\1\s\artifacts\obj\coreclr\Linux.x64.Release\crossgen\src\utilcode\stdafx.utilcode_dac.cpp(1,1): fatal error C1085: Cannot write precompiled header file: 'D:/a/1/s/artifacts/obj/coreclr/Linux.x64.Release/crossgen/src/utilcode/Release/stdafx.utilcode_dac.pch': There is not enough space on the disk. [D:\a\1\s\artifacts\obj\coreclr\Linux.x64.Release\crossgen\src\utilcode\utilcode_dac.vcxproj]
  D:\a\1\s\artifacts\obj\coreclr\Linux.x64.Release\crossgen\src\md\runtime\Release\mdruntime_dac.lib : fatal error LNK1180: insufficient disk space to complete link [D:\a\1\s\artifacts\obj\coreclr\Linux.x64.Release\crossgen\src\md\runtime\mdruntime_dac.vcxproj]
##[error]BUILD: Error: cross-arch components build failed. Refer to the build log files for details.

https://dev.azure.com/dnceng/public/_build/results?buildId=592971&view=logs&j=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&t=abae1f68-3c73-5bff-491f-f2b908580ce6

https://dev.azure.com/dnceng/public/_build/results?buildId=592970&view=logs&j=2d2b3007-3c5c-5840-9bb0-2b1ea49925f3&t=abae1f68-3c73-5bff-491f-f2b908580ce6

@dotnet/runtime-infrastructure

@BruceForstall BruceForstall added this to the 5.0 milestone Apr 9, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 9, 2020
@BruceForstall BruceForstall removed the untriaged New issue has not been triaged by the area owner label Apr 9, 2020
@jaredpar jaredpar added the blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs label Apr 9, 2020
@jaredpar
Copy link
Member

jaredpar commented Apr 9, 2020

@MattGal

@MattGal
Copy link
Member

MattGal commented Apr 9, 2020

Hi guys.

First I want to say I'm impressed you filled the whole disk in only 24 minutes, that's impressive. Also, there's nothing about the 2nd job that says disk space full, just a 2 hour timeout?

Anyways, I've had conversations with the MMS folks recently around similar issues in MacOS hosted pools, and I learned that the amount of free space goes up and down randomly with the payloads installed and no one actually knows what it is so they can't tell you how to keep your build from blowing up the machine.

Your options include:

  • Move to buildpool.windows Helix queues, they have huge disks; this is not a great time to go there with the core-reduction stuff we've done, and if you need docker on the build agent I don't think we have that combination ready to go (based on the pipeline's title, probably?)

  • Remove any file you like from the build agent, but only if it's hosted This is not good behavior in general and I'd recommend against it, but it's literally what AzDO suggested. I'd recommend an explicit check on the behavior to decide you're not in a Helix machine since we don't recycle machines every build (nor would you need to do this) if you might ever do so.

@BruceForstall
Copy link
Member Author

@dotnet/ncl Can someone look at this? It's failing every run.

@safern
Copy link
Member

safern commented Apr 14, 2020

I think we should just move these builds out of the Hosted pools as they use considerable disk space.

cc: @eiriktsarpalis

@jaredpar
Copy link
Member

Agree. This is the only way we're going to get the desired reliability here.

@karelz
Copy link
Member

karelz commented Apr 14, 2020

@alnikola can you please help us here?
Why does the stress cause so much trouble?

@eiriktsarpalis
Copy link
Member

FWIW this is happening when building the clr and libraries on the host machine.

@karelz
Copy link
Member

karelz commented Apr 14, 2020

Are the machines supposed to handle the build?
Do we use the right design for this workflow?

@eiriktsarpalis
Copy link
Member

eiriktsarpalis commented Apr 14, 2020 via email

@safern
Copy link
Member

safern commented Apr 14, 2020

The failures seem to have started abruptly 10 days ago, which suggests that
one change may have broken it.

Actually the build was failing with an unknown switch for linux and the build was not marked as failed

MSBUILD : error MSB1001: Unknown switch.
Switch: -skiptests

For switch syntax, type "MSBuild -help"
�[91mBuild failed (exit code '1').
�[0mFailed to restore the optimization data package.
The command '/bin/sh -c ./src/coreclr/build.sh -release -skiptests -clang9 &&     ./libraries.sh -c $CONFIGURATION -runtimeconfiguration release' returned a non-zero code: 1

https://dev.azure.com/dnceng/public/_build/results?buildId=587364

What changed the way we build this was: 42183b1#diff-41f10863d38cf298ee01c22c64e1b53a

We normally don't use Hosted machines for our builds because of the limitation of space since our build uses around 8 GBs for the artifacts. This build is also producing docker containers which will take up space.

@alnikola alnikola self-assigned this Apr 15, 2020
@alnikola
Copy link
Contributor

MSBUILD : error MSB1001: Unknown switch. Switch: -skiptests
It's a known error which must have been fixed by #33553. Looking into it.
cc: @ViktorHofer

@davidsh
Copy link
Contributor

davidsh commented Apr 15, 2020

Why does the stress cause so much trouble?

FWIW, the stress pipeline doesn't run automatically when PRs happen. So, any changes to overall build scripts (like removing the -skiptests argument) will break the stress pipeline.

@alnikola
Copy link
Contributor

Apparently, I misunderstood @safern's reply. That wrong argument issue has been fixed by now, so we definitely have to move stress test build to a different agent pool.

alnikola added a commit that referenced this issue Jun 18, 2020
HttpStress and SslStress tests moved off hosted pool to different queues.

Note: HttpStress runs are failing but it's actual test code or prod code issue which will be investigated. Infrastructure-wise everything looks good now.

Fixes #34780
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure-coreclr blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants