Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lld][coff] undefined symbols from lazy-loaded archives #107365

Open
glandium opened this issue Sep 5, 2024 · 22 comments
Open

[lld][coff] undefined symbols from lazy-loaded archives #107365

glandium opened this issue Sep 5, 2024 · 22 comments
Labels

Comments

@glandium
Copy link
Contributor

glandium commented Sep 5, 2024

STR:

Expected result:

  • Linkage fails because of a missing file (-reproduce didn't store the manifest file, you may want to edit the response file to remove the manifest-related flags, in which case the expected result is a successful link)

Actual result:

lld-link: error: undefined symbol: __declspec(dllimport) _wassert
>>> referenced by /tmp/g/third_party/rust/whatsys/c/windows.c:38
>>>               libwhatsys-f4cc7b72e53ee839.rlib(db3b6bfb95261072-windows.o):(get_os_release)

lld-link: error: undefined symbol: __declspec(dllimport) wcscpy
>>> referenced by /tmp/g/third_party/rust/lmdb-rkv-sys/lmdb/libraries/liblmdb/mdb.c:4238
>>>               liblmdb_sys-564cd906496499b4.rlib(7b4e66c3aa61e8a8-mdb.o):(mdb_fopen)

lld-link: error: undefined symbol: __declspec(dllimport) _aligned_malloc
>>> referenced by /tmp/g/third_party/rust/lmdb-rkv-sys/lmdb/libraries/liblmdb/mdb.c:9324
>>>               liblmdb_sys-564cd906496499b4.rlib(7b4e66c3aa61e8a8-mdb.o):(mdb_env_copyfd1)

lld-link: error: undefined symbol: __declspec(dllimport) _aligned_free
>>> referenced by /tmp/g/third_party/rust/lmdb-rkv-sys/lmdb/libraries/liblmdb/mdb.c:9425
>>>               liblmdb_sys-564cd906496499b4.rlib(7b4e66c3aa61e8a8-mdb.o):(mdb_env_copyfd1)

What happens is that the symbols, by the time resolveRemainingUndefines is called, are still undefined, but isLazy() is true for them, and resolveRemainingUndefines doesn't handle that case.

A crude retry, like the following, works around the issue:

diff --git a/lld/COFF/Driver.cpp b/lld/COFF/Driver.cpp
index 3ef9fa3f65c6..46307a386328 100644
--- a/lld/COFF/Driver.cpp
+++ b/lld/COFF/Driver.cpp
@@ -2595,6 +2595,8 @@ void LinkerDriver::linkerMain(ArrayRef<const char *> argsArr) {
 
   // Resolve remaining undefined symbols and warn about imported locals.
   ctx.symtab.resolveRemainingUndefines();
+  run();
+  ctx.symtab.resolveRemainingUndefines();
   if (errorCount())
     return;
 
diff --git a/lld/COFF/SymbolTable.cpp b/lld/COFF/SymbolTable.cpp
index bb7583bb9a7d..5301a7af9c92 100644
--- a/lld/COFF/SymbolTable.cpp
+++ b/lld/COFF/SymbolTable.cpp
@@ -501,6 +501,10 @@ void SymbolTable::resolveRemainingUndefines() {
     // This odd rule is for compatibility with MSVC linker.
     if (name.starts_with("__imp_")) {
       Symbol *imp = find(name.substr(strlen("__imp_")));
+      if (imp && imp->isLazy()) {
+        forceLazy(imp);
+        continue;
+      }
       if (imp && isa<Defined>(imp)) {
         auto *d = cast<Defined>(imp);
         replaceSymbol<DefinedLocalImport>(sym, ctx, name, d);

But somehow, there is some non-determinism, as when running the command repeatedly, sometimes, it fails with strdup marked as undefined symbol.

Cc: @mstorsjo

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 5, 2024

@llvm/issue-subscribers-lld-coff

Author: Mike Hommey (glandium)

STR: - Download https://drive.google.com/file/d/1HhG-LO4MU1GGOLstHlFBi7rOrwKfNDbZ/view?usp=sharing - Unpack the archive and enter the `repro` directory. - Run `lld-link @response.txt`

Expected result:

  • Linkage fails because of a missing file (-reproduce didn't store the manifest file, you may want to edit the response file to remove the manifest-related flags, in which case the expected result is a successful link)

Actual result:

lld-link: error: undefined symbol: __declspec(dllimport) _wassert
&gt;&gt;&gt; referenced by /tmp/g/third_party/rust/whatsys/c/windows.c:38
&gt;&gt;&gt;               libwhatsys-f4cc7b72e53ee839.rlib(db3b6bfb95261072-windows.o):(get_os_release)

lld-link: error: undefined symbol: __declspec(dllimport) wcscpy
&gt;&gt;&gt; referenced by /tmp/g/third_party/rust/lmdb-rkv-sys/lmdb/libraries/liblmdb/mdb.c:4238
&gt;&gt;&gt;               liblmdb_sys-564cd906496499b4.rlib(7b4e66c3aa61e8a8-mdb.o):(mdb_fopen)

lld-link: error: undefined symbol: __declspec(dllimport) _aligned_malloc
&gt;&gt;&gt; referenced by /tmp/g/third_party/rust/lmdb-rkv-sys/lmdb/libraries/liblmdb/mdb.c:9324
&gt;&gt;&gt;               liblmdb_sys-564cd906496499b4.rlib(7b4e66c3aa61e8a8-mdb.o):(mdb_env_copyfd1)

lld-link: error: undefined symbol: __declspec(dllimport) _aligned_free
&gt;&gt;&gt; referenced by /tmp/g/third_party/rust/lmdb-rkv-sys/lmdb/libraries/liblmdb/mdb.c:9425
&gt;&gt;&gt;               liblmdb_sys-564cd906496499b4.rlib(7b4e66c3aa61e8a8-mdb.o):(mdb_env_copyfd1)

What happens is that the symbols, by the time resolveRemainingUndefines is called, are still undefined, but isLazy() is true for them, and resolveRemainingUndefines doesn't handle that case.

A crude retry, like the following, works around the issue:

diff --git a/lld/COFF/Driver.cpp b/lld/COFF/Driver.cpp
index 3ef9fa3f65c6..46307a386328 100644
--- a/lld/COFF/Driver.cpp
+++ b/lld/COFF/Driver.cpp
@@ -2595,6 +2595,8 @@ void LinkerDriver::linkerMain(ArrayRef&lt;const char *&gt; argsArr) {
 
   // Resolve remaining undefined symbols and warn about imported locals.
   ctx.symtab.resolveRemainingUndefines();
+  run();
+  ctx.symtab.resolveRemainingUndefines();
   if (errorCount())
     return;
 
diff --git a/lld/COFF/SymbolTable.cpp b/lld/COFF/SymbolTable.cpp
index bb7583bb9a7d..5301a7af9c92 100644
--- a/lld/COFF/SymbolTable.cpp
+++ b/lld/COFF/SymbolTable.cpp
@@ -501,6 +501,10 @@ void SymbolTable::resolveRemainingUndefines() {
     // This odd rule is for compatibility with MSVC linker.
     if (name.starts_with("__imp_")) {
       Symbol *imp = find(name.substr(strlen("__imp_")));
+      if (imp &amp;&amp; imp-&gt;isLazy()) {
+        forceLazy(imp);
+        continue;
+      }
       if (imp &amp;&amp; isa&lt;Defined&gt;(imp)) {
         auto *d = cast&lt;Defined&gt;(imp);
         replaceSymbol&lt;DefinedLocalImport&gt;(sym, ctx, name, d);

But somehow, there is some non-determinism, as when running the command repeatedly, sometimes, it fails with strdup marked as undefined symbol.

Cc: @mstorsjo

@mstorsjo
Copy link
Member

mstorsjo commented Sep 5, 2024

Thanks, I can reproduce this. As additional context - I can reproduce the nondeterminism on Linux. (Some underlying details about how files are loaded in LLD differ between Linux and Windows, where they are loaded asynchronously on Windows, but that doesn't seem to play any role here.)

I wonder if #85290 makes any difference here - although I doubt it.

@EugeneZelenko EugeneZelenko removed the lld label Sep 5, 2024
@glandium
Copy link
Contributor Author

glandium commented Sep 5, 2024

I wonder if #85290 makes any difference here - although I doubt it.

It doesn't, but interestingly it makes strdup sometimes appear as undefined symbol, while it never happens with an unpatched tree.

@aganea
Copy link
Member

aganea commented Sep 5, 2024

I wonder if #85290 makes any difference here - although I doubt it.

It doesn't, but interestingly it makes strdup sometimes appear as undefined symbol, while it never happens with an unpatched tree.

Yeah I’ve noticed that with a recent Unreal Engine build. I’ll fix it.

@mstorsjo
Copy link
Member

mstorsjo commented Sep 6, 2024

I wonder if #85290 makes any difference here - although I doubt it.

It doesn't, but interestingly it makes strdup sometimes appear as undefined symbol, while it never happens with an unpatched tree.

Wasn’t that part of the original behavior too? From your original issue description:

But somehow, there is some non-determinism, as when running the command repeatedly, sometimes, it fails with strdup marked as undefined symbol.

@glandium
Copy link
Contributor Author

glandium commented Sep 6, 2024

No, I never got strdup as undefined symbol before applying my patch.

@mstorsjo
Copy link
Member

mstorsjo commented Sep 6, 2024

No, I never got strdup as undefined symbol before applying my patch.

Ah, I misunderstood - that was after applying your patch - I see. FWIW, I did see strdup as undefined symbol, occasionally, with an unpatched build.

@glandium
Copy link
Contributor Author

glandium commented Sep 6, 2024

Maybe I didn't try hard enough. Maybe the patch makes it more likely, too.

@glandium
Copy link
Contributor Author

@mstorsjo Ignoring the strdup issue, do you think the crude retry patch from the report is good enough as a start?

@mstorsjo
Copy link
Member

@mstorsjo Ignoring the strdup issue, do you think the crude retry patch from the report is good enough as a start?

Well it looks, as you say, a bit crude.

In one sense, this looks pretty much like what we do in loadMinGWSymbols() - where we scan for things that we may need to pull in, and do another run() before doing the final resolveRemainingUndefines().

So in one sense, it feels like a case where it's a tradeoff between always doing two passes through the symbols, or only doing it when we have a reason to believe we might need another round - and how costly is it to guess whether we need to make another round?

Or should we make resolveRemainingUndefines() return a value to indicate whether we need to pull in more object files and retry?

@glandium
Copy link
Contributor Author

The indeterminism of the error for strdup is due to the differing order of iteration of the symMap between runs. Replacing the llvm::DenseMap with a llvm::MapVector makes the order deterministic and makes the error happen reliably every time on my machine.

@glandium
Copy link
Contributor Author

I now know what's up with strdup, and it's all happening in resolveRemainingUndefines. There are two scenarios depending on the order the DenseMap wants to give:

  • (undefined) __imp_strdup is processed first, and we hit this code:
    if (name.starts_with("__imp_")) {
    Symbol *imp = find(name.substr(strlen("__imp_")));
    if (imp && isa<Defined>(imp)) {
    auto *d = cast<Defined>(imp);
    replaceSymbol<DefinedLocalImport>(sym, ctx, name, d);
    localImportChunks.push_back(cast<DefinedLocalImport>(sym)->getChunk());
    localImports[sym] = d;
    continue;
    }
    }
  • the code looks up for strdup, which is also undefined, so we don't go through the replaceSymbol, and _imp_strdup is added to undefs.
  • strdup is then processed, and is resolved through
    if (undef->resolveWeakAlias())
    continue;
  • Since _imp_strdup is in undefs we get the error.

The second scenario is:

  • (undefined) strdup is processed first, and is resolved through
    if (undef->resolveWeakAlias())
    continue;
  • (undefined) __imp_strdup is then processed, and we hit the code above again. But this time, when it looks up for strdup, it's defined, and we go through the replaceSymbol, marking _imp_strdup defined.
  • Thus we get no error.

I'm not sure yet how to fix this properly.

@mstorsjo
Copy link
Member

I now know what's up with strdup, and it's all happening in resolveRemainingUndefines. There are two scenarios depending on the order the DenseMap wants to give:

Good job in tracking this down! I wasn't really aware that DenseMap has such nondeterminism, but I guess it's to be expected.

I wonder if we should restructure resolveRemainingUndefines to operate in two steps; the first one does the same loop we have right now, more or less. But for any symbol that matches the __imp_ case, we don't inspect them right away, but put them in a local variable map and iterate through them afterwards.

Alternatively, can we in the case

    if (name.starts_with("__imp_")) {
      Symbol *imp = find(name.substr(strlen("__imp_")));
      if (imp && isa<Defined>(imp)) {

include a check kinda like this:

   if (imp && isa<Undefined>(imp))
      imp->resolveWeakAlias();

before checking whether it is defined?

@glandium
Copy link
Contributor Author

I wonder if we should restructure resolveRemainingUndefines to operate in two steps; the first one does the same loop we have right now, more or less. But for any symbol that matches the _imp case, we don't inspect them right away, but put them in a local variable map and iterate through them afterwards.

I was actually just testing that right now, and it works, but your suggestion to add a resolveWeakAlias sounds simpler.

@glandium
Copy link
Contributor Author

but your suggestion to add a resolveWeakAlias sounds simpler.

and, unsurprisingly, it works. I'll go with this. Do you prefer this is addressed in a separate PR? Considering the non-determinism involved, I'm not sure this specific part can have a corresponding testcase.

@mstorsjo
Copy link
Member

but your suggestion to add a resolveWeakAlias sounds simpler.

and, unsurprisingly, it works. I'll go with this. Do you prefer this is addressed in a separate PR? Considering the non-determinism involved, I'm not sure this specific part can have a corresponding testcase.

I'd prefer a separate PR, although they unfortunately kinda clash, touching the same area. I'm fine with the second PR being based on top of the first once, and you can add a note that it's based on that and reviews should focus on the second commit only.

As for a testcase - yeah, given the nondeterminism it's a bit tricky. But it would still be good with some sort of testcase that can trigger the codepaths, even if at least occasionally.

@glandium
Copy link
Contributor Author

Is this something that would be considered uplifting to the release/19.x branch? (I'm going to apply them downstream anyways, + ec4d5a6, without which I'm also getting undefined symbols for the non-import strdup on other link commands)

@mstorsjo
Copy link
Member

Is this something that would be considered uplifting to the release/19.x branch? (I'm going to apply them downstream anyways, + ec4d5a6, without which I'm also getting undefined symbols for the non-import strdup on other link commands)

Maybe, I guess. It's not a regression fix, but if it's a low-risk, small and selfcontained bugfix I guess it can be considered for backporting too.

@glandium
Copy link
Contributor Author

Moving the discussion from the PR back here, where this is more relevant:

I actually got things to link without the fix, by tweaking the command line lld-link is called with (which, incidentally, and annoyingly, -reproduce doesn't keep). And by using link.exe with a command line closer to the original, I actually get a more useful warning about what might, ultimately, be the root cause of all that's going on:

LINK : warning LNK4098: defaultlib 'msvcrt.lib' conflicts with the use of other libs; use /NODEFAULTLIB:library

Sadly, it doesn't say what it conflicts with.
This matches my earlier observation in the PR that the missing symbols are in ucrt.lib while it's trying to link libucrt.lib.

I'm not sure yet where the problems comes from, but a) this is probably a situation lld-link should also warn about b) considering the PR ends up converting errors into different warnings that lead to a binary that doesn't work, it might be better to leave the situation as is with the errors. Better to fail to link than to produce something utterly broken.

@glandium
Copy link
Contributor Author

glandium commented Sep 20, 2024

I finally found where this is all coming from. Here's the story, starting with the missing details in the original report:
This all comes from a coverage build of Firefox. They started to fail after we upgraded to rustc 1.80 with the undefined symbols error reported. In hindsight, I should have dug into what changed between rustc 1.79 and rustc 1.80, but here we are. Anyways, what changed is that before rustc 1.79, msvcrt.lib was appearing on the link command line. Because this is a coverage build, the link command includes the clang profile runtime. That runtime has a directive that says /defaultlib:libcmt because that runtime is compiled with /MT. Because msvcrt.lib was on the command line, it ends up being processed before libcmt.
We also have /defaultlib:msvcrt on the command line from other stuff, possibly even from rustc itself (I haven't really dug up where exactly the command line option comes from), and presumably in other .lib's directives.
So without msvcrt.lib explicitly on the command line, we end up linking libcmt first, and we end up in a total mess.
I think rust-lang/rust#124050 is what's responsible for the explicit msvcrt.lib not being on the link command line anymore.
Two things work to fix the problem:

  • putting msvcrt.lib on the command line again.
  • moving clang_rt.profile-x86_64.lib last.

At this point, I think the only thing that lld-link would need to fix here is to emit LNK4098.

@glandium
Copy link
Contributor Author

Although... as demonstrated by the testcase in #109082, the bug that it fixes is real... but I've only hit it by happenstance.

moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue Sep 20, 2024
…invokes r=firefox-build-system-reviewers,sergesanspaille

See llvm/llvm-project#107365 (comment)
for an example of how things can go wrong when they are passed first.

I don't think there was a specific reason for them to have been put
first in the first place.

Differential Revision: https://phabricator.services.mozilla.com/D222899
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue Sep 20, 2024
…invokes r=firefox-build-system-reviewers,sergesanspaille

See llvm/llvm-project#107365 (comment)
for an example of how things can go wrong when they are passed first.

I don't think there was a specific reason for them to have been put
first in the first place.

Differential Revision: https://phabricator.services.mozilla.com/D222899

UltraBlame original commit: f6cff71bfc6cba2b1d37d79fce2cb6b5d0b00852
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this issue Sep 20, 2024
…invokes r=firefox-build-system-reviewers,sergesanspaille

See llvm/llvm-project#107365 (comment)
for an example of how things can go wrong when they are passed first.

I don't think there was a specific reason for them to have been put
first in the first place.

Differential Revision: https://phabricator.services.mozilla.com/D222899

UltraBlame original commit: f6cff71bfc6cba2b1d37d79fce2cb6b5d0b00852
@mstorsjo
Copy link
Member

Although... as demonstrated by the testcase in #109082, the bug that it fixes is real... but I've only hit it by happenstance.

Yes, that's a proper fix indeed. If this case is unlikely to come up otherwise, I guess it's not strictly needed to fix it, but then again, the fix isn't all that ugly either so it could be worth fixing it in any case. Do you have opinions on which way to go about it?

As for the conflicts between static and dynamic CRT - that's good info, that properly explains how you'd run into this situation. Even despite that, I don't see why the result of that setup would crash though, no reason specific to LLD at least. But perhaps there are bits within the statically linked CRT objects that break if you'd mix and match like this?

ErichDonGubler pushed a commit to erichdongubler-mozilla/firefox that referenced this issue Sep 21, 2024
…invokes r=firefox-build-system-reviewers,sergesanspaille

See llvm/llvm-project#107365 (comment)
for an example of how things can go wrong when they are passed first.

I don't think there was a specific reason for them to have been put
first in the first place.

Differential Revision: https://phabricator.services.mozilla.com/D222899
i3roly pushed a commit to i3roly/firefox-dynasty that referenced this issue Sep 23, 2024
…invokes r=firefox-build-system-reviewers,sergesanspaille

See llvm/llvm-project#107365 (comment)
for an example of how things can go wrong when they are passed first.

I don't think there was a specific reason for them to have been put
first in the first place.

Differential Revision: https://phabricator.services.mozilla.com/D222899
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants