Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode command line to UTF8 on Windows #5671

Merged
merged 19 commits into from
Sep 14, 2023
Merged

Encode command line to UTF8 on Windows #5671

merged 19 commits into from
Sep 14, 2023

Conversation

dschuff
Copy link
Member

@dschuff dschuff commented Apr 15, 2023

This PR changes how file paths and the command line are handled. On startup on Windows,
we process the wstring version of the command line (including the file paths) and re-encode
it to UTF8 before handing it off to the rest of the command line handling logic. This means
that all paths are stored in UTF8-encoded std::strings as they go through the program, right
up until they are used to open files. At that time, they are converted to the appropriate native
format with the new to_path function before passing to the stdlib open functions.

This has the advantage that all of the non-file-opening code can use a single type to hold paths
(which is good since std::filesystem::path has proved problematic in some cases), but has the
disadvantage that someone could add new code that forgets to convert to_path before
opening. That's somewhat mitigated by the fact that most of the code uses the ModuleIOBase
classes for opening files.

Fixes #4995

@dschuff
Copy link
Member Author

dschuff commented Apr 15, 2023

This is essentially the orthogonal piece to #5627 (i.e. it doesn't depend on std::filesystem, which may end up being problematic) and I think it actually should accomplish the same goal as #5514 . It's not exactly done but as I'm OOO next week I thought I'd upload it.
Basically it encodes the command line to UTF8 and keeps std::string as the way to carry filename information. Then it encodes back to wchar on Windows where we open the file.

@dschuff
Copy link
Member Author

dschuff commented May 3, 2023

This is slightly cleaned up. For some reason there still seems to be an issue with the source map paths. i.e. I tried testing by running tests in a directory with an emoji (so all paths would have it) and the input and output paths seem fine, but source maps are failing to open for some reason I don't understand yet.

But it does show one possibility that doesn't require std::filesystem or use of std::wstring anywhere other than right at the point of opening files. One downside is that it does require explicitly doing the conversion everywhere we open a file, so it's easy to make a mistake. I think the only way to avoid that is use an abstraction like std::filesystem (or reinventing the path part of std::filesystem).
What do you all think?

Copy link
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on the windows stuff but overall this looks reasonable and non-intrusive.

@dschuff
Copy link
Member Author

dschuff commented May 9, 2023

@brson maybe you can help here, I'm a bit stumped. Currently this patch looks like it should work, and it does when opening input files, but opening output files (either for e.g. wasm conversion output like in the lit tests here, or for source map output) fails. And not with a "file not found" error, but with an "invalid argument" error (you can see this in the test output on this PR). Does it have the same behavior on your machine? Do you have any ideas about what the issue might be?

@brson
Copy link
Contributor

brson commented May 19, 2023

I'll get back to this next week and give this patch a test then.

@brson
Copy link
Contributor

brson commented Jul 26, 2023

@dschuff I have tested this patch and made a fix:

brson@f8d37bb

This seems to resolve the issue of the -o flag not working.

@codecov
Copy link

codecov bot commented Aug 18, 2023

Codecov Report

Merging #5671 (17e512a) into main (5bfb192) will increase coverage by 0.01%.
Report is 4 commits behind head on main.
The diff coverage is 42.85%.

@@            Coverage Diff             @@
##             main    #5671      +/-   ##
==========================================
+ Coverage   42.63%   42.65%   +0.01%     
==========================================
  Files         484      484              
  Lines       74872    74850      -22     
  Branches    11938    11925      -13     
==========================================
+ Hits        31919    31924       +5     
+ Misses      39748    39725      -23     
+ Partials     3205     3201       -4     
Files Changed Coverage Δ
src/support/file.cpp 44.00% <25.00%> (-1.84%) ⬇️
src/tools/wasm-opt.cpp 50.00% <33.33%> (+5.48%) ⬆️
src/wasm/wasm-io.cpp 83.13% <42.85%> (-4.21%) ⬇️
src/support/command-line.cpp 79.27% <100.00%> (ø)
src/support/path.cpp 28.57% <100.00%> (+2.64%) ⬆️

... and 15 files with indirect coverage changes

@brson
Copy link
Contributor

brson commented Aug 31, 2023

Here's a patch for the current windows build error:

brson@5183149

The error is:

  D:\a\binaryen\binaryen\src\support\path.cpp(29,87): error C2440: 'return': cannot convert from 'std::wstring' to 'const wchar_t *' 

This is because the signature of Path::to_path

const PathString::value_type* to_path(const std::string& s) { return string_to_wstring(s); }

is asking for a wchar_t* on windows but returning a wstring. This signature won't work on windows because it requires returning a raw pointer to a temporary wstring's buffer.

Changing the signature to

const PathString to_path(const std::string& s) { return string_to_wstring(s); }

returning a string/wstring, works for all callers and fixes the build error on windows, still builds on linux.

I tested that wasm-opt on windows still works with unicode paths, but have not run the test suite.

@dschuff
Copy link
Member Author

dschuff commented Sep 1, 2023

That is basically what I had in 9fe2f9b (now rebased to 9fdb5df so you can see the current CI). The problem is that it doesn't work on mingw (which has GNU libstdc++ on top of windows)... libstdc++ doesn't have wchar versions of functions like std::basic_ifstream<char>::open(), presumably because they aren't useful on Linux where libstdc++ was developed. At this point I think I'm tempted to just not have good support for unicode paths on minGW. TBH I'm not even sure whether anyone uses the mingw build, but if someone pops up and asks about this, maybe they can help us figure out a solution :)

@dschuff dschuff marked this pull request as ready for review September 7, 2023 23:10
@dschuff
Copy link
Member Author

dschuff commented Sep 7, 2023

I think this can be reviewed now. @kripken @brson WDYT?

@dschuff
Copy link
Member Author

dschuff commented Sep 7, 2023

(CI note: the linter really wants "shellapi.h" to be included before "windows.h" but my understanding is that windows.h should be included first, so I'm leaving it).

@dschuff
Copy link
Member Author

dschuff commented Sep 8, 2023

Oh, and a note on testing: The current test case doesn't cover all the cases in the code. One easy way to test this comprehensively would be to create the build/install directory on windows such that the directory path (and therefore all file paths in use) contains an emoji. Do you think that would be worth it?

Copy link
Member

@kripken kripken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert on the windows side, but those details look good in general, and overall this seems nonintrusive and helpful.

Testing here seems sufficient for now to me.

@brson can you confirm this fixes things for you?

@brson
Copy link
Contributor

brson commented Sep 14, 2023

@kripken Yes, confirmed this patch works for me. I have tested it with the Rust bindings and our Unicode tests work on windows with it.

Thanks @dschuff!

@dschuff dschuff merged commit f774eff into main Sep 14, 2023
13 of 14 checks passed
@dschuff dschuff deleted the encode-cmdline branch September 14, 2023 21:08
radekdoulik pushed a commit to dotnet/binaryen that referenced this pull request Jul 12, 2024
This PR changes how file paths and the command line are handled. On startup on Windows,
we process the wstring version of the command line (including the file paths) and re-encode
it to UTF8 before handing it off to the rest of the command line handling logic. This means
that all paths are stored in UTF8-encoded std::strings as they go through the program, right
up until they are used to open files. At that time, they are converted to the appropriate native
format with the new to_path function before passing to the stdlib open functions.

This has the advantage that all of the non-file-opening code can use a single type to hold paths
(which is good since std::filesystem::path has proved problematic in some cases), but has the
disadvantage that someone could add new code that forgets to convert to_path before
opening. That's somewhat mitigated by the fact that most of the code uses the ModuleIOBase
classes for opening files.

Fixes WebAssembly#4995
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Processing modules at paths that contain extended unicode characters on windows fails
3 participants