Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Unicode paths on Windows #5514

Closed
wants to merge 5 commits into from

Conversation

brson
Copy link
Contributor

@brson brson commented Feb 22, 2023

This patch makes Binaryen's APIs and binaries work on Windows with paths that contain Unicode characters outside of the ASCII range.

I expect this patch to require changes, for style at least.

The basic problem is that Windows requires processing paths as UTF-16-encoded wide chars, and obtaining these from the command line involves declaring a non-standard wmain function that accepts wide chars.

The strategy this patch uses is:

  • Use a macro to declare main
  • Options::parse re-encodes its arguments as UTF-8 on windows
  • Change APIs to take (a wrapper of ) std::filesystem::path, which uses the correct encoding on windows
  • Perform a UTF-8 to UTF-16 conversion on windows when calling the binaryen filesystem APIs

On non-Windows this patch should introduce no functional changes except for some additional string copies.

Some key decisions here:

This patch introduces a new header, pchar.h, with definitions needed to provide compatibility with platform-specific string/path types. It would be desirable to not need these defs, but it doesn't quite seem possible to accomplish this patch with just std.

pchar.h defines typedefs for a platftorm-specific character type, pchar,
which is a wchar_t on windows, and a platform-specific string type, pstring. These are aliases for std::filesystem::path::value_type and std::filesystem::path::string_type. It also defines unicode conversions between string and pstring, on windows using windows-specific apis, on unix just doing a string copy.

pchar.h also defines its own custom path type, fspath, which is just a wrapper around std::filesystem::path. The binaryen APIs use this type instead of std::filesystem::path. The only reason this type exists is because std::filesystem::path defines an implicit conversion from std::string which does not do unicode conversion. So this type is used to avoid silently introducing bugs where paths are not encoded correctly.

The binaryen file apis rely on an implicit conversion from std::string to fspath that does unicode encoding. This is to avoid introducing new explicit conversions to the various binaryen main functions.

Note that windows paths may contain non-standard unpaired surrogate code units. I have not attempted to support such strings as they are a rare corner case. The approach in this patch could work with unpaired surrogates by changing the encoding/decoding methods. Depending on how the windows encoding APIs work, these paths may already work - I have not tested. If these paths don't work the result should be mystery i/o failures, not crashes.

This patch probably also likely has the side-effect of allowing other argument strings to be UTF-8 encoded Unicode on Windows, such as --output-source-map-url, where before they would have been incorrectly encoded in some way. I have not thought hard about the consequences.

I have added a test of unicode path processing, but don't really know what I'm doing with the test suite. I just copied the hello_world files and modified them a bit.

I see a pre-existing test failure on windows (duplicate_imports.wat), so I have not successfully run the full test suite on windows.

Fixes #4995

@brson
Copy link
Contributor Author

brson commented Feb 22, 2023

I'll follow up on the CI failures another day.

@kripken
Copy link
Member

kripken commented Feb 23, 2023

Hi @brson !

Before I look at the details - I know little about windows or unicode paths there. How does this patch compare to what, say, LLVM does? Do all portable C++ programs need this kind of work?

@brson
Copy link
Contributor Author

brson commented Feb 24, 2023

I'll look into that question and get back to you. I am not too familiar with C++ programming.

@brson
Copy link
Contributor Author

brson commented Mar 28, 2023

@kripken I looked into what clang/llvm does.

They also use a strategy of immediately converting windows paths to UTF-8, but they don't macro-ize main to use different types. Instead they always define main to take char**, and they call a function that, on windows, calls GetCommandLineW to get the wide arguments, converts them to utf-8, then writes them back to a char** vector for the caller to use. In comments they say they do it this way to be less intrusive.

binaryen could do something like that in the Options class, and not macro-ize main.

As far as how llvm does I/O in a unicode-correct way, they appear to have a completely custom I/O stack, with specialized Path types and specialized file functions. In various places they convert back to wide chars right before the windows I/O calls. It is very llvm-specific.

@brson
Copy link
Contributor Author

brson commented Mar 28, 2023

Do all portable C++ programs need this kind of work?

As far as I can tell, yes, unless they are perhaps using e.g. a GUI framework that hides that complexity. Though I think what I've done here, leveraging c++17's std::filesystem::path is at least less onerous than what llvm does: writing completely custom per-plaftform file I/O functions.

@kripken
Copy link
Member

kripken commented Mar 31, 2023

I see, thanks @brson

I had no idea cross-platform file paths were this difficult 😄

This PR looks reasonable to me given all that, but ccing people that have more experience with such things: @dschuff @sbc100 - thoughts?

@@ -29,7 +30,7 @@

using namespace wasm;

int main(int argc, const char* argv[]) {
int BYN_MAIN(int argc, const pchar* argv[]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little more invasive that would imagine.

Do we need to new pchar type? Can you we have parse1 and parse2 both just take char*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On windows, the unicode-aware string type is wchar_t, so at some point the codebase has to deal with wchar_t instead of char on windows. The pchar typedef here is for cases where the actual type might be wchar_t, which is the case for the unicode windows main function.

We can at least get rid of it in main by instead calling GetCommandLineW after main, on windows, ignoring the char arguments to main. I will probably make that change, since I think nobody including me, likes the BYN_MAIN macro.

As to whether we can avoid a typedef altogether that depends on whether someone can find an equivalent standard typedef for this purpose; or possibly, after removing BYN_MAIN, the rest of the code that uses pchar can be refactored to not use it.

@dschuff
Copy link
Member

dschuff commented Apr 4, 2023

Ah yeah I recall I had some kind of similar nightmare with windows paths in PNaCl's toolchain back in the day (actually it wasn't about encodings but rather surprisingly low limits on the path lengths that applied to the C stdio functions; so we may still have some kind of issue there). So anyway I'm not too surprised that we have this problem and I'm supportive of fixing it. I'm not thrilled about a macroed main either, but I haven't thought enough about the alternative to have an opinion on whether it's better or worse :) Let me give this PR a closer look.

@dschuff
Copy link
Member

dschuff commented Apr 5, 2023

I like the idea of just going all-in on std::filesystem::path for all filenames. Maybe in the future we could also use other std::filesystem tools for things like wasm::file_size.
I'm less of a fan of wrapping it in wasm::fspath though. IIUC that's just so that we can always create one from whatever kind of string we have? What if we used LLVM way of using the global init function to just convert all the paths to UTF-8 and then use that everywhere internally? It looks like you can construct a filesystem::path from any type of string, so I think it should work without needing pchar and without needing the wrapper?
For re-encoding the command line arguments, I think my preference would be to call an init function in every main as LLVM does rather than using a macro, especially if we end up wanting other init-time things in every tool).

@dschuff
Copy link
Member

dschuff commented Apr 5, 2023

It also occurred to me that we can go toward this incrementally, starting with just using std::filesystem::path. What do you all think of #5627 ?

@dschuff
Copy link
Member

dschuff commented Apr 5, 2023

Bringing the discussion back here to keep it in one place: Based on the CI results both here and in #5627, it looks like we'll need to update our MacOS target to 10.15 if we want to use std::filesystem::path. This might be a possibility; the emsdk builder builds its own copy of libc++ so I don't think that should be an issue for us even if we want to keep 10.14 (I can check). We probably want to also check with other embedders that build Binaryen themselves such as @juj

@dschuff
Copy link
Member

dschuff commented Apr 6, 2023

I also confirmed that my changes in #5627 work when using boost::filesystem instead of std::filesystem. I'm not sure whether it's better to
a) drop support for MacOS 10.14
b) enable use of a (local or system-installed) boost:filesystem in Binaryen's CMake
c) do b) but also include Boost with Binaryen
d) add a build of libc++ in emsdk's binaryen build scripts (AFAIK @juj is the only one who uses this)

I'm not sure how many users we have of the prebuilt macOS binaries that are built on github, but the same options would apply to them.
b) seems fairly straightforward, but it would require whoever runs the emsdk script to ensure they have boost available. Many systems conveniently do, but I'm not sure about MacOS.
d) isn't terrible in principle (it's what I did for the Chromium emsdk builders) but it's kind of a pain because you have to pull in the entire LLVM source codebase just to build libc++. One option to mitigate this might be to package a prebuilt installation of libc++ rather than building it on the bot. Building and using gets more painful as you support more different kinds of configurations (static libs, dynamic libs, static libs with PIC, etc).
One advantage of d) is that it would work for other future C++ features too.

@brson
Copy link
Contributor Author

brson commented Apr 6, 2023

It looks like you can construct a filesystem::path from any type of string, so I think it should work without needing pchar and without needing the wrapper?

The implicit char* to filesystem::path constructor does not do any unicode encoding in my experiments - it just widens the chars naively. That is the reason there is a wrapper in this patch around filesystem::path - to avoid the incorrect conversion.

Edit: rather I think it is the string to filesystem::path ctor that i tested, but I assume it's true of char* as well.

@brson
Copy link
Contributor Author

brson commented Apr 6, 2023

I'll wait to see how #5627 goes; then after that plan to rework the entrypoints to keep the normal main function and instead call GetCommandLineW to retrieve then encode the correct arguments on windows. This patch may need to be redone from scratch, but I'll keep it open for now for discussion.

@dschuff
Copy link
Member

dschuff commented Apr 6, 2023

Yeah, #5627 makes no attempt at encoding correctness, it's just about using path rather than string because that seemed like a useful thing to do anyway. I'm sure there are places where there needs to be better conversion that I've missed.

It looks like you can construct a filesystem::path from any type of string, so I think it should work without needing pchar and without needing the wrapper?

The implicit char* to filesystem::path constructor does not do any unicode encoding in my experiments - it just widens the chars naively. That is the reason there is a wrapper in this patch around filesystem::path - to avoid the incorrect conversion.

Edit: rather I think it is the string to filesystem::path ctor that i tested, but I assume it's true of char* as well.

I suppose if the first thing we do on startup is re-encode everything to UTF-8, then maybe we don't even need to use std::filesystem at all? Can we use use std::string everywhere, and std::ifstream::open will get it right?

@brson
Copy link
Contributor Author

brson commented Apr 6, 2023

Yeah, #5627 makes no attempt at encoding correctness, it's just about using path rather than string because that seemed like a useful thing to do anyway. I'm sure there are places where there needs to be better conversion that I've missed.

It looks like you can construct a filesystem::path from any type of string, so I think it should work without needing pchar and without needing the wrapper?

The implicit char* to filesystem::path constructor does not do any unicode encoding in my experiments - it just widens the chars naively. That is the reason there is a wrapper in this patch around filesystem::path - to avoid the incorrect conversion.
Edit: rather I think it is the string to filesystem::path ctor that i tested, but I assume it's true of char* as well.

I suppose if the first thing we do on startup is re-encode everything to UTF-8, then maybe we don't even need to use std::filesystem at all? Can we use use std::string everywhere, and std::ifstream::open will get it right?

I suspect not, but have not tried. I think the filesystem API will just use the non-unicode windows code path and try to open a nonexistent file with a garbage name.

@brson
Copy link
Contributor Author

brson commented Aug 31, 2023

This patch is obsoleted by #5671

@brson brson closed this Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Processing modules at paths that contain extended unicode characters on windows fails
4 participants