Add support for Unicode paths on Windows #5514

brson · 2023-02-22T23:24:45Z

This patch makes Binaryen's APIs and binaries work on Windows with paths that contain Unicode characters outside of the ASCII range.

I expect this patch to require changes, for style at least.

The basic problem is that Windows requires processing paths as UTF-16-encoded wide chars, and obtaining these from the command line involves declaring a non-standard wmain function that accepts wide chars.

The strategy this patch uses is:

Use a macro to declare main
Options::parse re-encodes its arguments as UTF-8 on windows
Change APIs to take (a wrapper of ) std::filesystem::path, which uses the correct encoding on windows
Perform a UTF-8 to UTF-16 conversion on windows when calling the binaryen filesystem APIs

On non-Windows this patch should introduce no functional changes except for some additional string copies.

Some key decisions here:

This patch introduces a new header, pchar.h, with definitions needed to provide compatibility with platform-specific string/path types. It would be desirable to not need these defs, but it doesn't quite seem possible to accomplish this patch with just std.

pchar.h defines typedefs for a platftorm-specific character type, pchar,
which is a wchar_t on windows, and a platform-specific string type, pstring. These are aliases for std::filesystem::path::value_type and std::filesystem::path::string_type. It also defines unicode conversions between string and pstring, on windows using windows-specific apis, on unix just doing a string copy.

pchar.h also defines its own custom path type, fspath, which is just a wrapper around std::filesystem::path. The binaryen APIs use this type instead of std::filesystem::path. The only reason this type exists is because std::filesystem::path defines an implicit conversion from std::string which does not do unicode conversion. So this type is used to avoid silently introducing bugs where paths are not encoded correctly.

The binaryen file apis rely on an implicit conversion from std::string to fspath that does unicode encoding. This is to avoid introducing new explicit conversions to the various binaryen main functions.

Note that windows paths may contain non-standard unpaired surrogate code units. I have not attempted to support such strings as they are a rare corner case. The approach in this patch could work with unpaired surrogates by changing the encoding/decoding methods. Depending on how the windows encoding APIs work, these paths may already work - I have not tested. If these paths don't work the result should be mystery i/o failures, not crashes.

This patch probably also likely has the side-effect of allowing other argument strings to be UTF-8 encoded Unicode on Windows, such as --output-source-map-url, where before they would have been incorrectly encoded in some way. I have not thought hard about the consequences.

I have added a test of unicode path processing, but don't really know what I'm doing with the test suite. I just copied the hello_world files and modified them a bit.

I see a pre-existing test failure on windows (duplicate_imports.wat), so I have not successfully run the full test suite on windows.

Fixes #4995

brson · 2023-02-22T23:29:55Z

I'll follow up on the CI failures another day.

kripken · 2023-02-23T16:10:44Z

Hi @brson !

Before I look at the details - I know little about windows or unicode paths there. How does this patch compare to what, say, LLVM does? Do all portable C++ programs need this kind of work?

brson · 2023-02-24T20:49:27Z

I'll look into that question and get back to you. I am not too familiar with C++ programming.

brson · 2023-03-28T21:36:22Z

@kripken I looked into what clang/llvm does.

They also use a strategy of immediately converting windows paths to UTF-8, but they don't macro-ize main to use different types. Instead they always define main to take char**, and they call a function that, on windows, calls GetCommandLineW to get the wide arguments, converts them to utf-8, then writes them back to a char** vector for the caller to use. In comments they say they do it this way to be less intrusive.

binaryen could do something like that in the Options class, and not macro-ize main.

As far as how llvm does I/O in a unicode-correct way, they appear to have a completely custom I/O stack, with specialized Path types and specialized file functions. In various places they convert back to wide chars right before the windows I/O calls. It is very llvm-specific.

brson · 2023-03-28T21:38:35Z

Do all portable C++ programs need this kind of work?

As far as I can tell, yes, unless they are perhaps using e.g. a GUI framework that hides that complexity. Though I think what I've done here, leveraging c++17's std::filesystem::path is at least less onerous than what llvm does: writing completely custom per-plaftform file I/O functions.

kripken · 2023-03-31T22:31:55Z

I see, thanks @brson

I had no idea cross-platform file paths were this difficult 😄

This PR looks reasonable to me given all that, but ccing people that have more experience with such things: @dschuff @sbc100 - thoughts?

sbc100 · 2023-03-31T22:45:21Z

src/tools/wasm-as.cpp

@@ -29,7 +30,7 @@

 using namespace wasm;

-int main(int argc, const char* argv[]) {
+int BYN_MAIN(int argc, const pchar* argv[]) {


This seems a little more invasive that would imagine.

Do we need to new pchar type? Can you we have parse1 and parse2 both just take char*

On windows, the unicode-aware string type is wchar_t, so at some point the codebase has to deal with wchar_t instead of char on windows. The pchar typedef here is for cases where the actual type might be wchar_t, which is the case for the unicode windows main function.

We can at least get rid of it in main by instead calling GetCommandLineW after main, on windows, ignoring the char arguments to main. I will probably make that change, since I think nobody including me, likes the BYN_MAIN macro.

As to whether we can avoid a typedef altogether that depends on whether someone can find an equivalent standard typedef for this purpose; or possibly, after removing BYN_MAIN, the rest of the code that uses pchar can be refactored to not use it.

dschuff · 2023-04-04T23:35:46Z

Ah yeah I recall I had some kind of similar nightmare with windows paths in PNaCl's toolchain back in the day (actually it wasn't about encodings but rather surprisingly low limits on the path lengths that applied to the C stdio functions; so we may still have some kind of issue there). So anyway I'm not too surprised that we have this problem and I'm supportive of fixing it. I'm not thrilled about a macroed main either, but I haven't thought enough about the alternative to have an opinion on whether it's better or worse :) Let me give this PR a closer look.

dschuff · 2023-04-05T00:23:16Z

I like the idea of just going all-in on std::filesystem::path for all filenames. Maybe in the future we could also use other std::filesystem tools for things like wasm::file_size.
I'm less of a fan of wrapping it in wasm::fspath though. IIUC that's just so that we can always create one from whatever kind of string we have? What if we used LLVM way of using the global init function to just convert all the paths to UTF-8 and then use that everywhere internally? It looks like you can construct a filesystem::path from any type of string, so I think it should work without needing pchar and without needing the wrapper?
For re-encoding the command line arguments, I think my preference would be to call an init function in every main as LLVM does rather than using a macro, especially if we end up wanting other init-time things in every tool).

dschuff · 2023-04-05T01:25:35Z

It also occurred to me that we can go toward this incrementally, starting with just using std::filesystem::path. What do you all think of #5627 ?

dschuff · 2023-04-05T16:31:48Z

Bringing the discussion back here to keep it in one place: Based on the CI results both here and in #5627, it looks like we'll need to update our MacOS target to 10.15 if we want to use std::filesystem::path. This might be a possibility; the emsdk builder builds its own copy of libc++ so I don't think that should be an issue for us even if we want to keep 10.14 (I can check). We probably want to also check with other embedders that build Binaryen themselves such as @juj

dschuff · 2023-04-06T18:52:20Z

I also confirmed that my changes in #5627 work when using boost::filesystem instead of std::filesystem. I'm not sure whether it's better to
a) drop support for MacOS 10.14
b) enable use of a (local or system-installed) boost:filesystem in Binaryen's CMake
c) do b) but also include Boost with Binaryen
d) add a build of libc++ in emsdk's binaryen build scripts (AFAIK @juj is the only one who uses this)

I'm not sure how many users we have of the prebuilt macOS binaries that are built on github, but the same options would apply to them.
b) seems fairly straightforward, but it would require whoever runs the emsdk script to ensure they have boost available. Many systems conveniently do, but I'm not sure about MacOS.
d) isn't terrible in principle (it's what I did for the Chromium emsdk builders) but it's kind of a pain because you have to pull in the entire LLVM source codebase just to build libc++. One option to mitigate this might be to package a prebuilt installation of libc++ rather than building it on the bot. Building and using gets more painful as you support more different kinds of configurations (static libs, dynamic libs, static libs with PIC, etc).
One advantage of d) is that it would work for other future C++ features too.

brson · 2023-04-06T19:39:26Z

It looks like you can construct a filesystem::path from any type of string, so I think it should work without needing pchar and without needing the wrapper?

The implicit char* to filesystem::path constructor does not do any unicode encoding in my experiments - it just widens the chars naively. That is the reason there is a wrapper in this patch around filesystem::path - to avoid the incorrect conversion.

Edit: rather I think it is the string to filesystem::path ctor that i tested, but I assume it's true of char* as well.

brson · 2023-04-06T19:43:49Z

I'll wait to see how #5627 goes; then after that plan to rework the entrypoints to keep the normal main function and instead call GetCommandLineW to retrieve then encode the correct arguments on windows. This patch may need to be redone from scratch, but I'll keep it open for now for discussion.

dschuff · 2023-04-06T20:18:04Z

Yeah, #5627 makes no attempt at encoding correctness, it's just about using path rather than string because that seemed like a useful thing to do anyway. I'm sure there are places where there needs to be better conversion that I've missed.

It looks like you can construct a filesystem::path from any type of string, so I think it should work without needing pchar and without needing the wrapper?

The implicit char* to filesystem::path constructor does not do any unicode encoding in my experiments - it just widens the chars naively. That is the reason there is a wrapper in this patch around filesystem::path - to avoid the incorrect conversion.

Edit: rather I think it is the string to filesystem::path ctor that i tested, but I assume it's true of char* as well.

I suppose if the first thing we do on startup is re-encode everything to UTF-8, then maybe we don't even need to use std::filesystem at all? Can we use use std::string everywhere, and std::ifstream::open will get it right?

brson · 2023-04-06T20:21:22Z

Yeah, #5627 makes no attempt at encoding correctness, it's just about using path rather than string because that seemed like a useful thing to do anyway. I'm sure there are places where there needs to be better conversion that I've missed.

It looks like you can construct a filesystem::path from any type of string, so I think it should work without needing pchar and without needing the wrapper?

The implicit char* to filesystem::path constructor does not do any unicode encoding in my experiments - it just widens the chars naively. That is the reason there is a wrapper in this patch around filesystem::path - to avoid the incorrect conversion.
Edit: rather I think it is the string to filesystem::path ctor that i tested, but I assume it's true of char* as well.

I suppose if the first thing we do on startup is re-encode everything to UTF-8, then maybe we don't even need to use std::filesystem at all? Can we use use std::string everywhere, and std::ifstream::open will get it right?

I suspect not, but have not tried. I think the filesystem API will just use the non-unicode windows code path and try to open a nonexistent file with a garbage name.

brson · 2023-08-31T18:47:11Z

This patch is obsoleted by #5671

brson added 5 commits February 22, 2023 15:25

Add pchar.h and main.h

4d163f2

Update binaryen APIs for platform-specific paths

570312c

Update tools to use platform-specific paths

cfd7b95

Add a unicode paths test

928f931

Update changelog

9b37923

sbc100 reviewed Mar 31, 2023

View reviewed changes

brson mentioned this pull request Apr 6, 2023

Processing modules at paths that contain extended unicode characters on windows fails #4995

Closed

dschuff mentioned this pull request Apr 15, 2023

Encode command line to UTF8 on Windows #5671

Merged

brson closed this Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Unicode paths on Windows #5514

Add support for Unicode paths on Windows #5514

brson commented Feb 22, 2023 •

edited

Loading

brson commented Feb 22, 2023

kripken commented Feb 23, 2023

brson commented Feb 24, 2023

brson commented Mar 28, 2023

brson commented Mar 28, 2023 •

edited

Loading

kripken commented Mar 31, 2023

sbc100 Mar 31, 2023

brson Apr 6, 2023

dschuff commented Apr 4, 2023 •

edited

Loading

dschuff commented Apr 5, 2023

dschuff commented Apr 5, 2023

dschuff commented Apr 5, 2023 •

edited

Loading

dschuff commented Apr 6, 2023

brson commented Apr 6, 2023 •

edited

Loading

brson commented Apr 6, 2023 •

edited

Loading

dschuff commented Apr 6, 2023

brson commented Apr 6, 2023

brson commented Aug 31, 2023

Add support for Unicode paths on Windows #5514

Add support for Unicode paths on Windows #5514

Conversation

brson commented Feb 22, 2023 • edited Loading

brson commented Feb 22, 2023

kripken commented Feb 23, 2023

brson commented Feb 24, 2023

brson commented Mar 28, 2023

brson commented Mar 28, 2023 • edited Loading

kripken commented Mar 31, 2023

sbc100 Mar 31, 2023

Choose a reason for hiding this comment

brson Apr 6, 2023

Choose a reason for hiding this comment

dschuff commented Apr 4, 2023 • edited Loading

dschuff commented Apr 5, 2023

dschuff commented Apr 5, 2023

dschuff commented Apr 5, 2023 • edited Loading

dschuff commented Apr 6, 2023

brson commented Apr 6, 2023 • edited Loading

brson commented Apr 6, 2023 • edited Loading

dschuff commented Apr 6, 2023

brson commented Apr 6, 2023

brson commented Aug 31, 2023

brson commented Feb 22, 2023 •

edited

Loading

brson commented Mar 28, 2023 •

edited

Loading

dschuff commented Apr 4, 2023 •

edited

Loading

dschuff commented Apr 5, 2023 •

edited

Loading

brson commented Apr 6, 2023 •

edited

Loading

brson commented Apr 6, 2023 •

edited

Loading