-
Notifications
You must be signed in to change notification settings - Fork 737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Unicode paths on Windows #5514
Conversation
I'll follow up on the CI failures another day. |
Hi @brson ! Before I look at the details - I know little about windows or unicode paths there. How does this patch compare to what, say, LLVM does? Do all portable C++ programs need this kind of work? |
I'll look into that question and get back to you. I am not too familiar with C++ programming. |
@kripken I looked into what clang/llvm does. They also use a strategy of immediately converting windows paths to UTF-8, but they don't macro-ize main to use different types. Instead they always define main to take
binaryen could do something like that in the Options class, and not macro-ize main. As far as how llvm does I/O in a unicode-correct way, they appear to have a completely custom I/O stack, with specialized Path types and specialized file functions. In various places they convert back to wide chars right before the windows I/O calls. It is very llvm-specific. |
As far as I can tell, yes, unless they are perhaps using e.g. a GUI framework that hides that complexity. Though I think what I've done here, leveraging c++17's std::filesystem::path is at least less onerous than what llvm does: writing completely custom per-plaftform file I/O functions. |
@@ -29,7 +30,7 @@ | |||
|
|||
using namespace wasm; | |||
|
|||
int main(int argc, const char* argv[]) { | |||
int BYN_MAIN(int argc, const pchar* argv[]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little more invasive that would imagine.
Do we need to new pchar
type? Can you we have parse1 and parse2 both just take char*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On windows, the unicode-aware string type is wchar_t, so at some point the codebase has to deal with wchar_t instead of char on windows. The pchar typedef here is for cases where the actual type might be wchar_t, which is the case for the unicode windows main function.
We can at least get rid of it in main by instead calling GetCommandLineW after main, on windows, ignoring the char arguments to main. I will probably make that change, since I think nobody including me, likes the BYN_MAIN macro.
As to whether we can avoid a typedef altogether that depends on whether someone can find an equivalent standard typedef for this purpose; or possibly, after removing BYN_MAIN, the rest of the code that uses pchar can be refactored to not use it.
Ah yeah I recall I had some kind of similar nightmare with windows paths in PNaCl's toolchain back in the day (actually it wasn't about encodings but rather surprisingly low limits on the path lengths that applied to the C stdio functions; so we may still have some kind of issue there). So anyway I'm not too surprised that we have this problem and I'm supportive of fixing it. I'm not thrilled about a macroed main either, but I haven't thought enough about the alternative to have an opinion on whether it's better or worse :) Let me give this PR a closer look. |
I like the idea of just going all-in on |
It also occurred to me that we can go toward this incrementally, starting with just using std::filesystem::path. What do you all think of #5627 ? |
Bringing the discussion back here to keep it in one place: Based on the CI results both here and in #5627, it looks like we'll need to update our MacOS target to 10.15 if we want to use |
I also confirmed that my changes in #5627 work when using I'm not sure how many users we have of the prebuilt macOS binaries that are built on github, but the same options would apply to them. |
The implicit char* to filesystem::path constructor does not do any unicode encoding in my experiments - it just widens the chars naively. That is the reason there is a wrapper in this patch around filesystem::path - to avoid the incorrect conversion. Edit: rather I think it is the string to filesystem::path ctor that i tested, but I assume it's true of char* as well. |
I'll wait to see how #5627 goes; then after that plan to rework the entrypoints to keep the normal main function and instead call GetCommandLineW to retrieve then encode the correct arguments on windows. This patch may need to be redone from scratch, but I'll keep it open for now for discussion. |
Yeah, #5627 makes no attempt at encoding correctness, it's just about using
I suppose if the first thing we do on startup is re-encode everything to UTF-8, then maybe we don't even need to use std::filesystem at all? Can we use use std::string everywhere, and |
I suspect not, but have not tried. I think the filesystem API will just use the non-unicode windows code path and try to open a nonexistent file with a garbage name. |
This patch is obsoleted by #5671 |
This patch makes Binaryen's APIs and binaries work on Windows with paths that contain Unicode characters outside of the ASCII range.
I expect this patch to require changes, for style at least.
The basic problem is that Windows requires processing paths as UTF-16-encoded wide chars, and obtaining these from the command line involves declaring a non-standard
wmain
function that accepts wide chars.The strategy this patch uses is:
On non-Windows this patch should introduce no functional changes except for some additional string copies.
Some key decisions here:
This patch introduces a new header, pchar.h, with definitions needed to provide compatibility with platform-specific string/path types. It would be desirable to not need these defs, but it doesn't quite seem possible to accomplish this patch with just std.
pchar.h defines typedefs for a platftorm-specific character type, pchar,
which is a wchar_t on windows, and a platform-specific string type, pstring. These are aliases for std::filesystem::path::value_type and std::filesystem::path::string_type. It also defines unicode conversions between string and pstring, on windows using windows-specific apis, on unix just doing a string copy.
pchar.h also defines its own custom path type, fspath, which is just a wrapper around std::filesystem::path. The binaryen APIs use this type instead of std::filesystem::path. The only reason this type exists is because std::filesystem::path defines an implicit conversion from std::string which does not do unicode conversion. So this type is used to avoid silently introducing bugs where paths are not encoded correctly.
The binaryen file apis rely on an implicit conversion from std::string to fspath that does unicode encoding. This is to avoid introducing new explicit conversions to the various binaryen main functions.
Note that windows paths may contain non-standard unpaired surrogate code units. I have not attempted to support such strings as they are a rare corner case. The approach in this patch could work with unpaired surrogates by changing the encoding/decoding methods. Depending on how the windows encoding APIs work, these paths may already work - I have not tested. If these paths don't work the result should be mystery i/o failures, not crashes.
This patch probably also likely has the side-effect of allowing other argument strings to be UTF-8 encoded Unicode on Windows, such as
--output-source-map-url
, where before they would have been incorrectly encoded in some way. I have not thought hard about the consequences.I have added a test of unicode path processing, but don't really know what I'm doing with the test suite. I just copied the hello_world files and modified them a bit.
I see a pre-existing test failure on windows (duplicate_imports.wat), so I have not successfully run the full test suite on windows.
Fixes #4995