Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing modules at paths that contain extended unicode characters on windows fails #4995

Closed
brson opened this issue Aug 30, 2022 · 6 comments · Fixed by #5671
Closed

Processing modules at paths that contain extended unicode characters on windows fails #4995

brson opened this issue Aug 30, 2022 · 6 comments · Fixed by #5671

Comments

@brson
Copy link
Contributor

brson commented Aug 30, 2022

As far as I can determine, binaryen always fails to open files that contain non-ascii (or perhaps non-latin-1 or some other windowsy encoding) in their path on windows. This is probably because there is no explicit "wide char" support in the code, which is how windows represents unicode, as utf-16.

I have tried to process files via unicode paths with a self-built wasm-opt.exe within the developer command prompt, inside cygwin's mintty, and programmatically (from rust bindings):

cygwin:

$ bin/wasm-opt.exe ../..//hello_world❤️.wasm -o ../..//hello_world_opt.wasm
Failed opening '../..//hello_world??.wasm'

command prompt:

c:\cygwin64\home\Brian\binaryen\build>bin\wasm-opt.exe ..\..\hello_world❤️.wasm -o ..\..\hello_world_opt.wasm
Failed opening '..\..\hello_world??.wasm'

Fixing this seems like it would require all paths to be hidden behind some platform-specific typedefs that resolve to WCHAR on windows, but I don't have much experience here.

cc brson/wasm-opt-rs#40

@tlively
Copy link
Member

tlively commented Aug 31, 2022

Looks like we could use std::filesystem::path since we're on C++17.

@brson
Copy link
Contributor Author

brson commented Sep 1, 2022

filesystem::path looks like the right way to go and should just work with the standard file i/o methods.

From googling I don't see a standardized abstraction for getting wide char arguments from main. Seems like dealing with that and then sorting out which arguments are strings and which are paths could be messy.

@brson
Copy link
Contributor Author

brson commented Oct 31, 2022

Is this something the maintainers would be open to accepting patches for? I could see it being fairly disruptive to the codebase. It might be done in two steps: adjusting the internals to be unicode-correct on windows, then adapting all the main functions.

@tlively
Copy link
Member

tlively commented Oct 31, 2022

I think this would be worth accepting patches for. I can imagine the changes would be disruptive in the parts of the code base that deal with file paths, but I think those parts are fairly small overall, so it should be ok.

@brson
Copy link
Contributor Author

brson commented Jan 25, 2023

I have made progress on this here: https://github.com/brson/binaryen/tree/unicode

The wasm-opt in that branch can input and output to unicode paths on windows. I still need to convert all the other bins, do some cleanup, write some tests. It'll be a while yet.

The changes necessary are not all that invasive. As structured now the bins have to do some unicode conversions for path types, though it might be possible to hide those conversions inside the command-line Options type.

@brson
Copy link
Contributor Author

brson commented Apr 6, 2023

There are two PRs open related to this issue:

dschuff added a commit that referenced this issue Sep 14, 2023
This PR changes how file paths and the command line are handled. On startup on Windows,
we process the wstring version of the command line (including the file paths) and re-encode
it to UTF8 before handing it off to the rest of the command line handling logic. This means
that all paths are stored in UTF8-encoded std::strings as they go through the program, right
up until they are used to open files. At that time, they are converted to the appropriate native
format with the new to_path function before passing to the stdlib open functions.

This has the advantage that all of the non-file-opening code can use a single type to hold paths
(which is good since std::filesystem::path has proved problematic in some cases), but has the
disadvantage that someone could add new code that forgets to convert to_path before
opening. That's somewhat mitigated by the fact that most of the code uses the ModuleIOBase
classes for opening files.

Fixes #4995
radekdoulik pushed a commit to dotnet/binaryen that referenced this issue Jul 12, 2024
This PR changes how file paths and the command line are handled. On startup on Windows,
we process the wstring version of the command line (including the file paths) and re-encode
it to UTF8 before handing it off to the rest of the command line handling logic. This means
that all paths are stored in UTF8-encoded std::strings as they go through the program, right
up until they are used to open files. At that time, they are converted to the appropriate native
format with the new to_path function before passing to the stdlib open functions.

This has the advantage that all of the non-file-opening code can use a single type to hold paths
(which is good since std::filesystem::path has proved problematic in some cases), but has the
disadvantage that someone could add new code that forgets to convert to_path before
opening. That's somewhat mitigated by the fact that most of the code uses the ModuleIOBase
classes for opening files.

Fixes WebAssembly#4995
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants