Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 to UTF-16 gives wrong result with SIMD #212

Closed
jeremy-coulon opened this issue Jan 18, 2024 · 3 comments
Closed

UTF-8 to UTF-16 gives wrong result with SIMD #212

jeremy-coulon opened this issue Jan 18, 2024 · 3 comments

Comments

@jeremy-coulon
Copy link

I am using the latest git revision from master (0e62505).

I am transcoding from UTF-8 to UTF-16 the following string: config._initial_
UTF-8 code units:
[63, 6f, 6e, 66, 69, 67, 2e, 5f, 69, 6e, 69, 74, 69, 61, 6c, 5f]

By default BOOST_TEXT_USE_SIMD is defined to 1 and I get the following UTF-16 code units:
[63, 6e, 69, 2e, 10, 0, ca74, 7f9d, 69, 69, 69, 6c, ea80, 7f9e, eab0, 7f9e]
which is wrong.

When forcing BOOST_TEXT_USE_SIMD to 0 before including boost text, I correctly get:
[63, 6f, 6e, 66, 69, 67, 2e, 5f, 69, 6e, 69, 74, 69, 61, 6c, 5f]

My code is:

const std::u8string a = u8"config._initial_";
fmt::print("UTF-8  code units: {::4x}\n", a | std::views::transform([](auto c) { return static_cast<unsigned>(c); }));
std::u16string b;
b.resize(a.size());
const auto [_, out] = boost::text::transcode_to_utf16(a, b.data());
const std::ptrdiff_t newSize = std::ranges::distance(b.data(), out);
b.resize(newSize);
fmt::print("UTF-16 code units: {::4x}\n", b | std::views::transform([](auto c) { return static_cast<unsigned>(c); }));

What's even stranger is that I don't always get the same UTF-16 result. Maybe my SIMD registers already contain some value before transcoding?

@tzlaine
Copy link
Owner

tzlaine commented Jan 18, 2024

Easy fix -- I just turned off all the SIMD code. This repo is not being actively maintained. Just thought you might want to know that if you're using it. I may cannibalize it to make other smaller projects out of.

@jeremy-coulon
Copy link
Author

Sad to hear that since this is the reference implementation of P2728.

I really hope that your paper can make it to C++26.

@tzlaine
Copy link
Owner

tzlaine commented Jan 20, 2024

Well, it's a good news/bad news situation. I'm not pushing those Unicode papers any more, but someone who is not burned out on Unicode is picking them up! However, I have low-to-no expectation that this will happen in C++26. C++29 is way more likely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants