Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: document limitations with PCRE2_MATCH_INVALID_UTF #188

Closed
wants to merge 1 commit into from

Conversation

carenas
Copy link
Contributor

@carenas carenas commented Jan 12, 2023

Closes: #187

doc/pcre2unicode.3 Outdated Show resolved Hide resolved
@PhilipHazel
Copy link
Collaborator

I agree there should be some rewording, but I'm not sure it should mention breaking down 16-bit units. PCRE expects 16-bit data to be in uint16_t vectors, that is, in the natural BE/LE arrangement for the environment. If somebody is handling 16-bit data in bytes, it is their responsibility to get it into the right format for PCRE. I will work on the documentation to try to make this clear.

@carenas
Copy link
Contributor Author

carenas commented Jan 13, 2023

The proposed change doesn't mention breaking down 16-bit units. The misguided attempt to suggest doing so as a workaround was only part of the original version, because it was explicitly mentioned in the ticket, and only was added because I found it funny while testing as it would match an UTF-16 'a' out of pure luck when using an BE 'a' after half a broken other composite character:

code = pcre2_compile(L'a', 1, PCRE2_MATCH_INVALID_UTF, &error, &erroff, NULL);
..
int buf = L'a';
buf <<= 8;
buf |= 0xd8000000;
memcpy(subject, &buf, sizeof(int));
error = pcre2_match(code, (char *)subject + 1, 3, 0, 0, data, NULL);
assert(error == 1);

@PhilipHazel
Copy link
Collaborator

It took me a while to understand exactly what this was all about. I have added some slightly different wording in the places you suggest, in particular making it clear that this applies only to the 16-bit and 32-bit libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The code unit rules for _16 and _32 functions need to be more clearly explained
3 participants