Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCRE2_UTF flag detects U+202F as Mongolian #118

Closed
dAu6jARL opened this issue May 9, 2022 · 4 comments
Closed

PCRE2_UTF flag detects U+202F as Mongolian #118

dAu6jARL opened this issue May 9, 2022 · 4 comments
Labels
invalid This doesn't seem right

Comments

@dAu6jARL
Copy link

dAu6jARL commented May 9, 2022

if pcre2_compile() called with PCRE2_UTF option, U+202F(NARROW NO-BREAK SPACE) is detected as Mongolian.
pcre2grep with -u option occurs this error.

sample.text is as below.

foo bar (blank is U+0020)
foo​bar (blank is U+200B)
foo bar (blank is U+202F)
foobar (blank is U+FEFF)

command is as below.

pcre2grep -u '\p{Mongolian}' sample.text

output is as below.

foo bar (blank is U+202F)
@PhilipHazel PhilipHazel added the invalid This doesn't seem right label May 11, 2022
@PhilipHazel
Copy link
Collaborator

PhilipHazel commented May 11, 2022

It appears that Mongolian exists in the list of script extensions for U+202F. Here is output from the ucptest program:

$ ./ucptest 202f
U+202F CS Separator: Space separator, common, Other, [latin, mongolian], [alphabetic, caseignorable, cased, diacritic, graphemebase, idcontinue, idstart, lowercase]

Perl also recognizes U+202F as Mongolian. The Unicode file ScriptExtensions.txt from which PCRE2 gets its data contains this:

202F ; Latn Mong # Zs NARROW NO-BREAK SPACE

So it looks like this is deliberate on the part of Unicode. I am therefore closing this as invalid.

@dAu6jARL
Copy link
Author

Thank you for your reply.
I need to study more.

@PhilipHazel
Copy link
Collaborator

Note that \p{Mong} works like \p{scx:Mong}, that is, it checks both the script and the script extensions. If you want to test just the script, use \p{sc:Mong}.

@dAu6jARL
Copy link
Author

Thank you for your advice. I'll use \p{sc:Xxx} as appropriate.
As it happened, I assumed U+202F as \p{Mongolian} in place name as follows.

Arrêt Nation – Voltaire [56]
https://foursquare.com/v/4ffc7038e4b07354de08b054

Pasta & Tapas Pietro 池袋店
https://foursquare.com/v/613eee8f90eb43793c2e76fd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants