Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New \p{Letter} Unicode property escape #1688

Merged
merged 1 commit into from
Mar 1, 2017

Conversation

bhamiltoncx
Copy link
Contributor

@bhamiltoncx bhamiltoncx commented Feb 22, 2017

@mike-lischke asked for this in his review of #1633 . I think it's a great way to show the power of the new full Unicode functionality in ANTLR4.

This PR adds two new lexer escapes suitable for use in a charset, so:

[a-z]

could become:

[\p{Ll}]

or, equivalently:

[\p{Lowercase_Letter}]

to match both a-z as well as exciting Unicode code points like 𝐚 (U+1D41A) through 𝐳 (U+1D433).

I included both matching and non-matching variants:

  1. \p{Letter}: Include all Unicode code points with the general category "Letter"
  2. \P{Letter}: Include all Unicode code points which do not have the general category "Letter"

This also works for:

  1. Scripts (\p{Latin}, \p{Hiragana}, \p{Cyrillic})
  2. Binary properties (\p{Emoji}, \p{Changes_When_Uppercased}, \p{Quotation_Mark})
  3. Blocks prefixed with "In" (since their names collide with scripts): (\p{InHiragana}, \p{InArabic_Ext_A}, \p{InGreek})

The names of properties are case-insensitive. In addition, - and _ are treated the same.

@mike-lischke
Copy link
Member

I guess you followed http://www.regular-expressions.info/unicode.html with your names. Do you also support short forms (like \p{Ll})? And what about the Unicode block syntax (\p{InXXX}?

@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Feb 23, 2017 via email

@bhamiltoncx bhamiltoncx force-pushed the unicode-property-escape branch 2 times, most recently from fb17610 to 93c625e Compare February 23, 2017 22:09
@bhamiltoncx
Copy link
Contributor Author

bhamiltoncx commented Feb 23, 2017

OK, added support for Unicode blocks with names prefixed with In.

Since block and script names overlap, \p{InGreek} means the (basic) Greek block, but \p{Greek} means the entire Greek script, which includes multiple blocks like \p{InGreek_Ext}.

@bhamiltoncx
Copy link
Contributor Author

Split off two PRs which this is stacked on:

#1692
#1693

@bhamiltoncx
Copy link
Contributor Author

OK, I refactored everything and this should be good to go once the two dependent PRs (#1692 and #1693) land.

I'll rebase it once those two land and update the PR.

@bhamiltoncx bhamiltoncx changed the title WIP: New \p{Letter} Unicode property escape New \p{Letter} Unicode property escape Mar 1, 2017
@bhamiltoncx
Copy link
Contributor Author

OK, rebased and ready for review! (Hopefully all the tests pass.)

@parrt
Copy link
Member

parrt commented Mar 1, 2017

boom, jack!

@parrt parrt merged commit 8d1df4c into antlr:master Mar 1, 2017
@bhamiltoncx bhamiltoncx deleted the unicode-property-escape branch March 1, 2017 23:27
@bhamiltoncx
Copy link
Contributor Author

Awesomesauce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants