Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support unicode characters in token definitions #324

Closed
2 of 4 tasks
andreasabel opened this issue Nov 13, 2020 · 0 comments
Closed
2 of 4 tasks

Support unicode characters in token definitions #324

andreasabel opened this issue Nov 13, 2020 · 0 comments
Assignees
Labels
bug C++ C lexer Concerning the generated lexer OCaml unicode
Milestone

Comments

@andreasabel
Copy link
Member

Continues #249.

In most backends' lexer generators,

token Op ["⊗⊕"]

isn't translated correctly to a regular expression.

  • Haskell
  • Ocaml
  • C / C++
  • Java
@andreasabel andreasabel added bug OCaml C++ C lexer Concerning the generated lexer unicode labels Nov 13, 2020
@andreasabel andreasabel added this to the 2.9 milestone Nov 13, 2020
@andreasabel andreasabel self-assigned this Nov 13, 2020
andreasabel added a commit that referenced this issue Nov 13, 2020
It seems that in ocaml a char is 8bit, and unicode characters are
their UTF-8 encoded strings.  This means we cannot represent unicode
character sets in the ocamllex lexer definition.  We can use string
literals in some circumstances.

For that reason, RAlts is now translated to a disjunction of char or
string literals (the latter for unicode chars) rather than to a
@[charset]@.
andreasabel added a commit that referenced this issue Nov 13, 2020
…r classes

Instead of @l@, use name @_letter@ for predefined character class
@letter@ etc.

The included test makes sure the new names cannot clash with user
defined token names.  Previously, user token type L would be
translated to @l@ clashing with the predefined letter character
class.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug C++ C lexer Concerning the generated lexer OCaml unicode
Projects
None yet
Development

No branches or pull requests

1 participant