Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] lregex: replace gnu_regex with Onigmo regex engine #2469

Closed
wants to merge 1,160 commits into from

Conversation

masatake
Copy link
Member

@masatake masatake commented Mar 12, 2020

In the much of cases, I hope there is no impact on existing optlib
code using --regex-... option.

If r regex flag is given, you can use extended features of Onigmo
with ruby regex syntax.

A demonstration of one of miracle features:

iinput.mylang


    (define (f1) 1)
    (define ((f2)) #t)
    (define (((f3))) "abc")

mylang.ctags

    --langdef=mylang
    --map-mylang=.mylang
    --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function...
    --_fielddef-mylang=symbol,symbol binding to the function
    --fields-mylang={symbol}
    --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2}

See the r flag passed to --regex-mylang.

(((f3)))    input.mylang    /^(define (((f3))) "abc")$/;"   f       symbol:f3
((f2))      input.mylang    /^(define ((f2)) #t)$/;"        f       symbol:f2
(f1)        input.mylang    /^(define (f1) 1)$/;"   f       symbol:f1

Look at the name of tags. The pairs of (' and )' are balanced well.

I don't know this can be used also on VC2008.
* Show stack depth.
* Show header
* Show the type of stack top.
* Decrease stack number by one. The `stk` points the address which
  the next data should be written to.
Named backrefs (\k<name>, \g{name}) refer only the left most group with
the name in Perl.
* Show stack type in string.
* Fix bug when *op == OP_FINISH
Our parser uses recursion, so it causes stack overflow when parsing
deeply nested capture groups. E.g.:

  x2("(" * 32767 + "a" + ")" * 32767, "a", 0, 1)

Set a limit for this.
The default value is defined in regint.h:
* DEFAULT_PARSE_DEPTH_LIMIT (Currently 4096)

Also add two APIs to support this:
* onig_get_parse_depth_limit
* onig_set_parse_depth_limit
Found by Coverity Scan.

This is not needed if RUBY is defined, because rb_raise() is called and
the function does not return.
Capturing inside subexp calls should use a stack.
This fixes only the first issue. The second one has not been fixed yet.
fix uppercasing for U+A64B, CYRILLIC SMALL LETTER MONOGRAPH UK

* enc/unicode.c: Add U+A64B to the special cases 03B9 and 03BC
  at the end of onigenc_unicode_case_map (Bug #12990).

* enc/unicode/case-folding.rb: Add U+A64B to the special cases
  03B9 and 03BC. Add a comment pointing to enc/unicode.c.
  Change warnings to exceptions for unpredicted cases,
  because this would have been more easily noticed
  (the warning was not noticed when upgrading to Unicode 9.0.0).
Regexp supports Unicoe 9.0.0's \X

* meta character \X matches Unicode 9.0.0 characters with some workarounds
  for UTR universal-ctags#51 Unicode Emoji, Version 4.0 emoji zwj sequences.
  [Feature #12831] [ruby-core:77586]

The term "character" can have many meanings bytes, codepoints, combined
characters, and so on. "grapheme cluster" is highest one of such words,
which means user-perceived characters.
Unicode Standard Annex universal-ctags#29 UNICODE TEXT SEGMENTATION specifies how to
handle grapheme clusters (extended grapheme cluster).
But some specs aren't updated to current situation because Unicode Emoji
is rapidly extended without well definition.
It breaks the precondition of UTR#29 "Grapheme cluster boundaries can be
easily tested by looking at immediately adjacent characters". (the
sentence will be removed in the next version)
Though some of its detail are described in Unicode Technical Report universal-ctags#51
UNICODE EMOJI but it is not merged into UTR#29 yet.

http://unicode.org/reports/tr29/
http://unicode.org/reports/tr51/
http://unicode.org/Public/emoji/4.0/
constify CaseMappingSpecials
Use offsetof macro and shrink table size
k-takata and others added 17 commits July 29, 2019 20:43
Fix stack overflow with X+++++++++++++++++++…
Use spaces for indentation.

Also add `#ifndef RUBY` where needed.
If `RUBY` is defined, `malloc` is replaced with `ruby_xmalloc` which
fails with a `NoMemoryError` instead of returning `NULL`.
/[\x{111111}]/ causes out-of-bounds read when encoding is a single byte
encoding. \x{111111} is an invalid codepoint for a single byte encoding.
Check if it is a valid codepoint.
TODO: Currently \x{1000000} behaves differently depending on encodings.
Some encodings don't return an error for it. Should we make them
consistent?
Fix out-of-bounds read in parse_char_class()
git-subtree-dir: Onigmo
git-subtree-mainline: 0dbadb6
git-subtree-split: 0830382
@masatake
Copy link
Member Author

It seems that using git-subtree is a bad idea.

@masatake masatake force-pushed the Onigmo branch 2 times, most recently from 97d07b7 to 8920563 Compare March 12, 2020 19:11
Close universal-ctags#1861.

In the much of cases, I hope there is no impact on existing optlib
code using --regex-... option.

If `r` regex flag is given, you can use extended features of Onigmo
with ruby regex syntax.

A demonstration of one of miracle features:

;; input.mylang

    (define (f1) 1)
    (define ((f2)) #t)
    (define (((f3))) "abc")

    --langdef=mylang
    --map-mylang=.mylang
    --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function...
    --_fielddef-mylang=symbol,symbol binding to the function
    --fields-mylang={symbol}
    --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2}

See the r flag passed to --regex-mylang.

    (((f3)))	input.mylang	/^(define (((f3))) "abc")$/;"	f	symbol:f3
    ((f2))	input.mylang	/^(define ((f2)) #t)$/;"	f	symbol:f2
    (f1)	input.mylang	/^(define (f1) 1)$/;"	f	symbol:f1

Look at the name of tags. The pairs of `(' and `)' are balanced well.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 86.789% when pulling 9b73623 on masatake:Onigmo into 30cd8e0 on universal-ctags:master.

@k-takata
Copy link
Member

It seems that using git-subtree is a bad idea.

How about using --squash option? See misc/pull-packcc.sh.

Maybe it's better to add also misc/pull-onigmo.sh and misc/pull-libreadtags.sh?
I don't remember the command line of git-subtree.

@masatake
Copy link
Member Author

@k-takata, thank you for the suggestion.
I will add misc/pull-libreadtags.sh first.
I'm thinking about introducing a common script like:

misc/pull-subtree.sh [packcc|libreadtags|onigmo]

if possible.

@masatake
Copy link
Member Author

Instead of replacing gnu_regex, I'm thinking about adding pcre2 to ctags. See #3036.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.