[RFC] lregex: replace gnu_regex with Onigmo regex engine #2469

masatake · 2020-03-12T18:38:33Z

In the much of cases, I hope there is no impact on existing optlib
code using --regex-... option.

If r regex flag is given, you can use extended features of Onigmo
with ruby regex syntax.

A demonstration of one of miracle features:

iinput.mylang


    (define (f1) 1)
    (define ((f2)) #t)
    (define (((f3))) "abc")

mylang.ctags

    --langdef=mylang
    --map-mylang=.mylang
    --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function...
    --_fielddef-mylang=symbol,symbol binding to the function
    --fields-mylang={symbol}
    --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2}

See the r flag passed to --regex-mylang.

(((f3)))    input.mylang    /^(define (((f3))) "abc")$/;"   f       symbol:f3
((f2))      input.mylang    /^(define ((f2)) #t)$/;"        f       symbol:f2
(f1)        input.mylang    /^(define (f1) 1)$/;"   f       symbol:f1

Look at the name of tags. The pairs of (' and )' are balanced well.

…ags#73)

I don't know this can be used also on VC2008.

* Show stack depth. * Show header

* Show the type of stack top. * Decrease stack number by one. The `stk` points the address which the next data should be written to.

Named backrefs (\k<name>, \g{name}) refer only the left most group with the name in Perl.

* Show stack type in string. * Fix bug when *op == OP_FINISH

Our parser uses recursion, so it causes stack overflow when parsing deeply nested capture groups. E.g.: x2("(" * 32767 + "a" + ")" * 32767, "a", 0, 1) Set a limit for this. The default value is defined in regint.h: * DEFAULT_PARSE_DEPTH_LIMIT (Currently 4096) Also add two APIs to support this: * onig_get_parse_depth_limit * onig_set_parse_depth_limit

Import Ruby r56793 with some modifications.

Found by Coverity Scan. This is not needed if RUBY is defined, because rb_raise() is called and the function does not return.

Capturing inside subexp calls should use a stack. This fixes only the first issue. The second one has not been fixed yet.

fix uppercasing for U+A64B, CYRILLIC SMALL LETTER MONOGRAPH UK * enc/unicode.c: Add U+A64B to the special cases 03B9 and 03BC at the end of onigenc_unicode_case_map (Bug #12990). * enc/unicode/case-folding.rb: Add U+A64B to the special cases 03B9 and 03BC. Add a comment pointing to enc/unicode.c. Change warnings to exceptions for unpredicted cases, because this would have been more easily noticed (the warning was not noticed when upgrading to Unicode 9.0.0).

Regexp supports Unicoe 9.0.0's \X * meta character \X matches Unicode 9.0.0 characters with some workarounds for UTR universal-ctags#51 Unicode Emoji, Version 4.0 emoji zwj sequences. [Feature #12831] [ruby-core:77586] The term "character" can have many meanings bytes, codepoints, combined characters, and so on. "grapheme cluster" is highest one of such words, which means user-perceived characters. Unicode Standard Annex universal-ctags#29 UNICODE TEXT SEGMENTATION specifies how to handle grapheme clusters (extended grapheme cluster). But some specs aren't updated to current situation because Unicode Emoji is rapidly extended without well definition. It breaks the precondition of UTR#29 "Grapheme cluster boundaries can be easily tested by looking at immediately adjacent characters". (the sentence will be removed in the next version) Though some of its detail are described in Unicode Technical Report universal-ctags#51 UNICODE EMOJI but it is not merged into UTR#29 yet. http://unicode.org/reports/tr29/ http://unicode.org/reports/tr51/ http://unicode.org/Public/emoji/4.0/

constify CaseMappingSpecials

Use offsetof macro and shrink table size

This reverts commit 5502cf8.

…r-str Fix SEGV in onig_error_code_to_str() (Fix universal-ctags#132)

Fix stack overflow with X+++++++++++++++++++…

doc: Adjust wording (Close universal-ctags#109)

Import the latest code from Ruby

Use spaces for indentation. Also add `#ifndef RUBY` where needed. If `RUBY` is defined, `malloc` is replaced with `ruby_xmalloc` which fails with a `NoMemoryError` instead of returning `NULL`.

/[\x{111111}]/ causes out-of-bounds read when encoding is a single byte encoding. \x{111111} is an invalid codepoint for a single byte encoding. Check if it is a valid codepoint.

TODO: Currently \x{1000000} behaves differently depending on encodings. Some encodings don't return an error for it. Should we make them consistent?

Fix out-of-bounds read in parse_char_class()

st: Import the latest code from Ruby

Add `#ifdef USE_CAPTURE_HISTORY`.

git-subtree-dir: Onigmo git-subtree-mainline: 0dbadb6 git-subtree-split: 0830382

masatake · 2020-03-12T18:43:18Z

It seems that using git-subtree is a bad idea.

Close universal-ctags#1861. In the much of cases, I hope there is no impact on existing optlib code using --regex-... option. If `r` regex flag is given, you can use extended features of Onigmo with ruby regex syntax. A demonstration of one of miracle features: ;; input.mylang (define (f1) 1) (define ((f2)) #t) (define (((f3))) "abc") --langdef=mylang --map-mylang=.mylang --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function... --_fielddef-mylang=symbol,symbol binding to the function --fields-mylang={symbol} --regex-mylang=/$define +(([-a-z0-9]+)|\(\g<1>$)/\1/f/r{_field=symbol:\2} See the r flag passed to --regex-mylang. (((f3))) input.mylang /^(define (((f3))) "abc")$/;" f symbol:f3 ((f2)) input.mylang /^(define ((f2)) #t)$/;" f symbol:f2 (f1) input.mylang /^(define (f1) 1)$/;" f symbol:f1 Look at the name of tags. The pairs of `(' and `)' are balanced well. Signed-off-by: Masatake YAMATO <yamato@redhat.com>

coveralls · 2020-03-12T19:49:56Z

Coverage decreased (-0.01%) to 86.789% when pulling 9b73623 on masatake:Onigmo into 30cd8e0 on universal-ctags:master.

k-takata · 2020-03-13T01:42:14Z

It seems that using git-subtree is a bad idea.

How about using --squash option? See misc/pull-packcc.sh.

Maybe it's better to add also misc/pull-onigmo.sh and misc/pull-libreadtags.sh?
I don't remember the command line of git-subtree.

masatake · 2020-03-24T00:05:52Z

@k-takata, thank you for the suggestion.
I will add misc/pull-libreadtags.sh first.
I'm thinking about introducing a common script like:

misc/pull-subtree.sh [packcc|libreadtags|onigmo]

if possible.

masatake · 2021-05-23T10:06:20Z

Instead of replacing gnu_regex, I'm thinking about adding pcre2 to ctags. See #3036.

k-takata added 30 commits November 17, 2016 22:42

FIXME: Update comment for conditional expressions (Issue universal-ct…

f4984ec

…ags#73)

win32: Enable multiprocess build on VC2010+

2f422ba

I don't know this can be used also on VC2008.

win32: Don't use LTCG on debug build

059e7a5

testpy: Add tests for onig_search_gpos()

2653971

testpy: Fix for UTF-16/32

c6f30b7

Improve debug log

728e0ef

* Show stack depth. * Show header

Improve debug log

874be86

* Show the type of stack top. * Decrease stack number by one. The `stk` points the address which the next data should be written to.

Fix multiple name groups in Perl syntax (Fix universal-ctags#74)

86fbe2e

Named backrefs (\k<name>, \g{name}) refer only the left most group with the name in Perl.

Refine parsing conditional expression

3c63d73

Improve debug log

1678485

* Show stack type in string. * Fix bug when *op == OP_FINISH

Import the latest version of st.c from Ruby (Close universal-ctags#70)

40ecb9e

Import Ruby r56793 with some modifications.

st: Check allocation errors

33566cd

Remove codes for MSVC < 14

4f26e03

Update HISTORY

e35178d

st: Remove duplication

695e2ae

Fix type mismatch

31f23d3

Avoid negative array index read

18c317d

Found by Coverity Scan. This is not needed if RUBY is defined, because rb_raise() is called and the function does not return.

Update documents

f2ee02f

Update copyright information

f4b06cc

Fix wrong capture inside subexp calls (Issue universal-ctags#48)

dd4638c

Capturing inside subexp calls should use a stack. This fixes only the first issue. The second one has not been fixed yet.

Update HISTORY

a3699d7

Update HISTORY

dd1fdb6

Merge branch 'devel-6.0'

09ab902

Import Ruby r56951

2d36a89

constify CaseMappingSpecials

Import Ruby r56952

29e4389

Use offsetof macro and shrink table size

Update tool

ea050c9

Fix \X on UTF-16/32 (Issue universal-ctags#46)

bbb8b4c

k-takata and others added 17 commits July 29, 2019 20:43

st: Apply Onigmo specific changes again

f86dc0a

This reverts commit 5502cf8.

Merge pull request universal-ctags#134 from k-takata/fix-segv-in-erro…

9827d5a

…r-str Fix SEGV in onig_error_code_to_str() (Fix universal-ctags#132)

Merge pull request universal-ctags#135 from k-takata/fix-stack-overflow

bf856d8

Fix stack overflow with X+++++++++++++++++++…

Merge pull request universal-ctags#136 from k-takata/fix-doc

2a7441d

doc: Adjust wording (Close universal-ctags#109)

Add a test for the previous commit

5a44e02

Merge pull request universal-ctags#137 from k-takata/import-ruby

44339f8

Import the latest code from Ruby

Suppress warning on 64-bit builds

ced209d

st: Adjust style

40cc34c

Use spaces for indentation. Also add `#ifndef RUBY` where needed. If `RUBY` is defined, `malloc` is replaced with `ruby_xmalloc` which fails with a `NoMemoryError` instead of returning `NULL`.

st: Adjust coding style

e5ba624

Fix out-of-bounds read in parse_char_class() (Close universal-ctags#139)

d4cf99d

/[\x{111111}]/ causes out-of-bounds read when encoding is a single byte encoding. \x{111111} is an invalid codepoint for a single byte encoding. Check if it is a valid codepoint.

testpy: Add some tests for invalid codepoints

66dbbb4

TODO: Currently \x{1000000} behaves differently depending on encodings. Some encodings don't return an error for it. Should we make them consistent?

Merge pull request universal-ctags#140 from k-takata/fix-139

3ffa33b

Fix out-of-bounds read in parse_char_class()

Merge pull request universal-ctags#138 from k-takata/import-st

99db460

st: Import the latest code from Ruby

Disable error message for capture history when not needed

8217be2

Add `#ifdef USE_CAPTURE_HISTORY`.

testpy: Add tests for fetch_name(_with_level)

97a73c7

testpy: Add some tests

0830382

Add 'Onigmo/' from commit '0830382895a303a478de988b7ba7af0f0e86bb6c'

f2444e4

git-subtree-dir: Onigmo git-subtree-mainline: 0dbadb6 git-subtree-split: 0830382

masatake force-pushed the Onigmo branch from 66de727 to 9a6ff90 Compare March 12, 2020 18:42

masatake force-pushed the Onigmo branch 2 times, most recently from 97d07b7 to 8920563 Compare March 12, 2020 19:11

masatake force-pushed the Onigmo branch from 8920563 to 9b73623 Compare March 12, 2020 19:17

masatake mentioned this pull request Apr 8, 2020

RFC: readtags enhacement especially about -Q and -S option #2475

Open

21 tasks

masatake closed this May 23, 2021

masatake mentioned this pull request Jun 9, 2021

write the specification of regex implemented #3034

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] lregex: replace gnu_regex with Onigmo regex engine #2469

[RFC] lregex: replace gnu_regex with Onigmo regex engine #2469

masatake commented Mar 12, 2020 •

edited

Loading

masatake commented Mar 12, 2020

coveralls commented Mar 12, 2020

k-takata commented Mar 13, 2020

masatake commented Mar 24, 2020

masatake commented May 23, 2021

[RFC] lregex: replace gnu_regex with Onigmo regex engine #2469

[RFC] lregex: replace gnu_regex with Onigmo regex engine #2469

Conversation

masatake commented Mar 12, 2020 • edited Loading

masatake commented Mar 12, 2020

coveralls commented Mar 12, 2020

k-takata commented Mar 13, 2020

masatake commented Mar 24, 2020

masatake commented May 23, 2021

masatake commented Mar 12, 2020 •

edited

Loading