-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] lregex: replace gnu_regex with Onigmo regex engine #2469
Conversation
I don't know this can be used also on VC2008.
* Show stack depth. * Show header
* Show the type of stack top. * Decrease stack number by one. The `stk` points the address which the next data should be written to.
Named backrefs (\k<name>, \g{name}) refer only the left most group with the name in Perl.
* Show stack type in string. * Fix bug when *op == OP_FINISH
Our parser uses recursion, so it causes stack overflow when parsing deeply nested capture groups. E.g.: x2("(" * 32767 + "a" + ")" * 32767, "a", 0, 1) Set a limit for this. The default value is defined in regint.h: * DEFAULT_PARSE_DEPTH_LIMIT (Currently 4096) Also add two APIs to support this: * onig_get_parse_depth_limit * onig_set_parse_depth_limit
Import Ruby r56793 with some modifications.
Found by Coverity Scan. This is not needed if RUBY is defined, because rb_raise() is called and the function does not return.
Capturing inside subexp calls should use a stack. This fixes only the first issue. The second one has not been fixed yet.
fix uppercasing for U+A64B, CYRILLIC SMALL LETTER MONOGRAPH UK * enc/unicode.c: Add U+A64B to the special cases 03B9 and 03BC at the end of onigenc_unicode_case_map (Bug #12990). * enc/unicode/case-folding.rb: Add U+A64B to the special cases 03B9 and 03BC. Add a comment pointing to enc/unicode.c. Change warnings to exceptions for unpredicted cases, because this would have been more easily noticed (the warning was not noticed when upgrading to Unicode 9.0.0).
Regexp supports Unicoe 9.0.0's \X * meta character \X matches Unicode 9.0.0 characters with some workarounds for UTR universal-ctags#51 Unicode Emoji, Version 4.0 emoji zwj sequences. [Feature #12831] [ruby-core:77586] The term "character" can have many meanings bytes, codepoints, combined characters, and so on. "grapheme cluster" is highest one of such words, which means user-perceived characters. Unicode Standard Annex universal-ctags#29 UNICODE TEXT SEGMENTATION specifies how to handle grapheme clusters (extended grapheme cluster). But some specs aren't updated to current situation because Unicode Emoji is rapidly extended without well definition. It breaks the precondition of UTR#29 "Grapheme cluster boundaries can be easily tested by looking at immediately adjacent characters". (the sentence will be removed in the next version) Though some of its detail are described in Unicode Technical Report universal-ctags#51 UNICODE EMOJI but it is not merged into UTR#29 yet. http://unicode.org/reports/tr29/ http://unicode.org/reports/tr51/ http://unicode.org/Public/emoji/4.0/
constify CaseMappingSpecials
Use offsetof macro and shrink table size
This reverts commit 5502cf8.
…r-str Fix SEGV in onig_error_code_to_str() (Fix universal-ctags#132)
Fix stack overflow with X+++++++++++++++++++…
doc: Adjust wording (Close universal-ctags#109)
Import the latest code from Ruby
Use spaces for indentation. Also add `#ifndef RUBY` where needed. If `RUBY` is defined, `malloc` is replaced with `ruby_xmalloc` which fails with a `NoMemoryError` instead of returning `NULL`.
/[\x{111111}]/ causes out-of-bounds read when encoding is a single byte encoding. \x{111111} is an invalid codepoint for a single byte encoding. Check if it is a valid codepoint.
TODO: Currently \x{1000000} behaves differently depending on encodings. Some encodings don't return an error for it. Should we make them consistent?
Fix out-of-bounds read in parse_char_class()
st: Import the latest code from Ruby
Add `#ifdef USE_CAPTURE_HISTORY`.
It seems that using git-subtree is a bad idea. |
97d07b7
to
8920563
Compare
Close universal-ctags#1861. In the much of cases, I hope there is no impact on existing optlib code using --regex-... option. If `r` regex flag is given, you can use extended features of Onigmo with ruby regex syntax. A demonstration of one of miracle features: ;; input.mylang (define (f1) 1) (define ((f2)) #t) (define (((f3))) "abc") --langdef=mylang --map-mylang=.mylang --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function... --_fielddef-mylang=symbol,symbol binding to the function --fields-mylang={symbol} --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2} See the r flag passed to --regex-mylang. (((f3))) input.mylang /^(define (((f3))) "abc")$/;" f symbol:f3 ((f2)) input.mylang /^(define ((f2)) #t)$/;" f symbol:f2 (f1) input.mylang /^(define (f1) 1)$/;" f symbol:f1 Look at the name of tags. The pairs of `(' and `)' are balanced well. Signed-off-by: Masatake YAMATO <yamato@redhat.com>
How about using Maybe it's better to add also |
@k-takata, thank you for the suggestion.
if possible. |
Instead of replacing gnu_regex, I'm thinking about adding pcre2 to ctags. See #3036. |
In the much of cases, I hope there is no impact on existing optlib
code using --regex-... option.
If
r
regex flag is given, you can use extended features of Onigmowith ruby regex syntax.
A demonstration of one of miracle features:
iinput.mylang
mylang.ctags
See the r flag passed to --regex-mylang.
Look at the name of tags. The pairs of
(' and
)' are balanced well.