Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the ctags regex engine #1861

Closed
hadrielk opened this issue Aug 28, 2018 · 17 comments
Closed

Changing the ctags regex engine #1861

hadrielk opened this issue Aug 28, 2018 · 17 comments

Comments

@hadrielk
Copy link
Contributor

Has anyone thought about changing ctags to using a more modern regex engine?

The current glibc-2.10.1 engine is... suboptimal. It's incredibly slow. And using the POSIX API misses pretty much every addition to regex engines that's happened in the last 20 years or so: lazy captures, non-capturing groups, atomic groups, possessive quantifiers, negative and positive look-ahead/behind, etc.

I've recently written a Yang file parser, using optlib multi-table regex. When I first wrote it, it took ctags ~40 seconds to process the 61 yang files I have in my employer's codebase. After a lot of effort to optimize the parser, I was able to get it down to ~16 seconds. That's just for 61 files! When I ran it against ~900 yang files for Cisco's XR router, it takes approximately 5 minutes.

So I went and compiled ctags to use PCRE2 instead. My 61 yang files were processed in 0.2 seconds! Cisco's 900 yang files took 2.4 seconds! (and I verified the generated tags were the same as using the current regex engine)

That's without taking advantage of PCRE's expanded regex abilities - this is just using the POSIX api only, without changing the yang.ctags optlib file.

Of course in regular usage ctags won't be improved nearly as much, since most of the built-in parsers don't use regex. But some do, such as the CMake parser. In my employer's code base, it usually takes ~12 seconds to process our ~10k files (with C/C++, CMake, etc.); but with PCRE this dropped to 5 seconds.

When I run ctags with my new yang.ctags optlib file as well, against my ~10k mixed files (including the 61 yang files), it takes ~29 seconds. With PCRE that drops back down to ~5 seconds.

p.s. I searched through the closed issues list and didn't see this topic. Apologies if it's already been discussed.

@masatake
Copy link
Member

I wrote a bit about this topic in #519.

[yamato@master]~/var/ctags-peg% ./ctags --list-regex-flags
#LETTER NAME               DESCRIPTION
b       basic              interpreted as a Posix basic regular expression.
e       extend             interpreted as a Posix extended regular expression (default)
i       icase              applied in a case-insensitive manner
x       exclusive          skip testing the other patterns if a line is matched to this pattern
-       placeholder        don't put this tag to tags file.
-       scope=ACTION       use scope stack: ACTION = ref|push|pop|clear|set
-       _extra=EXTRA       record the tag only when the extra is enabled
-       _field=FIELD:VALUE record the matched string(VALUE) to parser own FIELD of the tag
-       _role=ROLE         Set given ROLE to roles field

I don't want to remove the current engine borrowed from glibc.
Instead I would like you to ADD the new engine.

I reserved p for the purpose. I also reserved o for https://github.com/k-takata/Onigmo.
Let's se p for PCRE.

I would like to write some concerts(issues) in this area.

  1. updating the current engine to the latest code from glibc( or gnulib?).
    This item has been MY todo list for a long time.

  2. updating our build-scripts to link pcre.
    In this step, you will modify configure.ac to find libpcre. ./configure will find it.
    That means pcre is optional. We cannot expect it is always available.
    So we cannot use the p flag in .ctags converted to .c with optlib2c.
    (You can use p in your private .ctags.)
    Don't forget updating the output of --list-features. pcre should be listed there.
    See Units/TEST/features optional item in http://docs.ctags.io/en/latest/units.html?highlight=features.
    Updating the build scripts for windows will become an issue. [HEADS-UP,WIP] peg based new java parser #1847 has the same trouble.

  3. integrating the source code
    As I did for gnu_regex and fnmatch directories, I think we can integrate pcre source code
    into our source tree with git submodule. I did it in [HEADS-UP,WIP] peg based new java parser #1847.
    This will takes our time. However, as the result, we can expect pcre is always available. So
    we can use p flag in ./optlib/*.ctags files.

  4. updating optlib2c to support p flag and pcre expressions
    So we can integrate your Yang parser:-)

@hadrielk
Copy link
Contributor Author

Instead I would like you to ADD the new engine.

OK, but we'd have to either rename the current regex functions of glibc's code, or #define them to some other tokens before compiling and including them - since PCRE's posix API uses the same regcomp(), regexec(), etc. Or do you mean to use PCRE's native (non-POSIX) API?

I reserved p for the purpose.

I thought you didn't like short letters. 😉
So what would p mean exactly? Would it just mean "use PCRE for the engine for this pattern, but the pattern itself is still POSIX ERE syntax/meaning", or would it mean "use PCRE for the engine and pattern (i.e., Perl-style regex not POSIX)". The former means it would still work using the glibc regex engine if the PCRE wasn't available, just a lot slower1; the latter would mean it shouldn't use the glibc engine no matter what, and instead error out or be ignored.

  1. integrating the source code

If we're going to integrate the source code anyway (e.g., as a git submodule), then why bother with ./configure --with-pcre and making it an optional feature? Why not just go to a submodule to begin with?

1: Well... there is one difference using PCRE's POSIX API that makes it not exactly like glibc's POSIX: in a regex pattern using the REG_NEWLINE flag, a negative bracket expression like [^;] matches a newline character in PCRE, but not in glibc. This only really affects ctags for --mline-regex patterns, not --regex or --_mtable-regex.

@b4n
Copy link
Member

b4n commented Aug 29, 2018

Just a few thoughts:

  • I personally prefer not including third party source code in the source tree and rather link when possible. I however understand that if a dependency is not common enough to be reasonably expected to be everywhere, it makes sense to fall back on a submodule.
  • In Geany, I'd like to avoid having several regex engines, and we're using GLib that has GRegex which provides a PCRE engine (AFAIK, it's a linked or bundled libpcre); so we currently patch ctags to use this instead of PCRE (IIRC there was also a performance question a while back that comforted this or lead to it).
    IIUC, this way it is not possible to support the b flag; but that might be a limitation of the GLib wrapper API. However this hasn't been a problem because none of the parsers we have in Geany make use of the b flag, and we didn't see any problem with the ones we have -- although we don't have so many regex parsers there.
    Anyway, one of my points is that if GRegex is usable as an alternative to libpcre, it'd be nice to be able to use it as an alternative dependency in the event libpcre isn't available.
  • Is there a good reason to keep using glibc's engine; e.g. does it provide something the PCRE engine doesn't? (like maybe the b flag).
  • Lastly, I'm all for PCRE :)

@hadrielk
Copy link
Contributor Author

Anyway, one of my points is that if GRegex is usable as an alternative to libpcre, it'd be nice to be able to use it as an alternative dependency in the event libpcre isn't available.

A few years ago I was one of the core developers for Wireshark, and Wireshark uses GRegex, but it had a subtle issue that we hit regularly: #837. Although I'm not sure that will matter for ctags?

Is there a good reason to keep using glibc's engine; e.g. does it provide something the PCRE engine doesn't? (like maybe the b flag).

There is one place in ctags that actually uses the b flag (i.e., POSIX BRE syntax): in read.c the readLineFromBypassSlow() function compiles the tag's pattern as a regex BRE.

I'm guessing this is because ctag-file tag patterns are supposed to actually be regex patterns? If so, compiling them as BRE makes sense because it will successfully compile more frequently - because nothing in ctags escapes regex special characters like "()" or "{}" when it writes tag patterns, afaik... so they appear inside tag patterns all the time. But they would never regex-compile if ERE mode was used instead of BRE.

Of course it's still possible to not compile as BRE either, if for example the original parsed file's line had backslashes before special characters like "\{" or "\(". But that's less common obviously.

Actually now that I think about it, the "+" , "*" and "?" are special in BRE, and are not escaped by ctags when creating the tag pattern. This won't cause a regex-compile failure (unless they're the first character), but the compiled regex won't match what it should have. This must cause readLineFromBypassSlow() to fail to find stuff all the time. Does readLineFromBypassSlow() actually get used??

How does geany/vim/etc. find tag lines using the tag file pattern? Do they compile it as regex, or treat it as a raw string and do something like strstr()? The Sublime Text editor essentially treats it as a string instead of a regex.

@masatake
Copy link
Member

I found we can remove readLineFromBypassSlow. I introduced it for xcmd. However, xcmd feature is gone now. See #1863.

@masatake
Copy link
Member

Removing b is o.k. as far as we write it to man/ctags-incompatibilities.7.

@masatake
Copy link
Member

@hadrielk, I misunderstood the interface PCRE provides.
Is Perl-style regex is upper compatible with POSIX?
If yes, we don't have to introduce p flag. If not, p is needed to use the Perl-style regex.

It seems that there are two issues.

  • using PCRE directly or PCRE indirectly via glib
  • mline-regex related incompatibility

@hadrielk
Copy link
Contributor Author

Is Perl-style regex is upper compatible with POSIX?

I'm not sure I understand the question. PCRE (and PCRE2) provide a POSIX-compatible API: i.e., the same C-functions regcomp(), regexec(), regfree() with many of the same defined flags such as REG_ICASE. When you use the POSIX API, you can use a regex pattern that follows the Perl syntax... well, really it's a PCRE syntax because it has a few differences from Perl. What you cannot do, however, is use any of the PCRE compile/match flags other than the POSIX ones. There are some exceptions to this, because some flags can be set within the regex pattern itself, but generally if you want to use PCRE or PCRE2's native flags, you have to use their native APIs.

When using the PCRE POSIX API, if you use the REG_NEWLINE flag, it's close but not identical to POSIX's meaning of REG_NEWLINE - which for ctags only affects the multi-line regex, as far as I can tell. When I compiled ctags with PCRE2, I did have to make some small changes for the plain --regex as well as --_mtable-regex, but they were simple and backwards-compatible with glib's regex.

The hard part, really, is that if we make PCRE an optional feature, then an optlib .ctags file of patterns written to work with PCRE (even through the POSIX API) and take advantage of it's richer Perl-style regex patterns, will compile and work great with PCRE. But if the same optlib file is tried in a ctags program without PCRE and only the current one, either won't compile (which is good I guess); or worse it might actually compile fine but just not match what it did with PCRE. For example the regex pattern "foo[^\}]+" would match the input string "foo\bar" in PCRE, but not in glib or POSIX; because PCRE allows escaping the "}" in a bracket expression even though it's not a special character there, whereas POSIX says it's undefined and glibc treats it as a literal "\". Also a pattern of "(foo|foobar)" in PCRE would capture the first 3 letters of "foo" in an input string of "foobar", but in POSIX match the whole string; because PCRE stops on the first match, whereas POSIX requires the longest match.

@hadrielk
Copy link
Contributor Author

using PCRE directly or PCRE indirectly via glib

To get that part, I could write a trivial shim layer/API that gives ctags code a common API to use, and hides the details of which regex engine is being included/used. In other words, make it so that the code in lregex.c uses the shim API and isn't riddled with #ifdef stuff. So for example lregex.c wouldn't store a regex_t* in regexPattern, but instead an opaque pointer that would really be a regex_t* for POSIX, but a GRegex* for GRegex, etc.

That way geany doesn't even have to modify the lregex.c code from uctags for GRegex support. (well... other than how much it already is just a small subset of uctags' lregex.c)

@hadrielk
Copy link
Contributor Author

mline-regex related incompatibility

One really ugly brute-force way to solve this is: for multi-line patterns only, have ctags look into the regex pattern before compiling it, and if it finds a negated bracket expression, to add a \n newline character inside the negated brackets if one's not already there.

@b4n
Copy link
Member

b4n commented Aug 30, 2018

Is Perl-style regex is upper compatible with POSIX?

I'm not sure I understand the question.

What I understand is is PCRE able to interpret POSIX patterns like glibc does, e.g. can libpcre be a drop-in replacement for the currently used library.
But then whether or not we need a p flag, the question would then be are POSIX patterns valid PCRE patterns, and do they work the same?.

using PCRE directly or PCRE indirectly via glib

To get that part, I could write a trivial shim layer/API that gives ctags code a common API to use, and hides the details of which regex engine is being included/used.

That sounds like a good plan. However, looking at the bug report you mentioned earlier (GLib's Unicode raw RE), I see GLib developers are mentioning the current GRegex API is not possible to port to PCRE2, and that PCRE is getting deprecated. So if we wanna be able to use PCRE2 (not sure what it brings) having a shim layer would only make sense if it can be made to work with other backends like PCRE, PCRE2 and GLib.
From my point of view it currently wouldn't change anything if ctags was linking libpcre, as GLib already does so it doesn't bring in anything.
Anyway, this question requires one to know what PCRE2 brings, which I don't currently know.

mline-regex related incompatibility

One really ugly brute-force way to solve this is: for multi-line patterns only, have ctags look into the regex pattern before compiling it, and if it finds a negated bracket expression, to add a \n newline character inside the negated brackets if one's not already there.

The problem with such an approach parsing the pattern is that we need a robust way to do so, without ever getting fooled. It's not so hard, but it's not a mere matter of s/\[\^[^]]*/\0\n/g, we basically have to have a real regex parser -- which sounds kind of reasonable in the context of ctags, though 😄

@hadrielk
Copy link
Contributor Author

hadrielk commented Aug 30, 2018

are POSIX patterns valid PCRE patterns

yes all POSIX patterns are valid PCRE patterns.

and do they work the same?

No, not exactly the same. The "[^;]" matches differently, which we can fix/work-around; the other difference is the longest possible match issue, which cannot be worked around, but is also unlikely to be a problem in practice I think. Normally your regex would have some other character after the alternation as a separator, and PCRE would then be forced to match the longer choice.

For example, you'd rarely have a regex of only "(foo|foobar)", but instead you'd have "(foo|foobar)[ \t\n]" or "(foo|foobar)\b" or some such, so PCRE would capture the whole "foobar" in an input string of "foobar blah". Or at least, the regex should have such a delimiter, even using the current regex engine. And even if it doesn't, the user can fix it so it does, and it will work in all regex engines.

@hadrielk
Copy link
Contributor Author

GLib developers are mentioning the current GRegex API is not possible to port to PCRE2

I saw one of them said that, but I don't know why - it is possible to make GRegex support PCRE2, as far as I can tell. A different developer suggested it could be done.

having a shim layer would only make sense if it can be made to work with other backends like PCRE, PCRE2 and GLib

Yes, that would be the goal of the shim layer.

this question requires one to know what PCRE2 brings, which I don't currently know.

So far PCRE2 has mostly added some unicode improvements, more flags to tweak settings, and improved performance (in theory). Some new, fairly esoteric, regex pattern commands have been added (e.g., \g{+N}, which is a relative forward reference to a capture group) - but nothing really significant so far. Also, the API itself is different, including some new functions that don't exist for PCRE; but they aren't really applicable to ctags.

@amerlyq
Copy link

amerlyq commented Feb 4, 2020

updating the current engine to the latest code from glibc

Correct me, if I understood wrongly, currently ctags still uses bundled gnu_regex from old glibc=2.10 without appropriate unicode support ?
Is it the main reason why my range regex [\u2800-\u28FF] or [⠀-⣿] don't work on anything beside ASCII ?

@masatake
Copy link
Member

masatake commented Feb 4, 2020

currently ctags still uses bundled gnu_regex from old glibc=2.10 without appropriate unicode support ?

Yes.

masatake added a commit to masatake/ctags that referenced this issue Mar 12, 2020
Close universal-ctags#1861.

In the much of cases, I hope there is no impact on existing optlib
code using --regex-... option.

If `r` regex flag is given, you can use extended features of Onigmo
with ruby regex syntax.

A demonstration of one of miracle features:

;; input.mylang

    (define (f1) 1)
    (define ((f2)) #t)
    (define (((f3))) "abc")

    --langdef=mylang
    --map-mylang=.mylang
    --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function...
    --_fielddef-mylang=symbol,symbol binding to the function
    --fields-mylang={symbol}
    --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2}

See the r flag passed to --regex-mylang.

    (((f3)))	input.mylang	/^(define (((f3))) "abc")$/;"	f	symbol:f3
    ((f2))	input.mylang	/^(define ((f2)) #t)$/;"	f	symbol:f2
    (f1)	input.mylang	/^(define (f1) 1)$/;"	f	symbol:f1

Look at the name of tags. The pairs of `(' and `)' are balanced well.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
masatake added a commit to masatake/ctags that referenced this issue Mar 12, 2020
Close universal-ctags#1861.

In the much of cases, I hope there is no impact on existing optlib
code using --regex-... option.

If `r` regex flag is given, you can use extended features of Onigmo
with ruby regex syntax.

A demonstration of one of miracle features:

;; input.mylang

    (define (f1) 1)
    (define ((f2)) #t)
    (define (((f3))) "abc")

    --langdef=mylang
    --map-mylang=.mylang
    --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function...
    --_fielddef-mylang=symbol,symbol binding to the function
    --fields-mylang={symbol}
    --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2}

See the r flag passed to --regex-mylang.

    (((f3)))	input.mylang	/^(define (((f3))) "abc")$/;"	f	symbol:f3
    ((f2))	input.mylang	/^(define ((f2)) #t)$/;"	f	symbol:f2
    (f1)	input.mylang	/^(define (f1) 1)$/;"	f	symbol:f1

Look at the name of tags. The pairs of `(' and `)' are balanced well.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
masatake added a commit to masatake/ctags that referenced this issue Mar 12, 2020
Close universal-ctags#1861.

In the much of cases, I hope there is no impact on existing optlib
code using --regex-... option.

If `r` regex flag is given, you can use extended features of Onigmo
with ruby regex syntax.

A demonstration of one of miracle features:

;; input.mylang

    (define (f1) 1)
    (define ((f2)) #t)
    (define (((f3))) "abc")

    --langdef=mylang
    --map-mylang=.mylang
    --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function...
    --_fielddef-mylang=symbol,symbol binding to the function
    --fields-mylang={symbol}
    --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2}

See the r flag passed to --regex-mylang.

    (((f3)))	input.mylang	/^(define (((f3))) "abc")$/;"	f	symbol:f3
    ((f2))	input.mylang	/^(define ((f2)) #t)$/;"	f	symbol:f2
    (f1)	input.mylang	/^(define (f1) 1)$/;"	f	symbol:f1

Look at the name of tags. The pairs of `(' and `)' are balanced well.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
masatake added a commit to masatake/ctags that referenced this issue Mar 12, 2020
Close universal-ctags#1861.

In the much of cases, I hope there is no impact on existing optlib
code using --regex-... option.

If `r` regex flag is given, you can use extended features of Onigmo
with ruby regex syntax.

A demonstration of one of miracle features:

;; input.mylang

    (define (f1) 1)
    (define ((f2)) #t)
    (define (((f3))) "abc")

    --langdef=mylang
    --map-mylang=.mylang
    --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function...
    --_fielddef-mylang=symbol,symbol binding to the function
    --fields-mylang={symbol}
    --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2}

See the r flag passed to --regex-mylang.

    (((f3)))	input.mylang	/^(define (((f3))) "abc")$/;"	f	symbol:f3
    ((f2))	input.mylang	/^(define ((f2)) #t)$/;"	f	symbol:f2
    (f1)	input.mylang	/^(define (f1) 1)$/;"	f	symbol:f1

Look at the name of tags. The pairs of `(' and `)' are balanced well.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
masatake added a commit to masatake/ctags that referenced this issue Mar 12, 2020
Close universal-ctags#1861.

In the much of cases, I hope there is no impact on existing optlib
code using --regex-... option.

If `r` regex flag is given, you can use extended features of Onigmo
with ruby regex syntax.

A demonstration of one of miracle features:

;; input.mylang

    (define (f1) 1)
    (define ((f2)) #t)
    (define (((f3))) "abc")

    --langdef=mylang
    --map-mylang=.mylang
    --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function...
    --_fielddef-mylang=symbol,symbol binding to the function
    --fields-mylang={symbol}
    --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2}

See the r flag passed to --regex-mylang.

    (((f3)))	input.mylang	/^(define (((f3))) "abc")$/;"	f	symbol:f3
    ((f2))	input.mylang	/^(define ((f2)) #t)$/;"	f	symbol:f2
    (f1)	input.mylang	/^(define (f1) 1)$/;"	f	symbol:f1

Look at the name of tags. The pairs of `(' and `)' are balanced well.

Signed-off-by: Masatake YAMATO <yamato@redhat.com>
@masatake
Copy link
Member

See #3036.

@masatake
Copy link
Member

Now we can use pcre2 as an optional regex parser.
To avoid thinking about the detail of compatibility, I introduced pcre2 as an optional one. It can be used only when it is linked to ctags and {pcre2} or p flag is given.

I will close this.
Comments on this issue have been very informative and valuable. Thank you who joining the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants