-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing the ctags regex engine #1861
Comments
I wrote a bit about this topic in #519.
I don't want to remove the current engine borrowed from glibc. I reserved I would like to write some concerts(issues) in this area.
|
OK, but we'd have to either rename the current regex functions of glibc's code, or
I thought you didn't like short letters. 😉
If we're going to integrate the source code anyway (e.g., as a git submodule), then why bother with 1: Well... there is one difference using PCRE's POSIX API that makes it not exactly like glibc's POSIX: in a regex pattern using the |
Just a few thoughts:
|
A few years ago I was one of the core developers for Wireshark, and Wireshark uses GRegex, but it had a subtle issue that we hit regularly: #837. Although I'm not sure that will matter for ctags?
There is one place in I'm guessing this is because ctag-file tag patterns are supposed to actually be regex patterns? If so, compiling them as BRE makes sense because it will successfully compile more frequently - because nothing in ctags escapes regex special characters like " Of course it's still possible to not compile as BRE either, if for example the original parsed file's line had backslashes before special characters like " Actually now that I think about it, the " How does geany/vim/etc. find tag lines using the tag file pattern? Do they compile it as regex, or treat it as a raw string and do something like |
I found we can remove |
Removing |
@hadrielk, I misunderstood the interface PCRE provides. It seems that there are two issues.
|
I'm not sure I understand the question. PCRE (and PCRE2) provide a POSIX-compatible API: i.e., the same C-functions When using the PCRE POSIX API, if you use the The hard part, really, is that if we make PCRE an optional feature, then an optlib |
To get that part, I could write a trivial shim layer/API that gives ctags code a common API to use, and hides the details of which regex engine is being included/used. In other words, make it so that the code in That way geany doesn't even have to modify the |
One really ugly brute-force way to solve this is: for multi-line patterns only, have ctags look into the regex pattern before compiling it, and if it finds a negated bracket expression, to add a |
What I understand is is PCRE able to interpret POSIX patterns like glibc does, e.g. can libpcre be a drop-in replacement for the currently used library.
That sounds like a good plan. However, looking at the bug report you mentioned earlier (GLib's Unicode raw RE), I see GLib developers are mentioning the current GRegex API is not possible to port to PCRE2, and that PCRE is getting deprecated. So if we wanna be able to use PCRE2 (not sure what it brings) having a shim layer would only make sense if it can be made to work with other backends like PCRE, PCRE2 and GLib.
The problem with such an approach parsing the pattern is that we need a robust way to do so, without ever getting fooled. It's not so hard, but it's not a mere matter of |
yes all POSIX patterns are valid PCRE patterns.
No, not exactly the same. The " For example, you'd rarely have a regex of only " |
I saw one of them said that, but I don't know why - it is possible to make GRegex support PCRE2, as far as I can tell. A different developer suggested it could be done.
Yes, that would be the goal of the shim layer.
So far PCRE2 has mostly added some unicode improvements, more flags to tweak settings, and improved performance (in theory). Some new, fairly esoteric, regex pattern commands have been added (e.g., |
Correct me, if I understood wrongly, currently ctags still uses bundled |
Yes. |
Close universal-ctags#1861. In the much of cases, I hope there is no impact on existing optlib code using --regex-... option. If `r` regex flag is given, you can use extended features of Onigmo with ruby regex syntax. A demonstration of one of miracle features: ;; input.mylang (define (f1) 1) (define ((f2)) #t) (define (((f3))) "abc") --langdef=mylang --map-mylang=.mylang --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function... --_fielddef-mylang=symbol,symbol binding to the function --fields-mylang={symbol} --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2} See the r flag passed to --regex-mylang. (((f3))) input.mylang /^(define (((f3))) "abc")$/;" f symbol:f3 ((f2)) input.mylang /^(define ((f2)) #t)$/;" f symbol:f2 (f1) input.mylang /^(define (f1) 1)$/;" f symbol:f1 Look at the name of tags. The pairs of `(' and `)' are balanced well. Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Close universal-ctags#1861. In the much of cases, I hope there is no impact on existing optlib code using --regex-... option. If `r` regex flag is given, you can use extended features of Onigmo with ruby regex syntax. A demonstration of one of miracle features: ;; input.mylang (define (f1) 1) (define ((f2)) #t) (define (((f3))) "abc") --langdef=mylang --map-mylang=.mylang --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function... --_fielddef-mylang=symbol,symbol binding to the function --fields-mylang={symbol} --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2} See the r flag passed to --regex-mylang. (((f3))) input.mylang /^(define (((f3))) "abc")$/;" f symbol:f3 ((f2)) input.mylang /^(define ((f2)) #t)$/;" f symbol:f2 (f1) input.mylang /^(define (f1) 1)$/;" f symbol:f1 Look at the name of tags. The pairs of `(' and `)' are balanced well. Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Close universal-ctags#1861. In the much of cases, I hope there is no impact on existing optlib code using --regex-... option. If `r` regex flag is given, you can use extended features of Onigmo with ruby regex syntax. A demonstration of one of miracle features: ;; input.mylang (define (f1) 1) (define ((f2)) #t) (define (((f3))) "abc") --langdef=mylang --map-mylang=.mylang --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function... --_fielddef-mylang=symbol,symbol binding to the function --fields-mylang={symbol} --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2} See the r flag passed to --regex-mylang. (((f3))) input.mylang /^(define (((f3))) "abc")$/;" f symbol:f3 ((f2)) input.mylang /^(define ((f2)) #t)$/;" f symbol:f2 (f1) input.mylang /^(define (f1) 1)$/;" f symbol:f1 Look at the name of tags. The pairs of `(' and `)' are balanced well. Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Close universal-ctags#1861. In the much of cases, I hope there is no impact on existing optlib code using --regex-... option. If `r` regex flag is given, you can use extended features of Onigmo with ruby regex syntax. A demonstration of one of miracle features: ;; input.mylang (define (f1) 1) (define ((f2)) #t) (define (((f3))) "abc") --langdef=mylang --map-mylang=.mylang --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function... --_fielddef-mylang=symbol,symbol binding to the function --fields-mylang={symbol} --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2} See the r flag passed to --regex-mylang. (((f3))) input.mylang /^(define (((f3))) "abc")$/;" f symbol:f3 ((f2)) input.mylang /^(define ((f2)) #t)$/;" f symbol:f2 (f1) input.mylang /^(define (f1) 1)$/;" f symbol:f1 Look at the name of tags. The pairs of `(' and `)' are balanced well. Signed-off-by: Masatake YAMATO <yamato@redhat.com>
Close universal-ctags#1861. In the much of cases, I hope there is no impact on existing optlib code using --regex-... option. If `r` regex flag is given, you can use extended features of Onigmo with ruby regex syntax. A demonstration of one of miracle features: ;; input.mylang (define (f1) 1) (define ((f2)) #t) (define (((f3))) "abc") --langdef=mylang --map-mylang=.mylang --kinddef-mylang=f,fun,function, function returing a function, or function returing a function returing function... --_fielddef-mylang=symbol,symbol binding to the function --fields-mylang={symbol} --regex-mylang=/\(define +(([-a-z0-9]+)|\(\g<1>\))/\1/f/r{_field=symbol:\2} See the r flag passed to --regex-mylang. (((f3))) input.mylang /^(define (((f3))) "abc")$/;" f symbol:f3 ((f2)) input.mylang /^(define ((f2)) #t)$/;" f symbol:f2 (f1) input.mylang /^(define (f1) 1)$/;" f symbol:f1 Look at the name of tags. The pairs of `(' and `)' are balanced well. Signed-off-by: Masatake YAMATO <yamato@redhat.com>
See #3036. |
Now we can use pcre2 as an optional regex parser. I will close this. |
Has anyone thought about changing ctags to using a more modern regex engine?
The current glibc-2.10.1 engine is... suboptimal. It's incredibly slow. And using the POSIX API misses pretty much every addition to regex engines that's happened in the last 20 years or so: lazy captures, non-capturing groups, atomic groups, possessive quantifiers, negative and positive look-ahead/behind, etc.
I've recently written a Yang file parser, using optlib multi-table regex. When I first wrote it, it took ctags ~40 seconds to process the 61 yang files I have in my employer's codebase. After a lot of effort to optimize the parser, I was able to get it down to ~16 seconds. That's just for 61 files! When I ran it against ~900 yang files for Cisco's XR router, it takes approximately 5 minutes.
So I went and compiled ctags to use PCRE2 instead. My 61 yang files were processed in 0.2 seconds! Cisco's 900 yang files took 2.4 seconds! (and I verified the generated tags were the same as using the current regex engine)
That's without taking advantage of PCRE's expanded regex abilities - this is just using the POSIX api only, without changing the
yang.ctags
optlib file.Of course in regular usage ctags won't be improved nearly as much, since most of the built-in parsers don't use regex. But some do, such as the CMake parser. In my employer's code base, it usually takes ~12 seconds to process our ~10k files (with C/C++, CMake, etc.); but with PCRE this dropped to 5 seconds.
When I run ctags with my new
yang.ctags
optlib file as well, against my ~10k mixed files (including the 61 yang files), it takes ~29 seconds. With PCRE that drops back down to ~5 seconds.p.s. I searched through the closed issues list and didn't see this topic. Apologies if it's already been discussed.
The text was updated successfully, but these errors were encountered: