write the specification of regex implemented #3034

hirooih · 2021-05-21T15:18:58Z

I've been expecting to update regex engine as #1861. It was because I just wanted to use \s, \S, \w, and \W which I was familiar with via Perl.
Today I found they were already supported the GNU regex which is being used by ctags.

I've been referring man page of e?grep(1), regex(7), and manual of the GNU libc. But they don't describe about the feature.
I searched on the Web, https://www.gnu.org/software/grep/manual/grep.html#Regular-Expressions is the only document I could find.

@masatake -san, if you agree with me, I'd like to update examples in ctags-optlib.7.rst and optlib.rst to use them. It will improve readability of patterns. Some of them do not handle white-spaces correctly, i.e. ...private) +)?func (...

The text was updated successfully, but these errors were encountered:

masatake · 2021-05-21T20:26:53Z

Before starting this task, could you add a case testing \s, \S, \w, and \W to Tmain or Units?
@leleliu008 has added many platforms where our test runs. We can know whether we can expect \s, \S, \w, and \W are available on all the platforms or not.

hirooih · 2021-05-22T05:49:40Z

Before starting this task, could you add a case testing \s, \S, \w, and \W to Tmain or Units?

PR #3035. Failed on Cygwin, MSYS2, and MacOS 10.15.
Interesting... Is is possible for them to use a different regex engine from gnu_regex/ in the ctags distribution?

masatake · 2021-05-22T08:31:13Z

If a platform has an implementation of regex library, ctags (build-system) may use it.
The code in gnu_regex/ directory is for the platform where the library is not available.

hirooih · 2021-05-22T08:47:34Z

If a platform has an implementation of regex library, ctags (build-system) may use it.
The code in gnu_regex/ directory is for the platform where the library is not available.

That makes sense and explains this situation.
Let me think some more.

hirooih · 2021-05-23T03:25:06Z

@masatake -san,

If a platform has an implementation of regex library, ctags (build-system) may use it.
The code in gnu_regex/ directory is for the platform where the library is not available.

Why don't we use gnu_regex always?

If ctags can use a same regex engine on all platforms, we can ensure the compatibility of optlib parsers among the platforms.
And regex engines on some platform has less functionalities than the GNU regex. \s and \w are very useful for optlib parsers as we see on c18057e.

I tired to use gnu_regex on my MacOS machine and confirmed it passed the new tests.

Here is my fix of `configure.ac*. I'm not familiar with Autoconf. There must be better fixes.

--- a/configure.ac
+++ b/configure.ac
@@ -563,27 +563,11 @@ dnl
 dnl The final result can be check with ./ctags --list-features.
 dnl
 dnl
-AC_CHECK_FUNCS(regcomp,[
-AC_MSG_CHECKING(if regcomp works)
-AC_RUN_IFELSE([AC_LANG_SOURCE([
-#include <sys/types.h>
-#include <regex.h>
-int main(void) {
-   regex_t patbuf;
-   return (regcomp (&patbuf, "/hello/", 0) != 0);
-}
-])],[regcomp_works=yes],[regcomp_works=no],[AC_DEFINE(CHECK_REGCOMP)])
-AC_MSG_RESULT($regcomp_works)
-if test no = "$regcomp_works"; then
-   AC_MSG_ERROR([regcomp() on this system is broken.])
-fi
-have_regcomp=yes
-],[
 dnl Using bundled regex implementation
 AC_LIBOBJ([regex])
 REGCOMP_CPPFLAGS="-I${srcdir}/gnu_regex -D__USE_GNU"
-AC_DEFINE(HAVE_REGCOMP)
-have_regcomp=no])
+AC_DEFINE(HAVE_REGCOMP,0,[use GNU regexp])
+have_regcomp=no
 AC_SUBST([REGCOMP_CPPFLAGS])
 AM_CONDITIONAL([HAVE_REGCOMP], [test "xyes" = "x$have_regcomp"])

And I need one more fix on Ubuntu system.
-D_USE_GNU is specified on the command line, but it is undefined in limits.h. I spent some time to solve this issue.

--- a/gnu_regex/regex.c
+++ b/gnu_regex/regex.c
@@ -57,6 +57,7 @@
    #undefs RE_DUP_MAX and sets it to the right value.  */
 #include <limits.h>

+#define __USE_GNU    // may be undefined by limits.h
 #include "regex.h"
 #include "regex_internal.h"

If you agree, I will add the above fixes to PR #3035. Or I will create a separate PR if you prefer.

masatake · 2021-05-23T05:00:01Z

I would like to keep gnu_regex/ as the fall-back implementation.

The implementation of gnu_regex/ is old. We have to update it to use widely.
I wonder where we should get the latest version. gnulib?

Fedora that I'm using daily, has newer glibc that may include newer gnu_regex.
I don't use gnu_regex/ in ctags source tree.

The gnu_regex/ in ctags source tree is patched by us.
When updating gnu_regex/ in ctags source tree, we must re-apply the patches.

$ git log --oneline gnu_regex
2ead636db gnu_regex: Suppress warnings
cb6188882 regex: Fix cast error on VC2015 x64
46204f785 Fix a lot of typos
035c7afca Updated to use regex code from glibc.
82268cee8 Updated to use regex code from glibc.
dcc4c8458 Fixed regex support for MinGW. Gnu regex module now included in all distributions.
738646227 Added GNU regex files to SVN control.

These are not so critical. However, if we have time, taking time for using pcre2 in ctags is fruitful.

masatake · 2021-05-23T10:03:20Z

See #3036.

hirooih · 2021-05-23T14:05:27Z

See #3036.

Great! I look forward to merging this fix.

The implementation of gnu_regex/ is old. We have to update it to use widely.
I wonder where we should get the latest version. gnulib?

Today I tried to import regex from glibc-2.33, the latest stable glibc.
I've confirmed that all of the fixed on ctags (diff between gnu_regex/* and reg*.[ch] in glibc-2.10.1) are already applied.
But I could not compile regex from 2.33 due to missing some include files, libc-config.h, intprops.h, and verify.h, and I gave-up.

For my records, here's what I've confirmed. (The exact same fixes, like typos in comment, are not listed.)
Note that the last one is different from ours. It might be a bug in ctags.

regcomp.c

--- gnu_regex-2.10.1/regcomp.c	2009-01-08 09:42:28.000000000 +0900
+++ gnu_regex/regcomp.c	2021-05-22 23:59:24.908479300 +0900
@@ -2527,7 +2527,7 @@
     old_tree = NULL;
 
   if (elem->token.type == SUBEXP)
-    postorder (elem, mark_opt_subexp, (void *) (long) elem->token.opr.idx);
+    postorder (elem, mark_opt_subexp, (void *) (intptr_t) elem->token.opr.idx);
 
   tree = create_tree (dfa, elem, NULL, (end == -1 ? OP_DUP_ASTERISK : OP_ALT));
   if (BE (tree == NULL, 0))

Fixed as follows;

    {
      uintptr_t subidx = elem->token.opr.idx;
      postorder (elem, mark_opt_subexp, (void *) subidx);
    }

@@ -3740,7 +3740,7 @@
 static reg_errcode_t
 mark_opt_subexp (void *extra, bin_tree_t *node)
 {
-  int idx = (int) (long) extra;
+  int idx = (int) (intptr_t) extra;
   if (node->token.type == SUBEXP && node->token.opr.idx == idx)
     node->token.opt_subexp = 1;

Fixed as follows;

  Idx idx = (uintptr_t) extra;

regex_internal.c

--- gnu_regex-2.10.1/regex_internal.c	2009-01-08 09:22:50.000000000 +0900
+++ gnu_regex/regex_internal.c	2020-09-04 13:16:03.791141600 +0900
@@ -1390,13 +1390,15 @@
 
 
 /* Add the token TOKEN to dfa->nodes, and return the index of the token.
-   Or return -1, if an error will be occured.  */
+   Or return -1, if an error will be occurred.  */
 
 static int
 internal_function
 re_dfa_add_node (re_dfa_t *dfa, re_token_t token)
 {
+#ifdef RE_ENABLE_I18N
   int type = token.type;
+#endif
   if (BE (dfa->nodes_len >= dfa->nodes_alloc, 0))
     {
       size_t new_nodes_alloc = dfa->nodes_alloc * 2;

The typo in comment is fixed.
The variable type is deleted. (token.type is used.)

regex_internal.h

--- gnu_regex-2.10.1/regex_internal.h	2009-01-08 09:23:00.000000000 +0900
+++ gnu_regex/regex_internal.h	2020-09-04 13:16:03.791141600 +0900
@@ -418,7 +418,11 @@
 #define re_string_skip_bytes(pstr,idx) ((pstr)->cur_idx += (idx))
 #define re_string_set_index(pstr,idx) ((pstr)->cur_idx = (idx))
 
-#include <alloca.h>
+#ifdef WIN32
+# include <malloc.h>
+#else
+# include <alloca.h>
+#endif
 
 #ifndef _LIBC
 # if HAVE_ALLOCA

This is fixed as follows;

#if defined _LIBC || HAVE_ALLOCA
# include <alloca.h>
#endif

#ifndef _LIBC
# if HAVE_ALLOCA
...
# else
/* alloca is implemented with malloc, so just use malloc.  */
#  define __libc_use_alloca(n) 0
#  undef alloca
#  define alloca(n) malloc (n)
# endif
#endif

regexec.c

--- gnu_regex-2.10.1/regexec.c	2009-01-08 09:47:05.000000000 +0900
+++ gnu_regex/regexec.c	2020-09-04 13:16:03.791141600 +0900
@@ -227,7 +227,9 @@
 {
   reg_errcode_t err;
   int start, length;
+#ifdef _LIBC
   re_dfa_t *dfa = (re_dfa_t *) preg->buffer;
+#endif
 
   if (eflags & ~(REG_NOTBOL | REG_NOTEOL | REG_STARTEND))
     return REG_BADPAT;
@@ -418,7 +420,9 @@
   regmatch_t *pmatch;
   int nregs, rval;
   int eflags = 0;
+#ifdef _LIBC
   re_dfa_t *dfa = (re_dfa_t *) bufp->buffer;
+#endif
 
   /* Check for out-of-range.  */
   if (BE (start < 0 || start > length, 0))

Now dfa is used even when _LIBC is not defined (for lock_lock() and lock_unlock()).

@@ -468,7 +472,7 @@
 
   rval = 0;
 
-  /* I hope we needn't fill ther regs with -1's when no match was found.  */
+  /* I hope we need not to fill the regs with -1's when no match was found.  */
   if (result != REG_NOERROR)
     rval = -1;
   else if (regs != NULL)

"I hope we needn't fill their regs with -1's when no match was found."

@@ -2899,7 +2903,7 @@
 	      sizeof (re_dfastate_t *) * (path->alloc - old_alloc));
     }
 
-  str_idx = path->next_idx ?: top_str;
+  str_idx = path->next_idx ? 0 : top_str;
 
   /* Temporary modify MCTX.  */
   backup_state_log = mctx->state_log;

fixed as follows;

  str_idx = path->next_idx ? path->next_idx : top_str;

masatake · 2021-05-23T15:01:28Z

Thank you for taking your time for comparing the changes. What you are doing is important even if we introduce pcre2.

I think we should take a copy from gnulib, not from glibc.

https://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/regex.h
https://git.savannah.gnu.org/cgit/gnulib.git/tree/lib
https://www.gnu.org/software/gnulib/
https://www.gnu.org/software/gnulib/manual/html_node/index.html

As far as reading the regex.h at gnulib repo, the license is not changed.

k-takata · 2021-05-24T04:19:20Z

This might be off-topic, but

/* alloca is implemented with malloc, so just use malloc.  */

alloca allocates memory from the stack and it will be automatically freed when a function returns.
Just replacing alloca with malloc causes memory leaks.

hirooih · 2021-05-24T13:22:21Z

@masatake -san,
I've been thinking gnulib is another name for glibc. I've just start reading the document of gnulib.

@k-tanaka san,
The implementation of regex in gnulib never use alloca().

Very interesting!
Thank you for your info.

k-takata · 2021-05-24T13:40:03Z

(You are mentioning the wrong person.)

hirooih · 2021-06-05T10:04:25Z

What you are doing is important even if we introduce pcre2.

I think we should take a copy from gnulib, not from glibc.

I had to learn about not only gnulib but also GNU Autotools to use gnulib,
Now I am ready to send a PR merging the latest regex engine from gnulib to ctags.

There are some concerns. I'd like to know if they are acceptable or not, before making a PR.

The current Gnulib requires GNU Autoconf 2.64 (released in 2009). The current ctags requires 2.59 (released in 2003).
The current gnu_regex directory has only 6 *.c and *.h files. If we import regex module from Gnulib, it also copy dependent files, total 73, and 52 M4 macro files.
- See https://www.gnu.org/software/gnulib/MODULES.html#module=regex for details (dependent files, etc.). They may improve portability of ctags.
- The directory name gnu_regex/ is not proper any more. I create a directory gnulib instead.

If these are acceptable. I will send a PR.

masatake · 2021-06-05T13:50:14Z

Thank you very much.

Could you open a pull request?
I would like to see the result of the test running on CI/CD environments.

masatake · 2021-06-05T14:08:23Z

What I'm afraid of is related to the license. Some of the files may be distributed under GPLv3.

hirooih · 2021-06-05T15:16:27Z

Could you open a pull request?
I would like to see the result of the test running on CI/CD environments.

Sure.
I've opened #3054. There are some fails. I will take a look at them.

masatake · 2021-06-05T16:14:17Z

I should write my idea first of all.

We expect ctags supports POSIX Extended Regular Expression (ERE).
Many platforms provide the library for ERE.
We add gnu_regex to the ctags source tree. It is for the platform having no ERE implementation. It is a fallback.

I think we should update the gnu_regex code because it was imported to the ctags source tree a long time ago.
I'm afraid of the vulnerability of the old code. For the same reason, we should update fnmatch library.

As you know you can use \s in your .ctags. It is o.k. as far as you use it privately.
However, it will not be acceptable to merge your optlib parser into the ctags tree.
How can we know whether \? in our source tree? CI/CD environments massively extended by @leleliu008 judge the acceptance.

If you want to write a parser using \s and merge the parser to our source tree, use p/pcre2 I'm working on #3036.
I'm exhausted about C++ parser-related issues. So I'm happy if you drive #3036. If you are interested in this task talk to me #3036.

If a platform doesn't provide ERE implementation, ctags should use gnu_regex in the source tree.
However, if a platform provides ERE implementation, ctags should use it.
If a user WANTS, ctags can be built with gnu_regex in our source tree.

If there are incompatibilities between ERE implementations on the platform and the gnu version, a parser in .ctags will not work when we force users to use gnu implementation.

hirooih · 2021-06-06T09:00:15Z

@masatake san,

Let explain my intension one more time.

I thinks \s and \w are MUST-HAVE-FEATUREs for Universal ctags. They can be used on most cases for use-cases such as ctags.

Most of examples in https://docs.ctags.io/en/latest/optlib.html need \s*, \s+, and \w+ patterns.
Some of examples does not care of tab characters or white spaces on the beginning of the line.
It is OK as examples, but we need lots of \s* or \s+ for actual use-cases.

--regex-Pod=/^=head1[ \t]+(.+)/\1/c/
--regex-Man=/^\.TH[[:space:]]{1,}"([^"]{1,})".*/\1/t/{exclusive}{icase}{scope=push}

can be;

--regex-Pod=/^=head1\s+(.+)/\1/c/
--regex-Man=/^\.TH\s+"([^"]{1,})".*/\1/t/{exclusive}{icase}{scope=push}

--regex-m4=/^m4_define\(\[([^]$\(]+).+$/\1/d/x
--regex-m4=/^m4_define\(\[([^]$\(]+).+$/\1/d/{extend}

The above is OK as an example, but they should be;

--regex-m4=/^[ \t]*m4_define[ \t]*\(\[([^]$\(]+).+$/\1/d/x
--regex-m4=/^[ \t]*m4_define[ \t]*\(\[([^]$\(]+).+$/\1/d/{extend}

and can be

--regex-m4=/^\s*m4_define\s*\(\[([^]$\(]+).+$/\1/d/x
--regex-m4=/^\s*m4_define\s*\(\[([^]$\(]+).+$/\1/d/{extend}

Similarly;

--regex-Foo=/.*\(lambda .*//l/{_anonymous=L}

should be

--regex-Foo=/.*\(\s*lambda\s+.*//l/{_anonymous=L}

We have to write

--regex-Foo=/.*\([ \t]*lambda[ \t]+.*//l/{_anonymous=L}

--regex-C=/([a-zA-Z_][a-zA-Z_0-9]*) *=/\1/v/{_role=lvalue}
--regex-C=/([a-zA-Z_][a-zA-Z_0-9]*) *\+=/\1/v/{_role=lvalue}{_role=incremented}

This can and should be

--regex-C=/(\w+)\s*=/\1/v/{_role=lvalue}
--regex-C=/(\w+)\s*\+=/\1/v/{_role=lvalue}{_role=incremented}

(Note 123foo = xxxx; causes syntax error.)

--regex-Foo=/^class[[:blank:]]+([[:alpha:]]+):/\1/c/{scope=set}
--regex-Foo=/^[[:blank:]]+def[[:blank:]]+([[:alpha:]]+).*:/\1/d/{scope=ref}

can be

--regex-Foo=/^\s*class\s+(\w+):/\1/c/{scope=set}
--regex-Foo=/^\s*def\s+(\w+).*:/\1/d/{scope=ref}

Most of others examples also can use \s and \w.

My commit c18057e (#3035) is also another example.

Universal-Ctags extends optlib features extensively. There are little reasons to limit to ERE. We can define the specification of the Regular Expression for Universal-Ctags.

I don't think PCRE can be a default feature.

The above is the reason why I am proposing and working on #3054.

masatake · 2021-06-06T14:10:43Z

We can replace \s in a pattern with [ \t] or [[:space:]] just before passing the pattern string to regcomp.
I guess we can do the same with \w.

leleliu008 · 2021-06-06T23:39:15Z

I noticed that @masatake have working on pcre2, why keep to use gnu_regex as a fallback. don't pcre2 meet our needs?

@hirooih update the latest gnu_regex source from gnulib, he import regex moudle via gnulib-tool --import regex, this way is very much tied with Autotools. If we want to support other build system, it will become very difficult. the old version of gnu_regex ctags ever used only regcomp.c regex.c regex.h regex_internal.c regex_internal.h regexec.c 6 files. so, I think we should update this 6 files only. other dependencies is to support —with-included-regex configure argument. do we realy need support this configure argument?

Reference: modules/regex

Description:
Regular expression matching.

Files:
lib/regex.h
lib/regex.c
lib/regex_internal.c
lib/regex_internal.h
lib/regexec.c
lib/regcomp.c
m4/eealloc.m4
m4/regex.m4
m4/mbstate_t.m4

Depends-on:
c99
extensions
ssize_t
attribute       [test $ac_use_included_regex = yes]
btowc           [test $ac_use_included_regex = yes]
builtin-expect  [test $ac_use_included_regex = yes]
dynarray        [test $ac_use_included_regex = yes]
intprops        [test $ac_use_included_regex = yes]
langinfo        [test $ac_use_included_regex = yes]
libc-config     [test $ac_use_included_regex = yes]
lock            [test $ac_use_included_regex = yes]
memcmp          [test $ac_use_included_regex = yes]
memmove         [test $ac_use_included_regex = yes]
mbrtowc         [test $ac_use_included_regex = yes]
mbsinit         [test $ac_use_included_regex = yes]
nl_langinfo     [test $ac_use_included_regex = yes]
stdbool         [test $ac_use_included_regex = yes]
stdint          [test $ac_use_included_regex = yes]
verify          [test $ac_use_included_regex = yes]
wchar           [test $ac_use_included_regex = yes]
wcrtomb         [test $ac_use_included_regex = yes]
wctype-h        [test $ac_use_included_regex = yes]
wctype          [test $ac_use_included_regex = yes]

configure.ac:
gl_REGEX
if test $ac_use_included_regex = yes; then
  AC_LIBOBJ([regex])
  gl_PREREQ_REGEX
fi

Makefile.am:

Include:
<regex.h>

Link:
$(LIB_MBRTOWC)
$(LIBTHREAD)
$(LTLIBINTL) when linking with libtool, $(LIBINTL) otherwise

License:
LGPLv2+

Maintainer:
all

k-takata · 2021-06-06T23:44:33Z

In general, character classes and POSIX brackets are affected by the locale.
[a-zA-Z_] is not exactly same as \w. Same for \s, [[:blank:]] and so on.

hirooih · 2021-06-07T14:57:06Z

@masatake san,

We can replace \s in a pattern with [ \t] or [[:space:]] just before passing the pattern string to regcomp.
I guess we can do the same with \w.

This is not best way but could be one of solution.

@leleliu008,

why keep to use gnu_regex as a fallback. don't pcre2 meet our needs?

If we can build with pcre2 on all platforms u-ctags support, pcre2 will meet our needs.

If we want to support other build system, it will become very difficult.

Yes, we already see this issue on #3054.

other dependencies is to support —with-included-regex configure argument. do we realy need support this configure argument?

I don't understand this comment. Can we reduce the number of files to be imported?
I don't understand what [test $ac_use_included_regex = yes] means, either.

masatake · 2021-06-07T18:07:52Z

This is not best way but could be one of solution.

I wonder why this is not best way for your intention.

As @k-takata suggested, \s is not the same as [ \t]. \s reflects the current locale setting.
@hirooih, do you think this is a matter?

If we can build with pcre2 on all platforms u-ctags support, pcre2 will meet our needs.

Exploring the way to build ctags with pre-installed pcre2 on more platforms may be more attractive and fruitful.
I don't think we have to insist on supporting ALL platforms.
The position of pcre2 is like libxml2.

@leleliu008,

I noticed that @masatake have working on pcre2, why keep to use gnu_regex as a fallback. don't pcre2 meet our needs?

To avoid minor troubles, I would like to introduce pcre2 as a new feature, not a replacement for ERE implementation.

leleliu008 · 2021-06-08T00:48:06Z

@hirooih

run command configure --help, you will see 4 new configure arguments which was introduced by gnulib-tool

--enable-threads={isoc|posix|isoc+posix|windows}
                          specify multithreading API
--disable-threads       build without multithread safety

--enable-cross-guesses={conservative|risky}
                          specify policy for cross-compilation guesses

--without-included-regex
                          don't compile regex; this is the default on systems
                          with recent-enough versions of the GNU C Library
                          (use with caution on other systems).

if user run configure --with-included-regex, the ac_use_included_regex variable will be set to yes, other dependencies will be used.

hirooih · 2021-06-08T14:57:31Z

@masatake san,

@hirooih, do you think this is a matter?

No I don't.
I just think reinventing wheels never be the BEST way. Again it could be one of solution.

@k-takata san,

In general, character classes and POSIX brackets are affected by the locale.
[a-zA-Z_] is not exactly same as \w. Same for \s, [[:blank:]] and so on.

POSIX defines [:blank:], [:space:], etc. They are affected by the locale, as you wrote. (And if I understand correctly locale support is optional in POSIX.)
But POSIX does not define \s or \w. (This is the root of our issue.)
u-ctags can define the specification of them. (Of course, we should follow our predecessors as much as possible.)

@leleliu008 san,

Now I understand what you meant and what I could not explain to you.

I hope all of us understand the importance of \s and \w for u-ctags.
We are searching ways to support \s and \w on all platforms supported.

Replace \s to [ \t] (or [ \t\r\n\f] for multiline-regex) and \w to ....
enable (outdated) gnu_regex on all platform by default.
enable gnulib regex on all plaforms by default
use pcre on all plaforms by default

do we really need support this configure argument?

Yes, we do for option 3.

other dependencies is to support —with-included-regex configure argument.

Other dependencies are included to improve portability to compile gnulib-regex.

--without-included-regex
                          don't compile regex; this is the default on systems
                          with recent-enough versions of the GNU C Library
                          (use with caution on other systems).

Without this option (by default) configure uses the regex library from "recent-enough versions of the GNU C Library" or regex library of gnulib.
This is what we need.

masatake · 2021-06-08T17:07:09Z

@hirooih

Do you mean a developing a code replacing \s with [ \t] is a "reinventing wheel" ?

hirooih · 2021-06-08T23:44:18Z

Do you mean a developing a code replacing \s with [ \t] is a "reinventing wheel" ?

Yes.
I think it is a good idea. But not the best, I think.

leleliu008 · 2021-06-09T06:27:30Z

we always heard that "there is no best, only better". But waht is the better? Different people have their own options. Better never means better for everyone, it always means worse, for some. so, i think better is the result of a compromise. as a example, when we introduce a new library in our proect, what we have to do is to decide what we want to take it, what we should abandon it, and what we can fix it.

hirooih · 2021-06-09T12:52:22Z

@masatake san,

Great! I never thought the "replacing approach" could be implemented like this. Very simple and clean code.
I've learned the power of vStrings functions.

Now working for Gnulib becomes reinventing wheels.

Why don't you merge #3058?

Let me create a PR to update specification of regex.

masatake · 2021-06-09T20:43:15Z

We should touch the ERE layer only when we have a critical trouble like a security issue.
As I wrote, gnu_regex should be updated because it is old. I assume a security related fix in the newer code.
It seems that #3054 doesn't update gnu_regex. We have studied updating was not easy task. This is very valuable result of efforts. Thank you.

\s and \w may be useful. However, we can write a same parser with or without them. @hirooih wrote "MUST-HAVE". I don't think so. I think mtable-regex and optscript are the typical "MUST-HAVE". We can write a same parser without them. mline-regex is the typical non-"MUST-HAVE". Even we don't have it, write a same parser with mtable-regex. I want to eliminate mline-regex but it is too late.

I never say we should not have \s and \w. However, when thinking about cost of development, documentation, and maintenance,
I think touching the ERE layer is not good.

I think introducing another interface that supports rich regex is better. To evaluate my ideas, I made two pull requests.

lregex: use pcre2 if p flag is given #3036 (pcre2)
[RFC] lregex: replace gnu_regex with Onigmo regex engine #2469 (Onigmo)

I have wondered which is better for ctags for three years. This will be one of the hardest question in ctags because there is a great trade-off.

Onigmo includes works of akr. I think they are semi-"MUST-HAVE". However, comparing with pcre2, it is not so popular. It means we have to think about its build process. In #2469, I used git-subtree.
pcre2 is popular. We can expect it is already installed or a package is provided on the target platform.
(Though I did't tried I expect we can link it to ctags with vc when we use the result of https://github.com/kiyolee/pcre2-win-build.)

When starting this discussion, I inclined to choose pcre2.

[yamato@control]/tmp% rpm -qf /usr/lib64/libonig.so.5
rpm -qf /usr/lib64/libonig.so.5
oniguruma-6.8.2-1.fc28.x86_64
[yamato@control]/tmp% nm -D /usr/lib64/libonig.so.5 | grep regcomp
nm -D /usr/lib64/libonig.so.5 | grep regcomp
000000000002d6e0 T regcomp
[yamato@control]/tmp% nm -D /usr/lib64/libonig.so.5 | grep regexec
nm -D /usr/lib64/libonig.so.5 | grep regexec
000000000002d810 T regexec
[yamato@control]/tmp% nm -D /usr/lib64/libpcre2-8.so.0 | grep regcomp
nm -D /usr/lib64/libpcre2-8.so.0 | grep regcomp
[yamato@control]/tmp% nm -D /usr/lib64/libpcre2-8.so.0 | grep regexec
nm -D /usr/lib64/libpcre2-8.so.0 | grep regexec

(I guess the result may be the same in onigruma.)

In my understanding, introducing libonig implies touch the ERE layer of ctags.
As I wrote, I don't want to touch the area already worked. I want to use a RICH regex engine as a new layer.
Unlike libonig, libocre2-8 doesn't provide regcomp and regexec. It means introducing libpcre2-8 has no impact on the ERE layer.

See #3036 . I would like to say it is well written.
If we can ignore the symbol conflicts, we can adapt libonig to the interface provided in #3036.

So I wanted to merge #3036. However, I closed #3036 after thinking. When adding a new feature I always write a parser or extend an existing parser using the new feature. I follow this rule not to introduce too many features randomly developed. I cannot find an item I should develop with the rich pcre2 features. All items I cannot implement with the ERE but I can implement with pcre2 can be implemented with optscript. So I lost interest.

#3036 is the pain-less and broad way to go for introducing \s and \w.

The pain-less way means fewer documentation and fewer efforts for adjusting build-script.
The broad way means more benefits with fewer cost.

masatake · 2021-06-09T21:23:16Z

BTW, using gnulib regex with g flag is an acceptable way because it doesn't touch the ERE layer.

hirooih · 2021-06-10T14:27:56Z

@hirooih wrote "MUST-HAVE". I don't think so.

It's a disappointing conclusion, but I will follow your judgment.
But I'm not completely convinced the reason why you decided not to merge #3036. I still hope you reconsider someday.

BTW how do you think about examples as I wrote above?

Some of examples does not care of tab characters or white spaces on the beginning of the line.
It is OK as examples, but we need lots of \s* or \s+ for actual use-cases.

I think we should use [ \t] or [[:blank:]] everywhere required.

It seems that #3054 doesn't update gnu_regex. We have studied updating was not easy task.

I gave up because I did not have development environments for the systems.
This article shows that Autotools works on Windows. Did anyone try it?

And just for a record.
The "replacing approach" has a minor issue. It does not work in a bracket expression (ex. [0-3\s]).

masatake · 2021-06-10T18:59:36Z

But I'm not completely convinced the reason why you decided not to merge #3036. I still hope you reconsider someday.

I myself don't work on #3036 because I cannot find strong demand in myself now.
It doesn't mean I will reject #3036 if someone finishes #3036.

If there is a real parser written in pcre2, I will merge #3036.

Other than I know how to use optscript. So, ignore whether the real parser can be written in optscript or not.
The parser that cannot be written in ERE but can be written in PCRE2 is needed.
This is a good demonstration of the p flag and can be used for testing the p flag.

BTW how do you think about examples as I wrote above?

Some of examples does not care of tab characters or white spaces on the beginning of the line.
It is OK as examples, but we need lots of \s* or \s+ for actual use-cases.

I think we should use [ \t] or [[:blank:]] everywhere required.

About a parser like Foo defined in args.ctags, we don't have to change. Keep them as is.

As you wrote, about real parsers in optlib/, it is better to insert [ \t] where [ \t] is needed.
I use "better" here. We don't receive a bug report. So I guess missing [ \t] doesn't become an issue.
In such a case, I use "better".

[[:blank:]] is locate-sensitive. So I think [ \t]. I wonder there is any use case of [[:foo:]] in ctags.

hirooih · 2021-07-03T10:27:55Z

I found I misunderstood about gnulib regexp.

regex.m4 L76-L290 checks several compatibility checks of regex on the system running. If the checks succeeds it uses the regex in the platform, otherwise it uses the regex in gnulib.

This means that using gnulib regex does allow us to use GNU extension (i.e. \s or \w) to support multiple platforms.

In other words #3036 just gives us the up-to-date implementation of GNU regex and better compatibility checks.

Import regex from gnulib ctags distribution includes regex library from glibc, but it is too old (from glibc-2.10.1 released in 2009). See also #3034. This PR imported the latest gnulib regex module.

- describe the change made in universal-ctags#3034 - a regex engine may be imported from gnulib (see universal-ctags#3054) - describe how regex engine is chosen - describe the GNU extensions and discourages its use.

hirooih added the Documentation label May 21, 2021

hirooih self-assigned this May 21, 2021

hirooih mentioned this issue Jun 5, 2021

Import regex from gnulib #3054

Merged

masatake mentioned this issue Jun 9, 2021

\s, \S, \w, and \W in regexp #3058

Closed

hirooih mentioned this issue Jul 31, 2021

Docs: regex specification update #3109

Merged

hirooih closed this as completed in 0521db5 Aug 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write the specification of regex implemented #3034

write the specification of regex implemented #3034

hirooih commented May 21, 2021

masatake commented May 21, 2021

hirooih commented May 22, 2021

masatake commented May 22, 2021

hirooih commented May 22, 2021

hirooih commented May 23, 2021

masatake commented May 23, 2021

masatake commented May 23, 2021

hirooih commented May 23, 2021

masatake commented May 23, 2021

k-takata commented May 24, 2021

hirooih commented May 24, 2021

k-takata commented May 24, 2021

hirooih commented Jun 5, 2021

masatake commented Jun 5, 2021

masatake commented Jun 5, 2021

hirooih commented Jun 5, 2021

masatake commented Jun 5, 2021

hirooih commented Jun 6, 2021

masatake commented Jun 6, 2021

leleliu008 commented Jun 6, 2021

k-takata commented Jun 6, 2021

hirooih commented Jun 7, 2021

masatake commented Jun 7, 2021

leleliu008 commented Jun 8, 2021 •

edited

Loading

hirooih commented Jun 8, 2021

masatake commented Jun 8, 2021

hirooih commented Jun 8, 2021

leleliu008 commented Jun 9, 2021 •

edited

Loading

hirooih commented Jun 9, 2021

masatake commented Jun 9, 2021 •

edited

Loading

masatake commented Jun 9, 2021

hirooih commented Jun 10, 2021

masatake commented Jun 10, 2021

hirooih commented Jul 3, 2021

write the specification of regex implemented #3034

write the specification of regex implemented #3034

Comments

hirooih commented May 21, 2021

masatake commented May 21, 2021

hirooih commented May 22, 2021

masatake commented May 22, 2021

hirooih commented May 22, 2021

hirooih commented May 23, 2021

masatake commented May 23, 2021

masatake commented May 23, 2021

hirooih commented May 23, 2021

regcomp.c

regex_internal.c

regex_internal.h

regexec.c

masatake commented May 23, 2021

k-takata commented May 24, 2021

hirooih commented May 24, 2021

k-takata commented May 24, 2021

hirooih commented Jun 5, 2021

masatake commented Jun 5, 2021

masatake commented Jun 5, 2021

hirooih commented Jun 5, 2021

masatake commented Jun 5, 2021

hirooih commented Jun 6, 2021

masatake commented Jun 6, 2021

leleliu008 commented Jun 6, 2021

k-takata commented Jun 6, 2021

hirooih commented Jun 7, 2021

masatake commented Jun 7, 2021

leleliu008 commented Jun 8, 2021 • edited Loading

hirooih commented Jun 8, 2021

masatake commented Jun 8, 2021

hirooih commented Jun 8, 2021

leleliu008 commented Jun 9, 2021 • edited Loading

hirooih commented Jun 9, 2021

masatake commented Jun 9, 2021 • edited Loading

masatake commented Jun 9, 2021

hirooih commented Jun 10, 2021

masatake commented Jun 10, 2021

hirooih commented Jul 3, 2021

leleliu008 commented Jun 8, 2021 •

edited

Loading

leleliu008 commented Jun 9, 2021 •

edited

Loading

masatake commented Jun 9, 2021 •

edited

Loading