vim: reimplement isMap with regex #134

masatake · 2014-11-22T20:47:38Z

isMapPredicateWords is more maintainable than repeated starts_with_cmd calls.

Signed-off-by: Masatake YAMATO yamato@redhat.com

Signed-off-by: Masatake YAMATO <yamato@redhat.com>

b4n · 2014-11-22T22:17:28Z

This looks a bit overly complex to me, and I don't really see what the regexes really give, but maybe possibly small performance improvement if the pattern compilation doesn't weight down the possible match speedup (which I would only expect on a huge number of runs).

For an example of why I think it's hard to read, the isMapPredicateWords table is really not obvious, as it uses a custom syntax that is then compiled to a regex which is in turn matched. It at least requires a comment explaining what this is about.
Is see that the problem is trying to have different prefix length so using a single [nvoilc]?(nore)?map([ \t!]|$) with partial matching isn't easy, but this kinda makes me think that regexes are not the answer here.

Also, as for now regexs are optional, it makes this parser optional.

Anyway, I would think that a simpler approach like this would be easier to read:

/* checks if a word at the start of `p` matches `word`, and returns the match
 * length, or 0 if no match */
static size_t wordMatchLen(const char *p, const char *word)
{
    size_t n = 0;

    while (*p && *p == *word)
    {
        p++;
        word++;
        n++;
    }

    if (*p == 0 || *p == ' ' || *p == '\t' || *p == '!')
        return n;

    return 0;
}

static boolean isMap(const char *line)
{
    return (wordMatchLen(line, "map") == 3 ||
            wordMatchLen(line, "nmap") >= 2 ||
            wordMatchLen(line, "vmap") >= 2 ||
            wordMatchLen(line, "omap") >= 2 ||
            wordMatchLen(line, "imap") >= 2 ||
            wordMatchLen(line, "lmap") >= 2 ||
            wordMatchLen(line, "cmap") >= 2 ||
            wordMatchLen(line, "noremap") >= 2 ||
            wordMatchLen(line, "nnoremap") >= 3 ||
            wordMatchLen(line, "vnoremap") >= 3 ||
            wordMatchLen(line, "onoremap") >= 3 ||
            wordMatchLen(line, "inoremap") >= 3 ||
            wordMatchLen(line, "lnoremap") >= 3 ||
            wordMatchLen(line, "cnoremap") >= 3);
}

Or if we really care about performances, go ahead and write the logic down, maybe something like this:

/* checks if `line` starts with a map kewyord:
 *   map               requires whole match
 *   [nvoilc]map       requires at least 2 chars prefix (e.g. "nm")
 *   noremap           requires at least 2 chars prefix (e.g. "no")
 *   [nvoilc]noremap   requires at least 3 chars prefix (e.g. "vno")
 */
static boolean isMap(const char *line)
{
    int offset = 0;

    /* match the prefix */
    if (strchr ("nvoilc", line[0]))
    {
        if (line[1] == 'n')
            line++;
        else if (line[1] == 'm')
        {
            /* map only require a 1 char match on "map" when it is prefixed */
            offset = 2;
            line++;
        }
    }

    if (line[0] == 'n')
        return (wordMatchLen(line, "noremap") + offset - 2 /* min len */) >= 0;
    else
        return (wordMatchLen(line, "map") + offset - 3 /* min len */) >= 0;
}

I don't think it's much harder to read (and could probably made easier sacrificing some lines) and it is most likely faster.

masatake · 2014-11-23T08:18:08Z

Thank you for comments. I withdraw this pull-request.
(We can use bundled regex source. So it should not be optional.)

Your wordMatchLen based code looks elegant.

How about adding the condtion/operator for comapring as arguments of wordMatchLen.

enum comp { EQ, MORE_OR_EQ };
static size_t wordMatchLen(const char *p, const char *word, unsigned int minimal_len, enum comp como)
{
...
}

static boolean isMap(const char *line)
{
return (wordMatchLen(line, "map", 3, EQ) ||
wordMatchLen(line, "nmap", 2, MORE_OR_EQ) ||
wordMatchLen(line, "vmap", 2, MORE_OR_EQ) ||...

...

b4n · 2014-11-23T13:46:30Z

How about adding the condtion/operator for comapring as arguments of wordMatchLen.

enum comp { EQ, MORE_OR_EQ };
[…]

Actually the operator is not useful as it's always >= actually, as the only == uses already the whole word's length, so it has the same meaning as >=.
And yes, if we prefer we can move the >= len check inside wordMatchLen() with something like this:

static boolean wordMatchLen(const char *p, const char *word, size_t min_len)
{
    /* ... */
    return n >= min_len;
}

If you want I can make a patch with either approach, or you can go ahead and commit it yourself I don't mind.

masatake · 2014-11-23T16:59:40Z

Yes, could you make the patch (and merge)?

Closes universal-ctags#134

b4n · 2014-11-23T22:33:07Z

I created #141 with the code. I'll merge later if you don't tell me it's not the way you saw it :)

BTW, other parts of the code could be improved by using something similar to wordMatchLen(), but as I don't know the specific rules about word termination and the like for Vim files I don't think I can do it reliably. I guess what would help is a generic wordMatchLen() (if the current one is not generic enough) and a new wordSkip() that skips past the current word -- or make wordMatchLen() do it in some way, by e.g. having an out arg for the word's end.

masatake · 2014-11-24T08:26:01Z

Thank you.

I agree with you; it is the reason why I talked about performance.
I don't know well about vimrc, too. However this kind of string comparison(is there any good name?) may be here and there in vim syntax. So using the technique compiling the pattern into regex widely will contribute the performance.

Remeber my original patch. "a bit overly complex" can be rearranged as common utility code. other vim parser related code can use the utility.

b4n · 2014-11-24T15:40:24Z

I don't know well about vimrc, too. However this kind of string comparison(is there any good name?) may be here and there in vim syntax. So using the technique compiling the pattern into regex widely will contribute the performance.

I don't know, because when I look around the vim.c file it does seem to match some words (so something like wordMatchLen() could come handy), but not the same ones, so a pattern wouldn't help much -- there would need to have quite a few of them. For example for things like vim.c:601, vim.c:225, etc.

b4n · 2014-11-24T16:05:41Z

Just to see how it could be (I do not know Vim syntax so this probably isn't strictly right -- for example for matching a word it's probably better to recognize what can be part of one rather than not part of it), here's a diff that makes uses of more wordMatchLen() and a new skipWord():

diff --git a/vim.c b/vim.c
index 7323520..b54644b 100644
--- a/vim.c
+++ b/vim.c
@@ -138,6 +138,11 @@ static const unsigned char* skipPrefix (const unsigned char* name, int *scope)
    return result;
 }

+static boolean isWordDelimiter(const unsigned char c)
+{
+   return (c == 0 || c == ' ' || c == '\t' || c == '!');
+}
+
 /* checks if a word at the start of `p` matches at least `min_len` first
  * characters from `word` */
 static boolean wordMatchLen(const unsigned char *p, const char *const word, size_t min_len)
@@ -152,12 +157,20 @@ static boolean wordMatchLen(const unsigned char *p, const char *const word, size
        n++;
    }

-   if (*p == 0 || *p == ' ' || *p == '\t' || *p == '!')
+   if (isWordDelimiter (*p))
        return n >= min_len;

    return FALSE;
 }

+static const unsigned char *skipWord(const unsigned char *p)
+{
+   /* FIXME: maybe rather check what *is* a word? */
+   while (*p && !isWordDelimiter (*p))
+       p++;
+   return p;
+}
+
 static boolean isMap (const unsigned char* line)
 {
    /*
@@ -220,12 +233,8 @@ static void parseFunction (const unsigned char *line)
    /* boolean inFunction = FALSE; */
    int scope;

-   const unsigned char *cp = line + 1;
+   const unsigned char *cp = line;

-   if ((int) *++cp == 'n'  &&  (int) *++cp == 'c'  &&
-       (int) *++cp == 't'  &&  (int) *++cp == 'i'  &&
-       (int) *++cp == 'o'  &&  (int) *++cp == 'n')
-           ++cp;
    if ((int) *cp == '!')
        ++cp;
    if (isspace ((int) *cp))
@@ -257,12 +266,7 @@ static void parseFunction (const unsigned char *line)
    /* TODO - update struct to indicate inside function */
    while ((line = readVimLine ()) != NULL)
    {
-       /* 
-        * Vim7 added the for/endfo[r] construct, so we must first
-        * check for an "endfo", before a "endf"
-        */
-       if ( (!strncmp ((const char*) line, "endfo", (size_t) 5) == 0) && 
-               (strncmp ((const char*) line, "endf", (size_t) 4) == 0)   )
+       if (wordMatchLen (line, "endfunction", 4))
            break;

        parseVimLine(line, TRUE);
@@ -275,10 +279,7 @@ static void parseAutogroup (const unsigned char *line)
    vString *name = vStringNew ();

    /* Found Autocommand Group (augroup) */
-   const unsigned char *cp = line + 2;
-   if ((int) *++cp == 'r' && (int) *++cp == 'o' &&
-           (int) *++cp == 'u' && (int) *++cp == 'p')
-       ++cp;
+   const unsigned char *cp = line;
    if (isspace ((int) *cp))
    {
        while (*cp && isspace ((int) *cp))
@@ -339,15 +340,9 @@ static boolean parseCommand (const unsigned char *line)
        while (*cp && isspace ((int) *cp))
            ++cp; 
    }
-   else if ( line && 
-                     (!strncmp ((const char*) line, "comp", (size_t) 4) == 0) && 
-                       (!strncmp ((const char*) line, "comc", (size_t) 4) == 0) && 
-                         (strncmp ((const char*) line, "com", (size_t) 3) == 0) )
+   else if ( line && wordMatchLen (cp, "command", 3) )
    {
-       cp += 2;
-       if ((int) *++cp == 'm' && (int) *++cp == 'a' &&
-               (int) *++cp == 'n' && (int) *++cp == 'd')
-           ++cp;
+       cp = skipWord (cp);

        if ((int) *cp == '!')
            ++cp;
@@ -438,7 +433,7 @@ static void parseLet (const unsigned char *line, int infunction)
 {
    vString *name = vStringNew ();

-   const unsigned char *cp = line + 3;
+   const unsigned char *cp = line;
    const unsigned char *np = line;
    /* get the name */
    if (isspace ((int) *cp))
@@ -578,9 +573,7 @@ static boolean parseMap (const unsigned char *line)

 static boolean isCommand (const unsigned char* line)
 {
-   return ( (!strncmp ((const char*) line, "comp", (size_t) 4) == 0) &&
-           (!strncmp ((const char*) line, "comc", (size_t) 4) == 0) &&
-           (strncmp ((const char*) line, "com", (size_t) 3) == 0) );
+   return wordMatchLen (line, "command", 3);
 }

 static boolean parseVimLine (const unsigned char *line, int infunction)
@@ -598,19 +591,19 @@ static boolean parseVimLine (const unsigned char *line, int infunction)
        parseMap(line);
    }

-   else if (strncmp ((const char*) line, "fu", (size_t) 2) == 0)
+   else if (wordMatchLen (line, "function", 2))
    {
-       parseFunction(line);
+       parseFunction(skipWord(line));
    }

-   else if (strncmp ((const char*) line, "aug", (size_t) 3) == 0)
+   else if (wordMatchLen (line, "augroup", 3))
    {
-       parseAutogroup(line);
+       parseAutogroup(skipWord(line));
    }

-   else if ( strncmp ((const char*) line, "let", (size_t) 3) == 0 )
+   else if (wordMatchLen (line, "let", 3))
    {
-       parseLet(line, infunction);
+       parseLet(skipWord(line), infunction);
    }

    return readNextLine;

@blueyed (as you seem to be the de facto vim.c maintainer): if you're interested by this kind of changes, I can either make a PR or you can steal this code.

blueyed · 2014-11-24T17:15:39Z

@b4n
Feel free to go for it.
If you do a PR, I'll be glad to review it.

as you seem to be the de facto vim.c maintainer

Well, I have only fixed problems with the parser, as they appeared to me.
I am glad that it worked out to be useful, but please do not consider me to be the maintainer.. :)

b4n · 2014-11-24T18:00:48Z

PR opened at #146

I am glad that it worked out to be useful, but please do not consider me to be the maintainer.. :)

Hehe ^^ Well, I assumed it from you making most of the recent changes that's all -- but actually all I wanted to find was someone that knows the syntax and would have an opinion, and it looks like I still found you :)

Closes universal-ctags#134

…r-str Fix SEGV in onig_error_code_to_str() (Fix universal-ctags#132)

vim: reimplement isMap with regex

68ad9f8

Signed-off-by: Masatake YAMATO <yamato@redhat.com>

b4n added a commit to b4n/fishman-ctags that referenced this pull request Nov 23, 2014

vim: Simplify word matching in isMap()

576e22e

Closes universal-ctags#134

b4n mentioned this pull request Nov 23, 2014

vim: Simplify word matching in isMap() #141

Merged

masatake closed this in #141 Nov 24, 2014

masatake deleted the vim-regex-based-isMap branch November 24, 2014 08:20

nomad-software pushed a commit to nomad-software/ctags that referenced this pull request Nov 29, 2014

vim: Simplify word matching in isMap()

9007a60

Closes universal-ctags#134

masatake pushed a commit to masatake/ctags that referenced this pull request Mar 12, 2020

Merge pull request universal-ctags#134 from k-takata/fix-segv-in-erro…

9827d5a

…r-str Fix SEGV in onig_error_code_to_str() (Fix universal-ctags#132)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vim: reimplement isMap with regex #134

vim: reimplement isMap with regex #134

masatake commented Nov 22, 2014

b4n commented Nov 22, 2014

masatake commented Nov 23, 2014

b4n commented Nov 23, 2014

masatake commented Nov 23, 2014

b4n commented Nov 23, 2014

masatake commented Nov 24, 2014

b4n commented Nov 24, 2014

b4n commented Nov 24, 2014

blueyed commented Nov 24, 2014

b4n commented Nov 24, 2014

vim: reimplement isMap with regex #134

vim: reimplement isMap with regex #134

Conversation

masatake commented Nov 22, 2014

b4n commented Nov 22, 2014

masatake commented Nov 23, 2014

b4n commented Nov 23, 2014

masatake commented Nov 23, 2014

b4n commented Nov 23, 2014

masatake commented Nov 24, 2014

b4n commented Nov 24, 2014

b4n commented Nov 24, 2014

blueyed commented Nov 24, 2014

b4n commented Nov 24, 2014