Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implicit wildcard for Prefix1 rules longer than 8 characters #3011

Closed
joey04 opened this issue Sep 13, 2017 · 18 comments
Closed

Implicit wildcard for Prefix1 rules longer than 8 characters #3011

joey04 opened this issue Sep 13, 2017 · 18 comments

Comments

@joey04
Copy link

joey04 commented Sep 13, 2017

The highlight+boldface in the logger for static rule matching appears to just match the first instance of the request URL, which is not always the actual match for the rule.

At ezgif.com there's a cookie notice script blocked by adding a /cookieconsent rule (my own, not in a 3p list). Given the word-boundary matching of this rule, the logger should highlight the second instance in the URL, not the first:

untitled

I've seen this a few other times at other sites with other word-boundary rules.

@gomoku
Copy link

gomoku commented Sep 13, 2017

Assuming the word-boundary in uBO is from left to right:

a cookie notice script blocked by adding a /cookieconsent

Wrong assumption, rightly:

a cookie phrase blocked by adding a /cookieconsent

If your intention was to block strictly a javascript, not a phrase, then the rule is wrong, coz the rule is javascript specific only in your mind, coz the rule actually isn't javascript specific, because it in no way indicates a javascript, thus the rule is only a phrase specific, which highlights the first occurence of the phrase, from left to right in the logger. The correct actual javascript specific rule is /cookieconsent.min.js or /cookieconsent*.js because it contains .js which makes the rule javascript specific. I think the issue will be tagged by "filter issue" and closed.

@gorhill
Copy link
Owner

gorhill commented Sep 13, 2017

The real explanation is that with fix to #2630, for any token longer than 7 characters, the word-boundary rule does not apply to the end of the string. Your filter qualifies as a "plain" filter, which means internally it's a simple but fast plain string comparison, no regex involved.

After fixing #2630, I knew the side effect would be to no longer respect word boundary for the end of string + for token greater than 7 characters, but the efficiency gain of #2630 was too interesting so I decided the side effect was acceptable. This is what you see.

I could theoretically add a second test to the plain string test to enforce the word boundary rule, but I decided for a wait-and-see approach -- at the end of the day, I want uBO to run as best as possible on low powered devices, and I decided the quirk was acceptable.

So I will leave this issue opened, but I won't fix it for now. Eventually if ever a more serious side effect arise, then I will address it.

@joey04
Copy link
Author

joey04 commented Sep 13, 2017

Thanks for the full explanation. I was unaware of the 7 char token optimization from a few months ago. So are you saying that in my specific example, uBO is actually matching the "/cookieconsent2" portion of the URL with my rule because the first 7 chars match? This would mean that any plain word rule beyond 7 chars is an implicit wildcard up to a / delimiter, but without the regex overhead.

I could theoretically add a second test to the plain string test to enforce the word boundary rule, but I decided for a wait-and-see approach -- at the end of the day, I want uBO to run as best as possible on low powered devices, and I decided the quirk was acceptable.

Please do add the second test for plain string rules. (Implicit wildcards are confusing and result in unwanted surprises.) There are hundreds of >7 char plain text rules in EasyList already, so there could be plenty of cases like mine out there in the wild that go largely unnoticed.

@gorhill
Copy link
Owner

gorhill commented Sep 13, 2017

There are hundreds of >7 char plain text rules in EasyList already

Yes, and it just happens that this matches EasyList's assumption of no word boundary.

@joey04
Copy link
Author

joey04 commented Sep 13, 2017

But that contradicts #2902. It'd be best to have them all the same, regardless of string length, even with a tiny performance overhead.

@joey04 joey04 changed the title Logger: highlights first match (not the actual match) Implicit wildcard for plain string rules longer than 7 chars Sep 13, 2017
@gorhill
Copy link
Owner

gorhill commented Sep 13, 2017

You don't mind a speculated "tiny performance overhead" but you mind the "no negative side effect to filtering".

Just accept that I disagree with you wanting me to fix this now. I repeat:

So I will leave this issue opened, but I won't fix it for now. Eventually if ever a more serious side effect arise, then I will address it.

@joey04
Copy link
Author

joey04 commented Sep 13, 2017

I respect your decision but would like to get this contradiction resolved at some point.

Is there a way to quickly check that if the plain string rule is longer than 7 chars then do a string compare of the plain rule with the delimited URL chunk? If so, that would be a very tiny performance overhead that would ensure a consistent word-boundary policy. It's tiny because you only do the second comparison in the event of a token match.

(In my example, that would be a quick string compare of "cookieconsent" with "cookieconsent2" after the 7-char token match. This would fail, but the second tokenized URL chunk would match the word-boundary check.)

@gorhill
Copy link
Owner

gorhill commented Sep 13, 2017

While at it since I see this as related, just a thought: with the advent of WebAssembly, all these quirks and choices are likely to become obsolete if uBO were to move to a WebAssembly-based static filtering engine, because it would no longer have to deal with javascript getting in the way, and only one filter representation internally would be needed instead of many optimized for specific purpose. So maybe this is only then I will address the issue here.

@joey04
Copy link
Author

joey04 commented Sep 13, 2017

Wow, WebAssembly would be really cool!
But what about the idea I just posted? (Since we posted at the same time.)

@joey04
Copy link
Author

joey04 commented Sep 14, 2017

I decided to figure this out for myself. Here's my modified plain string test:

FilterPlain.prototype.match = function(url, tokenBeg) {
    // https://github.com/gorhill/uBlock/issues/3011
    if ( url.startsWith(this.s, tokenBeg - this.tokenBeg) === true ) {
        if ( this.s.length > 7 ) {
            // _validTokenChars after filter violates word boundary policy
            if ( /[a-z0-9%]/.test(url.charAt(this.s.length + tokenBeg - this.tokenBeg)) === true ) {
                return false;
            }
        }
        return true;
    }
    return false;
};

This is supposed to return false after finding the "2" in the "cookieconsent2" directory portion of the URL, then later return true when it finds the "cookieconsent" filename.

But everything is working the same as before blocking and logger-wise, i.e. the logger is the exact same as my first post.

Is there something wrong with my method? Or is it correct and the logger merely reports the first match?

@joey04
Copy link
Author

joey04 commented Sep 14, 2017

SOLVED IT
Turns out I was functionally correct (except for len > 8 to account for the '/' initial char) but had modified the wrong function. That one, linked by gorhill earlier, must be the base class, but this particular rule is a Prefix1. (I don't know what that means, but console logs honed in on it for me.)

So now I have a working word-boundary enforcer for the /word type of rule:

FilterPlainPrefix1.prototype.match = function(url, tokenBeg) {
    //console.log('\tpre1 this.s: %s  len:%d', this.s, this.s.length);
    // https://github.com/gorhill/uBlock/issues/3011
    if ( url.startsWith(this.s, tokenBeg - 1) === true ) {
        if ( this.s.length > 8 ) {
            // _validTokenChars after filter violates word boundary policy
            if ( /[a-z0-9%]/.test(url.charAt(this.s.length + tokenBeg - 1)) === true ) {
                //console.log('\t\tWB violation \'%s\' at tokenBeg-1: %d', url.charAt(this.s.length + tokenBeg - 1), (tokenBeg - 1));
                return false;
            }
        }
        //console.log('\t\tvalid WB \'%s\' at tokenBeg-1: %d', url.charAt(this.s.length + tokenBeg - 1), (tokenBeg - 1));
        return true;
    }
    return false;
};

Here's console log for my particular rule:

untitled

The logger is still the same, so that's a bug there, but I can live with that knowing that word boundaries are now properly enforced for all of the rules I use.

A general word-boundary fix for any other types of rules would have to be provided by gorhill.

@joey04
Copy link
Author

joey04 commented Sep 14, 2017

Some follow-up after testing this more.

Now I realize Prefix1 means a 1-char prefix ('/', '_', etc.) before the tokenized word. (There's also a Prefix0 and other types, hence the need for a generalized rule, along with more appreciation and empathy from me for gorhill doing the hard, expert work of making this classification and optimizing it with things like tokens.)

I recently went entirely solo for uBO in my primary browser, which is why I'm deep-diving these things. Without the massive 3rd party lists, you have to figure out these rule patterns for yourself. Thankfully I've been able to achieve great coverage with a surprisingly small amount of rules, like those of the /word variety. Fortunately for me, the Prefix1 case is all I need.

I briefly stubbed out my length check to test any of my Prefix1, just to learn more of what's going on. The only other word-boundary violations I saw were these specific to IMDb (that I've had even before going solo)

logger

console

This is good confirmation of 2 things for me. 1) The word-boundary is indeed enforced, as there is only 1 match for each of these rules and the request was not blocked. 2) Need to keep my word rules limited to tokenized chars to avoid potential problems with any future rules I may add. (In this case the regular length > 8 check wouldn't matter, but I still removed the '-' chars for my own consistency.)

@ghost
Copy link

ghost commented Sep 16, 2017

That was a pretty interesting read up, by both joey04 and gorhill.

Especially what gorhill mentioned about WebAssembly too.

@joey04
Copy link
Author

joey04 commented Sep 20, 2017

In crafting a solution for #3037 it occurred to me that it's better to find the word boundary rather than a violation of it. So I've refined the above solution to be more robust, now able to find a non-token char at the end of the rule. This means both /longstring and /longstring- are properly detected, removing the restriction I mentioned in my prior post.

Here's the new version:

FilterPlainPrefix1.prototype.match = function(url, tokenBeg) {
    // https://github.com/gorhill/uBlock/issues/3011
    if ( url.startsWith(this.s, tokenBeg - 1) === true ) {
        if ( this.s.length > 8 ) {
            // not _validTokenChars is word boundary
            if ( /[a-z0-9%]/.test(url.charAt(this.s.length + tokenBeg - 1)) === false ||
                 /[a-z0-9%]/.test(url.charAt(this.s.length + tokenBeg - 2)) === false ) {
                return true;
            }
            return false;
        }
        return true;
    }
    return false;
};

@joey04
Copy link
Author

joey04 commented Mar 25, 2018

I revisited this today, since I was wondering if I could replace a group of related prefix1 rules (e.g. /word and _word) with a single prefix0 rule (e.g. word).

But it turns out that word is compiled to a hostname rule (thus equivalent to ||word^) for compatibility with HOSTS lists. So it's not actually a prefix0 rule. (I'm not even sure if one can actually make a prefix0 rule; perhaps by appending a non-token character? Since it's of no use to me, I won't experiment further.)

I now conclude that my prefix1 word-boundary enforcement covers all practical use cases, including the many prefix1's in EasyList.

@joey04 joey04 changed the title Implicit wildcard for plain string rules longer than 7 chars Implicit wildcard for Prefix1 rules longer than 8 characters Mar 25, 2018
@joey04
Copy link
Author

joey04 commented Aug 25, 2018

I'm closing this now because it's really only an issue for those (few) of us that decide to manage rules ourselves. It doesn't truly need to be fixed for the default list setup, given that EasyList terminates virtually all of its rules of this type with a boundary character.

@joey04 joey04 closed this as completed Aug 25, 2018
@gorhill gorhill reopened this Aug 25, 2018
@gorhill
Copy link
Owner

gorhill commented Aug 25, 2018

I thinks its best that the logger shows the real match which occurred inside uBO.

@gorhill
Copy link
Owner

gorhill commented Aug 25, 2018

Oops I meant to reopen the other: #3037.

@gorhill gorhill closed this as completed Aug 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants