-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new API function pcre2_set_optimization() for controlling enabled optimizations #471
Conversation
There are obviously a couple of things missing here (notably, documentation), but I am pushing it to allow for review and comments. |
One thing I would like to ask: Are there any other existing option bits which are used to turn optimizations on/off? Is it just the 3 which I have already identified? |
@@ -646,6 +647,7 @@ typedef struct pcre2_real_code { | |||
uint16_t top_backref; /* Highest numbered back reference */ | |||
uint16_t name_entry_size; /* Size (code units) of table entries */ | |||
uint16_t name_count; /* Number of name entries in the table */ | |||
uint32_t optimization_flags; /* Optimizations enabled at compile time */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated to this code, but this shouldn't be needed here, and indeed the fact that it is needed so that PCRE2_NO_START_OPTIMIZE
doesn't break is telling us IMHO there is a problem somewhere else.
this is also why I didn't like the idea of getting all optimizations together, since there are a lot more but those 3 (and JIT) are the only ones that are externally visible because have specific flags to disable them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's more than one way to look at this issue. In any case, any specific suggestion for improvement would be welcome.
Philip, Carlo, Zoltan: As usual, the code review on this project is great! |
Pushing an update which addresses many of the points raised by reviewers. |
Conversion is perfect, since it happens internally. During compile, the optimizations flags, which are already present as compile flags are "or'ed" to compile flags, so whichever is set, it activates the optimization (they don't need to have the same index). This should be easy since only a few flags are affected. For these optimizations the already present infrastructure will handle them (the optimization flags can be overwritten by compile flags). For newer optimizations, the optimization flagset is directly used. |
Just a couple of thoughts here... Right now, the default optimization behavior of PCRE2 is "enable all optimizations by default, disable selectively using option bits". For various reasons, it might become desirable to deviate from this behavior in the future. Reason 1: There is more than one metric which can be optimized for. It's possible to optimize for small memory footprint at the cost of CPU time, or vice versa. Also, it's possible to optimize for low average CPU time at the cost of worst-case CPU time, or vice versa. Reason 2: There may be certain "dangerous" optimizations, which improve performance at the cost of disabling certain features, or causing observable behavior changes. (For example: While DFA-based matching is desirable for consistent worst-case performance, the documentation for For both of these reasons, I don't think we can assume that the default behavior will always be to enable all optimizations. Sooner or later, some may need to be opt-in rather than opt-out. For this reason, I am thinking of adding a directive called Any comments? |
OK, latest contributions from Carlo and Zoltan have been incorporated. I hope to add tests and documentation tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few questions.
src/pcre2_compile.c
Outdated
@@ -11005,7 +11051,7 @@ used in this code because at least one compiler gives a warning about loss of | |||
"const" attribute if the cast (PCRE2_UCHAR *)codestart is used directly in the | |||
function call. */ | |||
|
|||
if (errorcode == 0 && (re->overall_options & PCRE2_NO_AUTO_POSSESS) == 0) | |||
if (errorcode == 0 && (optim_flags & PCRE2_OPTIM_AUTO_POSSESS)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many cases the == 0
and != 0
are removed. Is this a new coding style? We should eventually update all cases them, except for pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just trying to make the code shorter and didn't see any reason to include != 0
. However, I can add it back, that isn't a problem.
I am a new contributor to this project and don't know all the local conventions yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of my "fetishes" -- I only ever use if (x) when x is a Boolean; if it's not I use if (x != 0) to emphasize that it is a numerical value. Similar to my use of if (pointer != NULL) to emphasize that the variable is a pointer (though in principle, NULL need not be zero, so this is more correct C).
Alex wrote:
I have no objection - we do already have pcre2_posix_test.c and pcre2_jit_test.c, but remember that any such program has to work with every possible combination of 8, 16, or 32 bit libraries. |
I don't see a use for this and even believe it is likely to cause confusion. The default is currently all on and when that changes you will still use PCRE2_OPTIMIZATION_FULL to override that default if you so desire so. This API call (just like the other On second thought, if you are concerned about reusing the context WHEN the default is no longer full, it might be better to wait until then to add this IMHO. |
OK, well, we have tests now. Next is to add documentation. |
The code is now using the new field, so why is Sure; we can only do the API call if we have a context, so to test the "legacy" codepath need to probably piggyback on the Still think that an API with an extra BOOL parameter to turn on/off the optimization makes more sense as well. |
PCRE2 needs to both support the new |
We should also test combinations, when a compile and optimization flag contradict to each other (compile flag should be stronger). Since we might need to add lots of test for different optimizations in the future, we could even start a new input/output file for them. |
I wasn't arguing about having more modifiers, although was explaining that we could do with almost none by abusing what we currently have, but that the additional modifiers are only used sparingly and therefore the new code is barely exercised. But changing what the current modifiers do testing will be more comprehensive, and of course adding tests that go through the "legacy" path was also expected. |
Ah, good point. Let me add that right now. |
Ah, true. That is a good point. Hmm. I do think that for PCRE2 developers, it makes things a bit easier if the Hmm. 🤔 |
Done, you'll see it on the next push. |
Just thinking about this... The tests do demonstrate that I think what needs more thorough testing is that the actual optimizations behave as expected and do not break anything. But we already have existing tests for that. Right now we are just testing a different way of switching them on and off. |
Thanks for the feedback. 👍 Please check the update which I have just pushed. |
On a fairly quick read (I know we want to get on with this) that looks good. Minor style nitpick is that you refer to pcre2_match rather than pcre2_match(). Are we now ready to merge this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good patch, only a few minor things remained.
Thanks for noting that. Fixed. |
Yes, we do want to get on with this, but to the extent possible, we also want to get the API "right" the first time rather than making breaking changes later. So by all means, let's invite more comments from reviewers especially if they concern the design of the API, as visible to client programmers. Non-breaking improvements to the implementation can be done any time. |
doc/html/pcre2api.html
Outdated
backtracks into a+ that can never be successful. However, if callouts are in | ||
use, auto-possessification means that some callouts are never taken. You can | ||
set this option if you want the matching functions to do a full unoptimized | ||
search and run all the callouts, but it is mainly provided for testing | ||
purposes. | ||
purposes. (It is recommended to use <b>pcre2_set_optimize</b> instead.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should mention the directive to use as well, and probably also mention that if both are given this one takes precedence.
the message should also be in its own paragraph instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know why I can't add a reply to the above comment about PCRE2_ERROR_BADOPTION
. Anyways, what I would like to say is that, while the other "set" functions return PCRE2_ERROR_BADDATA
on invalid input, I feel that the symbolic name BADDATA
does not describe the actual error well in this case. For other "set" functions it does make sense; for example, if the value passed for "maximum heap size" or "maximum variable lookbehind length" is invalid, we could (in natural human language) describe that as "bad data". In this case, my preferred name for the error code would be something like PCRE2_UNKNOWN_DIRECTIVE
. Do you think it would be a good idea to add a new error code for this reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion re: PCRE2_ERROR_NULL
is fine and I have implemented it.
@@ -3925,6 +3934,21 @@ for (;;) | |||
else *((uint32_t *)field) |= m->value; | |||
break; | |||
|
|||
case MOD_OPTMZ: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI there is a macro to help with these code, but agree is not needed if it is only called once (which I was hoping would change, though)
Converted |
Just pushed another update. |
@@ -10432,6 +10439,18 @@ if (patlen > ccontext->max_pattern_length) | |||
return NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic above that restrict PCRE2_LITERAL needs to be also likely updated for when the optimizations are disabled through the new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. So it shouldn't be allowed to enable/disable optimizations when the pattern is literal? But some optimizations, like START_OPTIMIZE, still make sense on a literal pattern.
I appreciate the suggestion but don't see a good reason right now to make a change in this area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it shouldn't be allowed to enable/disable optimizations when the pattern is literal?
I think you misread the code; it is more like you can't use the PCRE2_LITERAL option unless those two optimizations are enabled. Not sure if there might be another similar restrictions but @PhilipHazel could probably explain the rationale for this one better, or if there are any other.
don't see a good reason right now to make a change in this area.
It would be safer if you do IMHO, specially considering that the "legacy" modifiers still set the "legacy" bits ONLY, so we will need to construct a new test to exercise this new codepath an therefore there is nothing in the current tests that would cover it, even if it is obviously broken, as this (and their equivalent for the other optimization) should behave the same and don't:
PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
re> //literal,no_auto_possess
Failed: error 192 at offset 0: invalid option bits with PCRE2_LITERAL
re> //literal,auto_possess_off
data>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding further detail. I will read the relevant section of code again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I looked at the code again and it still seems that my first reading was correct.
PUBLIC_COMPILE_OPTIONS lists all the bits which can normally be set in options; but if PCRE2_LITERAL is set, then only the bits in PUBLIC_LITERAL_COMPILE_OPTIONS can be set, or else compilation fails with an error. PCRE2_NO_AUTO_POSSESS and PCRE2_NO_DOTSTAR_ANCHOR are not in PUBLIC_LITERAL_COMPILE_OPTIONS, so those two options cannot be used to disable optimizations for literal patterns.
Doubtless this is because the 2 optimizations concerned do not apply to literal patterns anyways, so it doesn't make much sense to specifically disable them in that case.
Again, this raises the question: should we block users from enabling/disabling certain optimizations in the compile context for literal patterns? And again, my opinion is: no, I don't think there is a good reason to do that.
One reason is because the compile context can be re-used for multiple calls to pcre2_compile()
. It might happen that some of those calls are for literal patterns, while others are not.
So again, from my viewpoint, this is a desired difference in behavior between the legacy option bits and the new API function. However, if other key contributors agree that the use of pcre2_set_optimize()
should be restricted if the compile context is subsequently used for a literal pattern... well and good, I can implement it in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seeming "inconsistency" in this case has already been looked
My bad, should had used a different word; the "inconsistency" I am referring to, is that the code and documentation don't match, let me know if you need me to provide a patch to help drive this discussion.
Zoltan's suggestion is the same as the current behavior of this PR.
I know, and that is why I mentioned it as the first option that we could take.
We could also decide on a combination of them, for example letting the PCRE2_NO_DOTSTAR_ANCHOR to not fail pcre2_compile(), is also a very minimal change and apparently safe, from what you reported earlier. Eitherway all the assumptions we are making should be documented.
Tests have already been added confirming that the new API does not cause an error
A test that shows the error from pcre2_compile()
while using the "legacy" modifiers as I proposed in the original comments would be nice as well, since we are "committing" to that as the expected behaviour going forward.
Tests that show possible edge cases reflected from your auditing of the code even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carlo, I'm happy to update the documentation, but I'm not sure which part is currently not matching the code. Can you clarify?
I will add tests showing the error when the legacy modifiers are used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new optimization flags should not change this. I.e. if an optimization is not allowed in LITERAL, that optimization is ignored in the new option flags.
Instead of ignoring to disable the optimization, it is actually disabling it (even if it is known to conflict with LITERAL)
@alexdowad: could this be corrected and documented?
FWIW the test part of that documentation should show (maybe using BI
modifiers, as it is done for other cases) that the optimization isn't disabled when the literal modifier is used as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Carlo, thanks for explaining further what you meant. Further tests have been added showing what happens when the legacy options are used.
I'm glad that you are carefully checking what is needed for documentation. I wholeheartedly agree that for a library like PCRE2, documentation is of utmost importance. We both feel the same on that issue. 👍🏻
For the documentation, I can say there is nothing which needs to be added regarding this issue. I hope this does not give the (false) impression that I want to push this PR through quickly, while sweeping issues under the rug; that is not the case. I'm not sure to what extent you understand the auto-possess and dotstar anchor optimizations, but please bear with me as I explain these. I believe that if the following points are understood, we will be able to successfully conclude this discussion, with everyone on the same page.
Let's start with AUTO_POSSESS. What does this optimization do? It takes a regex like /x+y/
and converts it to /x++y/
. This does not help performance when the regex matches, but when it doesn't match, the amount of useless backtracking is reduced, so the match can fail faster and move on to another position.
Now, could that optimization ever apply to a literal regex? No, it is impossible! Why? Because a literal regex can never contain any quantifiers! So how would it be possible to convert quantifiers to possessive ones? Remember, a literal regex can only contain literal characters. The regex /x+y/
, if compiled with PCRE2_LITERAL, will just match a literal x
, a literal +
, and a literal y
.
How about DOTSTAR_ANCHOR? This optimization takes a regex like /.*abc/
and marks it as matching only at the beginning of a line. Then the regex engine won't try matching it in the middle of a line; it will skip to the beginning of the next line instead.
Could that ever apply to a literal regex? No, it is impossible! But why? Think about it: a literal regex can never contain META_DOT (it can only contain a literal period). And a literal regex can never include the quantifier star either (it can only contain a literal asterisk). So how could dotstar anchoring ever be done to a literal regex? It cannot!
So both AUTO_POSSESS and DOTSTAR_ANCHOR, regardless of whether they are enabled or disabled, could never have any effect whatsoever on a literal regex. They are completely just no-ops in that case.
Does it make more sense now why pcre2_compile()
does not allow the PCRE2_NO_AUTO_POSSESS and PCRE2_NO_DOTSTAR_ANCHOR options to be used for literal patterns? It is because these options are useless on literal regexes.
Does it also make more sense now why pcre2_set_optimize()
allows PCRE2_{AUTO_POSSESS,DOTSTAR_ANCHOR}{,_OFF} regardless of whether we are compiling literal regexes or not? It is because the pcre2_compile_context
can be re-used multiple times, and even if it is used to compile a literal regex now, the same compile context might be used for a non-literal regex next time.
Note this point: The options
argument to pcre2_compile
fundamentally applies to one compilation only. However, the directive
argument to pcre2_set_optimize
applies to a series of one or more compilations. So it is not surprising that while certain useless combinations are restricted from appearing in options
, directive
is handled a bit differently.
The documentation already accurately explains that only a few options may be used together with PCRE2_LITERAL, not including PCRE2_NO_{AUTO_POSSESS,DOTSTAR_ANCHOR}. That is good! At the same time, the documentation does not say anything about a relationship between pcre2_set_optimize()
and PCRE2_LITERAL
, because no such relationship exists. That is correct, and intended.
Hopefully it is clearer now that pcre2_set_optimize
is not "disabling the optimization rather than ignoring to disable it". It is also not true that these optimizations "are known to conflict with PCRE2_LITERAL".
Carlo, your code review on this PR has been great! While this particular issue was a "red herring" (meaning something which was distracting and not relevant), much of your other feedback was excellent. I can only say "2 thumbs up". 👍🏻 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, and indeed (again) kudos in a very good "first" contribution.
I don't think it is a "red herring" though, as we both missed the point Zoltan was making originally and that implied the code also needed an update. Note that the proposed code changes are on pcre2_compile()
so I hope it helps clarify what the intended behaviour is and why "allowing" or not to disable those two optimizations in pcre2_set_optimize()
was never a question.
To make discussion on the specifics of that easier had prepared it as alexdowad#8, which completes the suggested behavior in the code and documents it so users of the new API aren't surprised by the "apparent" difference of behaviour between the legacy and new entrypoints.
I agree with you that the behaviour might be different, but must be documented somewhere.
Also agree with you that the code might be able to work with or without the optimization anyway, but since the code without the optimizations was blocked by the pcre2_compile()
error, then making sure we keep using the same code paths for the new API, only makes this proposed API "migration" safer and allow us to merge it now without further delay or the need of additional code audits in the current code.
Pushed another update. |
Can others share any further comments on what error code to return from |
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_match.c#L6764 PCRE2_ERROR_BADOPTION means the option is unknown, although the name is a bit strong. PCRE2_ERROR_NULL is returned if the pattern is not set. That means the current return values are ok for me. For the function body, the large switch does not look nice for me. For generic options, such as FULL, the switch is perfect, but for option bits, I liked the old code (which could be placed in the default case, and returns with BADOPTION if the bit is not known). As for LITERAL, it is ok if the compile limits the flags (we should not change this), but for compile context, which is a generic configuration store, there should be no limitations. The unrelated bits can be simply ignored. This is true for other members: if malloc is not needed by a function, it simply does not use it. |
Just updating code to match Zoltan's preference on the use of |
Pushed another update. |
A minor issue and easy to miss is that the generated html documentation is not linked correctly; had prepared alexdowad#7 where the first commit show the missing bits that could be cherry-picked over to fix it. That branch is a little old, and so the continuous rebasing of this PR makes it an odd target for a simple merge/rebase but hope you find also the other commits useful otherwise. |
… optimizations It is anticipated that over time, more and more optimizations will be added to PCRE2, and we want to be able to switch optimizations off/on, both for testing purposes and to be able to work around bugs in a released library version. The number of free bits left in the compile options word is very small. Hence, we will start putting all optimization enable/disable flags in a separate word. To switch these off/on, the new API function pcre2_set_optimization() will be used. The values which can be passed to pcre2_set_optimization() are different from the internal flag bit values. The values accepted by pcre2_set_optimization() are contiguous integers, so there is no danger of ever running out of them. This means in the future, the internal representation can be changed at any time without breaking backwards compatibility. Further, the 'directives' passed to pcre2_set_optimization() are not restricted to control a single, specific optimization. As an example, passing PCRE2_OPTIMIZATION_FULL will turn on all optimizations supported by whatever version of PCRE2 the client program happens to be linked with. Co-Authored-By: Carlo Marcelo Arenas Belón <carenas@gmail.com> Co-Authored-by: Zoltan Herczeg <hzmester@freemail.hu>
Incorporated the addition to |
Is this ready to go now? |
It is good for me. If something comes up, we can fix it later. Before the next release, we can even change the api if needed, but I think we discussed everything. |
I think that considering there are almost 150 comments, we probably discussed too much, but also run the risk of having overlooked, misunderstood or not resolved some of the points made as well. Merging it now, will hopefully help having a less volatile base to work on though, and hopefully will lead to less clutter in the communication about them. |
Yes, it's ready. |
It is anticipated that over time, more and more optimizations will be added to PCRE2, and we want to be able to switch optimizations off/on, both for testing purposes and to be able to work around bugs in a released library version.
The number of free bits left in the compile options word is very small. Hence, we will start putting all optimization enable/disable flags in a separate word. To switch these off/on, the new API function pcre2_set_optimization() will be used.
The values which can be passed to pcre2_set_optimization() are different from the internal flag bit values. The values accepted by pcre2_set_optimization() are contiguous integers, so there is no danger of ever running out of them. This means in the future, the internal representation can be changed at any time without breaking backwards compatibility. Further, the 'directives' passed to pcre2_set_optimization() are not restricted to control a single, specific optimization. As an example, passing PCRE2_FULL_OPTIMIZATION will turn on all optimizations supported by whatever version of PCRE2 the client program happens to be linked with.