Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new API function pcre2_set_optimization() for controlling enabled optimizations #471

Merged
merged 1 commit into from
Sep 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions doc/html/pcre2_set_optimize.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
<html>
<head>
<title>pcre2_set_optimize specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_optimize man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_optimize(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>directive</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function controls which performance optimizations will be applied
by <b>pcre2_compile()</b>. It can be called multiple times with the same compile
context; the effects are cumulative, with the effects of later calls taking
precedence over earlier ones.
</P>
<P>
The result is zero for success, PCRE2_ERROR_NULL if <i>ccontext</i> is NULL,
or PCRE2_ERROR_BADOPTION if <i>directive</i> is unknown. The latter could be
useful to detect if a certain optimization is available.
</P>
<P>
There is a complete description of the PCRE2 native API, including all
permitted values for the <i>directive</i> parameter of <b>pcre2_set_optimize()</b>,
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
page.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
194 changes: 138 additions & 56 deletions doc/html/pcre2api.html
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,10 @@ <h1>pcre2api man page</h1>
<br>
<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b>
<b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b>
<br>
<br>
<b>int pcre2_set_optimize(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>directive</i>);</b>
</P>
<br><a name="SEC5" href="#TOC1">PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS</a><br>
<P>
Expand Down Expand Up @@ -808,6 +812,7 @@ <h1>pcre2api man page</h1>
The compile time nested parentheses limit
The maximum length of the pattern string
The extra options bits (none set by default)
Which performance optimizations the compiler should apply
</pre>
A compile context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of
Expand Down Expand Up @@ -952,6 +957,110 @@ <h1>pcre2api man page</h1>
nesting, and the second is user data that is set up by the last argument of
<b>pcre2_set_compile_recursion_guard()</b>. The callout function should return
zero if all is well, or non-zero to force an error.
<br>
<br>
<b>int pcre2_set_optimize(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>directive</i>);</b>
alexdowad marked this conversation as resolved.
Show resolved Hide resolved
<br>
<br>
PCRE2 can apply various performance optimizations during compilation, in order
to make matching faster. For example, the compiler might convert some regex
constructs into an equivalent construct which <b>pcre2_match()</b> can execute
faster. By default, all available optimizations are enabled. However, in rare
cases, one might wish to disable specific optimizations. For example, if it is
known that some optimizations cannot benefit a certain regex, it might be
desirable to disable them, in order to speed up compilation.
</P>
<P>
The permitted values of <i>directive</i> are as follows:
<pre>
PCRE2_OPTIMIZATION_NONE
</pre>
Disable all optional performance optimizations.
<pre>
PCRE2_OPTIMIZATION_FULL
</pre>
Enable all optional performance optimizations. This is the default value.
<pre>
PCRE2_AUTO_POSSESS
PCRE2_AUTO_POSSESS_OFF
</pre>
Enable/disable "auto-possessification" of variable quantifiers such as * and +.
This optimization, for example, turns a+b into a++b in order to avoid
backtracks into a+ that can never be successful. However, if callouts are in
use, auto-possessification means that some callouts are never taken. You can
disable this optimization if you want the matching functions to do a full,
unoptimized search and run all the callouts.
<pre>
PCRE2_DOTSTAR_ANCHOR
PCRE2_DOTSTAR_ANCHOR_OFF
</pre>
Enable/disable an optimization that is applied when .* is the first significant
item in a top-level branch of a pattern, and all the other branches also start
with .* or with \A or \G or ^. Such a pattern is automatically anchored if
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set for any
^ items. Otherwise, the fact that any match must start either at the start of
the subject or following a newline is remembered. Like other optimizations,
this can cause callouts to be skipped.
</P>
<P>
Dotstar anchor optimization is automatically disabled for .* if it is inside an
atomic group or a capture group that is the subject of a backreference, or if
the pattern contains (*PRUNE) or (*SKIP).
<pre>
PCRE2_START_OPTIMIZE
PCRE2_START_OPTIMIZE_OFF
</pre>
Enable/disable optimizations which cause matching functions to scan the subject
string for specific code unit values before attempting a match. For example, if
it is known that an unanchored match must start with a specific value, the
matching code searches the subject for that value, and fails immediately if it
cannot find it, without actually running the main matching function. This means
that a special item such as (*COMMIT) at the start of a pattern is not
considered until after a suitable starting point for the match has been found.
Also, when callouts or (*MARK) items are in use, these "start-up" optimizations
can cause them to be skipped if the pattern is never actually used. The start-up
optimizations are in effect a pre-scan of the subject that takes place before
the pattern is run.
</P>
<P>
Disabling start-up optimizations ensures that in cases where the result is "no
match", the callouts do occur, and that items such as (*COMMIT) and (*MARK) are
considered at every possible starting position in the subject string.
</P>
<P>
Disabling start-up optimizations may change the outcome of a matching operation.
Consider the pattern
<pre>
(*COMMIT)ABC
</pre>
When this is compiled, PCRE2 records the fact that a match must start with the
character "A". Suppose the subject string is "DEFABC". The start-up
optimization scans along the subject, finds "A" and runs the first match
attempt from there. The (*COMMIT) item means that the pattern must match the
current starting position, which in this case, it does. However, if the same
match is run without start-up optimizations, the initial scan along the subject
string does not happen. The first match attempt is run starting from "D" and
when this fails, (*COMMIT) prevents any further matches being tried, so the
overall result is "no match".
</P>
<P>
Another start-up optimization makes use of a minimum length for a matching
subject, which is recorded when possible. Consider the pattern
<pre>
(*MARK:1)B(*MARK:2)(X|Y)
</pre>
The minimum length for a match is two characters. If the subject is "XXBB", the
"starting character" optimization skips "XX", then tries to match "BB", which
is long enough. In the process, (*MARK:2) is encountered and remembered. When
the match attempt fails, the next "B" is found, but there is only one character
left, so there are no more attempts, and "no match" is returned with the "last
mark seen" set to "2". Without start-up optimizations, however, matches are
tried at every possible starting position, including at the end of the subject,
where (*MARK:1) is encountered, but there is no "B", so the "last mark seen"
that is returned is "1". In this case, the optimizations do not affect the
overall match result, which is still "no match", but they do affect the
auxiliary information that is returned.
<a name="matchcontext"></a></P>
<br><b>
The match context
Expand Down Expand Up @@ -1807,85 +1916,57 @@ <h1>pcre2api man page</h1>
<pre>
PCRE2_NO_AUTO_POSSESS
</pre>
If this option is set, it disables "auto-possessification", which is an
optimization that, for example, turns a+b into a++b in order to avoid
If this (deprecated) option is set, it disables "auto-possessification", which
is an optimization that, for example, turns a+b into a++b in order to avoid
backtracks into a+ that can never be successful. However, if callouts are in
use, auto-possessification means that some callouts are never taken. You can
set this option if you want the matching functions to do a full unoptimized
search and run all the callouts, but it is mainly provided for testing
purposes.
</P>
<P>
If a compile context is available, it is recommended to use
<b>pcre2_set_optimize()</b> with the <i>directive</i> PCRE2_AUTO_POSSESS_OFF rather
than the compile option PCRE2_NO_AUTO_POSSESS. Note that PCRE2_NO_AUTO_POSSESS
takes precedence over the <b>pcre2_set_optimize()</b> optimization directives
PCRE2_AUTO_POSSESS and PCRE2_AUTO_POSSESS_OFF.
<pre>
PCRE2_NO_DOTSTAR_ANCHOR
</pre>
If this option is set, it disables an optimization that is applied when .* is
the first significant item in a top-level branch of a pattern, and all the
other branches also start with .* or with \A or \G or ^. The optimization is
automatically disabled for .* if it is inside an atomic group or a capture
group that is the subject of a backreference, or if the pattern contains
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
automatically anchored if PCRE2_DOTALL is set for all the .* items and
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
must start either at the start of the subject or following a newline is
If this (deprecated) option is set, it disables an optimization that is applied
when .* is the first significant item in a top-level branch of a pattern, and
all the other branches also start with .* or with \A or \G or ^. The
optimization is automatically disabled for .* if it is inside an atomic group
or a capture group that is the subject of a backreference, or if the pattern
contains (*PRUNE) or (*SKIP). When the optimization is not disabled, such a
pattern is automatically anchored if PCRE2_DOTALL is set for all the .* items
and PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any
match must start either at the start of the subject or following a newline is
remembered. Like other optimizations, this can cause callouts to be skipped.
(If a compile context is available, it is recommended to use
<b>pcre2_set_optimize()</b> with the <i>directive</i> PCRE2_DOTSTAR_ANCHOR_OFF
instead.)
<pre>
PCRE2_NO_START_OPTIMIZE
</pre>
This is an option whose main effect is at matching time. It does not change
what <b>pcre2_compile()</b> generates, but it does affect the output of the JIT
compiler.
compiler. Setting this option is equivalent to calling <b>pcre2_set_optimize()</b>
with the <i>directive</i> parameter set to PCRE2_START_OPTIMIZE_OFF.
</P>
<P>
There are a number of optimizations that may occur at the start of a match, in
order to speed up the process. For example, if it is known that an unanchored
match must start with a specific code unit value, the matching code searches
the subject for that value, and fails immediately if it cannot find it, without
actually running the main matching function. This means that a special item
such as (*COMMIT) at the start of a pattern is not considered until after a
suitable starting point for the match has been found. Also, when callouts or
(*MARK) items are in use, these "start-up" optimizations can cause them to be
skipped if the pattern is never actually used. The start-up optimizations are
actually running the main matching function. The start-up optimizations are
in effect a pre-scan of the subject that takes place before the pattern is run.
</P>
<P>
The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
possibly causing performance to suffer, but ensuring that in cases where the
result is "no match", the callouts do occur, and that items such as (*COMMIT)
and (*MARK) are considered at every possible starting position in the subject
string.
</P>
<P>
Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching operation.
Consider the pattern
<pre>
(*COMMIT)ABC
</pre>
When this is compiled, PCRE2 records the fact that a match must start with the
character "A". Suppose the subject string is "DEFABC". The start-up
optimization scans along the subject, finds "A" and runs the first match
attempt from there. The (*COMMIT) item means that the pattern must match the
current starting position, which in this case, it does. However, if the same
match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the
subject string does not happen. The first match attempt is run starting from
"D" and when this fails, (*COMMIT) prevents any further matches being tried, so
alexdowad marked this conversation as resolved.
Show resolved Hide resolved
the overall result is "no match".
</P>
<P>
As another start-up optimization makes use of a minimum length for a matching
subject, which is recorded when possible. Consider the pattern
<pre>
(*MARK:1)B(*MARK:2)(X|Y)
</pre>
The minimum length for a match is two characters. If the subject is "XXBB", the
"starting character" optimization skips "XX", then tries to match "BB", which
is long enough. In the process, (*MARK:2) is encountered and remembered. When
the match attempt fails, the next "B" is found, but there is only one character
left, so there are no more attempts, and "no match" is returned with the "last
mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
at every possible starting position, including at the end of the subject, where
(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is
returned is "1". In this case, the optimizations do not affect the overall
match result, which is still "no match", but they do affect the auxiliary
information that is returned.
Disabling the start-up optimizations may cause performance to suffer. However,
this may be desirable for patterns which contain callouts or items such as
(*COMMIT) and (*MARK). See the above description of PCRE2_START_OPTIMIZE_OFF
for further details.
<pre>
PCRE2_NO_UTF_CHECK
</pre>
Expand Down Expand Up @@ -2312,6 +2393,7 @@ <h1>pcre2api man page</h1>
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern
PCRE2_NO_DOTSTAR_ANCHOR is not set
Dotstar anchoring has not been disabled with PCRE2_DOTSTAR_ANCHOR_OFF
</pre>
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
options returned for PCRE2_INFO_ALLOPTIONS.
Expand Down
30 changes: 18 additions & 12 deletions doc/html/pcre2pattern.html
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,8 @@ <h1>pcre2pattern man page</h1>
</b><br>
<P>
If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making quantifiers
the PCRE2_NO_AUTO_POSSESS option, or calling <b>pcre2_set_optimize()</b> with
a PCRE2_AUTO_POSSESS_OFF directive. This stops PCRE2 from making quantifiers
possessive when what follows cannot match the repeated item. For example, by
default a+b is treated as a++b. For more details, see the
<a href="pcre2api.html"><b>pcre2api</b></a>
Expand All @@ -153,8 +154,9 @@ <h1>pcre2pattern man page</h1>
</b><br>
<P>
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
PCRE2_NO_START_OPTIMIZE option. This disables several optimizations for quickly
reaching "no match" results. For more details, see the
PCRE2_NO_START_OPTIMIZE option, or calling <b>pcre2_set_optimize()</b> with
a PCRE2_START_OPTIMIZE_OFF directive. This disables several optimizations for
quickly reaching "no match" results. For more details, see the
<a href="pcre2api.html"><b>pcre2api</b></a>
documentation.
</P>
Expand All @@ -163,7 +165,8 @@ <h1>pcre2pattern man page</h1>
</b><br>
<P>
If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
setting the PCRE2_NO_DOTSTAR_ANCHOR option, or calling <b>pcre2_set_optimize()</b>
with a PCRE2_DOTSTAR_ANCHOR_OFF directive. This disables optimizations that
apply to patterns whose top-level branches all start with .* (match any number
of arbitrary characters). For more details, see the
<a href="pcre2api.html"><b>pcre2api</b></a>
Expand Down Expand Up @@ -2145,8 +2148,9 @@ <h1>pcre2pattern man page</h1>
(?&#62;.*?a)b
</pre>
It matches "ab" in the subject "aab". The use of the backtracking control verbs
(*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
(*PRUNE) and (*SKIP) also disable this optimization. To do so explicitly,
either pass the compile option PCRE2_NO_DOTSTAR_ANCHOR, or call
<b>pcre2_set_optimize()</b> with a PCRE2_DOTSTAR_ANCHOR_OFF directive.
</P>
<P>
When a capture group is repeated, the value captured is the substring that
Expand Down Expand Up @@ -2243,8 +2247,9 @@ <h1>pcre2pattern man page</h1>
PCRE2 has an optimization that automatically "possessifies" certain simple
pattern constructs. For example, the sequence A+B is treated as A++B because
there is no point in backtracking into a sequence of A's when B must follow.
This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting
the pattern with (*NO_AUTO_POSSESS).
This feature can be disabled by the PCRE2_NO_AUTO_POSSESS option, by calling
<b>pcre2_set_optimize()</b> with a PCRE2_AUTO_POSSESS_OFF directive, or by
starting the pattern with (*NO_AUTO_POSSESS).
</P>
<P>
When a pattern contains an unlimited repeat inside a group that can itself be
Expand Down Expand Up @@ -3464,9 +3469,9 @@ <h1>pcre2pattern man page</h1>
present. When one of these optimizations bypasses the running of a match, any
included backtracking verbs will not, of course, be processed. You can suppress
the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
when calling <b>pcre2_compile()</b>, or by starting the pattern with
(*NO_START_OPT). There is more discussion of this option in the section
entitled
when calling <b>pcre2_compile()</b>, by calling <b>pcre2_set_optimize()</b> with a
PCRE_START_OPTIMIZE_OFF directive, or by starting the pattern with (*NO_START_OPT).
There is more discussion of this option in the section entitled
<a href="pcre2api.html#compiling">"Compiling a pattern"</a>
in the
<a href="pcre2api.html"><b>pcre2api</b></a>
Expand Down Expand Up @@ -3597,7 +3602,8 @@ <h1>pcre2pattern man page</h1>
</P>
<P>
If you are interested in (*MARK) values after failed matches, you should
probably set the PCRE2_NO_START_OPTIMIZE option
probably either set the PCRE2_NO_START_OPTIMIZE option or call
<b>pcre2_set_optimize()</b> with a PCRE2_START_OPTIMIZE_OFF directive
<a href="#nooptimize">(see above)</a>
to ensure that the match is always attempted.
</P>
Expand Down
Loading
Loading