Skip to content

Commit

Permalink
Add new API function pcre2_set_optimization() for controlling enabled…
Browse files Browse the repository at this point in the history
… optimizations

It is anticipated that over time, more and more optimizations will be
added to PCRE2, and we want to be able to switch optimizations off/on,
both for testing purposes and to be able to work around bugs in a
released library version.

The number of free bits left in the compile options word is very small.
Hence, we will start putting all optimization enable/disable flags in
a separate word. To switch these off/on, the new API function
pcre2_set_optimization() will be used.

The values which can be passed to pcre2_set_optimization() are
different from the internal flag bit values. The values accepted by
pcre2_set_optimization() are contiguous integers, so there is no
danger of ever running out of them. This means in the future, the
internal representation can be changed at any time without breaking
backwards compatibility. Further, the 'directives' passed to
pcre2_set_optimization() are not restricted to control a single,
specific optimization. As an example, passing PCRE2_OPTIMIZATION_FULL
will turn on all optimizations supported by whatever version of
PCRE2 the client program happens to be linked with.

Co-Authored-By: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Co-Authored-by: Zoltan Herczeg <hzmester@freemail.hu>
  • Loading branch information
3 people committed Sep 16, 2024
1 parent 5e75d9b commit cacd570
Show file tree
Hide file tree
Showing 17 changed files with 286 additions and 41 deletions.
2 changes: 1 addition & 1 deletion doc/html/pcre2pattern.html
Original file line number Diff line number Diff line change
Expand Up @@ -2243,7 +2243,7 @@ <h1>pcre2pattern man page</h1>
PCRE2 has an optimization that automatically "possessifies" certain simple
pattern constructs. For example, the sequence A+B is treated as A++B because
there is no point in backtracking into a sequence of A's when B must follow.
This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting
This feature can be disabled by the PCRE2_NO_AUTO_POSSESS option, or starting
the pattern with (*NO_AUTO_POSSESS).
</P>
<P>
Expand Down
2 changes: 1 addition & 1 deletion doc/pcre2pattern.3
Original file line number Diff line number Diff line change
Expand Up @@ -2242,7 +2242,7 @@ package, and PCRE1 copied it from there. It found its way into Perl at release
PCRE2 has an optimization that automatically "possessifies" certain simple
pattern constructs. For example, the sequence A+B is treated as A++B because
there is no point in backtracking into a sequence of A's when B must follow.
This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting
This feature can be disabled by the PCRE2_NO_AUTO_POSSESS option, or starting
the pattern with (*NO_AUTO_POSSESS).
.P
When a pattern contains an unlimited repeat inside a group that can itself be
Expand Down
17 changes: 16 additions & 1 deletion src/pcre2.h.generic
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,18 @@ released, the numbers must not be changed. */
#define PCRE2_CONFIG_COMPILED_WIDTHS 14
#define PCRE2_CONFIG_TABLES_LENGTH 15

/* Optimization directives for pcre2_set_optimize().
For binary compatibility, only add to this list; do not renumber. */

#define PCRE2_OPTIMIZATION_NONE 0
#define PCRE2_OPTIMIZATION_FULL 1

#define PCRE2_AUTO_POSSESS 64
#define PCRE2_AUTO_POSSESS_OFF 65
#define PCRE2_DOTSTAR_ANCHOR 66
#define PCRE2_DOTSTAR_ANCHOR_OFF 67
#define PCRE2_START_OPTIMIZE 68
#define PCRE2_START_OPTIMIZE_OFF 69

/* Types for code units in patterns and subject strings. */

Expand Down Expand Up @@ -617,7 +629,9 @@ PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \
pcre2_set_parens_nest_limit(pcre2_compile_context *, uint32_t); \
PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \
pcre2_set_compile_recursion_guard(pcre2_compile_context *, \
int (*)(uint32_t, void *), void *);
int (*)(uint32_t, void *), void *); \
PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \
pcre2_set_optimize(pcre2_compile_context *, uint32_t);

#define PCRE2_MATCH_CONTEXT_FUNCTIONS \
PCRE2_EXP_DECL pcre2_match_context *PCRE2_CALL_CONVENTION \
Expand Down Expand Up @@ -912,6 +926,7 @@ pcre2_compile are called by application code. */
#define pcre2_set_newline PCRE2_SUFFIX(pcre2_set_newline_)
#define pcre2_set_parens_nest_limit PCRE2_SUFFIX(pcre2_set_parens_nest_limit_)
#define pcre2_set_offset_limit PCRE2_SUFFIX(pcre2_set_offset_limit_)
#define pcre2_set_optimize PCRE2_SUFFIX(pcre2_set_optimize_)
#define pcre2_set_substitute_callout PCRE2_SUFFIX(pcre2_set_substitute_callout_)
#define pcre2_substitute PCRE2_SUFFIX(pcre2_substitute_)
#define pcre2_substring_copy_byname PCRE2_SUFFIX(pcre2_substring_copy_byname_)
Expand Down
17 changes: 16 additions & 1 deletion src/pcre2.h.in
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,18 @@ released, the numbers must not be changed. */
#define PCRE2_CONFIG_COMPILED_WIDTHS 14
#define PCRE2_CONFIG_TABLES_LENGTH 15

/* Optimization directives for pcre2_set_optimize().
For binary compatibility, only add to this list; do not renumber. */

#define PCRE2_OPTIMIZATION_NONE 0
#define PCRE2_OPTIMIZATION_FULL 1

#define PCRE2_AUTO_POSSESS 64
#define PCRE2_AUTO_POSSESS_OFF 65
#define PCRE2_DOTSTAR_ANCHOR 66
#define PCRE2_DOTSTAR_ANCHOR_OFF 67
#define PCRE2_START_OPTIMIZE 68
#define PCRE2_START_OPTIMIZE_OFF 69

/* Types for code units in patterns and subject strings. */

Expand Down Expand Up @@ -617,7 +629,9 @@ PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \
pcre2_set_parens_nest_limit(pcre2_compile_context *, uint32_t); \
PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \
pcre2_set_compile_recursion_guard(pcre2_compile_context *, \
int (*)(uint32_t, void *), void *);
int (*)(uint32_t, void *), void *); \
PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \
pcre2_set_optimize(pcre2_compile_context *, uint32_t);

#define PCRE2_MATCH_CONTEXT_FUNCTIONS \
PCRE2_EXP_DECL pcre2_match_context *PCRE2_CALL_CONVENTION \
Expand Down Expand Up @@ -912,6 +926,7 @@ pcre2_compile are called by application code. */
#define pcre2_set_newline PCRE2_SUFFIX(pcre2_set_newline_)
#define pcre2_set_parens_nest_limit PCRE2_SUFFIX(pcre2_set_parens_nest_limit_)
#define pcre2_set_offset_limit PCRE2_SUFFIX(pcre2_set_offset_limit_)
#define pcre2_set_optimize PCRE2_SUFFIX(pcre2_set_optimize_)
#define pcre2_set_substitute_callout PCRE2_SUFFIX(pcre2_set_substitute_callout_)
#define pcre2_substitute PCRE2_SUFFIX(pcre2_substitute_)
#define pcre2_substring_copy_byname PCRE2_SUFFIX(pcre2_substring_copy_byname_)
Expand Down
98 changes: 72 additions & 26 deletions src/pcre2_compile.c
Original file line number Diff line number Diff line change
Expand Up @@ -834,7 +834,8 @@ enum { PSO_OPT, /* Value is an option bit */
PSO_BSR, /* Value is a \R type */
PSO_LIMH, /* Read integer value for heap limit */
PSO_LIMM, /* Read integer value for match limit */
PSO_LIMD /* Read integer value for depth limit */
PSO_LIMD, /* Read integer value for depth limit */
PSO_OPTMZ /* Value is an optimization bit */
};

typedef struct pso {
Expand All @@ -852,10 +853,10 @@ static const pso pso_list[] = {
{ STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP },
{ STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET },
{ STRING_NOTEMPTY_ATSTART_RIGHTPAR, 17, PSO_FLG, PCRE2_NE_ATST_SET },
{ STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS },
{ STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR, 18, PSO_OPT, PCRE2_NO_DOTSTAR_ANCHOR },
{ STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPTMZ, PCRE2_OPTIM_AUTO_POSSESS },
{ STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR, 18, PSO_OPTMZ, PCRE2_OPTIM_DOTSTAR_ANCHOR },
{ STRING_NO_JIT_RIGHTPAR, 7, PSO_FLG, PCRE2_NOJIT },
{ STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE },
{ STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPTMZ, PCRE2_OPTIM_START_OPTIMIZE },
{ STRING_LIMIT_HEAP_EQ, 11, PSO_LIMH, 0 },
{ STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 },
{ STRING_LIMIT_DEPTH_EQ, 12, PSO_LIMD, 0 },
Expand Down Expand Up @@ -8883,13 +8884,14 @@ this prevents the number of characters it matches from being adjusted.
cb points to the compile data block
atomcount atomic group level
inassert TRUE if in an assertion
dotstar_anchor TRUE if automatic anchoring optimization is enabled
Returns: TRUE or FALSE
*/

static BOOL
is_anchored(PCRE2_SPTR code, uint32_t bracket_map, compile_block *cb,
int atomcount, BOOL inassert)
int atomcount, BOOL inassert, BOOL dotstar_anchor)
{
do {
PCRE2_SPTR scode = first_significant_code(
Expand All @@ -8901,7 +8903,7 @@ do {
if (op == OP_BRA || op == OP_BRAPOS ||
op == OP_SBRA || op == OP_SBRAPOS)
{
if (!is_anchored(scode, bracket_map, cb, atomcount, inassert))
if (!is_anchored(scode, bracket_map, cb, atomcount, inassert, dotstar_anchor))
return FALSE;
}

Expand All @@ -8912,30 +8914,30 @@ do {
{
int n = GET2(scode, 1+LINK_SIZE);
uint32_t new_map = bracket_map | ((n < 32)? (1u << n) : 1);
if (!is_anchored(scode, new_map, cb, atomcount, inassert)) return FALSE;
if (!is_anchored(scode, new_map, cb, atomcount, inassert, dotstar_anchor)) return FALSE;
}

/* Positive forward assertion */

else if (op == OP_ASSERT || op == OP_ASSERT_NA)
{
if (!is_anchored(scode, bracket_map, cb, atomcount, TRUE)) return FALSE;
if (!is_anchored(scode, bracket_map, cb, atomcount, TRUE, dotstar_anchor)) return FALSE;
}

/* Condition. If there is no second branch, it can't be anchored. */

else if (op == OP_COND || op == OP_SCOND)
{
if (scode[GET(scode,1)] != OP_ALT) return FALSE;
if (!is_anchored(scode, bracket_map, cb, atomcount, inassert))
if (!is_anchored(scode, bracket_map, cb, atomcount, inassert, dotstar_anchor))
return FALSE;
}

/* Atomic groups */

else if (op == OP_ONCE)
{
if (!is_anchored(scode, bracket_map, cb, atomcount + 1, inassert))
if (!is_anchored(scode, bracket_map, cb, atomcount + 1, inassert, dotstar_anchor))
return FALSE;
}

Expand All @@ -8950,8 +8952,7 @@ do {
op == OP_TYPEPOSSTAR))
{
if (scode[1] != OP_ALLANY || (bracket_map & cb->backref_map) != 0 ||
atomcount > 0 || cb->had_pruneorskip || inassert ||
(cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
atomcount > 0 || cb->had_pruneorskip || inassert || !dotstar_anchor)
return FALSE;
}

Expand Down Expand Up @@ -8988,13 +8989,14 @@ or *SKIP does not count, because once again the assumption no longer holds.
cb points to the compile data
atomcount atomic group level
inassert TRUE if in an assertion
dotstar_anchor TRUE if automatic anchoring optimization is enabled
Returns: TRUE or FALSE
*/

static BOOL
is_startline(PCRE2_SPTR code, unsigned int bracket_map, compile_block *cb,
int atomcount, BOOL inassert)
int atomcount, BOOL inassert, BOOL dotstar_anchor)
{
do {
PCRE2_SPTR scode = first_significant_code(
Expand Down Expand Up @@ -9025,7 +9027,8 @@ do {
return FALSE;

default: /* Assertion */
if (!is_startline(scode, bracket_map, cb, atomcount, TRUE)) return FALSE;
if (!is_startline(scode, bracket_map, cb, atomcount, TRUE, dotstar_anchor))
return FALSE;
do scode += GET(scode, 1); while (*scode == OP_ALT);
scode += 1 + LINK_SIZE;
break;
Expand All @@ -9039,7 +9042,7 @@ do {
if (op == OP_BRA || op == OP_BRAPOS ||
op == OP_SBRA || op == OP_SBRAPOS)
{
if (!is_startline(scode, bracket_map, cb, atomcount, inassert))
if (!is_startline(scode, bracket_map, cb, atomcount, inassert, dotstar_anchor))
return FALSE;
}

Expand All @@ -9050,22 +9053,23 @@ do {
{
int n = GET2(scode, 1+LINK_SIZE);
unsigned int new_map = bracket_map | ((n < 32)? (1u << n) : 1);
if (!is_startline(scode, new_map, cb, atomcount, inassert)) return FALSE;
if (!is_startline(scode, new_map, cb, atomcount, inassert, dotstar_anchor))
return FALSE;
}

/* Positive forward assertions */

else if (op == OP_ASSERT || op == OP_ASSERT_NA)
{
if (!is_startline(scode, bracket_map, cb, atomcount, TRUE))
if (!is_startline(scode, bracket_map, cb, atomcount, TRUE, dotstar_anchor))
return FALSE;
}

/* Atomic brackets */

else if (op == OP_ONCE)
{
if (!is_startline(scode, bracket_map, cb, atomcount + 1, inassert))
if (!is_startline(scode, bracket_map, cb, atomcount + 1, inassert, dotstar_anchor))
return FALSE;
}

Expand All @@ -9079,8 +9083,7 @@ do {
else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR || op == OP_TYPEPOSSTAR)
{
if (scode[1] != OP_ANY || (bracket_map & cb->backref_map) != 0 ||
atomcount > 0 || cb->had_pruneorskip || inassert ||
(cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
atomcount > 0 || cb->had_pruneorskip || inassert || !dotstar_anchor)
return FALSE;
}

Expand Down Expand Up @@ -10362,6 +10365,10 @@ int regexrc; /* Return from compile */

uint32_t i; /* Local loop counter */

/* Enable all optimizations by default. */
uint32_t optim_flags = ccontext != NULL ? ccontext->optimization_flags :
PCRE2_OPTIMIZATION_ALL;

/* Comments at the head of this file explain about these variables. */

uint32_t stack_groupinfo[GROUPINFO_DEFAULT_SIZE];
Expand Down Expand Up @@ -10432,6 +10439,18 @@ if (patlen > ccontext->max_pattern_length)
return NULL;
}

/* Optimization flags in 'options' can override those in the compile context.
This is because some options to disable optimizations were added before the
optimization flags word existed, and we need to continue supporting them
for backwards compatibility. */

if (options & PCRE2_NO_AUTO_POSSESS)
optim_flags &= ~PCRE2_OPTIM_AUTO_POSSESS;
if (options & PCRE2_NO_DOTSTAR_ANCHOR)
optim_flags &= ~PCRE2_OPTIM_DOTSTAR_ANCHOR;
if (options & PCRE2_NO_START_OPTIMIZE)
optim_flags &= ~PCRE2_OPTIM_START_OPTIMIZE;

/* From here on, all returns from this function should end up going via the
EXIT label. */

Expand Down Expand Up @@ -10568,6 +10587,32 @@ if ((options & PCRE2_LITERAL) == 0)
else limit_depth = c;
skipatstart = ++pp;
break;

case PSO_OPTMZ:
optim_flags &= ~(p->value);

/* For backward compatibility the three original VERBs to disable
optimizations need to also update the corresponding external option. */

switch(p->value)
{
case PCRE2_OPTIM_AUTO_POSSESS:
cb.external_options |= PCRE2_NO_AUTO_POSSESS;
break;

case PCRE2_OPTIM_DOTSTAR_ANCHOR:
cb.external_options |= PCRE2_NO_DOTSTAR_ANCHOR;
break;

case PCRE2_OPTIM_START_OPTIMIZE:
cb.external_options |= PCRE2_NO_START_OPTIMIZE;
break;
}

break;

default:
PCRE2_UNREACHABLE();
}
break; /* Out of the table scan loop */
}
Expand Down Expand Up @@ -10863,6 +10908,7 @@ re->top_bracket = 0;
re->top_backref = 0;
re->name_entry_size = cb.name_entry_size;
re->name_count = cb.names_found;
re->optimization_flags = optim_flags;

/* The basic block is immediately followed by the name table, and the compiled
code follows after that. */
Expand Down Expand Up @@ -11005,7 +11051,7 @@ used in this code because at least one compiler gives a warning about loss of
"const" attribute if the cast (PCRE2_UCHAR *)codestart is used directly in the
function call. */

if (errorcode == 0 && (re->overall_options & PCRE2_NO_AUTO_POSSESS) == 0)
if (errorcode == 0 && (optim_flags & PCRE2_OPTIM_AUTO_POSSESS))
{
PCRE2_UCHAR *temp = (PCRE2_UCHAR *)codestart;
if (PRIV(auto_possessify)(temp, &cb) != 0) errorcode = ERR80;
Expand All @@ -11022,17 +11068,17 @@ there are no occurrences of *PRUNE or *SKIP (though there is an option to
disable this case). */

if ((re->overall_options & PCRE2_ANCHORED) == 0 &&
is_anchored(codestart, 0, &cb, 0, FALSE))
is_anchored(codestart, 0, &cb, 0, FALSE, (optim_flags & PCRE2_OPTIM_DOTSTAR_ANCHOR) != 0))
re->overall_options |= PCRE2_ANCHORED;

/* Set up the first code unit or startline flag, the required code unit, and
then study the pattern. This code need not be obeyed if PCRE2_NO_START_OPTIMIZE
is set, as the data it would create will not be used. Note that a first code
then study the pattern. This code need not be obeyed if PCRE2_OPTIM_START_OPTIMIZE
is disabled, as the data it would create will not be used. Note that a first code
unit (but not the startline flag) is useful for anchored patterns because it
can still give a quick "no match" and also avoid searching for a last code
unit. */

if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0)
if (optim_flags & PCRE2_OPTIM_START_OPTIMIZE)
{
int minminlength = 0; /* For minimal minlength from first/required CU */

Expand Down Expand Up @@ -11096,7 +11142,7 @@ if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0)
that disables this case.) */

else if ((re->overall_options & PCRE2_ANCHORED) == 0 &&
is_startline(codestart, 0, &cb, 0, FALSE))
is_startline(codestart, 0, &cb, 0, FALSE, (optim_flags & PCRE2_OPTIM_DOTSTAR_ANCHOR) != 0))
re->flags |= PCRE2_STARTLINE;

/* Handle the "required code unit", if one is set. In the UTF case we can
Expand Down
Loading

0 comments on commit cacd570

Please sign in to comment.