tags | e_maxx_link | |
---|---|---|
|
aho_corasick |
The Aho-Corasick algorithm allows us to quickly search for multiple patterns in a text.
The set of pattern strings is also called a dictionary.
We will denote the total length of its constituent strings by
The algorithm was proposed by Alfred Aho and Margaret Corasick in 1975.
A trie based on words "Java", "Rad", "Rand", "Rau", "Raum" and "Rose".
The image by [nd](https://de.wikipedia.org/wiki/Benutzer:Nd) is distributed under CC BY-SA 3.0 license.
Formally, a trie is a rooted tree, where each edge of the tree is labeled with some letter and outgoing edges of a vertex have distinct labels.
We will identify each vertex in the trie with the string formed by the labels on the path from the root to that vertex.
Each vertex will also have a flag
Accordingly, a trie for a set of strings is a trie such that each
We now describe how to construct a trie for a given set of strings in linear time with respect to their total length.
We introduce a structure for the vertices of the tree:
const int K = 26;
struct Vertex {
int next[K];
bool output = false;
Vertex() {
fill(begin(next), end(next), -1);
}
};
vector<Vertex> trie(1);
Here, we store the trie as an array of
Now we implement a function that will add a string
void add_string(string const& s) {
int v = 0;
for (char ch : s) {
int c = ch - 'a';
if (trie[v].next[c] == -1) {
trie[v].next[c] = trie.size();
trie.emplace_back();
}
v = trie[v].next[c];
}
trie[v].output = true;
}
This implementation obviously runs in linear time,
and since every vertex stores
It is possible to decrease the memory consumption to
Suppose we have built a trie for the given set of strings. Now let's look at it from a different side. If we look at any vertex, the string that corresponds to it is a prefix of one or more strings in the set, thus each vertex of the trie can be interpreted as a position in one or more strings from the set.
In fact, the trie vertices can be interpreted as states in a finite deterministic automaton.
From any state we can transition - using some input letter - to other states, i.e., to another position in the set of strings.
For example, if there is only one string
Thus we can understand the edges of the trie as transitions in an automaton according to the corresponding letter. However, in an automaton we need to have transitions for each combination of a state and a letter. If we try to perform a transition using a letter, and there is no corresponding edge in the trie, then we nevertheless must go into some state.
More precisely, suppose we are in a state corresponding to a string
For example, let the trie be constructed by the strings
An Aho-Corasick automaton based on words "a", "ab", "bc", "bca", "c" and "caa".
Blue arrows are suffix links, green arrows are terminal links.
A suffix link for a vertex
Thus we reduced the problem of constructing an automaton to the problem of finding suffix links for all vertices of the trie. However, we will build these suffix links, oddly enough, using the transitions constructed in the automaton.
The suffix links of the root vertex and all its immediate children point to the root vertex.
For any vertex
Thus, the problem of finding the transitions has been reduced to the problem of finding suffix links, and the problem of finding suffix links has been reduced to the problem of finding a suffix link and a transition, except for vertices closer to the root. So we have a recursive dependence that we can resolve in linear time.
Let's move to the implementation.
Note that we now will store the ancestor
const int K = 26;
struct Vertex {
int next[K];
bool output = false;
int p = -1;
char pch;
int link = -1;
int go[K];
Vertex(int p=-1, char ch='$') : p(p), pch(ch) {
fill(begin(next), end(next), -1);
fill(begin(go), end(go), -1);
}
};
vector<Vertex> t(1);
void add_string(string const& s) {
int v = 0;
for (char ch : s) {
int c = ch - 'a';
if (t[v].next[c] == -1) {
t[v].next[c] = t.size();
t.emplace_back(v, ch);
}
v = t[v].next[c];
}
t[v].output = true;
}
int go(int v, char ch);
int get_link(int v) {
if (t[v].link == -1) {
if (v == 0 || t[v].p == 0)
t[v].link = 0;
else
t[v].link = go(get_link(t[v].p), t[v].pch);
}
return t[v].link;
}
int go(int v, char ch) {
int c = ch - 'a';
if (t[v].go[c] == -1) {
if (t[v].next[c] != -1)
t[v].go[c] = t[v].next[c];
else
t[v].go[c] = v == 0 ? 0 : go(get_link(v), ch);
}
return t[v].go[c];
}
It is easy to see that thanks to memoization of the suffix links and transitions, the total time for finding all suffix links and transitions will be linear.
For an illustration of the concept refer to slide number 103 of the Stanford slides.
Instead of computing transitions and suffix links with recursive calls to go
and get_link
, it is possible to compute them bottom-up starting from the root.
(In fact, when the dictionary consists of only one string, we obtain the familiar Knuth-Morris-Pratt algorithm.)
This approach will have some advantages over the one described above as, instead of the total length
We can reason inductively using the fact that BFS from the root traverses vertices in order of increasing length.
We may assume that when we're in a vertex
Assume that at the moment we stand in a vertex
-
$go[v][c] = -1$ . In this case, we may assign$go[v][c] = go[u][c]$ , which is already known by the induction hypothesis; -
$go[v][c] = w \neq -1$ . In this case, we may assign$link[w] = go[u][c]$ .
In this way, we spend
We are given a set of strings and a text.
We have to print all occurrences of all strings from the set in the given text in
We construct an automaton for this set of strings.
We will now process the text letter by letter using the automaton,
starting at the root of the trie.
If we are at any time at state
How can we find out for a state
Thus if we store in each
If you only want to count the occurrences and not find the indices themselves, you can calculate the number of marked vertices in the suffix link path for each vertex
Finding the lexicographically smallest string of a given length that doesn't match any given strings
A set of strings and a length
We can construct the automaton for the set of strings.
Recall that
Here we use the same ideas.
For each vertex we store a mask that denotes the strings which match at this state.
Then the problem can be reformulated as follows:
initially being in the state
Finding the lexicographically smallest string of length $L$ containing $k$ strings {data-toc-label="Finding the lexicographically smallest string of length L containing k strings"}
As in the previous problem, we calculate for each vertex the number of matches that correspond to it (that is the number of marked vertices reachable using suffix links).
We reformulate the problem: the current state is determined by a triple of numbers