Terms agg fails with medium-large "exclude" clause lists #11176

markharwood · 2015-05-15T06:54:17Z

A terms agg with an exclude arrray of medium size (in my case, 86 strings) was sufficient to cause this error:

Caused by: org.apache.lucene.util.automaton.TooComplexToDeterminizeException: Determinizing automaton would result in more than 10000 states.
    at org.apache.lucene.util.automaton.Operations.determinize(Operations.java:743)
    at org.apache.lucene.util.automaton.RunAutomaton.<init>(RunAutomaton.java:138)
    at org.apache.lucene.util.automaton.ByteRunAutomaton.<init>(ByteRunAutomaton.java:32)
    at org.apache.lucene.util.automaton.ByteRunAutomaton.<init>(ByteRunAutomaton.java:27)
    at org.elasticsearch.search.aggregations.bucket.terms.support.IncludeExclude$StringFilter.<init>(IncludeExclude.java:88)

Example query:

{
    "query": {
        ...
    },
    "aggs": {
        "sample": {
            "sampler": {
                "shard_size": 100
            },
            "aggs": {
                "suggestions": {
                    "terms": {
                        "field": "links",
                        "exclude": [
                            "Triangle-free graph",
                            "Digraph (mathematics)",
                            "Clique (graph theory)",
                            "K-vertex-connected graph",
                            "Connected graph",
                            "Semi-symmetric graph",
                            "Complement (graph theory)",
                            "Tournament (graph theory)",
                            "Tree (graph theory)",
                            "Degree (graph theory)",
                            "Node (graph theory)",
                            "Heawood graph",
                            "Cycle graph",
                            "Cage graph",
                            "Star (graph theory)",
                            "Hoffman–Singleton graph",
                            "Edge (graph theory)",
                            "Shrikhande graph",
                            "Cubic graph",
                            "Cycle (graph theory)",
                            "Trees (graph theory)",
                            "Directed graph",
                            "Moore graph",
                            "Graph homomorphism",
                            "Hamiltonian graph",
                            "Regular graph",
                            "Vertex (graph theory)",
                            "Minor (graph theory)",
                            "Harries graph",
                            "Complement graph",
                            "K-edge-connected graph",
                            "Okapi BM25",
                            "Controlled vocabularies",
                            "Document retrieval",
                            "Recall (information retrieval)",
                            "C. J. van Rijsbergen",
                            "Subject indexing",
                            "Enterprise Search",
                            "Natural Language Processing",
                            "Information Retrieval",
                            "Relevance",
                            "SMART Information Retrieval System",
                            "Text mining",
                            "Inverse document frequency",
                            "Ranking functions",
                            "Binary classification",
                            "Enterprise search",
                            "Document classification",
                            "SIGIR",
                            "Query language",
                            "Gerard Salton",
                            "Gain (information retrieval)",
                            "Relevance (information retrieval)",
                            "Text Retrieval Conference",
                            "Compound term processing",
                            "Latent semantic indexing",
                            "Query expansion",
                            "Information retrieval#Recall",
                            "Vector Space Model",
                            "Stop words",
                            "Search engine",
                            "Index (search engine)",
                            "Tf-idf",
                            "Precision and recall",
                            "Query",
                            "Precision (information retrieval)",
                            "Latent semantic analysis",
                            "Summary statistics for contingency tables",
                            "Term frequency",
                            "Text retrieval",
                            "Natural language processing",
                            "Inverted index",
                            "Standard Boolean model",
                            "Question answering",
                            "Search engine indexing",
                            "Rocchio Classification",
                            "Jaccard index",
                            "Searching",
                            "Ranking function",
                            "Information retrieval",
                            "Relevance feedback",
                            "Special Interest Group on Information Retrieval",
                            "Information need",
                            "Gerard Salton Award",
                            "Vector space model",
                            "Automatic summarization"
                        ]
                    }
                }
            }
        }
    },
    "size": 1
}

I can see how consulting large sets of terms can be expensive and we might want to cap it but in the above case this risk is mitigated through the use of the "sampler" agg to consider a maxiumum of 100 top-matching docs.

The use case for this type of query is looking for new terms outside of a set that has already been gathered by the client eg aiding graph exploration by looking for connections beyond what you already have collected:

These sorts of exclude lists can grow large so a cap of around 85 terms seems low for this use case.

The text was updated successfully, but these errors were encountered:

jpountz · 2015-05-15T10:27:52Z

We currently handle lists of terms as an automaton just like we handle regular expressions for simplicity. I guess we could specialize the terms list case if we want large lists to work.

ppf2 · 2015-05-15T17:27:41Z

Per @jpountz , this error is only affecting the master branch. Just want to confirm that this will not get introduced to 1.6 when it gets released for we already have users using the exclude exact value with large number of values :)

ppf2 · 2015-05-15T18:40:59Z

@markharwood cool thx!

markharwood added the :Analytics/Aggregations Aggregations label May 15, 2015

This was referenced May 15, 2015

Terms agg exclude clauses slow on high-cardinality fields #11181

Closed

Improve include/exclude clause list speed and scalability #11188

Closed

kevinkluge assigned markharwood May 15, 2015

markharwood closed this as completed in caf7235 May 15, 2015

markharwood mentioned this issue May 26, 2015

Improved search timeout checking capabilities: #9168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terms agg fails with medium-large "exclude" clause lists #11176

Terms agg fails with medium-large "exclude" clause lists #11176

markharwood commented May 15, 2015

jpountz commented May 15, 2015

ppf2 commented May 15, 2015

ppf2 commented May 15, 2015

Terms agg fails with medium-large "exclude" clause lists #11176

Terms agg fails with medium-large "exclude" clause lists #11176

Comments

markharwood commented May 15, 2015

jpountz commented May 15, 2015

ppf2 commented May 15, 2015

ppf2 commented May 15, 2015