Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for early termination of search request #24398

Closed
wants to merge 2 commits into from
Closed

Add support for early termination of search request #24398

wants to merge 2 commits into from

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Apr 28, 2017

Relates #6720
This change introduce early termination of search request for indices sorted by specific fields.
When the index is sorted, the option called early_terminate indicates that top documents must be sorted by the index sort criteria
and that only the top N documents per segment should be visited.
Let's say for example that we have an index sorted by timestamp:

PUT events
{
    "settings" : {
        "index" : {
            "sort.field" : "timestamp",
            "sort.order" : "desc" <2>
        }
    },
    "mappings": {
        "doc": {
            "properties": {
                "timestamp": {
                    "type": "date"
                }
            }
        }
    }
}

... it is then possible to retrieve the N last events without visiting all the documents in the index with the following query:

GET /events/_search
{
    "size": 10,
    "early_terminate": true
}

The sort of this search request is automatically set to the index sort and each segment will visit the first 10 matching documents at most.

Setting this option on an index that is not sorted by any criteria will throw an exception.

Relates #6720
This change introduce early termination of search request for indices sorted by specific fields.
When the index is sorted, the option called `early_terminate` indicates that top documents must be sorted by the index sort criteria
and that only the top N documents per segment should be visited.
Let's say for example that we have an index sorted by timestamp:

```
PUT events
{
    "settings" : {
        "index" : {
            "sort.field" : "timestamp",
            "sort.order" : "desc" <2>
        }
    },
    "mappings": {
        "doc": {
            "properties": {
                "timestamp": {
                    "type": "date"
                }
            }
        }
    }
}
```

... it is then possible to retrieve the N last events without visiting all the documents in the index with the following query:

```
GET /events/_search
{
    "size": 10,
    "early_terminate": true
}
```

The `sort` of this search request is automatically set to the index sort and each segment will visit the first 10 matching documents at most.
@jimczi jimczi added :Search/Search Search-related issues that do not fall into other categories >enhancement review v6.0.0-alpha1 labels Apr 28, 2017
@jpountz
Copy link
Contributor

jpountz commented May 3, 2017

I'm wondering whether we should expose this early_terminate property like your PR does, or whether we should have an option like track_total_hits (similarly to track_scores) and let Elasticsearch do the right thing by comparing the search and index sorts.

@jimczi
Copy link
Contributor Author

jimczi commented May 3, 2017

Good idea @jpountz
And we can early_terminate even if track_total_hits is set to true for queries like match_all or simple term query where the number of matching docs is known beforehand.
I'll update this PR soon with the new option and heuristic.

@jpountz
Copy link
Contributor

jpountz commented May 4, 2017

@jimczi Indeed. Something else that is interesting is that MultiCollector knows what to do when a sub collector throws a CollectionTerminatedException, so it could also help in presence of aggregations: the top hits collector would just be removed from the sub collectors after size hits have been collected.

@weiqiyiji
Copy link

weiqiyiji commented May 12, 2017

hi, @jimczi I've tracked the "index-sorting" feature in elasticsearch for months, and I've already back port index sorting #24055 to elasticsearch 5.3 (The version I used in my production). Now there's one missing part l think is important for "index-sorting" feature. We know that most of the segments are sorted, and that could cover maybe 95% docs. Now if we do early terminate search, and occasionally some of our important docs reside in unsorted segments, then we may miss these docs due to early termination. So could we just do early termination in sorted segments, and do full collect in unsorted segments?

@jimczi
Copy link
Contributor Author

jimczi commented May 12, 2017

When we do early terminate search, and some of our important docs reside in unsorted segments, then we may miss these docs due to early termination. So could we just do early termination in sorted segments, and do full collect in unsorted segments?

This is detected by the EarlyTerminatingSortingCollector, if a segment is unsorted all docs are collected. Though since Lucene 6.5 segments are sorted on flush:
https://issues.apache.org/jira/browse/LUCENE-7579
... so if a sort is specified segments are always sorted.

@weiqiyiji
Copy link

Thanks @jimczi ! I get it!

@jimczi
Copy link
Contributor Author

jimczi commented May 24, 2017

I opened a new pull request that implements what @jpountz suggested in #24398 (comment)

@jimczi jimczi closed this May 24, 2017
jimczi added a commit to jimczi/elasticsearch that referenced this pull request Jun 7, 2017
This is a spin off for elastic#24398.
This commit refactors the query phase in order to be able
to automatically detect queries that can be early terminated.
If the index sort matches the query sort, the top docs collection is early terminated
on each segment and the computing of the total number of hits that match the query is delegated
to a simple TotalHitCountCollector.
This change also adds a new parameter to the search request called `track_total_hits`.
It indicates if the total number of hits that match the query should be tracked.
If false, queries sorted by the index sort will not try to compute this information
and will limit the collection to the first N documents per segment.
Aggregations are not impacted and will continue to see every document
even when the index sort matches the query sort and `track_total_hits` is false.

Relates elastic#6720
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants