Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing: index-time sorting #6720

Closed
jpountz opened this issue Jul 3, 2014 · 26 comments
Closed

Indexing: index-time sorting #6720

jpountz opened this issue Jul 3, 2014 · 26 comments
Assignees
Labels
:Core/Infra/Core Core issues without another label >feature

Comments

@jpountz
Copy link
Contributor

jpountz commented Jul 3, 2014

Lucene has index-time sorting and early query termination capabilities. It would be nice to integrate it into Elasticsearch in order to speed up queries whose sort order matches the index order. This presentation contains some information about these features and their implementation in Lucene.

@jpountz jpountz added the adoptme label Jul 3, 2014
@s1monw
Copy link
Contributor

s1monw commented Jul 4, 2014

I think this is an awesome feature for specific problems. Yet I think we have to make sure that this never ever corrupts an index during a merge. IMO we should restrict this to numeric fields and allow only asc / desc and that field is automatically marked as using on-disk docvalues to make sure we don't fail the merge due to OOM.

@rmuir
Copy link
Contributor

rmuir commented Jul 4, 2014

why limit to numeric? whether a field is numeric or not is completely unrelated to this feature: sorry that just doesn't make sense. Since lucene 4.8 you can supply any arbitrary Sort here.

@s1monw
Copy link
Contributor

s1monw commented Jul 4, 2014

I am super concerned about memory consumption so I should have rather said restricted to on-disk DV. I don't think it's unrelated

@rmuir
Copy link
Contributor

rmuir commented Jul 4, 2014

my question was with regards to string fields though. I think those are ok too (as well as multiple-level sorts or whatever lucene has). If we want to limit it to DV to prevent traps in merging thats fine, but I think string is just as good as numeric.

Separately, the real issue here is how well early termination will work with situations like aggregations. I don't yet fully understand how this is working, but it looks like it simply plugs into collector. so you would get wrong aggregation counts with early termination if we aren't careful.

@jpountz
Copy link
Contributor Author

jpountz commented Jul 4, 2014

+1 on allowing string doc values and multi-level sorts. We don't need to have them from the first iteration but having them eventually would be nice. Then we could early-terminate on any prefix on the sort that is configured at index time.

Even without aggregations, simple things like the total number of hits could only be approximated. I guess there would need to be a per-query flag to turn early termination on.

@mikemccand mikemccand removed their assignment Jul 8, 2014
@rjernst
Copy link
Member

rjernst commented Jul 8, 2014

This sounds like the PageRank use case, but there is another as well.

Sorting can be very useful for compression, to allow "clumps" of documents to be near each other (thus producing smaller deltas in postings). For example, let's say we have songs lyrics as our documents, and we sort by genre. If we restrict the search to a specific genre, we will quickly narrow into a small portion of the postings list, with small deltas. Also, if we do a search across all genres, but for a term which is highly correlated with a genre (e.g. "cowboy" for "Country"), then again we will have good compression and search speed for that term.

My point is just that you don't need early termination for this to be useful. It can help speed up queries if you just understand your data and the common types of queries you run.

@ofavre
Copy link
Contributor

ofavre commented Aug 22, 2014

Algolia amazing speed is achieved through index-time ordering, and they do support typo-tolerance as well.
I think it's safe to say that this can very well be a major feature.

@Vineeth-Mohan
Copy link

Looking forward for this feature.

@clintongormley
Copy link

Comment copied over from #8873:

Documents of different types tend to have different fields, which can lead to problems of sparsity when documents are stored in the order that they were indexed. If we sort documents by type, then missing values (eg in doc values or norms) can be represented very efficiently.

On merge, documents should be sorted by type (assuming there is more than one type). Additionally, for certain use cases, it can make sense to apply a secondary sort (eg on timestamp). This secondary level should be customizable per index, and should probably be dynamically updatable (although it would only have effect on later merges).

Relates to #8870

@jpountz
Copy link
Contributor Author

jpountz commented Sep 4, 2015

We probably want to have https://issues.apache.org/jira/browse/LUCENE-6766 first.

@jpountz jpountz added the stalled label Sep 4, 2015
@otisg
Copy link

otisg commented Sep 5, 2015

@jpountz would this help in the typical ELK case where one indexes logs (older first, then newer ones) and typically short in the opposite order - first new ones, then old ones?

In other words, with a typical ELK case where logs are being indexed would the natural indexing order already result in ideally sorted index, or would one have to run something (whatever LUCENE-6766 and this ES issue expose) to inverse the sort order of such an index?

@jpountz
Copy link
Contributor Author

jpountz commented Sep 5, 2015

I don't think index sorting would be appealing for time series: typical ELK deployments are interested in ingestion speed and running aggregations and none of them would benefit from index sorting as it adds overhead at index time and aggregations can't be early-terminated.

In my opinion, this would be more useful for pure search use-cases that only need to fetch the top docs and have a low to moderate ingestion rate.

@nik9000
Copy link
Member

nik9000 commented Sep 7, 2015

I don't think index sorting would be appealing for time series: typical ELK deployments are interested in ingestion speed and running aggregations and none of them would benefit from index sorting as it adds overhead at index time and aggregations can't be early-terminated.

I suspect it'd be faster to answer "find me the last 10 things that match this query" which is pretty useful. You'd have to not care about the count of total matches and I'm not sure if that outweighs the index time cost. I don't know this feature well so I can speak to what that index time cost amounts to.

@clintongormley clintongormley added v2.2.0 :Core/Infra/Core Core issues without another label and removed v2.1.0 labels Nov 20, 2015
@jimczi
Copy link
Contributor

jimczi commented Dec 7, 2015

After reading the description I though it would be pretty easy to add the feature but then I realized that it would break the blockjoin/nested support. Seems like we would need another sorting merge policy which takes the nested document into account...

@JanJakes
Copy link

Any plans/updates on this issue? Using index time sorting would have huge performance impacts on queries like "give me 10 items with the highest rating" when having millions of them. Seems like Algolia is using exactly this approach: https://www.algolia.com/doc/faq/index-configuration/how-to-sort-the-results-with-a-specific-attribute

@jpountz
Copy link
Contributor Author

jpountz commented Jul 19, 2016

We are slowly making progress: index sorting is going to be better integrated into Lucene in the upcoming 6.2 release https://issues.apache.org/jira/browse/LUCENE-6766 so when 6.2 is out we can start working on the integration. This will take time so don't expect it to be implemented and automatically terminate queries early etc. soon, but there is definitely progress being made.

@weiqiyiji
Copy link

Hey, guys, is there any progress? Currently I need index-time sorting in my product, and I wonder if I can wait util you guys implement this feature or I have to do it myself first.

@mikemccand
Copy link
Contributor

There has been good progress on the Lucene side, including a change @jimczi just pushed to Lucene 6.x (https://issues.apache.org/jira/browse/LUCENE-7579) for its future 6.5.0 release, to give a big boost to indexing performance with index-time sorting. You can see the impact of that change in the nightly sparse Lucene benchmarks: it's annotation V in this chart: https://home.apache.org/~mikemccand/lucenebench/sparseResults.html#index_throughput

But we still have plenty of work to do to expose sorted indices in ES.

@weiqiyiji
Copy link

@mikemccand Ok, I'll watch this issue first. Thanks!

@makeyang
Copy link
Contributor

@mikemccand any branch working on this so we can take a look at it ?

@jpountz
Copy link
Contributor Author

jpountz commented Mar 27, 2017

There is no branch yet.

@makeyang
Copy link
Contributor

sorry to bother to propose some thoughts on this issue:

  1. add mapping parameters: index_sort with values[no, asc, desc] for datatype[long,integer,short,byte,double,float] and index_sort_order with no-negative int value if and only if fields setting index_sort
  2. these 2 parameters can't be modified once it is set for a index
  3. index_sort_order is used to decide order of multi fields and 1 stands for the first sequence to decide sort
  4. it should not allow null value for fields with mapping parameter index_sort set not no for easy to implement
  5. it should not allow indices with no-zero-doc to add fields with index_sort parameter for easy to implement

@jimczi
Copy link
Contributor

jimczi commented Mar 31, 2017

@makeyang thanks.
I am planning to add the index sorting spec in the index settings rather than on the index mapping since you can define an index sort based on multiple fields.
Anyway I am currently iterating on this feature and I'll add updates (and link to the PR when I open it) to this issue so stay tuned !

jimczi added a commit that referenced this issue Apr 19, 2017
This change adds an index setting to define how the documents should be sorted inside each Segment.
It allows any numeric, date, boolean or keyword field inside a mapping to be used to sort the index on disk.
It is not allowed to use a `nested` fields inside an index that defines an index sorting since `nested` fields relies on the original sort of the index.
This change does not add early termination capabilities in the search layer. This will be added in a follow up.

Relates #6720
@jimczi
Copy link
Contributor

jimczi commented Apr 19, 2017

Index time sorting has landed in master:
#24055

There are still missing pieces that we need to add in order to make index sorting useful:

  • Documentation regarding when to (not) use index sorting and how

  • Early termination for queries that rely on the index sort

  • Faster sorted scroll queries when the query sort matches the index sort.

jimczi added a commit to jimczi/elasticsearch that referenced this issue Jun 7, 2017
This is a spin off for elastic#24398.
This commit refactors the query phase in order to be able
to automatically detect queries that can be early terminated.
If the index sort matches the query sort, the top docs collection is early terminated
on each segment and the computing of the total number of hits that match the query is delegated
to a simple TotalHitCountCollector.
This change also adds a new parameter to the search request called `track_total_hits`.
It indicates if the total number of hits that match the query should be tracked.
If false, queries sorted by the index sort will not try to compute this information
and will limit the collection to the first N documents per segment.
Aggregations are not impacted and will continue to see every document
even when the index sort matches the query sort and `track_total_hits` is false.

Relates elastic#6720
jimczi added a commit that referenced this issue Jun 8, 2017
…4864)

This commit refactors the query phase in order to be able
to automatically detect queries that can be early terminated.
If the index sort matches the query sort, the top docs collection is early terminated
on each segment and the computing of the total number of hits that match the query is delegated to a simple TotalHitCountCollector.
This change also adds a new parameter to the search request called `track_total_hits`.
It indicates if the total number of hits that match the query should be tracked.
If false, queries sorted by the index sort will not try to compute this information and 
and will limit the collection to the first N documents per segment.
Aggregations are not impacted and will continue to see every document
even when the index sort matches the query sort and `track_total_hits` is false.

Relates #6720
jimczi added a commit to jimczi/elasticsearch that referenced this issue Jun 8, 2017
Sorted scroll search can use early termination when the index sort matches the scroll search sort.
The optimization can be done after the first query (which still needs to collect all documents)
by applying a query that only matches documents that are greater than the last doc retrieved in the previous request.
Since the index is sorted, retrieving the list of documents that are greater than the last doc
only requires a binary search on each segment.
This change introduces this new query called `SortedSearchAfterDocQuery` and apply it when possible.
Scrolls with this optimization will search all documents on the first request and then will early terminate each segment
after $size doc for any subsequent requests.

Relates elastic#6720
jimczi added a commit that referenced this issue Jun 12, 2017
…25138)

Sorted scroll search can use early termination when the index sort matches the scroll search sort.
The optimization can be done after the first query (which still needs to collect all documents)
by applying a query that only matches documents that are greater than the last doc retrieved in the previous request.
Since the index is sorted, retrieving the list of documents that are greater than the last doc
only requires a binary search on each segment.
This change introduces this new query called `SortedSearchAfterDocQuery` and apply it when possible.
Scrolls with this optimization will search all documents on the first request and then will early terminate each segment
after $size doc for any subsequent requests.

Relates #6720
@jimczi
Copy link
Contributor

jimczi commented Jun 12, 2017

Index sorting is now fully supported in master.
The documentation for this feature can be found here:
https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules-index-sorting.html

We'll continue to add enhancements and features based on sorted indices but this issue can be closed.

@jimczi jimczi closed this as completed Jun 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >feature
Projects
None yet
Development

No branches or pull requests