Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inner_hits are far not scalable out of the box #32818

Closed
oleksii-sl opened this issue Aug 13, 2018 · 7 comments
Closed

inner_hits are far not scalable out of the box #32818

oleksii-sl opened this issue Aug 13, 2018 · 7 comments
Labels
>enhancement help wanted adoptme :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@oleksii-sl
Copy link

oleksii-sl commented Aug 13, 2018

During development of search for my project I found out that specific tuning is needed in order to make inner_hits usable. Why do we have at all that parent _source parsing for inner_hits if it is clear that it is not scalable at all?
If each nested object is a separate document why not to store it's own _source for each? (if possible)

Elasticsearch version (bin/elasticsearch --version):
5.6.4

Plugins installed: []

JVM version (java -version):
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

OS version (uname -a if on a Unix-like system):

Linux new-es-cluster-test-master 4.13.0-1008-gcp #11-Ubuntu SMP Thu Jan 25 11:08:44 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

current behavior: inner_hits not usable without using _source: false and doc_values (or stored fields)
expected behavior: inner_hits usable & scalable out of the box

Steps to reproduce:

  1. you need huge json document (I got issues with ~500 000 bytes json) with >250 nested objects
  2. run query on it with eg inner_hits: { size: 50 }
@jimczi
Copy link
Contributor

jimczi commented Aug 14, 2018

you need huge json document (I got issues with ~500 000 bytes json) with >250 nested objects

If you have big documents then retrieving the _source might be an issue in terms of performance but that's only the default behavior. For most of the cases this is enough but you've realized that in your case disabling _source and retrieving doc_values only is faster. I guess it is expected since you retrieve only a subset of the fields that are present in your nested documents ? Can you post a full recreation and some numbers of your perf tests ?

@jimczi jimczi added >enhancement feedback_needed :Search/Search Search-related issues that do not fall into other categories labels Aug 14, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@oleksii-sl
Copy link
Author

@jimczi

only is faster

NO! Not only faster, Scalable!

I need all source of matched nested documents, my statement is: "whatever the default behavior is - it should not be non-scalable", do you agree with it? I have docs with 50 nested objects and it works fine, but with 250 nested objects performance goes down ~10x (from 20ms to 200ms approx). So it is like y = 2 * x linear dependency, elastic is intended to be used in highly loaded, multi-user environments so this linear algo seem to be awful.
I can't post data which I use because it's proprietary.

I don't see any sense of keeping default behavior which obviously can wreck your production's thoughoutput at some point.

In my case I also have another level of nested objects under 1st level of nested objects, such things can't be retrieved by docvalues

Is it possible to store _source for each nested document? That would resolve scalability issue

Btw, currently I decided to store whole json inside each nested document as a string in order to get it using docvalues

@jimczi
Copy link
Contributor

jimczi commented Aug 14, 2018

NO! Not only faster, Scalable!

No need to yell ! I think there are ways to make it faster without changing how it is stored internally. For instance we parse the whole _source in json for each inner_hit. So the more nested documents you have the more it takes to build the "new" source of the inner_hits. We could cache the parsed json source and reuse it for inner_hits that appear in the same root document. I don't know how much this would save (we'd need some benchmarks) but that shouldn't be too difficult to do. I'll mark this issue with an adoptme tags because I don't have time to work on this now, would you like to work on this issue @serdyuk-alex ?

@oleksii-sl
Copy link
Author

@jimczi Can you give tips where I need to look at in the code?
I need following starting points:
document creation in index, _source storage
_source fetched after search completed

@jimczi
Copy link
Contributor

jimczi commented Aug 16, 2018

You can start by looking at org.elasticsearch.search.fetch.FetchPhase and more precisely FetchPhase#createNestedSearchHit which re-parses the _source of the root object for every nested documents:

Tuple<XContentType, Map<String, Object>> tuple = XContentHelper.convertToMap(source, true);

@javanna javanna added help wanted adoptme and removed feedback_needed labels Oct 2, 2018
@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
jtibshirani added a commit that referenced this issue Aug 3, 2020
Previously if an inner_hits block required _ source, we would reload and parse
the root document's source for every hit. This PR adds a shared SourceLookup to
the inner hits context that allows inner hits to reuse parsed source if it's
already available. This matches our approach for sharing the root document ID.

Relates to #32818.
@jtibshirani
Copy link
Contributor

jtibshirani commented Aug 3, 2020

I merged #60494, which avoids reloading and reparsing the root document's _source for every inner hit. I suspect this would make a significant positive difference in your case.

I'm going to close this issue for now, but please open a new one if you continue to see poor performance. Note there's already another issue about a high inner_hits overhead (but that doesn't seem related to loading _source): #56210.

jtibshirani added a commit to jtibshirani/elasticsearch that referenced this issue Aug 3, 2020
Previously if an inner_hits block required _ source, we would reload and parse
the root document's source for every hit. This PR adds a shared SourceLookup to
the inner hits context that allows inner hits to reuse parsed source if it's
already available. This matches our approach for sharing the root document ID.

Relates to elastic#32818.
jtibshirani added a commit that referenced this issue Aug 4, 2020
Previously if an inner_hits block required _ source, we would reload and parse
the root document's source for every hit. This PR adds a shared SourceLookup to
the inner hits context that allows inner hits to reuse parsed source if it's
already available. This matches our approach for sharing the root document ID.

Relates to #32818.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement help wanted adoptme :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

6 participants