inner_hits are far not scalable out of the box #32818

oleksii-sl · 2018-08-13T17:33:24Z

During development of search for my project I found out that specific tuning is needed in order to make inner_hits usable. Why do we have at all that parent _source parsing for inner_hits if it is clear that it is not scalable at all?
If each nested object is a separate document why not to store it's own _source for each? (if possible)

Elasticsearch version (bin/elasticsearch --version):
5.6.4

Plugins installed: []

JVM version (java -version):
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

OS version (uname -a if on a Unix-like system):

Linux new-es-cluster-test-master 4.13.0-1008-gcp #11-Ubuntu SMP Thu Jan 25 11:08:44 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

current behavior: inner_hits not usable without using _source: false and doc_values (or stored fields)
expected behavior: inner_hits usable & scalable out of the box

Steps to reproduce:

you need huge json document (I got issues with ~500 000 bytes json) with >250 nested objects
run query on it with eg inner_hits: { size: 50 }

The text was updated successfully, but these errors were encountered:

jimczi · 2018-08-14T07:47:12Z

you need huge json document (I got issues with ~500 000 bytes json) with >250 nested objects

If you have big documents then retrieving the _source might be an issue in terms of performance but that's only the default behavior. For most of the cases this is enough but you've realized that in your case disabling _source and retrieving doc_values only is faster. I guess it is expected since you retrieve only a subset of the fields that are present in your nested documents ? Can you post a full recreation and some numbers of your perf tests ?

elasticmachine · 2018-08-14T07:47:35Z

Pinging @elastic/es-search-aggs

oleksii-sl · 2018-08-14T10:30:19Z

@jimczi

only is faster

NO! Not only faster, Scalable!

I need all source of matched nested documents, my statement is: "whatever the default behavior is - it should not be non-scalable", do you agree with it? I have docs with 50 nested objects and it works fine, but with 250 nested objects performance goes down ~10x (from 20ms to 200ms approx). So it is like y = 2 * x linear dependency, elastic is intended to be used in highly loaded, multi-user environments so this linear algo seem to be awful.
I can't post data which I use because it's proprietary.

I don't see any sense of keeping default behavior which obviously can wreck your production's thoughoutput at some point.

In my case I also have another level of nested objects under 1st level of nested objects, such things can't be retrieved by docvalues

Is it possible to store _source for each nested document? That would resolve scalability issue

Btw, currently I decided to store whole json inside each nested document as a string in order to get it using docvalues

jimczi · 2018-08-14T13:50:40Z

NO! Not only faster, Scalable!

No need to yell ! I think there are ways to make it faster without changing how it is stored internally. For instance we parse the whole _source in json for each inner_hit. So the more nested documents you have the more it takes to build the "new" source of the inner_hits. We could cache the parsed json source and reuse it for inner_hits that appear in the same root document. I don't know how much this would save (we'd need some benchmarks) but that shouldn't be too difficult to do. I'll mark this issue with an adoptme tags because I don't have time to work on this now, would you like to work on this issue @serdyuk-alex ?

oleksii-sl · 2018-08-16T08:47:57Z

@jimczi Can you give tips where I need to look at in the code?
I need following starting points:
document creation in index, _source storage
_source fetched after search completed

jimczi · 2018-08-16T09:23:24Z

You can start by looking at org.elasticsearch.search.fetch.FetchPhase and more precisely FetchPhase#createNestedSearchHit which re-parses the _source of the root object for every nested documents:

elasticsearch/server/src/main/java/org/elasticsearch/search/fetch/FetchPhase.java

Line 288 in 15ff3da

    
           Tuple<XContentType, Map<String, Object>> tuple = XContentHelper.convertToMap(source, true);

Previously if an inner_hits block required _ source, we would reload and parse the root document's source for every hit. This PR adds a shared SourceLookup to the inner hits context that allows inner hits to reuse parsed source if it's already available. This matches our approach for sharing the root document ID. Relates to #32818.

jtibshirani · 2020-08-03T22:34:12Z

I merged #60494, which avoids reloading and reparsing the root document's _source for every inner hit. I suspect this would make a significant positive difference in your case.

I'm going to close this issue for now, but please open a new one if you continue to see poor performance. Note there's already another issue about a high inner_hits overhead (but that doesn't seem related to loading _source): #56210.

Previously if an inner_hits block required _ source, we would reload and parse the root document's source for every hit. This PR adds a shared SourceLookup to the inner hits context that allows inner hits to reuse parsed source if it's already available. This matches our approach for sharing the root document ID. Relates to elastic#32818.

Previously if an inner_hits block required _ source, we would reload and parse the root document's source for every hit. This PR adds a shared SourceLookup to the inner hits context that allows inner hits to reuse parsed source if it's already available. This matches our approach for sharing the root document ID. Relates to #32818.

jimczi added >enhancement feedback_needed :Search/Search Search-related issues that do not fall into other categories labels Aug 14, 2018

javanna added help wanted adoptme and removed feedback_needed labels Oct 2, 2018

rjernst added the Team:Search Meta label for search team label May 4, 2020

jimczi mentioned this issue Jul 29, 2020

Clarify SourceLookup sharing across fetch subphases. #60179

Merged

jtibshirani mentioned this issue Jul 31, 2020

Avoid reloading _source for every inner hit. #60494

Merged

jtibshirani closed this as completed Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inner_hits are far not scalable out of the box #32818

inner_hits are far not scalable out of the box #32818

oleksii-sl commented Aug 13, 2018 •

edited

Loading

jimczi commented Aug 14, 2018

elasticmachine commented Aug 14, 2018

oleksii-sl commented Aug 14, 2018

jimczi commented Aug 14, 2018

oleksii-sl commented Aug 16, 2018

jimczi commented Aug 16, 2018

jtibshirani commented Aug 3, 2020 •

edited

Loading

inner_hits are far not scalable out of the box #32818

inner_hits are far not scalable out of the box #32818

Comments

oleksii-sl commented Aug 13, 2018 • edited Loading

jimczi commented Aug 14, 2018

elasticmachine commented Aug 14, 2018

oleksii-sl commented Aug 14, 2018

jimczi commented Aug 14, 2018

oleksii-sl commented Aug 16, 2018

jimczi commented Aug 16, 2018

jtibshirani commented Aug 3, 2020 • edited Loading

oleksii-sl commented Aug 13, 2018 •

edited

Loading

jtibshirani commented Aug 3, 2020 •

edited

Loading