[FEATURE] inner_hits in nested neural query should return all the chunks #2113

yuye-aws · 2024-09-11T09:49:14Z

What is the bug?

I am using text_chunking and text_embedding processor to ingest documents into an index. The text_chunking search example works well, but the inner_hits only returns a single element from the chunked string list. It does not matter when I set the score_mode to max or avg.

How can one reproduce the bug?

Register a text embedding model.
Create text chunking and embedding pipeline

PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline
{
  "description": "A text chunking and embedding ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer": "standard"
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    },
    {
      "text_embedding": {
        "model_id": "6ipW4JEBXVV1cW1lcFvy",
        "field_map": {
          "passage_chunk": "passage_chunk_embedding"
        }
      }
    }
  ]
}

Create an index with mapping

PUT testindex
{
  "settings": {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "passage_text": {
        "type": "text"
      },
      "passage_chunk_embedding": {
        "type": "nested",
        "properties": {
          "knn": {
            "type": "knn_vector",
            "dimension": 768,
            "method": {
              "name": "hnsw",
              "engine": "lucene"
            }
          }
        }
      }
    }
  }
}

Ingest some sample documents into the index (run the following command two times).

POST testindex/_doc?pipeline=text-chunking-embedding-ingest-pipeline
{
  "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
}

Search the index with nested neural query

GET testindex/_search
{
  "query": {
    "nested": {
      "score_mode": "max",
      "path": "passage_chunk_embedding",
      "query": {
        "neural": {
          "passage_chunk_embedding.knn": {
            "query_text": "document",
            "model_id": "6ipW4JEBXVV1cW1lcFvy"
          }
        }
      },
      "inner_hits": {}
    }
  }
}

Receive the search result

{
    "took": 1361,
    "timed_out": false,
    "_shards": {
      "total": 1,
      "successful": 1,
      "skipped": 0,
      "failed": 0
    },
    "hits": {
      "total": {
        "value": 2,
        "relation": "eq"
      },
      "max_score": 0.02276505,
      "hits": [
        {
          "_index": "testindex",
          "_id": "7SqB4JEBXVV1cW1lKVvd",
          "_score": 0.02276505,
          "_source": {
            "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.",
            "passage_chunk": [
              "This is an example document to be chunked. The document ",
              "The document contains a single paragraph, two sentences and 24 ",
              "and 24 tokens by standard tokenizer in OpenSearch."
            ],
            "passage_chunk_embedding": [
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              }
            ]
          },
          "inner_hits": {
            "passage_chunk_embedding": {
              "hits": {
                "total": {
                  "value": 1,
                  "relation": "eq"
                },
                "max_score": 0.02276505,
                "hits": [
                  {
                    "_index": "testindex",
                    "_id": "7SqB4JEBXVV1cW1lKVvd",
                    "_nested": {
                      "field": "passage_chunk_embedding",
                      "offset": 1
                    },
                    "_score": 0.02276505,
                    "_source": {
                      "knn": [ ... ]
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "_index": "testindex",
          "_id": "7iqB4JEBXVV1cW1l5lv_",
          "_score": 0.02276505,
          "_source": {
            "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.",
            "passage_chunk": [
              "This is an example document to be chunked. The document ",
              "The document contains a single paragraph, two sentences and 24 ",
              "and 24 tokens by standard tokenizer in OpenSearch."
            ],
            "passage_chunk_embedding": [
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              },
              {
                "knn": [ ... ]
              }
            ]
          },
          "inner_hits": {
            "passage_chunk_embedding": {
              "hits": {
                "total": {
                  "value": 1,
                  "relation": "eq"
                },
                "max_score": 0.02276505,
                "hits": [
                  {
                    "_index": "testindex",
                    "_id": "7iqB4JEBXVV1cW1l5lv_",
                    "_nested": {
                      "field": "passage_chunk_embedding",
                      "offset": 1
                    },
                    "_score": 0.02276505,
                    "_source": {
                      "knn": [ ... ]
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }

What is the expected behavior?

The inner_hits should return matching score and offset of all the retrieved documents.

What is your host/environment?

Mac OS

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

The text was updated successfully, but these errors were encountered:

yuye-aws · 2024-09-11T10:01:29Z

Neural search with explain is not working. I could not find a workaround.

martin-gaievski · 2024-09-11T15:29:49Z

@yuye-aws Inner hits are not supported in hybrid query. There is a feature request for this (opensearch-project/neural-search#718), but at the moment there is no path forward

yuye-aws · 2024-09-12T00:13:47Z

I'm not using hybrid query, just a plain neural query.

yuye-aws · 2024-09-12T00:19:08Z

Are both features not supported due to the same blocking issue?

martin-gaievski · 2024-09-12T00:46:24Z

Sorry, my bad. Neural query is different, I'm not sure why nested doesn't work, in the code of neural we delegate execution to knn query, so you may want to check how it's done in knn. Easy test would be to try if plain knn query supports "nested" clause

yuye-aws · 2024-09-12T06:47:33Z

Easy test would - try if plain knn query supports "nested" clause

Already tried in my fifth step.

martin-gaievski · 2024-09-12T15:24:34Z

Easy test would - try if plain knn query supports "nested" clause

Already tried in my fifth step.

In step 5 you do have neural query. I mean the knn query, something like in following example but with nested:

"query": {
        "knn": {
            "embedding_field": {
                "vector": [
                    5.0,
                    4.0,
                    ....
                    3.8
                ],
                "k": 12
            }
        }
    }

martin-gaievski · 2024-09-12T22:01:07Z

@yuye-aws I found this change in knn #1182, the essense of it is: in case of nested documents we need to return only one that gave the max score, and drop others. It became new default behavior instead of old one where all nested docs (meaning inner hits) are returned. From knn it's inherited by neural query.

yuye-aws · 2024-09-13T00:36:03Z

in case of nested documents we need to return only one that gave the max score, and drop others. It became new default behavior instead of old one where all nested docs (meaning inner hits) are returned.

This does not make sense, because the score_mode can also be avg, where we expect to see all the scores.

yuye-aws · 2024-09-13T00:37:11Z

From knn it's inherited by neural query.

Shall we make a PR to knn repo? After all, nested k-NN query also needs avg score mode.

heemin32 · 2024-09-13T01:05:09Z

@yuye-aws Please add your use case and also suggestion if you have regarding avg score mode support in knn.
#1743

yuye-aws · 2024-09-13T02:38:15Z

Replied in #1743 (comment). Also, resolving this issue can help resolve a user issue: opensearch-project/ml-commons#2612. I was considering to implement a new search response processor to retrieved most relevant chunks, but is fortunately blocked by the current issue: opensearch-project/ml-commons#2612 (comment)

hagen6835 · 2024-09-24T16:16:32Z

Would love this!

yuye-aws added bug Something isn't working untriaged labels Sep 11, 2024

yuye-aws changed the title ~~[BUG] inner_hits in nested neural query only returns one element~~ [BUG] inner_hits in nested neural query only returns one chunk Sep 11, 2024

yuye-aws mentioned this issue Sep 12, 2024

[FEATURE] Enable to use passage chunks from hybrid neural search result as RAG input opensearch-project/ml-commons#2612

Open

naveentatikonda removed the untriaged label Sep 18, 2024

naveentatikonda transferred this issue from opensearch-project/neural-search Sep 18, 2024

github-actions bot added the untriaged label Sep 18, 2024

naveentatikonda added Features Introduces a new unit of functionality that satisfies a requirement and removed untriaged bug Something isn't working labels Sep 18, 2024

naveentatikonda changed the title ~~[BUG] inner_hits in nested neural query only returns one chunk~~ [FEATURE] inner_hits in nested neural query only returns one chunk Sep 18, 2024

yuye-aws changed the title ~~[FEATURE] inner_hits in nested neural query only returns one chunk~~ [FEATURE] inner_hits in nested neural query should return all the chunks Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] inner_hits in nested neural query should return all the chunks #2113

[FEATURE] inner_hits in nested neural query should return all the chunks #2113

yuye-aws commented Sep 11, 2024 •

edited

Loading

yuye-aws commented Sep 11, 2024 •

edited

Loading

martin-gaievski commented Sep 11, 2024

yuye-aws commented Sep 12, 2024 •

edited

Loading

yuye-aws commented Sep 12, 2024

martin-gaievski commented Sep 12, 2024 •

edited

Loading

yuye-aws commented Sep 12, 2024

martin-gaievski commented Sep 12, 2024

martin-gaievski commented Sep 12, 2024

yuye-aws commented Sep 13, 2024

yuye-aws commented Sep 13, 2024

heemin32 commented Sep 13, 2024

yuye-aws commented Sep 13, 2024

hagen6835 commented Sep 24, 2024

[FEATURE] inner_hits in nested neural query should return all the chunks #2113

[FEATURE] inner_hits in nested neural query should return all the chunks #2113

Comments

yuye-aws commented Sep 11, 2024 • edited Loading

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

yuye-aws commented Sep 11, 2024 • edited Loading

martin-gaievski commented Sep 11, 2024

yuye-aws commented Sep 12, 2024 • edited Loading

yuye-aws commented Sep 12, 2024

martin-gaievski commented Sep 12, 2024 • edited Loading

yuye-aws commented Sep 12, 2024

martin-gaievski commented Sep 12, 2024

martin-gaievski commented Sep 12, 2024

yuye-aws commented Sep 13, 2024

yuye-aws commented Sep 13, 2024

heemin32 commented Sep 13, 2024

yuye-aws commented Sep 13, 2024

hagen6835 commented Sep 24, 2024

yuye-aws commented Sep 11, 2024 •

edited

Loading

yuye-aws commented Sep 11, 2024 •

edited

Loading

yuye-aws commented Sep 12, 2024 •

edited

Loading

martin-gaievski commented Sep 12, 2024 •

edited

Loading