Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Enable to use passage chunks from hybrid neural search result as RAG input #2612

Open
reuschling opened this issue Jul 4, 2024 · 44 comments
Assignees
Labels
enhancement New feature or request

Comments

@reuschling
Copy link

I have implemented a hybrid search with according ingest and search pipeline, using text embeddings on document chunks, as the embedding models have input token size limitations of course.

The ingest pipeline follows https://opensearch.org/docs/latest/search-plugins/text-chunking/

The top results should now be used as input for RAG, I configured a search pipeline for this, following https://opensearch.org/docs/latest/search-plugins/conversational-search/ :

{
		"description": "Post and response processor for hybrid search and RAG",
		"phase_results_processors": [
			{
				"normalization-processor": {
					"normalization": {
						"technique": "min_max"
					},
					"combination": {
						"technique": "arithmetic_mean",
						"parameters": {
							"weights": [
								0.3,
								0.7
							]
						}
					}
				}
			}
		],
		"response_processors": [
			{
				"retrieval_augmented_generation": {
					"tag": "rag_pipeline",
					"description": "Pipeline using the configured LLM",
					"model_id": "{{ _.LlmId }}",
					"context_field_list": [
						"tns_body"
					],
					"system_prompt": "You are a helpful assistant",
                                        "user_instructions": "Generate a concise and informative answer in less than 100 words for the given question"
				}
			}
		]
}

Now I am able to send a search request using this search pipeline to {{ _.openSearchUrl }}/{{ _.openSearchIndex }}/_search?search_pipeline=hybrid-rag-pipeline , which is working:

{
	"_source": {
		"include": [
			"tns_body"
		]
	},
	"query": {
		"hybrid": {
			"queries": [
				{
					"bool": {
						"should": [
							{
								"match": {
									"tns_body": {
										"query": "${{query}}"
									}
								}
							},
							{
								"match": {
									"tns_body.ngram": {
										"query": "${{query}}"
									}
								}
							}
						]
					}
				},
				{
					"nested": {
						"score_mode": "max",
						"path": "embedding_chunked_512_paragraphs_chunks_tns_body",
						"query": {
							"neural": {
								"embedding_chunked_512_paragraphs_chunks_tns_body.knn": {
									"query_text": "{{ _.query }}",
									"model_id": "{{ _.embeddingsModelId }}",
									"k": 10
								}
							}
						}
					}
				}
			]
		}
	},
	"ext": {
		"generative_qa_parameters": {
			"llm_question": "${{query}}",
			"llm_model": "llama3",
			"context_size": 2,
			"message_size": 5,
			"timeout": 150
		}
	}
}

Now I am falling into the issue that the documents in my index are too long for my LLM input. In OpenSearch currently the context_size and the message_size is configurable, but when the first document exceeds the input token limit, OpenSearch sends a message to the LLM provider that can not be processed.

Two things comes into my mind now:

  1. Add a 'max token' / 'max terms' config parameter parallel to context_size and message_size, that not depend on the document sizes.
  2. Because the documents are such big, and it is working fine for embeddings with chunking, why not enable to use the matching chunks for RAG input instead of the whole documents. If the chunks are too small (because the embeddings input size is small) I can also think of some scripted field that considers the matching chunk together with neighbor chunks. Or maybe it would be possible to specify an according text field to a generated embedding vector at ingest time, that can comprise the neighbor chunks also. This then could be used for RAG afterwards.

Currently big documents are not only silently lost in RAG. Because the whole prompt exceeds the input token limit of the LLM, it is (in my setting at least) accidentally truncated, meaning that the question - which is the last part of the generated prompt - is lost. So the user question will not be answered at all.

@reuschling reuschling added enhancement New feature or request untriaged labels Jul 4, 2024
@ylwu-amzn
Copy link
Collaborator

@reuschling, Can you share some sample data for your index ? I think you should save text chunks together with embeddings to index

@reuschling
Copy link
Author

This is how a document looks like, returned by an empty query. I truncated the embedding vectors for better readability. I use dynamic mappings, thus the prefixes in the field names.

Do you mean to just save text chunks instead of the whole documents? In this case I can not search for the whole documents anymore. The hybrid search searches inside the whole documents in the 'classical' query part, only the embeddings rely on chunks, because the model input length is limited, and the semantic representation of the embeddings may become too generalized, depending on the used model.

If I chunk the body additionally regarding LLM input token size with an ingest pipeline, I get nested fields 'bodyChunk4LLM' to the origin document, as it is currently for the paragraph and paragraph->chunkSize4Embedding fields.

In the hybrid query with RAG postprocessing, following would be the situation:

  1. Classy part of hybrid query searches inside origin body or in one of the three chunk fields (bodyChunk4LLM, paragraph, paragraph->chunkSize4Embedding)
  2. Neuralsearch part of hybrid query will search inside the paragraph->chunkSize4Embedding field
  3. Then the matched passages from bodyChunk4LLM as RAG input are needed. But because these are nested fields, I can not know which one it is. Further, the fields matched inside the hybrid query are normally other fields, and there exists no relationship. And the returned format must be valid input for the RAG processor.

I tried to generate a scripted field inside the search query where neural search returns the matched chunk offset with "inner_hits". But "inner_hits" are not returned with hybrid queries, and scripted fields doesn't have access to the inner_hits return values, only to the fields inside the origin document. I also doubt that the returned format would not be valid for the RAG processor, as the scripted fields are not appear in _source.

The second possibility would be to generate additional documents with one single bodyChunk4LLM field, i.e. for each input document N new documents with chunks and a reference to the origin doc. Then using these documents for RAG, and the big origin documents for the other searches (classy, neural + hybrid). But the chunk ingest processor doesn't generate separate documents, only new fields with N values inside the origin document instead. I don't see an ingest processor who can do this, maybe you have a hint? 😄

Third, I could generate both documents (origin and N according chunks) outside OpenSearch, but this means to change code inside all possible document providers, where we often don't have access to. It would be much better if this could be done entirely inside OpenSearch configs with e.g. ingest pipelines.

Also there would be a huge redundancy in terms of data size, because all embeddings, chunks, etc. must be generated on both documents, origin and bodyChunk4LLM, in order to support hybrid neural search in both cases. But this is another story.

"hits": [
    {
        "_index": "testfiles",
        "_id": "/home/reuschling/projectz/leech/resource/testData/example_files/HTML.html",
        "_score": 1.0,
        "_source": {
            "tns_description": "",
            "tk_source": "file:/home/reuschling/projectz/leech/resource/testData/example_files/HTML.html",
            "paragraphs_tns_title": [
                "Laptop power supplies are available in First Class only"
            ],
            "resourceName": "/home/reuschling/projectz/leech/resource/testData/example_files/HTML.html",
            "dataEntityContentFingerprint": "1330962029000",
            "paragraphs_tns_description": [],
            "paragraphs_tns_body": [
                "\n    Code, Write, Fly\n\n",
                "    This chapter is being written 11,000 meters above New Foundland. \n  "
            ],
            "tns_title": "Laptop power supplies are available in First Class only",
            "date_modified": "2012.03.05 16:40:29:000",
            "tns_body": "\n    Code, Write, Fly\n\n    This chapter is being written 11,000 meters above New Foundland. \n  ",
            "embedding_chunked_512_paragraphs_chunks_tns_body": [
                {
                    "knn": [
                        -0.016918883,
                        -0.0045324834,
                        ...
                    ]
                },
                {
                    "knn": [
                        0.03815532,
                        0.015174329,
                        ...
                    ]
                }
            ],
            "X-TIKA:Parsed-By": [
                "org.apache.tika.parser.CompositeParser",
                "de.dfki.km.leech.parser.HtmlCrawlerParser"
            ],
            "Content-Encoding": "windows-1252",
            "dataEntityId": "/home/reuschling/projectz/leech/resource/testData/example_files/HTML.html",
            "paragraphs_chunks_tns_title": [
                "Laptop power supplies are available in First Class only"
            ],
            "paragraphs_chunks_tns_body": [
                "\n    Code, Write, Fly\n\n",
                "    This chapter is being written 11,000 meters above New Foundland. \n  "
            ],
            "embedding_512_tns_body": [
                0.021791747,
                0.0016991429,
                ...
            ],
            "Content-Type": "text/html; charset=windows-1252",
            "paragraphs_chunks_tns_description": []
        }
    }
]

@yuye-aws
Copy link
Member

yuye-aws commented Jul 6, 2024

The second possibility would be to generate additional documents with one single bodyChunk4LLM field, i.e. for each input document N new documents with chunks and a reference to the origin doc. Then using these documents for RAG, and the big origin documents for the other searches (classy, neural + hybrid). But the chunk ingest processor doesn't generate separate documents, only new fields with N values inside the origin document instead. I don't see an ingest processor who can do this, maybe you have a hint? 😄

You are right. To be the best of my knowledge, none of the ingest processor can return multiple documents. Can you take a look at logstash? After the text chunking and embedding, maybe you can use logstash to flatten the chunked list and embeddings into multiple documents.

@yuye-aws
Copy link
Member

yuye-aws commented Jul 6, 2024

From my understanding, you only need the specific chunk, instead of the whole document. You use need a way for nested query to tell you which chunk gets matched. Please correct me if I am wrong. @reuschling

@reuschling
Copy link
Author

yes exactly @yuye-aws , because the LLM has a limited text length that it can process. But there is still the problem that the chunks for the embedding model, that are matched against in the nested query, have normally other sizes as the chunks needed for the LLM, which are in general much bigger. Thus my though to get the neighbor chunks also.

So, in terms of control over the whole searching process, I would now assume it would be best if there is a possibility to have control over the LLM chunk size, which is currently only possible over the field content size. This can now achieved only by generating separate documents with LLM related chunk sizes, created outside (logstash, other document providers) or inside OpenSearch (ingest processor).

By far preferable would be inside OpenSearch - there exists millions of existing document providers in parallel to logstash, and they all have to be adjusted otherwise. I personally would look now if there is a possibility to create an own ingest processor (maybe script based?) that would create chunks like the current text chunking processor, but creates separate documents instead of additional fields instead.

The idea to use logstash as a postprocessing step is also not so bad, but it is not so easy to realize for a non-static index where new document content is added frequently.

@yuye-aws
Copy link
Member

yuye-aws commented Jul 9, 2024

So, in terms of control over the whole searching process, I would now assume it would be best if there is a possibility to have control over the LLM chunk size, which is currently only possible over the field content size. This can now achieved only by generating separate documents with LLM related chunk sizes, created outside (logstash, other document providers) or inside OpenSearch (ingest processor).

Aligning the chunk size with the LLM is a good practice for search relevance. I have created an RFC on model-based tokenizer: opensearch-project/neural-search#794. Do you think you concern will be addressed if we can register the same tokenizer as the LLM?

By the way, I have listed a few options under the RFC, along with their pros and cons. Would you like to share your valuable opinions? Any comments will be appreciated.

@yuye-aws
Copy link
Member

yuye-aws commented Jul 9, 2024

I personally would look now if there is a possibility to create an own ingest processor (maybe script based?) that would create chunks like the current text chunking processor, but creates separate documents instead of additional fields instead.

I am afraid not. The ingest processor in OpenSearch only performs certain actions upon a single document. We cannot create multiple documents based on a single document.

@ylwu-amzn Can you provide some suggestions? Maybe we can have a discussion together.

@yuye-aws
Copy link
Member

yuye-aws commented Jul 9, 2024

I tried to generate a scripted field inside the search query where neural search returns the matched chunk offset with "inner_hits". But "inner_hits" are not returned with hybrid queries, and scripted fields doesn't have access to the inner_hits return values, only to the fields inside the origin document. I also doubt that the returned format would not be valid for the RAG processor, as the scripted fields are not appear in _source.

How about supporting inner_hits in hybrid queries? Feel free to create an RFC so that we can discuss next-step plans.

@reuschling
Copy link
Author

How about supporting inner_hits in hybrid queries? Feel free to create an RFC so that we can discuss next-step plans.

There is already a feature request for this: opensearch-project/neural-search#718.
Nevertheless, the follow-up problems still would resist (scripted fields doesn't have access to inner_hits, and rag processor can not deal with scripted fields, as far as I know).

Otherwise this could be also a valid solution, because there would be no need for extra chunking for the LLM anymore. It would be possible to build the chunk for the LLM out of the matched embedding chunk with its neighbor chunks. The only lack is that the match would be only rely on the embedding part of the hybrid query.

@reuschling
Copy link
Author

Aligning the chunk size with the LLM is a good practice for search relevance. I have created an RFC on model-based tokenizer: opensearch-project/neural-search#794. Do you think you concern will be addressed if we can register the same tokenizer as the LLM?

This is a better way to specify valid chunk lengths for the current model, right? I currently use the formula terms*0,75≈tokenlimit, as documented at https://opensearch.org/docs/latest/ingest-pipelines/processors/text-chunking/. I think this is valid, with a specified overlap it feels to me that there is no relevant information loss. Specifying the real token limit as token limit (as it is known from the model description) would be much easier of course. But I am not sure if every (future) model follows the same rules for tokenization. If not, things would become complex against a bit redundancy in the data with the current approach. I will have a look on your RFC, thanks for the hint.

@yuye-aws
Copy link
Member

Aligning the chunk size with the LLM is a good practice for search relevance. I have created an RFC on model-based tokenizer: opensearch-project/neural-search#794. Do you think you concern will be addressed if we can register the same tokenizer as the LLM?

This is a better way to specify valid chunk lengths for the current model, right? I currently use the formula terms*0,75≈tokenlimit, as documented at https://opensearch.org/docs/latest/ingest-pipelines/processors/text-chunking/. I think this is valid, with a specified overlap it feels to me that there is no relevant information loss. Specifying the real token limit as token limit (as it is known from the model description) would be much easier of course. But I am not sure if every (future) model follows the same rules for tokenization. If not, things would become complex against a bit redundancy in the data with the current approach. I will have a look on your RFC, thanks for the hint.

Replied in opensearch-project/neural-search#794

@yuye-aws
Copy link
Member

There is already a feature request for this: opensearch-project/neural-search#718.
Nevertheless, the follow-up problems still would resist (scripted fields doesn't have access to inner_hits, and rag processor can not deal with scripted fields, as far as I know).

I see. Supporting inner hits from hybrid query does not suffice to resolve your problem. It may take us sometime to investigate and check valid solutions to your problem. Thanks for your patience.

If we enable the nested query to return the specific chunk. Does that resolve your problem?

@reuschling
Copy link
Author

If we enable the nested query to return the specific chunk. Does that resolve your problem?

Still, the rag processor doesn't support this as input. And there is also no real control to build chunks with the right size for RAG / the LLM.

The question is how OpenSearch can achieve a solution to get control over the LLM chunk sizes for RAG, without the need to chunk the documents outside of OpenSearch.

Outside of OpenSearch there is the same problem for chunking a document as you described it in opensearch-project/neural-search#794

For answering a single question, the ideal LLM chunk size would be:

  • Specify the rag input as top N result documents
  • Document chunks in the index with size (LlmInputMaxTokenCount - MaxLengthUserQuery) / N

Nevertheless, for a conversation with follow-up questions the chunk sizes have to be smaller. Is there a solution in conversational search to achieve that the input context length of the LLM won't be exceeded?

I see these possibilities to get control over the LLM chunk sizes:

  1. Creating the chunks as extra documents, either inside (not supported yet) or outside of OpenSearch (with the gap of the model specific tokenization rules)
  2. Enabling OpenSearch RAG processor to somehow deal with field chunks made for LLM input (hybrid neural query would return chunks for embedding input), where there is the need to get a relationship between the matched hybrid query embedding chunk field and the according LLM chunk field. This is complex, especially to deal with both parts of the hybrid query.
  3. If the LLM chunks are not built in the index explicitly, I have seen a possibility to build them on the fly out of the pre-built, matched embedding chunks, together with neighbor chunks. But in case of a hybrid neural query, the term-based part has to search in the embedding chunk field also in order to return it. Otherwise only the embedding part of the hybrid query will return a chunk. This was the origin idea of this feature request, I now see the limitations.

1. and 2. would be clean solutions from my point of view.

@yuye-aws
Copy link
Member

The current opensearch version does not support the feature. We will have a few discussions among our team to explore the possible solutions.

@b4sjoo
Copy link
Collaborator

b4sjoo commented Jul 16, 2024

Hi @yuye-aws, can you help take care of this issue?

@yuye-aws
Copy link
Member

Hi @yuye-aws, can you help take care of this issue?

Sure

@yuye-aws
Copy link
Member

HI @reuschling ! We are investigating possible solutions for this issue. Can you provide the postprocessing script to access the chunk offset?

@reuschling
Copy link
Author

@yuye-aws I am not sure what you mean. Do you mean postprocessing as part of the query? As part of my suggested solution 2? I'm not sure how to accomplish the relationship between the hybrid query matched field and the needed llm chunk. Maybe with text offset overlaps?
Or do you mean your idea with e.g. logstash as a postprocessing step in order to create chunk documents after indexing? Here I see the main problem how to deal with incremental indexing, if a document was modified in a second indexing step. The former chunk documents of the source document have to be deleted in this case.

@ylwu-amzn
Copy link
Collaborator

@reuschling , I'm building a solution based on Agents for another cx who have a similar problem. Is it ok to use Agent to run RAG for your case https://github.com/opensearch-project/ml-commons/blob/2.x/docs/tutorials/agent_framework/RAG_with_conversational_flow_agent.md ?

@reuschling
Copy link
Author

@ylwu-amzn , thanks for your hint. I had a look on the agent framework, but there are the same circumstances as in the other possibilities to configure RAG, right? It makes no difference regarding text chunks for the LLM if RAG is configured with a conversational search template or a conversational agent.

@yuye-aws
Copy link
Member

yuye-aws commented Sep 5, 2024

@yuye-aws I am not sure what you mean. Do you mean postprocessing as part of the query? As part of my suggested solution 2? I'm not sure how to accomplish the relationship between the hybrid query matched field and the needed llm chunk. Maybe with text offset overlaps?

I mean post processing a nested search query (non-hybrid). How do you expect to use script field and inner_hits to retrieve specific field? (If you can access the inner_hits from script field)

Sorry for taking long to respond. I was recently busy with other tasks.

@yuye-aws
Copy link
Member

yuye-aws commented Sep 5, 2024

Well, after some investigation, I come up with the following two options. Which option do you prefer @reuschling ?

  1. Implement a new type of search response processor, where it will retrieve the specific chunk from the nested type. The chunks will be ranked decreasingly according to search relevance score.
  2. Within retrieval_augmented_generation processor, allow users to configure the number of strings (instead of the number of documents). I was thinking about allowing users to configure the token limit. Unfortunately, we do not support model-based tokenizer now.

@yuye-aws
Copy link
Member

yuye-aws commented Sep 5, 2024

Here are the pros and cons of both options. I am personally in favor of the first option.

Option 1

Pros

  1. Enable the users to directly obtain the most relevant chunk after reranking.

Cons

  1. The processor only makes sense for max score_mode in nested query. There would be no guarantee that the reranked chunks are most relevant on other score_mode.
  2. Hard for users to retrieve neighbor chunks from the processor outputs.
  3. BM25 search may not need this processor. This processor will be implemented in either neural-search or ml-commons repo.

Option 2

Pros

  1. Perhaps users can devise customized parameter to retrieve neighbor chunks.

Cons

  1. The 1st and 3rd Con for option 1.
  2. Need to implement a sorting method to aggregate and rerank the chunks. Suppose that the user is using max score_mode to search documents. Here is the chunk relevance of the first document and the second document: [0.9 , 0.2, 0.1] and [0.8, 0.7, 0.6]. The expected order should be [0.9, 0.8, 0.7, 0.6, 0.2, 0.1].

@reuschling
Copy link
Author

You mean implementing my suggested solution 2 'Enabling RAG processor to somehow deal with field chunks made for LLM input ' with a search response processor, that can find the right chunk4llm_field out of the hybrid query result? With the help of another queries, so similar chunk fields are ranked by search relevance score? But what should be the input of this query. By processing a hybrid search we will have a possible term-based, classy match, and an embedding-based match against a chunk4embeddings_field. What are general criteria? Further I doubt OpenSearch doesn't retrieve single (chunk)fields as result, if there are more field values for a single field, they are assumed as concatenated for search, isn't it? And, last but not least, processing further queries per result document could be a performance issue also.
Still, it would be a nice solution to somehow deal with field chunks as input for the RAG processor.

Your second point with a token limit for the retrieval_augmented_generation processor sounds good, at least for throwing an error if the input exceeds the configured limit.
I see no possibility to just truncate the input, if someone feeds whole documents and we just take the beginning terms, the relevant chunk would not be considered at all.

@yuye-aws
Copy link
Member

yuye-aws commented Sep 6, 2024

With the help of another queries, so similar chunk fields are ranked by search relevance score? But what should be the input of this query. By processing a hybrid search we will have a possible term-based, classy match, and an embedding-based match against a chunk4embeddings_field. What are general criteria? Further I doubt OpenSearch doesn't retrieve single (chunk)fields as result, if there are more field values for a single field, they are assumed as concatenated for search, isn't it? And, last but not least, processing further queries per result document could be a performance issue also.

Actually it's not a query. It's a search response processor: https://opensearch.org/docs/latest/search-plugins/search-pipelines/search-processors/#search-response-processors. The processor will further process the results retrieved from the nested query. To be specific, it will visit the inner_hits to see the relevance score of each chunks, rerank the chunks and then return the results to the user.

I will take a look into the neural query and hybrid query in these days.

@yuye-aws
Copy link
Member

yuye-aws commented Sep 6, 2024

Your second point with a token limit for the retrieval_augmented_generation processor sounds good, at least for throwing an error if the input exceeds the configured limit.
I see no possibility to just truncate the input, if someone feeds whole documents and we just take the beginning terms, the relevant chunk would not be considered at all.

Truncating the input is not supported. But I think your LLM can automatically do the truncation.

@yuye-aws
Copy link
Member

yuye-aws commented Sep 6, 2024

Your second point with a token limit for the retrieval_augmented_generation processor sounds good, at least for throwing an error if the input exceeds the configured limit.

The token limit solution is only feasible when we support model-based tokenizer in OpenSearch. I'm afraid it will take at least a few releases to accomplish. Perhaps you can wait for the OpenSearch 2.19 release.

@yuye-aws
Copy link
Member

yuye-aws commented Sep 6, 2024

  • Implement a new type of search response processor, where it will retrieve the specific chunk from the nested type. The chunks will be ranked decreasingly according to search relevance score.

I will take into a look into this solution these days. Just have a question for you @reuschling : is it required for you to use hybrid query? Since the search response in hybrid query does not support inner_hits, the proposed search response processor may not be able to retrieve the hybrid score on each chunk.

@reuschling
Copy link
Author

reuschling commented Sep 6, 2024

Yes I deal with hybrid query, I think it is not wise to not consider the core competences for search of Lucene/OpenSearch for possible solutions. Generalized solutions would be better. But everything begins with the first step :)

So, if I understand you right, your suggestion is to write a search response processor that will do what I tried with a scripted field, right? To build the right chunk for the llm out of the matched embedding chunk with neigbours.

This would be the third of the possible solutions I had in mind, but I am not sure anymore if this would be a valid solution.
Cons:

  1. In case of a hybrid query, only the neural part against the embedding chunks are guaranteed to return a chunk.
  2. Also in case of a pure neural query, the chunks generated for the embeddings possibly overlaps by some degree, which is supported by the chunk processor. This is good practice, but if we generate llm chunks out of overlapping embedding chunks, the result would be not so good.

Still, generating llm chunks on the fly would be nice of course.
Pros:

  1. No need to create llm chunks outside of OpenSearch or inside OpenSearch at ingestion time.
  2. Clear document structure - there is only the origin document with its origin length and no further, related documents in the index for the llm chunks. This document have fields for the embedding chunks for the neural search of course, but nothing more.
  3. Much less index size, because overlapping llm chunks also >doubles disk memory.
  4. The llm chunk size could be adjusted without the need to reindex the whole corpus.

Cons

  1. Searching in (llm) chunks in the term based part of the hybrid query could be more precise. But this still can be done.
  2. Some performance overhead at searching time.

Maybe it would be a better solution to get somehow the term offset of the match inside the origin field, i.e. the term offset for the term based match and the term offset for the matched embedding chunk. With this offset it would be possible to cut the chunk for the llm out of the origin field. But in case of the term-based part match, there is no clear term offset inside the origin document. There are several term match offsets only, and it is still unclear which part of the document would be the best chunk. Thus, also this solution would be possible for the neural search part only.

For the term based part, I currently only see the possibility to search inside pre-chunked document parts of size right for the llm - not 'on the fly'. Here the current lack in OpenSearch is that the chunking have to be done outside of OpenSearch. It would be a huge benefit - from my point of view - if this can be done inside OpenSearch also. Possibilities would be my suggested solutions 1. and 2., i.e.

  1. Enabling the creation of chunks as extra documents (maybe nested fields?), e.g. with an according ingest processor, breaking the chains that ingest processors can not create extra documents currently. Or
  2. Creating the llm chunks with the current chunking processor as document fields, enabling -somehow- to pick the right chunk field for the RAG processor. This needs both a relationship from the term based hybrid query part to the right chunk field (maybe as nested query as in the neural search part), and a relationship from the neural search part result to the according llm chunk field (maybe by comparing chunk offsets, picking the llm chunk field with overlapping offsets).

Possibility 1. sounds to me as easiest solution, but I am not aware of possible hard restrictions for ingest processors.

@yuye-aws
Copy link
Member

yuye-aws commented Sep 7, 2024

Possibility 1. sounds to me as easiest solution, but I am not aware of possible hard restrictions for ingest processors.

Basically, the ingest pipeline in OpenSearch is a 1-on-1 mapping of the document. For any ingest document, both the input and the output should be a single document. You can check the following code snippets to have a rough idea: https://github.com/opensearch-project/OpenSearch/blob/7c9c01d8831f57b853647bebebd8d91802186778/server/src/main/java/org/opensearch/ingest/IngestDocument.java#L797-L819

And also: https://github.com/opensearch-project/OpenSearch/blob/7c9c01d8831f57b853647bebebd8d91802186778/server/src/main/java/org/opensearch/ingest/Processor.java#L80-L86

@yuye-aws
Copy link
Member

yuye-aws commented Sep 7, 2024

Hi @chishui ! Do you think there is any possible method to generate multiple documents when user is ingesting a document? Intuitively, can we modify the innerExecute method to get an array of ingest document https://github.com/opensearch-project/OpenSearch/blob/7c9c01d8831f57b853647bebebd8d91802186778/server/src/main/java/org/opensearch/ingest/IngestService.java#L990-L1027?

@yuye-aws
Copy link
Member

yuye-aws commented Sep 7, 2024

2. Creating the llm chunks with the current chunking processor as document fields, enabling -somehow- to pick the right chunk field for the RAG processor. This needs both a relationship from the term based hybrid query part to the right chunk field (maybe as nested query as in the neural search part), and a relationship from the neural search part result to the according llm chunk field (maybe by comparing chunk offsets, picking the llm chunk field with overlapping offsets).

It is a good ideal to reduce the overlapped tokens. I guess you are expecting to retrieve the neighbor chunks along with the matched chunk. In my opinion, it is not a hard requirement to address your problem. We just need to find the most matched chunk and return to the user.

@yuye-aws
Copy link
Member

yuye-aws commented Sep 7, 2024

  • In case of a hybrid query, only the neural part against the embedding chunks are guaranteed to return a chunk.

Will take a look at the hybrid query in the next few days. Just in case other solutions do not work. I would like to begin with the search response processor solution to support neural query first. Also, maybe you can leave an email or join the OpenSearch slack channel so that we can have meeting and respond to your messages ASAP. For your information: https://opensearch.org/slack.html

@reuschling
Copy link
Author

  1. Creating the llm chunks with the current chunking processor as document fields, enabling -somehow- to pick the right chunk field for the RAG processor. This needs both a relationship from the term based hybrid query part to the right chunk field (maybe as nested query as in the neural search part), and a relationship from the neural search part result to the according llm chunk field (maybe by comparing chunk offsets, picking the llm chunk field with overlapping offsets).

It is a good ideal to reduce the overlapped tokens. I guess you are expecting to retrieve the neighbor chunks along with the matched chunk. In my opinion, it is not a hard requirement to address your problem. We just need to find the most matched chunk and return to the user.

No, here I mean not dealing with neighbor chunks. We have two different chunks, one with chunk size for the llm, one with chunk size for the embeddings. Here I think about how it could be possible to get an relationship between a matched embedding chunk from the neural search to an according pre-calculated chunk field in llm size. One possibility could be looking on the source offsets of both chunks.

@yuye-aws
Copy link
Member

yuye-aws commented Sep 9, 2024

Third, I could generate both documents (origin and N according chunks) outside OpenSearch, but this means to change code inside all possible document providers, where we often don't have access to. It would be much better if this could be done entirely inside OpenSearch configs with e.g. ingest pipelines.

Also, could you elaborate more on why you could not generate a document for each chunk? In my opinion, you could download the existing index, and then create a new index with separate documents. You just need to consume duplicated spaces, right?

@yuye-aws
Copy link
Member

yuye-aws commented Sep 9, 2024

One possibility could be looking on the source offsets of both chunks.

Sure. inner_hits can do that, I will check what is the blocker to support inner_hits with hybrid query with the neural-search team today.

@reuschling
Copy link
Author

Also, could you elaborate more on why you could not generate a document for each chunk?

For my current use case where I implemented a new importer I do so. But we have several existing document corpora that are indexed/mirrored into OpenSearch. To enable RAG for existing OpenSearch applications - where the implementation of the document import is finished, existing, maybe complicated and the code maybe not available - it is currently mandatory to write code. Just to configure OpenSearch in a different way, transparent from the import process outside of OpenSearch, is much better and maybe sometimes the only possibility.

In my opinion, you could download the existing index, and then create a new index with separate documents. You just need to consume duplicated spaces, right?

This is right of course. Things becomes complicated I you have mirrored (several) document corpora, where you check if there is a modification inside the corpus - new, modified or deleted documents - re-indexing the delta incrementally. Someone have to implement this mirroring functionality also for the index duplication for RAG. Technically everything is possible and doable of course. But again, it is a totally different scenario - much more work and costs - against to have the possibility to just re-configure OpenSearch on top of the existing, unmodified solution.

Or in other words: For building new applications the current possibilities are sufficient. For the migration of existing applications to RAG, it would be a benefit if this could be done with the server config.

@yuye-aws
Copy link
Member

I understand. As our first step, I am implementing a prototype of search response processor. Will ping u when ready.

@yuye-aws
Copy link
Member

yuye-aws commented Sep 11, 2024

Hi @reuschling . I regret to tell you a bad news :(

I am running inner_hits with the neural query to search documents according to their chunks. It only returns the highest chunk in the inner_hits field. It's such a weird behavior that is different from the BM25 query.

The search response processor will only return only a chunk for each different document. Suppose that the chunk relevance of both documents are: [0.9, 0.8, 0.7], [0.6, 0.5]. The search response processor can only return [0.9, 0.6].

This would definitely be unexpected result. I will open an issue in neural search repo. For the next few days, I will take deeper dive to see if there is any blocking issues.

@reuschling
Copy link
Author

reuschling commented Sep 12, 2024

Hi, but isn't there the index of the chunk also? With this, the neighbor chunks could be determined, isn't it?

It could be a real performance issue to return all chunks, as loading field data needs much time in general.

@yuye-aws
Copy link
Member

Hi, but isn't there the index of the chunk also? With this, the neighbor chunks could be determined, isn't it?

It could be a real performance issue to return all chunks, as loading field data needs much time in general.

You can determine neighbor chunks via offsets, but there is no guarantee that neighbor chunks is relevant to the user query.

Suppose that the chunk relevance of both documents are: [0.9, 0.8, 0.7], [0.6, 0.5]. The search response processor can only return [0.9, 0.6].

We can also have an example [0.9, 0.2, 0.1], [0.6, 0.5]. The expected returned chunks should be [0.9, 0.8] and [0.9, 0.6] in both examples. Unfortunately, without inner hits, we cannot distinguish between them.

@yuye-aws
Copy link
Member

You can check the bug issue in neural-search: opensearch-project/k-NN#2113. The is the current blocking issue.

@yuye-aws
Copy link
Member

Latest update: neural-search issue transferred to k-nn repo: opensearch-project/k-NN#2113

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: On-deck
Development

No branches or pull requests

4 participants