feat: implement text chunking processor with fixed token length and delimiter algorithm #607

yuye-aws · 2024-02-18T12:31:55Z

Description

This PR implements the text chunking processor in RFC. We have implemented two algorithms: fixed token length algorithm and delimiter algorithm. Users can use the chunking ingest processor as the following:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "text_chunking": {
          "algorithm": {
            "fixed_token_length": {
              "token_limit": 10,
              "overlap_rate": 0.2,
              "tokenizer": "standard"
            }
          },
          "field_map": {
            "body": "body_chunk"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
      }
    }
  ]
}

And then obtain the response:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "body_chunk": [
            "This is an example document to be chunked The document",
            "The document contains a single paragraph two sentences and 24",
            "and 24 tokens by standard tokenizer in OpenSearch"
          ],
          "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
        },
        "_ingest": {
          "timestamp": "2024-03-05T09:49:37.131255Z"
        }
      }
    }
  ]
}

You can refer to the RFC for detailed parameter description.

User Cases

Text Embedding

After configuring the text_embedding processor and obtain the model id. We can chain chunking processor together with the text_embedding processor to obtain the embedding vectors for each chunked passages. Here is an example:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "text_chunking": {
          "algorithm": {
            "fixed_token_length": {
              "token_limit": 10,
              "overlap_rate": 0.2,
              "tokenizer": "standard"
            }
          },
          "field_map": {
            "body": "body_chunk"
          }
        }
      },
      {
        "text_embedding": {
          "model_id": "IYMBDo4BwlxmLrDqUr0a",
          "field_map": {
            "body_chunk": "body_chunk_embedding"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
      }
    }
  ]
}

And we obtain the following results:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "body_chunk": [
            "This is an example document to be chunked The document",
            "The document contains a single paragraph two sentences and 24",
            "and 24 tokens by standard tokenizer in OpenSearch"
          ],
          "body_chunk_embedding": [
            {
              "knn": [...]
            },
            {
              "knn": [...]
            },
            {
              "knn": [...]
            }
          ],
          "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
        },
        "_ingest": {
          "timestamp": "2024-03-05T09:49:37.131255Z"
        }
      }
    }
  ]
}

Cascaded Chunking Processors

Users can chain multiple chunking processor together. For example, if a user wish to split documents according to paragraphs, they can apply the Delimiter algorithm and specify the parameter to be "\n\n". In case that a paragraph exceeds the token limit, the user can then append another chunking processor with Fixed Token Length algorithm. The ingestion pipeline in this example should be configured like:

PUT _ingest/pipeline/chunking-pipeline
{
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "delimiter": {
            "delimiter": "\n\n"
          }
        },
        "field_map": {
          "body": "body_chunk1"
        }
      }
    },
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 500,
            "overlap_rate": 0.2,
            "tokenizer": "standard"
          }
        },
        "field_map": {
          "body_chunk1": "body_chunk2"
        }
      }
    }
  ]
}

Issues Resolved

Implement document chunking processor and fixed token length algorithm

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

yuye-aws · 2024-02-18T12:40:28Z

For now, this PR is a POC for the RFC. I will mark this PR as ready when we finalize the high level design and add corresponding unit tests and integration tests.

codecov · 2024-02-18T12:41:11Z

Codecov Report

Attention: Patch coverage is 97.89916% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 84.19%. Comparing base (e41fba7) to head (68fef4f).
Report is 2 commits behind head on main.

Files	Patch %	Lines
.../neuralsearch/processor/TextChunkingProcessor.java	96.03%	2 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #607      +/-   ##
============================================
+ Coverage     82.62%   84.19%   +1.56%     
- Complexity      666      743      +77     
============================================
  Files            52       59       +7     
  Lines          2072     2309     +237     
  Branches        334      370      +36     
============================================
+ Hits           1712     1944     +232     
- Misses          212      214       +2     
- Partials        148      151       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java

yuye-aws · 2024-02-22T15:25:08Z

Hi @zane-neo! I have modified the PR according your comments. Feel free to review my code.

samuel-oci

thank you for the draft @yuye-aws, I would like us to follow the upcoming new feature release process.

Lets make sure all feature spec feedback is collected in the RFC [RFC] Text chunking design #548
Lets create a meta issue with design (I can create one and link it)
We will move forward with the changes

yuye-aws · 2024-02-26T01:42:32Z

Lets create a meta issue with design (I can create one and link it)

Do you mean the high level design about the document chunking processor? Is Interface Design section in RFC what you are looking for?

src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java

model-collapse · 2024-03-18T00:03:37Z

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java

+    private static final Set<String> WORD_TOKENIZERS = Set.of(
+        "standard",
+        "letter",
+        "lowercase",
+        "whitespace",
+        "uax_url_email",
+        "classic",
+        "thai"
+    );


Currently let's don't support any customized tokenizer there, to avoid ones with overlapping. We can have some intelligent checker for tokenizers later.

model-collapse · 2024-03-18T00:07:17Z

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java

+            throw new IllegalStateException(
+                String.format(Locale.ROOT, "%s algorithm encounters exception in tokenization: %s", ALGORITHM_NAME, e.getMessage()),


It is ok to include the original message, but the wording is too simple. We need to explain why this is happening.

…elimiter algorithm (#607) * implement chunking processor and fixed token length Signed-off-by: yuye-aws <yuyezhu@amazon.com> * initialize node client for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * initialize document chunking processor with analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * chunker factory create with analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement tokenizer in fixed token length algorithm with analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add max token count parsing logic Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for non-existing index Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change error log Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement evenly chunk Signed-off-by: yuye-aws <yuyezhu@amazon.com> * unit tests for chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * unit tests for chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add error message for chunker factory tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Revert "implement evenly chunk" This reverts commit 93dd2f4. Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add default value logic back Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit test for fixed token length chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add test cases in unit test for fixed token length chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * support map type as an input Signed-off-by: yuye-aws <yuyezhu@amazon.com> * support map type as an input Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for map type Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for map type Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for map type in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove system out println Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add delimiter chunker Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add UT for delimiter chunker Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add delimiter chunker processor Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more UTs Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more UTs Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * basic unit tests for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix tests for getProcessors in neural search Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit tests with string, map and nested map type for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit tests for parameter valdiation in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add back deleted xml file Signed-off-by: yuye-aws <yuyezhu@amazon.com> * restore xml file Signed-off-by: yuye-aws <yuyezhu@amazon.com> * integration tests for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add back Run_Neural_Search.xml Signed-off-by: yuye-aws <yuyezhu@amazon.com> * restore Run_Neural_Search.xml Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add changelog Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update integration test for cascade processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add max chunk limit Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove useless and apply spotless Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update error message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change field UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove useless and apply spotless Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change logic of max chunk number Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add max chunk limit into fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Support list<list<string>> type in embedding and extract validation logic to common class Signed-off-by: zane-neo <zaniu@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit tests for inference processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit tests for unit tests with max_chunk_limit in fixed token length Signed-off-by: yuye-aws <yuyezhu@amazon.com> * constructor for inference processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * use inference processor Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * draft code for extending inference processor with document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * api refactor for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove nested list key for chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove unused function Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove processor validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove processor validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Revert InferenceProcessor.java Signed-off-by: Yuye Zhu <yuyezhu@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * revert changes in text embedding and sparse encoding processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement chunk with map in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add default delimiter value Signed-off-by: Lu <xinyual@88665a36eec8.ant.amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement max chunk logic in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add initial value for max chunk limit in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix in chunking processor: allow 0 max_chunk_limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement overlap rate with big decimal Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update max chunk limit in delimiter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter setting for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update max chunk limit implementation in chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit tests for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * spotless apply for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * initialize current chunk count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * parameter validation for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix integration tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix current UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change delimiter UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove delimiter useless code Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add UT for list inside map Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add UT for list inside map Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update unit tests for chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more unit tests for chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix import order Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix java doc error Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix update ut for fixed token length chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement chunk count wrapper for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * rename variable end to nextDelimiterPosition Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust method place Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * reanme interface name and fixed token length algorithm name Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update fixed token length algorithm configuration for integration tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * make delimiter member variables static Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove redundant set field value in execute method Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add integration tests with more tokenizers Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: unit test failure due to invalid tokenizer Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: token concatenation in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update chunker interface Signed-off-by: yuye-aws <yuyezhu@amazon.com> * track chunkCount within function Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: allow white space as the delimiter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix fixed length chunker Signed-off-by: xinyual <xinyual@amazon.com> * fix delimiter chunker Signed-off-by: xinyual <xinyual@amazon.com> * fix chunker factory Signed-off-by: xinyual <xinyual@amazon.com> * fix UTs Signed-off-by: xinyual <xinyual@amazon.com> * fix UT and chunker factory Signed-off-by: xinyual <xinyual@amazon.com> * move analysis_registry to non-runtime parameters Signed-off-by: xinyual <xinyual@amazon.com> * fix Uts Signed-off-by: xinyual <xinyual@amazon.com> * avoid java doc change Signed-off-by: xinyual <xinyual@amazon.com> * move validate to commonUtlis Signed-off-by: xinyual <xinyual@amazon.com> * remove useless function Signed-off-by: xinyual <xinyual@amazon.com> * change java doc Signed-off-by: xinyual <xinyual@amazon.com> * fix Document process ut Signed-off-by: xinyual <xinyual@amazon.com> * fixed token length: re-implement with start and end offset Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update exception message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix document chunking processor IT Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: adjust start, end content position in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update changelog for 2.x release Signed-off-by: yuye-aws <yuyezhu@amazon.com> * rename processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update default delimiter to be \n\n Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove change log in 3.0 unreleased Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix IT failure due to chunking processor rename Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update javadoc for text chunking processor factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust functions in chunker interface Signed-off-by: yuye-aws <yuyezhu@amazon.com> * move algorithm name definition to concrete chunker class Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for chunker parameter validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc for delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * support range double in chunker parameter validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update sneaky throw with text chunking processor it Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add word tokenizer restriction for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update error message for multiple algorithms in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add comment in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * validate max chunk limit with util parameter class Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * make parameter final Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement a map from chunker name to constuctor function in chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix in chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove get all chunkers in chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove type check for parameter check for max token count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove type check for parameter check for analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement parser and validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * provide fixed token length as the default algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust exception message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust exception message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * use object nonnull and require nonnull Signed-off-by: yuye-aws <yuyezhu@amazon.com> * apply final to ingest document and chunk count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * merge parameter validator into the parser Signed-off-by: yuye-aws <yuyezhu@amazon.com> * assign positive default value for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * validate supported chunker algorithm in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter setting of max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test with non list of string Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test with null input Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test for tokenization excpetion in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune method name in text chunking processor unit test Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune method name in delimiter algorithm unit test Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test for overlap rate too small in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune method modifier for all classes Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune code Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune code Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune exception type in parameter parser Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * include max chunk limit in both algorithms Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * allow 0 for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update runtime max chunk limit in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune code for chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement test for multiple field max chunk limit exceed Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune methods name in text chunking proceesor unit tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit tests for both algorithms with max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * optimize code Signed-off-by: yuye-aws <yuyezhu@amazon.com> * extract max chunk limit check to util class Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: only update runtime max chunk limit when enabled Signed-off-by: yuye-aws <yuyezhu@amazon.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: zane-neo <zaniu@amazon.com> Signed-off-by: Yuye Zhu <yuyezhu@amazon.com> Signed-off-by: Lu <xinyual@88665a36eec8.ant.amazon.com> Co-authored-by: xinyual <xinyual@amazon.com> Co-authored-by: zane-neo <zaniu@amazon.com> Co-authored-by: Lu <xinyual@88665a36eec8.ant.amazon.com> (cherry picked from commit eea53aa)

…en length and delimiter algorithm (#644) * feat: implement text chunking processor with fixed token length and delimiter algorithm (#607) * implement chunking processor and fixed token length Signed-off-by: yuye-aws <yuyezhu@amazon.com> * initialize node client for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * initialize document chunking processor with analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * chunker factory create with analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement tokenizer in fixed token length algorithm with analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add max token count parsing logic Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for non-existing index Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change error log Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement evenly chunk Signed-off-by: yuye-aws <yuyezhu@amazon.com> * unit tests for chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * unit tests for chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add error message for chunker factory tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Revert "implement evenly chunk" This reverts commit 93dd2f4. Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add default value logic back Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit test for fixed token length chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add test cases in unit test for fixed token length chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * support map type as an input Signed-off-by: yuye-aws <yuyezhu@amazon.com> * support map type as an input Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for map type Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for map type Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix for map type in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove system out println Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add delimiter chunker Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add UT for delimiter chunker Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add delimiter chunker processor Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more UTs Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more UTs Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * basic unit tests for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix tests for getProcessors in neural search Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit tests with string, map and nested map type for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit tests for parameter valdiation in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add back deleted xml file Signed-off-by: yuye-aws <yuyezhu@amazon.com> * restore xml file Signed-off-by: yuye-aws <yuyezhu@amazon.com> * integration tests for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add back Run_Neural_Search.xml Signed-off-by: yuye-aws <yuyezhu@amazon.com> * restore Run_Neural_Search.xml Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add changelog Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update integration test for cascade processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add max chunk limit Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove useless and apply spotless Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update error message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change field UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove useless and apply spotless Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change logic of max chunk number Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add max chunk limit into fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Support list<list<string>> type in embedding and extract validation logic to common class Signed-off-by: zane-neo <zaniu@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit tests for inference processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement unit tests for unit tests with max_chunk_limit in fixed token length Signed-off-by: yuye-aws <yuyezhu@amazon.com> * constructor for inference processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * use inference processor Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * draft code for extending inference processor with document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * api refactor for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove nested list key for chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove unused function Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove processor validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove processor validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * Revert InferenceProcessor.java Signed-off-by: Yuye Zhu <yuyezhu@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * revert changes in text embedding and sparse encoding processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement chunk with map in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add default delimiter value Signed-off-by: Lu <xinyual@88665a36eec8.ant.amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement max chunk logic in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add initial value for max chunk limit in document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix in chunking processor: allow 0 max_chunk_limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement overlap rate with big decimal Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update max chunk limit in delimiter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter setting for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update max chunk limit implementation in chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit tests for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * spotless apply for document chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * initialize current chunk count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * parameter validation for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix integration tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix current UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * change delimiter UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove delimiter useless code Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more UT Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add UT for list inside map Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add UT for list inside map Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update unit tests for chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add more unit tests for chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix import order Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix java doc error Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix update ut for fixed token length chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement chunk count wrapper for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * rename variable end to nextDelimiterPosition Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust method place Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * reanme interface name and fixed token length algorithm name Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update fixed token length algorithm configuration for integration tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * make delimiter member variables static Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove redundant set field value in execute method Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add integration tests with more tokenizers Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: unit test failure due to invalid tokenizer Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: token concatenation in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update chunker interface Signed-off-by: yuye-aws <yuyezhu@amazon.com> * track chunkCount within function Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: allow white space as the delimiter Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix fixed length chunker Signed-off-by: xinyual <xinyual@amazon.com> * fix delimiter chunker Signed-off-by: xinyual <xinyual@amazon.com> * fix chunker factory Signed-off-by: xinyual <xinyual@amazon.com> * fix UTs Signed-off-by: xinyual <xinyual@amazon.com> * fix UT and chunker factory Signed-off-by: xinyual <xinyual@amazon.com> * move analysis_registry to non-runtime parameters Signed-off-by: xinyual <xinyual@amazon.com> * fix Uts Signed-off-by: xinyual <xinyual@amazon.com> * avoid java doc change Signed-off-by: xinyual <xinyual@amazon.com> * move validate to commonUtlis Signed-off-by: xinyual <xinyual@amazon.com> * remove useless function Signed-off-by: xinyual <xinyual@amazon.com> * change java doc Signed-off-by: xinyual <xinyual@amazon.com> * fix Document process ut Signed-off-by: xinyual <xinyual@amazon.com> * fixed token length: re-implement with start and end offset Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update exception message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix document chunking processor IT Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: adjust start, end content position in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update changelog for 2.x release Signed-off-by: yuye-aws <yuyezhu@amazon.com> * rename processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update default delimiter to be \n\n Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove change log in 3.0 unreleased Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix IT failure due to chunking processor rename Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update javadoc for text chunking processor factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust functions in chunker interface Signed-off-by: yuye-aws <yuyezhu@amazon.com> * move algorithm name definition to concrete chunker class Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for chunker parameter validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc for delimiter algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * support range double in chunker parameter validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update string formatted message for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update sneaky throw with text chunking processor it Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add word tokenizer restriction for fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update error message for multiple algorithms in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add comment in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * validate max chunk limit with util parameter class Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update java doc Signed-off-by: yuye-aws <yuyezhu@amazon.com> * make parameter final Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement a map from chunker name to constuctor function in chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix in chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove get all chunkers in chunker factory Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove type check for parameter check for max token count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * remove type check for parameter check for analysis registry Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement parser and validator Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * provide fixed token length as the default algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust exception message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * adjust exception message Signed-off-by: yuye-aws <yuyezhu@amazon.com> * use object nonnull and require nonnull Signed-off-by: yuye-aws <yuyezhu@amazon.com> * apply final to ingest document and chunk count Signed-off-by: yuye-aws <yuyezhu@amazon.com> * merge parameter validator into the parser Signed-off-by: yuye-aws <yuyezhu@amazon.com> * assign positive default value for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * validate supported chunker algorithm in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update parameter setting of max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test with non list of string Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test with null input Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test for tokenization excpetion in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune method name in text chunking processor unit test Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune method name in delimiter algorithm unit test Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit test for overlap rate too small in fixed token length algorithm Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune method modifier for all classes Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune code Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune code Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune exception type in parameter parser Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * include max chunk limit in both algorithms Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune comment Signed-off-by: yuye-aws <yuyezhu@amazon.com> * allow 0 for max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * update runtime max chunk limit in text chunking processor Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune code for chunker Signed-off-by: yuye-aws <yuyezhu@amazon.com> * implement test for multiple field max chunk limit exceed Signed-off-by: yuye-aws <yuyezhu@amazon.com> * tune methods name in text chunking proceesor unit tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * add unit tests for both algorithms with max chunk limit Signed-off-by: yuye-aws <yuyezhu@amazon.com> * optimize code Signed-off-by: yuye-aws <yuyezhu@amazon.com> * extract max chunk limit check to util class Signed-off-by: yuye-aws <yuyezhu@amazon.com> * resolve code review comments Signed-off-by: yuye-aws <yuyezhu@amazon.com> * fix unit tests Signed-off-by: yuye-aws <yuyezhu@amazon.com> * bug fix: only update runtime max chunk limit when enabled Signed-off-by: yuye-aws <yuyezhu@amazon.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> Signed-off-by: xinyual <xinyual@amazon.com> Signed-off-by: zane-neo <zaniu@amazon.com> Signed-off-by: Yuye Zhu <yuyezhu@amazon.com> Signed-off-by: Lu <xinyual@88665a36eec8.ant.amazon.com> Co-authored-by: xinyual <xinyual@amazon.com> Co-authored-by: zane-neo <zaniu@amazon.com> Co-authored-by: Lu <xinyual@88665a36eec8.ant.amazon.com> (cherry picked from commit eea53aa) * bug fix: fix compile error in integration test (#645) Signed-off-by: yuye-aws <yuyezhu@amazon.com> --------- Signed-off-by: yuye-aws <yuyezhu@amazon.com> Co-authored-by: Yuye Zhu <yuyezhu@amazon.com>

gaobinlong · 2024-04-26T10:51:20Z

src/main/java/org/opensearch/neuralsearch/processor/TextChunkingProcessor.java

+                // chunk the object when target key is of leaf type (null, string and list of string)
+                Object chunkObject = sourceAndMetadataMap.get(originalKey);
+                List<String> chunkedResult = chunkLeafType(chunkObject, runtimeParameters);
+                sourceAndMetadataMap.put(String.valueOf(targetKey), chunkedResult);


sourceAndMetadataMap contains some metadata fields such as _index, _routing and _id, if the targetKey equals the name of the metadata field, may cause accident.

A simple solution is to prohibiting targetKey starting with "_".

Let me check the behavior of other ingestion processors.

yuye-aws requested review from heemin32, navneet1v, VijayanB, vamshin, jmazanec15, naveentatikonda, junqiu-lei, martin-gaievski, sean-zheng-amazon, model-collapse, zane-neo, ylwu-amzn, jngz-es and vibrantvarun as code owners February 18, 2024 12:31

yuye-aws marked this pull request as draft February 18, 2024 12:32

zane-neo reviewed Feb 19, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java Outdated Show resolved Hide resolved

zane-neo reviewed Feb 19, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java Outdated Show resolved Hide resolved

zane-neo reviewed Feb 19, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java Outdated Show resolved Hide resolved

zane-neo reviewed Feb 19, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/chunker/FixedTokenLengthChunker.java Show resolved Hide resolved

yuye-aws force-pushed the feature/documentChunkingProcessor branch from 30fd0eb to 57a4a20 Compare February 22, 2024 15:21

yuye-aws requested a review from zane-neo February 22, 2024 15:25

samuel-oci suggested changes Feb 23, 2024

View reviewed changes

yuye-aws mentioned this pull request Feb 26, 2024

[META] Chunking and querying of long passages for vector search #612

Open

zane-neo reviewed Feb 26, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java Outdated Show resolved Hide resolved

zane-neo reviewed Feb 26, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java Outdated Show resolved Hide resolved

zane-neo reviewed Feb 26, 2024

View reviewed changes

src/main/java/org/opensearch/neuralsearch/processor/DocumentChunkingProcessor.java Outdated Show resolved Hide resolved

yuye-aws added 12 commits March 15, 2024 16:36

tune exception type in parameter parser

63bbae9

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

tune comment

aaee028

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

tune comment

ab2a151

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

include max chunk limit in both algorithms

1eb12aa

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

tune comment

40991a3

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

allow 0 for max chunk limit

ea4bbb8

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

update runtime max chunk limit in text chunking processor

f0dfb57

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

tune code for chunker

cb4b39b

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

implement test for multiple field max chunk limit exceed

98dd886

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

tune methods name in text chunking proceesor unit tests

d245a04

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

add unit tests for both algorithms with max chunk limit

ad7ba25

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

optimize code

9702168

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

yuye-aws requested review from navneet1v and zane-neo March 16, 2024 01:26

yuye-aws added 4 commits March 17, 2024 13:05

extract max chunk limit check to util class

3d8c030

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

resolve code review comments

9931fae

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

fix unit tests

fb6a961

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

bug fix: only update runtime max chunk limit when enabled

68fef4f

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

zane-neo approved these changes Mar 18, 2024

View reviewed changes

model-collapse approved these changes Mar 18, 2024

View reviewed changes

model-collapse merged commit eea53aa into opensearch-project:main Mar 18, 2024
60 checks passed

model-collapse added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Mar 18, 2024

model-collapse assigned yuye-aws Mar 18, 2024

opensearch-trigger-bot bot mentioned this pull request Mar 18, 2024

[Backport 2.x] feat: implement text chunking processor with fixed token length and delimiter algorithm #644

Merged

vibrantvarun mentioned this pull request Mar 18, 2024

[Infrastructure] BWC tests for Chunking Processor #647

Closed

yuye-aws deleted the feature/documentChunkingProcessor branch March 26, 2024 02:19

yuye-aws mentioned this pull request Apr 2, 2024

Test: bwc test for text chunking processor #661

Merged

5 tasks

gaobinlong reviewed Apr 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement text chunking processor with fixed token length and delimiter algorithm #607

feat: implement text chunking processor with fixed token length and delimiter algorithm #607

yuye-aws commented Feb 18, 2024 •

edited

Loading

yuye-aws commented Feb 18, 2024

codecov bot commented Feb 18, 2024 •

edited

Loading

yuye-aws commented Feb 22, 2024

samuel-oci left a comment

yuye-aws commented Feb 26, 2024

model-collapse Mar 18, 2024

model-collapse Mar 18, 2024

gaobinlong Apr 26, 2024

yuye-aws Apr 26, 2024

yuye-aws Apr 26, 2024

		throw new IllegalStateException(
		String.format(Locale.ROOT, "%s algorithm encounters exception in tokenization: %s", ALGORITHM_NAME, e.getMessage()),

feat: implement text chunking processor with fixed token length and delimiter algorithm #607

feat: implement text chunking processor with fixed token length and delimiter algorithm #607

Conversation

yuye-aws commented Feb 18, 2024 • edited Loading

Description

User Cases

Text Embedding

Cascaded Chunking Processors

Issues Resolved

Check List

yuye-aws commented Feb 18, 2024

codecov bot commented Feb 18, 2024 • edited Loading

Codecov Report

yuye-aws commented Feb 22, 2024

samuel-oci left a comment

Choose a reason for hiding this comment

yuye-aws commented Feb 26, 2024

model-collapse Mar 18, 2024

Choose a reason for hiding this comment

model-collapse Mar 18, 2024

Choose a reason for hiding this comment

gaobinlong Apr 26, 2024

Choose a reason for hiding this comment

yuye-aws Apr 26, 2024

Choose a reason for hiding this comment

yuye-aws Apr 26, 2024

Choose a reason for hiding this comment

yuye-aws commented Feb 18, 2024 •

edited

Loading

codecov bot commented Feb 18, 2024 •

edited

Loading