Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement text chunking processor with fixed token length and delimiter algorithm #607

Merged

Conversation

yuye-aws
Copy link
Member

@yuye-aws yuye-aws commented Feb 18, 2024

Description

This PR implements the text chunking processor in RFC. We have implemented two algorithms: fixed token length algorithm and delimiter algorithm. Users can use the chunking ingest processor as the following:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "text_chunking": {
          "algorithm": {
            "fixed_token_length": {
              "token_limit": 10,
              "overlap_rate": 0.2,
              "tokenizer": "standard"
            }
          },
          "field_map": {
            "body": "body_chunk"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
      }
    }
  ]
}

And then obtain the response:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "body_chunk": [
            "This is an example document to be chunked The document",
            "The document contains a single paragraph two sentences and 24",
            "and 24 tokens by standard tokenizer in OpenSearch"
          ],
          "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
        },
        "_ingest": {
          "timestamp": "2024-03-05T09:49:37.131255Z"
        }
      }
    }
  ]
}

You can refer to the RFC for detailed parameter description.

User Cases

Text Embedding

After configuring the text_embedding processor and obtain the model id. We can chain chunking processor together with the text_embedding processor to obtain the embedding vectors for each chunked passages. Here is an example:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "text_chunking": {
          "algorithm": {
            "fixed_token_length": {
              "token_limit": 10,
              "overlap_rate": 0.2,
              "tokenizer": "standard"
            }
          },
          "field_map": {
            "body": "body_chunk"
          }
        }
      },
      {
        "text_embedding": {
          "model_id": "IYMBDo4BwlxmLrDqUr0a",
          "field_map": {
            "body_chunk": "body_chunk_embedding"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
      }
    }
  ]
}

And we obtain the following results:

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "body_chunk": [
            "This is an example document to be chunked The document",
            "The document contains a single paragraph two sentences and 24",
            "and 24 tokens by standard tokenizer in OpenSearch"
          ],
          "body_chunk_embedding": [
            {
              "knn": [...]
            },
            {
              "knn": [...]
            },
            {
              "knn": [...]
            }
          ],
          "body": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
        },
        "_ingest": {
          "timestamp": "2024-03-05T09:49:37.131255Z"
        }
      }
    }
  ]
}

Cascaded Chunking Processors

Users can chain multiple chunking processor together. For example, if a user wish to split documents according to paragraphs, they can apply the Delimiter algorithm and specify the parameter to be "\n\n". In case that a paragraph exceeds the token limit, the user can then append another chunking processor with Fixed Token Length algorithm. The ingestion pipeline in this example should be configured like:

PUT _ingest/pipeline/chunking-pipeline
{
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "delimiter": {
            "delimiter": "\n\n"
          }
        },
        "field_map": {
          "body": "body_chunk1"
        }
      }
    },
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 500,
            "overlap_rate": 0.2,
            "tokenizer": "standard"
          }
        },
        "field_map": {
          "body_chunk1": "body_chunk2"
        }
      }
    }
  ]
}

Issues Resolved

Implement document chunking processor and fixed token length algorithm

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@yuye-aws
Copy link
Member Author

For now, this PR is a POC for the RFC. I will mark this PR as ready when we finalize the high level design and add corresponding unit tests and integration tests.

Copy link

codecov bot commented Feb 18, 2024

Codecov Report

Attention: Patch coverage is 97.89916% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 84.19%. Comparing base (e41fba7) to head (68fef4f).
Report is 2 commits behind head on main.

Files Patch % Lines
.../neuralsearch/processor/TextChunkingProcessor.java 96.03% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #607      +/-   ##
============================================
+ Coverage     82.62%   84.19%   +1.56%     
- Complexity      666      743      +77     
============================================
  Files            52       59       +7     
  Lines          2072     2309     +237     
  Branches        334      370      +36     
============================================
+ Hits           1712     1944     +232     
- Misses          212      214       +2     
- Partials        148      151       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@yuye-aws yuye-aws force-pushed the feature/documentChunkingProcessor branch from 30fd0eb to 57a4a20 Compare February 22, 2024 15:21
@yuye-aws
Copy link
Member Author

Hi @zane-neo! I have modified the PR according your comments. Feel free to review my code.

Copy link

@samuel-oci samuel-oci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the draft @yuye-aws, I would like us to follow the upcoming new feature release process.

  1. Lets make sure all feature spec feedback is collected in the RFC [RFC] Text chunking design #548
  2. Lets create a meta issue with design (I can create one and link it)
  3. We will move forward with the changes

@yuye-aws
Copy link
Member Author

  • Lets create a meta issue with design (I can create one and link it)

Do you mean the high level design about the document chunking processor? Is Interface Design section in RFC what you are looking for?

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Comment on lines +48 to +56
private static final Set<String> WORD_TOKENIZERS = Set.of(
"standard",
"letter",
"lowercase",
"whitespace",
"uax_url_email",
"classic",
"thai"
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently let's don't support any customized tokenizer there, to avoid ones with overlapping. We can have some intelligent checker for tokenizers later.

Comment on lines 168 to 169
throw new IllegalStateException(
String.format(Locale.ROOT, "%s algorithm encounters exception in tokenization: %s", ALGORITHM_NAME, e.getMessage()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is ok to include the original message, but the wording is too simple. We need to explain why this is happening.

@model-collapse model-collapse merged commit eea53aa into opensearch-project:main Mar 18, 2024
60 checks passed
@model-collapse model-collapse added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Mar 18, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 18, 2024
…elimiter algorithm (#607)

* implement chunking processor and fixed token length

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* initialize node client for document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* initialize document chunking processor with analysis registry

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* chunker factory create with analysis registry

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement tokenizer in fixed token length algorithm with analysis registry

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add max token count parsing logic

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix for non-existing index

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change error log

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement evenly chunk

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* unit tests for chunker factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* unit tests for chunker factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add error message for chunker factory tests

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* Revert "implement evenly chunk"

This reverts commit 93dd2f4.

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add default value logic back

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement unit test for fixed token length chunker

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add test cases in unit test for fixed token length chunker

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* support map type as an input

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* support map type as an input

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix for map type

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix for map type

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix for map type in document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove system out println

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add delimiter chunker

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add UT for delimiter chunker

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add delimiter chunker processor

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add more UTs

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add more UTs

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* basic unit tests for document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix tests for getProcessors in neural search

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit tests with string, map and nested map type for document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit tests for parameter valdiation in document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add back deleted xml file

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* restore xml file

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* integration tests for document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add back Run_Neural_Search.xml

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* restore Run_Neural_Search.xml

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add changelog

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update integration test for cascade processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add max chunk limit

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove useless and apply spotless

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update error message

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change field UT

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove useless and apply spotless

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change logic of max chunk number

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add max chunk limit into fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* Support list<list<string>> type in embedding and extract validation logic to common class

Signed-off-by: zane-neo <zaniu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix unit tests for inference processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement unit tests for unit tests with max_chunk_limit in fixed token length

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* constructor for inference processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* use inference processor

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* draft code for extending inference processor with document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* api refactor for document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove nested list key for chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove unused function

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove processor validator

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove processor validator

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* Revert InferenceProcessor.java

Signed-off-by: Yuye Zhu <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* revert changes in text embedding and sparse encoding processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement chunk with map in document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add default delimiter value

Signed-off-by: Lu <xinyual@88665a36eec8.ant.amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement max chunk logic in document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add initial value for max chunk limit in document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix in chunking processor: allow 0 max_chunk_limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement overlap rate with big decimal

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update max chunk limit in delimiter

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update parameter setting for fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update max chunk limit implementation in chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix unit tests for fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* spotless apply for document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* initialize current chunk count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* parameter validation for max chunk limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix integration tests

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix current UT

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change delimiter UT

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove delimiter useless code

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add more UT

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add UT for list inside map

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add UT for list inside map

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update unit tests for chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add more unit tests for chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add java doc

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix import order

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix java doc error

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix update ut for fixed token length chunker

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement chunk count wrapper for max chunk limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* rename variable end to nextDelimiterPosition

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* adjust method place

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc for fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* reanme interface name and fixed token length algorithm name

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update fixed token length algorithm configuration for integration tests

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* make delimiter member variables static

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove redundant set field value in execute method

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add integration tests with more tokenizers

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix: unit test failure due to invalid tokenizer

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix: token concatenation in fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update chunker interface

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* track chunkCount within function

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix: allow white space as the delimiter

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix fixed length chunker

Signed-off-by: xinyual <xinyual@amazon.com>

* fix delimiter chunker

Signed-off-by: xinyual <xinyual@amazon.com>

* fix chunker factory

Signed-off-by: xinyual <xinyual@amazon.com>

* fix UTs

Signed-off-by: xinyual <xinyual@amazon.com>

* fix UT and chunker factory

Signed-off-by: xinyual <xinyual@amazon.com>

* move analysis_registry to non-runtime parameters

Signed-off-by: xinyual <xinyual@amazon.com>

* fix Uts

Signed-off-by: xinyual <xinyual@amazon.com>

* avoid java doc change

Signed-off-by: xinyual <xinyual@amazon.com>

* move validate to commonUtlis

Signed-off-by: xinyual <xinyual@amazon.com>

* remove useless function

Signed-off-by: xinyual <xinyual@amazon.com>

* change java doc

Signed-off-by: xinyual <xinyual@amazon.com>

* fix Document process ut

Signed-off-by: xinyual <xinyual@amazon.com>

* fixed token length: re-implement with start and end offset

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update exception message

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix document chunking processor IT

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix: adjust start, end content position in fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update changelog for 2.x release

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* rename processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update default delimiter to be \n\n

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove change log in 3.0 unreleased

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix IT failure due to chunking processor rename

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update javadoc for text chunking processor factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* adjust functions in chunker interface

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* move algorithm name definition to concrete chunker class

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update string formatted message for text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update string formatted message for chunker factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update string formatted message for chunker parameter validator

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc for delimiter algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* support range double in chunker parameter validator

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update string formatted message for fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update sneaky throw with text chunking processor it

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add word tokenizer restriction for fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update error message for multiple algorithms in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add comment in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* validate max chunk limit with util parameter class

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* make parameter final

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement a map from chunker name to constuctor function in chunker factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix in chunker factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove get all chunkers in chunker factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove type check for parameter check for max token count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove type check for parameter check for analysis registry

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement parser and validator

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update comment

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* provide fixed token length as the default algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* adjust exception message

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* adjust exception message

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* use object nonnull and require nonnull

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* apply final to ingest document and chunk count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* merge parameter validator into the parser

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* assign positive default value for max chunk limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* validate supported chunker algorithm in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update parameter setting of max chunk limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit test with non list of string

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit test with null input

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit test for tokenization excpetion in fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune method name in text chunking processor unit test

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune method name in delimiter algorithm unit test

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit test for overlap rate too small in fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune method modifier for all classes

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune code

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune code

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune exception type in parameter parser

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune comment

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune comment

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* include max chunk limit in both algorithms

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune comment

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* allow 0 for max chunk limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update runtime max chunk limit in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune code for chunker

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement test for multiple field max chunk limit exceed

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune methods name in text chunking proceesor unit tests

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit tests for both algorithms with max chunk limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* optimize code

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* extract max chunk limit check to util class

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix unit tests

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix: only update runtime max chunk limit when enabled

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

---------

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: zane-neo <zaniu@amazon.com>
Signed-off-by: Yuye Zhu <yuyezhu@amazon.com>
Signed-off-by: Lu <xinyual@88665a36eec8.ant.amazon.com>
Co-authored-by: xinyual <xinyual@amazon.com>
Co-authored-by: zane-neo <zaniu@amazon.com>
Co-authored-by: Lu <xinyual@88665a36eec8.ant.amazon.com>
(cherry picked from commit eea53aa)
zane-neo pushed a commit that referenced this pull request Mar 18, 2024
…en length and delimiter algorithm (#644)

* feat: implement text chunking processor with fixed token length and delimiter algorithm (#607)

* implement chunking processor and fixed token length

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* initialize node client for document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* initialize document chunking processor with analysis registry

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* chunker factory create with analysis registry

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement tokenizer in fixed token length algorithm with analysis registry

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add max token count parsing logic

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix for non-existing index

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change error log

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement evenly chunk

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* unit tests for chunker factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* unit tests for chunker factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add error message for chunker factory tests

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* Revert "implement evenly chunk"

This reverts commit 93dd2f4.

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add default value logic back

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement unit test for fixed token length chunker

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add test cases in unit test for fixed token length chunker

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* support map type as an input

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* support map type as an input

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix for map type

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix for map type

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix for map type in document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove system out println

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add delimiter chunker

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add UT for delimiter chunker

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add delimiter chunker processor

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add more UTs

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add more UTs

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* basic unit tests for document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix tests for getProcessors in neural search

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit tests with string, map and nested map type for document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit tests for parameter valdiation in document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add back deleted xml file

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* restore xml file

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* integration tests for document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add back Run_Neural_Search.xml

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* restore Run_Neural_Search.xml

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add changelog

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update integration test for cascade processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add max chunk limit

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove useless and apply spotless

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update error message

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change field UT

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove useless and apply spotless

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change logic of max chunk number

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add max chunk limit into fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* Support list<list<string>> type in embedding and extract validation logic to common class

Signed-off-by: zane-neo <zaniu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix unit tests for inference processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement unit tests for unit tests with max_chunk_limit in fixed token length

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* constructor for inference processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* use inference processor

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* draft code for extending inference processor with document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* api refactor for document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove nested list key for chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove unused function

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove processor validator

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove processor validator

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* Revert InferenceProcessor.java

Signed-off-by: Yuye Zhu <yuyezhu@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* revert changes in text embedding and sparse encoding processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement chunk with map in document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add default delimiter value

Signed-off-by: Lu <xinyual@88665a36eec8.ant.amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement max chunk logic in document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add initial value for max chunk limit in document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix in chunking processor: allow 0 max_chunk_limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement overlap rate with big decimal

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update max chunk limit in delimiter

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update parameter setting for fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update max chunk limit implementation in chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix unit tests for fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* spotless apply for document chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* initialize current chunk count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* parameter validation for max chunk limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix integration tests

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix current UT

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* change delimiter UT

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove delimiter useless code

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add more UT

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add UT for list inside map

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add UT for list inside map

Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update unit tests for chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add more unit tests for chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add java doc

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix import order

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix java doc error

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix update ut for fixed token length chunker

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement chunk count wrapper for max chunk limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* rename variable end to nextDelimiterPosition

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* adjust method place

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc for fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* reanme interface name and fixed token length algorithm name

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update fixed token length algorithm configuration for integration tests

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* make delimiter member variables static

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove redundant set field value in execute method

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add integration tests with more tokenizers

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix: unit test failure due to invalid tokenizer

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix: token concatenation in fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update chunker interface

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* track chunkCount within function

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix: allow white space as the delimiter

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix fixed length chunker

Signed-off-by: xinyual <xinyual@amazon.com>

* fix delimiter chunker

Signed-off-by: xinyual <xinyual@amazon.com>

* fix chunker factory

Signed-off-by: xinyual <xinyual@amazon.com>

* fix UTs

Signed-off-by: xinyual <xinyual@amazon.com>

* fix UT and chunker factory

Signed-off-by: xinyual <xinyual@amazon.com>

* move analysis_registry to non-runtime parameters

Signed-off-by: xinyual <xinyual@amazon.com>

* fix Uts

Signed-off-by: xinyual <xinyual@amazon.com>

* avoid java doc change

Signed-off-by: xinyual <xinyual@amazon.com>

* move validate to commonUtlis

Signed-off-by: xinyual <xinyual@amazon.com>

* remove useless function

Signed-off-by: xinyual <xinyual@amazon.com>

* change java doc

Signed-off-by: xinyual <xinyual@amazon.com>

* fix Document process ut

Signed-off-by: xinyual <xinyual@amazon.com>

* fixed token length: re-implement with start and end offset

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update exception message

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix document chunking processor IT

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix: adjust start, end content position in fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update changelog for 2.x release

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* rename processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update default delimiter to be \n\n

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove change log in 3.0 unreleased

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix IT failure due to chunking processor rename

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update javadoc for text chunking processor factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* adjust functions in chunker interface

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* move algorithm name definition to concrete chunker class

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update string formatted message for text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update string formatted message for chunker factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update string formatted message for chunker parameter validator

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc for delimiter algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* support range double in chunker parameter validator

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update string formatted message for fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update sneaky throw with text chunking processor it

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add word tokenizer restriction for fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update error message for multiple algorithms in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add comment in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* validate max chunk limit with util parameter class

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update java doc

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* make parameter final

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement a map from chunker name to constuctor function in chunker factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix in chunker factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove get all chunkers in chunker factory

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove type check for parameter check for max token count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* remove type check for parameter check for analysis registry

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement parser and validator

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update comment

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* provide fixed token length as the default algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* adjust exception message

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* adjust exception message

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* use object nonnull and require nonnull

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* apply final to ingest document and chunk count

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* merge parameter validator into the parser

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* assign positive default value for max chunk limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* validate supported chunker algorithm in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update parameter setting of max chunk limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit test with non list of string

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit test with null input

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit test for tokenization excpetion in fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune method name in text chunking processor unit test

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune method name in delimiter algorithm unit test

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit test for overlap rate too small in fixed token length algorithm

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune method modifier for all classes

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune code

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune code

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune exception type in parameter parser

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune comment

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune comment

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* include max chunk limit in both algorithms

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune comment

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* allow 0 for max chunk limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* update runtime max chunk limit in text chunking processor

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune code for chunker

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* implement test for multiple field max chunk limit exceed

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* tune methods name in text chunking proceesor unit tests

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* add unit tests for both algorithms with max chunk limit

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* optimize code

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* extract max chunk limit check to util class

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* resolve code review comments

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* fix unit tests

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

* bug fix: only update runtime max chunk limit when enabled

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

---------

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Signed-off-by: xinyual <xinyual@amazon.com>
Signed-off-by: zane-neo <zaniu@amazon.com>
Signed-off-by: Yuye Zhu <yuyezhu@amazon.com>
Signed-off-by: Lu <xinyual@88665a36eec8.ant.amazon.com>
Co-authored-by: xinyual <xinyual@amazon.com>
Co-authored-by: zane-neo <zaniu@amazon.com>
Co-authored-by: Lu <xinyual@88665a36eec8.ant.amazon.com>
(cherry picked from commit eea53aa)

* bug fix: fix compile error in integration test (#645)

Signed-off-by: yuye-aws <yuyezhu@amazon.com>

---------

Signed-off-by: yuye-aws <yuyezhu@amazon.com>
Co-authored-by: Yuye Zhu <yuyezhu@amazon.com>
@yuye-aws yuye-aws deleted the feature/documentChunkingProcessor branch March 26, 2024 02:19
// chunk the object when target key is of leaf type (null, string and list of string)
Object chunkObject = sourceAndMetadataMap.get(originalKey);
List<String> chunkedResult = chunkLeafType(chunkObject, runtimeParameters);
sourceAndMetadataMap.put(String.valueOf(targetKey), chunkedResult);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sourceAndMetadataMap contains some metadata fields such as _index, _routing and _id, if the targetKey equals the name of the metadata field, may cause accident.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple solution is to prohibiting targetKey starting with "_".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check the behavior of other ingestion processors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants