Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Markdown Chunking Algorithm #751

Open
yuye-aws opened this issue May 16, 2024 · 1 comment
Open

[RFC] Markdown Chunking Algorithm #751

yuye-aws opened this issue May 16, 2024 · 1 comment
Assignees
Labels

Comments

@yuye-aws
Copy link
Member

yuye-aws commented May 16, 2024

In OpenSearch 2.13, we released text chunking processor. This processor enables users to chunk documents to avoid information loss by truncation from embedding models. This RFC introduces markdown algorithm from the RFC on text chunking. We are initiating this RFC to solicit feedbacks in order to determine whether this algorithm is truly needed by the users.

Introduction

This algorithm is dedicated for markdown file. Within markdown document, the hierarchy structure provides related context for passages under subtitles. We can construct a tree based on the title levels from the doc. Given a node in the tree, we include titles and contents from its path to the root title, including all ancestor nodes. Users can configure the max depth of tree node. We provide a few examples so that you can better understand this algorithm.

Examples

Here is a simple example of markdown file:

// input
# Root title
Root content
## Title 1
Content 1
### Title 1.1
Content 1.1
### Title 1.2
Content 1.2
## Title 2
Content 2
### Title 2.1
Content 2.1
### Title 2.2
Content 2.2

By this example, the constructed tree should be like

ChunkingBackground-Page-2 (1)

Example 1

// output when max_depth = 1
[
'''
# Root title
Root content
## Title 1
Content 1
### Title 1.1
Content 1.1
### Title 1.2
Content 1.2
## Sub title 2
Content 2
### Title 2.1
Content 2.1
### Title 2.2
Content 2.2
'''
]

Example 2

// output when max_depth = 2
[
'''
# Root title
Root content
## Title 1
Content 1
### Title 1.1
Content 1.1
### Title 1.2
Content 1.2
'''
,
'''
# Root title
Root content
## Title 2
Content 2
### Title 2.1
Content 2.1
### Title 2.2
Content 2.2
'''
]

Example 3

// output when max_depth = 2
[
'''
Root title
Root content
Title 1
Content 1
Title 1.1
Content 1.1
'''
,
'''
Root title
Root content
Title 1
Content 1
Title 1.2
Content 1.2
'''
,
'''
Root title
Root content
Title 2
Content 2
Title 2.1
Content 2.1
'''
,
'''
Root title
Root content
Title 2
Content 2
Title 2.2
Content 2.2
'''
]

Pros and cons

Here are the pros and cons of this algorithm.

Pros

  1. Both existing chunking algorithms are too naive to chunk documents in an organic and coherent manner. This algorithm enables user to maintain context information for each section under subtitle.

Cons

  1. The algorithm is not applicable to other text formats with hierarchy structure like html and wikipedia.
  2. Even for markdown formatted documents, the algorithm is only applicable to documents where most contents are located under “leaf node”.
  3. The algorithm may cause extra space consumption due to overlapping contents from root title and root content.
  4. We may need to cascade multiple chunking algorithms if root title and root content themselves are longer than the truncation limit of the text embedding model. By doing so, the context information towards root title is also missed by the downstream algorithm.

Parameters

Parameter Required/Optional Data type Description
max_depth Optional Int The max depth for title in markdown formatted texts. Default is 3.
max_chunk_limit Optional Int The chunk limit for chunking algorithms. Default is 100. Users can set this value to -1 to disable this parameter.

We have two parameters in markdown algorithm, where the max_chunk_limit parameter follows other chunking algorithms and the max_depth parameter means the deepest title we consider.

API

Here is an example to create an ingestion pipeline with markdown algorithm

PUT _ingest/pipeline/markdown-chunking-pipeline
{
  "description": "This pipeline performs chunking for markdown files",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "markdown": {
            "max_depth": 3,
            "max_chunk_limit": 100
          }
        },
        "field_map": {
          "<input_field>": "<output_field>"
        }
      }
    }
  ]
}
@yuye-aws
Copy link
Member Author

Feel free to assign this RFC to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants