[RFC] Markdown Chunking Algorithm #751

yuye-aws · 2024-05-16T07:05:17Z

In OpenSearch 2.13, we released text chunking processor. This processor enables users to chunk documents to avoid information loss by truncation from embedding models. This RFC introduces markdown algorithm from the RFC on text chunking. We are initiating this RFC to solicit feedbacks in order to determine whether this algorithm is truly needed by the users.

Introduction

This algorithm is dedicated for markdown file. Within markdown document, the hierarchy structure provides related context for passages under subtitles. We can construct a tree based on the title levels from the doc. Given a node in the tree, we include titles and contents from its path to the root title, including all ancestor nodes. Users can configure the max depth of tree node. We provide a few examples so that you can better understand this algorithm.

Examples

Here is a simple example of markdown file:

// input
# Root title
Root content
## Title 1
Content 1
### Title 1.1
Content 1.1
### Title 1.2
Content 1.2
## Title 2
Content 2
### Title 2.1
Content 2.1
### Title 2.2
Content 2.2

By this example, the constructed tree should be like

Example 1

// output when max_depth = 1
[
'''
# Root title
Root content
## Title 1
Content 1
### Title 1.1
Content 1.1
### Title 1.2
Content 1.2
## Sub title 2
Content 2
### Title 2.1
Content 2.1
### Title 2.2
Content 2.2
'''
]

Example 2

// output when max_depth = 2
[
'''
# Root title
Root content
## Title 1
Content 1
### Title 1.1
Content 1.1
### Title 1.2
Content 1.2
'''
,
'''
# Root title
Root content
## Title 2
Content 2
### Title 2.1
Content 2.1
### Title 2.2
Content 2.2
'''
]

Example 3

// output when max_depth = 2
[
'''
Root title
Root content
Title 1
Content 1
Title 1.1
Content 1.1
'''
,
'''
Root title
Root content
Title 1
Content 1
Title 1.2
Content 1.2
'''
,
'''
Root title
Root content
Title 2
Content 2
Title 2.1
Content 2.1
'''
,
'''
Root title
Root content
Title 2
Content 2
Title 2.2
Content 2.2
'''
]

Pros and cons

Here are the pros and cons of this algorithm.

Pros

Both existing chunking algorithms are too naive to chunk documents in an organic and coherent manner. This algorithm enables user to maintain context information for each section under subtitle.

Cons

The algorithm is not applicable to other text formats with hierarchy structure like html and wikipedia.
Even for markdown formatted documents, the algorithm is only applicable to documents where most contents are located under “leaf node”.
The algorithm may cause extra space consumption due to overlapping contents from root title and root content.
We may need to cascade multiple chunking algorithms if root title and root content themselves are longer than the truncation limit of the text embedding model. By doing so, the context information towards root title is also missed by the downstream algorithm.

Parameters

Parameter	Required/Optional	Data type	Description
max_depth	Optional	Int	The max depth for title in markdown formatted texts. Default is 3.
max_chunk_limit	Optional	Int	The chunk limit for chunking algorithms. Default is 100. Users can set this value to -1 to disable this parameter.

We have two parameters in markdown algorithm, where the max_chunk_limit parameter follows other chunking algorithms and the max_depth parameter means the deepest title we consider.

API

Here is an example to create an ingestion pipeline with markdown algorithm

PUT _ingest/pipeline/markdown-chunking-pipeline
{
  "description": "This pipeline performs chunking for markdown files",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "markdown": {
            "max_depth": 3,
            "max_chunk_limit": 100
          }
        },
        "field_map": {
          "<input_field>": "<output_field>"
        }
      }
    }
  ]
}

The text was updated successfully, but these errors were encountered:

yuye-aws · 2024-05-16T07:05:51Z

Feel free to assign this RFC to me.

github-actions bot added the untriaged label May 16, 2024

zhichao-aws assigned yuye-aws Jun 5, 2024

zhichao-aws added RFC and removed untriaged labels Jun 5, 2024

martin-gaievski mentioned this issue Jul 29, 2024

[META] Chunking and querying of long passages for vector search #612

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Markdown Chunking Algorithm #751

[RFC] Markdown Chunking Algorithm #751

yuye-aws commented May 16, 2024 •

edited

Loading

yuye-aws commented May 16, 2024

[RFC] Markdown Chunking Algorithm #751

[RFC] Markdown Chunking Algorithm #751

Comments

yuye-aws commented May 16, 2024 • edited Loading

Introduction

Examples

Example 1

Example 2

Example 3

Pros and cons

Pros

Cons

Parameters

API

yuye-aws commented May 16, 2024

yuye-aws commented May 16, 2024 •

edited

Loading