feat: split `PreProcessor` #3557

ZanSara · 2022-11-11T17:57:29Z

Related Issues

Epic: #3308

fixes Feature Request: Break up Paragraphs in the way that they appear in a document instead of fixed split_length #3498
fixes Preprocessing: Warning when max doc length significantly exceeds preprocessor split length #3285
fixes split_length not working as expected #2613
fixes Removing page numbers too in data cleaning #657
closes Enable PreProcessor to split files by custom regex (e.g. for markdown headlines) #2583
related to Collect document conversion pain points #3424
and other hiccups with the way the current PreProcessor works

Proposed Changes:

Creates a NewPreProcessor class, an umbrella for 2 smaller classes:
- DocumentSplitter
- DocumentCleaner
Improves the capabilities of DocumentMerger to be able to merge on a sliding window and to merge by tokens count.
split_by now accepts "character", "token", "word", "sentence", "paragraph", "page" or "regex"
- split_by="token" will allow you to pass the tokenizer model as you would do in the Reader. That's relatively slow wrt the other methods, should be further tested on larger documents.
  - One can also pass "nltk" as model name to initialize an NLTKWordTokenizer (faster but way less precise).
- split_by="word" will spit on whitespace (watch out for languages that don't use them, like Chinese)
  - TODO: add a warning if no whitespace is found in the document text?
- split_by="regex" will require you to provide the regex to split upon
- split_by="paragraph" and split_by="page" are syntactic sugar over split_by="regex", split_regex="\n\n" and split_by="regex", split_regex="\f" respectively
- split_by="sentence" uses NLTK's PunkTokenizer as the old PreProcessor.
split_length now refers to how many of the split_by units to include, so:
- split_by="page" and split_length=3 means that each document will contain the content of three pages from the original document
- split_by="word" and split_length=20 means that each document will contain 20 words, regardless of sentence boundaries
- split_by="sentence" and split_length=5 means that each document will contain 5 sentences, regardless of how long they are.
Introduces a parameter called split_max_chars that puts an upper bound to the number of characters a split is allowed to have, regardless of split_by. This is supposed to be a safety parameter, set way higher than the expected document length, to prevent mega-documents from being passed over. If a document generated by the normal split parameters passes the split_max_chars value in length, it is hard-split into chunks that are max_char long and an ERROR log is produced.
Introduces a parameter called split_max_tokens that will attempt to keep all documents below the specified token count. Can be used with split_length or without (both cases are tested). If single units (for example, single sentences) exceed the tokens maximum count, a document is generated for that single sentence and an ERROR log is produced. The super-long sentence, though, will not be split.
- ❓ When huge sentences are found, is it better to split on the max_tokens ignoring the sentence boundaries?
Deprecates:
- The old PreProcessor, which will stay as long as we need it (likely to Haystack 2.0, but minimum for two more minor versions as we defined in the deprecation policy).
- clean_substrings in favor of clean_regex.
- split_respect_sentence_boundary: use split_by="sentence" instead and adjust split_length and max_tokens to your needs.

How did you test it?

Added several (~200) unit tests across test_preprocessor.py, test_document_merger.py, test_document_splitter.py, test_document_cleaner.py
Integration and "real-world" tests: TODO

Notes for the reviewer

The diff is very large. I recommend a test-driven review: go through the tests, see if anything seems to be missing and in case, add such test and re-run the suite. Once there are no more corner cases we can think of, this should be good for merge.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

sjrl · 2022-12-05T22:16:43Z

haystack/nodes/preprocessor/splitter.py

+
+            # If a document longer than max_chars is found, split it into max_length chunks and log loudly.
+            sane_documents = []
+            for document in split_documents:


The variable document in this for loop should be changed since it is already used as a iterator in the for loop on line 417

for document in tqdm(documents, disable=not self.progress_bar, desc="Splitting", unit="docs"):

sjrl · 2022-12-05T22:18:12Z

haystack/nodes/preprocessor/splitter.py

+                                For example, if `split_by="word" and split_length=5 and split_overlap=2`, then the splits would be like:
+                                `[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12]`.
+                                Set the value to 0 to ensure there is no overlap among the documents after splitting.
+        :param max_chars: Absolute maximum number of chars allowed in a single document. Reaching this boundary


I think max_chars is meant to be split_max_chars here based on the variable name in the run_batch function.

sjrl · 2022-12-05T22:18:54Z

haystack/nodes/preprocessor/splitter.py

+                            This is a safety parameter to avoid extremely long documents to end up in the document store.
+                            Keep in mind that huge documents (tens of thousands of chars) will strongly impact the
+                            performance of Reader nodes and can drastically slow down the indexing speed.
+        :param max_tokens:  Maximum number of tokens that are allowed in a single split. If set to 0, it will be


It doesn't look like max_tokens is used by run_batch. Should this be added?

sjrl · 2022-12-05T22:21:36Z

haystack/nodes/preprocessor/splitter.py

+                                Mind that some languages have limited support by the tokenizer: for example, it seems incapable to split Chinese text
+                                by word, but it can correctly split it by sentence.
+
+        :param nltk_folder: If `split_by="sentence"`, specifies the path to the folder containing the NTLK `PunktSentenceTokenizer` models,


Suggested change

:param nltk_folder: If `split_by="sentence"`, specifies the path to the folder containing the NTLK `PunktSentenceTokenizer` models,

:param nltk_folder: If `split_by="sentence"`, specifies the path to the folder containing the NLTK `PunktSentenceTokenizer` models,

sjrl · 2022-12-05T22:23:14Z

haystack/nodes/preprocessor/cleaner.py

+import re
+from copy import deepcopy
+import warnings
+from uuid import uuid4


This looks to be an unused import. Is it safe to remove?

sjrl · 2022-12-05T22:37:33Z

haystack/nodes/preprocessor/merger.py

+    for key, value in list_of_dicts[0].items():
+
+        # if not all other dicts have this key, delete directly
+        if not all(key in dict.keys() for dict in list_of_dicts):


To be on the safe side I think it would be good to rename the variable dict here to a variable that is not a built-in method of Python.

sjrl · 2022-12-05T22:38:23Z

haystack/nodes/preprocessor/merger.py

+
+    def run(  # type: ignore
+        self,
+        documents: List[Document],


Missing docstring for documents

sjrl · 2022-12-05T22:38:48Z

haystack/nodes/preprocessor/merger.py

+
+    def run_batch(  # type: ignore
+        self,
+        documents: List[List[Document]],


Missing docstring for documents

sjrl · 2022-12-05T22:39:15Z

haystack/nodes/preprocessor/merger.py

+            merge_dictionary[key] = common_values(list_of_subdicts)
+
+        # If all dicts have this key and it's not a dictionary, delete only if the values differ
+        elif not all(value == dict[key] for dict in list_of_dicts):


To be on the safe side I think it would be good to rename the variable dict here to a variable that is not a built-in method of Python.

ZanSara · 2022-12-12T14:29:56Z

Hey @robpasternak @mathislucka @sjrl ! Thanks you a ton for the detailed reviews! 🙏 I won't reply to each comment here but I assure you I've gone through all of them.

Given that this PR is becoming huge I will split it into a few smaller ones that we can manage separately, probably one for each subclass: DocumentMerger, DocumentSplitter, DocumentCleaner and once they're all reviewed I'll tackle DocumentPreProcessor.

I'm now focusing on making the whole stack much faster, which will imply some changes, as the bottlenecks I identified are deepcopy, regex.finditer, and in part my own management of Document objects. I'll make sure to notify you as soon as they're ready to test/review and to provide some performance numbers we can discuss on 😊

Sorry for the recurrent length typo, that's embarrassing but I often get that wrong 😅 I'll find/replace it asap.

ZanSara · 2023-01-09T11:09:50Z

For transparency, let's clarify the status of this PR. Unfortunately Q4 is over before I could get this change in a good enough shape, and as we don't really have roadmap items related to this work, I think it's best to pause it. The PR will be closed, but of course as soon as the opportunity comes up again we should be able to revamp it and resume work quickly. Thank you to everyone for all the feedback!

ZanSara · 2023-01-09T11:23:08Z

For future reference: one of the core issues that led to this rewrite is that _split_by_word_respecting_sent_boundary loses characters, due to the use of split().

split() is hyper-fast, because it throws away all the whitespace and render impossible to reconstruct the original document once applied if the doc has repeated whitespace anywhere. This is what makes the current PreProcessor so fast, but also so buggy: because a splitting method that should not lose characters actually loses track of all the whitespace when counting words and re-introduces them based on arbitrary assumptions. This brings up a series of issues with word counting, headlines and page alignment, documents shortening, etc.

To summarize: do not use split_respect_sentence_boundary if you're not planning to use clean_whitespace, because PreProcessor will do that anyway 😅

ZanSara added 17 commits November 8, 2022 18:20

test and improve header/footer detection

e4794d2

improve preprocessor testing

b503a1e

extracted headline alignment algorithm

3a02ef0

reuse remove_substrings to remove headers and footers

ad44cc2

clean functions tested

6627196

add options to split

81e3e69

implement regex cleaning and splitting, tested

e5bb35b

testing regex splitting

9ddc82b

one stubborn test failing

f335c8b

stub word tokenization

e8e8f5c

simplify testing slightly

9aee20d

all tests seems to be passing

cbfcbd9

few more tests failing

0030e67

Remove base class

a0d95b4

improving the merger to use in preprocessor

b9ea09c

split into splitter, cleaner and merger. Tests not passing

1e12a36

Merge branch 'main' into preprocessor-max-length

07b3677

ZanSara changed the title ~~refactor: fix several Preprocessor corner cases~~ refactor: rewrite PreProcessor Nov 22, 2022

ZanSara added 3 commits November 22, 2022 11:32

merger and splitter pass all tests

c4b1e65

all tests are passing

59833cd

add char based split

ae3fad7

ZanSara marked this pull request as ready for review November 23, 2022 11:43

ZanSara requested a review from a team as a code owner November 23, 2022 11:43

ZanSara requested review from julian-risch, ju-gu, masci, robpasternak and agnieszka-m and removed request for a team November 23, 2022 11:43

ZanSara added the action:needs documentation label Nov 23, 2022

sjrl reviewed Dec 5, 2022

View reviewed changes

haystack/nodes/preprocessor/merger.py

def run( # type: ignore

self,

documents: List[Document],

Copy link

Contributor

sjrl Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring for documents

sjrl reviewed Dec 5, 2022

View reviewed changes

ZanSara added 2 commits December 7, 2022 11:31

less deepcopy

3c4d676

improving merger

d4e8332

ZanSara mentioned this pull request Dec 12, 2022

feat: store id_hash_keys in Document objects to make documents clonable #3697

Merged

6 tasks

masci mentioned this pull request Dec 12, 2022

Collect document conversion pain points #3424

Closed

ZanSara removed request for julian-risch, ju-gu, masci and mkkuemmel December 12, 2022 14:35

improve merger, drastic speedup of splitter - wip

f3cc2cc

ZanSara marked this pull request as draft December 12, 2022 17:59

ZanSara added 5 commits December 12, 2022 19:02

Merge branch 'main' into preprocessor-max-length

0a979b6

extracting helpers

02efb06

making extracted helpers work again

b6af119

testing merging alg

6e1ed66

another approach, not working

c39a6bf

ZanSara closed this Jan 9, 2023

ZanSara mentioned this pull request Jan 10, 2023

feat: preprocessor raises warning when doc length exceeds threshold #3837

Merged

6 tasks

masci deleted the preprocessor-max-length branch September 13, 2023 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: split `PreProcessor` #3557

feat: split `PreProcessor` #3557

ZanSara commented Nov 11, 2022 •

edited

Loading

sjrl Dec 5, 2022

sjrl Dec 5, 2022 •

edited

Loading

sjrl Dec 5, 2022

sjrl Dec 5, 2022

sjrl Dec 5, 2022

sjrl Dec 5, 2022

sjrl Dec 5, 2022

sjrl Dec 5, 2022

sjrl Dec 5, 2022

ZanSara commented Dec 12, 2022 •

edited

Loading

ZanSara commented Jan 9, 2023

ZanSara commented Jan 9, 2023

	:param nltk_folder: If `split_by="sentence"`, specifies the path to the folder containing the NTLK `PunktSentenceTokenizer` models,
	:param nltk_folder: If `split_by="sentence"`, specifies the path to the folder containing the NLTK `PunktSentenceTokenizer` models,

feat: split PreProcessor #3557

feat: split PreProcessor #3557

Conversation

ZanSara commented Nov 11, 2022 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Choose a reason for hiding this comment

sjrl Dec 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZanSara commented Dec 12, 2022 • edited Loading

ZanSara commented Jan 9, 2023

ZanSara commented Jan 9, 2023

feat: split `PreProcessor` #3557

feat: split `PreProcessor` #3557

ZanSara commented Nov 11, 2022 •

edited

Loading

sjrl Dec 5, 2022 •

edited

Loading

ZanSara commented Dec 12, 2022 •

edited

Loading