Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: split PreProcessor #3557

Closed
wants to merge 54 commits into from
Closed

feat: split PreProcessor #3557

wants to merge 54 commits into from

Conversation

ZanSara
Copy link
Contributor

@ZanSara ZanSara commented Nov 11, 2022

Related Issues

Epic: #3308

Proposed Changes:

  • Creates a NewPreProcessor class, an umbrella for 2 smaller classes:

    • DocumentSplitter
    • DocumentCleaner
  • Improves the capabilities of DocumentMerger to be able to merge on a sliding window and to merge by tokens count.

  • split_by now accepts "character", "token", "word", "sentence", "paragraph", "page" or "regex"

    • split_by="token" will allow you to pass the tokenizer model as you would do in the Reader. That's relatively slow wrt the other methods, should be further tested on larger documents.

      • One can also pass "nltk" as model name to initialize an NLTKWordTokenizer (faster but way less precise).
    • split_by="word" will spit on whitespace (watch out for languages that don't use them, like Chinese)

      • TODO: add a warning if no whitespace is found in the document text?
    • split_by="regex" will require you to provide the regex to split upon

    • split_by="paragraph" and split_by="page" are syntactic sugar over split_by="regex", split_regex="\n\n" and split_by="regex", split_regex="\f" respectively

    • split_by="sentence" uses NLTK's PunkTokenizer as the old PreProcessor.

  • split_length now refers to how many of the split_by units to include, so:

    • split_by="page" and split_length=3 means that each document will contain the content of three pages from the original document
    • split_by="word" and split_length=20 means that each document will contain 20 words, regardless of sentence boundaries
    • split_by="sentence" and split_length=5 means that each document will contain 5 sentences, regardless of how long they are.
  • Introduces a parameter called split_max_chars that puts an upper bound to the number of characters a split is allowed to have, regardless of split_by. This is supposed to be a safety parameter, set way higher than the expected document length, to prevent mega-documents from being passed over. If a document generated by the normal split parameters passes the split_max_chars value in length, it is hard-split into chunks that are max_char long and an ERROR log is produced.

  • Introduces a parameter called split_max_tokens that will attempt to keep all documents below the specified token count. Can be used with split_length or without (both cases are tested). If single units (for example, single sentences) exceed the tokens maximum count, a document is generated for that single sentence and an ERROR log is produced. The super-long sentence, though, will not be split.

    • ❓ When huge sentences are found, is it better to split on the max_tokens ignoring the sentence boundaries?
  • Deprecates:

    • The old PreProcessor, which will stay as long as we need it (likely to Haystack 2.0, but minimum for two more minor versions as we defined in the deprecation policy).
    • clean_substrings in favor of clean_regex.
    • split_respect_sentence_boundary: use split_by="sentence" instead and adjust split_length and max_tokens to your needs.

How did you test it?

  • Added several (~200) unit tests across test_preprocessor.py, test_document_merger.py, test_document_splitter.py, test_document_cleaner.py
  • Integration and "real-world" tests: TODO

Notes for the reviewer

  • The diff is very large. I recommend a test-driven review: go through the tests, see if anything seems to be missing and in case, add such test and re-run the suite. Once there are no more corner cases we can think of, this should be good for merge.

Checklist

@ZanSara ZanSara changed the title refactor: fix several Preprocessor corner cases refactor: rewrite PreProcessor Nov 22, 2022
@ZanSara ZanSara marked this pull request as ready for review November 23, 2022 11:43
@ZanSara ZanSara requested a review from a team as a code owner November 23, 2022 11:43
@ZanSara ZanSara requested review from julian-risch, ju-gu, masci, robpasternak and agnieszka-m and removed request for a team November 23, 2022 11:43

# If a document longer than max_chars is found, split it into max_length chunks and log loudly.
sane_documents = []
for document in split_documents:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable document in this for loop should be changed since it is already used as a iterator in the for loop on line 417

for document in tqdm(documents, disable=not self.progress_bar, desc="Splitting", unit="docs"):

For example, if `split_by="word" and split_length=5 and split_overlap=2`, then the splits would be like:
`[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12]`.
Set the value to 0 to ensure there is no overlap among the documents after splitting.
:param max_chars: Absolute maximum number of chars allowed in a single document. Reaching this boundary
Copy link
Contributor

@sjrl sjrl Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think max_chars is meant to be split_max_chars here based on the variable name in the run_batch function.

This is a safety parameter to avoid extremely long documents to end up in the document store.
Keep in mind that huge documents (tens of thousands of chars) will strongly impact the
performance of Reader nodes and can drastically slow down the indexing speed.
:param max_tokens: Maximum number of tokens that are allowed in a single split. If set to 0, it will be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like max_tokens is used by run_batch. Should this be added?

Mind that some languages have limited support by the tokenizer: for example, it seems incapable to split Chinese text
by word, but it can correctly split it by sentence.

:param nltk_folder: If `split_by="sentence"`, specifies the path to the folder containing the NTLK `PunktSentenceTokenizer` models,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:param nltk_folder: If `split_by="sentence"`, specifies the path to the folder containing the NTLK `PunktSentenceTokenizer` models,
:param nltk_folder: If `split_by="sentence"`, specifies the path to the folder containing the NLTK `PunktSentenceTokenizer` models,

import re
from copy import deepcopy
import warnings
from uuid import uuid4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks to be an unused import. Is it safe to remove?

for key, value in list_of_dicts[0].items():

# if not all other dicts have this key, delete directly
if not all(key in dict.keys() for dict in list_of_dicts):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be on the safe side I think it would be good to rename the variable dict here to a variable that is not a built-in method of Python.


def run( # type: ignore
self,
documents: List[Document],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring for documents


def run_batch( # type: ignore
self,
documents: List[List[Document]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstring for documents

merge_dictionary[key] = common_values(list_of_subdicts)

# If all dicts have this key and it's not a dictionary, delete only if the values differ
elif not all(value == dict[key] for dict in list_of_dicts):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be on the safe side I think it would be good to rename the variable dict here to a variable that is not a built-in method of Python.

@ZanSara
Copy link
Contributor Author

ZanSara commented Dec 12, 2022

Hey @robpasternak @mathislucka @sjrl ! Thanks you a ton for the detailed reviews! 🙏 I won't reply to each comment here but I assure you I've gone through all of them.

Given that this PR is becoming huge I will split it into a few smaller ones that we can manage separately, probably one for each subclass: DocumentMerger, DocumentSplitter, DocumentCleaner and once they're all reviewed I'll tackle DocumentPreProcessor.

I'm now focusing on making the whole stack much faster, which will imply some changes, as the bottlenecks I identified are deepcopy, regex.finditer, and in part my own management of Document objects. I'll make sure to notify you as soon as they're ready to test/review and to provide some performance numbers we can discuss on 😊


Sorry for the recurrent length typo, that's embarrassing but I often get that wrong 😅 I'll find/replace it asap.

@ZanSara ZanSara marked this pull request as draft December 12, 2022 17:59
@ZanSara
Copy link
Contributor Author

ZanSara commented Jan 9, 2023

For transparency, let's clarify the status of this PR. Unfortunately Q4 is over before I could get this change in a good enough shape, and as we don't really have roadmap items related to this work, I think it's best to pause it. The PR will be closed, but of course as soon as the opportunity comes up again we should be able to revamp it and resume work quickly. Thank you to everyone for all the feedback!

@ZanSara ZanSara closed this Jan 9, 2023
@ZanSara
Copy link
Contributor Author

ZanSara commented Jan 9, 2023

For future reference: one of the core issues that led to this rewrite is that _split_by_word_respecting_sent_boundary loses characters, due to the use of split().

split() is hyper-fast, because it throws away all the whitespace and render impossible to reconstruct the original document once applied if the doc has repeated whitespace anywhere. This is what makes the current PreProcessor so fast, but also so buggy: because a splitting method that should not lose characters actually loses track of all the whitespace when counting words and re-introduces them based on arbitrary assumptions. This brings up a series of issues with word counting, headlines and page alignment, documents shortening, etc.

To summarize: do not use split_respect_sentence_boundary if you're not planning to use clean_whitespace, because PreProcessor will do that anyway 😅

@masci masci deleted the preprocessor-max-length branch September 13, 2023 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:preprocessing type:enhancement type:feature New feature or request type:refactor Not necessarily visible to the users
Projects
None yet
8 participants