-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: split PreProcessor
#3557
feat: split PreProcessor
#3557
Conversation
PreProcessor
|
||
# If a document longer than max_chars is found, split it into max_length chunks and log loudly. | ||
sane_documents = [] | ||
for document in split_documents: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable document
in this for loop should be changed since it is already used as a iterator in the for loop on line 417
for document in tqdm(documents, disable=not self.progress_bar, desc="Splitting", unit="docs"):
For example, if `split_by="word" and split_length=5 and split_overlap=2`, then the splits would be like: | ||
`[w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12]`. | ||
Set the value to 0 to ensure there is no overlap among the documents after splitting. | ||
:param max_chars: Absolute maximum number of chars allowed in a single document. Reaching this boundary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think max_chars
is meant to be split_max_chars
here based on the variable name in the run_batch
function.
This is a safety parameter to avoid extremely long documents to end up in the document store. | ||
Keep in mind that huge documents (tens of thousands of chars) will strongly impact the | ||
performance of Reader nodes and can drastically slow down the indexing speed. | ||
:param max_tokens: Maximum number of tokens that are allowed in a single split. If set to 0, it will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like max_tokens
is used by run_batch
. Should this be added?
Mind that some languages have limited support by the tokenizer: for example, it seems incapable to split Chinese text | ||
by word, but it can correctly split it by sentence. | ||
|
||
:param nltk_folder: If `split_by="sentence"`, specifies the path to the folder containing the NTLK `PunktSentenceTokenizer` models, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:param nltk_folder: If `split_by="sentence"`, specifies the path to the folder containing the NTLK `PunktSentenceTokenizer` models, | |
:param nltk_folder: If `split_by="sentence"`, specifies the path to the folder containing the NLTK `PunktSentenceTokenizer` models, |
import re | ||
from copy import deepcopy | ||
import warnings | ||
from uuid import uuid4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be an unused import. Is it safe to remove?
for key, value in list_of_dicts[0].items(): | ||
|
||
# if not all other dicts have this key, delete directly | ||
if not all(key in dict.keys() for dict in list_of_dicts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be on the safe side I think it would be good to rename the variable dict
here to a variable that is not a built-in method of Python.
|
||
def run( # type: ignore | ||
self, | ||
documents: List[Document], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstring for documents
|
||
def run_batch( # type: ignore | ||
self, | ||
documents: List[List[Document]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing docstring for documents
merge_dictionary[key] = common_values(list_of_subdicts) | ||
|
||
# If all dicts have this key and it's not a dictionary, delete only if the values differ | ||
elif not all(value == dict[key] for dict in list_of_dicts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be on the safe side I think it would be good to rename the variable dict here to a variable that is not a built-in method of Python.
Hey @robpasternak @mathislucka @sjrl ! Thanks you a ton for the detailed reviews! 🙏 I won't reply to each comment here but I assure you I've gone through all of them. Given that this PR is becoming huge I will split it into a few smaller ones that we can manage separately, probably one for each subclass: I'm now focusing on making the whole stack much faster, which will imply some changes, as the bottlenecks I identified are Sorry for the recurrent |
For transparency, let's clarify the status of this PR. Unfortunately Q4 is over before I could get this change in a good enough shape, and as we don't really have roadmap items related to this work, I think it's best to pause it. The PR will be closed, but of course as soon as the opportunity comes up again we should be able to revamp it and resume work quickly. Thank you to everyone for all the feedback! |
For future reference: one of the core issues that led to this rewrite is that
To summarize: do not use |
Related Issues
Epic: #3308
PreProcessor
to split files by custom regex (e.g. for markdown headlines) #2583Proposed Changes:
Creates a
NewPreProcessor
class, an umbrella for 2 smaller classes:DocumentSplitter
DocumentCleaner
Improves the capabilities of
DocumentMerger
to be able to merge on a sliding window and to merge by tokens count.split_by
now accepts "character", "token", "word", "sentence", "paragraph", "page" or "regex"split_by="token"
will allow you to pass the tokenizer model as you would do in the Reader. That's relatively slow wrt the other methods, should be further tested on larger documents."nltk"
as model name to initialize anNLTKWordTokenizer
(faster but way less precise).split_by="word"
will spit on whitespace (watch out for languages that don't use them, like Chinese)split_by="regex"
will require you to provide the regex to split uponsplit_by="paragraph"
andsplit_by="page"
are syntactic sugar oversplit_by="regex", split_regex="\n\n"
andsplit_by="regex", split_regex="\f"
respectivelysplit_by="sentence"
uses NLTK'sPunkTokenizer
as the old PreProcessor.split_length
now refers to how many of thesplit_by
units to include, so:split_by="page" and split_length=3
means that each document will contain the content of three pages from the original documentsplit_by="word" and split_length=20
means that each document will contain 20 words, regardless of sentence boundariessplit_by="sentence" and split_length=5
means that each document will contain 5 sentences, regardless of how long they are.Introduces a parameter called
split_max_chars
that puts an upper bound to the number of characters a split is allowed to have, regardless ofsplit_by
. This is supposed to be a safety parameter, set way higher than the expected document length, to prevent mega-documents from being passed over. If a document generated by the normal split parameters passes thesplit_max_chars
value in length, it is hard-split into chunks that are max_char long and anERROR
log is produced.Introduces a parameter called
split_max_tokens
that will attempt to keep all documents below the specified token count. Can be used withsplit_length
or without (both cases are tested). If single units (for example, single sentences) exceed the tokens maximum count, a document is generated for that single sentence and anERROR
log is produced. The super-long sentence, though, will not be split.max_tokens
ignoring the sentence boundaries?Deprecates:
PreProcessor
, which will stay as long as we need it (likely to Haystack 2.0, but minimum for two more minor versions as we defined in the deprecation policy).clean_substrings
in favor ofclean_regex
.split_respect_sentence_boundary
: usesplit_by="sentence"
instead and adjustsplit_length
andmax_tokens
to your needs.How did you test it?
test_preprocessor.py
,test_document_merger.py
,test_document_splitter.py
,test_document_cleaner.py
Notes for the reviewer
Checklist