Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: split PreProcessor #3557

Closed
wants to merge 54 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
e4794d2
test and improve header/footer detection
ZanSara Nov 8, 2022
b503a1e
improve preprocessor testing
ZanSara Nov 9, 2022
3a02ef0
extracted headline alignment algorithm
ZanSara Nov 10, 2022
ad44cc2
reuse remove_substrings to remove headers and footers
ZanSara Nov 10, 2022
6627196
clean functions tested
ZanSara Nov 11, 2022
81e3e69
add options to split
ZanSara Nov 11, 2022
e5bb35b
implement regex cleaning and splitting, tested
ZanSara Nov 15, 2022
9ddc82b
testing regex splitting
ZanSara Nov 16, 2022
f335c8b
one stubborn test failing
ZanSara Nov 17, 2022
e8e8f5c
stub word tokenization
ZanSara Nov 17, 2022
9aee20d
simplify testing slightly
ZanSara Nov 18, 2022
cbfcbd9
all tests seems to be passing
ZanSara Nov 18, 2022
0030e67
few more tests failing
ZanSara Nov 18, 2022
a0d95b4
Remove base class
ZanSara Nov 21, 2022
b9ea09c
improving the merger to use in preprocessor
ZanSara Nov 21, 2022
1e12a36
split into splitter, cleaner and merger. Tests not passing
ZanSara Nov 21, 2022
07b3677
Merge branch 'main' into preprocessor-max-length
ZanSara Nov 22, 2022
c4b1e65
merger and splitter pass all tests
ZanSara Nov 22, 2022
59833cd
all tests are passing
ZanSara Nov 22, 2022
ae3fad7
add char based split
ZanSara Nov 23, 2022
217228f
expose split_max_chars in preProcessor
ZanSara Nov 23, 2022
f848c3b
separate split_by qord and token
ZanSara Nov 23, 2022
65b01c2
more tests
ZanSara Nov 23, 2022
6c209fd
proper split_by token
ZanSara Nov 23, 2022
46af584
openapi
ZanSara Nov 23, 2022
bc482e5
some mypy fixes
ZanSara Nov 24, 2022
603ba46
integrate reviewer feedback
ZanSara Nov 24, 2022
b79d7f5
reintroduce old preprocessor with deprecation notice
ZanSara Nov 24, 2022
ec7e218
implement and lightly test max_tokens
ZanSara Nov 24, 2022
ff87cb8
Expose max_tokens to the NewPreprocessor and add docstrings
ZanSara Nov 24, 2022
ce4be7d
add error log
ZanSara Nov 24, 2022
fa8432f
Merge branch 'main' into preprocessor-max-length
ZanSara Nov 24, 2022
4a857c8
typo
ZanSara Nov 24, 2022
f226f83
mypy
ZanSara Nov 24, 2022
6047e9b
pylint
ZanSara Nov 24, 2022
d0e8c02
mypy + pylint
ZanSara Nov 24, 2022
163400d
schema fix
ZanSara Nov 24, 2022
2b20564
typing
ZanSara Nov 24, 2022
1242fa4
investigating weaviate
ZanSara Nov 28, 2022
d7a435c
implement most of the reviewer feedback
ZanSara Nov 29, 2022
5f9dd7d
simplify split_respect_sentence_boundary warning message
ZanSara Nov 29, 2022
6c918ff
fix a few bug with max_tokens
ZanSara Nov 30, 2022
c6bf667
fixing headlines
ZanSara Nov 30, 2022
89c1c80
add new tokenization option to speed up the tokens count
ZanSara Nov 30, 2022
3fb14e6
default to word as in the old preprocessor
ZanSara Dec 5, 2022
674d0ff
rename into DocumentPreprocessor
ZanSara Dec 5, 2022
3c4d676
less deepcopy
ZanSara Dec 7, 2022
d4e8332
improving merger
ZanSara Dec 9, 2022
f3cc2cc
improve merger, drastic speedup of splitter - wip
ZanSara Dec 12, 2022
0a979b6
Merge branch 'main' into preprocessor-max-length
ZanSara Dec 12, 2022
02efb06
extracting helpers
ZanSara Dec 13, 2022
b6af119
making extracted helpers work again
ZanSara Dec 14, 2022
6e1ed66
testing merging alg
ZanSara Dec 19, 2022
c39a6bf
another approach, not working
ZanSara Dec 20, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/_src/api/openapi/openapi-1.12.0rc0.json
Original file line number Diff line number Diff line change
Expand Up @@ -661,6 +661,13 @@
"embedding": {
"title": "Embedding",
"type": "string"
},
"id_hash_keys": {
"title": "Id Hash Keys",
"type": "array",
"items": {
"type": "string"
}
}
}
},
Expand Down
7 changes: 7 additions & 0 deletions docs/_src/api/openapi/openapi.json
Original file line number Diff line number Diff line change
Expand Up @@ -661,6 +661,13 @@
"embedding": {
"title": "Embedding",
"type": "string"
},
"id_hash_keys": {
"title": "Id Hash Keys",
"type": "array",
"items": {
"type": "string"
}
}
}
},
Expand Down
2 changes: 1 addition & 1 deletion haystack/document_stores/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -442,7 +442,7 @@ def add_eval_data(
if preprocessor is not None:
assert preprocessor.split_by != "sentence", (
f"Split by sentence not supported.\n"
f"Please set 'split_by' to either 'word' or 'passage' in the supplied PreProcessor."
f"Please set 'split_by' to either 'word', 'paragraph', or 'page' in the supplied PreProcessor."
)
assert preprocessor.split_respect_sentence_boundary == False, (
f"split_respect_sentence_boundary not supported yet.\n"
Expand Down
2 changes: 1 addition & 1 deletion haystack/document_stores/es_converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from haystack.schema import Document
from haystack.document_stores.base import BaseDocumentStore
from haystack.document_stores.filter_utils import LogicalFilterClause
from haystack.nodes.preprocessor.preprocessor import PreProcessor
from haystack.nodes.preprocessor.preprocessor_old import PreProcessor


def open_search_index_to_document_store(
Expand Down
2 changes: 1 addition & 1 deletion haystack/document_stores/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ def _extract_docs_and_labels_from_dict(
## Create Document
cur_full_doc = Document(content=paragraph["context"], meta=cur_meta)
if preprocessor is not None:
splits_docs = preprocessor.process(documents=[cur_full_doc])
splits_docs = preprocessor.run(documents=[cur_full_doc])[0]["documents"]
# we need to pull in _split_id into the document id for unique reference in labels
splits: List[Document] = []
offset = 0
Expand Down
13 changes: 12 additions & 1 deletion haystack/document_stores/weaviate.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,10 @@ def _convert_weaviate_result_to_document(
if props.get("content_type") is not None:
content_type = str(props.pop("content_type"))

id_hash_keys = None
if props.get("id_hash_keys") is not None:
id_hash_keys = str(props.pop("id_hash_keys"))

# Weaviate creates "_additional" key for semantic search
if "_additional" in props:
if "certainty" in props["_additional"]:
Expand Down Expand Up @@ -288,7 +292,14 @@ def _convert_weaviate_result_to_document(
meta_data[k] = v

document = Document.from_dict(
{"id": id, "content": content, "content_type": content_type, "meta": meta_data, "score": score}
{
"id": id,
"content": content,
"content_type": content_type,
"meta": meta_data,
"id_hash_keys": id_hash_keys,
"score": score,
}
)

if return_embedding and embedding:
Expand Down
10 changes: 8 additions & 2 deletions haystack/nodes/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,14 @@
ParsrConverter,
)
from haystack.nodes.label_generator import PseudoLabelGenerator
from haystack.nodes.other import Docs2Answers, JoinDocuments, RouteDocuments, JoinAnswers, DocumentMerger
from haystack.nodes.preprocessor import BasePreProcessor, PreProcessor
from haystack.nodes.other import Docs2Answers, JoinDocuments, RouteDocuments, JoinAnswers
from haystack.nodes.preprocessor import (
PreProcessor,
DocumentMerger,
DocumentSplitter,
DocumentCleaner,
DocumentPreProcessor,
)
from haystack.nodes.query_classifier import SklearnQueryClassifier, TransformersQueryClassifier
from haystack.nodes.question_generator import QuestionGenerator
from haystack.nodes.ranker import BaseRanker, SentenceTransformersRanker
Expand Down
1 change: 0 additions & 1 deletion haystack/nodes/other/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,3 @@
from haystack.nodes.other.route_documents import RouteDocuments
from haystack.nodes.other.join_answers import JoinAnswers
from haystack.nodes.other.join import JoinNode
from haystack.nodes.other.document_merger import DocumentMerger
88 changes: 0 additions & 88 deletions haystack/nodes/other/document_merger.py

This file was deleted.

7 changes: 5 additions & 2 deletions haystack/nodes/preprocessor/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
from haystack.nodes.preprocessor.base import BasePreProcessor
from haystack.nodes.preprocessor.preprocessor import PreProcessor
from haystack.nodes.preprocessor.preprocessor_old import PreProcessor
from haystack.nodes.preprocessor.preprocessor_new import DocumentPreProcessor
from haystack.nodes.preprocessor.splitter import DocumentSplitter
from haystack.nodes.preprocessor.cleaner import DocumentCleaner
from haystack.nodes.preprocessor.merger import DocumentMerger
107 changes: 0 additions & 107 deletions haystack/nodes/preprocessor/base.py

This file was deleted.

Loading