feat: Extraction of headlines in markdown files #3445

bogdankostic · 2022-10-20T18:40:18Z

Related Issues

fixes Extract headings from Markdown files #3056

Proposed Changes:

This PR adds the possibility to extract headlines out of a markdown file. For this, it adds the parameter extract_headlines to the MarkdownConverter and adapts the PreProcessor to be able to keep only the relevant headlines for each Document split when splitting the original Document.
Headlines are stored as a list of dictionaries in the Document's meta field "headlines" and are structured as followed:

{
    "headline": <THE HEADLINE STRING>,
    "start_idx": <IDX OF HEADLINE START IN document.content, int or None>,
    "level": <LEVEL OF THE HEADLINE, int>
}

start_idx is set to None by the PreProcessor when the headline is relevant for the current split but it appears in a previous split.

Further changes:

Add parameter remove_code_snippets to MarkdownConverter to allow the users to choose whether to remove the code snippets (previously, code snippets were always removed)
Refactor PreProcessor's split method to make it (hopefully) more readable
Adapt the SentenceTokenizer in _split_sentences to not remove whitespace characters when splitting a text into sentences

How did you test it?

I added a couple of unit tests, let me know if I should add more.

Notes for the reviewer

I'm not quite sure about the dictionary format for the headlines, especially about setting start_idx to None if the headline cannot be found in Document.content. Please let me know if you think this is okay or you see a better solution.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

haystack/nodes/file_converter/markdown.py

haystack/nodes/preprocessor/preprocessor.py

agnieszka-m · 2022-10-24T08:13:28Z

haystack/nodes/preprocessor/preprocessor.py

+            word_count_sen = len(sen.split())
+
+            if word_count_sen > split_length:
+                long_sentence_message = f"One or more sentence found with word count higher than the split length."


Can we also add a sentence about how they can fix it or what happens now?

Nor sure what we should add. Nothing really happening here, just informing the user that they have at least one exceptionally long sentence (longer than their specified split_length) in their Document. The consequence of this is that they will have at least one Document consisting of a single sentence.

haystack/nodes/preprocessor/preprocessor.py

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

haystack/nodes/file_converter/markdown.py

vblagoje · 2022-10-25T08:02:03Z

haystack/nodes/file_converter/markdown.py

        """
-        # md -> html -> text since BeautifulSoup can extract text cleanly
-        html = markdown(markdown_string)
+        headline_tags = {"h1", "h2", "h3", "h4", "h5", "h6"}


Make the depth level configurable?

Or even better - find the deepest tree level in one pass of regular expressions?

These are pre-defined HTML tags as listed for example here.
Before converting a markdown file to text, we convert it to HTML. The headline_tags set here is just used to check if an HTML element is a headline.

haystack/nodes/file_converter/markdown.py

vblagoje

Overall, LGTM @bogdankostic, especially the effort you put into unit tests. Some total newcomer views: would it be possible/beneficial for users to return the headlines as a tree, as that's the natural order of the headlines? Then they can order it any way they want (post/pre/in order). Great work overall

bogdankostic · 2022-10-25T13:45:51Z

would it be possible/beneficial for users to return the headlines as a tree, as that's the natural order of the headlines?

My initial idea was actually to represent the extracted headlines as a tree. I decided against it because this can result in quite nested, complex structures - especially for large Documents. I thought it would be much more understandable like this.

vblagoje · 2022-10-26T08:20:37Z

haystack/nodes/file_converter/markdown.py

-            id_hash_keys = self.id_hash_keys
+
+        id_hash_keys = id_hash_keys if id_hash_keys is not None else self.id_hash_keys
+        remove_code_snippets = remove_code_snippets if remove_code_snippets is not None else self.remove_code_snippets


@bogdankostic another option is id_hash_keys = id_hash_keys or self.id_hash_keys, but I am undecided about which one is better.

id_hash_keys = id_hash_keys or self.id_hash_keys would work in this case because id_hash_keys is supposed to be a list, but this doesn't work with optional boolean parameters.

For example, if we explicitly set remove_code_snippets to False when calling convert, remove_code_snippets or self.remove_code_snippets would evaluate to whatever value self.remove_code_snippets has.

For consistency, I would like all these lines to have the same pattern.

Great points

vblagoje

GTG now @bogdankostic

bogdankostic added 6 commits October 19, 2022 17:37

Extract headings from markdown files + adapt PreProcessor

d661a0b

Merge origin/main into headings_md_file

bcb4115

Add tests

55b06fe

Fix mypy

411f80c

Merge remote-tracking branch 'origin/main' into headings_md_files

e5662f1

Generate JSON schema

3763116

bogdankostic added type:feature New feature or request topic:file_converter topic:preprocessing labels Oct 20, 2022

bogdankostic marked this pull request as ready for review October 20, 2022 21:11

bogdankostic requested review from a team as code owners October 20, 2022 21:11

bogdankostic requested review from mayankjobanputra and removed request for a team October 20, 2022 21:11

agnieszka-m requested changes Oct 24, 2022

View reviewed changes

bogdankostic and others added 3 commits October 24, 2022 18:08

Apply suggestions from code review

62ab837

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

Update haystack/nodes/file_converter/markdown.py

5c59081

Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>

Apply black

bf76e3f

bogdankostic requested a review from vblagoje October 25, 2022 07:46

vblagoje reviewed Oct 25, 2022

View reviewed changes

haystack/nodes/file_converter/markdown.py Outdated Show resolved Hide resolved

vblagoje reviewed Oct 25, 2022

View reviewed changes

haystack/nodes/file_converter/markdown.py Outdated Show resolved Hide resolved

vblagoje requested changes Oct 25, 2022

View reviewed changes

Add PR feedback

eea5a72

bogdankostic requested review from agnieszka-m and vblagoje October 25, 2022 13:46

agnieszka-m approved these changes Oct 25, 2022

View reviewed changes

vblagoje reviewed Oct 26, 2022

View reviewed changes

vblagoje approved these changes Oct 26, 2022

View reviewed changes

bogdankostic merged commit 4fbe80c into main Oct 26, 2022

bogdankostic deleted the headings_md_files branch October 26, 2022 09:57

bogdankostic mentioned this pull request Oct 27, 2022

feat: Add headline extraction to ParsrConverter #3488

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Extraction of headlines in markdown files #3445

feat: Extraction of headlines in markdown files #3445

bogdankostic commented Oct 20, 2022 •

edited

Loading

agnieszka-m Oct 24, 2022

bogdankostic Oct 24, 2022

vblagoje Oct 25, 2022

vblagoje Oct 25, 2022

bogdankostic Oct 25, 2022

vblagoje left a comment •

edited

Loading

bogdankostic commented Oct 25, 2022 •

edited

Loading

vblagoje Oct 26, 2022

bogdankostic Oct 26, 2022

vblagoje Oct 26, 2022

vblagoje left a comment

feat: Extraction of headlines in markdown files #3445

feat: Extraction of headlines in markdown files #3445

Conversation

bogdankostic commented Oct 20, 2022 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vblagoje left a comment • edited Loading

Choose a reason for hiding this comment

bogdankostic commented Oct 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vblagoje left a comment

Choose a reason for hiding this comment

bogdankostic commented Oct 20, 2022 •

edited

Loading

vblagoje left a comment •

edited

Loading

bogdankostic commented Oct 25, 2022 •

edited

Loading