Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Extraction of headlines in markdown files #3445

Merged
merged 10 commits into from
Oct 26, 2022
Merged

Conversation

bogdankostic
Copy link
Contributor

@bogdankostic bogdankostic commented Oct 20, 2022

Related Issues

Proposed Changes:

This PR adds the possibility to extract headlines out of a markdown file. For this, it adds the parameter extract_headlines to the MarkdownConverter and adapts the PreProcessor to be able to keep only the relevant headlines for each Document split when splitting the original Document.
Headlines are stored as a list of dictionaries in the Document's meta field "headlines" and are structured as followed:

{
    "headline": <THE HEADLINE STRING>,
    "start_idx": <IDX OF HEADLINE START IN document.content, int or None>,
    "level": <LEVEL OF THE HEADLINE, int>
}

start_idx is set to None by the PreProcessor when the headline is relevant for the current split but it appears in a previous split.

Further changes:

  • Add parameter remove_code_snippets to MarkdownConverter to allow the users to choose whether to remove the code snippets (previously, code snippets were always removed)
  • Refactor PreProcessor's split method to make it (hopefully) more readable
  • Adapt the SentenceTokenizer in _split_sentences to not remove whitespace characters when splitting a text into sentences

How did you test it?

I added a couple of unit tests, let me know if I should add more.

Notes for the reviewer

I'm not quite sure about the dictionary format for the headlines, especially about setting start_idx to None if the headline cannot be found in Document.content. Please let me know if you think this is okay or you see a better solution.

Checklist

@bogdankostic bogdankostic marked this pull request as ready for review October 20, 2022 21:11
@bogdankostic bogdankostic requested review from a team as code owners October 20, 2022 21:11
@bogdankostic bogdankostic requested review from mayankjobanputra and removed request for a team October 20, 2022 21:11
haystack/nodes/file_converter/markdown.py Outdated Show resolved Hide resolved
haystack/nodes/file_converter/markdown.py Outdated Show resolved Hide resolved
haystack/nodes/file_converter/markdown.py Outdated Show resolved Hide resolved
haystack/nodes/file_converter/markdown.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
word_count_sen = len(sen.split())

if word_count_sen > split_length:
long_sentence_message = f"One or more sentence found with word count higher than the split length."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a sentence about how they can fix it or what happens now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nor sure what we should add. Nothing really happening here, just informing the user that they have at least one exceptionally long sentence (longer than their specified split_length) in their Document. The consequence of this is that they will have at least one Document consisting of a single sentence.

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
bogdankostic and others added 3 commits October 24, 2022 18:08
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
"""
# md -> html -> text since BeautifulSoup can extract text cleanly
html = markdown(markdown_string)
headline_tags = {"h1", "h2", "h3", "h4", "h5", "h6"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the depth level configurable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even better - find the deepest tree level in one pass of regular expressions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are pre-defined HTML tags as listed for example here.
Before converting a markdown file to text, we convert it to HTML. The headline_tags set here is just used to check if an HTML element is a headline.

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM @bogdankostic, especially the effort you put into unit tests. Some total newcomer views: would it be possible/beneficial for users to return the headlines as a tree, as that's the natural order of the headlines? Then they can order it any way they want (post/pre/in order). Great work overall

@bogdankostic
Copy link
Contributor Author

bogdankostic commented Oct 25, 2022

would it be possible/beneficial for users to return the headlines as a tree, as that's the natural order of the headlines?

My initial idea was actually to represent the extracted headlines as a tree. I decided against it because this can result in quite nested, complex structures - especially for large Documents. I thought it would be much more understandable like this.

id_hash_keys = self.id_hash_keys

id_hash_keys = id_hash_keys if id_hash_keys is not None else self.id_hash_keys
remove_code_snippets = remove_code_snippets if remove_code_snippets is not None else self.remove_code_snippets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdankostic another option is id_hash_keys = id_hash_keys or self.id_hash_keys, but I am undecided about which one is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id_hash_keys = id_hash_keys or self.id_hash_keys would work in this case because id_hash_keys is supposed to be a list, but this doesn't work with optional boolean parameters.

For example, if we explicitly set remove_code_snippets to False when calling convert, remove_code_snippets or self.remove_code_snippets would evaluate to whatever value self.remove_code_snippets has.

For consistency, I would like all these lines to have the same pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GTG now @bogdankostic

@bogdankostic bogdankostic merged commit 4fbe80c into main Oct 26, 2022
@bogdankostic bogdankostic deleted the headings_md_files branch October 26, 2022 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract headings from Markdown files
3 participants