-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Extraction of headlines in markdown files #3445
Conversation
word_count_sen = len(sen.split()) | ||
|
||
if word_count_sen > split_length: | ||
long_sentence_message = f"One or more sentence found with word count higher than the split length." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add a sentence about how they can fix it or what happens now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nor sure what we should add. Nothing really happening here, just informing the user that they have at least one exceptionally long sentence (longer than their specified split_length
) in their Document. The consequence of this is that they will have at least one Document consisting of a single sentence.
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
""" | ||
# md -> html -> text since BeautifulSoup can extract text cleanly | ||
html = markdown(markdown_string) | ||
headline_tags = {"h1", "h2", "h3", "h4", "h5", "h6"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the depth level configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even better - find the deepest tree level in one pass of regular expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are pre-defined HTML tags as listed for example here.
Before converting a markdown file to text, we convert it to HTML. The headline_tags
set here is just used to check if an HTML element is a headline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM @bogdankostic, especially the effort you put into unit tests. Some total newcomer views: would it be possible/beneficial for users to return the headlines as a tree, as that's the natural order of the headlines? Then they can order it any way they want (post/pre/in order). Great work overall
My initial idea was actually to represent the extracted headlines as a tree. I decided against it because this can result in quite nested, complex structures - especially for large Documents. I thought it would be much more understandable like this. |
id_hash_keys = self.id_hash_keys | ||
|
||
id_hash_keys = id_hash_keys if id_hash_keys is not None else self.id_hash_keys | ||
remove_code_snippets = remove_code_snippets if remove_code_snippets is not None else self.remove_code_snippets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bogdankostic another option is id_hash_keys = id_hash_keys or self.id_hash_keys
, but I am undecided about which one is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id_hash_keys = id_hash_keys or self.id_hash_keys
would work in this case because id_hash_keys
is supposed to be a list, but this doesn't work with optional boolean parameters.
For example, if we explicitly set remove_code_snippets
to False
when calling convert
, remove_code_snippets or self.remove_code_snippets
would evaluate to whatever value self.remove_code_snippets
has.
For consistency, I would like all these lines to have the same pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great points
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GTG now @bogdankostic
Related Issues
Proposed Changes:
This PR adds the possibility to extract headlines out of a markdown file. For this, it adds the parameter
extract_headlines
to theMarkdownConverter
and adapts thePreProcessor
to be able to keep only the relevant headlines for each Document split when splitting the original Document.Headlines are stored as a list of dictionaries in the Document's meta field
"headlines"
and are structured as followed:start_idx
is set to None by thePreProcessor
when the headline is relevant for the current split but it appears in a previous split.Further changes:
remove_code_snippets
toMarkdownConverter
to allow the users to choose whether to remove the code snippets (previously, code snippets were always removed)PreProcessor
'ssplit
method to make it (hopefully) more readable_split_sentences
to not remove whitespace characters when splitting a text into sentencesHow did you test it?
I added a couple of unit tests, let me know if I should add more.
Notes for the reviewer
I'm not quite sure about the dictionary format for the headlines, especially about setting
start_idx
toNone
if the headline cannot be found inDocument.content
. Please let me know if you think this is okay or you see a better solution.Checklist