-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add page number to Documents coming from PDFConverters and PreProcessor #2932
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a frist pass
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit but feel free to merge as it is
Co-authored-by: Agnieszka Marzec <97166305+agnieszka-m@users.noreply.github.com>
Thanks for implementing that feature @bogdankostic, it saved me from doing it myself :-) Unfortunately I found some wrong page numbers when using the tesla_annual_report.pdf I'm not a 100% sure if the bug is easy to fix or if the real solution is to store the page as meta data already when extracting the PDF, instead of doing it during pre processing. Should I open a bug ticket? |
HI @brunnurs, thanks for bringing this up! Please open an issue about this and provide some details about the configurations you made for the PDFToTextConverter and the PreProcessor, this would help me immensely to debug this and hopefully provide a fix :) |
@bogdankostic I tried to make the bug replicable as easy as possible in the ticket above. let me know if you need more information! Many thanks for your effort |
Related Issue(s):
Closes: #2374
Proposed changes:
This PR adds the parameter
add_page_number
toParsrConverter
,AzureConverter
andPreProcessor
. InParsrConverter
andAzureConverter
, setting this parameter toTrue
has the effect of adding a meta field"page"
to Documents of content_type table containing the page number the table occurs in.In
PreProcessor
, setting this parameter toTrue
has the adding a meta field"page"
to Documents containing the page number the Document starts at. Page breaks are determined by"\f"
character, which is added byPDFToTextConverter
,ParsrConverter
andAzureConverter
in between pages.Also, I noticed that we don't display API Documentation on our documentation website for
ParsrConverter
andAzureConverter
, so I added those.Pre-flight checklist