Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve document conversion #3308

Closed
4 tasks done
masci opened this issue Oct 3, 2022 · 0 comments
Closed
4 tasks done

Improve document conversion #3308

masci opened this issue Oct 3, 2022 · 0 comments

Comments

@masci
Copy link
Contributor

masci commented Oct 3, 2022

File conversion and data extraction has proven problematic lately: we have several converters and parsers in Haystack but it’s not always clear which one (or which combination) to use; sometimes one tool is good for a task but not for another; special characters and text layout can easily make document ingestion fail.

Desired outcome

  • Fix the pain points with PDF layouts: special chars, multi column and headlines
  • Test on a larger dataset of examples, this will also ease comparison if we want to introduce OCR later
  • Conversion and preprocessing often fail silently, review the code and find where adding a Warning message could help users

Related issues:


Process

@masci masci added the epic label Oct 3, 2022
@ZanSara ZanSara mentioned this issue Nov 23, 2022
6 tasks
@masci masci closed this as completed Mar 13, 2023
@masci masci removed the epic:in-progress Epic is in progress label Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants