Improve document conversion #3308

masci · 2022-10-03T10:37:13Z

File conversion and data extraction has proven problematic lately: we have several converters and parsers in Haystack but it’s not always clear which one (or which combination) to use; sometimes one tool is good for a task but not for another; special characters and text layout can easily make document ingestion fail.

Desired outcome

Fix the pain points with PDF layouts: special chars, multi column and headlines
Test on a larger dataset of examples, this will also ease comparison if we want to introduce OCR later
Conversion and preprocessing often fail silently, review the code and find where adding a Warning message could help users

Related issues:

Process

Assess LayoutLM's capabilities and whether it makes sense to integrate it
Write a guide about when and how to use the Document converters
- https://docs.haystack.deepset.ai/docs/file_converters#choosing-the-right-pdf-converter
Identify and solve pain points during preprocessing
- ~~feat: split PreProcessor #3557~~
- feat: preprocessor raises warning when doc length exceeds threshold #3837
[OPTIONAL] Implement CSVConverter
- feat: Add CsvTextConverter #3587

The text was updated successfully, but these errors were encountered:

masci added the epic label Oct 3, 2022

masci assigned bogdankostic Oct 3, 2022

masci added the topic:preprocessing label Oct 3, 2022

ZanSara mentioned this issue Nov 23, 2022

feat: split PreProcessor #3557

Closed

6 tasks

masci closed this as completed Mar 13, 2023

masci removed the epic:in-progress Epic is in progress label Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve document conversion #3308

Improve document conversion #3308

masci commented Oct 3, 2022 •

edited

Loading

Improve document conversion #3308

Improve document conversion #3308

Comments

masci commented Oct 3, 2022 • edited Loading

Desired outcome

Related issues:

Process

masci commented Oct 3, 2022 •

edited

Loading