You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
File conversion and data extraction has proven problematic lately: we have several converters and parsers in Haystack but it’s not always clear which one (or which combination) to use; sometimes one tool is good for a task but not for another; special characters and text layout can easily make document ingestion fail.
Desired outcome
Fix the pain points with PDF layouts: special chars, multi column and headlines
Test on a larger dataset of examples, this will also ease comparison if we want to introduce OCR later
Conversion and preprocessing often fail silently, review the code and find where adding a Warning message could help users
File conversion and data extraction has proven problematic lately: we have several converters and parsers in Haystack but it’s not always clear which one (or which combination) to use; sometimes one tool is good for a task but not for another; special characters and text layout can easily make document ingestion fail.
Desired outcome
Related issues:
PreProcessor
to split files by custom regex (e.g. for markdown headlines) #2583Process
feat: splitPreProcessor
#3557CsvTextConverter
#3587The text was updated successfully, but these errors were encountered: