Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent line breaks, deliver reading order. #3878

Merged
merged 1 commit into from
Sep 23, 2024
Merged

Commits on Sep 23, 2024

  1. Prevent line breaks, deliver reading order.

    Refactor plain text and "words" extraction with sort=True:
    We previously simply sorted the output by ascending bottom and left coordinate.
    This change collects words  (and respectively text) that are approximately on the same line.
    Apart from extremely malformed pages, words and respectively text is returned in "natural" reading sequence.
    
    This change also suppresses line breaks generated by MuPDF just because of large horizontal distances (as it e.g. often happens between  table cell content of the same row.
    JorjMcKie committed Sep 23, 2024
    Configuration menu
    Copy the full SHA
    dfe1fdc View commit details
    Browse the repository at this point in the history