Prevent line breaks, deliver reading order. #3878

JorjMcKie · 2024-09-22T22:31:05Z

Refactor plain text and "words" extraction when option sort=True is used:
We previously simply sorted output by ascending bottom and left coordinate. This change collects words (and respectively text portions) that are approximately on the same line. Apart from extremely malformed pages, words and text are returned in "natural" reading sequence.

This change also suppresses line breaks generated by MuPDF just because of large horizontal distances, as it e.g. often happens between table cell content of the same row.

This change does not alter the user's API and is only activated if Page.get_text(sort=True) or Page.get_text("words", sort=True) is used.

JorjMcKie · 2024-09-23T09:41:40Z

Have added test data and script.

src/utils.py

Refactor plain text and "words" extraction with sort=True: We previously simply sorted the output by ascending bottom and left coordinate. This change collects words (and respectively text) that are approximately on the same line. Apart from extremely malformed pages, words and respectively text is returned in "natural" reading sequence. This change also suppresses line breaks generated by MuPDF just because of large horizontal distances (as it e.g. often happens between table cell content of the same row.

JorjMcKie · 2024-09-23T14:23:40Z

Now verified that it works for MuPDF v1.24.9 ...

JorjMcKie requested review from julian-smith-artifex-com and jamie-lemon September 22, 2024 22:31

JorjMcKie force-pushed the prevent-linebreaks branch from 485230c to 7790aff Compare September 23, 2024 09:40

JorjMcKie force-pushed the prevent-linebreaks branch from 7790aff to 68d915a Compare September 23, 2024 10:16

julian-smith-artifex-com reviewed Sep 23, 2024

View reviewed changes

src/utils.py Outdated Show resolved Hide resolved

src/utils.py Outdated Show resolved Hide resolved

src/utils.py Outdated Show resolved Hide resolved

src/utils.py Show resolved Hide resolved

src/utils.py Show resolved Hide resolved

JorjMcKie force-pushed the prevent-linebreaks branch from 68d915a to dfe1fdc Compare September 23, 2024 14:21

JorjMcKie requested a review from julian-smith-artifex-com September 23, 2024 14:23

julian-smith-artifex-com approved these changes Sep 23, 2024

View reviewed changes

JorjMcKie requested a review from julian-smith-artifex-com September 23, 2024 15:11

JorjMcKie merged commit 75a060d into main Sep 23, 2024
2 checks passed

github-actions bot locked and limited conversation to collaborators Sep 23, 2024

JorjMcKie deleted the prevent-linebreaks branch September 23, 2024 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent line breaks, deliver reading order. #3878

Prevent line breaks, deliver reading order. #3878

JorjMcKie commented Sep 22, 2024

JorjMcKie commented Sep 23, 2024

JorjMcKie commented Sep 23, 2024

Prevent line breaks, deliver reading order. #3878

Prevent line breaks, deliver reading order. #3878

Conversation

JorjMcKie commented Sep 22, 2024

JorjMcKie commented Sep 23, 2024

JorjMcKie commented Sep 23, 2024