Apply detect() on readable PDF files #29

simonschoe · 2021-04-15T12:28:49Z

Hi there,
from the docs I infere that detect() operates, for example, on PIL.Image objects. Is there way to directly operate on already readable PDF files (which obviates the need applying OCR as well).
Greetings

The text was updated successfully, but these errors were encountered:

lolipopshock · 2021-04-16T05:46:17Z

Thanks for pointing out. Providing support for PDF is among one of our future plans. And in the meantime, you could try to use the pdf2image library to firstly convert the PDF to some image scans and perform layout detection on them. You might want to go to their Github homepage for a detailed installation instruction: https://github.com/Belval/pdf2image .

simonschoe · 2021-04-16T06:00:23Z

@lolipopshock thanks for the response! Thought about this workflow as well, however, I assumed this approach would ignore the fact that PDF files are already run through OCR beforehand (i.e. readable). I find it somewhat cumbersome to first convert a readable PDF to an image, only to then re-apply OCR...

lolipopshock · 2021-04-17T18:41:18Z

You might want to check the pdfplumber library and here is some starter code for you:

import pdfplumber
from typing import List, Union, Dict, Any, Tuple


def obtain_word_tokens(cur_page: pdfplumber.page.Page) -> List:
    words = cur_page.extract_words(
                x_tolerance=1.5,
                y_tolerance=3,
                keep_blank_chars=False,
                use_text_flow=True,
                horizontal_ltr=True,
                vertical_ttb=True,
                extra_attrs=["fontname", "size"],
    )
    
    return lp.Layout([       
                lp.TextBlock(
                    lp.Rectangle(float(ele['x0']), float(ele['top']), 
                                 float(ele['x1']), float(ele['bottom'])),
                    text=ele['text'],
                    id = idx
                ) for idx, ele in enumerate(words)]
            )
    

plumber_pdf_object = pdfplumber.open(pdf_path)

all_pages_tokens = []
for page_id in range(len(plumber_pdf_object.pages)):
    cur_page = plumber_pdf_object.pages[page_id]
    tokens = obtain_word_tokens(cur_page)
    all_pages_tokens.append(tokens)

lkluo · 2021-04-20T06:16:13Z

I am thinking of combining Poppler and this layout parser to extract structured paragraph from readable PDFs, such that both advantages from the two can be leveraged.
Has anyone done that?

simonschoe · 2021-04-20T11:24:20Z

You might want to check the pdfplumber library and here is some starter code for you:

import pdfplumber
from typing import List, Union, Dict, Any, Tuple


def obtain_word_tokens(cur_page: pdfplumber.page.Page) -> List:
    words = cur_page.extract_words(
                x_tolerance=1.5,
                y_tolerance=3,
                keep_blank_chars=False,
                use_text_flow=True,
                horizontal_ltr=True,
                vertical_ttb=True,
                extra_attrs=["fontname", "size"],
    )
    
    return lp.Layout([       
                lp.TextBlock(
                    lp.Rectangle(float(ele['x0']), float(ele['top']), 
                                 float(ele['x1']), float(ele['bottom'])),
                    text=ele['text'],
                    id = idx
                ) for idx, ele in enumerate(words)]
            )
    

plumber_pdf_object = pdfplumber.open(pdf_path)

all_pages_tokens = []
for page_id in range(len(plumber_pdf_object.pages)):
    cur_page = plumber_pdf_object.pages[page_id]
    tokens = obtain_word_tokens(cur_page)
    all_pages_tokens.append(tokens)

You might want to check the pdfplumber library and here is some starter code for you:

import pdfplumber
from typing import List, Union, Dict, Any, Tuple


def obtain_word_tokens(cur_page: pdfplumber.page.Page) -> List:
    words = cur_page.extract_words(
                x_tolerance=1.5,
                y_tolerance=3,
                keep_blank_chars=False,
                use_text_flow=True,
                horizontal_ltr=True,
                vertical_ttb=True,
                extra_attrs=["fontname", "size"],
    )
    
    return lp.Layout([       
                lp.TextBlock(
                    lp.Rectangle(float(ele['x0']), float(ele['top']), 
                                 float(ele['x1']), float(ele['bottom'])),
                    text=ele['text'],
                    id = idx
                ) for idx, ele in enumerate(words)]
            )
    

plumber_pdf_object = pdfplumber.open(pdf_path)

all_pages_tokens = []
for page_id in range(len(plumber_pdf_object.pages)):
    cur_page = plumber_pdf_object.pages[page_id]
    tokens = obtain_word_tokens(cur_page)
    all_pages_tokens.append(tokens)

Hm, indeed that would help to extract the actual text, token-by-token. However, if I got it correctly, this approach would still require to use something like the pdf2image library to convert the PDF to images to detect actual contiguous text blocks and then match the word-level text blocks to the Layout object respectively the page-level TextBlocks detected by one of the layout detection models (to not incorporate words/tokens that correspond to figures or tables). I am still not quite sure with respect to the optimal pipeline. Do you expect to implement the PDF-functionality in the near future? Otherwise, I would proceed with one of the "dirtier" workflows proposed in this thread.

EDIT: I started implementing the following workflow:

Convert PDF-pages to images using pdf2image
Run layout detection model for each page-image using layoutparser
Extract word tokens for each PDF page (as proposed above) using pdfplumber
Check if word token coordinates are inside (considering a soft_margin) the text_blocks detected in 2.
Merge word token for each text_block according to their id to produce contiguous string.

This raised a new issue, however. That is, the coordinates (x1, x2, y1, y2) assigned by pdfplumber are on a different scale compared to the coordinates detected in 2. which are in turn reliant on the dpi when using convert_from_path in the pdf2image library I assume...

eest9 · 2021-04-23T19:52:11Z

Hi there,
you probably should add an additional step between 2. and 3. in order to check if there is any text embedded inside the boundaries of the text_block. Because you may have to fall back to use OCR when there is no text embedded.
I'm currently working on a project where I'll need a structured output of a handful of PDFs in different qualities which some of them will consist of text and others just are filled with images from scanned documents.

zthab · 2021-06-18T23:24:26Z

Something like tabula https://github.com/tabulapdf/tabula might be helpful? Although this is a bit applied to tables, it does leverage the structure of the pdf to pull tables.

gevezex · 2021-07-14T11:07:36Z

This raised a new issue, however. That is, the coordinates (x1, x2, y1, y2) assigned by pdfplumber are on a different scale compared to the coordinates detected in 2. which are in turn reliant on the dpi when using convert_from_path in the pdf2image library I assume...

May I ask how you solved this scale issue. I also have different x,y coordinates for the text of pdfplumber and the layout boxes of LayoutParser.

UPDATE: Looks like the default dpi of pdf2image is 200. pdfplumber uses default 72 dpi. So if you multiply (or divide, depends from your perspective) de coordinates with float 200/72 it seems to solve the issue.
The thing is should you always rely on that 72 dpi for every pdf? Couldn't find out how you can enforce 200 dpi for pdfplumber and is of course off topic for this issue.

simonschoe · 2021-07-14T14:08:38Z

Hi @gevezex

May I ask how you solved this scale issue.

I must admit that I simply dropped the project for now. I found it way too tedious worfklow, hoping the team will add some support for already readable PDF-files in the future

gevezex · 2021-07-14T18:50:30Z

Solved it like this with PyMuPdf (pip install pymupdf). I hope it can help someone with the same issue.
Check also the pymupdf utility for retrieving text out of certain box coordinate

# function for rescaling xy coordinates
def scale_xy(textblock, scale=72/200):
    x1 = textblock.block.x_1 * scale
    y1 = textblock.block.y_1 * scale
    x2 = textblock.block.x_2 * scale
    y2 = textblock.block.y_2 * scale
    return (x1,y1,x2,y2)

# Using PyMuPdf for retrieving text in a bounding box
import fitz  # this is pymupdf

# Function for retrieving the tokens (words). See pymupdf utilities
def make_text(words):
    """Return textstring output of get_text("words").
    Word items are sorted for reading sequence left to right,
    top to bottom.
    """
    line_dict = {}  # key: vertical coordinate, value: list of words
    words.sort(key=lambda w: w[0])  # sort by horizontal coordinate
    for w in words:  # fill the line dictionary
        y1 = round(w[3], 1)  # bottom of a word: don't be too picky!
        word = w[4]  # the text of the word
        line = line_dict.get(y1, [])  # read current line content
        line.append(word)  # append new word
        line_dict[y1] = line  # write back to dict
    lines = list(line_dict.items())
    lines.sort()  # sort vertically
    return "\n".join([" ".join(line[1]) for line in lines])

# Open your pdf in pymupdf
pdf_doc = fitz.open('/location/to/your/file.pdf')
pdf_page4 = pdf_doc[3]  # this wil retrieve for example page 4 
words = pdf_page4.get_text("words")

# Get one of your inferenced TextBlocks what is detected with your model (model LayoutParser)
# In the doc it was called "layout". So will use that one
# first recognized bounding box:  layout[0]
# When I print my pdf version the output of layout[0] is like this:
>>> TextBlock(block=Rectangle(x_1=104.882, y_1=133.696, x_2=124.79, y_2=147.696), text=Het, id=0, type=None, parent=None, next=None, score=None)

# Rescale the coordinates
new_coordinates = scale_xy(layout[0])

# Create a Rect object for fitz (similar to TextBlock for the bounding box coordinates)
rect = fitz.Rect(*new_coordinates)

# Now we can find and print all the tokens in the bounding box:
mywords = [w for w in words if fitz.Rect(w[:4]).intersects(rect)]

print("\nSelect the words intersecting the rectangle")
print("-------------------------------------------")
print(make_text(mywords))

Sorry for the confusion of terminologies. I am still learning pdf related stuff.

lolipopshock · 2021-07-26T05:36:48Z

You might also want to refer to the PDF Parsers that I've implemented in another project recently -- https://github.com/allenai/VILA/blob/master/src/vila/pdftools/pdfplumber_extractor.py @ allenai/vila#6 . It should provide similar functionalities and readily applicable to the layout-parser library as well. I will merge the PDF parsers in the layout-parser library soon.

lolipopshock · 2021-09-13T03:16:22Z

See #71 and #72

lolipopshock added the enhancement New feature or request label Apr 15, 2021

lolipopshock closed this as completed Sep 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply detect() on readable PDF files #29

Apply detect() on readable PDF files #29

simonschoe commented Apr 15, 2021

lolipopshock commented Apr 16, 2021

simonschoe commented Apr 16, 2021

lolipopshock commented Apr 17, 2021

lkluo commented Apr 20, 2021

simonschoe commented Apr 20, 2021 •

edited

Loading

eest9 commented Apr 23, 2021

zthab commented Jun 18, 2021

gevezex commented Jul 14, 2021 •

edited

Loading

simonschoe commented Jul 14, 2021 •

edited

Loading

gevezex commented Jul 14, 2021 •

edited

Loading

lolipopshock commented Jul 26, 2021

lolipopshock commented Sep 13, 2021

Apply detect() on readable PDF files #29

Apply detect() on readable PDF files #29

Comments

simonschoe commented Apr 15, 2021

lolipopshock commented Apr 16, 2021

simonschoe commented Apr 16, 2021

lolipopshock commented Apr 17, 2021

lkluo commented Apr 20, 2021

simonschoe commented Apr 20, 2021 • edited Loading

eest9 commented Apr 23, 2021

zthab commented Jun 18, 2021

gevezex commented Jul 14, 2021 • edited Loading

simonschoe commented Jul 14, 2021 • edited Loading

gevezex commented Jul 14, 2021 • edited Loading

lolipopshock commented Jul 26, 2021

lolipopshock commented Sep 13, 2021

simonschoe commented Apr 20, 2021 •

edited

Loading

gevezex commented Jul 14, 2021 •

edited

Loading

simonschoe commented Jul 14, 2021 •

edited

Loading

gevezex commented Jul 14, 2021 •

edited

Loading