-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply detect() on readable PDF files #29
Comments
Thanks for pointing out. Providing support for PDF is among one of our future plans. And in the meantime, you could try to use the |
@lolipopshock thanks for the response! Thought about this workflow as well, however, I assumed this approach would ignore the fact that PDF files are already run through OCR beforehand (i.e. readable). I find it somewhat cumbersome to first convert a readable PDF to an image, only to then re-apply OCR... |
You might want to check the import pdfplumber
from typing import List, Union, Dict, Any, Tuple
def obtain_word_tokens(cur_page: pdfplumber.page.Page) -> List:
words = cur_page.extract_words(
x_tolerance=1.5,
y_tolerance=3,
keep_blank_chars=False,
use_text_flow=True,
horizontal_ltr=True,
vertical_ttb=True,
extra_attrs=["fontname", "size"],
)
return lp.Layout([
lp.TextBlock(
lp.Rectangle(float(ele['x0']), float(ele['top']),
float(ele['x1']), float(ele['bottom'])),
text=ele['text'],
id = idx
) for idx, ele in enumerate(words)]
)
plumber_pdf_object = pdfplumber.open(pdf_path)
all_pages_tokens = []
for page_id in range(len(plumber_pdf_object.pages)):
cur_page = plumber_pdf_object.pages[page_id]
tokens = obtain_word_tokens(cur_page)
all_pages_tokens.append(tokens) |
I am thinking of combining Poppler and this layout parser to extract structured paragraph from readable PDFs, such that both advantages from the two can be leveraged. |
Hm, indeed that would help to extract the actual text, token-by-token. However, if I got it correctly, this approach would still require to use something like the EDIT: I started implementing the following workflow:
This raised a new issue, however. That is, the coordinates (x1, x2, y1, y2) assigned by |
Hi there, |
Something like tabula https://github.com/tabulapdf/tabula might be helpful? Although this is a bit applied to tables, it does leverage the structure of the pdf to pull tables. |
May I ask how you solved this scale issue. I also have different x,y coordinates for the text of pdfplumber and the layout boxes of LayoutParser. UPDATE: Looks like the default dpi of pdf2image is 200. pdfplumber uses default 72 dpi. So if you multiply (or divide, depends from your perspective) de coordinates with float 200/72 it seems to solve the issue. |
Hi @gevezex
I must admit that I simply dropped the project for now. I found it way too tedious worfklow, hoping the team will add some support for already readable PDF-files in the future |
Solved it like this with PyMuPdf (pip install pymupdf). I hope it can help someone with the same issue. # function for rescaling xy coordinates
def scale_xy(textblock, scale=72/200):
x1 = textblock.block.x_1 * scale
y1 = textblock.block.y_1 * scale
x2 = textblock.block.x_2 * scale
y2 = textblock.block.y_2 * scale
return (x1,y1,x2,y2)
# Using PyMuPdf for retrieving text in a bounding box
import fitz # this is pymupdf
# Function for retrieving the tokens (words). See pymupdf utilities
def make_text(words):
"""Return textstring output of get_text("words").
Word items are sorted for reading sequence left to right,
top to bottom.
"""
line_dict = {} # key: vertical coordinate, value: list of words
words.sort(key=lambda w: w[0]) # sort by horizontal coordinate
for w in words: # fill the line dictionary
y1 = round(w[3], 1) # bottom of a word: don't be too picky!
word = w[4] # the text of the word
line = line_dict.get(y1, []) # read current line content
line.append(word) # append new word
line_dict[y1] = line # write back to dict
lines = list(line_dict.items())
lines.sort() # sort vertically
return "\n".join([" ".join(line[1]) for line in lines])
# Open your pdf in pymupdf
pdf_doc = fitz.open('/location/to/your/file.pdf')
pdf_page4 = pdf_doc[3] # this wil retrieve for example page 4
words = pdf_page4.get_text("words")
# Get one of your inferenced TextBlocks what is detected with your model (model LayoutParser)
# In the doc it was called "layout". So will use that one
# first recognized bounding box: layout[0]
# When I print my pdf version the output of layout[0] is like this:
>>> TextBlock(block=Rectangle(x_1=104.882, y_1=133.696, x_2=124.79, y_2=147.696), text=Het, id=0, type=None, parent=None, next=None, score=None)
# Rescale the coordinates
new_coordinates = scale_xy(layout[0])
# Create a Rect object for fitz (similar to TextBlock for the bounding box coordinates)
rect = fitz.Rect(*new_coordinates)
# Now we can find and print all the tokens in the bounding box:
mywords = [w for w in words if fitz.Rect(w[:4]).intersects(rect)]
print("\nSelect the words intersecting the rectangle")
print("-------------------------------------------")
print(make_text(mywords)) Sorry for the confusion of terminologies. I am still learning pdf related stuff. |
You might also want to refer to the PDF Parsers that I've implemented in another project recently -- https://github.com/allenai/VILA/blob/master/src/vila/pdftools/pdfplumber_extractor.py @ allenai/vila#6 . It should provide similar functionalities and readily applicable to the layout-parser library as well. I will merge the PDF parsers in the layout-parser library soon. |
Hi there,
from the docs I infere that
detect()
operates, for example, onPIL.Image
objects. Is there way to directly operate on already readable PDF files (which obviates the need applying OCR as well).Greetings
The text was updated successfully, but these errors were encountered: