Skip to content

Commit

Permalink
Merge pull request #2733 from sciencehistory/fix_pdf_ocr_extraction
Browse files Browse the repository at this point in the history
Fix and improve PDF text extraction hocr generation
  • Loading branch information
eddierubeiz committed Sep 4, 2024
2 parents 813c00d + e8ba032 commit f30dc8b
Showing 1 changed file with 15 additions and 4 deletions.
19 changes: 15 additions & 4 deletions app/services/pdf_to_page_images.rb
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,8 @@ class PdfToPageImages
attr_reader :pdf_file_path, :dpi

# @param pdf_asset [Asset] asset holding a pdf
# @param dpi [Integer] dpi we are extracting page image at, defaults to 300
# @param dpi [Integer] dpi we are extracting page image at, defaults to 300. We need to make sure we target
# consistent DPI in image and hocr, so they match!
def initialize(pdf_file_path, dpi: DEFAULT_TARGET_DPI)
@pdf_file_path = pdf_file_path
@dpi = dpi
Expand Down Expand Up @@ -144,23 +145,33 @@ def extract_jpeg_for_page(page_num)
def extract_hocr_for_page(page_num)
page_num_arg_check!(page_num)

poppler_bbox_layout_out, err = TTY::Command.new(printer: :null).run(
args = [
pdftotext_command,
"-bbox-layout",
pdf_file_path,
# this tool uses 1-based page numbers
"-f", page_num,
"-l", page_num,
"-" # stdout output
)
]

poppler_bbox_layout_out, err = TTY::Command.new(printer: :null).run( *args)


# if there are no actual words, this still gives us HTML skeleton back, but with
# nothing in it... just return nil, don't return an empty hocr
unless poppler_bbox_layout_out.include?("<word")
return nil
end

return PopplerBboxToHocr.new(poppler_bbox_layout_out).transformed_to_hocr
meta_tags = {
"pdftotext-command" => args.join(" "),
"pdftotext-version" => `#{pdftotext_command} -v 2>&1`,
"pdftotext-conversion" => "converted from pdftotext to hocr by ScihistDigicoll app PopplerBboxToHocr class",
"pdftotext-generation-date" => DateTime.now.iso8601
}

return PopplerBboxToHocr.new(poppler_bbox_layout_out, dpi: dpi, meta_tags: meta_tags).transformed_to_hocr
end

def num_pdf_pages
Expand Down

0 comments on commit f30dc8b

Please sign in to comment.