Skip to content

Workflow Guide format conversion

Konstantin Baierer edited this page Sep 30, 2020 · 7 revisions

In this processing step the produced PAGE XML files can be converted to ALTO, PDF, hOCR or text files. Note that ALTO and hOCR can also be converted into different formats whereas the PDF version of PAGE XML OCR results is a widely accessible format that can be used as-is by expert and layman alike.

Available processors

Processor Parameter Remarks Call
ocrd-fileformat-transform
        {"from-to": "alto2.0 alto3.0"}
        # or {"from-to": "alto2.0 alto3.1"}
        # or {"from-to": "alto2.0 hocr"}
        # or {"from-to": "alto2.1 alto3.0"}
        # or {"from-to": "alto2.1 alto3.1"}
        # or {"from-to": "alto2.1 hocr"}
        # or {"from-to": "alto page"}
        # or {"from-to": "alto text"}
        # or {"from-to": "gcv hocr"}
        # or {"from-to": "hocr alto2.0"}
        # or {"from-to": "hocr alto2.1"}
        # or {"from-to": "hocr text"}
        # or {"from-to": "page alto"}
        # or {"from-to": "page hocr"}
        # or {"from-to": "page text"}
      
as the values consist of two words, when using `-P` they have to be enclosed in quotation marks; e.g. -P from-to "alto2.0 alto3.0"
if you want to save all OCR results in one file, you can use the following command: `cat OCR* > full.txt`
ocrd-fileformat-transform -I OCR-D-OCR -O OCR-D-ALTO
ocrd-pagetopdf
      {
        # fix (invalid) negative coordinates
        "negative2zero": true,
        # create a single "fat" PDF
        "multipage": "name_of_pdf",
        # render text on this level
        "textequiv_level": "word",
        # draw polygon outlines in the PDF
        "outlines": "line"
      }
      
  ocrd-pagetopdf -I OCR-D-OCR -O OCR-D-PDF -P textequiv_level word

Notes on parameter usage

E.g.

  • which parameters do you use with what values?
  • which parameters are insufficiently documented?
  • which aspects of a processor should be parameterizable but are not?

Notes on document-specific usage

E.g. which processors worked best with what material? -- feel free to post sample images here, too.

Welcome to the OCR-D wiki, a companion to the OCR-D website.

Articles and tutorials
Discussions
Expert section on OCR-D- workflows
Particular workflow steps
Recommended workflows
Workflow Guide
Videos
Section on Ground Truth
Clone this wiki locally