Skip to content

Comparison of OCR formats

Philipp Zumstein edited this page Sep 28, 2019 · 1 revision

Comparison of OCR formats

Format Descriptions

hOCR

Version Released Specs Schema Samples
1.0 December 2007 - -
1.1 March 2010 -

Smallest unit: word

<span
  class="ocrx_word"
  id="word_1_33"
  title="bbox 1584 1199 1997 1284; x_wconf 87"
  lang="deu-frak"
  dir="ltr"
  >Verhältnisse.</span>

ALTO

Version Released Specs Schema Samples
1.0 December 02, 2004 - XSD -
2.0 January 11, 2010 - XSD
2.1 February 20, 2014 - XSD
3.0 August, 2014 - XSD -
3.1 January, 2014 - XSD -

ABBYY FineReader XML

Version Released Specs Schema Samples
6v1 2002? - XSD - -
8v2 2006? - XSD - -
9v1 2007? - XSD - -
10v1 2011? XSD

Comparison Tables

Typographic levels

hOCR ALTO ABBYY
Page
<div class="ocr_page">
<Page>
<page>
Text Area / Column
<div class="ocr_carea">
<div class="ocrx_block">
<PrintSpace>
Paragraph
<div class="ocr_par">
<TextBlock STYLEREFS="...">
Text Line
<div class="ocr_line">
<TextLine>
<line>
  <formatting>...</formatting>
</line>
Word
<div class="ocrx_word">
<TextLine>
<line>
  <formatting>...</formatting>
</line>

Bounding Boxes

hOCR ALTO ABBYY
<div title="bbox 100 200 150 250"/>
<String
  HEIGHT="250"
  WIDTH="150"
  VPOS="100"
  HPOS="200"/>
<line
  l="200"
  t="100"
  r="1200"
  b="130">

Hyphenation

hOCR ALTO ABBYY
&shy
Soft hyphens must be represented using the HTML ­ entity.[1]

Regular hyphenation characters are just dashes

<HYP/>
[1]

Confidence values

Level hOCR ALTO ABBYY
Page -
<Page PC="0.743">
[1]
-
Word
<span
  class="ocrx_word"
  title="x_wconf 71>foo</span>
"if possible, convert word confidences to values between 0 and 100 and have them approximate posterior (expressed in %)"[1]
<String WC="0.422">
"Word Confidence: Confidence level of the ocr for this string. A value between 0 (unsure) and 1 (sure)."[1]
-
Character
<span
  class="ocrx_word"
  title="x_wconf 71>foo</span>

"if possible, convert word confidences to values between 0 and 100 and have them approximate posterior probabilities (expressed in %)"[1]

Not implemented in common engines?

<String CC="0 0 4 0" CONTENT="luft"/>

"Confidence level of each character in that string. A list of numbers, one number between 0 (sure) and 9 (unsure) for each character."[1]

Links