Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create corpora for benchmarking #130

Open
3 tasks done
mweidling opened this issue Sep 13, 2022 · 5 comments
Open
3 tasks done

Create corpora for benchmarking #130

mweidling opened this issue Sep 13, 2022 · 5 comments
Assignees

Comments

@mweidling
Copy link
Collaborator

mweidling commented Sep 13, 2022

In order to execute the benchmarking we need some data with different characteristics to work on.
@mweidling already has examined the OCR-D GT repository and wants to discuss with @tboenig and @cneud about useful corpora.

TODOs:

  • schedule a meeting for discussion
  • create the workspaces
  • provide them in a separate repo (QUIVER assets)
@mweidling mweidling self-assigned this Sep 13, 2022
@mweidling
Copy link
Collaborator Author

This is a first naive overview of my GT categorization: gt_overview.ods.

If this isn't of that much use, I'll have a deeper look into that.

@mweidling
Copy link
Collaborator Author

mweidling commented Sep 19, 2022

Here ist second, reviewed version of the sheet:

gt_overview.ods

EDIT: Replaced all instances of schwabacher with fraktur (20.09.22).

@mweidling
Copy link
Collaborator Author

First draft for the corpora

General thoughts

  • we focus on Ground Truth that encompasses both layout and text information. Although we have lots of GT that aims at the page structure only, GT that has both enables us to measure both layout and text detection quality.
  • our first and most important category should be the century in which a work has been created since we aim for VD corpora
  • Fraktur vs. Schwabacher won't be a separate category since our models have been trained for Fraktur fonts in general (which encompass Schwabacher)

Categories

16th century, fraktur, simple layout

  • kistler_kraeuter_1500.ocrd + trota_mordtbrenner_1540.ocrd + luther_auszlegunge_1520.ocrd

16th century, fraktur, complex layout

  • two-columned: luther_babstum_1526.ocrd
  • hand-written additions, stamps: petrarca_psalmi_1506.ocrd + nn_lied_1520.ocrd
  • partly tabular-like structures, stamps: aventinus_grammatica_1515.ocrd
  • labelled illustration, initial, stamps: nn_historia_1500.ocrd

16th century, antiqua, simple layout

  • heyden_paedono_1548.ocrd

16th century, antiqua, complex layout

  • marginal notes [both printed and hand written], initial: alberti_pictura_1540.ocrd

16th century, font mix, simple layout

  • -/-

16th century, font mix, complex layout

  • -/-

17th century, fraktur, simple layout

  • calvi_beutelschneider01_1627.ocrd

17th century, fraktur, complex layout

  • musical notation: silesius_seelenlust01_1657.ocrd
  • hand-written additions, with title page: huebner_handbuch_1696.ocrd

17th century, antiqua, simple layout

  • -/-

17th century, antiqua, complex layout

  • -/-

17th century, font mix, simple layout

fraktur, antiqua

  • rollenhagen_reysen_1603.ocrd + loeber_heuschrecken_1693.ocrd + bohse_helicon_1696.ocrd (feat. initials, with title page, colour chart)

fraktur, antiqua, ancient Greek, Hebrew

  • weigel_gnothi02_1618.ocrd + dannhauer_catechismus10_1673.ocrd

17th century, font mix, complex layout

fraktur, antiqua

  • partly two-columned, initials: glauber_opera01_1658.ocrd
  • partly two-columned, initials, marginal notes: arnold_ketzerhistorie01_1699.ocrd
  • tabular-like layout: lohenstein_agrippina_1665.ocrd
  • marginal notes, initials: meyfart_rhetorica_1634.ocrd
  • initials, hand-written additions: valentinus_occulta_1603.ocrd

18th century, fraktur, simple layout

  • lessing_menschengeschlecht_1780.ocrd

18th century, fraktur, complex layout

  • marginal notes: justi_abhandlung01_1758.ocrd + estor_rechtsgelehrsamkeit02_1758.ocrd
  • partly two-columned: buerger_gedichte_1778.ocrd
  • tables, mathematics: euler_rechenkunst01_1738.ocrd
  • handwritten additions: nn_besuch_1780.ocrd
  • with title page, stamps: luz_blitz_1784.ocrd + bernd_lebensbeschreibung_1738.ocrd

18th century, antiqua, simple layout

  • ballenstedt_delatio_1777.ocrd

18th century, antiqua, complex layout

  • -/-

18th century, font mix, simple layout

  • -/-

18th century, font mix, complex layout

  • partly two-columned, intial: benner_herrnhuterey04_1748.ocrd

19th century, antiqua [1]

  • blumenbach_anatomie_1805.ocrd

19th century, fraktur [1]

  • arnimb_goethe03_1835.ocrd

[1] We only have two works with text GT for the 19th century, blumenbach_anatomie_1805.ocrd and arnimb_goethe03_1835.ocrd. Since the 19th century isn't part of our scope, we'll limit ourselves to the material we already have.

@mweidling
Copy link
Collaborator Author

mweidling commented Sep 21, 2022

Creating the simple cases

Categories

16th century, fraktur, simple layout

  • kistler_kraeuter_1500.ocrd + trota_mordtbrenner_1540.ocrd + luther_auszlegunge_1520.ocrd
    • images with licenses
    • workspace
    • GT

16th century, antiqua, simple layout

  • heyden_paedono_1548.ocrd
    • images with licenses
    • workspace
    • GT

16th century, antiqua, complex layout

  • marginal notes [both printed and hand written], initial: alberti_pictura_1540.ocrd
    • images with licenses
    • workspace
    • GT

17th century, fraktur, simple layout

  • calvi_beutelschneider01_1627.ocrd
    • images with licenses
    • workspace
    • GT

17th century, font mix, simple layout

fraktur, antiqua

  • rollenhagen_reysen_1603.ocrd + loeber_heuschrecken_1693.ocrd + bohse_helicon_1696.ocrd (feat. initials, with title page, colour chart)
    • images with licenses
    • workspace
    • GT

fraktur, antiqua, ancient Greek, Hebrew

  • weigel_gnothi02_1618.ocrd + dannhauer_catechismus10_1673.ocrd
    • images with licenses
    • workspace
    • GT

18th century, fraktur, simple layout

  • lessing_menschengeschlecht_1780.ocrd
    • images with licenses
    • workspace
    • GT

18th century, antiqua, simple layout

  • ballenstedt_delatio_1777.ocrd
    • images with licenses
    • workspace
    • GT

18th century, font mix, complex layout

  • partly two-columned, intial: benner_herrnhuterey04_1748.ocrd
    • images with licenses
    • workspace
    • GT

19th century, antiqua

  • blumenbach_anatomie_1805.ocrd
    • images with licenses
    • workspace
    • GT

19th century, fraktur

  • arnimb_goethe03_1835.ocrd
    • images with licenses
    • workspace
    • GT

@mweidling
Copy link
Collaborator Author

The data is now available at https://github.com/OCR-D/quiver-data.git.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant