Skip to content

Commit

Permalink
Merge branch 'HTR-United:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
matgille authored Sep 13, 2024
2 parents 94e1455 + 29ebd52 commit d757787
Show file tree
Hide file tree
Showing 91 changed files with 36,617 additions and 4,302 deletions.
9 changes: 6 additions & 3 deletions .github/workflows/Catalog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,28 +7,31 @@ on:
branches:
- master
workflow_dispatch: #Allows for manual triggering
schedule:
- cron: "0 23 * * 0"
jobs:
catalog:
runs-on: ubuntu-latest
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
- name: Set up Python 3.10
uses: actions/setup-python@v2
with:
python-version: 3.8
python-version: "3.11"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install htruc
- name: Run HTRUC
run: |
htruc make ./catalog --access_token ${{ secrets. GITHUB_TOKEN }} --graph-csv data.csv --statistics statistics.csv --output htr-united.yml --graph graph.png --json catalog.json --ids catalog-ids.json --organization htr-united --organization gallicorpora --organization fondue-htr
htruc make ./catalog --access_token ${{ secrets. GITHUB_TOKEN }} --graph-csv data.csv --statistics statistics.csv --output htr-united.yml --graph graph.png --json catalog.json --ids catalog-ids.json --check-link --no-remote
- name: Commit files
run: |
git config --local user.email "41898282+github-actions[bot]@users.noreply.github.com"
git config --local user.name "github-actions[bot]"
python3 spid.py
git add htr-united.yml graph.png statistics.csv catalog.json
git commit -m "[Automatic] Update of the Catalog" || echo "Nothing to commit"
git push || echo "Nothing to push"
Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ HTR-United

## What is HTR-United

HTR-United is a Github organization without any other form of legal personality. It aims at **gathering HTR/OCR transcriptions of all periods and style of writing, mostly but not exclusively in French**. It was born from the mere necessity -for projects- to possess potentiel ground truth to rapidly train models on smaller corpora.
HTR-United is a Github organization without any other form of legal personality. It aims at **gathering HTR/OCR transcriptions of all periods and style of writing, mostly but not exclusively in French**. It was born from the mere necessity -for projects- to possess potential ground truth to rapidly train models on smaller corpora.

## What is shared?

Expand All @@ -19,7 +19,7 @@ Datasets shared or referenced with HTR-United must, at minimum, take the form of
- an ensemble of corresponding images. They can be shared in the form of a simple permalink to ressources hosted somewhere else, or can be the contact information necessary to request access to the images. It must be possible to recompose the link between the XML files and the image without any intermediary process such as changing the images' names;
- a documentation on the context in which the dataset was produced and the rules followed to segment and transcribed the documents. For Github repositories, this documentation is usually presented in the README.

A corpus can be sub-diveded into smaller ensembles if it seems necessary.
A corpus can be sub-divided into smaller ensembles if it seems necessary.

If you need help to compose your repository, you can check [our template](https://github.com/HTR-United/template-htr-united-datarepo)!

Expand All @@ -36,7 +36,7 @@ There are two cases:

### You already have data in a repository

It's rather convinient: you stay in control, and there's no issue with joining the organization. However, if you want your dataset to gain visibility, it seems important to us that you describe it here. In deed, if you take benefit from data or models provided by HTR-United, you may as well contribute!
It's rather convenient: you stay in control, and there's no issue with joining the organization. However, if you want your dataset to gain visibility, it seems important to us that you describe it here. Indeed, if you take benefit from data or models provided by HTR-United, you may as well contribute!

To do so, you just need to [open an issue](https://github.com/HTR-United/htr-united/issues/new) or request an update on the deposit repository by adding a YAML file generated with [our form](https://htr-united.github.io/document-your-data.html), presented as follows:

Expand Down Expand Up @@ -86,7 +86,7 @@ Well, we'll be happy to get help from you. [Open an issue here](https://github.c

## Overview

You can browse the content of the catalog from out website: [here](https://htr-united.github.io/catalog.html).
You can browse the content of the catalog from our website: [here](https://htr-united.github.io/catalog.html).

Here is an overview of the periods covered by the datasets documented in HTR-United's catalog!

Expand Down
1 change: 1 addition & 0 deletions catalog-ids.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"https://doi.org/10.5281/zenodo.5153263": "repo-00000", "https://zenodo.org/record/4780947#.YhN5pVvMLUQ": "repo-00001", "https://github.com/calfa-co/rasam-dataset": "repo-00002", "https://github.com/DesenrollandoElCordel/FoNDUE-Spanish-chapbooks-Dataset": "repo-00003", "https://zenodo.org/record/3333627#.YhN1G1vMLUQ": "repo-00004", "https://github.com/rescribe/carolineminuscule-groundtruth": "repo-00005", "http://dx.doi.org/10.34847/nkl.acb724xs": "repo-00006", "https://github.com/e-ditiones/OCR17plus": "repo-00007", "https://github.com/PSL-Chartes-HTR-Students/TNAH-2021-Projet-Notre-Dame": "repo-00008", "https://github.com/PSL-Chartes-HTR-Students/TNAH-2021-ArgusDesBrevets": "repo-00009", "https://github.com/PSL-Chartes-HTR-Students/TNAH-2021-DecameronFR": "repo-00010", "https://github.com/PSL-Chartes-HTR-Students/HN2021-Kovalewsky-1893": "repo-00011", "https://github.com/PSL-Chartes-HTR-Students/HN2021-ChateauChavigny": "repo-00012", "https://github.com/PSL-Chartes-HTR-Students/HN2021-Boccace": "repo-00013", "https://github.com/PSL-Chartes-HTR-Students/HN2021-Memorials_Jane_Lathrop_Stanford": "repo-00014", "https://github.com/PSL-Chartes-HTR-Students/TNAH-2021-Expositions_Universelles": "repo-00015", "https://github.com/PSL-Chartes-HTR-Students/TNAH-2021-Projet-Correspondance-Berlioz": "repo-00016", "https://github.com/jpmjpmjpm/genauto-td-htr.git": "repo-00017", "https://doi.org/10.5281/zenodo.5179361": "repo-00018", "HTR-United/tapuscorpus": "repo-00019", "HTR-United/timeuscorpus": "repo-00020", "HTR-United/dahncorpus": "repo-00021", "HTR-United/cremma-medieval": "repo-00022", "HTR-United/cremma-16-17-print": "repo-00023", "HTR-United/CREMMA-Medieval-LAT": "repo-00024", "HTR-United/CREMMA-MSS-17": "repo-00025", "HTR-United/CREMMA-MSS-18": "repo-00026", "HTR-United/CREMMA-MSS-19": "repo-00027", "HTR-United/CREMMA-MSS-20": "repo-00028", "HTR-United/lectaurep-bronod": "repo-00029", "HTR-United/lectaurep-mariages-et-divorces": "repo-00030", "HTR-United/lectaurep-repertoires": "repo-00031", "HTR-United/CREMMA-AN-TestamentDePoilus": "repo-00032", "HTR-United/cremma-wikipedia": "repo-00033", "Gallicorpora/HTR-MSS-15e-Siecle": "repo-00034", "Gallicorpora/HTR-incunable-15e-siecle": "repo-00035", "Gallicorpora/HTR-imprime-16e-siecle": "repo-00036", "Gallicorpora/HTR-imprime-17e-siecle": "repo-00037", "Gallicorpora/HTR-imprime-gothique-16e-siecle": "repo-00038", "Gallicorpora/HTR-imprime-18e-siecle": "repo-00039", "FoNDUE-HTR/FONDUE-FR-PRINT-17": "repo-00040", "FoNDUE-HTR/FONDUE-FR-PRINT-16": "repo-00041"}
15,179 changes: 15,178 additions & 1 deletion catalog.json

Large diffs are not rendered by default.

60 changes: 60 additions & 0 deletions catalog/ajmc/ajmc-layout.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: 'GT4HistCommentLayout: Layout Ground Truth for Historical Commentaries'
url: https://github.com/AjaxMultiCommentary/GT-commentaries-OLR
authors:
- name: Matteo
surname: Romanello
orcid: 0000-0002-7406-6286
roles:
- project-manager
- name: Sven
surname: Najem-Meyer
orcid: 0000-0002-3661-4579
roles:
- transcriber
- quality-control
- name: Carla
surname: Amaya
roles:
- transcriber
description: 'This dataset contains layout annotations for ca. 370 pages sampled from
8 public domain classical commentaries, published in the 19th century in English,
German and Latin. The commentaries concern Ancient Greek and Latin works from prose
and poetry (caveat: AGreek poetry is slightly over-represented). Pages were annotated
according to a taxonomy mapped to the SegmOnto controlled vocabulary.'
project-name: Ajax Multi-Commentary
project-website: https://mromanello.github.io/ajax-multi-commentary/
language:
- eng
- deu
- lat
- grc
production-software: Kraken + VGG Image Annotator (VIA)
script:
- iso: Latn
- iso: Grek
script-type: only-typed
time:
notBefore: '1835'
notAfter: '1903'
hands:
count: '1'
precision: exact
license:
- name: CC-BY 4.0
url: https://creativecommons.org/licenses/by/4.0/
format: Alto-XML
volume:
- metric: characters
count: 0
- metric: files
count: 371
- metric: lines
count: 0
- metric: regions
count: 2386
transcription-guidelines: SegmOnto guidelines (v. 0.9)
citation-file-link: https://github.com/AjaxMultiCommentary/GT-commentaries-layout/blob/master/CITATION.cff
characters:
mode: NFD
members: []
125 changes: 125 additions & 0 deletions catalog/alix-tz/moonshines.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: Moonshines
url: https://github.com/alix-tz/moonshines
authors:
- name: Alix
surname: "Chagu\xE9"
orcid: 0000-0002-0136-4434
roles:
- transcriber
- aligner
- project-manager
- digitization
institutions: []
description: This dataset is composed of pages of text written in 2023 by a single
person, copying texts taken from Guillaume Apollinaire's poems published in Alcools,
and taken from Guillaume Apollinaire's Wikipedia page.
language:
- fra
production-software: eScriptorium + Kraken
script:
- iso: Latn
script-type: only-manuscript
time:
notBefore: '2023'
notAfter: '2023'
hands:
count: '1'
precision: exact
license:
- name: CC-BY 4.0
url: https://creativecommons.org/licenses/by/4.0/
format: Alto-XML
volume:
- metric: characters
count: 27734
- metric: files
count: 45
- metric: lines
count: 1016
- metric: regions
count: 45
citation-file-link: https://github.com/alix-tz/moonshines/blob/master/CITATION.cff
transcription-guidelines: The transcription strictly follows what is written on the
images, including accentuation or capitalization errors. The segmentation follows
the SegmOnto ontology and mostly relies on MainZone and DefaultLine. Beware that
this dataset barely contains any ponctuation and that most lines begin with a capital
letter.
characters:
mode: NFD
members:
- e
- s
- a
- n
- r
- i
- t
- u
- o
- l
- d
- m
- c
- p
- "\u0301"
- ''''
- v
- g
- b
- h
- "\u0300"
- f
- L
- q
- E
- '1'
- A
- C
- x
- y
- "\u0302"
- S
- '9'
- P
- M
- j
- T
- D
- '-'
- N
- J
- R
- '0'
- z
- O
- I
- '2'
- '8'
- V
- F
- G
- U
- '5'
- B
- Q
- )
- H
- '3'
- (
- '7'
- '6'
- w
- k
- '4'
- "\u0327"
- K
- Z
- "\u0308"
- Y
- '{'
- '}'
- W
- .
- X
- ','
55 changes: 55 additions & 0 deletions catalog/alix-tz/peraire-ground-truth.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
schema: https://htr-united.github.io/schema/2022-04-15/schema.json
title: Peraire Ground Truth
url: https://github.com/alix-tz/peraire-ground-truth
authors:
- name: Alix
surname: Chagué
orcid: 0000-0002-0136-4434
roles:
- transcriber
- quality-control
institutions:
- name: Bibliothèque Sébert, Espéranto-France, Paris
roles:
- digitization
description: >-
This dataset was created in order to produce an HTR model for the Digital
Peraire project. The documents are handwritten, dating from the second half of
the 20th century, written by Lucien Péraire in French with a blue ink pen or,
more frequently, with a blue pencil.
project-name: Digital Peraire
language:
- fra
production-software: eScriptorium + Kraken
script:
- iso: Latn
script-type: only-manuscript
time:
notBefore: '1928'
notAfter: '1971'
hands:
count: '1'
precision: exact
license:
- name: CC-BY 4.0
url: https://creativecommons.org/licenses/by/4.0/
format: Alto-XML
volume:
- metric: characters
count: 38793
- metric: files
count: 33
- metric: lines
count: 1059
- metric: regions
count: 80
citation-file-link: https://github.com/alix-tz/peraire-ground-truth/blob/master/CITATION.cff
transcription-guidelines: >-
The transcription respects what is written on the document, including
ponctuation and spelling errors. The case is respected: capital letters are
transcribed with capital letters. Crossed out words are signaled by # which
isn't used to transcribe anything else. The SegmOnto ontology was used for the
segmentation of this dataset. For regions, MainZone and MarginTextZone were
used. For lines, DefaultLine and InterlinearLine were used. The original
documents are held at the Bibliothèque Sébert, Espéranto-France, Paris. They
should be mentionned every time the images are used.
Loading

0 comments on commit d757787

Please sign in to comment.