Skip to content

Latest commit

 

History

History
26 lines (19 loc) · 2.15 KB

README.md

File metadata and controls

26 lines (19 loc) · 2.15 KB

Article Separation

Description

Article separation (AS), also called article segmentation, is the process of dividing a newspaper page into its articles. So far, existing systems still need considerable input from human users to solve this task due to handcrafted rules that need parameter tuning to work well dependent on the layout of the newspaper page. In the context of the NewsEye project the aim is to create an automated workflow that is independant from any user input.

Workflow

The preceeding tasks 2.1 Layout Analysis (LA) and 2.2 Automated Text Recognition (ATR) in the NewsEye project provide geometrical information in the form of text lines / baselines and the corresponding transcription of the text. This information we want to use to solve the AS task by combining traditional LA methods, semantic information and machine learning based approaches.

Tasks 2.1 and 2.2 are processed and further developed in the Transkribus platform which has its roots in the FP7 Project tranScriptorium (2013-2015) and was further developed in the H2020 Project READ (2016-2019).

The Transkribus GitHub repository can be found at https://github.com/transkribus/.

Used models and algorithms

Data

To our knowledge, there is no general AS dataset on newspaper pages on which a comparison with other existing workflows is possible.

CITlab AS GitHub Repository:

The code for the article separation can be found in the following GitHub repository, which was updated for M45: https://github.com/CITlabRostock/citlab-article-separation-new