Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web api #222

Merged
merged 87 commits into from
Mar 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
2f3930c
Ignore the .idea directory from Intellij.
tdoan2010 Aug 16, 2022
d94f8b7
Add some first text.
tdoan2010 Aug 16, 2022
cbf056b
Briefly describe the spec.
tdoan2010 Aug 16, 2022
92ff42d
Add the usage section.
tdoan2010 Aug 17, 2022
54a0009
Add the centralized approach.
tdoan2010 Aug 17, 2022
e64df2d
Add the distributed web API server.
tdoan2010 Aug 18, 2022
c0a94ca
Merge Processing and Workflow server.
tdoan2010 Aug 18, 2022
d7dd17c
Add the distributed system architecture section.
tdoan2010 Aug 19, 2022
ff5888f
Add the distributed system architecture section.
tdoan2010 Aug 19, 2022
68f2eb7
Add the Processor API section.
tdoan2010 Aug 19, 2022
6021303
Merge branch 'master' into web-api
tdoan2010 Dec 5, 2022
2143e65
Apply suggestions from code review
tdoan2010 Dec 6, 2022
409dcad
Better reasoning.
tdoan2010 Dec 6, 2022
87c12da
Describe the architecture.
tdoan2010 Dec 6, 2022
b7196fd
Add schema files.
tdoan2010 Dec 7, 2022
f146df7
Better formulate the text. Starting the Processing Broker section.
tdoan2010 Dec 7, 2022
590bbbe
Add text to describe the configuration file and database.
tdoan2010 Dec 12, 2022
edf725e
Describe processing server, message queue, and database.
tdoan2010 Dec 12, 2022
84fac2a
Add credentials to message_queue.
tdoan2010 Dec 12, 2022
a48e9f9
update decisions.md to current state of things
kba Dec 13, 2022
709af1c
Update the Web API section.
tdoan2010 Dec 13, 2022
d8b8e78
Update the example config file.
tdoan2010 Dec 13, 2022
2f25ff4
Add the configuration file schema.
tdoan2010 Dec 13, 2022
480fb98
Update message_queue address and port with $ref.
tdoan2010 Dec 13, 2022
73fe0f1
Update the example with path_to_privkey.
tdoan2010 Dec 13, 2022
167b7e2
Fix the $id.
tdoan2010 Dec 14, 2022
7ae56cd
Add the result message schema.
tdoan2010 Dec 14, 2022
e96eb50
Update the result message example.
tdoan2010 Dec 14, 2022
c09bf8f
Add the RUNNING state.
tdoan2010 Dec 14, 2022
a4c85cb
Apply suggestions from code review
tdoan2010 Jan 6, 2023
a46eb97
Integrate comments.
tdoan2010 Jan 9, 2023
90a1910
Fix broken links.
tdoan2010 Jan 9, 2023
dfd0e3c
Re-phrase some text.
tdoan2010 Jan 9, 2023
7bd1bc6
Ignore the .idea directory from Intellij.
tdoan2010 Aug 16, 2022
fb8af40
Merging from master
tdoan2010 Aug 16, 2022
08d6b9d
Briefly describe the spec.
tdoan2010 Aug 16, 2022
c2837ab
Add the usage section.
tdoan2010 Aug 17, 2022
605ea8a
Add the centralized approach.
tdoan2010 Aug 17, 2022
973739f
Add the distributed web API server.
tdoan2010 Aug 18, 2022
6a28255
Merge Processing and Workflow server.
tdoan2010 Aug 18, 2022
a723fa9
Add the distributed system architecture section.
tdoan2010 Aug 19, 2022
6c5c79e
Add the distributed system architecture section.
tdoan2010 Aug 19, 2022
568b532
Add the Processor API section.
tdoan2010 Aug 19, 2022
45a7436
Apply suggestions from code review
tdoan2010 Dec 6, 2022
cb4ecc2
Better reasoning.
tdoan2010 Dec 6, 2022
6f23627
Describe the architecture.
tdoan2010 Dec 6, 2022
5262c6b
Add schema files.
tdoan2010 Dec 7, 2022
9a939ab
Better formulate the text. Starting the Processing Broker section.
tdoan2010 Dec 7, 2022
b3e7165
Add text to describe the configuration file and database.
tdoan2010 Dec 12, 2022
3b712b7
Describe processing server, message queue, and database.
tdoan2010 Dec 12, 2022
22d0ca5
Add credentials to message_queue.
tdoan2010 Dec 12, 2022
5b8b091
Update the example config file.
tdoan2010 Dec 13, 2022
4ab14c7
Add the configuration file schema.
tdoan2010 Dec 13, 2022
1652440
Update message_queue address and port with $ref.
tdoan2010 Dec 13, 2022
4e3016b
Update the example with path_to_privkey.
tdoan2010 Dec 13, 2022
5ab65bb
Fix the $id.
tdoan2010 Dec 14, 2022
2274934
Add the result message schema.
tdoan2010 Dec 14, 2022
b4c8ff0
Update the result message example.
tdoan2010 Dec 14, 2022
e0859ef
Add the RUNNING state.
tdoan2010 Dec 14, 2022
ffb85ca
Apply suggestions from code review
tdoan2010 Jan 6, 2023
d32722d
Integrate comments.
tdoan2010 Jan 9, 2023
548f390
Re-phrase some text.
tdoan2010 Jan 9, 2023
f0919a5
Add Web API section.
tdoan2010 Jan 9, 2023
dbc7316
Merge remote-tracking branch 'origin/web-api' into web-api
tdoan2010 Jan 9, 2023
4618ec0
Fix broken link and typos.
tdoan2010 Jan 9, 2023
ddfd0dd
Add --address option to the processing-broker command.
tdoan2010 Jan 16, 2023
9019c8a
Apply suggestions from code review
tdoan2010 Jan 24, 2023
e2bf67f
Apply suggestions from code review
tdoan2010 Jan 24, 2023
4b6c7cb
Update the architecture image and terminologies.
tdoan2010 Jan 24, 2023
966244b
Fix broken links.
tdoan2010 Jan 24, 2023
db6c902
Add an example about how to listen on a result queue in Python.
tdoan2010 Jan 24, 2023
fa5f076
Add the terminology section.
tdoan2010 Jan 24, 2023
0946255
Use the correct architecture picture.
tdoan2010 Jan 25, 2023
d0d2f5f
Apply suggestions from code review
tdoan2010 Jan 25, 2023
a56b662
Reformulate some sentences.
tdoan2010 Jan 25, 2023
77a0d1f
Rename some keys.
tdoan2010 Jan 25, 2023
f22fdd4
Update the config schema to support standalone server deployment.
tdoan2010 Jan 25, 2023
47e2bd4
Add callback URL and more terminologies.
tdoan2010 Jan 25, 2023
39b937b
Add credentials and port to the Python example.
tdoan2010 Jan 25, 2023
e24c898
Rename message_queue to process_queue and processors to workers.
tdoan2010 Jan 26, 2023
b1aea7c
Adapt the configuration file example. Move the Processing Worker sect…
tdoan2010 Jan 26, 2023
559e1b5
Update the CLI syntax.
tdoan2010 Jan 26, 2023
9ca42bc
Update the figure with proper arrow direction.
tdoan2010 Jan 26, 2023
64d568b
Apply suggestions from code review
tdoan2010 Feb 10, 2023
ad6fe9e
Re-formulate the Workflow Server terminology.
tdoan2010 Feb 10, 2023
3b2dee9
Change the syntax to start a processing worker, since nested command …
tdoan2010 Feb 10, 2023
b670165
Remove some text.
tdoan2010 Feb 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
/.project
.idea/
330 changes: 167 additions & 163 deletions decisions.md
Original file line number Diff line number Diff line change
@@ -1,163 +1,167 @@
# Decisions in OCR-D
kba marked this conversation as resolved.
Show resolved Hide resolved

In a software project, especially a highly distributed one like OCR-D,
decisions need to be made on the technology used, how interfaces should
interoperate and how the software as a whole is designed.

In this document, such decisions on key aspects of OCR-D are discussed for the
benefit of all OCR-D stakeholders.

## Terminology

* *current* refers to **September 1, 2022**, the last change of this document
* *Q1-Q4* refers to yearly quarters
* *target version* is the version we mainly test and develop for
* *supported version* means that we test this version and ensure compatibility

## General decisions

* [2022] We will update to Ubuntu 22.04 and Python 3.7 as soon as possible.
* [2022] Switch to Slim Containers in ```ocrd_all```
* [2022] Python API changes (Pagewise processing): <https://github.com/OCR-D/zenhub/issues/2>

## Workflow format

* [Q3 2022] We use Nextflow. The whole `.nf` file (Nextflow file) as the workflow
format workflow server and processing
server including web API implementation is part of the
[implementation projects](phase3). Further details can be found [in the nextflow spec](spec/nextflow).

## Web API

* [2022] OCR-D Coordination Project provides the [Web API spec](spec/web_api).
Only the [REST API wrapper](https://github.com/OCR-D/core/pull/884) of a single processor is provided by OCR-D Core.

## QUIVER

* [2022] We will create a web application, QUIVER (for QUalIty oVERview), in which several information about OCR-D processors are provided:
* a general overview of the projects (i.e. GitHub repositories), e.g. if their `ocrd-tool.json` is valid, when their last release has been made etc.
* a workflow section where we [benchmark](#benchmarking) different workflows for different corpora.
* a general overview of the available processors

### Benchmarking

* [2022] To execute the benchmarking, we will create several corpora with different characteristics (font, creation date, layout, …) and
run different workflows with these as input. The result is then displayed in the QUIVER workflow tab.
The corpora will be publicly available for better transparency.
* [2022] Relevant benchmarks for the mininum viable product (MVP) will be:
* CER
* WER
* Bag of Words
* Reading order
* IoU
* CPU time
* wall time
* I/O
* Memory Usage
* Disc usage
* [2022] The benchmarking will be executed automatically in a regular intervall to measure if changes in the processors improve the result.
This might be done via CI, GitHub Actions or as a CRON job on a separate server.

## OCR-D/core

### METS server

The current approach to file management requires processors accessing a single
METS file on disk, which turns file management into a bottleneck for workflows.

To alleviate this, we will develop an HTTP server that provides asynchronous
and parallel access to the METS in **Q4 2022**.

### Decentralized resource list

We currently maintain a list of processor resources centrally in OCR-D/core.

In **Q3 2022**, to allow processor developers to maintain their own separate
list of resources, we have implemented mechanisms to store resource lists in a
processor's `ocrd-tool.json` and bundle resources in their own module directory.

By **Q4 2022** we should have updated all the processors and whittled down the
central list to a mostly empty list.

### Page-wise processing

Currently, processors iterate through the files of a workspace by looping through
all the files in the input file group(s) themselves.

In **Q4 2022** we will refactor the processor API, deprecate the current
approach of processors iterating in a `process` method and enable processors
to process individual pages in a `process_page` method.

<!--
-## Processors
-
-In this section we outline our plans with the various processor projects.
-
-**NOTE** Currently only anybaseocr as an example
-
-### [ocrd_anybaseocr](https://github.com/OCR-D/ocrd_anybaseocr)
-
-`ocrd_anybaseocr` is a fairly complex project with multiple processors working
-on different problems with different technologies. Some of the processors are
-powerful, some are too experimental to be recommended. The original developers
-have moved on from the projects, so it is essential for maintainability by the
-community that we refactor it.
-->

## ocrd_all Docker deployment


* Our current target container is a **fat container**, with **maximum**,
**medium** and **minimum** versions with decreasing amount of processors
contained.
* We will wrap processor projects individually and transition to **slim containers** in **Q1 2023**.

## Supported Python versions


* Our current target version for Python is **3.7**, we support **3.6** and **3.7** fully, later versions partially.
* :warning: We cannot currently upgrade beyond **3.7** because there are no [tensorflow v1.15.x](#tensorflow) prebuilt images available. We need to investigate how to alleviate this until **Q4 2022**.
* We will change the target version for Python to **3.10** in **Q4 2022** when we have solved the tensorflow problem.
* Support for **3.6** will end **Q3 2022**. We will not test and include Python 3.6 after that.
* We will start to support **3.11** in **Q4 2022**.
* We will start to support **3.12** in **Q2 2023** (:warning: won't have distutils anymore)
* Support for **3.7** will end **Q2 2023**.
* Support for **3.8** will end **Q3 2023**.
* Support for **3.9** will end **Q4 2023**.

## Base OS image

* Our current base image for deployment is **Ubuntu 18.04**, we support **Ubuntu 18.04**, **20.04** and **22.04**.
* We will change the base image to **Ubuntu 22.04** in **Q4 2022**.
* Support for **Ubuntu 18.04** will end in **Q1 2023**.
* Support for **Ubuntu 20.04** will end in **Q2 2024**.

## Software libraries

### [calamari](https://github.com/OCR-D/ocrd_calamari)

* Our currently supported calamari version is **1.x**.
* We will switch to **2.x** in **Q4 2022**.

* Support for **1.x** will end in in **Q3 2022**

### [pillow](https://pillow.readthedocs.io/)

* We currently support Pillow **5.x** to **v9.x**

### [tensorflow](https://github.com/tensorflow/tensorflow)

* Our target version is **2.5.0**
* We currently support **1.15.x**, **2.4.0** and **2.5.0**.
While we strongly encourage moving away from **1.15.x**, due to the
logistics of updating trained models, we don't have a fixed
cut-off date.

### [torch](https://pytorch.org/)

* Our current target version is **1.10.x**.

### bash

* We use bash scripting for development tasks and for the bashlib library in OCR-D.
* Our current target version is **4.4**.
# Decisions in OCR-D

In a software project, especially a highly distributed one like OCR-D,
decisions need to be made on the technology used, how interfaces should
interoperate and how the software as a whole is designed.

In this document, such decisions on key aspects of OCR-D are discussed for the
benefit of all OCR-D stakeholders.

## Terminology

* *current* refers to **December 13, 2022**, the last change of this document
* *Q1-Q4* refers to yearly quarters
* *target version* is the version we mainly test and develop for
* *supported version* means that we test this version and ensure compatibility

## General decisions

* [Q1 2023] We will update to Ubuntu 22.04 and Python 3.7 as soon as possible. [OCR-D/core#956](https://github.com/OCR-D/core/pull/956)
* [Q1 2023] Switch to Slim Containers in `ocrd_all`
* [Q1 2023] Python API changes (Pagewise processing) [OCR-D/core#322](https://github.com/OCR-D/core/issues/322)

## Workflow format

* [Q3 2022] We use Nextflow. The whole `.nf` file (Nextflow file) as the workflow
format workflow server and processing
server including [web API implementation](https://github.com/OCR-D/ocrd-webapi-implementation) is part of the
[implementation projects](phase3). Further details can be found [in the nextflow spec](nextflow).

## Web API

* [Q3 2022] Switch to the new architecture with message queue.
* Processing Broker and Processing Server will be provided via OCR-D Core.
* [2022] OCR-D Coordination Project provides the [Web API spec](web_api). Only
the [REST API wrapper](https://github.com/OCR-D/core/pull/884) of a single processor is provided by OCR-D Core.

## QUIVER

* [2022] We will create a web application, [QUIVER](https://github.com/OCR-D/quiver-back-end) (for QUalIty oVERview), in
which some information about OCR-D processors are provided:
* a general overview of the projects (i.e. GitHub repositories), e.g. if their `ocrd-tool.json` is valid, when their last release has been made etc.
* a workflow section where we [benchmark](#benchmarking) different workflows for different corpora.
* a general overview of the available processors

### Benchmarking

* [2022] To execute the benchmarking, we will create several corpora with different characteristics (font, creation date, layout, …) and
run different workflows with these as input. The result is then displayed in the QUIVER workflow tab.
The corpora will be publicly available for better transparency.
* [2022] [Relevant metrics](https://github.com/OCR-D/spec/pull/225) for the mininum viable product (MVP) will be:
* CER
* WER
* Bag of Words
* Reading order
* IoU
* CPU time
* wall time
* I/O
* Memory Usage
* Disc usage
* [2022] The benchmarking will be executed automatically in a regular interval to measure if changes in the processors improve the result.
This might be done via CI, GitHub Actions or as a CRON job on a separate server.

## OCR-D/core

### METS server

The current approach to file management requires processors accessing a single
METS file on disk, which turns file management into a bottleneck for workflows.

To alleviate this, we [will develop an HTTP server](https://github.com/OCR-D/core/pull/966) that provides asynchronous and
parallel access to the METS in **Q4 2022**.

### Decentralized resource list

We currently maintain a list of processor resources centrally in OCR-D/core.

In **Q3 2022**, to allow processor developers to maintain their own separate
list of resources, we have implemented mechanisms to store resource lists in a
processor's `ocrd-tool.json` and bundle resources in their own module directory.

By **Q1 2023** we should have updated all the processors and whittled down the
central list to a mostly empty list.

### Page-wise processing

Currently, processors iterate through the files of a workspace by looping through
all the files in the input file group(s) themselves.

In **Q1 2023** we will refactor the processor API, deprecate the current
approach of processors iterating in a `process` method and enable processors
to process individual pages in a `process_page` method.

<!--
-## Processors
-
-In this section we outline our plans with the various processor projects.
-
-**NOTE** Currently only anybaseocr as an example
-
-### [ocrd_anybaseocr](https://github.com/OCR-D/ocrd_anybaseocr)
-
-`ocrd_anybaseocr` is a fairly complex project with multiple processors working
-on different problems with different technologies. Some processors are
-powerful, some are too experimental to be recommended. The original developers
-have moved on from the projects, so it is essential for maintainability by the
-community that we refactor it.
-->

## ocrd_all Docker deployment


* Our current target container is a **fat container**, with **maximum**,
**medium** and **minimum** versions with decreasing amount of processors
contained.
* We will wrap processor projects individually and transition to **slim containers** in **Q1 2023**.

## Supported Python versions


* Our current target version for Python is **3.7**, we support **3.6** and **3.7** fully, later versions partially.
* :warning: We cannot currently upgrade beyond **3.7** because there are no [tensorflow v1.15.x](#tensorflow) prebuilt images available. We need to investigate how to alleviate this until **Q4 2022**.
* We will change the target version for Python to **3.10** in **Q1 2023** when we have solved the tensorflow problem.
* Support for **3.6** will end **Q4 2022**. We will not test and include Python 3.6 after that.
* We will start to support **3.11** in **Q1 2023**.
* We will start to support **3.12** in **Q2 2023** (:warning: won't have distutils anymore)
* Support for **3.7** will end **Q2 2023**.
* Support for **3.8** will end **Q3 2023**.
* Support for **3.9** will end **Q4 2023**.

## Base OS image

* Our current base image for deployment is **Ubuntu 18.04**, we support **Ubuntu 18.04**, **20.04** and **22.04**.
* We will change the base image to **Ubuntu 20.04** in **Q4 2022**.
* We will change the base image to **Ubuntu 22.04** in **Q1 2023**.
* Support for **Ubuntu 18.04** will end in **Q1 2023**.
* Support for **Ubuntu 20.04** will end in **Q2 2024**.

## Software libraries

### [calamari](https://github.com/OCR-D/ocrd_calamari)

* Our currently supported calamari version is **1.x**.
* We will switch to **2.x** in **Q1 2023**.

* Support for **1.x** will end in **Q1 2022**

### [pillow](https://pillow.readthedocs.io/)

* We currently support Pillow **5.x** to **v9.x**

### [tensorflow](https://github.com/tensorflow/tensorflow)

* Our target version is **2.5.0**
* We currently support **1.15.x**, **2.4.0** and **2.5.0**.
While we strongly encourage moving away from **1.15.x**, due to the
logistics of updating trained models, we don't have a fixed
cut-off date.

### [torch](https://pytorch.org/)

* Our current target version is **1.10.x**.

### bash

* We use bash scripting for development tasks and for the bashlib library in OCR-D.
* Our current target version is **4.4**.
Binary file added images/web-api-distributed-queue.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading