OCR_XTRACT

This project is conducted by the Lab IA at Etalab.
The Lab IA helps french administrations to modernize their services by the use of modern AI techniques.
Other Lab IA projects can be found on our GitHub organization.

-- Project Status: [Active]

OCR Xtract

OCR-Xtract is a tool that performs OCR and information extraction from documents. It is meant to speed up the work of state agents dealing with documents whose formats are not directly numerically exploitable. OCR_Xtract will consist in :

A front-end for uploading files
An API to access the trained model for the Key Information Extraction
The code to extract the information from the scanned images.

Methods Used

OCR
Image Processing

Technologies

Python 3.7

Project Description

For now, only a POC is available for extracting information for French DNI and for the french payslips.

Getting Started for development

Fork this repo
Update pip : pip install --upgrade pip
Install requirements : pip install -r requirements.txt

Install Doctr

Since we use doctr, you will need extra dependencies. We also use pdf2image, so you will have to install poppler.

For MacOS users

You can install them as follows:

brew install cairo pango gdk-pixbuf libffi

Mac users will have to install poppler for Mac. Install poppler with the command

brew install poppler

If this one does not work, an alternative is to use conda :

conda install -c conda-forge poppler

For Windows users

Those dependencies are included in GTK. You can find the latest installer over here. Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/ folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument in convert_from_path.

Linux

If you experience trouble with Weasyprint and pango, install with this apt install python3-pip python3-cffi python3-brotli libpango-1.0-0 libharfbuzz0b libpangoft2-1.0-0 libgl1-mesa-glx Most distros ship with pdftoppm and pdftocairo. If they are not installed, refer to your package manager to install poppler-utils

Install DVC (Optional)

To have access to the datasets on which the models are trained, install dvc and connect to our s3 bucket.

dvc remote add -d minio s3://labia/PATH/TO/STORE -f --local
dvc remote modify minio endpointurl https://our.endpoint.fr --local
dvc remote modify minio access_key_id ourkey --local
dvc remote modify minio secret_access_key ourpassword --local
dvc pull

How to perform the annotation

CNI

Prepare the data for annotation: Launch script align_cni_in_folder.py. This script will align the images. The CNI are easier to detect once they have been aligned to the reference CNI.
Prepare the annotated data : Launch script script_prepare_CNI_annotation.py. This script will create a csv file with the OCR data extracted from image and to be annotated with Label Studio
Place the images to be annotated in /data/label_studio_files
Launch the annotation platform. This command will mount your local folder data\label-studio inside label-studio so label-studio can use it. It will also mount the folder label_studio_files where you have put the images to be annotated inside label-studio

docker run -it -p 8080:8080 -v `pwd`/data/label-studio:/label-studio/data \
--env-file .env
-v `pwd`/data/label_studio_files:/label-studio/files \
heartexlabs/label-studio:latest

docker run -it -p 8080:8080 -v C:\Users\Utilisateur\PythonProjects\ocr-xtract\data\label-studio:/label-studio/data --env-file .env -v C:\Users\Utilisateur\PythonProjects\ocr-xtract\data\label_studio_files:/label-studio/files heartexlabs/label-studio:latest

Create an annotation project :
- name your project
- import the json file generated in step 2.
- select object detection with Bounding Boxes in Labeling Setup
- define the labels corresponding to the categories you want to extract
Once the annotation is complete, you can export the annotation in the json format
Convert the annotation from json to csv with script_get_csv_from_annotation_json.py
Use this file to train a new model with train_cni_pipeline.py

How to convert the label studio annotated file into csv file

export label studio annotation using the Label Studio json format

df = LabelStudioConvertor(Path("export.json"), Path("annotated_data.csv")).transform()

where :

export.json : path to the exported label studio json file
annotated_data.csv : path where you want to save the csv file

Contributing Lab IA Members

How to contribute to this project

We love your input! We want to make contributing to this project as easy and transparent as possible : see our contribution rules

Name		Name	Last commit message	Last commit date
Latest commit History 434 Commits
.dvc		.dvc
.github		.github
api		api
data		data
data_dvc		data_dvc
front		front
model		model
src		src
test		test
tutorials		tutorials
.dockerignore		.dockerignore
.dvcignore		.dvcignore
.env.template		.env.template
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile_train		Dockerfile_train
LICENSE		LICENSE
README.md		README.md
app_local.py		app_local.py
app_local_fdp.py		app_local_fdp.py
docker-compose.yml		docker-compose.yml
download_doctr_models.py		download_doctr_models.py
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
params.yaml		params.yaml
requirements.txt		requirements.txt
requirements_train.txt		requirements_train.txt
windows.txt		windows.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR_XTRACT

-- Project Status: [Active]

OCR Xtract

Methods Used

Technologies

Project Description

Getting Started for development

Install Doctr

For MacOS users

For Windows users

Linux

Install DVC (Optional)

How to perform the annotation

CNI

How to convert the label studio annotated file into csv file

Contributing Lab IA Members

How to contribute to this project

About

Releases

Packages

Contributors 4

Languages

License

etalab-ia/ocr-xtract

Folders and files

Latest commit

History

Repository files navigation

OCR_XTRACT

-- Project Status: [Active]

OCR Xtract

Methods Used

Technologies

Project Description

Getting Started for development

Install Doctr

For MacOS users

For Windows users

Linux

Install DVC (Optional)

How to perform the annotation

CNI

How to convert the label studio annotated file into csv file

Contributing Lab IA Members

How to contribute to this project

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages