ArabicToTXM

This project was created during the CERES Hackathon event with the participation of Rimane Karam and Marceau Hernandez. The goal was to apply POS tags to an arabic corpus. We use the work presented in Camelira: An Arabic Multi-Dialect Morphological Disambiguator (Ossama Obeid, Go Inoue, Nizar Habash, 2022) to apply multiple POS tags for each word of the corpus.

Installation

To install ArabicToTXM, you must have Python 3.x and pip installed. You must first install some dependencies for Camel-Tools, which is the package used to apply multiple POS tags. Refer to Camel-Tools official documentation for more informations. Here is the command to install those dependencies (for Ubuntu):

sudo apt-get install cmake libboost-all-dev
camel_data -i light

Once all the dependencies installed, clone this repository on your computer. Open your terminal and go to the ArabicToTXM folder (where main.py is). Once in the indicated folder, install required packages with the following command:

pip install -r requirements.txt

How to use

This script takes raw text files (.txt) as input. It will tokenize each file and apply POS tags for each token in them. The result is a .xml file in output/tagged containing one word per line with its POS tags. This file is compatible with TXM. The command is as follows:

python main.py --tag --model [MODEL_NAME]

The applied POS tags list can be found in src/tags_list.json. For more informations about the tags you can add to the list, please refer to Camelira's online documentation and Camelira's tag list. If you want to segment your corpus in sentences, you can use the following command:

python main.py --sentence

It will create a .csv file in output/sentence containing three columns: sentence (raw sentence), tags (POS tagsof each word in this sentence) and lem (canonical form of each word in this sentence).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArabicToTXM

Installation

How to use

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
data		data
output		output
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

JulienBez/ArabicToTXM

Folders and files

Latest commit

History

Repository files navigation

ArabicToTXM

Installation

How to use

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages