Skip to content

Script used to convert an arabic corpus to a TXM compatible file. Takes .txt files as input.

Notifications You must be signed in to change notification settings

JulienBez/ArabicToTXM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArabicToTXM

This project was created during the CERES Hackathon event with the participation of Rimane Karam and Marceau Hernandez. The goal was to apply POS tags to an arabic corpus. We use the work presented in Camelira: An Arabic Multi-Dialect Morphological Disambiguator (Ossama Obeid, Go Inoue, Nizar Habash, 2022) to apply multiple POS tags for each word of the corpus.

Installation

To install ArabicToTXM, you must have Python 3.x and pip installed. You must first install some dependencies for Camel-Tools, which is the package used to apply multiple POS tags. Refer to Camel-Tools official documentation for more informations. Here is the command to install those dependencies (for Ubuntu):

sudo apt-get install cmake libboost-all-dev
camel_data -i light

Once all the dependencies installed, clone this repository on your computer. Open your terminal and go to the ArabicToTXM folder (where main.py is). Once in the indicated folder, install required packages with the following command:

pip install -r requirements.txt

How to use

This script takes raw text files (.txt) as input. It will tokenize each file and apply POS tags for each token in them. The result is a .xml file in output/tagged containing one word per line with its POS tags. This file is compatible with TXM. The command is as follows:

python main.py --tag --model [MODEL_NAME]

The applied POS tags list can be found in src/tags_list.json. For more informations about the tags you can add to the list, please refer to Camelira's online documentation and Camelira's tag list. If you want to segment your corpus in sentences, you can use the following command:

python main.py --sentence

It will create a .csv file in output/sentence containing three columns: sentence (raw sentence), tags (POS tagsof each word in this sentence) and lem (canonical form of each word in this sentence).

About

Script used to convert an arabic corpus to a TXM compatible file. Takes .txt files as input.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages