This repository contains the Arabic Keyphrase Extraction Corpus (AKEC) built by Muhammad Helmy, Marco Basaldella, Eddy Maddalena, Stefano Mizzaro and Gianluca Demartini.
The corpus and the process we used for its building are described in detail in the paper ''Towards Building a Standard Dataset for Arabic Keyphrase Extraction Evaluation'', presented at the 20th International Conference on Asian Language Processing (IALP 2016), held in Tainan, Taiwan, from November 21 to 23, 2016.
You can find a brief statistical overview of the Corpus in the docs
folder, or you can see it online at this link.
The corpus consists in 160 arabic documents and their keyphrases. We selected the documents from a variety of sources, while we collected the keyphrases using a large-scale Crowdsourcing experiment.
The repository is structured as follows.
├── docs
├── documents
│ ├── pure
│ └── raw
└── keyphrases
├── sort_frequency
└── sort_lm
The docs
folder contains the stastical analysis mentioned above.
The documents
folder contains the documents. We provide the documents in their original form (plus some formatting) in the raw
folder and in pure form, i.e. with diacritics removed, in the pure
folder.
The keyphrases
folder contains the crowd-assigned keyphrases. We provide four files; two of them contain the keyphrases ordered using their frequency inside the crowd workers selections (in the sort_frequency
folder) and the other two contain the keyphrases ordered using a simple language model generated from the crowd selection as well (in the sort_lm
folder). For each sorting, we provide both the keyphrases in their pure form (pure.txt
) and in their lemmatized form (lemmatized.txt
).
We divided the corpus in 100 documents for training and 60 documents for training. The filenames of the training and testing documents are contained in train.ids
and test.ids
respectively.
We provide you a convenient shell script called split.sh
that does the dirty job of getting the training/testing documents and putting them into two folders called, unsurprisingly, train
and test
. To split the documents in their original form, just navigate to the folder where you downloaded the corpus and run ./split.sh raw
. To split the documents in the pure form, just write pure
instead of raw
.
If you use our dataset, please cite the reference paper:
@inproceedings{akec2016,
author={Helmy, Muhammad and Basaldella, Marco and Maddalena, Eddy and Mizzaro, Stefano and Demartini, Gianluca},
title={Towards Building a Standard Dataset for Arabic Keyphrase Extraction Evaluation},
booktitle={Proceedings of the 20th International Conference on Asian Language Processing (IALP 2016)},
year={2016},
address={Tainan (Taiwan)}
}
To perform NLP in Arabic, we used the AraMorph software, which is a Java port of the Buckwalter Arabic Morphological Analyzer.
We selected the documents for our corpus from 4 different sources:
- Arabic Newspapers Corpus
- Corpus of Contemporary Arabic
- Essex Arabic Summaries Corpus
- Open Source Arabic Corpora
Please be aware that some of the resources we used to assemble our corpus are licensed for research purposes only. For this reason, we also make our corpus available for research purposes only.