Skip to content

Exploring different state of the art nlp techniques and architectures for multi-label text classification

Notifications You must be signed in to change notification settings

jlacv/pytorch-Text_Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pytorch-Text_Classification

Dataset : UCC - Unhealthy comments corpus :

The goal of this practical project is to implement state-of-the-art NLP models in pytorch to perform multi-label text classification on the high-quality UCC dataset. This dataset was published in 2020 in the paper Six Attributes of Unhealthy Conversation.

The dataset contains over 40 000 healthy comments and less than 3000 unhealthy comments. In addition to the binary labels, it also captures 6 unhealthy sub-attributes, such as (1) hostile, (2) insulting and trolling, (3) dismissive .... (6) unfair generalization. For some of these attributes, this was the first large publicly available dataset that captured them.

Model training :

The original paper aimed to present the dataset and they trained a BERT model on the text classification task. I used the BERT-base, T5 and roBERTa models. The latter had better scores in the classification of unhealthy labels.

Results :

Given that the original paper was published in 2020 and focused on the dataset, I was able to replicate the same performance measure the authors used and achieve better scores for all labels before any hyperparameter optimization steps.

The authors scored 50% for the classification of the label sarcasm and talked about the difficulty of detecting sarcasm. With the fine-tuned roBERTa model, I was able to achieve a score of 75% before any hyper parameter optimization.

About

Exploring different state of the art nlp techniques and architectures for multi-label text classification

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published