Document classification by inversion of word vectors

I recently read Matt Taddy's paper "Document classification by inversion of distributed language representations" (available here). The paper presents an neat method of making use of distributional information for document classification. However, my intuition is that the method would not work well for small labelled, so I decided to check this experimentally.

Requirements

Implementation of Taddy method is based on example provided by gensim. You will need a bleeding edge copy. Also, standard Python 3 scientific stack, i.e. numpy and scikit-learn. Legacy Python (<3.0) not supported.

Current features

Taddy, Naive Bayes and SVM classifiers with grid search for parameter settings.
Yelp reviews and 20 Newsgroups data sets

Usage

get a pre-trained word2vec model, either from Google or train it yourself
get Yelp reviews data from here (you will need to log into Kaggle)
run the following commands

python write_conf_files.py # generate one configuration file per experiment
python prepare_models.py # train word2vec models for each labelled corpus
python train_classifiers.py # train and evaluate classifiers

inspect results in ./results/exp*

Todo

More classifiers and labelled data sets
Grid search for word2vec parameters (currently using half of cwiki with default settings)
Clean up and publish script I used to pre-train word2vec
Compare to Word Mover Distance (another method that reports state of the art results in document classification)

Disclaimer: This is a weekend hack, very much work in progress. Haven't had a chance to run an extensive evaluation. Preliminary results suggest Naive Bayes < Taddy < SVM (with grid search) for Yelp data.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
common.py		common.py
prepare_models.py		prepare_models.py
taddy_classifier.py		taddy_classifier.py
train_classifiers.py		train_classifiers.py
write_conf_files.py		write_conf_files.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document classification by inversion of word vectors

Requirements

Current features

Usage

Todo

About

Releases

Packages

Languages

mbatchkarov/word2vec-inversion

Folders and files

Latest commit

History

Repository files navigation

Document classification by inversion of word vectors

Requirements

Current features

Usage

Todo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages