Open Class Prize

Goal

The goal of this project is to tag question/answer samples strucutured in a tab/line separated text file as any of many topics (this implementation involves ~325 unique topics)

Methods

The best solution ended up involving using a unique logistic regression classifier for each class along with a TFID-vectorized input. Given the small data size the choice of seed for tbe train-test split also proved to be an important factor in development.

The data was highly unbalanced. Many labels appeared only once in the training data making it impossible to train on in some instances.

It was a trade-off between never being able to pick the given class and trying to get the model to incorporate the one-off case. Removing such labels from training enabled better results.

Several methods were of exploration were used to determine the right set of hyper-parameters (i.e. class-weight, train-test split seed, number of labels to keep, what part of question/answer to use, etc.). A full detailed explanation of my process is available in the presentation video

Results

My algorithm ended up acheiving a 0.54 F2 score. I'm quite proud of this result given the small training data size. This was also higher than the algorithm developed by the creator of OpenClass

NOTE: the presentation states that my highest F2 score was only 0.52. I had made the presentation before my last "hail-mary" submission to the competition using the entire training set for training (as opposed to setting aside a dev set)_

BELOW HERE WAS PROVIDED BY SCHOOL AND FILLED IN BY ME

Layout

Only a couple new files have been added to this repo from the template

/test_scores/ <= this folder contains .csv, and corresponding .xlsx files for the tests that I ran to determine the best classifier set up
/code/pickle_files <= this folder contains the best scoring model as well as the binarizer and vectorizer used

Presentation

This is the link to the recorded presentation.

NOTE: The video gives the most comprehensive overview of my methodology, but a PDF of the presentation slides named presentation-slides.pdf is in the presentation-slides directory.

Data

The data used for training can be accessed here

Leaderboard Identification

On the OpenClass leaderboard the team name I used was "I guess this'll work"

Code

To run the run_model.py script which will train a model and print out the F1 score, run the installation line below (to build the docker file) and then run the following command.

# executed from within the 'code' directory
docker run ling-539/final-project-dalcantara7 python scripts/run_model.py YOUR_TRAINING_FILE YOUR_TESTING_FILE

Installation

Dependencies are within the requirements.txt and the docker file is set up to install them

# executed from within the `code` directory:
docker build -f Dockerfile -t "ling-539/final-project:latest" .

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
code		code
feedback		feedback
images		images
presentation-slides		presentation-slides
proposal		proposal
test_scores		test_scores
.gitignore		.gitignore
.travis.yml		.travis.yml
INSTRUCTIONS.pdf		INSTRUCTIONS.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Class Prize

Goal

Methods

Results

BELOW HERE WAS PROVIDED BY SCHOOL AND FILLED IN BY ME

Layout

Presentation

Data

Leaderboard Identification

Code

Installation

About

Releases

Packages

Contributors 2

Languages

dalcantara7/openclass_prize

Folders and files

Latest commit

History

Repository files navigation

Open Class Prize

Goal

Methods

Results

BELOW HERE WAS PROVIDED BY SCHOOL AND FILLED IN BY ME

Layout

Presentation

Data

Leaderboard Identification

Code

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages