Text_Classification_ML-DL

This is an end-to-end NLP project based on text classification. We have created a real-time web application that takes input text from the user and predicts its diagnosis out of 10 predefined labels. To see project, please visit the link: https://conditionpredictor.herokuapp.com/

Developed by

Ayşenur YILMAZ @Aysenuryilmazz
Mustafa Abdullah HAKKOZ @mustafahakkoz

And for more details on the project, see the report and online notebooks.

Online Notebooks:

EDA
Text cleaning, hand-made features, label encoding, train-test split
a. Text representations (bow and 1-hot)

b. Text representations (tf-idf) and the analysis of most correlated words
a. ML classification with 3 text representations (Naive-Bayes)

b. ML classification with 3 text representations (Random Forest)

c. ML classification with 3 text representations (Logistic Regression)

d. ML classification with 3 text representations (Support Vector Machines)

e. ML classification with 3 text representations (XGBoost Classifier)

f. Training word embeddings and a vanilla neural network

g. Fine-tuning Glove(6B_50d) embeddings and CNN
a. Comparing models and deciding the best one

b. Topic modeling with LDA, WordClouds and PCA
Utility functions for training models

Repo Content and Implementation Steps:

1.EDA

Exploratory Data Analysis for gaining insights about the data. We have examined the dataset using univariate and bivariate analysis.
Also we determined on which attribute will be the target.

2.feature engineering

We clean punctuations signs, special characters, numbers and possessive pronouns.
We convert all letters to lowercase.
All reviews are processed by wordnet lemmatizer of NLTK library, to convert words into their dictionary forms.
Stopwords are also removed by using stopword list of NLTK library again.
Some text-based features are calculated such as word count, unique word count, letter count, punctuation count, count of upper-cased words, number count, stopword count, average length of words. Also bivariate analysis and correlation matrix of these features are implemented. However, we did not choose to use them as training features since we would like to make predictions semantically.
And finally, label encoding of the target attribute and .75/.25 train-test split are done.

3.text representation

We implemented three different techniques for text representation (one-hot, bow, tf-idf vectorizers of ScikitLearn) in 2 different notebooks.
Most correlated unigrams and bigrams are calculated by chi2 method for each label.

4.a.ML classification algorithms

5 ML models (Naive Bayes, Logistic Regression, SVM, Random Forest, XGBoost) are implemented for each of these 3 text representations with 5-fold stratified cross validation and gridsearch.

4.b.DL classification algorithms

We also tried word embeddings using deep learning models. For both of the experiments, we used relu as an activation function for all hidden layers. Two different approaches are used on GPU notebooks of Kaggle: Embedding Layer in a vanilla NN to train our own vectors and pretrained Glove vectors + CNN.
First, we trained our own word embedding using the Embedding layer (dimension=50) of Keras. In this experiment we used vanilla neural network GlobalMaxPool layer + DropOut (0.5) layer + Dense layer with 50 nodes + SoftMax layer with 10 nodes.
As a second experiment, we used the pre-trained word embedding (Glove 6B 50d) with a CNN model. The architecture of CNN model is Embedding (trainable=True for fine tuning) layer + Conv1D (filters=50, kernel size=3) + MaxPooling1D (3) layer + DropOut (0.5) layer + Dense layer with 50 nodes + SoftMax layer with 10 nodes.

5.a.classification results

Neural networks are the most successful models on train dataset, but on the test set their accuracy scores are a bit off compared to RF and XGBoost. On the other hand training scores of RF and XGBoost are lower than their test scores which is a weird behaviour, so we choose the Glove CNN model over them in our app.

5.b.topic_modeling

Latent Dirichlet Allocation (LDA) is an unsupervised example of topic modelling and it is used to cluster the text to particular topics. We ran LDA using the tf-idf method.
We plot WordClouds to to see the words belonging each topics and to compare the results of LDA.
And visualize tf-idf vectors in 2D and 3D (plotly) by using PCA dimensionality reduction method.

6.best model predictor

A python script to make predictions with best model (Glove+CNN) for testing purposes. It does preprocessing, converts string to array with pretrained keras-tokenizer and runs pretrained CNN model on it. And finally prints predictions on screen.

7.application deployment

In the deployment phase, Dash and Heroku used to build and deploy the app. Dash is a framework to create single page basic web applications based on Flask and Plotly. Also, Heroku is a free deployment tool.
The input text is converted to a pre-trained word embedding model. After that, a CNN classifier is applied to predict its condition.

text-classification-utility-functions.ipynb

A python script that contains two functions for displaying test scores and tuning hyperparameters.

report.pdf

A detailed report on project goals and implementation steps also including explanations on methods used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text_Classification_ML-DL

Online Notebooks:

Repo Content and Implementation Steps:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
1.EDA		1.EDA
2.feature engineering		2.feature engineering
3.text representation		3.text representation
4.a.ML classification algorithms		4.a.ML classification algorithms
4.b.DL classification algorithms		4.b.DL classification algorithms
5.a.classification results		5.a.classification results
5.b.topic_modeling		5.b.topic_modeling
6.best model predictor		6.best model predictor
7.application deployment		7.application deployment
images		images
.DS_Store		.DS_Store
README.md		README.md
report.pdf		report.pdf
text-classification-utility-functions.ipynb		text-classification-utility-functions.ipynb

mustafahakkoz/Text_Classification_ML-DL

Folders and files

Latest commit

History

Repository files navigation

Text_Classification_ML-DL

Online Notebooks:

Repo Content and Implementation Steps:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages