Skip to content

This is an end-to-end NLP project based on text classification. We have created a real-time web application that takes input text from the user and predicts its diagnosis out of 10 predefined labels.

Notifications You must be signed in to change notification settings

mustafahakkoz/Text_Classification_ML-DL

 
 

Repository files navigation

Text_Classification_ML-DL

This is an end-to-end NLP project based on text classification. We have created a real-time web application that takes input text from the user and predicts its diagnosis out of 10 predefined labels. To see project, please visit the link: https://conditionpredictor.herokuapp.com/

Developed by

And for more details on the project, see the report and online notebooks.


Online Notebooks:

  1. EDA

  2. Text cleaning, hand-made features, label encoding, train-test split

  3. a. Text representations (bow and 1-hot)

    b. Text representations (tf-idf) and the analysis of most correlated words

  4. a. ML classification with 3 text representations (Naive-Bayes)

    b. ML classification with 3 text representations (Random Forest)

    c. ML classification with 3 text representations (Logistic Regression)

    d. ML classification with 3 text representations (Support Vector Machines)

    e. ML classification with 3 text representations (XGBoost Classifier)

    f. Training word embeddings and a vanilla neural network

    g. Fine-tuning Glove(6B_50d) embeddings and CNN

  5. a. Comparing models and deciding the best one

    b. Topic modeling with LDA, WordClouds and PCA

  6. Utility functions for training models


Repo Content and Implementation Steps:

1.EDA

  • Exploratory Data Analysis for gaining insights about the data. We have examined the dataset using univariate and bivariate analysis.
  • Also we determined on which attribute will be the target.

2.feature engineering

  • We clean punctuations signs, special characters, numbers and possessive pronouns.

  • We convert all letters to lowercase.

  • All reviews are processed by wordnet lemmatizer of NLTK library, to convert words into their dictionary forms.

  • Stopwords are also removed by using stopword list of NLTK library again.

  • Some text-based features are calculated such as word count, unique word count, letter count, punctuation count, count of upper-cased words, number count, stopword count, average length of words. Also bivariate analysis and correlation matrix of these features are implemented. However, we did not choose to use them as training features since we would like to make predictions semantically.

  • And finally, label encoding of the target attribute and .75/.25 train-test split are done.

3.text representation

  • We implemented three different techniques for text representation (one-hot, bow, tf-idf vectorizers of ScikitLearn) in 2 different notebooks.

  • Most correlated unigrams and bigrams are calculated by chi2 method for each label.

4.a.ML classification algorithms

  • 5 ML models (Naive Bayes, Logistic Regression, SVM, Random Forest, XGBoost) are implemented for each of these 3 text representations with 5-fold stratified cross validation and gridsearch.

4.b.DL classification algorithms

  • We also tried word embeddings using deep learning models. For both of the experiments, we used ​ relu​ as an activation function for all hidden layers.​​ Two different approaches are used on GPU notebooks of Kaggle: Embedding Layer in a vanilla NN to train our own vectors and pretrained Glove vectors + CNN.

  • First, we trained our own word embedding using the Embedding layer (dimension=50) of Keras. In this experiment we used vanilla neural network GlobalMaxPool layer + DropOut (0.5) layer + Dense layer with 50 nodes + SoftMax layer with 10 nodes.

  • As a second experiment, we used the pre-trained word embedding (Glove 6B 50d) with a CNN model. The architecture of CNN model is Embedding (trainable=True for fine tuning) layer + Conv1D (filters=50, kernel size=3) + MaxPooling1D (3) layer + DropOut (0.5) layer + Dense layer with 50 nodes + SoftMax layer with 10 nodes.

5.a.classification results

  • Neural networks are the most successful models on train dataset, but on the test set their accuracy scores are a bit off compared to RF and XGBoost. On the other hand training scores of RF and XGBoost are lower than their test scores which is a weird behaviour, so we choose the Glove CNN model over them in our app.

5.b.topic_modeling

  • Latent Dirichlet Allocation (LDA) is an unsupervised example of topic modelling and it is used to cluster the text to particular topics. We ran LDA using the tf-idf method.

  • We plot WordClouds to to see the words belonging each topics and to compare the results of LDA.

  • And visualize tf-idf vectors in 2D and 3D (plotly) by using PCA dimensionality reduction method.

6.best model predictor

  • A python script to make predictions with best model (Glove+CNN) for testing purposes. It does preprocessing, converts string to array with pretrained keras-tokenizer and runs pretrained CNN model on it. And finally prints predictions on screen.

7.application deployment

  • In the deployment phase, Dash and Heroku used to build and deploy the app. Dash is a framework to create single page basic web applications based on Flask and Plotly. Also, Heroku is a free deployment tool.

  • The input text is converted to a pre-trained word embedding model. After that, a CNN classifier is applied to predict its condition.

text-classification-utility-functions.ipynb

  • A python script that contains two functions for displaying test scores and tuning hyperparameters.

report.pdf

  • A detailed report on project goals and implementation steps also including explanations on methods used.

About

This is an end-to-end NLP project based on text classification. We have created a real-time web application that takes input text from the user and predicts its diagnosis out of 10 predefined labels.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 94.7%
  • Python 5.3%