Skip to content
/ Sparkify-IBM-Udacity Public template

Udacity Nanodegree's Capstone Project - Customer Churn Prediction

Notifications You must be signed in to change notification settings

nit611/Sparkify-IBM-Udacity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 

Repository files navigation

Predict Customer Churn - with Spark for python.

Table of Contents

  1. Overview
  2. Installations
  3. Data
  4. Modelling
  5. Results & Thoughts
  6. Conclusion & Acknowledgements

Overview

The Capstone project for Udacity's Data Scientist Nanodegree. This project involves predicting Customer Churn for a hypothetical music streaming app Sparkify, using Spark's MLlib to engineer features and build a classification model. The dataset used here is a medium-sized (248 MB, with 544,000 rows) version of the whole dataset (which is 12 GB).
This project is worked on IBM Cloud's Watson Studio, uploading the data cluster, with a Python 3.7/Spark 3.0 enabled Jupyter Notebook.

Using pyspark, the project broadly involves the following:

  • Loading and cleaning the data.
  • Exploratory Data Analysis - where the data is explored in its depth, and data visualizations to help with selecting features, and to basically understand the data more.
  • Feature Engineering - appropriate features are selected based on the EDA, as well as creating new features from existing data, to create a final dataset ready for training using Spark's ML.
  • Modelling - four different classification algorithms are tested, based on its characteristics, and is evaluated using metrics. This section also involves further tuning of the model that has highest potential (based on the metric defined)
  • Concluding Remarks - This section summarizes the whole project, along with a couple of thoughts and possible improvements to the project for further scalability to the whole dataset.

Installations

  • Python 3.6+
  • pyspark.*
  • Jupyter - available through this link, or IBM Watson Studio (Lite)

Data

The data is provided by Udacity. Due to the size of the data, it is not avaialable in this repository.
A broad overview of the raw data:

  • artist - the artist of the soundtrack
  • auth - variable indicating whether the user has cancelled the subscription or not
  • firstName - first name of the user
  • gender - gender of the user
  • ItemInSession - Item ID for each session (row) recorded
  • lastName - last name of the user
  • length - length of each session by the user
  • level - the level of subscription of the user (free trial or paid)
  • location - location data of the user (city and state)
  • page - the page in Sparkify the user visited in each session
  • song - the song listened to in each session, by the user
  • ts - the timestamp of each session
  • userAgent - the user agent used by the user to visit sparkify
  • userId - unique number identifying each user

The Target variable for modelling, which indicates whether the customer has churned or not, is indicated by the event Cancellation Confirmation, in the page column. This is then dummy-fied into a Churn variable.

For more information about the data, visit the Jupyter Notebook above!.

Modelling

  1. The classification algorithms tested here are - RandomForestClassifier, Gradient Boosted Trees Classifier (GBTClassifier), LogisticRegression and Linear Support Vector Machines (LinearSVM). The thorough documentation for the algorithms and its application is available in the Spark ML documentation.
  2. Based on the performance metric, the F1-Score, the models are evaluated and for the love of experimentation, I took 3 out of the 4 for hyperparameter tuning, which is the second part of this section of the notebook. For more detailed discussion of the parameters and algorithms selected to tune, visit the section Modelling, linked in the Jupyter Notebook above!.

Results & Thoughts

The Random Forest classifier performed the best, with an F1-Score of 88.5% approximately. The other algorithms performed reasonably well too. However:

  • This result must be taken with a grain of salt, as the target variable Churn, is imbalanced. Even if the F1-Score does account into the metric the False Positives and False Negatives, before scaling the model to the entire 12 GB dataset, we need to be wary of the result, due to the imbalanced nature.
  • Further SMOTE-like oversampling or undersampling to equalize the size of the two classes could be more unbiased, but they come with disadvantages too. They are discussed in the last section of the notebook above.

Conclusion & Acknowledgements

Thanks to Udacity for a wonderful opporunity to learn a new technology, and get experience with it by hands-on applications. The end is near to this nanodegree, but the journey as a Data Scientist begins! So, a big shoutout to them!

Customer Churn is a relatively common business/product problem in real world Data Science applications, and the dataset used here is a curated almost real data, streamed from the app Sparkify. Predicting churn potentially saves millions in revenue for a company, as incentives can be offered before the customer churns, to keep them subscribed. Using Spark's high level ML library was one of the best parts about the project, where there was more opportunity to learn about each algorithm in depth, and analyze its applicability to the problem at hand, and then comparing it with the actual result (as opposed to the expected result).

To get a brief overview of the project apart from the notebook and this documentation, do visit my blog here!.

About

Udacity Nanodegree's Capstone Project - Customer Churn Prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published