Predict Customer Churn - with Spark for python.

Overview

The Capstone project for Udacity's Data Scientist Nanodegree. This project involves predicting Customer Churn for a hypothetical music streaming app Sparkify, using Spark's MLlib to engineer features and build a classification model. The dataset used here is a medium-sized (248 MB, with 544,000 rows) version of the whole dataset (which is 12 GB).
This project is worked on IBM Cloud's Watson Studio, uploading the data cluster, with a Python 3.7/Spark 3.0 enabled Jupyter Notebook.

Using pyspark, the project broadly involves the following:

Loading and cleaning the data.
Exploratory Data Analysis - where the data is explored in its depth, and data visualizations to help with selecting features, and to basically understand the data more.
Feature Engineering - appropriate features are selected based on the EDA, as well as creating new features from existing data, to create a final dataset ready for training using Spark's ML.
Modelling - four different classification algorithms are tested, based on its characteristics, and is evaluated using metrics. This section also involves further tuning of the model that has highest potential (based on the metric defined)
Concluding Remarks - This section summarizes the whole project, along with a couple of thoughts and possible improvements to the project for further scalability to the whole dataset.

Installations

Python 3.6+
pyspark.*
Jupyter - available through this link, or IBM Watson Studio (Lite)

Data

The data is provided by Udacity. Due to the size of the data, it is not avaialable in this repository.
A broad overview of the raw data:

artist - the artist of the soundtrack
auth - variable indicating whether the user has cancelled the subscription or not
firstName - first name of the user
gender - gender of the user
ItemInSession - Item ID for each session (row) recorded
lastName - last name of the user
length - length of each session by the user
level - the level of subscription of the user (free trial or paid)
location - location data of the user (city and state)
page - the page in Sparkify the user visited in each session
song - the song listened to in each session, by the user
ts - the timestamp of each session
userAgent - the user agent used by the user to visit sparkify
userId - unique number identifying each user

The Target variable for modelling, which indicates whether the customer has churned or not, is indicated by the event Cancellation Confirmation, in the page column. This is then dummy-fied into a Churn variable.

For more information about the data, visit the Jupyter Notebook above!.

Modelling

The classification algorithms tested here are - RandomForestClassifier, Gradient Boosted Trees Classifier (GBTClassifier), LogisticRegression and Linear Support Vector Machines (LinearSVM). The thorough documentation for the algorithms and its application is available in the Spark ML documentation.
Based on the performance metric, the F1-Score, the models are evaluated and for the love of experimentation, I took 3 out of the 4 for hyperparameter tuning, which is the second part of this section of the notebook. For more detailed discussion of the parameters and algorithms selected to tune, visit the section Modelling, linked in the Jupyter Notebook above!.

Results & Thoughts

The Random Forest classifier performed the best, with an F1-Score of 88.5% approximately. The other algorithms performed reasonably well too. However:

This result must be taken with a grain of salt, as the target variable Churn, is imbalanced. Even if the F1-Score does account into the metric the False Positives and False Negatives, before scaling the model to the entire 12 GB dataset, we need to be wary of the result, due to the imbalanced nature.
Further SMOTE-like oversampling or undersampling to equalize the size of the two classes could be more unbiased, but they come with disadvantages too. They are discussed in the last section of the notebook above.

Conclusion & Acknowledgements

Thanks to Udacity for a wonderful opporunity to learn a new technology, and get experience with it by hands-on applications. The end is near to this nanodegree, but the journey as a Data Scientist begins! So, a big shoutout to them!

Customer Churn is a relatively common business/product problem in real world Data Science applications, and the dataset used here is a curated almost real data, streamed from the app Sparkify. Predicting churn potentially saves millions in revenue for a company, as incentives can be offered before the customer churns, to keep them subscribed. Using Spark's high level ML library was one of the best parts about the project, where there was more opportunity to learn about each algorithm in depth, and analyze its applicability to the problem at hand, and then comparing it with the actual result (as opposed to the expected result).

To get a brief overview of the project apart from the notebook and this documentation, do visit my blog here!.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
README.md		README.md
Sparkify.ipynb		Sparkify.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predict Customer Churn - with Spark for python.

Table of Contents

Overview

Installations

Data

Modelling

Results & Thoughts

Conclusion & Acknowledgements

About

Releases

Packages

Languages

nit611/Sparkify-IBM-Udacity

Folders and files

Latest commit

History

Repository files navigation

Predict Customer Churn - with Spark for python.

Table of Contents

Overview

Installations

Data

Modelling

Results & Thoughts

Conclusion & Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages