- This project aims to build a predictive model to implement as a tool via a dashboard to predict whether a customer will cancel their booking or not.
- The link to the deployed dashboard is: PP5 Hotel Hocuspocus
-
GitHub Project Management was used to manage this project via its kanban board feature.
-
Link to the kanban board
-
Link to the completed and closed tasks/issues.
-
Link to the uncompleted and open tasks/issues in the backlog, which will be implemented as future features.
- In addition to Agile methodology, the Cross Industry Standard Process for Data Mining (CRISP-DM) guidelines were observed to help with the workflow of this project. The workflow has the following six stages:
- Business understanding
- Data understanding
- Data preparation
- Modelling
- Evaluation
- Deployment
- Determine business requirements through extensive discussions with the client.
- Identify, collect, and analyse the dataset that can accomplish the business requirements.
- Prepare the data for modelling through cleaning and feature engineering.
- Build and assess various models based on several different modelling techniques and hyperparameters optimisation.
- Evaluate which model best meets the business and what to do next whether to proceed to deployment or iterate further.
- Develop a web application using Streamlit, taking into account business requirements and principles for good user experience. The app is deployed through Heroku cloud hosting service.
- The dataset is from Kaggle.
- A fictitious business scenario was created based on the dataset to conduct conventional data analysis and develop an ML predictive model to meet business requirements.
- This dataset contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data.
- The dataset from Kaggle originally contains 100k+ entries, but in this project the dataset has been scaled down to 8k entries to minimise the size of the model pipeline for pushing to repo. It contains the following 32 variables:
No. | Variable | Description | Units |
---|---|---|---|
1 | hotel | Type of hotel | Resort Hotel, City Hotel |
2 | is_canceled | Reservation cancellation status | 0 = not cancelled, 1 = cancelled |
3 | lead_time | Number of days between booking and arrival | 0-629 days |
4 | arrival_date_year | Year of arrival | 2015-2017 |
5 | arrival_date_month | Month of arrival | January-December |
6 | arrival_date_week_number | Week number of the year for arrival | 1-53 |
7 | arrival_date_day_of_month | Day of the month of arrival | 1-31 |
8 | stays_in_weekend_nights | Number of weekend nights (Saturday and Sunday) the guest stayed or booked | 0-16 |
9 | stays_in_week_nights | Number of week nights the guest stayed or booked | 0-40 |
10 | adults | Number of adults | 0-27 |
11 | children | Number of children | 0-3 |
12 | babies | Number of babies | 0-2 |
13 | meal | Type of meal booked | BB, FB, HB, SC, Undefined |
14 | country | Country of origin of the guest | Countries |
15 | market_segment | Market segment designation | Offline TA/TO, Online TA, Groups, Direct, Corporate, Complementary, Aviation |
16 | distribution_channel | Booking distribution channel | TA/TO, Direct, Corporate, GDS |
17 | is_repeated_guest | If the guest is a repeat customer | 0 = not repeated, 1 = repeated |
18 | previous_cancellations | Number of previous bookings that were canceled by the customer | 0-26 |
19 | previous_bookings_not_canceled | Number of previous bookings that were not canceled by the customer | 0-57 |
20 | reserved_room_type | Type of reserved room | A, D, E, G, F, B, P, C, H, L |
21 | assigned_room_type | Type of assigned room | 'A', 'D', 'E', 'C', 'G', 'B', 'K', 'F', 'P', 'H', 'I' |
22 | booking_changes | Number of changes made to the booking | 0-14 |
23 | deposit_type | Type of deposit made | No Deposit, Refundable, Non Refund |
24 | agent | ID of the travel agent responsible for the booking | 1-531 |
25 | company | ID of the company responsible for the booking | 9-534 |
26 | days_in_waiting_list | Number of days the booking was in the waiting list | 0-391 |
27 | customer_type | Type of customer | Transient, Contract, Transient-Party, Group |
28 | adr | Average Daily Rate | 0-392 |
29 | required_car_parking_spaces | Number of car parking spaces required | 0-2 |
30 | total_of_special_requests | Number of special requests made | 0-5 |
31 | reservation_status | Last reservation status | Check-Out, Canceled, No-Show |
32 | reservation_status_date | Date of the last reservation status | Dates |
The client wishes to come up with operational planning for minimising cancellations, improving room occupancy, and maximising revenue based on data-driven insights. After discussion with the stakeholders, the following two business requirements were agreed upon:
-
Conducting conventional data analysis, the client is interested in answering the following questions: Which months have the highest number of cancellations? What are the top 5 countries with the highest number of cancellations? Which booking channels have the highest number of cancellations? Are bookings with weekend nights stay more likely to be cancelled than those with none?
-
The client is interested in determining whether or not a given booking will be cancelled.
- We suspect majority of booking cancellations are in the summer season.
- Plot a countplot of the number of cancellations per month in descending order.
- We suspect that there are more cancellations for bookings made through distribution partners than direct bookings.
- Plot a countplot of the number of cancellations by distribution channel in descending order.
- We suspect that bookings with weekend nights stay are more likely to be cancelled than those with none.
- Plot a histogram of the distribution of stays in weekend nights.
-
Business Requirement 1: Conventional Data Analysis and Visualization
- We will conduct conventional data analysis on the following variables: arrival_date_month, country, distribution_channel, and stays_in_weekend_nights to answer the questions from stakeholders.
- We will plot the above variables against is_canceled to visualise insights.
-
Business Requirement 2: Classification for Predicting Cancellation
- We want to predict if a customer will cancel their booking or not. We want to build a binary classifier.
- We want an ML model to predict if a customer will cancel their booking or not based on historical data, which doesn't include the following variables:
- company (irrelevant as it's the company's ID number)
- agent (irrelevant as it's the agent's ID number)
- country (high cardinality and might affect fitting of model)
- arrival_date_year (model must be able to generalise future bookings)
- reservation_status (directly related to is_cancel and might cause data leakage)
- reservation_status_date (same reason as above)
- assigned_room_type (same reason as above; only set in the system when guests actually check-in)
- The target variable is categorical and contains 2-classes. We consider a classification model. It is a supervised model, a 2-class, single-label, classification model output: 0 (not cancelled), 1 (cancelled)
- Our ideal outcome is to provide the client's operational planning team with reliable insight into minimising cancellations, improving room occupancy, and thus maximising revenue.
- The model success metrics are:
- At least 80% Recall for cancellation on train and test sets
- Version 1 of the predictive model achieves 87% Recall for cancellation on the train set, but only 66% on the test set. This suggests that the model has overfitted as there is a considerable difference.
- Stakehoders are made aware of this performance and future improvements, such as further extensive hyperparameters optimisation or collection of new variable to add to the dataset, will be made to fine tune the model's performance.
- Heuristics: Currently, there is no approach to predict cancellations.
- The training data to fit the model comes from Kaggle. This dataset contains about 100k+ entries, but in this project the dataset has been scaled down to 8k entries to minimise the size of the model pipeline.
- Train data - target: is_canceled; features: all other variables, except the ones listed above.
- List the questions from the client as per business requirement 1
- Checkbox: data inspection (display the number of rows and columns in the data, and display the first ten rows of the data)
- Display the answers to the questions as bulleted list.
- Checkboxes to show each plot that provides the answers to the questions.
- State business requirement 2
- Set of widgets inputs, which relates to a customer's booking, to predict cancellation.
- "Run predictive analysis" button that serves the booking data to the ML pipelines and predicts if the customer will cancel their booking or not.
- Before the analysis, we knew we wanted this page to describe each project hypothesis, the conclusions, and how we validated each. After the data analysis, we can report that:
- We suspect majority of booking cancellations are in the summer season.
- Correct. Data analysis shows that August and July are the top 2 highest cancellation by month. This is during the summer season.
- We suspect that there are more cancellations for bookings made through distribution partners than direct bookings.
- Correct. The data analysis conducted supports this hypothesis.
- We suspect that bookings with weekend nights stay are more likely to be cancelled than those with none.
- We suspect majority of booking cancellations are in the summer season.
- Considerations and conclusions after the pipeline is trained
- Present ML pipeline steps
- Feature importance
- Pipeline performance
- Please refer to the TESTING.md file for all test-related documentation.
- There were no known unfixed bugs in this project.
- The App live link is: https://pp5-hotel-hocuspocus-10951e8f369c.herokuapp.com/
- Set the runtime.txt Python version to a Heroku-20 stack currently supported version.
- The project was deployed to Heroku using the following steps:
- Log in to Heroku and create an App
- At the Deploy tab, select GitHub as the deployment method.
- Select your repository name and click Search. Once it is found, click Connect.
- Select the branch you want to deploy, then click Deploy Branch.
- The deployment process should happen smoothly if all deployment files are fully functional. Click now the button Open App on the top of the page to access your App.
- If the slug size is too large then add large files not required for the app to the .slugignore file.
- Jupyter: used for its interactive web application for creating the necessary notebooks to write the code for the ML project.
- Streamlit: used to create the dashboard to put their code into a web application.
- Numpy: is a Python library used for working with arrays, for example return an array of zeros with the same shape and type as a given array.
- Pandas: used for working with data sets, for example read a comma-separated values (csv) file into DataFrame.
- Matplotlib, Seaborn, and Plotly: used for visualization of the data by generating different type of plots.
- Pandas Profiling: used to create a comprehensive Report of the dataset to help with Exploratory Data Analysis (EDA).
- ppscore: used to determine the predictive power score between two columns.
- feature-engine: used to engineer the dataset’s variables and select features for use in the machine learning model.
- scikit-learn: is a Python library for machine learning used for example split randomly the train and test sets.
- imbalanced-learn: used for handling target imbalance, e.g. SMOTE.
- The dataset is from Kaggle.
- This project was based on Code Institute's Churnometer walkthrough project.
- This Kaggle notebook by Farzad Nekouei was also consulted for ideas.
- This Kaggle notebook by Nitesh Yadav was also consulted for ideas.
- I would like to thank my mentor, Rohit Sharma, for his tips and support in completing this project.