Skip to content

EddieAmaitum/BrainStation_capstone_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Taxi Demand Prediction

 photo

Table of Contents

1. Project overview

1.1 Introduction

Demand in the taxi and Car share service industry

  • In order for Ride sharing companies like Uber to maximize revenue and improve customer satisfaction, one challenge they must address is taxi delays.
  • They must ensure that enough drivers (supply) are readily available to service customers (demand) at all times.
  • Companies therefore need to forecast demand.

1.2 Solution

  • My aim in this project is to use machine learning to a build a model that predicts taxi demand per hour by location.
  • This way drivers can be allocated to high demand areas an peak times and reallocated when demand falls thus maximizing efficiency.
  • Using push notifications the drivers can be incentivized to go to specific locations.

1.3 Project Impact

  • Predictive models like this one have been shown to cut wait times by up tp 20%.
  • Improved efficiency in driver deployment hence companies can generate more revenue.
  • Increased customer satisfaction due to enhanced service reliability.

1.4 Data Dictionary

I will use the NYC Taxi & Limousine Commission (TLC) Trip Records dataset.

  • The NYC TLC website offers an extensive set of monthly taxi trip data, encompassing various records from 2009 to present day.
  • The official data dictionary can be found here.
  • In the beginning of our analysis we start with 2022 January yellow_trip data, as we progress we add more data to build a robust model.
  • Below are the descriptions of the columns:
Field Name Description
VendorID A code indicating the TPEP provider that provided the record.
1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
tpep_pickup_datetime The date and time when the meter was engaged.
tpep_dropoff_datetime The date and time when the meter was disengaged.
Passenger_count The number of passengers in the vehicle.
This is a driver-entered value.
Trip_distance The elapsed trip distance in miles reported by the taximeter.
PULocationID TLC Taxi Zone in which the taximeter was engaged
DOLocationID TLC Taxi Zone in which the taximeter was disengaged
RateCodeID The final rate code in effect at the end of the trip.
1= Standard rate
2=JFK
3=Newark
4=Nassau or Westchester
5=Negotiated fare
6=Group ride
Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server.
Y= store and forward trip
N= not a store and forward trip
Payment_type A numeric code signifying how the passenger paid for the trip.
1= Credit card
2= Cash
3= No charge
4= Dispute
5= Unknown
6= Voided trip
Fare_amount The time-and-distance fare calculated by the meter.
Extra Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
MTA_tax $0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge $0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
Tip_amount Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
Tolls_amount Total amount of all tolls paid in trip.
Total_amount The total amount charged to passengers. Does not include cash tips.
Congestion_Surcharge Total amount collected in trip for NYS congestion surcharge.
Airport_fee $1.25 for pick up only at LaGuardia and Jo
  • The dataset was relatively clean.
  • I spent a good amount of my time doing feature engineering.

2. Solution approach

  • Set up the working environment: I use poetry, conda, VSCode, Git/GitHub as development tools.
  • Data preparation:
    • Find this in notebooks 01 - 05.
    • First I fetch raw data from the website and validate.
    • I then analyze the validated data, transform it into time series data and finally tabular data for model building.
  • Model training:
    • Find this in notebooks 06 - 10.
    • I split the data into train and test sets.
    • I start with building baseline models followed by more robust models.
    • I evaluated model perfomance using Mean Absolute Error(MAE) - the magnitude of difference between the prediction of an observation and the true value of that observation, and iterated to obtain the best performing model.
  • Model operationalization (MLops):
    • I'm continously working to deploy the model as a batch scoring service.
    • Find this in notebooks 11+, this is an on going proccess.
    • Here I use Hopsworks as a feature store.
    • I use github actions to automate model runs.
    • `For model deploment I use Streamlit to build a UI. PS- This is an ongoing process.

3. Conclusion

In this project I use time series data to build a model that predicts taxi demand per hour per location.

The dataset used was quite large and relatively clean. I spend a good time on feature transformations in preparing the data for modeling.

I built 7 models in total. The best performing model was LightGBM with hyper parameter tuning.

I used mean absolute error to evaluate model performance (the magnitude of difference between the prediction of an observation and the true value of that observation). See model performance below:

Model Mean Absolute Error (MAE) Notes
Ad Hoc model 1 6.05 Baseline model
Ad Hoc model 2 3.68 Baseline model
Ad Hoc model 3 3.19 Baseline model
XGBoost 2.70 Models improved
Lightgbm 2.57 Models improved
Lightgbm + feature engineering 2.59 Added average rides per month
Lightgbm + hyperparameter tuning 2.54 Best model for production (num_leaves, min_child_samples, etc)

4. Next steps

I hope to further improve the model performance by adding more features, build pipelines to automate processes and complete model operationalization.

5. Appendix

  • Sample streamlit UI dashboard: You can see highlighted hot zones with high demand.

 photo

  • Sample time series data.

 photo

  • Sample features and targets.

 photo

About

In this repository I build a taxi demand predictor for NYC

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published