Skip to content

This repository contains my final project for UT Austin's Data Analytics Bootcamp. My teammates and I explored a Wine Reviews dataset and built an interactive Tableau dashboard to recommend wines for a novice based on price, rating, variety, and country. We also built a machine learning model to train it to rate wine like an experienced sommelier.

Notifications You must be signed in to change notification settings

whitneyshine/austin_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wine_ut

Final Project - Vine & Vault

Table of Contents

Presentation

Predictive Wine Ratings

For this repository we chose to explore a Wine Reviews dataset compiled from Wine Enthusiast magazine. We selected this topic because we're a group of wine enthusiasts but we're certainly no sommeliers. Since wine can be complicated and overwhelming, we wanted to create a fun and interactive way for beginners to discover new wines. With this idea in mind, we built a dashboard to recommend wines for a novice based on such things as price, rating, variety, country and province. We also built a machine learning model to see if we could train it to rate wine like an experienced sommelier. For an in-depth look at our project, see our Vine & Vault presentation on Google Slides.

wine_row

Technologies Used

  • Data Cleaning and Analysis
    We performed our data transformation and analysis with Python and Pandas using Jupyter Notebook. All members of the group were familiar with Pandas so this came as an easy decision and allowed the analysis to run smoothly. See Wine_Ratings.ipynb for the code that transformed and analyzed our data.
  • Database Storage
    We used PostgresSQL for database storage. Connections to our SQL database were created in our machine learning and data analysis notebooks. Again, this decision was made due to familiarity.
  • Machine Learning
    For the machine learning portion, we chose to use a SciKitLearn Random Forest model due to the algorithm's high degree of accuracy, the reduced chance of overfitting, and the need to use a supervised model.
  • Dashboard
    We used Tableau to build our Dashboard and Story. Interact with the dashboard by selecting a desired country from our dropdown feature or maybe you are looking for a specific price point - we have that covered in a slide scale in the upper left-hand corner.
  • Final Project Website
    We built a website using Bootstrap v4.1, Flask v1.0.2 and Jinja2 and hosted on Google App Engine for a complete and polished location to access and view all the elements of our final project. We even embedded our Tableau dashboard and Google Slides presentation.

wine_communication

Database

Dataset

Our raw dataset contained almost 130,000 rows of information that included the wine's title, grape variety, winery, country and region of origin, as well as the price per bottle, wine rating, taster name, and a description about the wine. The original data was created by Wine Enthusiast and the Wine Reviews dataset was posted on Kaggle. As a team for this project, we used a SQL database - see our Entity Relationship Diagram (ERD) with relationships. After we finished cleaning and transforming the data, our final dataset contained almost 115,000 rows and 12 columns.

wine_database

Machine Learning Model

Question we would like to answer with our machine learning model

Can a machine learning model be trained to rate wine like an experienced sommelier?

Machine Learning Model

We chose a random forest model since we needed a supervised learning model. Random forest algorithms are great to use for classification or regression problems and typically produce a higher degree of accuracy. The model does a good job to avoid overfitting and it can efficiently handle large datasets like ours. The biggest downside to using this type of model is computing time. The model can take hours to fit to the training data making it very time consuming to optimize.

Output Label

Our machine learning model's output label is a wine rating -- a continuous value between 80 and 100 -- otherwise known as "points" in the dataset.

Data Preprocessing

Our initial dataset was fairly robust with lots of data (almost 130,000 rows and 13 columns) but offered a limited number of valuable features to analyze and explore. Therefore, we engineered the following features: 

  • We extracted the year the wine was made by searching the title column for a regular expression then added it as an extra feature to our dataset, focusing on wines made starting in 2000 since this made up most of our dataset.
  • We used dictionary keys to look in the description, variety and title columns and assigned a red or white designation. We added this feature as an additional column called wine type.
  • We added a column to group ratings into 5 categories -- below average, average, good, very good and excellent. The idea was we could use these categories to add context and value to our consumer-friendly dashboard. However, we did not use this feature to train our machine learning model since it was derived from the feature we were trying to predict.

    To clean and transform our dataset further:
  • We replaced null values in the region_1 column with province name and in the taster_name column with "unknown"
  • We reluctantly dropped the description, designation, title and winery columns since they presented computational challenges for our machine learning model
  • We dropped the region_2 and taster_twitter_handle columns since they didn’t add value to our model or dashboard.

How the model works

See a flowchart for a broad overview of the process for our machine learning model. First, the model made a connection to our SQL database and read the dataset into a Pandas dataframe. Then, the data was cleaned and transformed. Once the data was ready, the categorical columns were split into binary data using scitkit-learn’s One Hot Encoder. This tool created a new column for each unique value in the previous columns which made the dataset quite larger than before. The data was then split using scikit-learn’s Train Test Split method into 75% training data and 25% testing data. Finally, the model was fit to the data. This was the most time-consuming part of the process. At 100 estimators, the model took about an hour to fit to the data.

Model Accuracy

Since our target is continuous and not discrete, we could not use a confusion matrix and the traditional accuracy score to rate the performance of our model. Instead, we use the coefficient of determination (r²) as well as the mean squared error (mse). Both of these are simply just ways of measuring how far away each data point is from the line of regression. A perfect model has an r² value of 1 and a mse of 0.

When we trained the model to predict wine ratings, it scored an r² value of 0.478 and an mse value of 4.78. When we trained it to predict categories of wine ratings, it scored an r² value of 0.149 and an mse value of 0.109. In the end, due to our computational limitations and the abundance of categorical features in our dataset, several of which we had to omit, our model performed mediocrely.

wine_cellar

Looking Ahead

What are some possible improvements we could make?

If we had a very large amount of computing power (over 100GB RAM) then we could go back and include the title and winery columns to improve our model's results. Also, all features other than price proved to be weak learners. Clearly there are other factors not contained in our dataset that have a huge impact on the rating of a wine, such as climate and weather data for instance. Given more time, we could bring in additional features like these and improve our model. Finally, we had a few outliers in our datset that we should consider addressing.

Ideas for further development

Ideally, natural language processing techniques would be used to predict score based on text found in the description column. This was simply out of reach for us given our skill sets and time constraints.

wine_toast_sunset

About

This repository contains my final project for UT Austin's Data Analytics Bootcamp. My teammates and I explored a Wine Reviews dataset and built an interactive Tableau dashboard to recommend wines for a novice based on price, rating, variety, and country. We also built a machine learning model to train it to rate wine like an experienced sommelier.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •