Skip to content

The objective is to determine the popularity of articles prior to its publication by training machine learning models on various features and implement new pricing strategy with help of cost based analysis (cost matrix) and help mashable's marketing to maximize their profit.

Notifications You must be signed in to change notification settings

yk0766/Online-News-Popularity-Prediction-Python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 

Repository files navigation

Online News Popularity Prediction

Problem statement

The marketing team of Mashable is responsible identifying popular articles and accordingly sell ads on these articles to incur large profits. Currently the team uses heuristics to predict articles popularity and has flat rate for ads irrespective of the popularity. The team lacks a framework to identify article popularity based on historical data of articles published on the website.

Goal

Determining the popularity of articles prior to its publication by training machine learning models on various features and implement new pricing strategy with help of cost based analysis (cost matrix) and help mashable's marketing to maximize their profit.

Dataset

Instances – 39,000

Features – 61

Target variables (No. of shares) - Determine the popularity of articles based on the shares

Source – UCI Data Repository

Link - https://archive.ics.uci.edu/ml/datasets/online+news+popularity

Exploratory Data Analysis (Univariate, Bivariate, Multivariate)

  1. Understand the target variable (shares) Shares – Determine the popularity of articles based on the shares Number of shares – Continuous

  2. Converted this problem to a classification problem based on the percentile (median) of the target variable which was 1400 but we chose 1500 to eliminate the problem of class imbalance and therefore imposes less challenges when building ML models.

    • Shares > 1500 – Popular (1)

    • Shares < 1500 - Unpopular (0)

    Reason for choosing 1500

    • Closer to median (50th percentile)

    • The problem of class imbalance would be resolved

  3. Performed correlation analysis using heatmap, scatter plot, bar chart to understand variable relationship and identify highly correlated features.

image image

Insights Obtained after EDA

• Articles with titles that include a fair number of words are more likely to be shared and short titles articles aren’t appreciated by people.

• It was observed that articles with more number of videos were shared fewer times, probably because people don’t have the option to skim read the articles.

• It is observed that if rate of negative words in the content increases the count of shares decreases.

• The number of shares for articles posted on weekends is less than those posted on weekdays, pet the proportion of popular articles is higher on weekends.

image image

Preprocessing

• Removed dump articles – articles with no words as they won’t be helpful for model building

• Removing non-predictive features like URL of article and time delta

• Removed highly correlated attributed with help of correlation matrix: n_non_stop_unique_tokens, n_non_stop_words, kw_avg_min, self_reference_avg_shares_new, avg_negative_polarity_new, avg_positive_polarity_new

• Removed weekday_is_saturday and weekday_is_Saturday with help of cross plots chi-square values

image

image

Dealing with Skewness and Outliers

• Most of the features are right skewed – In order to deal with this, we performed power transformation techniques imported from pre-processing package

• Box Cox Transformation: This transformation can be performed when we have values greater than 0 in our dataset. As we had many negative values in our dataset we first converted these negative values into positive and then proceeded with the box cox transformation.

• Most of the features in our dataset contained outliers and after performing box cox transformation we partially took care of the outlier and further performed capping

Lower bound = less than 1st percentile (outliers below lower bound) Upper bound = above 99th percentile (outliers above upper bound)

image image

Machine Learning Model and Evaluation

Naïve Bayes

Logistic Regression

KNN

Random Forest

Performance metrics AUC & ROC Curve, F1 Score and Accuracy to evaluate these models

It was observed that random forest model gave the highest accuracy, AUC score so we performed hyperparameter tunning on it and observed that its performance increase significantly

image

Cost-based Analysis using cost matrix

Based on the problem statement and our research on the advertisement selling we created a cost matrix where we added the profits and loss of the following cases :

• Identifying popular articles correctly will help in increasing the cost at which Advertisements are sold (True Positive)

• While identifying unpopular articles will decrease the ad cost, it will increase the number of Advertisement sold although at lower cost (True Negative)

• Loss will be incurred when a Popular article will be predicted as Unpopular, as ads will be sold at lower cost (False Negative)

• Loss will be incurred when an Unpopular article will be predicted as Popular, as ads will be sold at a higher cost but eventually, we might lose potential future ads (False Positive)

image

Model Profitabilty Analysis

• Profitability by cost matrix, Random Forest has the highest profitability

• Profitability in no model scenario, with flat rate of $400 for all articles irrespective of popularity

• The red line indicates - No Model Scenario

• Naive bayes cost performance gives negative lift, based on the cost matrix

• Form the 4 supervised learning models, random forest has the best profitability. (6.47 mil from 3 mil)

image

Conclusion

Random Forest shows the highest accuracy, AUC, and profitability but it isn't the most interpretable model. While there is a recurrent dilemma between performance and interpretation our goal is to create a more interpretable model. While KNN is a good model in our case, it might be slow when the data size increases as it generates hypotheses on the fly. The logistic regression model gives almost the same lift as the KNN model, however, it is more interpretable and would work even with increased data size. One of the major takeaways from the project is that accuracy is not the only metric we need to look forward to while we are building the model, there are parameters that need careful consideration based on the use case. In the future, we could build ensemble models to get higher accuracy, precision, and AUC for the data. Additionally, the research could be extended to building multiclass classification, this will help in identifying the Most Popular, Popular, Neutral, Unpopular, and No Popularity articles.

About

The objective is to determine the popularity of articles prior to its publication by training machine learning models on various features and implement new pricing strategy with help of cost based analysis (cost matrix) and help mashable's marketing to maximize their profit.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published