Puppularity Contest

Puppularity Contest

By Sam Celarek

"How might use machine learning to identify the factors that drive engagement for a popular Twitter account dedicated to comedic dog pictures and ratings?"

🎯 Project Overview

In this project, we delve into the ‘We Rate Dogs’ Twitter account to discern which aspects of the tweets, be it comedic or aesthetic, primarily drive user engagement. We employ multiple linear regressions, focusing first on a model with only comedic features as predictors, then on aesthetic attributes, and finally a combination of both. The target variable for these models is going to be a proxy for engagement called ‘popularity rating’, calculated as the sum of retweets and favorites. After all this, I then assess the magnitude and significance of each variable’s impact on the popularity rating.

📊 Dataset

The primary dataset for this analysis was sourced from the ‘We Rate Dogs’ Twitter account using the Twitter API. It contains detailed tweet data, including tweet text, user data, retweet count, favorite count, and other meta information. I also used a dataset with neural network classifications of each dog picture from the ‘We Rate Dogs’ tweet into its respective dog breed.

🧹 Data Wrangling

The initial dataset from the Twitter API was in JSON format and several preprocessing steps were performed:

🛠️ Feature Engineering

To prepare for the modeling phase, several new features were engineered in addition to the features already extracted from the tweet text. Also the target feature of Popularity Rating was created by summing retweets and favorites for each tweet.

📶 Exploratory Data Analysis (EDA)

EDA was performed to understand the underlying patterns in the data:

In the time line plot below, it is notable that popularity rating steadily increases over time. This means that time passing, which is represented by the feature dates_number, is going to be very important in my model. This is likely because it serves as a proxy for the constantly increasing amount of followers that the ‘We Rate Dogs’ account has on twitter, which I unfortunately did not have access to in the available datasets.

image

🖥️ Modeling

Three multiple linear regression models with different sets of features were trained to predict popularity rating. One MLR model had comedic features, another had aesthetic features, and the final had all features. Then the significance and effect size of each feature was assessed in each model. To summarize the findings, graph highlights the statistically significant variables in the all-features model.

image

As we can see, the ratings of 14/10 and 13/10 have the most significant impact of the categorical features in the model compared to the baseline rating of 11/10. It would also appear that the monikers ‘Puppo’ and ‘Doggo’ drive higher levels of engagement too. Finally we can see here that breeds which were not frequently posted enough (>30 times) or hard for the neural net to recognize, did not generate as high of popularity ratings compared to the baseline dog breed of ‘not dogs’.

📈 Discussion

From the analysis and modeling, several key insights were derived:

image

In conclusion, comedic features drive the more engagement than aesthetic ones, however this model only weeakly explains the variance in popularity meaning there is likely other latent (unmeasured) factors which contribute to popularity rating.

Best Wishes,
Sam Celarek

💡 Other Resources