In this project, we delve into the ‘We Rate Dogs’ Twitter account to discern which aspects of the tweets, be it comedic or aesthetic, primarily drive user engagement. We employ multiple linear regressions, focusing first on a model with only comedic features as predictors, then on aesthetic attributes, and finally a combination of both. The target variable for these models is going to be a proxy for engagement called ‘popularity rating’, calculated as the sum of retweets and favorites. After all this, I then assess the magnitude and significance of each variable’s impact on the popularity rating.
The primary dataset for this analysis was sourced from the ‘We Rate Dogs’ Twitter account using the Twitter API. It contains detailed tweet data, including tweet text, user data, retweet count, favorite count, and other meta information. I also used a dataset with neural network classifications of each dog picture from the ‘We Rate Dogs’ tweet into its respective dog breed.
The initial dataset from the Twitter API was in JSON format and several preprocessing steps were performed:
To prepare for the modeling phase, several new features were engineered in addition to the features already extracted from the tweet text. Also the target feature of Popularity Rating
was created by summing retweets and favorites for each tweet.
EDA was performed to understand the underlying patterns in the data:
In the time line plot below, it is notable that popularity rating steadily increases over time. This means that time passing, which is represented by the feature dates_number
, is going to be very important in my model. This is likely because it serves as a proxy for the constantly increasing amount of followers that the ‘We Rate Dogs’ account has on twitter, which I unfortunately did not have access to in the available datasets.
Three multiple linear regression models with different sets of features were trained to predict popularity rating. One MLR model had comedic features, another had aesthetic features, and the final had all features. Then the significance and effect size of each feature was assessed in each model. To summarize the findings, graph highlights the statistically significant variables in the all-features model.
As we can see, the ratings of 14/10 and 13/10 have the most significant impact of the categorical features in the model compared to the baseline rating of 11/10. It would also appear that the monikers ‘Puppo’ and ‘Doggo’ drive higher levels of engagement too. Finally we can see here that breeds which were not frequently posted enough (>30 times) or hard for the neural net to recognize, did not generate as high of popularity ratings compared to the baseline dog breed of ‘not dogs’.
From the analysis and modeling, several key insights were derived:
Comedic Model | R-Squared= 0.37: The comedic model was the most predictive model and tweets with higher dog ratings tend to get more retweets and favorites. |
Aesthetic Model | R-Squared= 0.32: No dog breed was statistically significant, however the nickname doggo and puppo had a positive impact on popularity. |
All Features Model | R-Squared= 0.38: Time passing (date_number ) was a very important variable to have in the model as popularity Ratings rise dramatically over time as the account gets more fans and followers. |
In conclusion, comedic features drive the more engagement than aesthetic ones, however this model only weeakly explains the variance in popularity meaning there is likely other latent (unmeasured) factors which contribute to popularity rating.