How many Facebook friends will you have? Linear Regression has the answer.
Table of Contents
1 Executive Summary
1 Executive Summary
This case study will model the number of friends using Linear Regression using a relatively small dataset. We will see the importance of feature selection, review the essential assumptions about the data for linear regression, and apply log transformation. We will conclude that the best model uses features length of the relationship, number of groups, number of music users liked, and gender, resulting in an adjusted R-squared of 56%.
Our task is to build a well-performing linear regression model to predict the number of Facebook friends based on a small dataset of social media data. We will first explore the data and apply several tests to see if data is suitable for linear regression, build several models with different combinations of variables. We would select the best model using its rank across three key metrics: adjusted R-squared, BIC, and AIC as the scoring criteria. We will also discuss what future modifications to the data pre-processing and transformation would help to improve its performance.
This project was part of the third course in Statistical Predictive Modelling and Applications in the University of Edinburgh Business School Predictive Analytics using Python MicroMasters program.
3.1 Analytic Approach and Data Requirements
We will take a predictive analytics approach to estimate Facebook friends based on social media data. Our data must be in a numerical format to build Linear Regression models. Second, the data needs to pass the tests of linear relationship, multicollinearity, and multivariate normality. Thirdly, As we will see in Section 3.3 Transformation, we will need to apply a log transformation to ensure our data is acceptable for Linear Regression.
We start with a small dataset with 200 observations and nine features describing store social media activity for every user in corresponding columns. The course instructors provided the data in a CSV format.
Figure 3.1 Original Dataset shows a five-row sample from the dataset.
I will quote the descriptions for each column directly from the course lab:
LOR: length of the relationship, tells us how long somebody is on Facebook (expressed in weeks)
NBRlikes: the total number of likes a user has given to Facebook pages
NBRgroups: the number of groups that a user is a part of
NBRcheckins: the number of times a user has shared a check-in to a particular location
NBRmovies: the number of movies a user has liked
NBRlinks: the number of links a user has posted
NBRfriends: the dependent variable expressing a user’s network size
female: indicate whether the user is female (1) or male (0)
Figure 3.2 Dataset Descriptions shows vital statistics for each variable, including mean, standard deviation, quartiles, and other metrics.
3.2 Pre-processing and Transformation
First, we visually inspect the variables for signs of collinearity by plotting them against each other.
We also look at histograms depicted in Figure 3.4 Histograms.
The plots do not suggest the variables follow Gaussian distribution, which is one of the crucial assumptions necessary for a Linear Regression model. We will have to correct this using log transformation.
Next, to test multicollinearity, we look at the correlation matrix. We see that number of movies and music are correlated. The number of groups is the highest correlated variable to the target at 34%. Groups, NBRmusic, NBRmovies, and length of relationship could be predictors. We see that the individual correlation between the features and the target is not strong. No independent variable has a correlation over 80%, which would flag it for multicollinearity.
Next, we log-transform all the variables except for the boolean “female” variable. The transformed data frame
We visually recheck the dataset using pair plot Figure 3.7 Log Transformed pair plot, Figure 3.8 Log Transformed Correlation Matrix.
The number of movies and the number of music independent variables had a high correlation of 75%. The recommendation from the course instructors was to drop the variable “Number of movies” to avoid multicollinearity.
3.3 Preliminary Data Analysis
We will use the statistics package to do some preliminary analysis and check linear assumptions in greater detail (using the best model selected in Section 4 Results)
F-statistic is significantly higher than one, and the p-value is a small value close to zero.
These two measures show that at least one variable has a non-zero predictive value.
NBR music and NBR checkouts are not statistically significant at less than 5%. The lack of statistical significance does not necessarily mean that they cannot be good predictors. However, we cannot say that these coefficients are statistically different from zero for interpretation purposes. In other words, we cannot say with confidence that the number of music and number of checkouts impact the number of friends and draw conclusions.
As we can see from Figure 3.10 VIF, for LOR and NBRGroups, VIF is above four, which may be problematic. If we used a more conservative VIF threshold, we could drop the NBRmusic variable.
Multivariate normality test
QQ-plot appears to follow the Normal distribution expectation, although we see outliers at the higher and lower end of the quantiles.
At Kolmogorov-Smirnov p-value of 24.4%, we fail to reject the null hypothesis that the normal distribution and the distribution of the residuals from the best models are the same. So we consider that the data meet the normality assumption.
The Durbin Watson value of 1.926 that we see in the Figure 3.9 OLS Regression Summary is still within the range where we do not see signs of autocorrelation.
For the assignment, I did not remove the outliers, but, as was pointed out by course instructors in the solution, we could visually see how skewed the data is, and outliers affect the model for some variables.
From Figure 3.12 Heteroscedasticity, it’s hard to say for sure if the residuals are equally spread. Arguably, they are more spread around the lower values, almost resembling a funnel shape. Perhaps we see a sign of heteroscedastic error term even after log transformation. Maybe, we need to try a different feature transformation or a different model. The curvature may also indicate a higher-order polynomial relationship present. Polynomial transformation may come to the rescue, but this was out of the scope of this project.
After model transformation, we built models for every combination of columns in the dataset and recorded its AIC, BIC and Adjuster R-squared scores. The best model was selected using the average of each metric rankings for the model.
Figure 4.1 Feature Selection shows the features used for each model and the weighted score resulting in AIC of 616.96, BIC 633.45, and an adjusted R-squared of 56%. The best overall model uses the following features LOR, NBRgroups, NBRmusic, female, const.
Since we log-transformed the value, we can interpret the coefficients by analysing them.
- const 4.5166657620535
- LOR 1.3352525501685886
- NBRgroups 2.3422311334226893
- NBRmusic 1.1092017095122706
- NBRcheckins 1.580937658283545
- female 1.4014891572865704
People start with 4.5 friends, and for every week somebody is on Facebook, the model expects 1.33 additional friends. For every group, one gets 2.34 more friends, and music like and check in the model predicts plus 1.1 and 1.6 friends respectively (although these are not statistically significant at 5% and coefficient may be null in reality). Gender also affects the number of Facebook friends.
Having built a highly interpretably model, we can predict the number of friends one has over time and the factors associated with the number of friends values. The model helps make predictions, but we have not analysed the cause and effect relationship between the variables and predictor. For example, is the number of check-ins reflective of the person’s offline social activity manifested in check-ins, or does the higher Facebook activity of checking in attract additional friends? Or does one check-in more often if they have more friends on Facebook as there is a greater incentive to share?
Causality can be subject to independent research, but we have now achieved the task we set out to do to build a predictive model.
We have gone through the entire cycle of pre-processing, transformation and data building.
We selected the features leading to the best performing model, interpreted the coefficients, and looked at the assumptions necessary for linear regression model effectiveness. MLR is one of the simplest models. It may not be the most powerful, but how much understanding of the data and the relationship between the variables and what they represent in the physical world we can get from the model.
Log transformation helped with normality assumptions. Looking at the residual plot, we can improve the linear assumption and heteroscedasticity by removing outliers and further data transformation like polynomial or other log or square root transformations to see the residual plot more equally spread. We have not tried applying regularisation in this case or plotting the learning curve to help set further directions: to reduce bias or variance. One can also perform error analysis and try to improve the data. I have not discussed one element of the mini-case study: building the model using scikit-learn and cross-validation, but I will leave this to another post.
Copyright © 2021 Schwarzwald_AI