We will use data from the \(2009\) British Social Attitudes Survey. You can download the data here.
Our response variable is:
We will use the following predictors:
Let’s have a look at the distribution of the response variable (the red dashed line shows the true percentage of non-Western immigrants):
The plot shows that \(759\) of \(1049\) respondents (or \(72.4\%\)) overestimate the percentage of non-Western immigrants in the UK.
library(foreign)
library(dplyr)
# Load data set
load("bsas_short.RData")
# Declare factor variables
bsas_data <- bsas_data %>%
dplyr::mutate(resp_urban_area = factor(resp_urban_area,
levels = 1:4,
labels = c("rural", "rather rural",
"rather urban", "urban")),
resp_health = factor(resp_health,
levels = 0:3,
labels = c("bad", "fair", "fairly good", "good")))
We will use the glmnet()
function from the glmnet
package to perform ridge regression and the Lasso. Before doing so, we use the model.matrix()
function to create a matrix of predictors. This function automatically transforms any qualitative variables into dummy variables. This is important because glmnet()
can only take quantitative inputs.
We remove the intercept from the matrix produced by model.matrix()
because glmnet()
will automatically include an intercept. We also exclude the predictor resp_party_cons, which will serve as the baseline in our model.
# Matrix of predictors (remove the intercept and resp_party_cons)
x <- model.matrix(imm_brit ~ . -1 -resp_party_cons, bsas_data)
# Response variable
y <- bsas_data$imm_brit
The glmnet()
function has an alpha
argument that determines what type of model is fit. If alpha = 0
, then a ridge regression model is fit, and if alpha = 1
, then a Lasso model is fit. We first fit a ridge regression model to our data. Note that by default, the glmnet()
function standardizes the variables so that they are on the same scale.
library(glmnet)
# Split data into training and test sets
train <- sample(1:nrow(x), nrow(x) / 2)
y_test <- y[-train]
# Define grid of lambda values
grid <- 10^seq(10, -2, length = 100)
# Fit ridge regression models using the glmnet function
ridge_out <- glmnet(x[train, ], y[train], alpha = 0, lambda = grid)
# Plot shrinkage of coefficient estimates
plot(ridge_out, xvar = "lambda", label = TRUE)
Associated with each value of \(\lambda\) is a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef()
. In our case, this is a \(23 \times 100\) matrix, with a row for each predictor and the intercept and a column for each value of \(\lambda\).
dim(coef(ridge_out))
## [1] 23 100
We now use cross-validation to choose the optimal tuning parameter \(\lambda\). We can do this using the built-in cross-validation function, cv.glmnet()
. By default, the function performs \(10\)-fold cross-validation, though we can change this using the argument nfolds
. We here do \(5\)-fold cross-validation. We also split our data into a training data set and a test data set, so that we can estimate the test error.
set.seed(1234)
# Perform cross-validation
ridge_cv_out <- cv.glmnet(x[train, ], y[train], alpha = 0, lambda = grid, nfolds = 5)
# Plot estimated test MSE for each lambda value
plot(ridge_cv_out)
# Determine optimal lambda value
(lambda_opt <- ridge_cv_out$lambda.min)
## [1] 10.72267
We see that the optimal value of \(\lambda\) is \(2.009233\). What is the test MSE associated with this optimal value?
ridge_out_pred <- predict(ridge_out, s = lambda_opt, newx = x[-train, ])
mean((ridge_out_pred - y_test)^2)
## [1] 371.6465
Next, let’s see what the benefit is of performing ridge regression instead of least squares regression.
ridge_out_pred <- predict(ridge_out, s = 0, x = x[train, ], y = y[train],
newx = x[-train, ], exact = TRUE)
mean((ridge_out_pred - y_test)^2)
## [1] 367.2372
Finally, we re-fit our ridge regression model on the full data set, using the optimal value of \(\lambda\).
ridge_final <- glmnet(x, y, alpha = 0)
predict(ridge_final, type = "coefficients", s = lambda_opt)[1:23, ]
## (Intercept) resp_female
## 35.553763732 4.886957882
## resp_age resp_household_size
## -0.032312452 0.811902360
## resp_party_lab resp_party_libdem
## -1.779711511 -3.265929217
## resp_party_snp resp_party_green
## 3.351761067 -2.601119200
## resp_party_ukip resp_party_bnp
## -3.200234545 7.615164426
## resp_party_other resp_newspaper
## 2.625833051 1.094508322
## resp_internet_hrs resp_religious
## -0.023866830 0.541491591
## resp_time_current_employment resp_urban_arearural
## -0.004461721 -1.133232628
## resp_urban_arearather rural resp_urban_arearather urban
## -0.722514958 0.239799448
## resp_urban_areaurban resp_healthfair
## 1.494439037 -0.815808996
## resp_healthfairly good resp_healthgood
## -0.141514011 -0.248405055
## resp_household_income
## -0.948736534
In order to fit a Lasso model, we again use the glmnet()
function. However, we now use the argument alpha = 1
.
# Fit Lasso regression models using the glmnet function
lasso_out <- glmnet(x[train, ], y[train], alpha = 1, lambda = grid)
# Plot shrinkage of coefficient estimates
plot(lasso_out, xvar = "lambda", label = TRUE)
We now perform cross-validation and compute the associated test error.
set.seed(1234)
# Perform cross-validation
lasso_cv_out <- cv.glmnet(x[train, ], y[train], alpha = 1, lambda = grid, nfolds = 5)
# Plot estimated test MSE for each lambda value
plot(lasso_cv_out)
# Determine optimal lambda value
(lambda_opt <- lasso_cv_out$lambda.min)
## [1] 0.869749
# Fit Lasso model using the optimal lambda value
lasso_out_pred <- predict(lasso_out, s = lambda_opt, newx = x[-train, ])
# Test error
mean((lasso_out_pred - y_test)^2)
## [1] 366.8876
This value is very similar to the test MSE of ridge regression with \(\lambda\) chosen by cross-validation.
Finally, we again re-fit our Lasso regression model on the full data set, using the optimal value of \(\lambda\).
lasso_final <- glmnet(x, y, alpha = 1)
predict(lasso_final, type = "coefficients", s = lambda_opt)[1:23, ]
## (Intercept) resp_female
## 36.7124923556 5.3438555078
## resp_age resp_household_size
## -0.0121796559 0.9336677630
## resp_party_lab resp_party_libdem
## -0.5800356517 -1.4922918181
## resp_party_snp resp_party_green
## 0.0000000000 0.0000000000
## resp_party_ukip resp_party_bnp
## -0.0002045951 5.8155536429
## resp_party_other resp_newspaper
## 2.5567523020 0.0068577237
## resp_internet_hrs resp_religious
## 0.0000000000 0.0000000000
## resp_time_current_employment resp_urban_arearural
## 0.0000000000 0.0000000000
## resp_urban_arearather rural resp_urban_arearather urban
## 0.0000000000 0.0000000000
## resp_urban_areaurban resp_healthfair
## 0.1538081862 0.0000000000
## resp_healthfairly good resp_healthgood
## 0.0000000000 0.0000000000
## resp_household_income
## -1.2840413798