In this lab, we will introduce the basics of machine learning in R. We’ll cover some data exploration, designing a machine learning workflow (including a cross-validation strategy and performance metric) and look at how to try different algorithms. We’ll also look briefly at making predictions with our model and exploring the results of the model.
Before starting, we’ll define some vocabulary for the process of ML model building:
The data we will use contains daily counts of rented bicycles from the bicycle rental company Capital-Bikeshare in Washington D.C., along with weather and seasonal information. Our goal is to build a model to predict the count of bikes rented on any given day (the count is the outcome). Before starting the lab, you will need to set up a new folder for your working directory. Download the file bike.csv from github and move it to this folder. Now start R or Rstudio and set your working directory to this folder (if you’re not sure how to do this, please ask).
We’ll need to load a few add-ons to help. R has a large number of packages for individual machine learning algorithms, but also has a couple of meta-packages that are designed to manage a machine learning workflow. These meta-packages take care of setting up training and testing data, as well as evaluating the models. The package we will use is called caret, which is one of the oldest and best established. You will need to install this, as well as a couple of other useful packages. If these are already installed on your computer (you can check in the ‘Packages’ tab in RStudio), then you can skip this step.
install.packages(c("caret", "tidyverse", "pdp", "vip", "patchwork", "skimr"))
Now load the first few packages:
library(tidyverse)
library(patchwork)
library(skimr)
We’ll start by loading the data and carrying out some simple exploration.
bike <- read.csv("./datafiles/bike.csv")
Let’s take a quick look at the data:
head(bike)
## season yr mnth holiday weekday workingday weathersit temp
## 1 WINTER 2011 JAN NO HOLIDAY SAT NO WORKING DAY MISTY 8.175849
## 2 WINTER 2011 JAN NO HOLIDAY SUN NO WORKING DAY MISTY 9.083466
## 3 WINTER 2011 JAN NO HOLIDAY MON WORKING DAY GOOD 1.229108
## 4 WINTER 2011 JAN NO HOLIDAY TUE WORKING DAY GOOD 1.400000
## 5 WINTER 2011 JAN NO HOLIDAY WED WORKING DAY GOOD 2.666979
## 6 WINTER 2011 JAN NO HOLIDAY THU WORKING DAY GOOD 1.604356
## hum windspeed count days_since_2011
## 1 80.5833 10.749882 985 0
## 2 69.6087 16.652113 801 1
## 3 43.7273 16.636703 1349 2
## 4 59.0435 10.739832 1562 3
## 5 43.6957 12.522300 1600 4
## 6 51.8261 6.000868 1606 5
And get some basic summary statistics using the skimr package:
skim(bike)
Name | bike |
Number of rows | 731 |
Number of columns | 12 |
_______________________ | |
Column type frequency: | |
character | 6 |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
season | 0 | 1 | 4 | 6 | 0 | 4 | 0 |
mnth | 0 | 1 | 3 | 3 | 0 | 12 | 0 |
holiday | 0 | 1 | 7 | 10 | 0 | 2 | 0 |
weekday | 0 | 1 | 3 | 3 | 0 | 7 | 0 |
workingday | 0 | 1 | 11 | 14 | 0 | 2 | 0 |
weathersit | 0 | 1 | 4 | 15 | 0 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
yr | 0 | 1 | 2011.50 | 0.50 | 2011.00 | 2011.00 | 2012.00 | 2012.00 | 2012.00 | ▇▁▁▁▇ |
temp | 0 | 1 | 15.28 | 8.60 | -5.22 | 7.84 | 15.42 | 22.80 | 32.50 | ▂▇▇▇▅ |
hum | 0 | 1 | 62.79 | 14.24 | 0.00 | 52.00 | 62.67 | 73.02 | 97.25 | ▁▁▆▇▂ |
windspeed | 0 | 1 | 12.76 | 5.19 | 1.50 | 9.04 | 12.13 | 15.63 | 34.00 | ▃▇▅▁▁ |
count | 0 | 1 | 4504.35 | 1937.21 | 22.00 | 3152.00 | 4548.00 | 5956.00 | 8714.00 | ▂▅▇▅▃ |
days_since_2011 | 0 | 1 | 365.00 | 211.17 | 0.00 | 182.50 | 365.00 | 547.50 | 730.00 | ▇▇▇▇▇ |
We’ll now make some plots to take a look at how the features relate to the count of rental bikes. First, let’s plot the time series of daily rentals. This shows a couple of things: a clear seasonal cycle and a long-term trend across the two years:
ggplot(bike, aes(x = days_since_2011, y = count)) +
geom_line() +
theme_bw()
We can also look at the distribution by day of the week, month,
holiday, etc. Note that we need to make sure R plots the days and months
in the correct order by using a factor
variable
bike <- bike %>%
mutate(mnth = factor(mnth, levels = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN",
"JUL", "AUG", "SEP", "OCT", "NOV", "DEC")),
weekday = factor(weekday, levels = c("SUN", "MON", "TUE", "WED", "THU", "FRI", "SAT"))
)
ggplot(bike, aes(x = mnth, y = count)) +
geom_boxplot() +
theme_bw()
ggplot(bike, aes(x = weekday, y = count)) +
geom_boxplot() +
theme_bw()
p1 = ggplot(bike, aes(x = workingday, y = count)) +
geom_boxplot() +
theme_bw()
p2 = ggplot(bike, aes(x = holiday, y = count)) +
geom_boxplot() +
theme_bw()
p1 + p2
Again we can see the clear seasonal cycle, as well as a slightly higher rate on non-holdiays. There’s little to no variation across week days however. We can also use some scatter plots to show the relationship of rentals to environmental features:
p1 = ggplot(bike, aes(x = temp, y = count)) +
geom_point() +
theme_bw()
p2 = ggplot(bike, aes(x = hum, y = count)) +
geom_point() +
theme_bw()
p3 = ggplot(bike, aes(x = windspeed, y = count)) +
geom_point() +
theme_bw()
(p2 + p3) / p1
It’s difficult to make out much in the humidity and windspeed plots, except that rentals appear to decline at higher values. Rentals generally increase with temperature, but appear to decline at higher temps. Most of this makes sense: cycling in high wind speed or hot, humid conditions is generally less appealing.
ggplot(bike, aes(x = weathersit, y = count)) +
geom_boxplot() +
theme_bw()
We’ll now build a machine learning model with these data. We’ll model the rental numbers using the environmental data, months and holiday/non-holiday variables.
The general steps in constructing any ML model are:
We’ll first walk through doing this by hand, then switch to using caret to help automate some of these steps. We’ll start by loading the libraries we need:
library(caret)
library(ModelMetrics)
library(randomForest)
Prior to building a model, we will want to clean the data to help optimize the training process and the predictive skill of the model. Some things to check for are:
This dataset has already been cleaned so there is relatively little
to do in processing it before building models. However, the plots above
showed an observation with a relative humidity value of 0, which is
likely an error. We’ll use the filter()
function to select
only observations with hum > 0
:
bike2 = bike %>%
filter(hum > 0)
As the majority of ML algorithms have no built in diagnostics, similar to those found in traditional statistical models, we need a different approach to assess our models. Cross-validation refers to the process of dividing the data into two subsets:
There are several different ways to create the training and test set.
Here, we’ll use a simple hold-out method. We use
createDataPartition
to select a proportion of the original
data to go into the training set (controlled by the argument
p=0.8
). This returns an index with the row number for each
observation selected for training, which can then be used to create a
training (train
) and test (test
) dataset.
## Cross-validation strategy
train_id = createDataPartition(bike2$count, p = 0.8)
train = bike2[train_id[[1]], ]
test = bike2[-train_id[[1]], ]
Check the sizes (the test should be roughly 1/4 the size of the training set):
nrow(train)
## [1] 586
nrow(test)
## [1] 144
Now we can go ahead and train a model. We’ll start by using a simple
linear regression model. These are considered to be included in machine
learning algorithms (much to the annoyance of most statisticians). While
these tend not to perform as well as more complex algorithms (tree
methods, neural networks, etc), they are useful in providing a baseline
model that complex models should improve on. We’ll build it here using
R’s lm()
function. Note that this takes a formula argument
that describes the outcome and the features, separated by a tilde
(~
). As we’ll use this through out this exercise, we’ll
create it and store for reuse:
f1 <- count ~ temp + hum + windspeed + mnth + holiday + weathersit
Now fit the model using only the training data:
fit_lm = lm(f1, data = train)
summary(fit_lm)
##
## Call:
## lm(formula = f1, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3435.9 -988.3 -191.4 1065.9 3312.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3841.267 515.754 7.448 3.55e-13 ***
## temp 154.034 16.216 9.499 < 2e-16 ***
## hum -36.072 5.754 -6.269 7.19e-10 ***
## windspeed -57.484 11.450 -5.020 6.91e-07 ***
## mnthFEB 33.339 274.877 0.121 0.903507
## mnthMAR 663.560 297.821 2.228 0.026268 *
## mnthAPR 772.395 332.742 2.321 0.020624 *
## mnthMAY 848.347 375.184 2.261 0.024127 *
## mnthJUN 164.729 440.726 0.374 0.708717
## mnthJUL -525.858 474.671 -1.108 0.268401
## mnthAUG 273.727 441.833 0.620 0.535818
## mnthSEP 1317.713 388.289 3.394 0.000738 ***
## mnthOCT 1572.453 321.874 4.885 1.34e-06 ***
## mnthNOV 1133.447 286.808 3.952 8.73e-05 ***
## mnthDEC 774.570 267.161 2.899 0.003885 **
## holidayNO HOLIDAY 799.094 313.659 2.548 0.011107 *
## weathersitMISTY -117.290 145.647 -0.805 0.420981
## weathersitRAIN/SNOW/STORM -1630.741 420.110 -3.882 0.000116 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1297 on 568 degrees of freedom
## Multiple R-squared: 0.5678, Adjusted R-squared: 0.5549
## F-statistic: 43.89 on 17 and 568 DF, p-value: < 2.2e-16
In a statistical model, we’d spend some time looking at the
coefficients, standard errors and \(p\)-values. In a ML model, we more
interested in the predictive skill of the model, so we’ll go on to check
this now. First, we use the predict()
function to estimate
the bike rental count for each observation in the test set:
y_pred = predict(fit_lm, newdata = test)
We’ll use the root mean squared error as a performance metric to compare the observed and predicted bike counts. As the name might suggest, this is the average of the squared difference between obs and pred values.
rmse(test$count, y_pred)
## [1] 1262.745
So our prediction error from this model is approximately 1263 bikes per day. Note that your results may vary from this due to differences in random selection and model building.
Now, we’ll see if we can improve on this using a random forest model. Random forests (RFs) were first introduced by Leo Breiman to tackle very complex, noisy radar data. They work by building a series of decision trees that use the features to break the dataset down into small ranges of the outcome. For example, a decision might be that any relative humidity above 80% will have much lower values of bike rentals, or that summer months will have some of the highest counts. Each tree is based on a series of these decisions, which makes it possible to model more complex relationships, for example with these data, a tree might predict low counts at temperatures below 5 degrees, then higher counts between 5 and 20 degrees, and then low counts again.
RF models build several hundred of these trees based on different
subsets of the data, and diifferent subsets of the available features.
This may seem counter-intuitive (why would you use less data to build a
model?), and each individual tree is considered to be a weak model. But
the ensemble of trees that are built is extremely robust to variations
in the data that are used. Note that one additional advantage from this
is that the RF model can provide a range of predicted outcomes (one per
tree), but in practice these are usually averaged to a single value.
We’ll fit this here using the RandomForest package.
Note that this uses the same formual (f1
) as the linear
model.
### Random forest
fit_rf = randomForest(f1, data = train)
fit_rf
##
## Call:
## randomForest(formula = f1, data = train)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 1417709
## % Var explained: 62.43
Now we can go through the same steps of predicting the outcome and calculating the RMSE:
y_pred = predict(fit_rf, newdata = test)
rmse(test$count, y_pred)
## [1] 1237.825
Although our predictive error is still high, it has dropped substantially from the linear model.
Most ML algorithms have a set of hyperparameters that control the way in which the algorithm learns. For example, many algorithms use a series of iterations to update their weights, so one hyperparameter might control the number of iterations, and another might control the amount that the weights can be updated in any step. Selecting the optimal value of these, or tuning them, is often an important step in model building and may markedly improve the model skill.
The set of hyperparameters is specific to each model. A random forest
has several, but the most important are generally considered to be a)
the number of features used in the decision trees and b) the number of
trees that are made. In R, you find the default settings for these by
finding the help page for the randomForest
function
(?randomForest
) and looking for the arguments
mtry
and ntrees
respectively. Like most
defaults, these are chosen to work reasonably well in most situations,
but you can easily change them to see if it improves your model. Here,
we’ll re-run the random forest with fewer trees (100) and using 4
features in each decision tree split:
fit_rf = randomForest(f1, data = train, mtry = 4, ntree = 100)
fit_rf
##
## Call:
## randomForest(formula = f1, data = train, mtry = 4, ntree = 100)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 1448087
## % Var explained: 61.63
## Test predictive skill
y_pred = predict(fit_rf, newdata = test)
rmse(test$count, y_pred)
## [1] 1273.925
In this case, we get a small improvement in the predictive skill compared to the default settings (and this is often the case with random forests - other algorithms may show bigger changes).
Choosing the best value for the hyperparameters can be a tedious
exercise, and often a substantial part of any machine learning project,
as testing every possible combination of parameters can become very time
consuming. To help with this, most software allows automated tuning of
the parameters. Here, the training data are split again into two subsets
(one still called training and one called validation). A series of
models are built using the new training set with different values for
the hyperparameters, and then predicted for the validation set. The
hyperparameter values that give the best performance (lowest RMSE in
this example) are then considered to be optimal. The
caret package provides a function
(train()
) that will do all this for you. You need to
provide the following:
rf
will build a random
forest. The full set of algorithms can be found here https://topepo.github.io/caret/available-models.html)trainControl
object that describes how the data will
be split up for training and validationLet’s start by defining the last two of these. First the
hyperparameter grid. We’ll try values of mtry
from 1 to
12:
rf_grid <- expand.grid( mtry = seq(1, 12))
Next, we’ll define how to split the data for training and validation.
Previously, we used a holdout method to get a training and testing set.
Here, we’ll use a k-fold cross-validation. In this method the data are
split \(k\) times into a training and
validation set and a model is built for each split. The size of each
validation set will be \(1/k\), so for
a 5-fold cross-validation, each validation set is 20% of the data (the
argument number
sets this). The advantage of this method is
that we test each combination of hyperparameters \(k\) times, which can provide more stable
estimate of which is the best. Note that there are other methods that
can be used (e.g. repeated cross-validation, bootstrapping) but are more
time-consuming; \(k\)-fold is often
preferred as it is a good balance of speed and robustness.
fit_control = trainControl(method = "cv", number = 5)
Now we have this, we can set up and run the tuning process. Note that
we specify method
to select the algorithm used. This will
take a few seconds to run - it’s creating 5 repeats of 12 hyperparameter
values, so 60 total random forests:
fit_rf_cv = train(f1, data = train,
method = "rf",
trControl = fit_control,
tuneGrid = rf_grid)
You can see the results for each value of the hyperparameters (and plot the RMSE)
fit_rf_cv
## Random Forest
##
## 586 samples
## 6 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 470, 468, 469, 469, 468
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 1567.536 0.5242597 1299.8883
## 2 1320.411 0.5813329 1134.9446
## 3 1225.060 0.6149341 1051.0878
## 4 1192.725 0.6279974 1018.1909
## 5 1189.575 0.6273249 1007.6487
## 6 1183.898 0.6295057 999.0764
## 7 1189.626 0.6252977 1001.6499
## 8 1187.268 0.6269126 998.7479
## 9 1189.591 0.6254172 1000.6482
## 10 1193.652 0.6229455 1004.2152
## 11 1198.398 0.6198843 1006.0964
## 12 1198.628 0.6199428 1006.9436
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.
plot(fit_rf_cv)
You can also see the best model:
fit_rf_cv$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 1406755
## % Var explained: 62.72
Now’s let’s test the predictive skill of the selected model (the RMSE printed above is for the validation step):
y_pred = predict(fit_rf_cv, newdata = test)
rmse(test$count, y_pred)
## [1] 1226.116
Simple regression models such as the one we built above are generally easy to interpret. For example in our model, we found a coefficient of 154 for daily temperature. Which we can interpret as the number of bikes rented increases by 154 for each increase in temperature of 1 degree. Interpretation of ML models is much less straightforward for several reasons:
There are several tools that have been developed to help explore these model. We’ll look at a couple of those here.
First, we’ll look at the permutation-based variable importance for this model. Variable importance is a measure of how much worse a model becomes when we randomly shuffle the values of one of the features. The model is used to predict the outcome for some test data twice: once with the original values of the feature and once with randomly shuffled values. If there is a large difference in the skill of the model, this feature is important in controlling the outcome.
We’ll use the vip()
function from the
vip to show and then plot the variable importance
scores from the best model we obtained in the last step.
library(vip)
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
vip(fit_rf_cv$finalModel)
The variable importance plot shows which of the features are the most useful in predicting bike rentals, but not how these are related. We can look at the form of the relationship between the occurrence of the pine and this feature (and any other one) using a partial dependency (PD) plot. This shows changes in the outcome across the range of some feature (with all other features effectively held constant).
To get this, we’ll use the partial()
function from the
the pdp package to obtain the PD values. As arguments,
this requires the model, the feature that you want the PD on, the set of
data used to produce the model. This produces the PD values for
temperature:
library(pdp)
##
## Attaching package: 'pdp'
## The following object is masked from 'package:purrr':
##
## partial
partial(fit_rf_cv, "temp", train = bike2)
## temp yhat
## 1 -5.22087120 1904.002
## 2 -4.46648680 1904.002
## 3 -3.71210239 1910.569
## 4 -2.95771799 1914.056
## 5 -2.20333358 1914.621
## 6 -1.44894918 1915.568
## 7 -0.69456478 1942.961
## 8 0.05981963 1939.875
## 9 0.81420403 1965.655
## 10 1.56858844 1951.752
## 11 2.32297284 2220.089
## 12 3.07735724 2238.067
## 13 3.83174165 2436.982
## 14 4.58612605 2595.527
## 15 5.34051046 3127.303
## 16 6.09489486 3162.332
## 17 6.84927926 3253.859
## 18 7.60366367 3376.516
## 19 8.35804807 3589.989
## 20 9.11243248 3685.819
## 21 9.86681688 3950.366
## 22 10.62120128 4192.139
## 23 11.37558569 4272.479
## 24 12.12997009 4314.649
## 25 12.88435450 4723.124
## 26 13.63873890 4712.756
## 27 14.39312330 4990.241
## 28 15.14750771 5029.532
## 29 15.90189211 4942.864
## 30 16.65627652 5216.038
## 31 17.41066092 5218.807
## 32 18.16504532 5574.587
## 33 18.91942973 5566.377
## 34 19.67381413 5583.547
## 35 20.42819854 5533.608
## 36 21.18258294 5524.034
## 37 21.93696734 5465.281
## 38 22.69135175 5485.531
## 39 23.44573615 5463.019
## 40 24.20012056 5443.804
## 41 24.95450496 5474.137
## 42 25.70888936 5524.808
## 43 26.46327377 5471.446
## 44 27.21765817 5402.746
## 45 27.97204258 5307.683
## 46 28.72642698 5270.019
## 47 29.48081138 5187.292
## 48 30.23519579 5128.667
## 49 30.98958019 4953.746
## 50 31.74396460 4752.147
## 51 32.49834900 4774.637
More typically, we plot this to see the changing response. This illustrates the non-linearity in response. Very few bikes are predicted to be rented below about 2 degC. Rentals then increase in a relative linear fashion to a maximum at about 18 degC, then remain high. Beyond about 25 degC, the rentals start to decline again.
partial(fit_rf_cv, "temp", train = bike2, plot = TRUE, plot.engine = 'ggplot2')
Try plotting these for other variables (e.g. humidity or wind speed). Partial dependency plots can also be made for two variables at the same time, so to see the combined effect of temperature and windspeed. Note how the response to temperature is much flatter at higher wind speeds.
partial(fit_rf_cv, c("temp", "windspeed"), train = bike2, plot = TRUE)
We’ve been through most of the main steps of a machine learning exercise, including processing, training and testing, and exploring the model. There’s quite a lot more that can be done in these steps, but this basic framework works for most projects. This exercise has focused on working with the algorithm - selecting one, tuning it, etc. A very important part of the work is also in selecting the correct variables.
The models we built all have fairly high error, largely because there
is a long-term increasing trend in rentals across the two years. If you
have time, try rebuilding and testing a model that includes a feature to
capture this (e.g. yr
or days_since_2011
), and
you’ll see a substantial decrease in the RMSE. A thing to note though,
is that while this might help in reducing error, it is difficult to see
how it could be used in practice. And this illustrates some of the
challenge - selecting meaningful features that result in a robust
predictive model.
Bike rental dataset from https://christophm.github.io/interpretable-ml-book/bike-data.html:
season
: The season, either spring, summer, fall or
winter.year
: The year, either 2011 or 2012.mnth
: The monthholiday
: Indicator whether the day was a holiday or
not.weekday
: Day of weekworkingday
: Indicator whether the day was a working day
or weekend.weathersit
: The weather situation on that day. One of:
temp
: Temperature in degrees Celsius.hum
: Relative humidity in percent (0 to 100).windspeed
: Wind speed in km per hour.count
: Count of bicycles including both casual and
registered users. The count is used as the target in the regression
task.days_since_2011
: Number of days since the 01.01.2011
(the first day in the dataset). This feature was introduced to take
account of the trend over time.