MAGIC AI Workshop 01 Introduction to machine learning

Introduction

In this lab, we will introduce the basics of machine learning in R. We’ll cover some data exploration, designing a machine learning workflow (including a cross-validation strategy and performance metric) and look at how to try different algorithms. We’ll also look briefly at making predictions with our model and exploring the results of the model.

Before starting, we’ll define some vocabulary for the process of ML model building:

Outcome or target: the variable that we are interested in predicting. The equivalent to covariates in a regression model
Features: the variables we will use to predict the outcome. Equivalent to covariates in regression models
Training: the process of estimating model weights
Loss function: a measure of how well the predicted outcome (\(\hat{y}\)) maps to the observed outcome (\(y\))
Performance metric: a measure of how well the model can predict for an independent dataset
Model weights: one or more values that are used to map the features to the outcome. The value of these is learned during the training process
Hyperparameters: algorithm-specific parameters that control the way that the algorithm learns or updates the weights

The data we will use contains daily counts of rented bicycles from the bicycle rental company Capital-Bikeshare in Washington D.C., along with weather and seasonal information. Our goal is to build a model to predict the count of bikes rented on any given day (the count is the outcome). Before starting the lab, you will need to set up a new folder for your working directory. Download the file bike.csv from github and move it to this folder. Now start R or Rstudio and set your working directory to this folder (if you’re not sure how to do this, please ask).

We’ll need to load a few add-ons to help. R has a large number of packages for individual machine learning algorithms, but also has a couple of meta-packages that are designed to manage a machine learning workflow. These meta-packages take care of setting up training and testing data, as well as evaluating the models. The package we will use is called caret, which is one of the oldest and best established. You will need to install this, as well as a couple of other useful packages. If these are already installed on your computer (you can check in the ‘Packages’ tab in RStudio), then you can skip this step.

install.packages(c("caret", "tidyverse", "pdp", "vip", "patchwork", "skimr"))

Now load the first few packages:

library(tidyverse)
library(patchwork)
library(skimr)

Data

We’ll start by loading the data and carrying out some simple exploration.

bike <- read.csv("./datafiles/bike.csv")

Let’s take a quick look at the data:

head(bike)

##   season   yr mnth    holiday weekday     workingday weathersit     temp
## 1 WINTER 2011  JAN NO HOLIDAY     SAT NO WORKING DAY      MISTY 8.175849
## 2 WINTER 2011  JAN NO HOLIDAY     SUN NO WORKING DAY      MISTY 9.083466
## 3 WINTER 2011  JAN NO HOLIDAY     MON    WORKING DAY       GOOD 1.229108
## 4 WINTER 2011  JAN NO HOLIDAY     TUE    WORKING DAY       GOOD 1.400000
## 5 WINTER 2011  JAN NO HOLIDAY     WED    WORKING DAY       GOOD 2.666979
## 6 WINTER 2011  JAN NO HOLIDAY     THU    WORKING DAY       GOOD 1.604356
##       hum windspeed count days_since_2011
## 1 80.5833 10.749882   985               0
## 2 69.6087 16.652113   801               1
## 3 43.7273 16.636703  1349               2
## 4 59.0435 10.739832  1562               3
## 5 43.6957 12.522300  1600               4
## 6 51.8261  6.000868  1606               5

And get some basic summary statistics using the skimr package:

skim(bike)

Data summary
Name	bike
Number of rows	731
Number of columns	12
_______________________
Column type frequency:
character	6
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
season	1	4	6	4
mnth	1	3	3	12
holiday	1	7	10	2
weekday	1	3	3	7
workingday	1	11	14	2
weathersit	1	4	15	3

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
yr	1	2011.50	0.50	2011.00	2011.00	2012.00	2012.00	2012.00	▇▁▁▁▇
temp	1	15.28	8.60	-5.22	7.84	15.42	22.80	32.50	▂▇▇▇▅
hum	1	62.79	14.24	0.00	52.00	62.67	73.02	97.25	▁▁▆▇▂
windspeed	1	12.76	5.19	1.50	9.04	12.13	15.63	34.00	▃▇▅▁▁
count	1	4504.35	1937.21	22.00	3152.00	4548.00	5956.00	8714.00	▂▅▇▅▃
days_since_2011	1	365.00	211.17	0.00	182.50	365.00	547.50	730.00	▇▇▇▇▇

We’ll now make some plots to take a look at how the features relate to the count of rental bikes. First, let’s plot the time series of daily rentals. This shows a couple of things: a clear seasonal cycle and a long-term trend across the two years:

ggplot(bike, aes(x = days_since_2011, y = count)) +
  geom_line() +
  theme_bw()

We can also look at the distribution by day of the week, month, holiday, etc. Note that we need to make sure R plots the days and months in the correct order by using a factor variable

Month:

bike <- bike %>%
  mutate(mnth = factor(mnth, levels = c("JAN", "FEB", "MAR", "APR", "MAY", "JUN",
                                        "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")),
         weekday = factor(weekday, levels = c("SUN", "MON", "TUE", "WED", "THU", "FRI", "SAT"))
         )
ggplot(bike, aes(x = mnth, y = count)) +
  geom_boxplot() +
  theme_bw()

Day of week:

ggplot(bike, aes(x = weekday, y = count)) +
  geom_boxplot() +
  theme_bw()

Holidays and working days (this uses patchwork to combine figures):

p1 = ggplot(bike, aes(x = workingday, y = count)) +
  geom_boxplot() +
  theme_bw()
p2 = ggplot(bike, aes(x = holiday, y = count)) +
  geom_boxplot() +
  theme_bw()
p1 + p2

Again we can see the clear seasonal cycle, as well as a slightly higher rate on non-holdiays. There’s little to no variation across week days however. We can also use some scatter plots to show the relationship of rentals to environmental features:

p1 = ggplot(bike, aes(x = temp, y = count)) +
  geom_point() +
  theme_bw()
p2 = ggplot(bike, aes(x = hum, y = count)) +
  geom_point() +
  theme_bw()
p3 = ggplot(bike, aes(x = windspeed, y = count)) +
  geom_point() +
  theme_bw()
(p2 + p3) / p1

It’s difficult to make out much in the humidity and windspeed plots, except that rentals appear to decline at higher values. Rentals generally increase with temperature, but appear to decline at higher temps. Most of this makes sense: cycling in high wind speed or hot, humid conditions is generally less appealing.

ggplot(bike, aes(x = weathersit, y = count)) +
  geom_boxplot() +
  theme_bw()

Machine learning

We’ll now build a machine learning model with these data. We’ll model the rental numbers using the environmental data, months and holiday/non-holiday variables.

The general steps in constructing any ML model are:

Preprocess data
Set up cross-validation strategy
Train (and optionally tune) the model
Estimate the predictive skill through cross-validation

We’ll first walk through doing this by hand, then switch to using caret to help automate some of these steps. We’ll start by loading the libraries we need:

library(caret)
library(ModelMetrics)
library(randomForest)

Preprocessing

Prior to building a model, we will want to clean the data to help optimize the training process and the predictive skill of the model. Some things to check for are:

Outliers in the outcome variable
Missing values
High correlations between features

This dataset has already been cleaned so there is relatively little to do in processing it before building models. However, the plots above showed an observation with a relative humidity value of 0, which is likely an error. We’ll use the filter() function to select only observations with hum > 0:

bike2 = bike %>%
  filter(hum > 0)

Cross-validation strategy

As the majority of ML algorithms have no built in diagnostics, similar to those found in traditional statistical models, we need a different approach to assess our models. Cross-validation refers to the process of dividing the data into two subsets:

The training set is used to build or train the model. Training selects models weights that minimize the loss function (and therefore maximize the fit of the model to the training data).
The test set is used to assess the model. Once the model weights have been established, the trained model is used to predict the outcome for this set. The difference between predicted and observed value is assessed using the performance metric.

There are several different ways to create the training and test set. Here, we’ll use a simple hold-out method. We use createDataPartition to select a proportion of the original data to go into the training set (controlled by the argument p=0.8). This returns an index with the row number for each observation selected for training, which can then be used to create a training (train) and test (test) dataset.

## Cross-validation strategy
train_id = createDataPartition(bike2$count, p = 0.8)

train = bike2[train_id[[1]], ] 
test = bike2[-train_id[[1]], ]

Check the sizes (the test should be roughly 1/4 the size of the training set):

nrow(train)

## [1] 586

nrow(test)

## [1] 144

Training the model

Now we can go ahead and train a model. We’ll start by using a simple linear regression model. These are considered to be included in machine learning algorithms (much to the annoyance of most statisticians). While these tend not to perform as well as more complex algorithms (tree methods, neural networks, etc), they are useful in providing a baseline model that complex models should improve on. We’ll build it here using R’s lm() function. Note that this takes a formula argument that describes the outcome and the features, separated by a tilde (~). As we’ll use this through out this exercise, we’ll create it and store for reuse:

f1 <- count ~ temp + hum + windspeed + mnth + holiday + weathersit

Now fit the model using only the training data:

fit_lm = lm(f1, data = train)
summary(fit_lm)

## 
## Call:
## lm(formula = f1, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3435.9  -988.3  -191.4  1065.9  3312.1 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3841.267    515.754   7.448 3.55e-13 ***
## temp                        154.034     16.216   9.499  < 2e-16 ***
## hum                         -36.072      5.754  -6.269 7.19e-10 ***
## windspeed                   -57.484     11.450  -5.020 6.91e-07 ***
## mnthFEB                      33.339    274.877   0.121 0.903507    
## mnthMAR                     663.560    297.821   2.228 0.026268 *  
## mnthAPR                     772.395    332.742   2.321 0.020624 *  
## mnthMAY                     848.347    375.184   2.261 0.024127 *  
## mnthJUN                     164.729    440.726   0.374 0.708717    
## mnthJUL                    -525.858    474.671  -1.108 0.268401    
## mnthAUG                     273.727    441.833   0.620 0.535818    
## mnthSEP                    1317.713    388.289   3.394 0.000738 ***
## mnthOCT                    1572.453    321.874   4.885 1.34e-06 ***
## mnthNOV                    1133.447    286.808   3.952 8.73e-05 ***
## mnthDEC                     774.570    267.161   2.899 0.003885 ** 
## holidayNO HOLIDAY           799.094    313.659   2.548 0.011107 *  
## weathersitMISTY            -117.290    145.647  -0.805 0.420981    
## weathersitRAIN/SNOW/STORM -1630.741    420.110  -3.882 0.000116 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1297 on 568 degrees of freedom
## Multiple R-squared:  0.5678, Adjusted R-squared:  0.5549 
## F-statistic: 43.89 on 17 and 568 DF,  p-value: < 2.2e-16

In a statistical model, we’d spend some time looking at the coefficients, standard errors and \(p\)-values. In a ML model, we more interested in the predictive skill of the model, so we’ll go on to check this now. First, we use the predict() function to estimate the bike rental count for each observation in the test set:

y_pred = predict(fit_lm, newdata = test)

We’ll use the root mean squared error as a performance metric to compare the observed and predicted bike counts. As the name might suggest, this is the average of the squared difference between obs and pred values.

rmse(test$count, y_pred)

## [1] 1262.745

So our prediction error from this model is approximately 1263 bikes per day. Note that your results may vary from this due to differences in random selection and model building.

Now, we’ll see if we can improve on this using a random forest model. Random forests (RFs) were first introduced by Leo Breiman to tackle very complex, noisy radar data. They work by building a series of decision trees that use the features to break the dataset down into small ranges of the outcome. For example, a decision might be that any relative humidity above 80% will have much lower values of bike rentals, or that summer months will have some of the highest counts. Each tree is based on a series of these decisions, which makes it possible to model more complex relationships, for example with these data, a tree might predict low counts at temperatures below 5 degrees, then higher counts between 5 and 20 degrees, and then low counts again.

RF models build several hundred of these trees based on different subsets of the data, and diifferent subsets of the available features. This may seem counter-intuitive (why would you use less data to build a model?), and each individual tree is considered to be a weak model. But the ensemble of trees that are built is extremely robust to variations in the data that are used. Note that one additional advantage from this is that the RF model can provide a range of predicted outcomes (one per tree), but in practice these are usually averaged to a single value. We’ll fit this here using the RandomForest package. Note that this uses the same formual (f1) as the linear model.

### Random forest
fit_rf = randomForest(f1, data = train)
fit_rf

## 
## Call:
##  randomForest(formula = f1, data = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 1417709
##                     % Var explained: 62.43

Now we can go through the same steps of predicting the outcome and calculating the RMSE:

y_pred = predict(fit_rf, newdata = test)
rmse(test$count, y_pred)

## [1] 1237.825

Although our predictive error is still high, it has dropped substantially from the linear model.

Tuning hyperparameters

Most ML algorithms have a set of hyperparameters that control the way in which the algorithm learns. For example, many algorithms use a series of iterations to update their weights, so one hyperparameter might control the number of iterations, and another might control the amount that the weights can be updated in any step. Selecting the optimal value of these, or tuning them, is often an important step in model building and may markedly improve the model skill.

The set of hyperparameters is specific to each model. A random forest has several, but the most important are generally considered to be a) the number of features used in the decision trees and b) the number of trees that are made. In R, you find the default settings for these by finding the help page for the randomForest function (?randomForest) and looking for the arguments mtry and ntrees respectively. Like most defaults, these are chosen to work reasonably well in most situations, but you can easily change them to see if it improves your model. Here, we’ll re-run the random forest with fewer trees (100) and using 4 features in each decision tree split:

fit_rf = randomForest(f1, data = train, mtry = 4, ntree = 100)
fit_rf

## 
## Call:
##  randomForest(formula = f1, data = train, mtry = 4, ntree = 100) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 1448087
##                     % Var explained: 61.63

## Test predictive skill
y_pred = predict(fit_rf, newdata = test)
rmse(test$count, y_pred)

## [1] 1273.925

In this case, we get a small improvement in the predictive skill compared to the default settings (and this is often the case with random forests - other algorithms may show bigger changes).

Choosing the best value for the hyperparameters can be a tedious exercise, and often a substantial part of any machine learning project, as testing every possible combination of parameters can become very time consuming. To help with this, most software allows automated tuning of the parameters. Here, the training data are split again into two subsets (one still called training and one called validation). A series of models are built using the new training set with different values for the hyperparameters, and then predicted for the validation set. The hyperparameter values that give the best performance (lowest RMSE in this example) are then considered to be optimal. The caret package provides a function (train()) that will do all this for you. You need to provide the following:

A description of the model (this is the formula we created earlier)
The dataset to be used (the training data we created earlier)
The algorithm to be used (rf will build a random forest. The full set of algorithms can be found here https://topepo.github.io/caret/available-models.html)
A trainControl object that describes how the data will be split up for training and validation
A grid of hyperparameters that we want to test for (there are defaults for most algorithms)

Let’s start by defining the last two of these. First the hyperparameter grid. We’ll try values of mtry from 1 to 12:

rf_grid <- expand.grid( mtry = seq(1, 12))

Next, we’ll define how to split the data for training and validation. Previously, we used a holdout method to get a training and testing set. Here, we’ll use a k-fold cross-validation. In this method the data are split \(k\) times into a training and validation set and a model is built for each split. The size of each validation set will be \(1/k\), so for a 5-fold cross-validation, each validation set is 20% of the data (the argument number sets this). The advantage of this method is that we test each combination of hyperparameters \(k\) times, which can provide more stable estimate of which is the best. Note that there are other methods that can be used (e.g. repeated cross-validation, bootstrapping) but are more time-consuming; \(k\)-fold is often preferred as it is a good balance of speed and robustness.

fit_control = trainControl(method = "cv", number = 5)

Now we have this, we can set up and run the tuning process. Note that we specify method to select the algorithm used. This will take a few seconds to run - it’s creating 5 repeats of 12 hyperparameter values, so 60 total random forests:

fit_rf_cv = train(f1, data = train, 
                  method = "rf", 
                  trControl = fit_control, 
                  tuneGrid = rf_grid)

You can see the results for each value of the hyperparameters (and plot the RMSE)

fit_rf_cv

## Random Forest 
## 
## 586 samples
##   6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 470, 468, 469, 469, 468 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##    1    1567.536  0.5242597  1299.8883
##    2    1320.411  0.5813329  1134.9446
##    3    1225.060  0.6149341  1051.0878
##    4    1192.725  0.6279974  1018.1909
##    5    1189.575  0.6273249  1007.6487
##    6    1183.898  0.6295057   999.0764
##    7    1189.626  0.6252977  1001.6499
##    8    1187.268  0.6269126   998.7479
##    9    1189.591  0.6254172  1000.6482
##   10    1193.652  0.6229455  1004.2152
##   11    1198.398  0.6198843  1006.0964
##   12    1198.628  0.6199428  1006.9436
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 6.

plot(fit_rf_cv)

You can also see the best model:

fit_rf_cv$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 1406755
##                     % Var explained: 62.72

Now’s let’s test the predictive skill of the selected model (the RMSE printed above is for the validation step):

y_pred = predict(fit_rf_cv, newdata = test)
rmse(test$count, y_pred)

## [1] 1226.116

Exploring the model

Simple regression models such as the one we built above are generally easy to interpret. For example in our model, we found a coefficient of 154 for daily temperature. Which we can interpret as the number of bikes rented increases by 154 for each increase in temperature of 1 degree. Interpretation of ML models is much less straightforward for several reasons:

They do not have a basis to for statistical inference (i.e. you can’t get \(p\)-values)
They capture non-linearities in the relationship between features and labels
Often they include complex interactions between features

There are several tools that have been developed to help explore these model. We’ll look at a couple of those here.

Variable importance

First, we’ll look at the permutation-based variable importance for this model. Variable importance is a measure of how much worse a model becomes when we randomly shuffle the values of one of the features. The model is used to predict the outcome for some test data twice: once with the original values of the feature and once with randomly shuffled values. If there is a large difference in the skill of the model, this feature is important in controlling the outcome.

We’ll use the vip() function from the vip to show and then plot the variable importance scores from the best model we obtained in the last step.

library(vip)

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

vip(fit_rf_cv$finalModel)

Partial dependency plots

The variable importance plot shows which of the features are the most useful in predicting bike rentals, but not how these are related. We can look at the form of the relationship between the occurrence of the pine and this feature (and any other one) using a partial dependency (PD) plot. This shows changes in the outcome across the range of some feature (with all other features effectively held constant).

To get this, we’ll use the partial() function from the the pdp package to obtain the PD values. As arguments, this requires the model, the feature that you want the PD on, the set of data used to produce the model. This produces the PD values for temperature:

library(pdp)

## 
## Attaching package: 'pdp'

## The following object is masked from 'package:purrr':
## 
##     partial

partial(fit_rf_cv, "temp", train = bike2)

##           temp     yhat
## 1  -5.22087120 1904.002
## 2  -4.46648680 1904.002
## 3  -3.71210239 1910.569
## 4  -2.95771799 1914.056
## 5  -2.20333358 1914.621
## 6  -1.44894918 1915.568
## 7  -0.69456478 1942.961
## 8   0.05981963 1939.875
## 9   0.81420403 1965.655
## 10  1.56858844 1951.752
## 11  2.32297284 2220.089
## 12  3.07735724 2238.067
## 13  3.83174165 2436.982
## 14  4.58612605 2595.527
## 15  5.34051046 3127.303
## 16  6.09489486 3162.332
## 17  6.84927926 3253.859
## 18  7.60366367 3376.516
## 19  8.35804807 3589.989
## 20  9.11243248 3685.819
## 21  9.86681688 3950.366
## 22 10.62120128 4192.139
## 23 11.37558569 4272.479
## 24 12.12997009 4314.649
## 25 12.88435450 4723.124
## 26 13.63873890 4712.756
## 27 14.39312330 4990.241
## 28 15.14750771 5029.532
## 29 15.90189211 4942.864
## 30 16.65627652 5216.038
## 31 17.41066092 5218.807
## 32 18.16504532 5574.587
## 33 18.91942973 5566.377
## 34 19.67381413 5583.547
## 35 20.42819854 5533.608
## 36 21.18258294 5524.034
## 37 21.93696734 5465.281
## 38 22.69135175 5485.531
## 39 23.44573615 5463.019
## 40 24.20012056 5443.804
## 41 24.95450496 5474.137
## 42 25.70888936 5524.808
## 43 26.46327377 5471.446
## 44 27.21765817 5402.746
## 45 27.97204258 5307.683
## 46 28.72642698 5270.019
## 47 29.48081138 5187.292
## 48 30.23519579 5128.667
## 49 30.98958019 4953.746
## 50 31.74396460 4752.147
## 51 32.49834900 4774.637

More typically, we plot this to see the changing response. This illustrates the non-linearity in response. Very few bikes are predicted to be rented below about 2 degC. Rentals then increase in a relative linear fashion to a maximum at about 18 degC, then remain high. Beyond about 25 degC, the rentals start to decline again.

partial(fit_rf_cv, "temp", train = bike2, plot = TRUE, plot.engine = 'ggplot2')

Try plotting these for other variables (e.g. humidity or wind speed). Partial dependency plots can also be made for two variables at the same time, so to see the combined effect of temperature and windspeed. Note how the response to temperature is much flatter at higher wind speeds.

partial(fit_rf_cv, c("temp", "windspeed"), train = bike2, plot = TRUE)

Final thoughts

We’ve been through most of the main steps of a machine learning exercise, including processing, training and testing, and exploring the model. There’s quite a lot more that can be done in these steps, but this basic framework works for most projects. This exercise has focused on working with the algorithm - selecting one, tuning it, etc. A very important part of the work is also in selecting the correct variables.

The models we built all have fairly high error, largely because there is a long-term increasing trend in rentals across the two years. If you have time, try rebuilding and testing a model that includes a feature to capture this (e.g. yr or days_since_2011), and you’ll see a substantial decrease in the RMSE. A thing to note though, is that while this might help in reducing error, it is difficult to see how it could be used in practice. And this illustrates some of the challenge - selecting meaningful features that result in a robust predictive model.

Appendix 1: Bike rental dataset

Bike rental dataset from https://christophm.github.io/interpretable-ml-book/bike-data.html:

season: The season, either spring, summer, fall or winter.
year: The year, either 2011 or 2012.
mnth: The month
holiday: Indicator whether the day was a holiday or not.
weekday: Day of week
workingday: Indicator whether the day was a working day or weekend.
weathersit: The weather situation on that day. One of:
- clear, few clouds, partly cloudy, cloudy
- mist + clouds, mist + broken clouds, mist + few clouds, mist, light snow, light rain + thunderstorm + scattered clouds, light rain + scattered clouds
- heavy rain + ice pallets + thunderstorm + mist, snow + mist
temp: Temperature in degrees Celsius.
hum: Relative humidity in percent (0 to 100).
windspeed: Wind speed in km per hour.
count: Count of bicycles including both casual and registered users. The count is used as the target in the regression task.
days_since_2011: Number of days since the 01.01.2011 (the first day in the dataset). This feature was introduced to take account of the trend over time.