Chapter 8 High-dimensional data

8.1 The Big Picture

  • The rise of high-dimensional data. The new data frontiers in social sciences—text (Gentzkow et al. 2019; Grimmer and Stewart 2013) and and image (Joo and Steinert-Threlkeld 2018)—are all high-dimensional data.

  • The rise of the new approach: statistics + computer science = machine learning

  • Statistical inference

    • \(y\) <- some probability models (e.g., linear regression, logistic regression) <- \(x\)

    • \(y\) = \(X\beta\) + \(\epsilon\)

    • The goal is to estimate \(\beta\)

  • Machine learning

    • \(y\) <- unknown <- \(x\)

    • \(y\) <-> decision trees, neutral nets <-> \(x\)

    • For the main idea behind prediction modeling, see Breiman, Leo (Berkeley stat faculty who passed away in 2005). “Statistical modeling: The two cultures (with comments and a rejoinder by the author).” Statistical science 16, no. 3 (2001): 199-231.

    • “The problem is to find an algorithm \(f(x)\) such that for future \(x\) in a test set, \(f(x)\) will be a good predictor of \(y\).”

    • “There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.”

  • How ML differs from econometrics?

  • A review by Athey, Susan, and Guido W. Imbens. “Machine learning methods that economists should know about.” Annual Review of Economics 11 (2019): 685-725.

  • Stat:

    • Specifying a target (i.e., an estimand)

    • Fitting a model to data using an objective function (e.g., the sum of squared errors)

    • Reporting point estimates (effect size) and standard errors (uncertainty)

    • Validation by yes-no using goodness-of-fit tests and residual examination

  • ML:

    • Developing algorithms (estimating f(x))

    • Prediction power, not structural/causal parameters

    • Basically, high-dimensional data statistics (N < P)

    • The major problem is to avoid “the curse of dimensionality” (too many features - > overfitting)

    • Validation: out-of-sample comparisons (cross-validation) not in-sample goodness-of-fit measures

    • So, it’s curve-fitting, but the primary focus is unseen (test data), not seen data (training data)

  • A quick review on ML lingos for those trained in econometrics

    • Sample to estimate parameters = Training sample

    • Estimating the model = Being trained

    • Regressors, covariates, or predictors = Features

    • Regression parameters = weights

    • Prediction problems = Supervised (some \(y\) are known) + Unsupervised (\(y\) unknown)

How to teach machines. Based on vas3k blog. Many images in this chapter come from vas3k blog.

The main types of machine learning. Based on vas3k blog

The map of the machine learning universe. Based on vas3k blog

Classical machine learning. Based on vas3k blog

8.2 Dataset

# Load packages

## CRAN packages
pacman::p_load(
  here,
  tidyverse,
  tidymodels,
  doParallel, # parallel processing
  patchwork, # arranging ggplots
  remotes,
  SuperLearner,
  vip,
  tidymodels,
  glmnet,
  xgboost,
  rpart,
  ranger,
  conflicted
)

remotes::install_github("ck37/ck37r")
## Skipping install of 'ck37r' from a github remote, the SHA1 (b5214459) has not changed since last install.
##   Use `force = TRUE` to force installation
conflicted::conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
## Jae's custom functions
source(here("functions", "ml_utils.r"))

# Import the dataset

data_original <- read_csv(here("data", "heart.csv"))
## Rows: 303 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(data_original)
## Rows: 303
## Columns: 14
## $ age      <dbl> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5…
## $ sex      <dbl> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1…
## $ cp       <dbl> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0…
## $ trestbps <dbl> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1…
## $ chol     <dbl> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2…
## $ fbs      <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ restecg  <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1…
## $ thalach  <dbl> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1…
## $ exang    <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0…
## $ slope    <dbl> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1…
## $ ca       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0…
## $ thal     <dbl> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3…
## $ target   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
# Createa a copy
data <- data_original

theme_set(theme_minimal())

8.3 Workflow

    1. Preprocessing
    1. Model building
    1. Model fitting
    1. Model evaluation
    1. Model tuning
    1. Prediction

8.4 tidymodels

  • Like tidyverse, tidymodels is a collection of packages.

    • rsample: for data splitting

    • recipes: for pre-processing

    • parsnip: for model building

      • tune: hyperparameter tuning
    • yardstick: for model evaluations

    • workflows: for bundling a pieplne that bundles together preprocessing, modeling, and post-processing requests

  • Why taking a tidyverse approach to machine learning?

  • Benefits

    • Readable code

    • Reusable data structures

    • Extendable code

Tidymodels. From RStudio.

tidymodels are an integrated, modular, extensible set of packages that implement a framework that facilitates creating predicative stochastic models. - Joseph

8.5 Pre-processing

  • recipes: for pre-processing

  • textrecipes for text pre-processing

  • Step 1: recipe() defines target and predictor variables (ingredients).

  • Step 2: step_*() defines preprocessing steps to be taken (recipe).

    The preprocessing steps list draws on the vignette of the parsnip package.

    • dummy: Also called one-hot encoding

    • zero variance: Removing columns (or features) with a single unique value

    • impute: Imputing missing values

    • decorrelate: Mitigating correlated predictors (e.g., principal component analysis)

    • normalize: Centering and/or scaling predictors (e.g., log scaling). Scaling matters because many algorithms (e.g., lasso) are scale-variant (except tree-based algorithms). Remind you that normalization (sensitive to outliers) = \(\frac{X - X_{min}}{X_{max} - X_{min}}\) and standardization (not sensitive to outliers) = \(\frac{X - \mu}{\sigma}\)

    • transform: Making predictors symmetric

  • Step 3: prep() prepares a dataset to base each step on.

  • Step 4: bake() applies the preprocessing steps to your datasets.

In this course, we focus on two preprocessing tasks.

  • One-hot encoding (creating dummy/indicator variables)
# Turn selected numeric variables into factor variables
data <- data %>%
  dplyr::mutate(across(c("sex", "ca", "cp", "slope", "thal"), as.factor))

glimpse(data)
## Rows: 303
## Columns: 14
## $ age      <dbl> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5…
## $ sex      <fct> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1…
## $ cp       <fct> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0…
## $ trestbps <dbl> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1…
## $ chol     <dbl> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2…
## $ fbs      <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ restecg  <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1…
## $ thalach  <dbl> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1…
## $ exang    <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0…
## $ slope    <fct> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1…
## $ ca       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0…
## $ thal     <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3…
## $ target   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
  • Imputation
# Check missing values

map_df(data, ~ is.na(.) %>% sum())
## # A tibble: 1 × 14
##     age   sex    cp trestbps  chol   fbs restecg thalach exang oldpeak slope
##   <int> <int> <int>    <int> <int> <int>   <int>   <int> <int>   <int> <int>
## 1     0     0     0        0     0     0       0       0     0       0     0
## # ℹ 3 more variables: ca <int>, thal <int>, target <int>
# Add missing values

data$oldpeak[sample(seq(data), size = 10)] <- NA

# Check missing values

# Check the number of missing values
data %>%
  map_df(~ is.na(.) %>% sum())
## # A tibble: 1 × 14
##     age   sex    cp trestbps  chol   fbs restecg thalach exang oldpeak slope
##   <int> <int> <int>    <int> <int> <int>   <int>   <int> <int>   <int> <int>
## 1     0     0     0        0     0     0       0       0     0      10     0
## # ℹ 3 more variables: ca <int>, thal <int>, target <int>
# Check the rate of missing values
data %>%
  map_df(~ is.na(.) %>% mean())
## # A tibble: 1 × 14
##     age   sex    cp trestbps  chol   fbs restecg thalach exang oldpeak slope
##   <dbl> <dbl> <dbl>    <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1     0     0     0        0     0     0       0       0     0  0.0330     0
## # ℹ 3 more variables: ca <dbl>, thal <dbl>, target <dbl>

8.5.1 Regression setup

8.5.1.1 Outcome variable

# Continuous variable
data$age %>% class()
## [1] "numeric"

8.5.1.2 Data splitting using random sampling

# for reproducibility
set.seed(1234)

# split
split_reg <- initial_split(data, prop = 0.7)

# training set
raw_train_x_reg <- training(split_reg)

# test set
raw_test_x_reg <- testing(split_reg)

8.5.1.3 recipe

# Regression recipe
rec_reg <- raw_train_x_reg %>%
  # Define the outcome variable
  recipe(age ~ .) %>%
  # Median impute oldpeak column
  step_impute_median(oldpeak) %>%
  # Expand "sex", "ca", "cp", "slope", and "thal" features out into dummy variables (indicators).
  step_dummy(c("sex", "ca", "cp", "slope", "thal"))

# Prepare a dataset to base each step on
prep_reg <- rec_reg %>% prep(retain = TRUE)
# x features
train_x_reg <- juice(prep_reg, all_predictors())

test_x_reg <- bake(
  object = prep_reg,
  new_data = raw_test_x_reg, all_predictors()
)

# y variables
train_y_reg <- juice(prep_reg, all_outcomes())$age %>% as.numeric()
test_y_reg <- bake(prep_reg, raw_test_x_reg, all_outcomes())$age %>% as.numeric()

# Checks
names(train_x_reg) # Make sure there's no age variable!
##  [1] "trestbps" "chol"     "fbs"      "restecg"  "thalach"  "exang"   
##  [7] "oldpeak"  "target"   "sex_X1"   "ca_X1"    "ca_X2"    "ca_X3"   
## [13] "ca_X4"    "cp_X1"    "cp_X2"    "cp_X3"    "slope_X1" "slope_X2"
## [19] "thal_X1"  "thal_X2"  "thal_X3"
class(train_y_reg) # Make sure this is a continuous variable!
## [1] "numeric"
  • Note that other imputation methods are also available.
grep("impute", ls("package:recipes"), value = TRUE)
##  [1] "step_bagimpute"     "step_impute_bag"    "step_impute_knn"   
##  [4] "step_impute_linear" "step_impute_lower"  "step_impute_mean"  
##  [7] "step_impute_median" "step_impute_mode"   "step_impute_roll"  
## [10] "step_knnimpute"     "step_lowerimpute"   "step_meanimpute"   
## [13] "step_medianimpute"  "step_modeimpute"    "step_rollimpute"
  • You can also create your own step_ functions. For more information, see tidymodels.org.

8.5.2 Classification setup

8.5.2.1 Outcome variable

data$target %>% class()
## [1] "numeric"
data$target <- as.factor(data$target)

data$target %>% class()
## [1] "factor"

8.5.2.2 Data splitting using stratified random sampling

# split
split_class <- initial_split(data %>%
  mutate(target = as.factor(target)),
prop = 0.7,
strata = target
)

# training set
raw_train_x_class <- training(split_class)

# testing set
raw_test_x_class <- testing(split_class)

8.5.2.3 recipe

# Classification recipe
rec_class <- raw_train_x_class %>%
  # Define the outcome variable
  recipe(target ~ .) %>%
  # Median impute oldpeak column
  step_impute_median(oldpeak) %>%
  # Expand "sex", "ca", "cp", "slope", and "thal" features out into dummy variables (indicators).
  step_normalize(age) %>%
  step_dummy(c("sex", "ca", "cp", "slope", "thal"))

# Prepare a dataset to base each step on
prep_class <- rec_class %>% prep(retain = TRUE)
# x features
train_x_class <- juice(prep_class, all_predictors())
test_x_class <- bake(prep_class, raw_test_x_class, all_predictors())

# y variables
train_y_class <- juice(prep_class, all_outcomes())$target %>% as.factor()
test_y_class <- bake(prep_class, raw_test_x_class, all_outcomes())$target %>% as.factor()

# Checks
names(train_x_class) # Make sure there's no target variable!
##  [1] "age"      "trestbps" "chol"     "fbs"      "restecg"  "thalach" 
##  [7] "exang"    "oldpeak"  "sex_X1"   "ca_X1"    "ca_X2"    "ca_X3"   
## [13] "ca_X4"    "cp_X1"    "cp_X2"    "cp_X3"    "slope_X1" "slope_X2"
## [19] "thal_X1"  "thal_X2"  "thal_X3"
class(train_y_class) # Make sure this is a factor variable!
## [1] "factor"

8.6 Supervised learning

x -> f - > y (defined)

8.6.1 OLS and Lasso

8.6.1.1 parsnip

  • Build models (parsnip)
  1. Specify a model
  2. Specify an engine
  3. Specify a mode
# OLS spec
ols_spec <- linear_reg() %>% # Specify a model
  set_engine("lm") %>% # Specify an engine: lm, glmnet, stan, keras, spark
  set_mode("regression") # Declare a mode: regression or classification

Lasso is one of the regularization techniques along with ridge and elastic-net.

# Lasso spec
lasso_spec <- linear_reg(
  penalty = 0.1, # tuning hyperparameter
  mixture = 1
) %>% # 1 = lasso, 0 = ridge
  set_engine("glmnet") %>%
  set_mode("regression")

# If you don't understand parsnip arguments
lasso_spec %>% translate() # See the documentation
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.1
##   mixture = 1
## 
## Computational engine: glmnet 
## 
## Model fit template:
## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
##     alpha = 1, family = "gaussian")
  • Fit models
ols_fit <- ols_spec %>%
  fit_xy(x = train_x_reg, y = train_y_reg)
# fit(train_y_reg ~ ., train_x_reg) # When you data are not preprocessed

lasso_fit <- lasso_spec %>%
  fit_xy(x = train_x_reg, y = train_y_reg)

8.6.1.2 yardstick

  • Visualize model fits
map2(list(ols_fit, lasso_fit), c("OLS", "Lasso"), visualize_fit)
## [[1]]

## 
## [[2]]

# Define performance metrics
metrics <- yardstick::metric_set(rmse, mae, rsq)

# Evaluate many models
evals <- purrr::map(list(ols_fit, lasso_fit), evaluate_reg) %>%
  reduce(bind_rows) %>%
  mutate(type = rep(c("OLS", "Lasso"), each = 3))

# Visualize the test results
evals %>%
  ggplot(aes(x = fct_reorder(type, .estimate), y = .estimate)) +
  geom_point() +
  labs(
    x = "Model",
    y = "Estimate"
  ) +
  facet_wrap(~ glue("{toupper(.metric)}"), scales = "free_y")

- For more information, read Tidy Modeling with R by Max Kuhn and Julia Silge.

8.6.1.3 tune

Hyperparameters are parameters that control the learning process.

8.6.1.3.1 tune ingredients
  • Search space for hyperparameters
  1. Grid search: a grid of hyperparameters

  2. Random search: random sample points from a bounded domain

# tune() = placeholder

tune_spec <- linear_reg(
  penalty = tune(), # tuning hyperparameter
  mixture = 1
) %>% # 1 = lasso, 0 = ridge
  set_engine("glmnet") %>%
  set_mode("regression")

tune_spec
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = tune()
##   mixture = 1
## 
## Computational engine: glmnet
# penalty() searches 50 possible combinations

lambda_grid <- grid_regular(penalty(), levels = 50)

Source: Kaggle

# 10-fold cross-validation

set.seed(1234) # for reproducibility

rec_folds <- vfold_cv(train_x_reg %>% bind_cols(tibble(age = train_y_reg)))
8.6.1.3.2 Add these elements to a workflow
# Workflow
rec_wf <- workflow() %>%
  add_model(tune_spec) %>%
  add_formula(age ~ .)
# Tuning results
rec_res <- rec_wf %>%
  tune_grid(
    resamples = rec_folds,
    grid = lambda_grid
  )
8.6.1.3.3 Visualize
# Visualize

rec_res %>%
  collect_metrics() %>%
  ggplot(aes(penalty, mean, col = .metric)) +
  geom_errorbar(aes(
    ymin = mean - std_err,
    ymax = mean + std_err
  ),
  alpha = 0.3
  ) +
  geom_line(size = 2) +
  scale_x_log10() +
  labs(x = "log(lambda)") +
  facet_wrap(~ glue("{toupper(.metric)}"),
    scales = "free",
    nrow = 2
  ) +
  theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

8.6.1.3.4 Select
conflict_prefer("filter", "dplyr")
## [conflicted] Removing existing preference.
## [conflicted] Will prefer dplyr::filter over any other package.
top_rmse <- show_best(rec_res, metric = "rmse")

best_rmse <- select_best(rec_res, metric = "rmse")

best_rmse
## # A tibble: 1 × 2
##   penalty .config              
##     <dbl> <chr>                
## 1   0.391 Preprocessor1_Model48
glue('The RMSE of the intiail model is
     {evals %>%
  filter(type == "Lasso", .metric == "rmse") %>%
  select(.estimate) %>%
  round(2)}')
## The RMSE of the intiail model is
##    7.82
glue('The RMSE of the tuned model is {rec_res %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  arrange(mean) %>%
  dplyr::slice(1) %>%
  select(mean) %>%
  round(2)}')
## The RMSE of the tuned model is 7.55
finalize_lasso <- rec_wf %>%
  finalize_workflow(best_rmse)

finalize_lasso %>%
  fit(train_x_reg %>% bind_cols(tibble(age = train_y_reg))) %>%
  pull_workflow_fit() %>%
  vip::vip()
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## ℹ Please use `extract_fit_parsnip()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

8.6.1.3.5 Test fit
  • Apply the tuned model to the test dataset
test_fit <- finalize_lasso %>%
  fit(test_x_reg %>% bind_cols(tibble(age = test_y_reg)))

evaluate_reg(test_fit)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       7.06 
## 2 mae     standard       5.80 
## 3 rsq     standard       0.407

8.6.2 Decision tree

8.6.2.1 parsnip

  • Build a model
  1. Specify a model
  2. Specify an engine
  3. Specify a mode
# workflow
tree_wf <- workflow() %>% add_formula(target ~ .)

# spec
tree_spec <- decision_tree(

  # Mode
  mode = "classification",

  # Tuning hyperparameters
  cost_complexity = NULL,
  tree_depth = NULL
) %>%
  set_engine("rpart") # rpart, c5.0, spark

tree_wf <- tree_wf %>% add_model(tree_spec)
  • Fit a model
tree_fit <- tree_wf %>% fit(train_x_class %>% bind_cols(tibble(target = train_y_class)))

8.6.2.2 yardstick

  • Let’s formally test prediction performance.
  1. Confusion matrix

A confusion matrix is often used to describe the performance of a classification model. The below example is based on a binary classification model.

Predicted: YES Predicted: NO
Actual: YES True positive (TP) False negative (FN)
Actual: NO False positive (FP) True negative (TN)
  1. Metrics
  • accuracy: The proportion of the data predicted correctly (\(\frac{TP + TN}{total}\)). 1 - accuracy = misclassification rate.

  • precision: Positive predictive value. When the model predicts yes, how correct is it? (\(\frac{TP}{TP + FP}\))

  • recall (sensitivity): True positive rate (e.g., healthy people healthy). When the actual value is yes, how often does the model predict yes? (\(\frac{TP}{TP + FN}\))

  • F-score: A weighted average between precision and recall.

  • ROC Curve (receiver operating characteristic curve): a plot that shows the relationship between true and false positive rates at different classification thresholds. y-axis indicates the true positive rate and x-axis indicates the false positive rate. What matters is the AUC (Area under the ROC Curve), which is a cumulative probability function of ranking a random “positive” - “negative” pair (for the probability of AUC, see this blog post).

Source: Google Machine Learning Crash Course

  • To learn more about other metrics, check out the yardstick package references.
# Define performance metrics

metrics <- yardstick::metric_set(accuracy, precision, recall)

# Visualize

tree_fit_viz_metr <- visualize_class_eval(tree_fit)

tree_fit_viz_metr

tree_fit_viz_mat <- visualize_class_conf(tree_fit)

tree_fit_viz_mat

8.6.2.3 tune

8.6.2.3.1 tune ingredients

Decisions trees tend to overfit. We need to consider two things to reduce this problem: how to split and when to stop a tree.

  • complexity parameter: a high CP means a simple decision tree with few splits.

  • tree_depth

tune_spec <- decision_tree(
  cost_complexity = tune(), # how to split
  tree_depth = tune(), # when to stop
  mode = "classification"
) %>%
  set_engine("rpart")

tree_grid <- grid_regular(cost_complexity(),
  tree_depth(),
  levels = 5
) # 2 hyperparameters -> 5*5 = 25 combinations

tree_grid %>%
  count(tree_depth)
## # A tibble: 5 × 2
##   tree_depth     n
##        <int> <int>
## 1          1     5
## 2          4     5
## 3          8     5
## 4         11     5
## 5         15     5
# 10-fold cross-validation

set.seed(1234) # for reproducibility

tree_folds <- vfold_cv(train_x_class %>% bind_cols(tibble(target = train_y_class)),
  strata = target
)
8.6.2.3.2 Add these elements to a workflow
# Update workflow
tree_wf <- tree_wf %>% update_model(tune_spec)

# Determine the number of cores
no_cores <- detectCores() - 1

# Initiate
cl <- makeCluster(no_cores)

registerDoParallel(cl)

# Tuning results
tree_res <- tree_wf %>%
  tune_grid(
    resamples = tree_folds,
    grid = tree_grid,
    metrics = metrics
  )
8.6.2.3.3 Visualize
  • The following plot draws on the vignette of the tidymodels package.
tree_res %>%
  collect_metrics() %>%
  mutate(tree_depth = factor(tree_depth)) %>%
  ggplot(aes(cost_complexity, mean, col = .metric)) +
  geom_point(size = 3) +
  # Subplots
  facet_wrap(~tree_depth,
    scales = "free",
    nrow = 2
  ) +
  # Log scale x
  scale_x_log10(labels = scales::label_number()) +
  # Discrete color scale
  scale_color_viridis_d(option = "plasma", begin = .9, end = 0) +
  labs(
    x = "Cost complexity",
    col = "Tree depth",
    y = NULL
  ) +
  coord_flip()

8.6.2.3.4 Select
# Optimal hyperparameter
best_tree <- select_best(tree_res, "recall")

# Add the hyperparameter to the workflow
finalize_tree <- tree_wf %>%
  finalize_workflow(best_tree)
tree_fit_tuned <- finalize_tree %>%
  fit(train_x_class %>% bind_cols(tibble(target = train_y_class)))

# Metrics
(tree_fit_viz_metr + labs(title = "Non-tuned")) / (visualize_class_eval(tree_fit_tuned) + labs(title = "Tuned"))

# Confusion matrix
(tree_fit_viz_mat + labs(title = "Non-tuned")) / (visualize_class_conf(tree_fit_tuned) + labs(title = "Tuned"))

  • Visualize variable importance
tree_fit_tuned %>%
  pull_workflow_fit() %>%
  vip::vip()

8.6.2.3.5 Test fit
  • Apply the tuned model to the test dataset
test_fit <- finalize_tree %>%
  fit(test_x_class %>% bind_cols(tibble(target = test_y_class)))

evaluate_class(test_fit)
## # A tibble: 3 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 accuracy  binary         0.761
## 2 precision binary         0.778
## 3 recall    binary         0.667

In the next subsection, we will learn variants of ensemble models that improve decision tree models by putting models together.

8.6.3 Bagging (Random forest)

Key idea applied across all ensemble models (bagging, boosting, and stacking): single learner -> N learners (N > 1)

Many learners could perform better than a single learner as this approach reduces the variance of a single estimate and provides more stability.

Here we focus on the difference between bagging and boosting. In short, boosting may reduce bias while increasing variance. On the other hand, bagging may reduce variance but has nothing to do with bias. Please check out What is the difference between Bagging and Boosting? by aporras.

bagging

  • Data: Training data will be randomly sampled with replacement (bootstrapping samples + drawing random subsets of features for training individual trees)

  • Learning: Building models in parallel (independently)

  • Prediction: Simple average of the estimated responses (majority vote system)

From Sebastian Raschka’s blog

boosting

  • Data: Weighted training data will be random sampled

  • Learning: Building models sequentially (mispredicted cases would receive more weights)

  • Prediction: Weighted average of the estimated responses

From Sebastian Raschka’s blog

8.6.3.1 parsnip

  • Build a model
  1. Specify a model
  2. Specify an engine
  3. Specify a mode
# workflow
rand_wf <- workflow() %>% add_formula(target ~ .)

# spec
rand_spec <- rand_forest(

  # Mode
  mode = "classification",

  # Tuning hyperparameters
  mtry = NULL, # The number of predictors to available for splitting at each node
  min_n = NULL, # The minimum number of data points needed to keep splitting nodes
  trees = 500
) %>% # The number of trees
  set_engine("ranger",
    # We want the importance of predictors to be assessed.
    seed = 1234,
    importance = "permutation"
  )

rand_wf <- rand_wf %>% add_model(rand_spec)
  • Fit a model
rand_fit <- rand_wf %>% fit(train_x_class %>% bind_cols(tibble(target = train_y_class)))

8.6.3.2 yardstick

# Define performance metrics
metrics <- yardstick::metric_set(accuracy, precision, recall)

rand_fit_viz_metr <- visualize_class_eval(rand_fit)

rand_fit_viz_metr

  • Visualize the confusion matrix.
rand_fit_viz_mat <- visualize_class_conf(rand_fit)

rand_fit_viz_mat

8.6.3.3 tune

8.6.3.3.1 tune ingredients

We focus on the following two hyperparameters:

  • mtry: The number of predictors available for splitting at each node.

  • min_n: The minimum number of data points needed to keep splitting nodes.

tune_spec <-
  rand_forest(
    mode = "classification",

    # Tuning hyperparameters
    mtry = tune(),
    min_n = tune()
  ) %>%
  set_engine("ranger",
    seed = 1234,
    importance = "permutation"
  )

rand_grid <- grid_regular(mtry(range = c(1, 10)),
  min_n(range = c(2, 10)),
  levels = 5
)

rand_grid %>%
  count(min_n)
## # A tibble: 5 × 2
##   min_n     n
##   <int> <int>
## 1     2     5
## 2     4     5
## 3     6     5
## 4     8     5
## 5    10     5
# 10-fold cross-validation

set.seed(1234) # for reproducibility

rand_folds <- vfold_cv(train_x_class %>% bind_cols(tibble(target = train_y_class)),
  strata = target
)
8.6.3.3.2 Add these elements to a workflow
# Update workflow
rand_wf <- rand_wf %>% update_model(tune_spec)

# Tuning results
rand_res <- rand_wf %>%
  tune_grid(
    resamples = rand_folds,
    grid = rand_grid,
    metrics = metrics
  )
8.6.3.3.3 Visualize
rand_res %>%
  collect_metrics() %>%
  mutate(min_n = factor(min_n)) %>%
  ggplot(aes(mtry, mean, color = min_n)) +
  # Line + Point plot
  geom_line(size = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  # Subplots
  facet_wrap(~.metric,
    scales = "free",
    nrow = 2
  ) +
  # Log scale x
  scale_x_log10(labels = scales::label_number()) +
  # Discrete color scale
  scale_color_viridis_d(option = "plasma", begin = .9, end = 0) +
  labs(
    x = "The number of predictors to be sampled",
    col = "The minimum number of data points needed for splitting",
    y = NULL
  ) +
  theme(legend.position = "bottom")

# Optimal hyperparameter
best_tree <- select_best(rand_res, "accuracy")

best_tree
## # A tibble: 1 × 3
##    mtry min_n .config              
##   <int> <int> <chr>                
## 1     1     4 Preprocessor1_Model06
# Add the hyperparameter to the workflow
finalize_tree <- rand_wf %>%
  finalize_workflow(best_tree)
rand_fit_tuned <- finalize_tree %>%
  fit(train_x_class %>% bind_cols(tibble(target = train_y_class)))

# Metrics
(rand_fit_viz_metr + labs(title = "Non-tuned")) / (visualize_class_eval(rand_fit_tuned) + labs(title = "Tuned"))

# Confusion matrix
(rand_fit_viz_mat + labs(title = "Non-tuned")) / (visualize_class_conf(rand_fit_tuned) + labs(title = "Tuned"))

  • Visualize variable importance
rand_fit_tuned %>%
  pull_workflow_fit() %>%
  vip::vip()

8.6.3.3.4 Test fit
  • Apply the tuned model to the test dataset
test_fit <- finalize_tree %>%
  fit(test_x_class %>%
    bind_cols(tibble(target = test_y_class)))

evaluate_class(test_fit)
## # A tibble: 3 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 accuracy  binary         0.924
## 2 precision binary         0.927
## 3 recall    binary         0.905

8.6.4 Boosting (XGboost)

8.6.4.1 parsnip

  • Build a model
  1. Specify a model
  2. Specify an engine
  3. Specify a mode
# workflow
xg_wf <- workflow() %>% add_formula(target ~ .)

# spec
xg_spec <- boost_tree(

  # Mode
  mode = "classification",

  # Tuning hyperparameters

  # The number of trees to fit, aka boosting iterations
  trees = c(100, 300, 500, 700, 900),
  # The depth of the decision tree (how many levels of splits).
  tree_depth = c(1, 6),
  # Learning rate: lower means the ensemble will adapt more slowly.
  learn_rate = c(0.0001, 0.01, 0.2),
  # Stop splitting a tree if we only have this many obs in a tree node.
  min_n = 10L
) %>%
  set_engine("xgboost")

xg_wf <- xg_wf %>% add_model(xg_spec)
  • Fit a model
xg_fit <- xg_wf %>% fit(train_x_class %>% bind_cols(tibble(target = train_y_class)))
## Warning in is.numeric(args$trees) && args$trees < 0: 'length(x) = 5 > 1' in
## coercion to 'logical(1)'
## Warning in is.numeric(args$tree_depth) && args$tree_depth < 0: 'length(x) = 2 >
## 1' in coercion to 'logical(1)'
## Warning in begin_iteration:end_iteration: numerical expression has 5 elements:
## only the first used

8.6.4.2 yardstick

metrics <- metric_set(
  yardstick::accuracy,
  yardstick::precision,
  yardstick::recall
)

evaluate_class(xg_fit)
## # A tibble: 3 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 accuracy  binary         0.739
## 2 precision binary         0.705
## 3 recall    binary         0.738
xg_fit_viz_metr <-
  visualize_class_eval(xg_fit)

xg_fit_viz_metr

  • Visualize the confusion matrix.
xg_fit_viz_mat <-
  visualize_class_conf(xg_fit)

xg_fit_viz_mat

8.6.4.3 tune

8.6.4.3.1 tune ingredients
  • We focus on the following hyperparameters: trees, tree_depth, learn_rate, min_n, mtry, loss_reduction, and sample_size
tune_spec <-
  xg_spec <- boost_tree(

    # Mode
    mode = "classification",

    # Tuning hyperparameters

    # The number of trees to fit, aka boosting iterations
    trees = tune(),
    # The depth of the decision tree (how many levels of splits).
    tree_depth = tune(),
    # Learning rate: lower means the ensemble will adapt more slowly.
    learn_rate = tune(),
    # Stop splitting a tree if we only have this many obs in a tree node.
    min_n = tune(),
    loss_reduction = tune(),
    # The number of randomly selected hyperparameters
    mtry = tune(),
    # The size of the data set used for modeling within an iteration
    sample_size = tune()
  ) %>%
  set_engine("xgboost")

# Space-filling hyperparameter grids
xg_grid <- grid_latin_hypercube(
  trees(),
  tree_depth(),
  learn_rate(),
  min_n(),
  loss_reduction(),
  sample_size = sample_prop(),
  finalize(mtry(), train_x_class),
  size = 30
)

# 10-fold cross-validation

set.seed(1234) # for reproducibility

xg_folds <- vfold_cv(train_x_class %>% bind_cols(tibble(target = train_y_class)),
  strata = target
)
8.6.4.3.2 Add these elements to a workflow
# Update workflow
xg_wf <- xg_wf %>% update_model(tune_spec)

# Tuning results
xg_res <- xg_wf %>%
  tune_grid(
    resamples = xg_folds,
    grid = xg_grid,
    control = control_grid(save_pred = TRUE)
  )
8.6.4.3.3 Visualize
conflict_prefer("filter", "dplyr")
## [conflicted] Removing existing preference.
## [conflicted] Will prefer dplyr::filter over any other package.
xg_res %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  pivot_longer(mtry:sample_size,
    values_to = "value",
    names_to = "parameter"
  ) %>%
  ggplot(aes(x = value, y = mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(
    y = "AUC",
    x = NULL
  )

# Optimal hyperparameter
best_xg <- select_best(xg_res, "roc_auc")

best_xg
## # A tibble: 1 × 8
##    mtry trees min_n tree_depth    learn_rate loss_reduction sample_size .config 
##   <int> <int> <int>      <int>         <dbl>          <dbl>       <dbl> <chr>   
## 1     6  1856     6         10 0.00000000859  0.00000000102       0.681 Preproc…
# Add the hyperparameter to the workflow
finalize_xg <- xg_wf %>%
  finalize_workflow(best_xg)
xg_fit_tuned <- finalize_xg %>%
  fit(train_x_class %>% bind_cols(tibble(target = train_y_class)))

# Metrics
(xg_fit_viz_metr + labs(title = "Non-tuned")) / (visualize_class_eval(xg_fit_tuned) + labs(title = "Tuned"))

# Confusion matrix
(xg_fit_viz_mat + labs(title = "Non-tuned")) / (visualize_class_conf(xg_fit_tuned) + labs(title = "Tuned"))

  • Visualize variable importance
xg_fit_tuned %>%
  pull_workflow_fit() %>%
  vip::vip()

8.6.4.3.4 Test fit
  • Apply the tuned model to the test dataset
test_fit <- finalize_xg %>%
  fit(test_x_class %>% bind_cols(tibble(target = test_y_class)))

evaluate_class(test_fit)
## # A tibble: 3 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 accuracy  binary         0.761
## 2 precision binary         0.763
## 3 recall    binary         0.690

8.6.5 Stacking (SuperLearner)

This stacking part of the book heavily relies on Chris Kennedy’s notebook.

8.6.5.1 Overview

8.6.5.1.1 Stacking

Wolpert, D.H., 1992. Stacked generalization. Neural networks, 5(2), pp.241-259.

Breiman, L., 1996. [Stacked regressions]((https://statistics.berkeley.edu/sites/default/files/tech-reports/367.pdf). Machine learning, 24(1), pp.49-64.

8.6.5.1.2 SuperLearner

The “SuperLearner” R package is a method that simplifies ensemble learning by allowing you to simultaneously evaluate the cross-validated performance of multiple algorithms and/or a single algorithm with differently tuned hyperparameters. This is a generally advisable approach to machine learning instead of fitting single algorithms.

Let’s see how the four classification algorithms you learned in this workshop (1-lasso, 2-decision tree, 3-random forest, and 4-gradient boosted trees) compare to each other and also to 5-binary logistic regression (glm) and the 6-mean of Y as a benchmark algorithm, in terms of their cross-validated error!

A “wrapper” is a short function that adapts an algorithm for the SuperLearner package. Check out the different algorithm wrappers offered by SuperLearner:

8.6.5.2 Choose algorithms

# Review available models
SuperLearner::listWrappers()
## All prediction algorithm wrappers in SuperLearner:
##  [1] "SL.bartMachine"      "SL.bayesglm"         "SL.biglasso"        
##  [4] "SL.caret"            "SL.caret.rpart"      "SL.cforest"         
##  [7] "SL.earth"            "SL.extraTrees"       "SL.gam"             
## [10] "SL.gbm"              "SL.glm"              "SL.glm.interaction" 
## [13] "SL.glmnet"           "SL.ipredbagg"        "SL.kernelKnn"       
## [16] "SL.knn"              "SL.ksvm"             "SL.lda"             
## [19] "SL.leekasso"         "SL.lm"               "SL.loess"           
## [22] "SL.logreg"           "SL.mean"             "SL.nnet"            
## [25] "SL.nnls"             "SL.polymars"         "SL.qda"             
## [28] "SL.randomForest"     "SL.ranger"           "SL.ridge"           
## [31] "SL.rpart"            "SL.rpartPrune"       "SL.speedglm"        
## [34] "SL.speedlm"          "SL.step"             "SL.step.forward"    
## [37] "SL.step.interaction" "SL.stepAIC"          "SL.svm"             
## [40] "SL.template"         "SL.xgboost"
## 
## All screening algorithm wrappers in SuperLearner:
## [1] "All"
## [1] "screen.corP"           "screen.corRank"        "screen.glmnet"        
## [4] "screen.randomForest"   "screen.SIS"            "screen.template"      
## [7] "screen.ttest"          "write.screen.template"
# Compile the algorithm wrappers to be used.
sl_lib <- c(
  "SL.mean", # Marginal mean of the outcome ()
  "SL.glmnet", # GLM with lasso/elasticnet regularization
  "SL.rpart", # Decision tree
  "SL.ranger", # Random forest
  "SL.xgboost"
) # Xgbboost

8.6.5.3 Fit model

Fit the ensemble!

# This is a seed that is compatible with multicore parallel processing.
# See ?set.seed for more information.
set.seed(1, "L'Ecuyer-CMRG")

# This will take a few minutes to execute - take a look at the .html file to see the output!
cv_sl <- SuperLearner::CV.SuperLearner(
  Y = as.numeric(as.character(train_y_class)),
  X = train_x_class,
  family = binomial(),
  # For a real analysis we would use V = 10.
  cvControl = list(V = 5L, stratifyCV = TRUE),
  SL.library = sl_lib,
  verbose = FALSE
)

8.6.5.4 Risk

Risk is the average loss, and loss is how far off the prediction was for an individual observation. The lower the risk, the fewer errors the model makes in its prediction. SuperLearner’s default loss metric is squared error \((y_{actual} - y_{predicted})^2\), so the risk is the mean-squared error (just like in ordinary least squares regression). View the summary, plot results, and compute the Area Under the ROC Curve (AUC)!

8.6.5.4.1 Summary
  • Discrete SL chooses the best single learner (in this case, SL.glmnet or lasso).
  • SuperLearner takes a weighted average of the models using the coefficients (importance of each learner in the overall ensemble). Coefficient 0 means that learner is not used at all.
  • SL.mean_All (the weighted mean of \(Y\)) is a benchmark algorithm (ignoring features).
summary(cv_sl)
## 
## Call:  
## SuperLearner::CV.SuperLearner(Y = as.numeric(as.character(train_y_class)),  
##     X = train_x_class, family = binomial(), SL.library = sl_lib, verbose = FALSE,  
##     cvControl = list(V = 5L, stratifyCV = TRUE)) 
## 
## Risk is based on: Mean Squared Error
## 
## All risk estimates are based on V =  5 
## 
##       Algorithm     Ave        se      Min     Max
##   Super Learner 0.11277 0.0135685 0.075782 0.14318
##     Discrete SL 0.11854 0.0145053 0.074937 0.16302
##     SL.mean_All 0.24798 0.0030968 0.247743 0.24895
##   SL.glmnet_All 0.10735 0.0135893 0.074937 0.14324
##    SL.rpart_All 0.16730 0.0197985 0.111990 0.22102
##   SL.ranger_All 0.12566 0.0119774 0.097945 0.15955
##  SL.xgboost_All 0.13035 0.0149709 0.098472 0.16302
8.6.5.4.2 Plot
# Plot the cross-validated risk estimate with 95% CIs.

plot(cv_sl)

8.6.5.5 Compute AUC for all estimators

ROC

ROC: a ROC (receiver operating characteristic curve) plots the relationship between True Positive Rate (Y-axis) and FALSE Positive Rate (X-axis).

Area Under the ROC Curve

AUC

AUC: Area Under the ROC Curve

1 = perfect

0.5 = no better than chance

ck37r::auc_table(cv_sl)
##                      auc         se  ci_lower  ci_upper      p-value
## SL.mean_All    0.5000000 0.06912305 0.3645213 0.6354787 5.228498e-10
## SL.rpart_All   0.8235355 0.03869743 0.7476899 0.8993810 5.550478e-03
## SL.xgboost_All 0.8850572 0.02430397 0.8374223 0.9326921 6.525143e-02
## SL.ranger_All  0.9071396 0.02018027 0.8675870 0.9466922 2.336564e-01
## DiscreteSL     0.9073913 0.02036305 0.8674805 0.9473021 2.394812e-01
## SuperLearner   0.9167735 0.01947639 0.8786004 0.9549465 3.980169e-01
## SL.glmnet_All  0.9218078 0.01910904 0.8843548 0.9592608 5.000000e-01
8.6.5.5.1 Plot the ROC curve for the best estimator (DiscretSL)
ck37r::plot_roc(cv_sl)

8.6.5.5.2 Review weight distribution for the SuperLearner
print(ck37r::cvsl_weights(cv_sl), row.names = FALSE)
##  # Learner    Mean      SD     Min     Max
##  1  glmnet 0.78127 0.23613 0.45149 0.99803
##  2 xgboost 0.12768 0.19994 0.00000 0.46785
##  3  ranger 0.07492 0.16644 0.00000 0.37265
##  4   rpart 0.01613 0.03607 0.00000 0.08065
##  5    mean 0.00000 0.00000 0.00000 0.00000

The general stacking approach is available in the tidymodels framework through stacks package (developmental stage).

However, SuperLearner is currently not available in the tidymodels framework. You can easily build and add a parsnip model if you’d like to. If you are interested in knowing more about it, please look at this vignette of the tidymodels.

8.6.6 Applications

8.6.6.1 Bandit algorithm (optimizing an experiment)

8.6.6.2 Causal forest (estimating heterogeneous treatment effect)

8.7 Unsupervised learning

x -> f - > y (not defined)

8.7.1 Dimension reduction

Projecting 2D-data to a line (PCA). From vas3k.com

8.7.1.1 Correlation analysis

This dataset is a good problem for PCA as some features are highly correlated.

Again, think about what the dataset is about. The following data dictionary comes from this site.

  • age - age in years
  • sex - sex (1 = male; 0 = female)
  • cp - chest pain type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)
  • trestbps - resting blood pressure (in mm Hg on admission to the hospital)
  • chol - serum cholestoral in mg/dl
  • fbs - fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
  • restecg - resting electrocardiographic results (0 = normal; 1 = having ST-T; 2 = hypertrophy)
  • thalach - maximum heart rate achieved
  • exang - exercise induced angina (1 = yes; 0 = no)
  • oldpeak - ST depression induced by exercise relative to rest slope - the slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
  • ca - number of major vessels (0-3) colored by flourosopy
  • thal - 3 = normal; 6 = fixed defect; 7 = reversable defect
  • num - the predicted attribute - diagnosis of heart disease (angiographic disease status) (Value 0 = < 50% diameter narrowing; Value 1 = > 50% diameter narrowing)
data_original %>%
  select(-target) %>%
  corrr::correlate()
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
## # A tibble: 13 × 14
##    term         age     sex      cp trestbps     chol      fbs restecg  thalach
##    <chr>      <dbl>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
##  1 age      NA      -0.0984 -0.0687   0.279   0.214    0.121   -0.116  -0.399  
##  2 sex      -0.0984 NA      -0.0494  -0.0568 -0.198    0.0450  -0.0582 -0.0440 
##  3 cp       -0.0687 -0.0494 NA        0.0476 -0.0769   0.0944   0.0444  0.296  
##  4 trestbps  0.279  -0.0568  0.0476  NA       0.123    0.178   -0.114  -0.0467 
##  5 chol      0.214  -0.198  -0.0769   0.123  NA        0.0133  -0.151  -0.00994
##  6 fbs       0.121   0.0450  0.0944   0.178   0.0133  NA       -0.0842 -0.00857
##  7 restecg  -0.116  -0.0582  0.0444  -0.114  -0.151   -0.0842  NA       0.0441 
##  8 thalach  -0.399  -0.0440  0.296   -0.0467 -0.00994 -0.00857  0.0441 NA      
##  9 exang     0.0968  0.142  -0.394    0.0676  0.0670   0.0257  -0.0707 -0.379  
## 10 oldpeak   0.210   0.0961 -0.149    0.193   0.0540   0.00575 -0.0588 -0.344  
## 11 slope    -0.169  -0.0307  0.120   -0.121  -0.00404 -0.0599   0.0930  0.387  
## 12 ca        0.276   0.118  -0.181    0.101   0.0705   0.138   -0.0720 -0.213  
## 13 thal      0.0680  0.210  -0.162    0.0622  0.0988  -0.0320  -0.0120 -0.0964 
## # ℹ 5 more variables: exang <dbl>, oldpeak <dbl>, slope <dbl>, ca <dbl>,
## #   thal <dbl>

8.7.1.2 Descriptive statistics

Notice the scaling issues? PCA is not scale-invariant. So, we need to fix this problem.

min_max <- list(
  min = ~ min(.x, na.rm = TRUE),
  max = ~ max(.x, na.rm = TRUE)
)

data_original %>%
  select(-target) %>%
  summarise(across(where(is.numeric), min_max))
## # A tibble: 1 × 26
##   age_min age_max sex_min sex_max cp_min cp_max trestbps_min trestbps_max
##     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>        <dbl>        <dbl>
## 1      29      77       0       1      0      3           94          200
## # ℹ 18 more variables: chol_min <dbl>, chol_max <dbl>, fbs_min <dbl>,
## #   fbs_max <dbl>, restecg_min <dbl>, restecg_max <dbl>, thalach_min <dbl>,
## #   thalach_max <dbl>, exang_min <dbl>, exang_max <dbl>, oldpeak_min <dbl>,
## #   oldpeak_max <dbl>, slope_min <dbl>, slope_max <dbl>, ca_min <dbl>,
## #   ca_max <dbl>, thal_min <dbl>, thal_max <dbl>

8.7.1.3 Preprocessing

recipe is essential for preprocessing multiple features at once.

pca_recipe <- recipe(~., data = data_original) %>%
  # Imputing NAs using mean
  step_impute_mean(all_predictors()) %>%
  # Normalize some numeric variabless
  step_normalize(c("age", "trestbps", "chol", "thalach", "oldpeak"))

8.7.1.4 PCA analysis

pca_res <- pca_recipe %>%
  step_pca(all_predictors(),
    id = "pca"
  ) %>% # id argument identifies each PCA step
  prep()

pca_res %>%
  tidy(id = "pca")
## # A tibble: 196 × 4
##    terms        value component id   
##    <chr>        <dbl> <chr>     <chr>
##  1 age      -0.00101  PC1       pca  
##  2 sex       0.216    PC1       pca  
##  3 cp        0.321    PC1       pca  
##  4 trestbps  0.00118  PC1       pca  
##  5 chol     -0.000292 PC1       pca  
##  6 fbs       0.0468   PC1       pca  
##  7 restecg   0.166    PC1       pca  
##  8 thalach   0.0137   PC1       pca  
##  9 exang     0.0962   PC1       pca  
## 10 oldpeak  -0.00863  PC1       pca  
## # ℹ 186 more rows
8.7.1.4.1 Screeplot
# To avoid conflicts
conflict_prefer("filter", "dplyr")
## [conflicted] Removing existing preference.
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("select", "dplyr")
## [conflicted] Will prefer dplyr::select over any other package.
pca_recipe %>%
  step_pca(all_predictors(),
    id = "pca"
  ) %>% # id argument identifies each PCA step
  prep() %>%
  tidy(id = "pca", type = "variance") %>%
  filter(terms == "percent variance") %>%
  ggplot(aes(x = component, y = value)) +
  geom_col() +
  labs(
    x = "PCAs of heart disease",
    y = "% of variance",
    title = "Scree plot"
  )

8.7.1.4.2 View factor loadings

Loadings are the covariances between the features and the principal components (=eigenvectors).

pca_recipe %>%
  step_pca(all_predictors(),
    id = "pca"
  ) %>% # id argument identifies each PCA step
  prep() %>%
  tidy(id = "pca") %>%
  filter(component %in% c("PC1", "PC2")) %>%
  ggplot(aes(
    x = fct_reorder(terms, value), y = value,
    fill = component
  )) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(
    x = "Terms",
    y = "Contribtutions",
    fill = "PCAs"
  )

The key lesson

You can use these low-dimensional data to solve the curse of dimensionality problem. Compressing feature space via dimension reduction techniques is called feature extraction. PCA is one way of doing this.

8.7.2 Topic modeling

8.7.2.1 Setup

pacman::p_load(
  tidytext, # tidy text analysis
  glue, # paste string and objects
  stm, # structural topic modeling
  gutenbergr
) # toy datasets

8.7.2.2 Dataset

The data munging process draws on Julia Silge’s blog post.

sherlock_raw <- gutenberg_download(1661)
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
sherlock <- sherlock_raw %>%
  # Mutate story using a conditional statement
  mutate(
    story = ifelse(str_detect(text, "ADVENTURE"), text, NA)
  ) %>%
  # Fill in missing values with next value
  tidyr::fill(story, .direction = "down") %>%
  # Filter
  dplyr::filter(story != "THE ADVENTURES OF SHERLOCK HOLMES") %>%
  # Factor
  mutate(story = factor(story, levels = unique(story)))

sherlock <- sherlock[, 2:3] # no id

8.7.2.3 Key ideas

Source: paperswithcode.com

  • Main papers: See Latent Dirichlet Allocation by David M. Blei, Andrew Y. Ng and Michael I. Jordan (then all Berkeley) and this follow-up paper with the same title.

  • Topics as distributions of words (\(\beta\) distribution)

  • Documents as distributions of topics (\(\alpha\) distribution)

  • What distributions?

    • Probability

    • Multinominal

  • Words lie on a lower-dimensional space (dimension reduction akin to PCA)

  • Co-occurrence of words (clustering)

  • Bag of words (feature engineering)

    • Upside: easy and fast (also working quite well)
    • Downside: ignored grammatical structures and rich interactions among words (Alternative: word embeddings. Please check out text2vec)
  • Documents are exchangeable (sequencing won’t matter).

  • Topics are independent (uncorrelated). If you don’t think this assumption holds, use Correlated Topics Models by Blei and Lafferty (2007).

8.7.2.4 Exploratory data analysis

sherlock_n <- sherlock %>%
  unnest_tokens(
    output = word,
    input = text
  ) %>%
  count(story, word, sort = TRUE)

sherlock_total_n <- sherlock_n %>%
  group_by(story) %>%
  summarise(total = sum(n))

sherlock_words <- sherlock_n %>%
  left_join(sherlock_total_n)
## Joining with `by = join_by(story)`
sherlock_words %>%
  mutate(freq = n / total) %>%
  group_by(story) %>%
  top_n(10) %>%
  ggplot(aes(
    x = fct_reorder(word, freq),
    y = freq,
    fill = story
  )) +
  geom_col() +
  coord_flip() +
  facet_wrap(~story,
    ncol = 2,
    scales = "free_y"
  ) +
  scale_fill_viridis_d() +
  labs(
    x = "",
    fill = "Story"
  ) +
  theme(legend.position = "bottom")
## Selecting by freq

8.7.2.5 STM

Structural Topic Modeling by Roberts, Stewart, and Tingley helps estimate how topics’ proportions vary by covariates. If you don’t use covariates, this approach is close to CTM. The other useful (and very recent) topic modeling package is Keyword Assisted Topic Models (keyATM) by Shusei, Imai, and Sasaki.

Also, note that we didn’t cover other important techniques in topic modeling, such as dynamic and hierarchical topic modeling.

8.7.2.5.1 Turn text into document-term matrix

stm package has its preprocessing function.

dtm <- textProcessor(
  documents = sherlock$text,
  metadata = sherlock,
  removestopwords = TRUE,
  verbose = FALSE
)
8.7.2.5.2 Tuning K
  • K is the number of topics.
  • Let’s try K = 5, 10, 15.
test_res <- searchK(
  dtm$documents,
  dtm$vocab,
  K = c(5, 10, 15),
  prevalence = ~story,
  data = dtm$meta
)
## Beginning Spectral Initialization 
##   Calculating the gram matrix...
##   Finding anchor words...
##      .....
##   Recovering initialization...
##      ..............................................
## Initialization complete.
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 1 (approx. per word bound = -7.627) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 2 (approx. per word bound = -7.512, relative change = 1.510e-02) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 3 (approx. per word bound = -7.419, relative change = 1.228e-02) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 4 (approx. per word bound = -7.381, relative change = 5.151e-03) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 5 (approx. per word bound = -7.365, relative change = 2.165e-03) 
## Topic 1: littl, man, see, hand, shall 
##  Topic 2: upon, holm, think, come, take 
##  Topic 3: said, will, just, know, word 
##  Topic 4: one, may, came, tell, ask 
##  Topic 5: time, sherlock, case, saw, face 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 6 (approx. per word bound = -7.358, relative change = 9.504e-04) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 7 (approx. per word bound = -7.355, relative change = 4.015e-04) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 8 (approx. per word bound = -7.354, relative change = 1.580e-04) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Model Converged 
## Beginning Spectral Initialization 
##   Calculating the gram matrix...
##   Finding anchor words...
##      ..........
##   Recovering initialization...
##      ..............................................
## Initialization complete.
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 1 (approx. per word bound = -7.699) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 2 (approx. per word bound = -7.499, relative change = 2.594e-02) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 3 (approx. per word bound = -7.373, relative change = 1.684e-02) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 4 (approx. per word bound = -7.287, relative change = 1.172e-02) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 5 (approx. per word bound = -7.257, relative change = 4.115e-03) 
## Topic 1: miss, littl, came, man, good 
##  Topic 2: said, might, sudden, hous, went 
##  Topic 3: upon, just, never, right, two 
##  Topic 4: upon, will, one, see, may 
##  Topic 5: sherlock, name, think, laugh, holm 
##  Topic 6: see, hard, night, cri, forward 
##  Topic 7: littl, stone, becam, whole, sure 
##  Topic 8: can, know, matter, now, say 
##  Topic 9: man, hand, knew, one, even 
##  Topic 10: holm, ask, sat, “pray, long 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 6 (approx. per word bound = -7.248, relative change = 1.256e-03) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 7 (approx. per word bound = -7.247, relative change = 9.258e-05) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Model Converged 
## Beginning Spectral Initialization 
##   Calculating the gram matrix...
##   Finding anchor words...
##      ...............
##   Recovering initialization...
##      ..............................................
## Initialization complete.
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 1 (approx. per word bound = -7.749) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 2 (approx. per word bound = -7.417, relative change = 4.283e-02) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 3 (approx. per word bound = -7.297, relative change = 1.624e-02) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 4 (approx. per word bound = -7.242, relative change = 7.558e-03) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 5 (approx. per word bound = -7.222, relative change = 2.745e-03) 
## Topic 1: think, holm, turn, now, “ye 
##  Topic 2: might, dress, hous, place, near 
##  Topic 3: know, without, now, “’s, money 
##  Topic 4: open, may, look, much, one 
##  Topic 5: hand, well, see, way, littl 
##  Topic 6: question, salesman, told, companion, close 
##  Topic 7: littl, told, feel, remark, quit 
##  Topic 8: can, matter, “oh, say, away 
##  Topic 9: will, shall, must, come, littl 
##  Topic 10: one, man, light, time, two 
##  Topic 11: upon, holm, miss, man, sherlock 
##  Topic 12: room, came, ask, just, hous 
##  Topic 13: may, tell, sir, find, help 
##  Topic 14: said, holm, believ, laugh, will 
##  Topic 15: littl, now, noth, day, saw 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 6 (approx. per word bound = -7.212, relative change = 1.382e-03) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 7 (approx. per word bound = -7.207, relative change = 5.993e-04) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 8 (approx. per word bound = -7.203, relative change = 5.851e-04) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Completing Iteration 9 (approx. per word bound = -7.202, relative change = 9.837e-05) 
## ....................................................................................................
## Completed E-Step (0 seconds). 
## Completed M-Step. 
## Model Converged
8.7.2.5.3 Evaludating models

Several metrics assess topic models’ performance: the held-out likelihood, residuals, semantic coherence, and exclusivity. Here we examine the relationship between semantic coherence and exclusivity to understand the trade-off involved in selecting K.

  • Semantic coherence: high probability words for a topic co-occur in documents

  • Exclusivity: keywords of one topic are not likely to appear as keywords in other topics.

In Roberts et al. 2014 we proposed using the Mimno et al. 2011 semantic coherence metric for helping with topic model selection. However, we found that semantic coherence alone is relatively easy to achieve by having only a couple of topics that dominate the most common words. Thus we also proposed an exclusivity measure.

Our exclusivity measure includes some information on word frequency as well. It is based on the FREX labeling metric (calcfrex) with the weight set to .7 in favor of exclusivity by default.

test_res$results %>%
  unnest(c(K, exclus, semcoh)) %>%
  select(K, exclus, semcoh) %>%
  mutate(K = as.factor(K)) %>%
  ggplot(aes(x = exclus, y = semcoh)) +
  geom_point() +
  geom_text(
    label = glue("K = {test_res$results$K}"),
    size = 5,
    color = "red",
    position = position_jitter(width = 0.05, height = 0.05)
  ) +
  labs(
    x = "Exclusivity",
    y = "Semantic coherence",
    title = "Exclusivity and semantic coherence"
  )

8.7.2.5.4 Finalize
final_stm <- stm(dtm$documents,
  dtm$vocab,
  K = 10, prevalence = ~story,
  max.em.its = 75,
  data = dtm$meta,
  init.type = "Spectral",
  seed = 1234567,
  verbose = FALSE
)
8.7.2.5.5 Explore the results
  • Using the stm package.
plot(final_stm)

  • Using ggplot2

In LDA distribution, \(\alpha\) represents document-topic density and \(\beta\) represents topic-word density.

# tidy
tidy_stm <- tidy(final_stm)

# top terms
tidy_stm %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  ggplot(aes(fct_reorder(term, beta), beta, fill = as.factor(topic))) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~topic, scales = "free_y") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_viridis_d()

8.8 References

8.8.1 Books

  • An Introduction to Statistical Learning - with Applications in R (2013) by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. Springer: New York. Amazon or free PDF.

  • Hands-On Machine Learning with R (2020) by Bradley Boehmke & Brandon Greenwell. CRC Press or Amazon

  • Applied Predictive Modeling (2013) by Max Kuhn and Kjell Johnson. Springer: New York. Amazon

  • Feature Engineering and Selection: A Practical Approach for Predictive Models (2019) by Kjell Johnson and Max Kuhn. Taylor & Francis. Amazon or free HTML.

  • Tidy Modeling with R (2020) by Max Kuhn and Julia Silge (work-in-progress)

8.8.3 Blog posts