Regression trees

06 May, 2022

Classification & Regression Trees

Advantanges

Feature complexity
Prediction
Relative importance
(Multi)collinearity

Classification & Regression Trees

Diss-advantanges

over-fitting (over learning)

Classification & Regression Trees

Classification

Categorical response

Regression

Continuous response

CART

Simple regression trees

split (partition) data up into major chunks

Simple regression trees

split (partition) data up into major chunks
- maximizing change in explained deviance
- when Gaussian error,
  - maximizing between group SS
  - minimizing SSerror

Simple regression trees

- split (**partition**) data up into __major__ chunks

Simple regression trees

split (partition) data up into major chunks

Simple regression trees

split (partition) data up into major chunks

Simple regression trees

split these subsets

Simple regression trees

split these subsets

Simple regression trees

split these subsets

Simple regression trees

split these subsets

Simple regression trees

recursively partition (split)
decision tree
simple trees tend to overfit.
- error is fitted along with the model

Simple regression trees

Pruning

reduce overfitting
- deviance at each terminal node (leaf)

Simple regression trees

Predictions

partial plots

Classification and Regression Trees

R packages

simple CART

library(tree)

an extension that facilitates (some) non-gaussian errors

library(rpart)

Classification and Regression Trees

Limitations

crude overfitting protection
low resolution
limited error distributions
little scope for random effects

Boosted regression Trees

Boosting

machine learning meets predictive modelling
ensemble models
- sequence of simple Trees (10,000+ trees)
- built to predict residuals of previous tree
- shrinkage
- produce excellent fit

Boosted regression Trees

Over fitting

over vs under fitting
residual error vs precision
minimizing square error loss

Boosted regression Trees

minimizing square error loss

test (validation) data
- 75% train, 25% test
out of bag
- 50% in, 50% out
cross validation
- 3 folds

Boosted regression Trees

Over fitting

Boosted regression Trees

Predictions

Boosted regression Trees

Variable importance

##    var  rel.inf
## x1  x1 55.06919
## x2  x2 44.93081

Boosted regression Trees

Pseudo \(R^2\)

\[ 1-\left(\frac{\sum (y_i - E(y))^2}{\sum(y_i - \bar{y})}\right) \]

## [1] 0.5650695