06 May, 2022

Classification & Regression Trees

Classification & Regression Trees

Advantanges

  • Feature complexity
  • Prediction
  • Relative importance
  • (Multi)collinearity

Classification & Regression Trees

Diss-advantanges

  • over-fitting (over learning)

Classification & Regression Trees

Classification

  • Categorical response


Regression

  • Continuous response


CART

Simple regression trees

  • split (partition) data up into major chunks

Simple regression trees

  • split (partition) data up into major chunks
    • maximizing change in explained deviance
    • when Gaussian error,
      • maximizing between group SS
      • minimizing SSerror

Simple regression trees

- split (**partition**) data up into __major__ chunks

Simple regression trees

  • split (partition) data up into major chunks

Simple regression trees

  • split (partition) data up into major chunks

Simple regression trees

  • split these subsets

Simple regression trees

  • split these subsets

Simple regression trees

  • split these subsets

Simple regression trees

  • split these subsets

Simple regression trees

  • recursively partition (split)
  • decision tree
  • simple trees tend to overfit.
    • error is fitted along with the model

Simple regression trees

Pruning

  • reduce overfitting
    • deviance at each terminal node (leaf)

Simple regression trees

Predictions

  • partial plots

Classification and Regression Trees

R packages

  • simple CART
library(tree)


  • an extension that facilitates (some) non-gaussian errors
library(rpart)

Classification and Regression Trees

Limitations

  • crude overfitting protection
  • low resolution
  • limited error distributions
  • little scope for random effects

Boosted regression Trees

Boosting

  • machine learning meets predictive modelling
  • ensemble models
    • sequence of simple Trees (10,000+ trees)
    • built to predict residuals of previous tree
    • shrinkage
    • produce excellent fit

Boosted regression Trees

Over fitting

  • over vs under fitting
  • residual error vs precision
  • minimizing square error loss

Boosted regression Trees

minimizing square error loss

  • test (validation) data
    • 75% train, 25% test
  • out of bag
    • 50% in, 50% out
  • cross validation
    • 3 folds

Boosted regression Trees

Over fitting

Boosted regression Trees

Predictions

Boosted regression Trees

Variable importance

##    var  rel.inf
## x1  x1 55.06919
## x2  x2 44.93081

Boosted regression Trees

Pseudo \(R^2\)

\[ 1-\left(\frac{\sum (y_i - E(y))^2}{\sum(y_i - \bar{y})}\right) \]

## [1] 0.5650695