Ensemble Methods for Machine Learning Random ...

Viewer
Transcript

Ensemble Methods for Machine Learning Random Forests & Boosting

February 2014

Carlos Becker

Ensemble Methods 2

Ensemble of predictors:

How to learn model parameters?

Ensemble Methods 3

But, wait a minute! How do you solve

●

Generally non-convex, many local minima, depending on L and

●

Most popular and successful methods do greedy minimization ●

●

?

Eg. Random Forests, Boosting “Regularization arises in form of algorithmic constraints rather than explicit penalty terms” [P. Bühlmann et al., Statistics for High-Dimensional Data]

Ensemble Methods 4

Now let's look at two popular methods: ●

Random Forests: bagging + random subspace search

●

Boosting: greedy gradient descent in function space

Bagging & Random Forests

Random Forests 6

RFs Initially proposed by Breiman in 2001(*) RFs use trees as predictors, combining 1) Predictor bagging (model averaging) 2) Random subspace search

Random Forests 7

Bagging: Boostrap-AGGregation Key idea: Variance reduction through averaging (Trees typically have low bias, high variance)

Random Forests 8

Intermission: how are classification trees learned? (focusing here on trees with orthogonal splits, with a greedy approach)

Random Forests 9

Intermission: how are classification trees learned? (focusing here on trees with orthogonal splits, with a greedy approach)

Random Forests 10

Intermission: how are classification trees learned? (focusing here on trees with orthogonal splits, with a greedy approach)

●

Greedy learning is not optimal, but very efficient

●

Training hundreds of trees for bagging is generally feasible

●

Automatic feature selection

●

No need for metric feature space or pre-feature scaling

●

Trees can deal with very-high dimensional feature spaces

Random Forests 11

Example: no bagging vs bagging

[from Jessie Li's slides from Penn State University]

Random Forests 12

Single CART tree

Random Forests 13

100 bagged trees

Random Forests 14

Bagging: variance reduction with boostrapping But let's de-correlate predictors further to reduce final variance

Random Forests: subsample feature space as well → helps de-correlate predictors further

How: modify split learning in tree construction:

Random Forests 15

Make Example on the board

Random Forests 16

Quotes about RFs [The Elements of Statistical Learning]: –

Typically values for Mtry are √K or even as low as 1

–

Not all estimators can be improved by shaking up the data like this. It seems that highly nonlinear estimators, such as trees, benefit the most.

–

On many problems the performance of random forests is very similar to boosting, and they are simpler to train and tune

Random Forests 17

RFs and Overfitting [The Elements of Statistical Learning, Sec 15.3.4]: –

Mtry: related to ratio of useful vs noisy attributes

–

“Random Forests cannot overfit” claim ●

Averaging converges as no. of trees increases

●

But RFs can still overfit (tree depth/trees themselves)

Random Forests 18

Out-of-bag (OOB) samples: –

Because of bootstrapping, p(picking sample for tree Ti) = 1 – exp(-1) = 63.2%

–

We have free cross-validation-like estimates! ●

Classification error / accuracy

●

Variable Importance

Random Forests 19

Summary –

●

Advantages –

●

RFs = bagging trees + random subspace search at tree splits

Few parameters to tune ●

Mtry, number of trees, (tree depth)

●

Generally typical values do already very well

–

'Free' Out-of-bag estimates: cross-validation-like

–

Variable Importance can be very useful in some situations

–

Learning highly parallelizable

Disadvantages –

If trees don't model the problem properly, performance may be very poor ●

eg. the circle example given before: a kernel-machine will do better.

Boosting (and its many variants)

Boosting 21

●

First practical boosting ML algorithm: AdaBoost [Freund,Schapire]

●

Later it was found that AdaBoost is a particular case of the more general

Gradient Boosting approach ●

Literature can be very confusing –

AdaBoost, LogitBoost, RealBoost, TangentBoost, L2Boost, …

Boosting 22

Gradient Boosting in a nutshell

How:

Boosting 23

Gradient Boosting in a nutshell

How:

Boosting 24

Gradient Boosting in a nutshell

How:

Boosting 25

Gradient Boosting in a nutshell

How:

Boosting 26

Gradient Boosting in a nutshell

How:

Boosting 27

Gradient Boosting in a nutshell

How:

Boosting 28

How:

fixed

new term

Boosting 29

Let's clarify with an example:

Boosting 30

Let's clarify with an example:

Boosting 31

Let's clarify with an example:

Boosting 32

Let's clarify with an example:

Boosting 33

Let's clarify with an example:

Boosting 34

Let's clarify with an example:

Boosting 35

Let's clarify with an example:

Boosting 36

Let's clarify with an example:

Boosting 37

Let's clarify with an example:

Boosting 38

Let's clarify with an example:

Boosting 39

Gradient Boosted Trees (GBTs): –

Very successful & popular ML approach

–

Use regression trees as weak learners

–

Regularization term:

(Friedman, Gradient Boosting)

(Zheng, Newton-like method)

–

Generally no closed-form for Linesearch available, still easy to implement ●

–

some say linesearch is not necessary [Buhlmann]

Almost any loss can be used, either classification or regression

Boosting 40

Gradient Boosted Trees (GBTs): Overfitting & Parameter Tuning –

Weak learners must be not too weak neither too strong ●

–

Generally depth-2 trees do very well

Going too fast gradient descent: risk of overfitting ●

Use shrinkage:

●

–

–

–

Stochastic Gradient Boosting [Friedman]: apply bootstrapping at each iteration ●

Boosts performance

●

and speeds up training!

Subsampling in feature space, as with RFs generally boosts performance ●

Boosts performance

●

and speeds up training!

No of iterations M: if well set up, performance on test set increases asymptotically with M

Boosting 41

But.. why boosting does so well generally? ●

In general, boosting does better or as well as RFs (tree-based)

●

Boosting first does bias reduction, then variance reduction later [suggested somewhere]

●

Greedy minimization used by boosting, similar to L1-reg. minimization [The Elements of Statistical Learning]

●

Possible “explanation”: –

As M increases, weak learners become weaker, alpha's smaller

–

which force averaging to happen, similar to bagging

Boosting 42

Summary ●

Boosting constructs an ensemble method greedily, one term at a time

●

GBTs: successful method to boost trees for classification or regression

Ensemble Methods for Machine Learning Random ...

Because of bootstrapping, p(picking sample for tree Ti) = 1 â exp(-1) = 63.2%. â We have free cross-validation-like estimates! â. Classification error / accuracy.

Download PDF

1MB Sizes 1 Downloads 296 Views

Report

Ensemble Methods for Machine Learning Random ...

Recommend Documents