Ensemble Methods for Machine Learning Random Forests & Boosting

February 2014

Carlos Becker

Ensemble Methods 2

Ensemble of predictors:

How to learn model parameters?

Ensemble Methods 3

But, wait a minute! How do you solve



Generally non-convex, many local minima, depending on L and



Most popular and successful methods do greedy minimization ●



?

Eg. Random Forests, Boosting “Regularization arises in form of algorithmic constraints rather than explicit penalty terms” [P. Bühlmann et al., Statistics for High-Dimensional Data]

Ensemble Methods 4

Now let's look at two popular methods: ●

Random Forests: bagging + random subspace search



Boosting: greedy gradient descent in function space

Bagging & Random Forests

Random Forests 6

RFs Initially proposed by Breiman in 2001(*) RFs use trees as predictors, combining 1) Predictor bagging (model averaging) 2) Random subspace search

Random Forests 7

Bagging: Boostrap-AGGregation Key idea: Variance reduction through averaging (Trees typically have low bias, high variance)

Random Forests 8

Intermission: how are classification trees learned? (focusing here on trees with orthogonal splits, with a greedy approach)

Random Forests 9

Intermission: how are classification trees learned? (focusing here on trees with orthogonal splits, with a greedy approach)

Random Forests 10

Intermission: how are classification trees learned? (focusing here on trees with orthogonal splits, with a greedy approach)



Greedy learning is not optimal, but very efficient



Training hundreds of trees for bagging is generally feasible



Automatic feature selection



No need for metric feature space or pre-feature scaling



Trees can deal with very-high dimensional feature spaces

Random Forests 11

Example: no bagging vs bagging

[from Jessie Li's slides from Penn State University]

Random Forests 12

Single CART tree

Random Forests 13

100 bagged trees

Random Forests 14

Bagging: variance reduction with boostrapping But let's de-correlate predictors further to reduce final variance

Random Forests: subsample feature space as well → helps de-correlate predictors further

How: modify split learning in tree construction:

Random Forests 15

Make Example on the board

Random Forests 16

Quotes about RFs [The Elements of Statistical Learning]: –

Typically values for Mtry are √K or even as low as 1



Not all estimators can be improved by shaking up the data like this. It seems that highly nonlinear estimators, such as trees, benefit the most.



On many problems the performance of random forests is very similar to boosting, and they are simpler to train and tune

Random Forests 17

RFs and Overfitting [The Elements of Statistical Learning, Sec 15.3.4]: –

Mtry: related to ratio of useful vs noisy attributes



“Random Forests cannot overfit” claim ●

Averaging converges as no. of trees increases



But RFs can still overfit (tree depth/trees themselves)

Random Forests 18

Out-of-bag (OOB) samples: –

Because of bootstrapping, p(picking sample for tree Ti) = 1 – exp(-1) = 63.2%



We have free cross-validation-like estimates! ●

Classification error / accuracy



Variable Importance

Random Forests 19

Summary –



Advantages –



RFs = bagging trees + random subspace search at tree splits

Few parameters to tune ●

Mtry, number of trees, (tree depth)



Generally typical values do already very well



'Free' Out-of-bag estimates: cross-validation-like



Variable Importance can be very useful in some situations



Learning highly parallelizable

Disadvantages –

If trees don't model the problem properly, performance may be very poor ●

eg. the circle example given before: a kernel-machine will do better.

Boosting (and its many variants)

Boosting 21



First practical boosting ML algorithm: AdaBoost [Freund,Schapire]



Later it was found that AdaBoost is a particular case of the more general

Gradient Boosting approach ●

Literature can be very confusing –

AdaBoost, LogitBoost, RealBoost, TangentBoost, L2Boost, …

Boosting 22

Gradient Boosting in a nutshell

How:

Boosting 23

Gradient Boosting in a nutshell

How:

Boosting 24

Gradient Boosting in a nutshell

How:

Boosting 25

Gradient Boosting in a nutshell

How:

Boosting 26

Gradient Boosting in a nutshell

How:

Boosting 27

Gradient Boosting in a nutshell

How:

Boosting 28

How:

fixed

new term

Boosting 29

Let's clarify with an example:

Boosting 30

Let's clarify with an example:

Boosting 31

Let's clarify with an example:

Boosting 32

Let's clarify with an example:

Boosting 33

Let's clarify with an example:

Boosting 34

Let's clarify with an example:

Boosting 35

Let's clarify with an example:

Boosting 36

Let's clarify with an example:

Boosting 37

Let's clarify with an example:

Boosting 38

Let's clarify with an example:

Boosting 39

Gradient Boosted Trees (GBTs): –

Very successful & popular ML approach



Use regression trees as weak learners



Regularization term:

(Friedman, Gradient Boosting)

(Zheng, Newton-like method)



Generally no closed-form for Linesearch available, still easy to implement ●



some say linesearch is not necessary [Buhlmann]

Almost any loss can be used, either classification or regression

Boosting 40

Gradient Boosted Trees (GBTs): Overfitting & Parameter Tuning –

Weak learners must be not too weak neither too strong ●



Generally depth-2 trees do very well

Going too fast gradient descent: risk of overfitting ●

Use shrinkage:









Stochastic Gradient Boosting [Friedman]: apply bootstrapping at each iteration ●

Boosts performance



and speeds up training!

Subsampling in feature space, as with RFs generally boosts performance ●

Boosts performance



and speeds up training!

No of iterations M: if well set up, performance on test set increases asymptotically with M

Boosting 41

But.. why boosting does so well generally? ●

In general, boosting does better or as well as RFs (tree-based)



Boosting first does bias reduction, then variance reduction later [suggested somewhere]



Greedy minimization used by boosting, similar to L1-reg. minimization [The Elements of Statistical Learning]



Possible “explanation”: –

As M increases, weak learners become weaker, alpha's smaller



which force averaging to happen, similar to bagging

Boosting 42

Summary ●

Boosting constructs an ensemble method greedily, one term at a time



GBTs: successful method to boost trees for classification or regression

Ensemble Methods for Machine Learning Random ...

Because of bootstrapping, p(picking sample for tree Ti) = 1 – exp(-1) = 63.2%. – We have free cross-validation-like estimates! ○. Classification error / accuracy.

1MB Sizes 1 Downloads 243 Views

Recommend Documents

BASED MACHINE LEARNING METHODS FOR ...
machine learning methods for interpreting neural activity in order to predict .... describing the problems faced in motor control and what characterizes natural.

Ensemble machine learning on gene expression data ...
tissues. The data are usually organised in a matrix of n rows ... gene expression data on cancer classification problems. This ... preconditions and duplication.

Ensemble Learning for Free with Evolutionary Algorithms ?
Free” claim is empirically examined along two directions. The first ..... problem domain. ... age test error (over 100 independent runs as described in Sec-.

Machine Learning Methods for High Level Cyber ...
Windows Explorer (the file browser). This instrumentation captures events at a semantically-meaningful level (e.g., open excel file, navigate to web page, reply.

Ensemble Methods for Structured Prediction - NYU Computer Science
and possibly most relevant studies for sequence data is that of Nguyen & Guo ... for any input x ∈ X, he can use the prediction of the p experts h1(x),... ...... 336, 1999. Schapire, R. and Singer, Y. Boostexter: A boosting-based system for text ..

Ensemble Methods for Structured Prediction - NYU Computer Science
http://ai.stanford.edu/˜btaskar/ocr/. It con- tains 6,877 word instances with a total of .... 2008. URL http://www.cs.cornell.edu/people/ · tj/svm_light/svm_struct.html.

Ensemble methods for environmental data modelling ...
Institute of Geomatics and Analysis of Risk, University of Lausanne ... The choice of a good statistical model for environmental data modelling is usually very ...

Ensemble Methods for Structured Prediction - NYU Computer Science
may been used for training the algorithms that generated h1(x),...,hp(x). ...... Finally, the auto-context algorithm of Tu & Bai (2010) is based on experts that are ...

Towards an ensemble learning strategy for ...
learner. Though the performance of the four base classifiers was good, .... Table 1. Content sensors features used [x] by gene prediction tools in metagenomics. .... Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Morgan.

A Semi-supervised Ensemble Learning Approach for ...
The different database collecting initiatives [1], [4], [8] described in the literature ... different abstraction levels for our data representation (see. Fig. 2b). Our first ...

Comparing Machine Learning Methods in Estimation of Model ...
Abstract− The paper presents a generalization of the framework for assessment of predictive models uncertainty using machine learning techniques. Historical ...

Comparing Machine Learning Methods in Estimation ...
author, phone: +31-15-2151764, e-mail: [email protected]). Dimitri P. ... Furthermore, the first and the last approaches deal only with a single source of ...

Experiments with Random Projections for Machine ...
Division of Computer and Information Sciences. Rutgers University ... Department of Statistics Rutgers University. Piscataway, NJ ... points in p, as an n×p matrix X, we want to find the best. (in least squares ..... Berkeley, California, USA, 1999.

Experiments with Random Projections for Machine ...
The Need for Dimensionality Reduction. Data with large dimensionality (p) presents problems for many machine learning algorithms: • their computational complexity can be superlinear in p. • they may need complexity control to avoid overfitting. T

MACHINE LEARNING FOR DIALOG STATE ... - Semantic Scholar
output of this Dialog State Tracking (DST) component is then used ..... accuracy, but less meaningful confidence scores as measured by the .... course, 2015.

Machine Learning for Computer Games
Mar 10, 2005 - Teaching: Game Design and Development for seven years. • Research: ..... Russell and Norvig: Artificial Intelligence: A Modern. Approach ...

Machine Learning for PHY_PhDPosition -
Toward smart transceivers: Machine/Deep learning for the Physical Layer. With the ... Solid programming skills (Python, Matlab, …) - Good proficiency in English ...

Machine Learning for Computer Security
ferring application protocol behaviors in encrypted traffic to help intrusion detection systems. The ... Applications of Data Mining in Computer Security. Kluwer,.

Machine Learning for Computer Games
Mar 10, 2005 - GDC 2005: AI Learning Techniques Tutorial. Machine Learning for ... Teaching: Game Design and Development for seven years. • Research: ...