Machine Learning: A Statistics and Optimization Perspective Nan Ye Mathematical Sciences School Queensland University of Technology
1 / 109
What is Machine Learning?
2 / 109
Machine Learning
• Machine learning turns data into insight, predictions and/or
decisions.
• Numerous applications in diverse areas, including natural language
processing, computer vision, recommender systems, medical diagnosis.
3 / 109
A Much Sought-after Technology
4 / 109
Enabled Applications
Make reminders by talking to your phone. 5 / 109
Tell the car where you want to go to and the car takes you there.
6 / 109
Check your emails, see some spams in your inbox, mark them as spams, and similar spams will not show up.
7 / 109
Video recommendations.
8 / 109
Play Go against the computer.
9 / 109
Tutorial Objective Essentials for crafting basic machine learning systems. Formulate applications as machine learning problems Classification, regression, density estimation, clustering... Understand and apply basic learning algorithms Least squares regression, logistic regression, support vector machines, K-means,... Theoretical understanding Position and compare the problems and algorithms in a unifying statistical framework Have fun... 10 / 109
Outline
• A statistics and optimization perspective • Statistical learning theory • Regression • Model selection • Classification • Clustering
11 / 109
Hands-on
• An exercise on using WEKA. WEKA H2O scikit-learn CRAN
Java Java python R
http://www.cs.waikato.ac.nz/ml/weka/ http://www.h2o.ai/ http://scikit-learn.org/ https://cran.r-project.org/web/views/MachineLearning.html
• Some technical details are left as exercises. These are tagged with
(verify).
12 / 109
A Statistics and Optimization Perspective Illustrations • Learning a binomial distribution • Learning a Gaussian distribution
13 / 109
Learning a Binomial Distribution I pick a coin with the probability of heads being θ. I flip it 100 times for you and you see a dataset D of 70 heads and 30 tails, can you learn θ?
14 / 109
Learning a Binomial Distribution I pick a coin with the probability of heads being θ. I flip it 100 times for you and you see a dataset D of 70 heads and 30 tails, can you learn θ? Maximum likelihood estimation The likelihood of θ is P(D | θ) = θ70 (1 − θ)30 . Learning θ is an optimization problem. θml = arg max P(D | θ) θ
= arg max ln P(D | θ) θ
= arg max (70 ln θ + 30 ln(1 − θ)). θ
14 / 109
θml = arg max (70 ln θ + 30 ln(1 − θ)). θ
Set derivative of log-likelihood to 0, 70 30 − = 0, θ 1−θ we have θml = 70/(70 + 30).
15 / 109
Learning a Gaussian distribution
f (X )
I pick a Gaussian N(µ, σ 2 ) and give you a bunch of data D = {x1 , . . . , xn } independently drawn from it. Can you learn µ and σ. X
16 / 109
Learning a Gaussian distribution
f (X )
I pick a Gaussian N(µ, σ 2 ) and give you a bunch of data D = {x1 , . . . , xn } independently drawn from it. Can you learn µ and σ. X
(x−µ)2 1 P(x | µ, σ) = √ e − 2σ2 . σ 2π
16 / 109
Maximum likelihood estimation ln P(D | µ, σ) ! n X (xi − µ)2 1 √ = ln exp − 2σ 2 σ 2π i X (xi − µ)2 √ = −n ln(σ 2π) − . 2σ 2 i
Set derivative w.r.t. µ to 0, −
X xi − µ =0 σ2
⇒
µml =
i
1X xi n i
Set derivative w.r.t. σ to 0, −
n (xi − µ)2 + =0 σ σ3
n
⇒
2 σml =
1X (xi − µml )2 . n i=1
17 / 109
What You Need to Know... Learning is... • Collect some data, e.g. coin flips. • Choose a hypothesis class, e.g. binomial distribution. • Choose a loss function, e.g. negative log-likelihood. • Choose an optimization procedure, e.g. set derivative to 0. • Have fun...
Statistics and optimization provide powerful tools for formulating and solving machine learning problems.
18 / 109
Statistical Learning Theory • The framework • Applications in classification, regression, and density estimation • Does empirical risk minimization work?
19 / 109
There is nothing more practical than a good theory. Kurt Lewin
...at least in the problems of statistical inference. Vladimir Vapnik
20 / 109
Learning...
• H. Simon: Any process by which a system improves its
performance. • M. Minsky: Learning is making useful changes in our minds. • R. Michalsky: Learning is constructing or modifying representations
of what is being experienced. • L. Valiant: Learning is the process of knowledge acquisition in the
absence of explicit programming.
21 / 109
A Probabilistic Framework Data Training examples z1 , . . . , zn are i.i.d. drawn from a fixed but unknown distribution P(Z ) on Z. e.g. outcomes of coin flips.
Hypothesis space H e.g. head probability θ ∈ [0, 1]. Loss function L(z, h) measures the penalty for hypothesis h on example z. ( − ln(θ), z = H, e.g. log-loss L(z, θ) = − ln P(z | θ) = − ln(1 − θ), z = T .
22 / 109
Expected risk • The expected risk of h is E(L(Z , h)). • We want to find the hypothesis with minimum expected risk,
arg min E(L(Z , h)). h∈H
Empirical risk minimization (ERM) def 1 n
Minimize empirical risk Rn (h) =
P
i
L(zi , h) over h ∈ H.
e.g. choose θ to minimize −70 ln θ − 30 ln(1 − θ).
23 / 109
This provides a unified formulation for many machine learning problems, which differ in • the data domain Z, • the choice of the hypothesis space H, and • the choice of loss function L.
Most algorithms that we see later can be seen as special cases of ERM.
24 / 109
Classification predict a discrete class Digit recognition: image to {0, 1, . . . , 9}.
Spam filter: email to {spam, not spam}.
25 / 109
Given D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ X × Y, find a classifier f that maps an input x ∈ X to a class y ∈ Y. We usually use the 0/1 loss ( 1, L((x, y ), h) = I(h(x) = y ) = 0,
h(x) 6= y , . h(x) = y .
ERM chooses the classifier with minimum classification error 1X min I(h(xi ) = yi ). h∈H n i
26 / 109
Regression predict a numerical value
Stock market prediction: predict stock price using recent trading data
27 / 109
Given D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ X × R, find a function f that maps an input x ∈ X to a value y ∈ R.
We usually use the quadratic loss L((x, y ), h) = (y − h(x))2 .
ERM is often called the method of least squares in this case 1X (yi − h(xi ))2 . h∈H n min
i
28 / 109
Density Estimation E.g. learning a binomial distribution, or a Gaussian distribution. f (X )
X
We often use the log-loss L(x, h) = − ln p(x | h). ERM is MLE in this case. 29 / 109
Does ERM Work? Estimation error • How does the empirically best hypothesis hn = arg minh∈H Rn (h)
compare with the best in the hypothesis space? Specifically, how large is the estimation error R(hn ) − inf h∈H R(h)? • Consistency: Does R(hn ) converge to inf h∈H R(h) as n → ∞?
30 / 109
Does ERM Work? Estimation error • How does the empirically best hypothesis hn = arg minh∈H Rn (h)
compare with the best in the hypothesis space? Specifically, how large is the estimation error R(hn ) − inf h∈H R(h)? • Consistency: Does R(hn ) converge to inf h∈H R(h) as n → ∞?
If |H| is finite, ERM is likely to pick the function with minimal expected risk when n is large, because then Rn (h) is close to R(h) for all h ∈ H. If |H| is infinite, we can still show that ERM is likely to choose a near-optimal hypothesis if H has finite complexity (such as VC-dimension).
30 / 109
Approximation error How good is the best hypothesis in H? That is, how large is the approximation error inf h∈H R(h) − inf h R(h)?
31 / 109
Approximation error How good is the best hypothesis in H? That is, how large is the approximation error inf h∈H R(h) − inf h R(h)?
Trade-off between estimation error and approximation error: • Larger hypothesis space implies smaller approximation error, but
larger estimation error. • Smaller hypothesis space implies larger approximation error, but
smaller estimation error.
31 / 109
Optimization error Is the optimization algorithm computing the empirically best hypothesis exact?
32 / 109
Optimization error Is the optimization algorithm computing the empirically best hypothesis exact?
While ERM can be efficiently implemented in many cases, there are also computationally intractable cases, and efficient approximations are sought. The performance gap between the sub-optimal hypothesis and the empirically best hypothesis is the optimization error.
32 / 109
What You Need to Know...
Recognise machine learning problems as special cases of the general statistical learning problem.
Understand that the performance of ERM depends on the approximation error, estimation error and optimization error.
33 / 109
Regression • Ordinary least squares • Ridge regression • Basis function method • Regression function • Nearest neighbor regression • Kernel regression • Classification as regression
34 / 109
Ordinary Least Squares
Find a best fitting hyperplane for (x1 , y1 ), . . . , (xn , yn ) ∈ Rd × R.
35 / 109
OLS finds a hyperplane minimizing the sum of squared errors βn = arg min
β∈Rd
n X
2 (xT i β − yi ) .
i=1
A special case of function learning using ERM • The input set is X = Rd , and the output set is Y = R. • The hypothesis space are hyperplanes H = {xT β : β ∈ Rd }. • Quadratic loss is used, as typically in regression.
36 / 109
Empirically best hyperplane. The solution to OLS is βn = (XT X)−1 XT y, where X is the n × d matrix with xi as the i-th row, and y = (y1 , . . . , yn )T .
The formula holds when XT X is non-singular. When XT X is singular, there are infinitely many possible values for βn . They can be obtained by solving the linear systems (XT X)β = XT y.
37 / 109
Proof. The empirical risk is (ignoring a factor of n1 ) Rn (β) =
n X
2 2 (xT i β − yi ) = ||Xβ − y||2 .
i=1
Set the gradient of Rn to 0 ∇Rn = 2XT (Xβ − y) = 0, (verify) we have βn = (XT X)−1 XT y.
38 / 109
Optimal hyperplane. The hyperplane β ∗ = E(XX T )−1 E(XY ) minimizes the expected quadratic loss among all hyperplanes.
39 / 109
Optimal hyperplane. The hyperplane β ∗ = E(XX T )−1 E(XY ) minimizes the expected quadratic loss among all hyperplanes.
Proof. The expected quadratic loss of a hyperplane β is R(β) = E((β T X − Y )2 ) = E(β T XX T β − 2β T XY + Y 2 ) = β T E(XX T )β − 2β T E(XY ) + E (Y 2 ). Set the gradient of R to 0, we have ∇R(β) = 2 E(XX T )β − 2 E(XY ) = 0
⇒
β ∗ = E(XX T )−1 E(XY ).
39 / 109
Consistency. We can show that least squares linear regression is P consistent, that is, R(βn ) → R(β ∗ ), by using the law of large numbers.
40 / 109
Least Squares as MLE • Consider the class of conditional distributions {pβ (Y |X ) : β ∈ Rd },
where def
pβ (Y | X = x) = N(Y ; xT β, σ) = √
1 T 2 2 e −(Y −x β) /2σ , 2πσ
with σ being a constant. • The (conditional) likelihood of β is
Ln (β) = pβ (y1 | x1 ) . . . pβ (yn | xn ). • Maximizing the likelihood Ln (β) gives the same βn as given by the
method of least squares. (verify)
41 / 109
Ridge Regression When collinearity is present, the matrix XT X may be singular or close to singular, making the solution unreliable. Ridge regression We add a regularizer λ||β||22 to OLS objective, where λ > 0 is a fixed constant. n X 2 βn = arg min (xT i β − yi ) + β∈Rd
quadratic/`2 regularizer
z }| { λ||β||22
.
i=1
Empirically optimal hyperplane βn = (λI + XT X)−1 XT y. (verify) The matrix λI + XT X is non-singular (verify), and thus there is always a unique solution. 42 / 109
Regression with a Bias So far we have only considered hyperplanes of the form y = xT β, which passes through the origin (green line).
Considering hyperplanes with a bias term, that is, hyperplanes of the form y = xT β + b is more useful (red line). 43 / 109
OLS with a bias solves (bn , βn ) = arg
min
b∈R,β∈Rd
n X
bias
(xT i β
z}|{ + b − yi )2 .
i=1
44 / 109
OLS with a bias solves (bn , βn ) = arg
min
b∈R,β∈Rd
n X
bias
(xT i β
z}|{ + b − yi )2 .
i=1
Solution. Reduce it to regression without a bias term by replacing b . xT β + b with (1 xT ) β
44 / 109
Ridge regression with a bias solves (bn , βn ) = arg
min
b∈R,β∈Rd
n X
2 2 (xT β + b − y ) + λ||β|| i 2 . i
i=1
45 / 109
Ridge regression with a bias solves (bn , βn ) = arg
min
n X
b∈R,β∈Rd
2 2 (xT β + b − y ) + λ||β|| i 2 . i
i=1
Solution. Reduce it to ridge regression without Pna bias term as follows. Let ˆxP x, and yˆi = yi − y¯, where ¯ x = i=1 xi /n, and i = xi − ¯ y¯ = ni=1 yi /n, then βn = arg min
β∈Rd
n X
2 2 (ˆ xT β − y ˆ ) + λ||β|| i 2 , i
i=1
bn = y¯ − ¯ xT βn . (verify )
45 / 109
Basis Function Method We can use linear regression to learn complex regression functions • Choose some basis functions g1 , . . . , gk : Rd → R. • Transform each input x to (g1 (x), . . . , gk (x)). • Perform linear regression on the transformed data.
46 / 109
Basis Function Method We can use linear regression to learn complex regression functions • Choose some basis functions g1 , . . . , gk : Rd → R. • Transform each input x to (g1 (x), . . . , gk (x)). • Perform linear regression on the transformed data.
Examples • Linear regression: use basis functions g1 , . . . , gd with gi (x) = xi ,
and g0 (x) = 1. • Quadratic functions: use basis functions of the above form,
together with basis functions of the form gij (x) = xi xj for all 1 ≤ i ≤ j ≤ d.
46 / 109
Regression Function The minimizer of the expected quadratic loss is the regression function h∗ (x) = E(Y | x).
47 / 109
Regression Function The minimizer of the expected quadratic loss is the regression function h∗ (x) = E(Y | x).
Proof. The expected quadratic loss of a function h is E((h(X ) − Y )2 ) = EX h(X )2 − 2h(X ) E(Y | X ) + E(Y 2 | X ) . Hence we can set the value of h(x) independently for each x by choosing it to minimize the expression under expectation. This leads to h∗ (x) = E(Y | x).
47 / 109
k nearest neighbor (kNN) kNN approximates the regression function using hn (x) = avg(yi | xi ∈ Nk (x)), which is the average of the values of the set Nk (x) of the k nearest neighbors of x in the training data.
48 / 109
k nearest neighbor (kNN) kNN approximates the regression function using hn (x) = avg(yi | xi ∈ Nk (x)), which is the average of the values of the set Nk (x) of the k nearest neighbors of x in the training data. • Under mild conditions, as k → ∞ and n/k → ∞, hn (x) → h∗ (x),
for any distribution P(X , Y ). • (Curse of dimensionality) The number of samples required for
accurate approximation is exponential in the dimension. • kNN is a non-parametric method, while linear regression is a
parametric method.
48 / 109
Kernel Regression
hn (x) =
n X i=1
K (x, xi )yi
n .X
K (x, xi ),
i=1
where K (x, x0 ) is a function measuring the similarity between x and x0 , and is often called a kernel function.
49 / 109
Kernel Regression
hn (x) =
n X i=1
K (x, xi )yi
n .X
K (x, xi ),
i=1
where K (x, x0 ) is a function measuring the similarity between x and x0 , and is often called a kernel function. Example kernel functions ||x0 −x||2
• Gaussian kernel Kλ (x, x0 ) = λ1 exp(− 2λ 2 ). • kNN kernel Kk (x, x0 ) = I(||x0 − x|| ≤ maxx00 ∈Nk (x) ||x00 − x||). Note
that this kernel is data-dependent and non-symmetric.
49 / 109
Binary Classification as Regression
• Label one class as −1 and and the other as +1. • Fit a function f (x) using least squares regression. • Given a test example x, predict −1 if f (x) < 0 and +1 otherwise.
50 / 109
What You Need to Know...
• Regression function. • Parametric methods: Ordinary least squares, ridge regression, basis
function method. • Non-parametric methods: kNN, kernel regression.
51 / 109
Model Selection
52 / 109
Bias-Variance Tradeoff The predicted value Y 0 at a fixed point x can be considered a random function of the training set. Let Y be the true value at x. The expected prediction error E((Y 0 − Y )2 ) is a property of the model class.
53 / 109
Bias-Variance Tradeoff The predicted value Y 0 at a fixed point x can be considered a random function of the training set. Let Y be the true value at x. The expected prediction error E((Y 0 − Y )2 ) is a property of the model class. Bias-variance decomposition expected prediction error
bias (squared)
variance irreducible noise }| { z }| { z }| { z }| 2 z { 0 2 0 0 2 0 = E (Y − E(Y )) + E(Y ) − E(Y ) + E (Y − E(Y ))2 , E (Y − Y )
Proof. Expand the RHS and simplify.
53 / 109
Bias-Variance Tradeoff The predicted value Y 0 at a fixed point x can be considered a random function of the training set. Let Y be the true value at x. The expected prediction error E((Y 0 − Y )2 ) is a property of the model class. Bias-variance decomposition expected prediction error
bias (squared)
variance irreducible noise }| { z }| { z }| { z }| 2 z { 0 2 0 0 2 0 = E (Y − E(Y )) + E(Y ) − E(Y ) + E (Y − E(Y ))2 , E (Y − Y )
Proof. Expand the RHS and simplify. Bias-variance tradeoff In general, as model complexity increases (i.e., the hypothesis becomes more complex), variance tends to increase, and bias tends to decrease. 53 / 109
Bias-variance Tradeoff in kNN Assumption Suppose Y | X ∼ N(f (X ), σ) for some function f and some fixed σ. In addition, suppose x1 , . . . , xn are fixed.
54 / 109
Bias-variance Tradeoff in kNN Assumption Suppose Y | X ∼ N(f (X ), σ) for some function f and some fixed σ. In addition, suppose x1 , . . . , xn are fixed. Bias and variance P At x, Y 0 = k1 xi ∈Nk (x) yi is predicted and the true value is Y . bias = E(Y 0 ) − E(Y ) = 0
0
1 k
X
f (xi ) − f (x),
xi ∈Nk (x)
2
variance = E((Y − E(Y )) ) = σ 2 /k.
54 / 109
bias =
1 k
X
f (xi ) − f (x),
xi ∈Nk (x)
2
variance = σ /k.
1 k
as a complexity measure This is because with smaller k1 , hn (x) = avg(yi | xi ∈ Nk (x)) is closer to a constant, and thus model complexity is smaller.
Bias-variance trade-off As k1 increases (or as model complexity increases), bias is likely to decrease, and variance increases.
55 / 109
56 / 109
Model Selection Assume model complexity is controlled by some parameter θ. How to pick the best value among candidates θ0 , . . . , θm ?
57 / 109
Model Selection Assume model complexity is controlled by some parameter θ. How to pick the best value among candidates θ0 , . . . , θm ?
Using a development set • Split available data into a training set T and a development set D. • For each θi , train a model on T , and test it on D. • Choose the parameter with best performance.
A lot of data is needed, while the amount may be limited.
57 / 109
K -fold cross validation • Split the training data into K folds. • For each θi , train K models, with each trained on K − 1 folds and
tested on the remaining fold. • Choose the parameter with best average performance.
Computationally more expensive than using a development set.
58 / 109
59 / 109
What You Need to Know...
The expected prediction error is a function of the model complexity.
The bias-variance decomposition implies that minimizing the expected prediction error requires careful tuning of the model complexity.
Using a development set and cross-validation are two basic methods for model selection.
60 / 109
Recap • A statistics and optimization perspective Learning a binomial and a Gaussian • Statistical learning theory Data, hypotheses, loss function, expected risk, empirical risk... • Regression Regression function, parametric regression, non-parametric regression... • Model selection Bias-variance tradeoff, development set, cross-validation... • Classification • Clustering
61 / 109
Classification • Bayes optimal classifier • Nearest neighbor classifier • Naive Bayes classifier • Logistic regression • The perceptron • Support vector machines
62 / 109
Recall Classification Output a function f : X → Y where Y is a finite set.
0/1 loss ( 1, L((x, y ), h) = I(y = 6 h(x)) = 0,
y= 6 h(x), y = h(x).
Expected/True risk R(h) = E(L((X , Y ), h). 63 / 109
Bayes Optimal Classifier The expected 0/1 loss is minimized by the Bayes optimal classifier h∗ (x) = arg max P(y | x). y ∈Y
64 / 109
Bayes Optimal Classifier The expected 0/1 loss is minimized by the Bayes optimal classifier h∗ (x) = arg max P(y | x). y ∈Y
Proof. The expected 0/1 loss of a classifier h(x) is E(L((X , Y ), h)) = EX EY |X (I(Y 6= h(X ))) = EX P(Y 6= h(X ) | X ). Hence we can set the value of h(x) independently for each x by choosing it to minimize the expression under expectation. This leads to h∗ (x) = arg maxy ∈Y P(y | x).
64 / 109
The Bayes optimal classifier is h∗ (x) = arg max P(y | x). y ∈Y
However, P(Y | x) is unknown...
Idea. Estimate P(y | x) from data.
65 / 109
Nearest Neighbor Classifier Approximate P(y | x) using the label distribution in {yi | xi ∈ Nk (x)}, where Nk (x) consists of the k nearest examples of x (with respect to some distance measure), and predict the majority label hn (x) = majority{yi | xi ∈ Nk (x)}.
66 / 109
Nearest Neighbor Classifier Approximate P(y | x) using the label distribution in {yi | xi ∈ Nk (x)}, where Nk (x) consists of the k nearest examples of x (with respect to some distance measure), and predict the majority label hn (x) = majority{yi | xi ∈ Nk (x)}.
• Under mild conditions, as k → ∞ and n/k → ∞, hn (x) → h∗ (x),
for any distribution P(X , Y ). • (Curse of dimensionality) The number of samples required for
accurate approximation is exponential in the dimension.
66 / 109
1-NN classifier
15-NN classifier
Bayes optimal classifier
67 / 109
Naive Bayes Classifier (NB) Model • X = X1 × . . . × Xd , where each Xi is a finite set. • A model p(X , Y ) satisfies the independence assumption
p(x1 , . . . , xd | y ) = p(x1 | y ) . . . p(xd | y ).
68 / 109
Naive Bayes Classifier (NB) Model • X = X1 × . . . × Xd , where each Xi is a finite set. • A model p(X , Y ) satisfies the independence assumption
p(x1 , . . . , xd | y ) = p(x1 | y ) . . . p(xd | y ). Classification An example x = (x1 , . . . , xd ) is classified as y = arg max p(y 0 | x). 0 y ∈Y
This is equivalent to y = arg max p(y 0 , x) = arg max p(y 0 )p(x1 | y )...p(xd | y ), 0 0 y ∈Y
y ∈Y
by the independence assumption. 68 / 109
Learning (MLE) The maximum likelihood Naive Bayes model is p(X ˆ , Y ) given by p(y ˆ ) = ny /n, p(x ˆ i | y ) = ny ,xi /ny , where ny is the number of times class y appears in the training set, and ny ,xi is the number of times attribute i takes value xi when the class label is y . (verify)
69 / 109
Learning (MLE) The maximum likelihood Naive Bayes model is p(X ˆ , Y ) given by p(y ˆ ) = ny /n, p(x ˆ i | y ) = ny ,xi /ny , where ny is the number of times class y appears in the training set, and ny ,xi is the number of times attribute i takes value xi when the class label is y . (verify) Issues • Independence assumption unlikely to be satisified. • The counts ny may be 0, making the estimates undefined. • The counts may be very small, leading to unstable estimates.
69 / 109
Laplace correction
p(y ˆ ) = (ny + c0 )/
X
(ny 0 + c0 ),
y ∈Y
p(x ˆ i | y ) = (ny ,xi + c1 )/
X
(ny ,xi0 + c1 ),
xi0 ∈Xi
where c0 > 0 and c1 > 0 are user-chosen constants. Laplace correction makes NB more stable, but still relies on strong independence assumption.
70 / 109
Logistic Regression (LR) Model • X = Rd . • Logistic regression estimates conditional distributions of the form
.X p(y | x, θ) = exp(xT θy ) exp(xT θy 0 ), y 0 ∈Y
where θy = (θy 1 , . . . , θyd ) ∈ Rd , and θ is the concatenation of θy ’s.
71 / 109
Logistic Regression (LR) Model • X = Rd . • Logistic regression estimates conditional distributions of the form
.X p(y | x, θ) = exp(xT θy ) exp(xT θy 0 ), y 0 ∈Y
where θy = (θy 1 , . . . , θyd ) ∈ Rd , and θ is the concatenation of θy ’s. Classification An example x is classified as y = arg max p(y 0 | x, θ). 0 y ∈Y
71 / 109
Learning • Training is often done by maximizing regularized log-likelihood
L(θ) = log
n Y
p(yi | xi , θ) − λ||θ||22 .
i=1
That is, the parameter estimate is θn = arg max L(θ). θ
• L(θ) is a concave function, and can be optimized using standard
numerical methods (such as L-BFGS).
72 / 109
Comparing kNN, NB and LR All approximates the Bayes-optimal classifier ˜ , Y ) to P(X , Y ) (or learn an • Learn an approximation P(X ˜ approximation P(Y | X ) to P(Y | X )). ˜ | x). • Choose the function hn (x) = arg maxy P(y kNN and logistic regression estimate P(Y | X ), while naive Bayes estimates P(X , Y ). kNN is a non-parametric method, while naive Bayes and logistic regression are both parametric methods.
73 / 109
Application in Digit Recognition
Assume each digit image is a binary image, represented as a vector with the pixel values as its elements.
Applying the algorithms • kNN: use the Euclidean distance between the examples. • NB can be applied because each example is a discrete vector. • LR can be applied because each example is a continuous vector.
74 / 109
The Perceptron x1 x2
w1
x0 = 1 w0
w2
Σ .. .
xT w
y = sgn(xT w)
wd
xd
A perceptron maps an input x ∈ Rd+1 to xT w > 0, 1, h(x) = sgn(xT w) = 0, xT w = 0, . −1, xT w < 0. Here x = (1, x1 , . . . , xd ) includes a dummy variable 1. 75 / 109
x1 + − +
+ −
+ +
− −
x0
− − −
A perceptron corresponds to a linear decision boundary (i.e., the boundary between the regions for examples of the same class). 76 / 109
It is NP-hard to minimize the empirical 0/1 loss of a perceptron, that is, given a training set {(x1 , y1 ), . . . , (xn , yn )}, it is NP-hard to solve min w
1 I(sgn(xT i w) 6= yi ) n
2 Idea: Can we use (xT i w − yi ) as a surrogate loss for the 0/1 loss?
77 / 109
Least Squares May Fail
Recall: Binary classification as regression • Label one class as −1 and and the other as +1. • Fit a function f (x) using least squares regression. • Given a test example x, predict −1 if f (x) < 0 and +1 otherwise.
78 / 109
Least Squares May Fail
Recall: Binary classification as regression • Label one class as −1 and and the other as +1. • Fit a function f (x) using least squares regression. • Given a test example x, predict −1 if f (x) < 0 and +1 otherwise.
Issue Least squares fitting may not find a separating hyperplane (i.e., a hyperplane which puts the positive and negative examples on different sides of it) even there is one.
78 / 109
− + + + + + +
+ + + + + +
The decision boundary learned using least squares fitting (red line) wrongly classifies the negative example, while there exists separating hyperplanes (like the blue line). 79 / 109
Perceptron Algorithm
Require: (x1 , y1 ), . . . , (xn , yn ) ∈ Rd+1 × {−1, +1}, η ∈ (0, 1]. Ensure: Weight vector w. Randomly or smartly initialize w. while there is any misclassified example do Pick a misclassified example (xi , yi ). w ← w + ηyi xi .
80 / 109
Why the update rule w ← w + ηyi xi ? • w classifies (xi , yi ) correctly if and only if yi xT i w > 0. • If w classifies (xi , yi ) wrongly, then the update rule moves yi xT i w
towards positive, because 2 T T y i xT i (w + ηyi xi ) = yi xi w + η||xi || > yi xi w.
81 / 109
Perceptron convergence theorem If the training data is linearly separable (i.e., there exists some w∗ ∗ such that yi xT i w > 0 for all i), then the perceptron algorithm terminates with all training examples correctly classified.
82 / 109
Perceptron convergence theorem If the training data is linearly separable (i.e., there exists some w∗ ∗ such that yi xT i w > 0 for all i), then the perceptron algorithm terminates with all training examples correctly classified. Proof. Suppose w∗ separates the data. We can scale w∗ such that |xTi w∗ | ≥ ||xi ||22 . If w classifies (xi , yi ) wrongly, then it will be updated to w0 = w + ηyi xi . We show that ||w∗ − w0 ||22 ≤ ||w∗ − w||22 − ηR 2 , where R = mini ||xi ||2 . This implies that only finitely many updates is possible. The inequality can be shown as follows. ||w∗ − w0 ||22 = ||w∗ − w − ηyi xi ||22 = ||w∗ − w||22 − 2ηyi xTi (w∗ − w) + η 2 yi2 ||xi ||22 ≤ ||w∗ − w||22 − 2η(|xTi w∗ | + |xTi w|) + η||xi ||22 ≤ ||w∗ − w||22 − 2η|xTi w∗ | + η||xi ||22 ≤ ||w∗ − w||22 − η|xTi w∗ | ≤ ||w∗ − w||22 − ηR 2 . 82 / 109
Issues • When the data is separable, the hyperplane found by the
perceptron algorithm depends on the initial weight and is thus arbitrary. • Convergence can be very slow, especially when the gap between
the positive and negative examples is small. • When the data is not separable, the algorithm does not stop, but
this can be difficult to detect.
83 / 109
Support Vector Machines (SVMs) Separable data wT x + w0 = 0
Geometric intuition Find a separating hyperplane with maximal margin (i.e., the minimum distance from the points to it).
+ − +
+ −
+ +
yi (wT xi +w0 ) ||w||2
− − − − (xi , yi ) −
84 / 109
Support Vector Machines (SVMs) Separable data wT x + w0 = 0
Geometric intuition Find a separating hyperplane with maximal margin (i.e., the minimum distance from the points to it).
+ − +
+ −
yi (wT xi +w0 ) ||w||2
Algebraic formulation maxM,w,w0 M subject to
+ +
− − − − (xi , yi )
yi (wT xi +w0 ) ||w||2
≥ M, i = 1, . . . , n.
−
84 / 109
Support Vector Machines (SVMs) Separable data wT x + w0 = 0
Geometric intuition Find a separating hyperplane with maximal margin (i.e., the minimum distance from the points to it).
+ − +
+ −
yi (wT xi +w0 ) ||w||2
Algebraic formulation maxM,w,w0 M subject to
+ +
− − − − (xi , yi ) −
yi (wT xi +w0 ) ||w||2
≥ M, i = 1, . . . , n.
Equivalent formulation (add M||w||2 = 1) . minw,w0 12 ||w||22 subject to yi (wT xi + w0 ) ≥ 1, i = 1, . . . , n.
84 / 109
Soft-margin SVMs Non-separable data Algebraic formulation
min
w,w0 ,ξ1 ,...,ξn
X 1 ||w||22 + C ξi 2 i
T
subject to yi (w xi + w0 ) ≥ 1 − ξi , i = 1, . . . , n, ξi ≥ 0, i = 1, . . . , n. • C > 0 is a user chosen constant. • Introducing ξi allows (xi , yi ) to be misclassified with a penalty of
C ξi in the original objective function 21 ||w||22 . An SVM always have a unique solution that can be found efficiently. 85 / 109
SVM as minimizing regularized hinge loss Soft-margin SVMs can be equivalently written as min
w,w0
X 1 ||w||22 + max(0, 1 − yi (wT xi + w0 )), 2C i
where max(0, 1 − y (wT x + w0 )) is the hinge loss Lhinge ((x, y ), h) = max(0, 1 − yh(x)) of the classifier h(x) = wT x + w0 , and upper bounds the 0/1 loss L0/1 ((x, y ), h) = I(y 6= sgn(h(x)))
86 / 109
What You Need to Know...
• Bayes optimal classifier. • Probabilistic classifiers: NN, NB and LR. • Perceptrons and SVMs.
87 / 109
Clustering
88 / 109
Clustering
Clustering is unsupervised function learning which learns a function from X to [K ] = {1, . . . , K }. The objective generally depends on some similarity/distance measure between the items, so that similar items are grouped together. 89 / 109
K -means Clustering
Given observations x1 , . . . , xn , find a K -clustering f (a surjective from [n] to [K ]) to minimize the cost n X
||xi − cf (i) ||2 ,
i=1
where ck is the centroid of cluster k, that is, the average of xi ’s with f (i) = k.
90 / 109
K -means algorithm Randomly initialize c1:K , and set each f (i) = 0. repeat (Assignment) Set each f (i) to be the index of the cj closest to xi . (Update) Set each ci as the centroid of cluster i given by f . until f does not change
91 / 109
K -means algorithm Randomly initialize c1:K , and set each f (i) = 0. repeat (Assignment) Set each f (i) to be the index of the cj closest to xi . (Update) Set each ci as the centroid of cluster i given by f . until f does not change
Initialization • Forgy method: randomly choose K observations as centroids, and
initialize f as in the assignment step. • Random Partition: assign a random cluster to each example.
Random partition is preferable.
91 / 109
Illustration
Initialize centroids Generated using http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
92 / 109
Illustration
Iter 1: Assignment Generated using http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
92 / 109
Illustration
Iter 1: Update Generated using http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
92 / 109
Illustration
Iter 2: Assignment Generated using http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
92 / 109
Illustration
Iter 2: Update (converged) Generated using http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
92 / 109
K -means as Coordinate Descent
K -means is a coordinate descent algorithm for the cost function C (f , c1:K ) =
n X
||xi − cf (i) ||2 .
i=1
Specifically, we have (verify) (Assignment) f ← arg minf 0 C (f 0 , c1:K ), where f 0 is a K -clustering. (Update) c1:K ← arg minc01:K C (f , c01:K ),
93 / 109
Convergence • The cost decreases at each iteration before termination. • The cost converges to a local minimum by the monotone
convergence theorem. • Convergence may be very slow, taking exponential time in some
cases, but such cases do not seem to arise in practice.
94 / 109
Convergence • The cost decreases at each iteration before termination. • The cost converges to a local minimum by the monotone
convergence theorem. • Convergence may be very slow, taking exponential time in some
cases, but such cases do not seem to arise in practice.
Dealing with poor local minimum • Restart multiple times and pick the minimum cost clustering found.
94 / 109
Convergence • The cost decreases at each iteration before termination. • The cost converges to a local minimum by the monotone
convergence theorem. • Convergence may be very slow, taking exponential time in some
cases, but such cases do not seem to arise in practice.
Dealing with poor local minimum • Restart multiple times and pick the minimum cost clustering found.
K -means gives hard clusterings. Can we give probabilistic assignments to clusters?
94 / 109
Soft Clustering with Gaussian Mixtures
Assumption • Each cluster is represented as a Gaussian
1 1 N(x; µk , Σk ) = p exp − (x − µk )T Σ−1 (x − µ ) k k 2 |2πΣk | • Cluster k has weight wk ≥ 0, with
PK
k=1 wk
= 1.
95 / 109
Equivalently, we assume the probability of observing x and it is in cluster z = k is p(x, z = k | θ) = wk N(x; µk , Σk ), where θ = {w1:K , µ1:K , Σ1:K }.
96 / 109
Equivalently, we assume the probability of observing x and it is in cluster z = k is p(x, z = k | θ) = wk N(x; µk , Σk ), where θ = {w1:K , µ1:K , Σ1:K }.
The distribution p(x | θ) = mixture.
P
k
wk N(x; µk , Σk ) is called a Gaussian
96 / 109
Computing a soft clustering
p(Z = k | x, θ) = PK
p(x, Z = k | θ)
k 0 =1 p(x, Z
= k 0 | θ)
.
Learning a Gaussian mixture Given the observations D = {x1 , . . . , xn },P we choose θ by maximizing the log-likelihood L(D | θ) = ni=1 ln p(xi | θ) max L(D | θ). θ
This is solved using the EM algorithm.
97 / 109
The EM Algorithm Let zi be the random variable representing the cluster from which xi is drawn from. EM starts with some initial parameter θ(0) , and repeats the following steps:
Expectation step:
Q(θ | θ(t) ) = E
X
ln p(xi , zi | θ) | D, θ(t) ,
i
Maximization step:
θ
(t+1)
= arg max Q(θ | θ(t) ). θ
In words... • E-step: Expectation of the log-likelihood for the complete data,
w.r.t. the conditional distribution p(z1:n | D, θ(t) ). • M-step: Maximization of the expectation
98 / 109
The EM Algorithm Let zi be the random variable representing the cluster from which xi is drawn from. EM starts with some initial parameter θ(0) , and repeats the following steps:
Expectation step:
Q(θ | θ(t) ) = E
X
ln p(xi , zi | θ) | D, θ(t) ,
i
Maximization step:
θ
(t+1)
= arg max Q(θ | θ(t) ). θ
Data completion interpretation • E-step: Create complete data (xi , zi ) with weight p(zi | xi , θ (t) ),
for each xi and each zi ∈ [K ]. • M-step: Perform maximum likelihood estimation on the complete
data set. 98 / 109
EM algorithm is iterative likelihood maximization EM algorithm iteratively improves the likelihood function, L(D | θ(t+1) ) ≥ L(D | θ(t) ).
99 / 109
EM algorithm is iterative likelihood maximization EM algorithm iteratively improves the likelihood function, L(D | θ(t+1) ) ≥ L(D | θ(t) ). Proof. With some algebraic manipulation, we have Q(θ(t+1) | θ(t) ) − Q(θ(t) | θ(t) ) X KL(p(Zi | xi , θ(t) ) | p(Zi | xi , θ(t+1) )), = L(D | θ(t+1) ) − L(D | θ(t) ) − i
where KL(q||q 0 ) is the KL-divergence
P
x
q(x) ln qq(x) 0 (x) .
The result follows by noting that the LHS is non-negative by the choice of θ(t+1) in the M-step, and the nonnegativeness of the KL-divergence.
99 / 109
Q(θ(t+1) | θ(t) ) − Q(θ(t) | θ(t) ) X X =E ln p(xi , zi | θ(t+1) ) | D, θ(t) − E ln p(xi , zi | θ(t) ) | D, θ(t) i
i
X =E (ln p(xi | θ(t+1) ) + ln p(zi | xi , θ(t+1) ) i
− ln p(xi | θ(t) ) − ln p(zi | xi , θ(t) | D, θ(t) ) p(z | x , θ(t) ) X X i i (t) ) | D, θ = ln p(xi | θ(t+1) ) − ln p(xi | θ(t) ) − E ln p(zi | xi , θ(t) i i X = L(D | θ(t+1) ) − L(D | θ(t) ) − KL(p(Zi | xi , θ(t) ) | p(Zi | xi , θ(t+1) )). i
100 / 109
Update Equations for Gaussian Mixtures Scalar covariance matrices Assume each covariance matrix Σk is a scalar matrix σk2 Id . Let (t,i) wk = p(zi = k | xi , θ(t) ). Then given θ(t) , θ(t+1) can be computed using X (t,i) (t+1) wk = wk /n, i (t+1) µk
=
X
(t,i)
wk
xi /
X
i (t+1) 2
(σk
) =
X i
(t,i)
wk
,
i (t,i)
wk
X j
(t+1) 2
(xij − µkj
) /(d
X
(t,i)
wk
).
i
101 / 109
Data completion interpretation • Split example xi into K complete examples (xi , 1), . . . , (xi , K ), (t,i)
where example (xi , k) has weight wk . • Apply maximum likelihood estimation to the complete data (t+1)
• wk
is total weight of examples in cluster k. • is the mean x value of the (weighted) examples in cluster k. • is the standard deviation of all the attributes of the (weighted) examples in cluster k. (t+1) µk (t+1) σk
102 / 109
Diagonal covariance matrices Assume each covariance matrix Σk is a diagonal matrix with 2 , . . . , σ 2 . Let w (t,i) = p(z = k | x , θ (t) ). diagonal entries σk1 i i kd k Then given θ(t) , θ(t+1) can be computed using X (t,i) (t+1) wk = wk /n, i (t+1) µk
=
X
=
X
(t,i)
wk
xi /
i (t+1) (σkj )2
i
X
(t,i)
wk
,
i (t,i) wk (xij
(t+1) 2
− µkj
) /
X
(t,i)
wk
.
i
103 / 109
Arbitrary covariance matrices (t,i)
Let wk = p(zi = k | xi , θ(t) ). Then given θ(t) , θ(t+1) can be computed using X (t,i) (t+1) wk = wk /n, i (t+1) µk
=
X
=
X
(t,i)
wk
xi /
i (t+1) Σk
i
X
(t,i)
wk
,
i (t,i) wk (xi
(t+1)
− µk
(t+1) T
)(xi − µk
) /
X
(t,i)
wk
.
i
104 / 109
What You Need to Know...
• The clustering problem. • Hard clustering with K -means algorithm. • Soft clustering with Gaussian mixture.
105 / 109
Density Estimation
• Maximum likelihood estimation. • Naive Bayes. • Logistic regression.
106 / 109
This Tutorial... Essentials for crafting basic machine learning systems. Formulate applications as machine learning problems Classification, regression, density estimation, clustering Understand and apply basic learning algorithms Least squares regression, logistic regression, support vector machines, K-means,... Theoretical understanding Position and compare the problems and algorithms in the unifying framework of statistical learning theory. 107 / 109
Beyond This Course
• Representation: dimensionality reduction, feature selection,... • Algorithms: decision tree, artificial neural network, Gaussian
processes,... • Meta-learning algorithms: boosting, stacking, bagging,... • Learning theory: generalization performance of learning algorithms • Many other exciting stuff...
108 / 109
Further Readings
109 / 109