Chapter 4 DJM 13 February 2017

Announcements

1. 2. 3. 4.

Homework 3 is due Tuesday 2/20 instead of Thursday 2/15. Exam 1 period is Wednesday 2/21–__Wednesday 2/28 instead of Friday–Friday. (Still by 11:59pm) I will be out of town 2/21-2/23. You will have a substitute on 2/22. I will have extra office hours on 2/26 and 2/27.

Workflow for doing statistics 1. Choose a family of models. 2. Split the data in half (randomly) 3. For each model: 1. Use half the data to. . . 2. Calculate CV to get estimates of the risk. 3. Choose the tuning parameter that gets the lowest estimate of the risk. 4. Choose a model by picking the model with the lowest estimate of the risk. 5. Evaluate and describe your model. Make plots, interpret coefficients, make predictions, etc. Use the other half. Why? 6. If you see things if 5 you don’t like, propose a new model(s) to handle these issues and return to step 3.

“Smoothers” and easy CV Linear smoothers • Recall S431: The “Hat Matrix” puts the hat on Y : Yb = HY . • If I want to get fitted values from the linear model   Yb = X βb = X(X > X)−1 X > Y = HY • We generalize this to arbitrary matrices: A linear smoother is any predictor f that gives fitted values via f (X) = W Y . • Today, we will learn other ways of predicting Y from X. • If I can get the fitted values at my original datapoints X by multiplying Y by a matrix, then that is a linear smoother.

1

Example

6

y

4

2

0 0

2

4

6

x

At each x, find 2 points on the left, and 2 on the right. Average their y values with that of your current point. W = toeplitz(c(rep(1,3),rep(0,n-3))) W = sweep(W, 1, rowSums(W), '/') df$Yhat = W %*% df$y geom_line(mapping = aes(x,Yhat), color=green) This is a linear smoother. What is W ?

What is W? • I actually built this one directly into the code. • An example with a 10 x 10 matrix: W = toeplitz(c(rep(1,3),rep(0,7))) round(sweep(W, 1, rowSums(W), '/'), 2) ## ## [1,] ## [2,] ## [3,] ## [4,] ## [5,] ## [6,] ## [7,] ## [8,] ## [9,] ## [10,]

[,1] 0.33 0.25 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00

[,2] 0.33 0.25 0.20 0.20 0.00 0.00 0.00 0.00 0.00 0.00

[,3] 0.33 0.25 0.20 0.20 0.20 0.00 0.00 0.00 0.00 0.00

[,4] [,5] [,6] [,7] [,8] [,9] [,10] 0.00 0.0 0.0 0.00 0.00 0.00 0.00 0.25 0.0 0.0 0.00 0.00 0.00 0.00 0.20 0.2 0.0 0.00 0.00 0.00 0.00 0.20 0.2 0.2 0.00 0.00 0.00 0.00 0.20 0.2 0.2 0.20 0.00 0.00 0.00 0.20 0.2 0.2 0.20 0.20 0.00 0.00 0.00 0.2 0.2 0.20 0.20 0.20 0.00 0.00 0.0 0.2 0.20 0.20 0.20 0.20 0.00 0.0 0.0 0.25 0.25 0.25 0.25 0.00 0.0 0.0 0.00 0.33 0.33 0.33

• This is a “kernel” smoother.

2

What is a “kernel” smoother? • The mathematics: A kernel is any function K such that for any u, K(u) ≥ 0,

R

duK(u) = 1 and

R

uK(u)du = 0.

• The idea: a kernel is a nice way to take weighted averages. The kernel function gives the weights. • The previous example is called the boxcar kernel. It looks like this:

6

y

4

2

0 0

2

4

6

x • Notice that the kernel gets centered at each x. The weights of the average are determined by the shape of the kernel. • For the boxcar, all the points inside the box get the same weight, all the rest get 0.

Other kernels • Most of the time, we don’t use the boxcar because the weights are weird. • A more common one is the Gaussian kernel:

3

6

y

4

2

0 0

2

4

6

x • Let’s look at row 49 of the W matrix here: W49,j

1



1 =√ exp − 2 (xj − x49 )2 2 2σ 2πσ

• For the plot, I made σ = .2.

Other kernels • What if I made σ = 0.8?

4



6

y

4

2

0 0

2

4

6

x • Before, points far from x49 got very small weights for predicting at x49 , now they have more influence. • For the Gaussian kernel, σ determines something like the “range” of the smoother.

Many Gaussians • Using my formula for W , I can calculate different linear smoothers with different σ dmat = as.matrix(dist(x)) Wgauss <- function(sig){ gg = exp(-dmat^2/(2*sig^2)) / (sig * sqrt(2*pi)) sweep(gg, 1, rowSums(gg),'/') } df$W1 = with(df, Wgauss(1) %*% y) df$W.5 = with(df, Wgauss(.5) %*% y) df$W.1 = with(df, Wgauss(.1) %*% y) ggplot(df, aes(x, y)) + geom_point() + xlim(0,2*pi) + ylim(0,max(df$y)) + stat_function(fun=trueFunction, color=red) + geom_line(aes(x, W1), color=blue) + geom_line(aes(x, W.5), color=green) + geom_line(aes(x, W.1), color=orange)

5

6

y

4

2

0 0

2

4

6

x

The bandwidth • Choosing σ is very important. • This “range” parameter is called the bandwidth. • Most practitioners will tell you that it is way more important than which kernel you use. • The default kernel is something called ‘Epanechnikov’: epan <- function(x) 3/4*(1-x^2)*(abs(x)<1) ggplot(data.frame(x=c(-2,2)), aes(x)) + stat_function(fun=epan,color=green)

y

0.6

0.4

0.2

0.0 −2

−1

0

x

How do you choose the bandwidth? • Cross validation of course! 6

1

2

• Now the trick: For linear smoothers, one can show (after pages of tedious algebra which I wouldn’t wish on my worst enemy, but might, in a fit of rage assign to a belligerant graduate student) that for Yb = W Y , n

LOO-CV =

n

1X eb2i 1 X (yi − ybi )2 = . n i=1 (1 − wii )2 n i=1 (1 − wii )2

• This trick means that you only have to fit the model once rather than n times! • You still have to calculate this for each model!

Back to my Gaussian example looCV <- function(y, W){ n = length(y) resids2 = ((diag(n)-W) %*% y)^2 denom = (1-diag(W))^2 return(mean(resids2/denom)) } looCV.forNiceModels <- function(mdl){ mean(residuals(mdl)^2/(1-hatvalues(mdl))^2) } looCVs = double(20) sigmas = seq(.05, 1, length.out=length(looCVs)) for(i in 1:length(looCVs)){ W = Wgauss(sigmas[i]) looCVs[i] = looCV(df$y, W) } ggplot(data.frame(sigmas,looCVs),aes(sigmas,looCVs)) + geom_point() + geom_line() 0.64 looCVs 0.62 0.60 0.58 0.56 0.25 0.50 sigmas 7 0.75 1.00 Back to my Gaussian example df$Wstar = with(df, Wgauss(sigmas[which.min(looCVs)]) %*% y) ggplot(df, aes(x, y)) + geom_point() + xlim(0,2*pi) + ylim(0,max(df$y)) + stat_function(fun=trueFunction, color=red) + geom_line(aes(x, Wstar), color=blue) 6 y 4 2 0 0 2 4 6 x Heads up on Ch. 4 Ugly formulas • These are things like (4.10)-(4.12) and (4.14) • The purpose of these formulas is to illustrate VERY GENERALLY how to trade bias and variance with Kernel smoothers. • The highest level overview is equation (4.16): M SE − σ 2 (x) = O(h4 ) + O(1/nh). • Note: we have moved irreducible noise to the left of =. • The first term on the right is the squared bias while the second term on the right is the variance. • The “big-Oh” notation means we have removed a bunch of constants that don’t depend on n or h. [They DO depend on the properties of the Kernel, and the distribution which generated the data.] 8 • The Optimal Bandwidth minimizes the MSE: hopt = arg min C1 h4 + h C2 nh C2 set ⇒ 0 = 4C1 h3 − 2   nh 1 ⇒ h5 = O n   1 . ⇒ hopt = O n1/5 • If we plug this in, we get the Oracle MSE—the MSE for the optimal, though unavailable estimator. M SE − σ 2 = O(h4opt ) + O(1/nhopt ) = O(n−4/5 ) + O(1/n4/5 )   1 =O n4/5 Ok, you asked for the algebra. • You don’t want the algebra. • Like the formula for LOO-CV, if I were a horrible, soul destroying person, I would wade through it for the next two hours (to get (4.10)). • Believe me, I’ve done it. Not fun. The hand wavy, “big-Oh” stuff is what you should keep in mind. • If you really want it, I will write up a document with all the work. Kernels and interactions • In multivariate kernel regressions, you estimate a surface over the input variables. • This is trying essentially to find fb(x1 , . . . , xp ). • Therefore, this function by construction includes interactions, handles categorical data, etc. etc. • This is contrast with linear models which need you to specify these things. • This extra complexity (automatically including interactions, as well as other things) comes with tradeoffs. Issue 1 • More complicated functions (smooth Kernel regressions vs. linear models) tend to have lower bias but higher variance. • For p = 1, equations (4.19) and (4.20) show this: • Bias 1. The bias of using a linear model when it is wrong is a number b(x, θ0 ) which doesn’t depend on n. 2. The bias of using kernel regression is O(1/n4/5 ). This goes to 0 as n → ∞. • Variance 1. The variance of using a linear model is O(1/n) 9 2. The variance of using kernel regression is O(1/n4/5 ). • To conclude: bias of kernels goes to zero (not for lines) but variance of lines goes to zero faster than for kernels. • If the linear model is right, you win. But if it’s wrong, you (eventually) lose. • How do you know if you have enough data? Do model selection (CV to choose models). • Compare of the kernel version with CV-selected tuning parameter (the CV estimate of the risk), with the CV estimate of the risk for the linear model. Issue 2 • For p > 1, there is more trouble. • First, lets look again at M SE(h) − σ 2 (x) = O(1/n4/5 ). That is for p = 1. It’s not that much slower than O(1/n), the variance for linear models. • If p > 1 similar calculations show, M SE(h) − σ 2 (x) = O(1/n4/(4+p) ) M SE(θ0 ) − σ 2 (x) = b(x, θ0 ) + O(p/n). • What if p is big? 1. Then O(1/n4/(4+p) ) is still big. 2. But O(p/n) is small. 3. So unless b(x, θ0 ) is big, we should use the linear model. • How do you tell? Use CV to decide. Issue 3 • When p is big, npreg is slow. • Not much to do about that. • Chapter 8 has some compromises that people use. • A very, very questionable rule of thumb: if p > log(n), this may not work. Summary • This is the lesson of the class (the second one) • How to do data analysis: 1. Choose a family of models. Some parametric and some nonparametric 2. Split the data in half (randomly) 3. For each model: 1. Use half the data to. . . 2. Calculate CV get estimates of the risk. 3. Choose any tuning parameters by using the one that has the lowest CV. 4. Choose a model by picking the model with the lowest CV. 5. Evaluate and describe your model. Make plots, interpret coefficients, make predictions, etc. Use the other half. 10 6. If you see things if 5 you don’t like, propose a new model(s) to handle these issues and return to step 3. • We like CV. It is good. • Split your data to make reasonable inferences. npreg computational advice • Read section 4.6 carefully, it will make your life much easier • npreg works like lm: out = npreg(y~x1+x2) • The + just means “use these variables” • There’s no reason to use I(x1ˆ2) or x1*x2, it already does that. (Why?) • npreg takes a little while to run, be sure to set cache=TRUE so you need only run it once. • You can use ordered(x2) or factor(x2). This may improve the speed a bit. • DO NOT CROSS VALIDATE. npreg does it automatically. The CV risk estimate is in out$bws\$fval.

Some more npreg discussion • npreg is using CV and optimization to try to choose the bandwidth(s) for you. • The tol and ftol arguments control how close the solution needs to be to an optimum. • Very basic minimization (called Gradient descent): – Suppose I want to minimize f (x) = (x − 6)2 numerically. – If I start at a point (say x1 = 23), vaguely, I want to “go” in the negative direction of the gradient. – The gradient (at x1 = 23) is f 0 (23) = 2(23 − 6) = 34. – Gradient descent says, ok go that way by some small amount: x2 = x1 − γ34, for γ. small. – In general, xn+1 = xn − γf 0 (xn ). niter = 10 gam = 0.1 x = double(niter) x[1] = 23 grad <- function(x) 2*(x-6) for(i in 2:niter) x[i] = x[i-1] - gam*grad(x[i-1]) x ## ##

[1] 23.000000 19.600000 16.880000 14.704000 12.963200 11.570560 10.456448 [8] 9.565158 8.852127 8.281701 • How do I decide if I’m done? The easiest way is to check how much I’m moving.

Fixing my gradient descent code maxiter = 1000 conv = FALSE gam = 0.1 x = 23 tol = 1e-3 grad <- function(x) 2*(x-6) for(iter in 1:maxiter){

11

x.new = x - gam * grad(x) conv = (x - x.new < tol) x = x.new if(conv) break } x ## [1] 6.003531 iter ## [1] 38 • What happens if I change tol to 1e-7?

12

## Chapter 4 - GitHub

The mathematics: A kernel is any function K such that for any u, K(u) â¥ 0, â« duK(u)=1 and â« uK(u)du = 0. â¢ The idea: a kernel is a nice way to take weighted averages. The kernel function gives the .... The âbig-Ohâ notation means we have removed a bunch of constants that don't depend on n or h. [They DO depend on the ...

#### Recommend Documents

AIFFD Chapter 4 - Recruitment - GitHub
Some sections build on descriptions from previous sections, so each ... setwd("c:/aaaWork/web/fishR/BookVignettes/AIFFD/") ... fact is best illustrated with a two-way frequency table constructed from the two group factor variables with ..... 10. Year

Chapter 2 - GitHub
Jan 30, 2018 - More intuitively, this notation means that the remainder (all the higher order terms) are about the size of the distance between ... We don't know Âµ, so we try to use the data (the Zi's) to estimate it. â¢ I propose 3 ... Asymptotica

Chapter 3 - GitHub
N(0, 1). The CLT tells us about the shape of the âpilingâ, when appropriately normalized. Evaluation. Once I choose some way to âlearnâ a statistical model, I need to decide if I'm doing a good job. How do I decide if I'm doing anything good?

Chapter 1 - GitHub
Jan 23, 2017 - 1. What are all these things? 2. What is the mean of yi? 3. What is the distribution of Ïµi? 4. What is the notation X or Y ? Drawing a sample yi = xi Î² + Ïµi. Write code which draws a sample form the population given by this model. p

4 - GitHub
Feb 16, 2016 - devspecs/abi386-4.pdf, which describes the ABI for processors compati- ble with ..... cation software. The state of ..... Generally, a shared object is built with a 0 base virtual .... C++ language semantics, but it does provide some b

4 - GitHub
æä»¶ï¼å¶ææè§husttrans-example.pdfã 1 .... ItalicFont={Adobe Kaiti Std}]{Adobe Song Std}. 41 ... 48. \newCJKfontfamily\FANGSONG{Adobe Fangsong Std}. 11 ...

4 - GitHub
Jun 18, 2016 - ç¨äºçæé®ç®±å°åãå¦\email{[email protected]}ä¼çæå¦ä¸ææçå°åï¼ ...... \email. 450 \def\email#1{. 451. \href{mailto:#1}{\texttt{#1}}. 452 }.

Chapter 4
For example, based on historical data, an insurance company could apply ..... ios we explicitly assume that the only goal of data mining is to optimize accuracy.

Chapter 4 Rational Numbers
students a solid foundation, one that prepares them for college and careers in the 21st century. Send all inquiries to: McGraw-Hill Education. 8787 Orion Place.

Chapter 4 Notepacket Key
s XTA-ex -\ ic O. Writing Equations from Roots: esser 2. C-X A- \\ (K A- \\ - O. A root of an equation is a value that makes the equation true. 0. 7 O X A-\ s O X A \ rt O. Use the Zero Product Property to write a quadratic equation with each pair of

Chapter 4 Notepacket Key
sa. Does the parabola open up or down? Graph the quadratic functions. Is the y-coord. of the vertex a max or min? Find the vertex and AOS if possible ... sÃ©s. CN. Date Notes 4.2: Dvocacy d. Ysales. A >do & soul Yurc-v- s-. Standard Form of a Quadrat

Chapter 4
In this chapter we will show that data mining and classifier induction can lead to ..... Such background knowledge may encourage an analyst to apply dis-.

AIFFD Chapter 12 - Bioenergetics - GitHub
The authors fit a power function to the maximum consumption versus weight variables for the 22.4 and ... The linear model for the 6.9 group is then fit with lm() using a formula of the form ..... PhD thesis, University of Maryland, College Park. 10.

Chapter 4 - Heuristics_6JobShopSchedulingProblem.pdf ...
Solve the problem to achieve each of the objective above using heuristic technique that is based on. Earliest due date, Shortest processing time, and Longest ...

Chapter 4, Section 4 Notes.pdf
Pearson Education, Inc., publishing as Pearson Prentice Hall. All rights reserved. Section Reading Support HOW 69. Ancient India, Section 4. â¢ Born poor; was a slave at. one time. â¢ Believed in absolute power. and complete control over. the peopl

AIFFD Chapter 10 - Condition - GitHub
May 14, 2015 - 32. This document contains R versions of the boxed examples from Chapter 10 of the âAnalysis and Interpretation of Freshwater Fisheries Dataâ ...

chapter iv: the adventure - GitHub
referee has made changes in the rules and/or tables, simply make a note of the changes in pencil (you never kno, ,hen the rules ,ill ... the Game Host's rulebook until they take on the mantle of GH. The excitement and mystery of ...... onto the chara

Chapter 5 and 6 - GitHub
Mar 8, 2018 - These things are based on the sampling distribution of the estimators (ËÎ²) if the model is true and we don't do any model selection. â¢ What if we do model selection, use Kernels, think the model is wrong? â¢ None of those formulas

AIFFD Chapter 6 - Mortality - GitHub
6.5 Adjusting Catch-at-Age Data for Unequal Recruitment . . . . . . . . . . . . . . . . . . . . . . ...... System: Windows, i386-w64-mingw32/i386 (32-bit). Base Packages: base ...

Chapter 4 exer.pdf
Page 1 of 20. SECTION 4.1 Polynomial Functions and Models 187. EXAMPLE 11 A Cubic Function of Best Fit. The data in Table 5 represent the weekly cost C (in thousands of dollars) of print- ing x thousand textbooks. (a) Draw a scatter diagram of the da

Chapter 4.pdf
Air, which is a gas, also flows. Both gases and liquids are fluids. Fluids flow because some sort of force is ... How do deposits. on artery walls affect the flow of blood? How is an airplane affected by. different kinds of airflow? ... Flow tests ar

Chapter 4 Review.notebook
October 27, 2016. Oct 262:45 PM. 1. There were 920 people who attended a Winter Carnival. Festival on a Saturday. The number of children (c) was triple the number of adults. (a). Given a ... A video game store sold 96 games. The store sold 3 times mo

CHAPTER 4.pdf
Page 1 of 11. sarojpandey.com.np Page 1 of 11. CHAPTER 4. 4. HTTP and the Web Services 8 Hrs. 4.1 HTTP, Web Servers and Web Access. [Self Study]. 4.2 Universal naming with URLs. [Self Study]. 4.3 WWW Technology: HTML, DHTML, WML, XML. HTML. [Self Stu