Chapter 1 DJM 23 January 2017 The normal linear model Assume that 1. 2. 3. 4.

What What What What

yi = x> i β + i .

are all these things? is the mean of yi ? is the distribution of i ? is the notation X or Y ?

Drawing a sample yi = x> i β + i . Write code which draws a sample form the population given by this model. p = 3 n = 100 sigma = 2 epsilon = rnorm(n,sd=sigma) # this is random X = matrix(runif(n*p), n, p) # treat this as fixed, but I need numbers beta = rpois(p+1,5) # also fixed, but I again need numbers Y = cbind(1,X) %*% beta + epsilon # epsilon is random, so this is ## Equiv: Y = beta[1] + X %*% beta[-1] + epsilon

How do we estimate beta? 1. 2. 3. 4.

Guess. Ordinary least squares (OLS). Maximum likelihood. Do something more creative.

Method 1: Guess This method isn’t very good, as I’m sure you can imagine.

Method 2. OLS Suppose I want to find an estimator βb which makes small errors on my data. I measure errors with the difference between predictions X βb and the responses Y . I don’t care if the differences are positive or negative, so I try to measure the total error with n X b yi − x> i β . i=1


This is fine, but hard to minimize (what is the derivative of | · |?) So I use

n X

b2 (yi − x> i β) .


Method 2. OLS solution We write this as βb = arg min β

n X 2 (yi − x> i β) . i=1

“Find the β which minimizes the sum of squared errors.” Note that this is the same as


1X 2 (yi − x> βb = arg min i β) . β n i=1

“Find the beta which minimizes the mean squared error.”

Method 2. Ok, do it We differentiate and set to zero


∂ 1X 2 (yi − x> i β) ∂β n i=1 n


2X xi (yi − x> i β) n i=1


2X −xi x> i β + xi yi n i=1


0≡ ⇒

n X

−xi x> i β + xi yi

i=1 n X

xi x> i β =



n X

xi yi

i=1 n X

!−1 xi x> i


n X i=1

In matrix notation. . . . . . this is βˆ = (X > X)−1 X > Y. The β which “minimizes the sum of squared errors” AKA, the SSE.


xi yi

Method 3: maximum likelihood Method 2 didn’t use anything about the distribution of . But if we know that  has a normal distribution, we can write down the joint distribution of Y = (y1 , . . . , yn ): n Y

fY (y; β) =

fyi ;β (yi )

i=1 n Y

 1 > 2 √ exp − 2 (yi − xi β) = 2σ 2πσ 2 i=1 !  n/2 n 1 1 X > 2 exp − 2 = (yi − xi β) 2πσ 2 2σ i=1 1

In M463, we think of fY as a function of y with β fixed: 1. If we integrate over y from −∞ to ∞, it’s 1. 2. If we want the probability of (a, b), we integrate from a to b. 3. etc.

Turn it around. . . . . . instead, think of it as a function of β. We call this “the likelihood” of beta: L(β). Given some data, we can evaluate the likelihood for any value of β (assuming σ is known). It won’t integrate to 1 over β. But it is “convex”, meaning we can maximize it (the second derivative wrt β is everywhere negative).

So let’s maximize The derivative of this thing is kind of ugly. But if we’re trying to maximize over β, we can take an increasing transformation without changing anything. I choose loge . L(β) =

`(β) = −

1 2πσ 2


n 1 X 2 exp − 2 (yi − x> i β) 2σ i=1

n 1 X n 2 log(2πσ 2 ) − 2 (yi − x> i β) 2 2σ i=1

But we can ignore constants, so this gives βb = arg max − β

n X 2 (yi − x> i β) i=1

The same as before!



The here and now In S432, we focus on OLS. In S420, you look at maximum likelihood (for this and many other distributions). Here, the method gives the same estimator. We need to be able to evaluate how good this estimator is however.

Mean squared error (MSE) Let’s look at the population version, and let’s forget about the linear model. Suppose we think that there is some function which relates y and x. Let’s call this function f for the moment. How do we estimate f ? What is f ?

Minimizing MSE Let’s try to minimize the expected sum of squared errors (MSE)      E (Y − f (X))2 = E E (Y − f (X))2 | X h i 2 = E Var [Y | X] + E [(Y − f (X)) | X] h i 2 = E [Var [Y | X]] + E E [(Y − f (X)) | X] The first part doesn’t depend on f , it’s constant, and we toss it. To minimize the rest, take derivatives and set to 0.  ∂   E E (Y − f (X))2 | X ∂f = −E [E [2(Y − f (X) | X]]


⇒ 2E [f (X) | X] = 2E [Y | X] ⇒ f (X) = E [Y | X]

The regression function We call this solution: µ(X) = E [Y | X] the regression function. If we assume that µ(x) = E [Y | X = x] = x> β, then we get back exactly OLS. But why should we assume µ(x) = x> β?


The regression function In mathematics: µ(x) = E [Y | X = x]. In words: Regression is really about estimating the mean. 1. If Y ∼ N(µ, 1), our best guess for a new Y is µ. 2. For regression, we let the mean (µ) depend on X. 3. Think of Y ∼ N(µ(X), 1), then conditional on X = x, our best guess for a new Y is µ(x) [whatever this function µ is]

Causality For any two variables Y and X, we can always write Y | X = µ(X) + η(X) such that E [η(X)] = 0. • Suppose, µ(X) = µ0 (constant in X), are Y and X independent? • Suppose Y and X are independent, is µ(X) = µ0 ?

Previews of future chapters Linear smoothers What is a linear smoother? 1. Suppose I observe Y1 , . . . , Yn . 2. A linear smoother is any prediction function that’s linear in Y. ˆ = WY for any matrix W. • Linear functions of Y are simply premultiplications by a matrix, i.e. Y 3. Examples: P   • Y = n1 Yi = n1 1 1 · · · 1 Y ˆ = X(X> X)−1 X> Y • Given X, Y • You will see many other smoothers in this class

kNN as a linear smoother (We will see smoothers in more detail in Ch. 4) 1. 2. 3. 4.

For kNN, consider a particular pair (Yi , Xi ) Find the k covariates Xj which are closest to Xi Predict Yi with the average of those Xj ’s This turns out to be a linear smoother

• How would you specify W?

Kernels (Again, more info in Ch. 4)


• There are two definitions of “kernels”. We’ll use only 1. • Recall the pdf for the Normal density:   1 1 2 exp (x − µ) f (x) = √ 2σ 2 2πσ • The part that depends on the data (x), is a kernel • The kernel has a center (µ) and a range (σ)

Kernels (part 2) • In general, any function which integrates, is non-negative, and symmetric is a kernel in the sense used in the book • You can think of any (unnormalized) symmetric density function (uniform, normal, Cauchy, etc.) • The way you use a kernel is take a weighted average of nearby data to make predictions • The weight of Xj is given by the height of the density centered at Xi • Examples: 2 • The Gaussian kernel is K(x − x0 ) = e−(x−x0 ) /2 • The Boxcar kernel is K(x − x0 ) = I(x − x0 < 1)

Kernels (part 3) • • • • • •

You don’t need the normalizing constant To alter the support: take (x − x0 )/h and K(z) = K(z)/h Now, the range of the density is determined by h You can interpret kNN as a particular kind of kernel The range is determined by k The center is determined by Xi


Chapter 1 - GitHub

Jan 23, 2017 - 1. What are all these things? 2. What is the mean of yi? 3. What is the distribution of ϵi? 4. What is the notation X or Y ? Drawing a sample yi = xi β + ϵi. Write code which draws a sample form the population given by this model. p = 3 .... We'll use only 1. • Recall the pdf for the Normal density: f(x) = 1. √. 2πσ.

279KB Sizes 1 Downloads 534 Views

Recommend Documents

HW 2: Chapter 1. Data Exploration - GitHub
OI 1.8: Smoking habits of UK Residents: A survey was conducted to study the smoking habits ... create the scatterplot here. You can use ... Go to the Spurious Correlations website: and use the drop down menu to.

Chapter 4 - GitHub
The mathematics: A kernel is any function K such that for any u, K(u) ≥ 0, ∫ duK(u)=1 and ∫ uK(u)du = 0. • The idea: a kernel is a nice way to take weighted averages. The kernel function gives the .... The “big-Oh” notation means we have

Chapter 2 - GitHub
Jan 30, 2018 - More intuitively, this notation means that the remainder (all the higher order terms) are about the size of the distance between ... We don't know µ, so we try to use the data (the Zi's) to estimate it. • I propose 3 ... Asymptotica

Chapter 3 - GitHub
N(0, 1). The CLT tells us about the shape of the “piling”, when appropriately normalized. Evaluation. Once I choose some way to “learn” a statistical model, I need to decide if I'm doing a good job. How do I decide if I'm doing anything good?

AIFFD Chapter 12 - Bioenergetics - GitHub
The authors fit a power function to the maximum consumption versus weight variables for the 22.4 and ... The linear model for the 6.9 group is then fit with lm() using a formula of the form ..... PhD thesis, University of Maryland, College Park. 10.

AIFFD Chapter 10 - Condition - GitHub
May 14, 2015 - 32. This document contains R versions of the boxed examples from Chapter 10 of the “Analysis and Interpretation of Freshwater Fisheries Data” ...

chapter iv: the adventure - GitHub
referee has made changes in the rules and/or tables, simply make a note of the changes in pencil (you never kno, ,hen the rules ,ill ... the Game Host's rulebook until they take on the mantle of GH. The excitement and mystery of ...... onto the chara

Chapter 5 and 6 - GitHub
Mar 8, 2018 - These things are based on the sampling distribution of the estimators (ˆβ) if the model is true and we don't do any model selection. • What if we do model selection, use Kernels, think the model is wrong? • None of those formulas

AIFFD Chapter 4 - Recruitment - GitHub
Some sections build on descriptions from previous sections, so each ... setwd("c:/aaaWork/web/fishR/BookVignettes/AIFFD/") ... fact is best illustrated with a two-way frequency table constructed from the two group factor variables with ..... 10. Year

AIFFD Chapter 6 - Mortality - GitHub
6.5 Adjusting Catch-at-Age Data for Unequal Recruitment . . . . . . . . . . . . . . . . . . . . . . ...... System: Windows, i386-w64-mingw32/i386 (32-bit). Base Packages: base ...

chapter p chapter 1
Write the product in standard form. 5) (3 + 5i)(2 + 9i). 5). Find the product of the complex number and its conjugate. 6) -1 - 5i. 6). CHAPTER 1. Find the domain of ...

1 - GitHub
are constantly accelerated by an electric field in the direction of the cathode, the num- ...... als, a standard fit software written at the University of Illinois [Beechem et al., 1991], ...... Technical report, International Computer Science Instit

Chapter 1
converged to the highest peak because the selective pressure focuses attention to the area of .... thus allowing the formation of non-equal hyper-volume niches. In order to ..... The crossover operator exchanges the architecture of two ANNs in.

Chapter 1
strategy entails, the research findings are difficult to compare. .... rooms (cf. Li 1984; Wu 2001; Yu 2001). Comprehensive Surveys of EFL Learner Behaviours.

Chapter 1
increasing timeliness, and increasing precision [265]. Example: New data enable new analyses ..... but they allow researchers to take on problems of great scale and complexity. Furthermore, they are developing at ..... For MySQL, Chapter 4 provides i

Chapter 1
Digital System Test and Testable Design: Using HDL Models and Architectures ... What it is that we are testing in digital system test and why we are testing it? ..... mainframe cabinet containing the power distribution unit, heat exchanger for liquid

Chapter 1
Shall I send for the doctor?" The last thing he needed was a dose of the site doctor's hippopotamus-blood-and-cat- hair paste. He clambered quickly to his feet. Reonet peered at him through her fringe and pretended to continue with her work. The rest

Chapter 1
The expression x2 is greater than 2x for x 5 3. Through trial and error you can find that 3 is the only value of x where x2 is greater than 2x because 32 5 3 p 3 5 9 and. 23 5 2 p 2 p 2 5 8. Problem Solving. 48. 4s 5 4(7.5) 5 30. The perimeter is 30

1 - GitHub
Mar 4, 2002 - is now an integral part of computer science curricula. ...... students have one major department in which they are working OIl their degree.

Chapter 1
Impact of Bullying on Hospital, GP and Child Psychiatric Health Services. Long-term ... Appendix for Parents – Is My Child Being Bullied? Fears ..... per cent of lone mothers living alone had major problems in the area of social contact. ... Childr

Chapter 1
The Continuum Publishing Group Ltd 2003, The Tower Building, 11 York Road, London SE1 7NX ... trying to find a link between this information and its insight into peda- ..... reports on Chinese EFL learners' strategies for oral communication.

Chapter 1
Patients with AH make external misattributions of the source ... The exclusive license for this PDF is limited to personal website use only. No part of this digital ...... (2001). Verbal self-monitoring and auditory verbal hallucinations in patients

CCG Chapter 1 TV
CPM Educational Program. Lesson 1.3.2A Resource Page. Shapes Toolkit. Equilateral. Triangle: Isosceles. Triangle: Scalene. Triangle: Scalene Right. Triangle: Isosceles. Right. Triangle: Square: Rectangle: A quadrilateral with four right angles. Paral

Chapter 1 -
The Challenges and Opportunities. Of Marketing in Today's Economy. • Power Shift to Customers. • Massive Increase in Product Selection. • Audience and ...