LECTURE 13: FROM INTERPOLATIONS TO REGRESSIONS TO GAUSSIAN PROCESSES • So far we were mostly doing linear or nonlinear regression of data points with simple small basis (for example, linear function y=ax+b) • The basis can be arbitrarily large and can be defined in many different ways: we do not care about values of a,b… but about predictions for y • This is the task of machine learning • Interpolation can be viewed as regression without noise (but still sparse sampling of data) • There are many different regressions

Interpolation • Goal: we have function y=f(x) defined at points yi=f(xi), and we want to know its value at any x. For now we will assume solution goes through xi’s • Why? Perhaps it is very expensive to evaluate it everywhere. Or perhaps we really do not know it. • Local interpolation: use a few nearby points • Example: polynomial interpolation f(x)=Si=1Naixi • How do we choose N? Higher N better for smooth functions and worse for sharp kinks

Differentiability • Lagrange formula for polynomial interpolation:

• Polynomial interpolation does not guarantee that derivatives are differentiable everywhere. • Stiffer solution are splines, where we enforce derivatives to be continuous • Most popular is cubic spline where 1st derivatives are smooth and 2nd derivatives are continuous

Cubic spline

• Start with piecewise linear interpolation

• This will not have 2nd derivative continuous: it is zero inside intervals and infinite at xi • But we can add another interpolation of 2nd derivatives y’’. If we also arrange it to be 0 at xi then we do not spoil linear interpolation above. • This has a unique solution:

Cubic spline

• So far we assumed we know y’’, but we do not • We can determine it by requiring continuity of 1st derivatives on both sides of xi:

• We get N-2 equations for N unknown yi’’, tridiagonal system, O(N) • Natural cubic spline: set y0’’=0 and yN’’=0

Rational function expansion • Spline is mostly used for interpolation • Polynomials can be used for extrapolation outside the interval (x0,xN), but the polynomial with the largest power dominates and results in divergence • Rational functions can be better for extrapolation: if we know the function goes to 0 we choose n>µ • Rational functions are better for interpolation if the functions has poles (in real or complex plane) • Rational functions can also be used for analytic work (Pade approximation)

Interpolation on a grid in higher dimensions • Simplest 2d example: bilinear interpolation

• Higher order accuracy: polynomials (biquadratic…) • Higher ordersmoothness: bicubic spline

From spline to B-spline to gaussian • We can write basis functions for (cubic) splines (B for basis), so that the solution is a linear combination of them. For cubic B-splines on uniform points

• Very close to gaussian • Gaussian is infinitely differentiable (i.e. smoother) but not sparse

Examples of basis functions • Polynomial spline/Gaussian sigmoid

• Sigmoid (used for classification)

From interpolation to regression • One can view function interpolation as regression in the limit of zero noise, but still sparse sampling • Sparse sampling will induce an error, as will noise • Both can use the same basis expansion f(x) • For regression with noise or sparse sampling we add a regularization term (Tikhonov/ridge/L2, Lasso/L1…), this can prevent overfitting • l=Si=1N [Sj=1Majf(xi) –yi]2+lSj=1Maj2 • The question is how to choose the regularization parameter l

High l: low scatter, high bias, low l : high scatter, low bias Example: N=M=25, gaussian basis

Bayesian regression

• In the Bayesian context we perform regression of coefficients aj assigning them some prior distribution, such as a gaussian with some precision a. If we also have noise precision b then ln p is

• So regularizing parameter is l=a/b

Linear algebra solution • More general prior

• Posterior

• We want to predict at arbitrary t:

Prediction for t • We

Examples

Kernel picture • Kernels work as closeness measure, giving more weight to nearby points. Assuming m0=0

Kernel examples

Kernel regression • So far the kernel was defined in terms of a finite number of basis functions F(x) • We can eliminate the concept of the basis functions and work simply with kernels • Nadaraya-Watson regression • We want to estimate y(x)=h(x): we use points close to x weighted by some function of the distance Kh(x-xi): as the distance increases the weights drop off • Kernel Kh does not have to be a covariance function, but if it is then this becomes a gaussian process

Gaussian process • We interpret the kernel as Kij=Cov(y(xi),y(xj)) and define y(x) as a random variable with a multi-variate gaussian distribution N(µ,K) • 2-variables • We then use posterior prediction for y: we get both the mean µ and variance K1/2 • We can interpret GP as regression on an infinite basis

Gaussian process

• Example covariance function K for translational and rotational invariant case

• Kernel parameters can be learned from data using optimization ed by

GP for regression • Model

• We can marginalize over f

GP regression

Note we have to invert a matrix: this can be expensive (O(N3))

Examples of different kernels

Predictions for different kernels

Learning the kernel

This can be viewed as a likelihood of q given the data (x,y) We can use NL optimization to determine MLE q

From linear regression to GP

• We started with a linear regression with inputs xi and noisy yi: yi=a+bxi+ei • We generalized this to a general basis with M basis functions yi=Sj=1Majf(xi)+ei • Next we performed Bayesian regression by adding a gaussian prior on coefficients aj=N(0, lj): this is equivalent to adding L2 norm and minimize l=Si=1N [Sj=1Majf(xi) –yi]2+Sj=1M lj aj2 • Next we marginalize over lj: we are left with E(yi)=0 and Kij=Cov(y(xi),y(xj))= Sj=1Mljf(xi) f(xj) • This is a gaussian process with a finite number of basis functions. Many GP kernels correspond to infinite number of basis functions.

Connections between regression/ classification models

Summary • Interpolations: polynomial, spline, rational… • They are just a subset of regression models, and one can connect them to regularized regression, which can be connected to Bayesian regression, which can be connected to kernel regression • This can also be connected to classification, next lecture

Literature • • • • •

Bishop, Ch. 3 Gelman etal Ch. 20, 21 http://www.gaussianprocess.org https://arxiv.org/pdf/1505.02965.pdf http://mlss2011.comp.nus.edu.sg/uploads/Site/l ect1gp.pdf

lecture 13: from interpolations to regressions to gaussian ... - GitHub

LECTURE 13: FROM. INTERPOLATIONS TO REGRESSIONS. TO GAUSSIAN PROCESSES. • So far we were mostly doing linear or nonlinear regression of data points with simple small basis (for example, linear function y=ax+b). • The basis can be arbitrarily large and can be defined in many different ways: we do not care ...

4MB Sizes 0 Downloads 232 Views

Recommend Documents

lecture 2: intro to statistics - GitHub
Continuous Variables. - Cumulative probability function. PDF has dimensions of x-1. Expectation value. Moments. Characteristic function generates moments: .... from realized sample, parameters are unknown and described probabilistically. Parameters a

Additive Gaussian Processes - GitHub
This model, which we call additive Gaussian processes, is a sum of functions of all ... way on an interaction between all input variables, a Dth-order term is ... 3. 1.2 Defining additive kernels. To define the additive kernels introduced in this ...

Deep Gaussian Processes - GitHub
Because the log-normal distribution is heavy-tailed and its domain is bounded .... of layers as long as D > 100. ..... Deep learning via Hessian-free optimization.

Lecture 1 - GitHub
Jan 9, 2018 - We will put special emphasis on learning to use certain tools common to companies which actually do data ... Class time will consist of a combination of lecture, discussion, questions and answers, and problem solving, .... After this da

Transcriptomics Lecture - GitHub
Jan 17, 2018 - Transcriptomics Lecture Overview. • Overview of RNA-Seq. • Transcript reconstruc囉n methods. • Trinity de novo assembly. • Transcriptome quality assessment. (coffee break). • Expression quan懿a囉n. • Differen鶯l express

Contribution to SBGN contest: best SBGN outreach - lecture ... - GitHub
Contribution to SBGN contest: best SBGN outreach - lecture, training, publication, book, website. RIMAS - An engineer's view on regulation of seed development.

lecture 3: more statistics and intro to data modeling - GitHub
have more parameters than needed by the data: posteriors can be ... Modern statistical methods (Bayesian or not) .... Bayesian data analysis, Gelman et al.

Lecture 13.pdf
Page 2 of 20. ELEMENTS OF THE WIMP INTERFACE. -We studied previously that the four key features of the WIMP. interface, that give it its name are ...

Normal form decomposition for Gaussian-to-Gaussian ...
Dec 1, 2016 - Reuse of AIP Publishing content is subject to the terms: ... Directly related to the definition of GBSs is the notion of Gaussian transformations,1,3,4 i.e., ... distribution W ˆρ(r) of a state ˆρ of n Bosonic modes, yields the func

lecture 15: fourier methods - GitHub
LECTURE 15: FOURIER METHODS. • We discussed different bases for regression in lecture. 13: polynomial, rational, spline/gaussian… • One of the most important basis expansions is ... dome partial differential equations. (PDE) into ordinary diffe

lecture 12: distributional approximations - GitHub
We have data X, parameters θ and latent variables Z (which often are of the ... Suppose we want to determine MLE/MAP of p(X|θ) or p(θ|X) over q: .... We limit q(θ) to tractable distributions. • Entropies are hard to compute except for tractable

Automatic Model Construction with Gaussian Processes - GitHub
This chapter also presents a system that generates reports combining automatically generated ... in different circumstances, our system converts each kernel expression into a standard, simplified ..... (2013) developed an analytic method for ...

lecture 4: linear algebra - GitHub
Inverse and determinant. • AX=I and solve with LU (use inv in linalg). • det A=L00. L11. L22 … (note that Uii. =1) times number of row permutations. • Better to compute ln detA=lnL00. +lnL11. +…

Automatic Model Construction with Gaussian Processes - GitHub
One can multiply any number of kernels together in this way to produce kernels combining several ... Figure 1.3 illustrates the SE-ARD kernel in two dimensions. ×. = → ...... We'll call a kernel which enforces these symmetries a Möbius kernel.

Automatic Model Construction with Gaussian Processes - GitHub
just an inference engine, but also a way to construct new models and a way to check ... 3. A model comparison procedure. Search strategies requires an objective to ... We call this system the automatic Bayesian covariance discovery (ABCD).

Lecture 13-14.pdf
to the operating system (or memory-protection violation). Page 4 of 24. Lecture 13-14.pdf. Lecture 13-14.pdf. Open. Extract. Open with. Sign In. Main menu.

Lecture # 13 (04.05.2015) Power Method.pdf
(Eigenvalues and Eigenvectors). Objectives: o Introduction to Eigenvalues and associated Eigenvectors. o Importance of Eigenvalue Problems in Scientific ...

EE 396: Lecture 13
Mar 29, 2011 - We shall now derive an algorithm whereby we can compute dS(x) for .... Lagrange equations are. ∇Lφ(γ) = ∇φ(γ(s)) − d ds. (φ(γ(s))γs(s)) = 0.

lecture 16: ordinary differential equations - GitHub
Bulirsch-Stoer method. • Uses Richardson's extrapolation again (we used it for. Romberg integration): we estimate the error as a function of interval size h, then we try to extrapolate it to h=0. • As in Romberg we need to have the error to be in

Brooklyn Community District 13 - GitHub
Open Space. Parking. Vacant. Other. 3,563. 1,427. 180. 391. 253. 56. 45. 117. 67. 240. 455. 48. Kings Hwy. BeltPkw y. Bay Pkw y. O ce an. A v. O ce a n. P kw y. BK 11. BK 15. Brooklyn Community District 13. Neighborhoods1: Brighton Beach, Coney Islan

Old Dominion University Lecture 2 - GitHub
Old Dominion University. Department of ... Our Hello World! [user@host ~]$ python .... maxnum = num print("The biggest number is: {}".format(maxnum)) ...

lecture 5: matrix diagonalization, singular value ... - GitHub
We can decorrelate the variables using spectral principal axis rotation (diagonalization) α=XT. L. DX. L. • One way to do so is to use eigenvectors, columns of X. L corresponding to the non-zero eigenvalues λ i of precision matrix, to define new

C2M - Team 101 lecture handout.pdf - GitHub
Create good decision criteria in advance of having to make difficult decision with imperfect information. Talk to your. GSIs about their experience re: making ...