DP Mixture of Warped Correlated GPs for Individualized ...

Viewer
Transcript

DP Mixture of Warped Correlated GPs for Individualized Time Series Prediction Yun Jie Serene Yeo Kian Ming A. Chai Weiping Priscilla Fan Si Hui Maureen Lee Junxian Ong Poh Ling Tan Yu Li Lydia Law Kok-Yong Seng DSO National Laboratories, Singapore 12 Science Park Drive, Singapore 118225 {yyunjie,ckianmin,fweiping,lsihui,ojunxian,tpohling,lydia.law,skokyong}@dso.org.sg

Abstract We propose a Dirichlet process (DP) mixture of warped Gaussian processes (GP) to model a collection of profiles, each of which is a set of correlated time-series. Given a new profile, we predict one of its time-series given the others in a streaming manner. Variational inference is used for tractability. This includes an online formulation of the projected process approximation of GPs to handle real-time prediction. We demonstrate our method on predicting the body-core temperature of subjects given their heart-rate and skin temperature.

1

Introduction

We model a collection of subject-specific profiles of correlated time-series trajectories using a Dirichlet process (DP) mixture of warped Gaussian processes (GP). Our aim is to predict, for a new subject, one of the trajectories given incoming observations of the other trajectories. The model allows the new profile to switch among the mixture components based on his observed trajectories. Variational approximation is used for tractable inference: for the DP mixture, we use the non-truncated approximation so that a profile can switch to the tail components not involving the train set [7]; for the GPs, we give an online formulation of the projected process approximation for real-time prediction. Current works use MCMC [5]; or use truncated DP approximation with no GP approximation [3, 9]. Our motivation is to reduce the possibility of heat injury in soldiers. Heat injury is commonly associated with a rise in body-core temperature (Tc), but measuring Tc in real-time is impractical. Hence, we need reliable real-time prediction of Tc. We present results on subjects participating in a route-march activity. For each subject, a wearable sensor collects real-time measurements of heart-rate (HR) and skin temperature (Tsk); and an ingestible thermometer pill collects the Tc measurements. For a new subject, we aim to predict Tc given HR and Tsk in the absence of the pill.

2

Dirichlet Process Mixture of Warped Correlated Gaussian Processes

Consider a collection of N time-series profiles. Each time-series profile, say the ith, consists of S time˜ is def series trajectories, where the sth trajectory is a sequence of observations y yis1 , . . . , y˜isnis ) at = (˜ a corresponding sequence tis def (t , . . . , t ) of n time-points. Each observation from the sth = is1 isnis is time-series is assumed to be generated by warping the values from a latent function of time his (t) drawn from a Gaussian process with prior mean mis (t) and covariance function κi (t, t0 ), which is common among the S trajectories [12]. These trajectories are further modelled with time-independent intrinsic-correlation ρis,is0 between his (t) and his0 (t); a model such as this is also called multi-task [1]. For warping, any monotonic transformation may be used; here, we use the Box-Cox family [2] 2 yis (t) ∼ N (his (t), σis ),

0 1/λ 0 yis (t) def = (λis yis (t) + 1) is , y˜isj def = scales yis (tisj ) + offsets , (1)

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

2 where σis is a latent noise variance for the ith subject, and λis is the transformation parameter. The 0 fixed scaling scales and translation offsets reduce the range of values taken by yis (t) so that the transformations are better-behaved. Let us denote this observation process (1) by pobs .

We use Dirichlet process (DP) to model a mixture of functions drawn from GPs. The base distribution QS G0 in this DP mixture model is GP (θ f )×p(θ f |ϑf )× s=1 p(θ ys |ϑy ). Here, GP (θ f ) is the distribution of S latent functions, given by the correlated GPs described above for the his s, parameterized by θ f that collects the parameters for the mean functions, the parameters for the covariance function and the intrinsic-correlations; and θ ys collects the parameters of pobs . The parameters θ f and θ ys are further drawn from their respective priors. Such involvement of p(θ f |ϑf ) in G0 has been explored before [5]. The cth random draw from G0 consists of S random correlated functions fc1 , . . . , fcS , the associated GP parameters θ fc , and the parameters θ yc1 , . . . , θ ycS for pobs . The observations for the ith profile is generated via a random variable zi that is an index into the infinite number of draws from G0 : zi |β10 , β20 , . . . ∼ Categorical(β10 , β20 , . . .), y y˜is (t)|zi , {f1s , f2s , . . .} , θ 1s , θ y2s , . . . ∼ pobs (· | fzi s (t), θ yzi s ), s = 1, . . . S, 0 where βc is the probability that zi takes the value c for the cth component. The βc0 s are given by a Qc−1 stick breaking process: βc0 def = βc c0 =1 (1 − βc0 ), where βc |α ∼ Beta(1, α). In this model, the latent function his for the sth trajectory of the ith profile is fzi s . The function fcs is a random function, and it is shared probabilistically among all profiles. Hence, the posterior of fcs will depend on the observations of all the profiles to the extent weighted by the probabilities of the assignments zi s. This allows borrowing of statistical strength. y y y f f Let ϑ def = (ϑ , ϑ ), θc def = (θc1 , . . . , θcS , θc ), fc def = (fc1 , . . . , fcS ), Θ def = (θ 1 , θ 2 , . . .), F def = (f1 , f2 , . . .), def def def ˜ ˜ iS ), and D def β = (β1 , β2 , . . .), z = (z1 , . . . , zN ), Yi = (˜ y i1 , . . . , y = (Y˜1 , . . . , Y˜N ). Placing a gamma prior Γ(k, γ) on the concentration parameter α, the marginal likelihood of the data is RRRR P p(D|ϑ, k, γ) = z p(D|z, F, Θ)p(z|β) p(F |Θ)dF p(Θ|ϑ)dΘ p(β|α)dβ p(α|k, γ)dα, where the dependence on the time-points tis s is implicit.

3

Approximate Inference

Exact inference of the model is intractable. Here, we use a variational approach. The joint Qnposterior is approximated by the factorization q(z) q(F |Θ) q(Θ) q(β|α) q(α), where q(z) def = i=1 q(zi ), Q∞ Q∞ Q∞ f q(F |Θ) def = c=1 q(fc |θ c ), q(Θ) def = c=1 q(θ c ), and q(β|α) def = c=1 q(βc |α). An approximation f parameter C is chosen, and for all c > C we fix q(fc |θ fc ) def = p(f c |θ c ), q(θ c ) def = p(θ c |ϑ), and def q(βc |α) = p(βc |α) = Beta(1, α) [7]. Our factorization involves q(α), absent in existing work [7]. Let fc (t) gives the evaluations of all the S functions at every point in t. An additional approximation is, for every c ≤ C, to select a set τc of inducing sites and define q(fc |θ fc ) as the factorization f q fc (t), fc (τc ) θ fc , uc , Vc def (2) = p fc (t) fc (τc ), θ c q fc (τc ) uc , Vc that holds for any collection of time-points t [13]. The newly introduced variational parameters uc and Vc are the mean and covariance of the values of fc at the inducing sites. Integrating a GP over its hyper-parameters is generally intractable, so a further approximation ˆ c ) for c ≤ C, and we set p(θ|ϑ) = is needed. Here, we use point estimates q(θ c ) = δ(θ c , θ PC ˆ ( c0 =1 δ(θ, θ c0 ))/C. A similar effect is achievable via a DP mixture of the hyper-parameters within G0 in a Bayesian hierarchical manner. Estimation is by optimizing the variational lower ˆ1, . . . , θ ˆ C , k, γ), which also gives updates on the approximate distributions and bound on p(D|θ their parameters [6]. The updates are omitted here, but we add that the update for q(z) involves R q(α) ψ(1 + α)dα, where ψ is the digamma function, and q(α) is a gamma distribution.

4

Prediction

For a new subject, we predict one of his time-series trajectories as measurements from the other trajectories arrive over time. Let the prediction be on the first trajectory, and collate the observations up to time t from the other trajectories into the set Y˜∗ (t). The predictive distribution 2

p(˜ y∗1 (t)|Y˜∗ (t), D, ϑ, k, γ) is approximated by RRRR P y∗1 (t)|z∗ , F, Θ) q∗ (z ∗ )q∗ (F |Θ)q(Θ)q(β|α)q(α) dF dΘ dβdα z p(˜ R PC ˆ f )dfc1 ˆ y )q∗ (fc1 |θ ≈ c=1 q∗ (z∗ = c) p(˜ y∗1 (t)|z∗ = c, fc1 , θ c1 c1 −1 PC R P∞ y ˆ ˆ f ) p(θ f )|ϑ)dfc1 , y∗1 (t)|z∗ = c, fc1 , θ c0 1 ) p(fc1 |θ + c c1 c=1 p(˜ c=C+1 q∗ (z∗ = c) C where z∗ is the assignment variable for the new subject, z ∗ def = (z1 , . . . , zN , z∗ ), and q∗ (·) indicates that the variational distribution is updated with Y˜∗ (t). For the variational distributions, we only update the factor q∗ (z∗ ) in the fully factorized distribution of z ∗ and q∗ (fc |θ c ) for c ≤ C. Continuous updates to q∗ (z∗ ) allows different components in the DP mixture to contribute towards the prediction as observations from the other trajectories arrive. Next section gives the updates for q∗ (fc |θ c ). 4.1

Online Prediction and Conditioning for Gaussian Processes

The approximation (2) is the projected process approximation (PPA) [8], where we track q(fc (τc )) at the inducing time points τc . In the manner of sparse online GP (soGP) [4], define −1 −1 T −1 −1 Q def C def , α def = Kcc , = −Q + Kcc + Kc Σc Kc = (Q + C) Kc Σc (y − m), where Kcc is the covariance between the inducing points at τc , Kc is the covariance between the inducing points and the complete training data, y is the vector of collated inverse-transformed observations, m is the vector of values from the mean functions, and Σc is a diagonal matrix wherein 2 the entry for sth trajectory of the ith profile is σcs /q(zi = c) [3, 9]. The predictive mean and variance of the approximate GP for the sth trajectory at time t are V[fcs (t)] ≈ κc (t, t) + kT Ck,

T E [fcs (t)] ≈ m ˆ def = k α + mcs (t),

where k is the covariance of the prediction point with the inducing points. The predictive mean E [˜ y∗s (t)] for the measurement is obtained by a one-dimensional integration of the warp function 2 under the normal distribution, taking account of the noise σcs and the fixed scaling and offset in pobs . During testing, we further condition on a new observation y˜ on the sth trajectory at time t by using 2 ˆ def it as a non-inducing point. Let s /q(z∗ = c) + kT Ck + kT Qk), and = (C + Q)k, rˆ def = −1/(σcs def qˆ = −ˆ r(y − m), ˆ where k and m ˆ are as in the preceding paragraph, and y is the inverse-transformed ˆs ˆT and α ← α + qˆs ˆ. These updates shadow those of soGP, except y˜. The updates are C ← C + rˆs that within rˆ the κc (t, t) there is replaced by kT Qk here. This is because PPA never involves the variance of the non-inducing points, but instead replace these by projection in the manner of Nyström approximation [14]. At present, we do not use the new observation as an inducing point because such a set of PPA update equations requires the covariance of the observation point with all existing data.

5

Experiment and Results

Model training and validation are conducted using 29 and 43 sets of heat strain data measured from laboratory and field route-marches, respectively. Data from both settings have dissimilar characteristics due to different conditions in laboratory and field trials. Each subject’s time-series profile consists of Tc, HR and Tsk measurements from his time-series trajectories, that is, S = 3. The extended Kalman filter model [11] and the non-linear mixed effects model [10] have previously been applied on a different data set of a similar nature. Our experiment uses the mean and covariance functions mis (t) = µis + βis t,

κi (t, t0 ) def =

PL

l=1

σl2 kMatern (|t − t0 |; νl , `il ),

where kMatern (r; ν, `) is of the Matérn class [8], and L = 4 to accommodate for varied length-scales. The four process smoothness parameters ν1 to ν4 are fixed at 1/2, 5/2, 3/2 and 1/2; and the offset-scale pairs for Tc, HR and Tsk are (35, 0.5), (48, 25) and (30, 1). We use C = 35. We evaluate the performance of our model under two cross-validation (CV) settings A and B. In each setting, data from three subjects in the laboratory setting and four in the field are used for model validation, with data from the other sixty-five subjects for training. Differences between observed and predicted Tc values are examined using mean-squared error (MSE), mean bias and the percentage of 3

Table 2: MSEs and LLs breakdowns. CV Set A

Table 1: Prediction accuracy from the two CV runs. % of deviation within Mean Bias

±0.1

±0.3

±0.5

Set A Set B

0.209 0.150

0.0238 0.0185

15 22

63 61

79 81

40

40

39

39

38

38

37

37

MSE

LL/10

MSE

LL/103

0.035 0.045 0.050 0.052 0.058 0.543 0.613

-11 -18 -30 -19 -77 -45 -44

0.027 0.064 0.068 0.119 0.159 0.174 0.327

-37 -28 -38 -26 -108 -196 -177

Running frequency

MSE

Tc/ ◦ C

CV

CV Set B 3

100 50 0 0

0

100 200 Time/min

0

100 200 Time/min

Figure 1: Comparison between observed Tc and predicted Tc against time for two subjects for the laboratory data (left plot) and field data (right plot). The observed Tc’s are shown as asterisks and predicted Tc’s are shown as a continuous line.

50 100 Time/min

Figure 2: For a new subject, we compute the frequencies of the most probable components. Here we plot the three top components with c ≤ C (red, blue and green solid lines) and the collective tail components (black dashed line).

prediction-observation deviation. Table 1 summarizes the prediction errors. Overall, the CV results indicate that the model is able to associate a new unseen subject with components more similar to his available observations. In Figure 1, predicted Tc values are superimposed with the observed Tc values for two representative subjects. The plots show that the Tc estimates are predicted with a reasonable amount of accuracy, and that work-rest-cycle transitions of the activity are present in the predicted Tc despite the absence of explicit knowledge of this information. This periodicity is inferred from the incoming HR and Tsk observations. Table 2 summarizes the fourteen individual MSE and their corresponding log-likelihood (LL) upon the conditioning of incoming HR and Tsk observations. From these figures, we can see that closely related HR and Tsk observations can generally provide a basis for Tc predictions as decreasing log-likelihoods often suggest increasing MSEs. We also investigate the suitability of the DP mixture in clustering test subjects participating in another activity with a different work-rest cycle. Figure 2 shows the cumulative counts of the top components one of the test subjects is assigned to. As the HR and Tsk observations of the test subject peak at a much faster rate, displaying a different trend from the training data, the DP mixture model initially assigns the subject to the tail components with the greatest probability. Gradually, the HR and Tsk observations began to display trends more similar to that of the training data, resulting in the switch to the other components with greater probability. This suggests that a truncation-free DP mixture model can appropriately handle data dissimilar from the training data.

Acknowledgments This work is funded by Future Systems and Technology Directorate, Singapore. 4

References [1] Edwin V. Bonilla, Kian Ming A. Chai, and Christopher K.I. Williams. Multi-task Gaussian process prediction. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 153–160. Curran Associates, Inc., 2008. [2] G. E. P. Box and D. R. Cox. An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological), 26(2):211–252, 1964. [3] Sotirios P. Chatzis and Yiannis Demiris. Nonparametric mixtures of Gaussian processes with power-law behavior. IEEE Transactions on Neural Networks and Learning Systems, 23(12):1862–1871, 2012. [4] Lehel Csató and Manfred Opper. Sparse on-line Gaussian processes. Neural Computation, 14(3):641–668, 2002. [5] Edmund Jackson, Manuel Davy, Arnaud Doucet, and William J. Fitzgerald. Bayesian unsupervised signal classification by Dirichlet process mixtures of Gaussian processes. In IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, volume 3, pages III–1077–III–1080, 2007. [6] Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. An introduction to variational methods for graphical models. Mach. Learn., 37(2):183–233, November 1999. ISSN 0885-6125. [7] Kenichi Kurihara, Max Welling, and Nikos Vlassis. Accelerated variational Dirichlet process mixtures. In P. B. Schölkopf, J. C. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 761–768. MIT Press, 2007. [8] Carl E. Rasmussen and Christopher K.I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, USA, January 2006. [9] James C. Ross and Jennifer G. Dy. Nonparametric mixture of Gaussian processes with constraints. In Proceedings of the 30th International Conference on International Conference on Machine Learning Volume 28, ICML’13, pages III–1346–III–1354, 2013. [10] K.-Y. Seng, Y. Chen, K.M.A. Chai, T. Wang, C. Y. D. Fun, Y. Teo, M. S. P. Tan, W. H. Ang, and J. Lee. Tracking body core temperature in military thermal environments: An extended Kalman filter approach. In IEEE EMBS 13th Annual International Body Sensor Networks Conference. (BSN’16), 2016. [11] K.-Y. Seng, Y. Chen, T. Wang, A. K. M. Chai, D. C. Y. Fun, Y. S. Teo, P. M. S. Tan, W. H. Ang, and J. K. W. Lee. Nonlinear mixed effects modelling for the analysis of longitudinal body core temperature data in healthy volunteers. Physiol. Meas., 37:485–502, 2016. [12] Edward Snelson, Zoubin Ghahramani, and Carl E. Rasmussen. Warped Gaussian processes. In S. Thrun, L. K. Saul, and P. B. Schölkopf, editors, Advances in Neural Information Processing Systems 16, pages 337–344. MIT Press, 2004. [13] Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In David van Dyk and Max Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567–574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16–18 Apr 2009. PMLR. [14] Christopher K. I. Williams and Matthias Seeger. Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001.

5

Warped Mixture Models - GitHub

MIXTURE OF MIXTURE N-GRAM LANGUAGE ... - Research at Google

Warped Discrete Cosine Transform Cepstrum

Cheap Tk102 Gps Track Locator Gps Mount Tracker Gps Bracket ...

IMCA M182_ International Guidelines for The Safe Operation of DP ...

DP introduction for parents.pdf

Statistical Properties of the Warped Discrete Cosine ...

Chemical constituents of marijuana - The complex mixture of natural ...

DP Brochure.pdf

DP Induction.pdf

DP Brochure.pdf

DP vac.pdf

Dynamical Gaussian mixture model for tracking ...

R routines for partial mixture estimation and differential ...

Two Component Mixture