Learning Compact Representations of Time-varying Processes: Motivation and Related Work Philip Bachman and Doina Precup
1
Motivation
As stated in the title, the goal of our work is to learn compact representations of time-varying processes. The methods we introduce here concern cases in which a parametric model serves as representation, changes in its optimal parametrization provide temporal variation, and our learned representation compacts the model by restricting temporal variation in the estimated parametrization to a subspace of the parameter-space intrinsic to the model. While the specific methods we introduce deal exclusively with dimension reduction in timevarying linear regression via linear projection onto a subspace of the possible parametrizations of a model, dimension reduction in the parameter-space of a model rather than the observation-space is readily extensible to many practical domains of interest in which the models and/or dimension reduction methods are more complex than those investigated here. Some examples of real-world problems to which extensions of the methods we introduce are applicable will be mentioned in the section on related work. We hope to initiate a line of work that fruitfully combines ideas from the rich fields of dimension reduction and varying models. Current approaches to estimating time-varying models generally face difficulties stemming from interacting factors including noisy observations and rapid variation in the optimal model relative to both temporal observation density and observation dimension. Researchers working on dimension reduction have produced some effective methods for combatting similar issues in the context of static models, many of which suggest straightforward adaptations suited to time-varying models. Sharing concepts with existing work for dimension reduction in regression, the methods we introduce here empirically prove, in certain circumstances, to significantly reduce both variance and error in model estimation resulting from the aforementioned factors. Though technically uncomplicated, the effectiveness of our methods should encourage further investigations into what looks to be a promising area for both practically and theoretically oriented researchers.
1
2
Related Work
Our methods bring together ideas from prior work in dimension reduction and (time) varying model learning. Most work on dimension reduction has focused on exploiting structure in the distribution of an observed set of inputs in order to form a reduced (e.g. in dimension) representation of the inputs that maintains an optimal balance between compactness and information retention, for some definition of optimal. This general approach to dimension reduction encompasses a significant fraction of the existing literature on unsupervised learning, including numerous variants on PCA (i.e. PCA, kernel PCA [10], sparse PCA [15], etc.) and techniques such as ISOMAP [9] and locally-linear embedding [13]. More closely related to our methods is work on what might be called supervised dimension reduction, in which compact representations of a joints set of inputs and outputs are sought with the goal of maintaining information about the relationship between the inputs and their corresponding outputs, while discarding irrelevant and potentially misleading aspects of the inputs. Most work in this area falls in the category of “dimension reduction for regression”, with the defining feature of such work being the search for dimension reducing transformations of the inputs that facilitate regression onto the outputs. One early approach to dimension reduction for regression is projection pursuit regression [4], in which a stagewise search is performed for one-dimensional linear projections of the input that best explain (e.g. via locally-weighted regression) the residual remaining after accounting for the projections selected in earlier stages. Coming later, the method of sliced inverse regression [8] is similar to our method in its use of PCA to determine relevant directions in the inputs, but it differs significantly in that it assumes a static model and requires a spherically-symmetric distribution of the inputs to be effective. In sliced inverse regression, the inputs are placed into bins (a.k.a. slices) based on the values of their corresponding outputs and the first k principal components of the sample means of each bin are then taken as bases for the k-dimensional reduced representation of the inputs. In [8], it is shown that, under certain (strong) conditions, this procedure will converge to the correct subspace. A more recent introduction is the kernel-based method presented in [5], in which the properties of reproducing kernel Hilbert spaces are used to derive an algorithm that finds a k-dimensional linear subspace of the inputs which, roughly speaking, approximately optimizes the mutual information between the projection of the inputs into this subspace and their corresponding outputs, with respect to the full set of k-dimensional linear subspaces. This method makes few assumptions about the distribution of the inputs, the distribution of the outputs, or the form of the relationship between the inputs and outputs, thus permitting its successful application to a broader range of problems than earlier methods. Moving away from dimension reduction, we also draw on work addressing varying models. Describing variation in the parameters of a model is the focus of work on varying-coefficient models [6]. In a typical varying-coefficient model, it is assumed that a single parametric model is adequate for describing the relationship between a set of inputs and outputs, given that the parametrization is allowed to vary in some systematic way over the range of the inputs. In this sense, the parameters of the underlying model can be described as functions of some controlling variable(s), which may be exogenous to the input (e.g. time) or may be some subspace of the inputs. Varying-coefficient models thus subsume locally-weighted regression [2], which can be 2
seen as a nonparametric method for estimating the functions by which the controlling variables effect parametrizations of the underlying model, with the the space in which locality is measured defining the controlling variables. Change-point detection (e.g. [3, 14, 1]) is another area of work aimed at capturing variations in the underlying model. In change-point detection, the underlying model is assumed to vary abruptly at certain points in time and the goal is to detect such points. The abruptness of the model variation at these change-points distinguishes them from other points in time, while the relatively smooth variation assumed by varying-coefficient models does not similarly privilege particular points in time. In common to both change-point detection and varying-coefficient models is the need to effectively estimate multiple parametrizations of the underlying model. Typically, little information is explicitly shared between estimated parametrizations aside from the implicit smoothness induced by methods using some form of locally-weighted regression, in which neighboring models are estimated using similar subsets of the inputs. In change-point detection, even this local information sharing is usually abandoned, with individual parametrizations estimated de novo. Our goal is to produce methods for improved information sharing among the estimated parametrizations of a time-varying model. The two methods that we introduce presently are directed towards situations in which the underlying model varies smoothly over time, but the basic approach extends readily to use in situations more amenable to change-point type modeling. An example of recent work that can be augmented with an extension of our methods is [7], in which parameters are estimated under the assumption that the underlying model has sparse structure (i.e. relatively few non-zero parameters) subject to abrupt variation. Two more examples are [11] and [12], in which smoothly-varying sparse graphical models are estimated using locally-weighted `1 -regularized logistic regression and locally-weighted `1 -regularized linear regression, respectively.
References [1] Ryan Prescott Adams and David J C Mackay. Bayesian Online Changepoint Detection. Tech Report: University of Cambridge, 2007. [2] Christopher G Atkeson, Andrew W Moore, and Stefan Schaal. Locally weighted learning. Artificial Intelligence Review, 11(1-5):11–73, 1997. [3] Jushan Bai and Pierre Perron. Estimating and testing linear models with multiple structural changes. Econometrica, 66(1):47, 1998. [4] Jerome Friedman and Werner Stuetzle. Projection pursuit regression. Journal of the American Statistical Association, 76(376):817–823, 1981. [5] Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5:73–99, 2004. [6] Trevor Hastie and Robert Tibshirani. Varying-Coefficient Models. Journal of the Royal Statistical Society B, 55(4):757–796, 1993. 3
[7] Mladen Kolar, Le Song, and Eric P Xing. Sparsistent Learning of Varying-coefficient Models with Structural Changes. In Neural Information Processing Systems 23, 2009. [8] Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991. [9] S. T. Roweis and L. K. Saul. Locally linear embedding. Science, 290:2323–2326, 2000. [10] B. Sch¨olkopf, A. Smola, and K.-R. M¨uller. Kernel principal component analysis. In B. Sch¨olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 327–352. MIT Press, 1999. [11] Le Song, Mladen Kolar, and Eric P Xing. Keller: estimating time-varying interactions between genes. Bioinformatics, 25:i128–i138, 2009. [12] Le Song, Mladen Kolar, and Eric P Xing. Time-varying dynamic bayesian networks. In Neural Information Processing Systems 23, 2009. [13] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000. [14] Xiang Xuan and Kevin Murphy. Modeling changing dependency structure in multivariate time series. International Conference on Machine Learning, 2007. [15] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):301–320, 2006.
4