Learning Compact Representations of Time-varying Processes: Motivation and Related Work Philip Bachman and Doina Precup

1

Motivation

As stated in the title, the goal of our work is to learn compact representations of time-varying processes. The methods we introduce here concern cases in which a parametric model serves as representation, changes in its optimal parametrization provide temporal variation, and our learned representation compacts the model by restricting temporal variation in the estimated parametrization to a subspace of the parameter-space intrinsic to the model. While the specific methods we introduce deal exclusively with dimension reduction in timevarying linear regression via linear projection onto a subspace of the possible parametrizations of a model, dimension reduction in the parameter-space of a model rather than the observation-space is readily extensible to many practical domains of interest in which the models and/or dimension reduction methods are more complex than those investigated here. Some examples of real-world problems to which extensions of the methods we introduce are applicable will be mentioned in the section on related work. We hope to initiate a line of work that fruitfully combines ideas from the rich fields of dimension reduction and varying models. Current approaches to estimating time-varying models generally face difficulties stemming from interacting factors including noisy observations and rapid variation in the optimal model relative to both temporal observation density and observation dimension. Researchers working on dimension reduction have produced some effective methods for combatting similar issues in the context of static models, many of which suggest straightforward adaptations suited to time-varying models. Sharing concepts with existing work for dimension reduction in regression, the methods we introduce here empirically prove, in certain circumstances, to significantly reduce both variance and error in model estimation resulting from the aforementioned factors. Though technically uncomplicated, the effectiveness of our methods should encourage further investigations into what looks to be a promising area for both practically and theoretically oriented researchers.

1

2

Related Work

Our methods bring together ideas from prior work in dimension reduction and (time) varying model learning. Most work on dimension reduction has focused on exploiting structure in the distribution of an observed set of inputs in order to form a reduced (e.g. in dimension) representation of the inputs that maintains an optimal balance between compactness and information retention, for some definition of optimal. This general approach to dimension reduction encompasses a significant fraction of the existing literature on unsupervised learning, including numerous variants on PCA (i.e. PCA, kernel PCA [10], sparse PCA [15], etc.) and techniques such as ISOMAP [9] and locally-linear embedding [13]. More closely related to our methods is work on what might be called supervised dimension reduction, in which compact representations of a joints set of inputs and outputs are sought with the goal of maintaining information about the relationship between the inputs and their corresponding outputs, while discarding irrelevant and potentially misleading aspects of the inputs. Most work in this area falls in the category of “dimension reduction for regression”, with the defining feature of such work being the search for dimension reducing transformations of the inputs that facilitate regression onto the outputs. One early approach to dimension reduction for regression is projection pursuit regression [4], in which a stagewise search is performed for one-dimensional linear projections of the input that best explain (e.g. via locally-weighted regression) the residual remaining after accounting for the projections selected in earlier stages. Coming later, the method of sliced inverse regression [8] is similar to our method in its use of PCA to determine relevant directions in the inputs, but it differs significantly in that it assumes a static model and requires a spherically-symmetric distribution of the inputs to be effective. In sliced inverse regression, the inputs are placed into bins (a.k.a. slices) based on the values of their corresponding outputs and the first k principal components of the sample means of each bin are then taken as bases for the k-dimensional reduced representation of the inputs. In [8], it is shown that, under certain (strong) conditions, this procedure will converge to the correct subspace. A more recent introduction is the kernel-based method presented in [5], in which the properties of reproducing kernel Hilbert spaces are used to derive an algorithm that finds a k-dimensional linear subspace of the inputs which, roughly speaking, approximately optimizes the mutual information between the projection of the inputs into this subspace and their corresponding outputs, with respect to the full set of k-dimensional linear subspaces. This method makes few assumptions about the distribution of the inputs, the distribution of the outputs, or the form of the relationship between the inputs and outputs, thus permitting its successful application to a broader range of problems than earlier methods. Moving away from dimension reduction, we also draw on work addressing varying models. Describing variation in the parameters of a model is the focus of work on varying-coefficient models [6]. In a typical varying-coefficient model, it is assumed that a single parametric model is adequate for describing the relationship between a set of inputs and outputs, given that the parametrization is allowed to vary in some systematic way over the range of the inputs. In this sense, the parameters of the underlying model can be described as functions of some controlling variable(s), which may be exogenous to the input (e.g. time) or may be some subspace of the inputs. Varying-coefficient models thus subsume locally-weighted regression [2], which can be 2

seen as a nonparametric method for estimating the functions by which the controlling variables effect parametrizations of the underlying model, with the the space in which locality is measured defining the controlling variables. Change-point detection (e.g. [3, 14, 1]) is another area of work aimed at capturing variations in the underlying model. In change-point detection, the underlying model is assumed to vary abruptly at certain points in time and the goal is to detect such points. The abruptness of the model variation at these change-points distinguishes them from other points in time, while the relatively smooth variation assumed by varying-coefficient models does not similarly privilege particular points in time. In common to both change-point detection and varying-coefficient models is the need to effectively estimate multiple parametrizations of the underlying model. Typically, little information is explicitly shared between estimated parametrizations aside from the implicit smoothness induced by methods using some form of locally-weighted regression, in which neighboring models are estimated using similar subsets of the inputs. In change-point detection, even this local information sharing is usually abandoned, with individual parametrizations estimated de novo. Our goal is to produce methods for improved information sharing among the estimated parametrizations of a time-varying model. The two methods that we introduce presently are directed towards situations in which the underlying model varies smoothly over time, but the basic approach extends readily to use in situations more amenable to change-point type modeling. An example of recent work that can be augmented with an extension of our methods is [7], in which parameters are estimated under the assumption that the underlying model has sparse structure (i.e. relatively few non-zero parameters) subject to abrupt variation. Two more examples are [11] and [12], in which smoothly-varying sparse graphical models are estimated using locally-weighted `1 -regularized logistic regression and locally-weighted `1 -regularized linear regression, respectively.

References [1] Ryan Prescott Adams and David J C Mackay. Bayesian Online Changepoint Detection. Tech Report: University of Cambridge, 2007. [2] Christopher G Atkeson, Andrew W Moore, and Stefan Schaal. Locally weighted learning. Artificial Intelligence Review, 11(1-5):11–73, 1997. [3] Jushan Bai and Pierre Perron. Estimating and testing linear models with multiple structural changes. Econometrica, 66(1):47, 1998. [4] Jerome Friedman and Werner Stuetzle. Projection pursuit regression. Journal of the American Statistical Association, 76(376):817–823, 1981. [5] Kenji Fukumizu, Francis R. Bach, and Michael I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5:73–99, 2004. [6] Trevor Hastie and Robert Tibshirani. Varying-Coefficient Models. Journal of the Royal Statistical Society B, 55(4):757–796, 1993. 3

[7] Mladen Kolar, Le Song, and Eric P Xing. Sparsistent Learning of Varying-coefficient Models with Structural Changes. In Neural Information Processing Systems 23, 2009. [8] Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991. [9] S. T. Roweis and L. K. Saul. Locally linear embedding. Science, 290:2323–2326, 2000. [10] B. Sch¨olkopf, A. Smola, and K.-R. M¨uller. Kernel principal component analysis. In B. Sch¨olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 327–352. MIT Press, 1999. [11] Le Song, Mladen Kolar, and Eric P Xing. Keller: estimating time-varying interactions between genes. Bioinformatics, 25:i128–i138, 2009. [12] Le Song, Mladen Kolar, and Eric P Xing. Time-varying dynamic bayesian networks. In Neural Information Processing Systems 23, 2009. [13] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319–2323, 2000. [14] Xiang Xuan and Kevin Murphy. Modeling changing dependency structure in multivariate time series. International Conference on Machine Learning, 2007. [15] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2):301–320, 2006.

4

Learning Compact Representations of Time-varying ...

As stated in the title, the goal of our work is to learn compact representations of time-varying processes. The methods we introduce here concern cases in which a parametric model serves as representation, changes in its optimal parametrization provide temporal variation, and our learned representation compacts the ...

58KB Sizes 0 Downloads 159 Views

Recommend Documents

Compact Part-Based Image Representations - UChicago Stat
P(I |µ) where the global template µ(x) = γ(µ1(x),...,µK(x)) is a composition of part templates .... drawn from a symmetric Bernoulli distribution (w.p.. 1. 2. ). For a fair.

Learning the Irreducible Representations of Commutative Lie Groups
Abstract. We present a new probabilistic model of compact commutative Lie groups that produces invariant- equivariant and disentangled representations of data. To define the notion of disentangling, we borrow a fundamental principle from physics that

Learning from weak representations using ... - Semantic Scholar
was one of the best in my life, and their friendship has a lot to do with that. ...... inherent structure of the data can be more easily unravelled (see illustrations in ...

Learning Topographic Representations for ... - Semantic Scholar
Figure 1: The covariance matrices of (a) sampled s, estimated components (b) without and ... example is coherent sources in EEG or MEG, which can be linearly ...

Learning from weak representations using ... - Semantic Scholar
how to define a good optimization argument, and the problem, like clustering, is an ... function space F · G. This search is often intractable, leading to high .... Linear projections- Learning a linear projection A is equivalent to learning a low r

Learning Topographic Representations for ... - Semantic Scholar
the assumption of ICA: only adjacent components have energy correlations, and ..... were supported by the Centre-of-Excellence in Algorithmic Data Analysis.

Learning representations by back-propagating errors
or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units The ability to ...

DeepWalk: Online Learning of Social Representations
Permission to make digital or hard copies of all or part of this work for personal or ..... photos'. • YouTube [40] is a social network between users of the popular video sharing website. The labels here represent groups of viewers that enjoy commo

Learning Symbolic Representations of Actions from ...
Learning, and conventional planning methods. In our approach, the sensorimotor skills (i.e., actions) are learned through a learning from demonstration strategy.

DeepWalk: Online Learning of Social Representations
scores up to 10% higher than competing methods when la- beled data is sparse. ... social networks; deep learning; latent representations; learn- ing with partial ...

THE-COMPACT-TIMELINE-OF-AVIATION-HISTORY-COMPACT ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item.

Inductive Learning Algorithms and Representations for Text ... - Microsoft
categorization in terms of learning speed, real- time classification speed, and classification accuracy. ... Decimal or Library of Congress classification systems,. Medical Subject Headings (MeSH), or Yahoo!'s topic .... method the weight of each ter

Learning from weak representations using distance ...
1.3.4 Similar object classes and knowledge transfer . .... growing attention in the machine learning and computer vision communities .... “Learning distance function by Coding similarity” - Here we define 'similarity' and its goals using informat

Highest weight representations of the Virasoro algebra
Oct 8, 2003 - Definition 2 (Antilinear anti-involution). An antilinear anti-involution ω on a com- plex algebra A is a map A → A such that ω(λx + µy) = λω(x) + ...

Event-Driven, Compact, Self-Learning Information
World Wide Web. Our methodology .... measured Web server and RAID array per- formance on ... ize the implications of embedded technology at the time [11].

Representations of Orthogonal Polynomials
classical orthogonal polynomials if it is the solution of a di erence equation of the ... in the discrete case, in terms of the coe cients a; b; c; d and e of the given di ...

Learning Compact Lexicons for CCG Semantic Parsing - Slav Petrov
tions, while learning significantly more compact ...... the same number of inference calls, and in prac- .... Proceedings of the Joint Conference on Lexical and.

Highest weight representations of the Virasoro algebra
Oct 8, 2003 - In this second part of the master thesis we review some of the ...... We will use the notation degh p for the degree of p as a polynomial in h.

A survey of qualitative spatial representations
Oct 17, 2013 - domain is infinite, and therefore the spatial relations contain infinitely many tuples. ..... distance between A and B is 100 meters' or 'A is close to B'. .... suitable for movements in one dimension; free or restricted 2D movements .

Highest weight representations of the Virasoro algebra
Oct 8, 2003 - on the generators of the algebra, we conclude that D = E. Therefore d ⊆. ⊕ n∈Z. Cdn and the proof of (6) is finished. We now show the relation ...

Distributed Representations of Sentences and ...
18 Apr 2016 - understand how the model is different from bag-of-words et al. → insight into how the model captures the semantics of words and paragraphs. | Experiments. | 11. Page 12. Experiments. □ 11855 parsed sentences with labels for subphras

Decompositions and representations of monotone ...
monotone operators with linear graphs by. Liangjin Yao. M.Sc., Yunnan University, 2006. A THESIS SUBMITTED IN PARTIAL FULFILMENT OF. THE REQUIREMENTS FOR THE DEGREE OF. Master of Science in. The College of Graduate Studies. (Interdisciplinary). The U

UNCORRECTED PROOF Additive Representations of ...
Department of Mathematics and Computer Science,. St. Louis ..... by their units, but the ring of integers of any number field of finite degree cannot. 215 have a ...