LONG-RANGE OUT-OF-SAMPLE PROPERTIES OF ...

Viewer
Transcript

LONG-RANGE OUT-OF-SAMPLE PROPERTIES OF AUTOREGRESSIVE NEURAL NETWORKS PATRICK L. LEONI Abstract. We consider already-trained discrete autoregressive neural networks in their most general representations, with the exclusion of time-varying input though, and we provide tight sufficient conditions and elementary proofs for: 1- existence of an attractor, 2- uniqueness, 3- global convergence. Those conditions can be used as easy-to-check criteria when convergence (or not) of long-range predictions is desirable.

1. Introduction Out-of-sample accuracy of predictions is one of the most attractive features of neural networks; that is, once a neural network is trained on a truncated data sample or training set its ability to forecast the remaining sample is remarkable (see McNellis, 2005, for an introduction). Several training methods have been successfully developed with this objective in mind (see Navia-V´azquez and Figueiras-Vidal, 2000, ˇıma, 2002, Lu and Leen, 2007, among many others). Despite the S´ considerable amount of efforts devoted to reaching always better forecasting properties, little so far has been done to analyze the sensitivity of long-range predictions to the functional representation of the network and the choice of initial conditions to generate predictions. This issue is often innocuous with linear models, but it is of critical importance with neural networks due to their non-linearity. If the non-linear deterministic system representing the trained network has an attractor then long-term predictions may convergence toward it, depending on the choice of initial conditions. This situation may have desirable features when, for instance, the original data suggests some 1

2

PATRICK L. LEONI

form of long-run convergence. In this case, the network can be trained and initial conditions be chosen so as to learn this long-run stationary value. However, when the original data displays some form of chaotic behavior then convergence toward the attractor may nevertheless occur and out-of-sample predictions will surely be misleading. Another problematic aspect of attractors in neural networks is lack of uniqueness. Network predictions may then converge to different attractors, and depending on initial conditions long-run accuracy of predictions can be severely flawed even when the network is correctly trained. This problematic situation occurs quite often; for instance, we give in Section 3 an example of a standard Elliott activation function leading to three different attractors. This paper considers already-trained discrete autoregressive neural networks in their most general representations, with the exclusion of time-varying input though, and it provides new and tight sufficient conditions and elementary proofs for: 1- existence of an attractor in the neural network, 2- uniqueness, 3- global convergence. Those conditions are on an intuitive restriction of the general functional form of the network, and they can be used as criteria when, for instance, choosing an activation function to match anticipated longrange properties of actual data. As discussed earlier, it may be desirable for long-range out-of-sample predictions to converge (or not) toward some particular value, and training the right network with such considerations in mind leads to a significant improvement in accuracy. Other studies have addressed related issues. Trapletti et al. (2005) analyze global convergence and stationarity only for single hidden-layer perceptrons with shortcut connections. Van den Driessche and Zou (1998) focus on global attractivity solely in delayed Hopfield neural network models. Vanualailai and Nakagiri (2005) uses Lyapounov functions to derive sufficient conditions for global convergence in our general setting, although their conditions appear stronger than ours. Hu and

LONG-RANGE OUT-OF-SAMPLE PROPERTIES

3

Liu (2005) and Ji et al. (2007) add time-varying input to a global convergence analysis in some particular networks; we leave this issue for later in a general setting. The paper is organized as follows: in Section 2 we present the model, in Section 3 we establish existence of attractors under suitable assumption, in Section 4 we study uniqueness and convergence for every set of initial conditions, and Section 5 concludes the paper. 2. The model We consider an already trained autoregressive (of order p > 0) neural network whose dynamics are represented by (2.1)

yt = G (yt−1 , yt−2 , ..., yt−p ) ,

where G is a non-linear function mapping Rp into R and representing the auto-regressive neural network in its most general specification. We omit standard dependencies on exogenous variables to simplify the analysis, although none of our results would be changed if we added them. We also take as given any arbitrary p reals (y1 , ..., yp ) as initial conditions. This representation encompasses, without being restricted to, the following forms. Denote first by A any p × p matrix of reals and the compact notation yt = (yt−1 , yt−2 , ..., yt−p ), the neural network (2.1) can take for instance the form (2.2)

yt = Γ A · yt ,

where Γ : R → R can be any activation function. Adding a linear auto-regressive part into (2.2) is also a particular case of our study. Fix a p−vector of reals (b1 , ..., bp ), then any neural network of the form (2.3) yt = b1 · yt−1 + ... + bp · yt−p + Γ A · yt fits into our analysis. One can imagine many other specifications of classes of neural networks; however we leave for further work the issue of time-varying input as in Hu and Liu (2005) and Ji et al. (2007).

4

PATRICK L. LEONI

We next define the notion of attractor, central to this paper. We first denote by Gn , for every integer n, the n-times iterated functional composition of the general function G. Definition. We say that y ∈ R is an attractor if there exist initial conditions (y1 , ..., yp ) such that Gn (y1 , ..., yp ) converges to y. We are primarily interested in finding sufficient conditions for existence, uniqueness and global attractiveness of such attractors. 3. Existence In this section, we give a set of sufficient and tight conditions ensuring existence of an attractor. Those conditions are not on the general functional specification G, but rather on its restriction on a small and intuitive subspace. For any function G as above, we define the function ΨG mapping R into R as ΨG (x) = G(x, ..., x) for every real x. Theorem 1. Assume that ΨG is continuous and bounded. There exists an attractor of (2.1). Notice first that Theorem 1 does not require that the function G be continuous, but continuity is rather required on its restriction ΨG . It is easy to check in practice whether the tight conditions in Theorem 1 are met. As a concrete illustration, we next consider the neural network (2.3) where the matrix A is diagonal. We have now that (3.1)

ΨG (x) = x ·

p X

bi + Γ(x ·

i=1

p X

aii ),

i=1

for every real x. The boundedness of ΨG comes down to the existence of a positive constant c such that | ΨG |≤ c, which rewrites as (3.2)

−c − x ·

p X i=1

bi ≤ Γ(x ·

p X i=1

aii ) ≤ c − x ·

p X

bi ,

i=1

for every real x. Thus, for (2.3) to have an attractor, it is sufficient that 1- the activation function be continuous and 2- it be bounded above and below by affine functions whose common slope depends on the network coefficients.

LONG-RANGE OUT-OF-SAMPLE PROPERTIES

5

The proof of Theorem 1 is given next, and it uses a fixed point argument applied to a restriction of the already restricted function ΨG . This approach allows for a short and intuitive proof, and it does not rely on finding appropriate Lyapounov functions as in Vanualailai and Nakagiri (2005). Proof. The strategy of the proof is it uses a fixed point argument applied to a well-chosen restriction of the already restricted function ΨG . We first define M = maxy∈R |ΨG (y)|, and we consider the restriction of ΨG over the set [−M, M ]. Since this restriction maps a compact set into a compact set, and since ΨG is continuous, a direct application of Brouwer’s Fixed Point Theorem (see Border, 2005, Ch. 6) provides existence of a real y¯ ∈ [−M, M ] such that y¯ = ΨG (¯ y ). Consider the initial values y0 = (¯ y , ...., y¯), then by construction of y¯ it is true that (3.3)

y¯ = Gn (¯ y , ..., y¯) for every integer n,

and thus y¯ is a bassin of attraction for the initial condition y0 . The proof is now complete.

It is easy to find examples of neural networks with multiple attractors and satisfying the conditions of Theorem 1. For instance, consider the autoregressive system of order 1 defined as yt+1 = G(yt ) where G is the standard (scaled) Elliott activation function defined as G(y) = 2y/(1 + |y|). In this case, the values -1,0 and 1 are attractors. 4. Uniqueness We now turn to giving a sufficient condition ensuring uniqueness of an attractor, and also convergence toward this bassin for every set of initial conditions. Assumption 1. There exists a sequence of reals (θn )n≥0 such that P n θn < ∞, and satisfying |Gn (x) − Gn (x0 )| ≤ θn kx − x0 k for every x, x0 ∈ Rp and for every integer n.

6

PATRICK L. LEONI

It turns out that the previous assumption is all that is needed to prove uniqueness and convergence for all of the initial conditions. Theorem 2. Under Assumption 1, there exists a unique attractor and the system (2.1) globally converges toward this attractor. Assumption 1 is significantly weaker than standard assumptions used to prove global convergence in similar settings. Indeed, it is nearly always assumed that the function G be globally Lipschitz with Lipschitz coefficient strictly less than 1 (see van den Driessche and Zou, 1998, among many others). It is easy to see that any such function satisfies Assumption 1: by definition, any Lipschitz function G satisfies for every y, y 0 ∈ Rp |G (y) − G (y 0 )| ≤ θky − y 0 k where θ < 1 is the positive Lipschitz coefficient. Iterating again the function G and applying the Lipschitz condition to the once iterated function gives 2 G (y) − G2 (y 0 ) ≤ θ2 ky − y 0 k This iteration can be repeated infinitely, to obtain that |Gn (y) − Gn (y 0 )| ≤ θn ky − y 0 k for every integer n. To show that the Lipschitz function G satisfies Assumption 1, it suffices to define the sequence (θn )n≥0 as θn = θn for P every n. The fact that θ < 1 directly implies that n θn < ∞. The converse is not true though; that is, any function G satisfying Assumption 1 is not necessarily globally Lipschitz with Lipschitz coefficient strictly less than 1, since for some values of n the corresponding coefficient θn may be greater than 1. For a concrete application to neural networks, consider the representation (2.3) with a linear autoregressive part. Define the function Υ(x) = Γ(A · x) for every x ∈ Rp , and assume that Υ is Lipschitz with coefficient α > 0. For the system (2.3) to satisfy Assumption 1, from

LONG-RANGE OUT-OF-SAMPLE PROPERTIES

7

the last point it is sufficient to show that G is Lipschitz with coefficient strictly less than 1. Fix x, x0 ∈ Rp , we have that p p X X bi x0i − Υ (x0 ) bi xi + Υ (x) − |G (x) − G (x0 )| = i=1 i=1 p X 0 bi (xi − xi ) + |Υ (x) − Υ (x0 )| ≤ i=1

≤ Maxi |bi | · kx − x0 k + α · kx − x0 k = (Maxi |bi | + α) · kx − x0 k, where k.k is the sup-norm on Rp or any other equivalent norm. For Assumption 1 to hold, and thus to obtain existence and uniqueness of an attractor, it is sufficient that Maxi |bi | + α < 1 from our previous remarks. The proof of Theorem 2 is non-standard because the operator G is not an endomorphism (that is, the operator does not map a space into itself), and thus a standard proof based on contraction mapping arguments is inappropriate in our setting. Proof. We first establish existence of an attractor for all of the initial conditions. Consider a sequence (yt )t generated by iterating (2.1) with arbitrary initial conditions y 1 = (y1 , ..., yp ). Fix now two integers n and m such n > m. By Assumption 1, we have that (4.1) |yn − ym | ≤

n X j=m+1

|yj − yj−1 | ≤

n X j=m+1

θj ky 2 − y 1 k = ky 2 − y 1 k

n X

θj ,

j=m+1

where y 2 = (y2 , ..., yp+1 ). Since the sequence (θn )n is summable the P sequence ( nj=1 θj )n is a Cauchy sequence, and from (4.1) so is (yt )t . Thus (yt )t must convergence to some value y, which is an attractor by definition. We now prove uniqueness of the bassin. Consider two different initials conditions y0 = (y1 , ..., yp ) and y00 = (y10 , ..., yp0 ), with respective generated sequences (yt )t and (yt0 )t and respective bassins of attraction

8

PATRICK L. LEONI

y and y 0 . Assume by of contradiction that y 6= y 0 . By definition of the bassins, it is straightforward to derive that, for every ε > 0, there exists n ¯ such that, for every n ≥ n ¯ (4.2)

0 |y − y 0 | ≤ |Gn (yn , ..., yn+p ) − Gn (yn0 , ..., yn+p )| + ε.

By Assumption 1 it follows that (4.3)

|y − y 0 | ≤ θn ky0 − y00 k + ε

Since (θn )n is summable it must be true that (θn ) converges to 0, and thus the difference |y − y 0 | can be made arbitrarily small since ε was chosen arbitrarily. Uniqueness of an attractor is established, and the proof is now complete.

5. Conclusions

We have given several sufficient conditions on autoregressive neural networks in their most general representations (with the exclusion of time-varying parameters) for existence of an attractor, uniqueness and global convergence. Those criteria are useful and easy-to-check when choosing activation functions matching anticipated long-range properties of neural networks. Further work will include a similar analysis with more general specifications than (2.1), such as (5.1)

yt = Gt (yt−1 , yt−2 , ..., yt−p ) ,

for every integer t (time-varying parameters). The work shall identify sufficient conditions on the functions (Gt )t≥0 to obtain existence and uniqueness of attractors. References [1] K. Border, Fixed Points Theorems with Applications to Economics and Game Theory, Cambridge University Press (1985). [2] P. van den Driessche and X. Zou, Global Attractivity in Delayed Hopfield Neural Network Models, SIAM J. Appl. Math., 58 (1998), pp. 1878-1890.

LONG-RANGE OUT-OF-SAMPLE PROPERTIES

9

[3] S. Hu, D. Liu, On the Global Output Convergence of a Class of Reccurent Neural Networks with Time-Varying Input, Neural Networks 18 (2005), pp. 171-178. [4] Y. Ji, X. Lou and B. Cui, Global Output Convergence of Cohen-Grossberg Neural Networks with both Time-Varying and Distributed Delays, forthcoming in Chaos, Solitons & Fractals (2007). [5] Z. Lu and T. Leen, Penalized Probabilistic Clustering, Neural Comp. 19 (2007), pp. 1528-1567. [6] P. McNelis, Neural Networks in Finance: Gaining Predictive Edge in the Market, Elsevier Academic Press (2005). [7] A. Navia-V´ azquez and A. R. Figueiras-Vidal, Efficient Block Training of Multilayer Perceptrons, Neural Comp. 12 (2000), pp. 1429-1447. ˇıma, Training a Single Sigmoidal Neuron is Hard, Neural Comp. 14 (2002), [8] J. S´ pp. 2709-2728. [9] A. Trapletti, F. Leisch and K. Hornik, Stationary and Integrated Autoregressive Neural Network Processes, Neural Comp. 12 (2000), pp. 2427-2450. [10] J. Vanualailai, S.-I. Nakagiri, Some Generalized Sufficient Convergence Criteria for Nonlinear Continuous Neural Networks, Neural Comp. 17 (2000), pp. 18201835.