Nonlinear System Modeling with Random Matrices

Viewer
Transcript

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. XX, NO. XX

Nonlinear System Modeling with Random Matrices: Echo State Networks Revisited Bai Zhang, David J. Miller, and Yue Wang Abstract—Echo state networks (ESNs) are a novel form of recurrent neural networks (RNNs) that provide an efficient and powerful computational model to approximate nonlinear dynamical systems. A unique feature of an ESN is that a large number of neurons (called the “reservoir”) are used, with the synaptic connections generated randomly and only the connections from the reservoir to the output modified by learning. Why a large, randomly generated, fixed RNN or reservoir gives such excellent performance in approximating nonlinear systems is still not well understood. In this brief paper, we apply random matrix theory to examine the properties of random reservoirs in ESNs under different reservoir topologies (sparse or fullyconnected) and connection weights (Bernoulli or Gaussian). We precisely quantify the asymptotic gap between the scaling factor bounds for the necessary and sufficient conditions previously proposed for the echo state property. We then show that the state transition mapping is contractive with high probability when only the necessary condition is satisfied, which corroborates and thus analytically explains the observation that in practice one obtains echo states when the spectral radius of the reservoir weight matrix is smaller than 1. Index Terms—Echo state networks, recurrent neural networks, echo state property, random matrix theory, circular law, concentration of measure.

I. I NTRODUCTION Recurrent neural networks (RNNs) have been widely used to model nonlinear dynamical systems in science and engineering. Recently, a new framework for RNNs, namely echo state networks (ESNs), was proposed by H. Jaeger et al. [1], [2]. ESNs (and closely-related liquid state machines, independently proposed by Maass et al. [3]) share some features that are characteristic of models for learning mechanisms in biological brains and they exhibit superior performance when used as “black-box” time-series models. ESNs have drawn great interest from the research community and have been successfully applied to various tasks, e.g. chaotic time series prediction [4], communications channel equalization [1], dynamical pattern recognition [5], [6], and gene regulatory network modeling [7]. Various ESN schemes have been explored, including a small-world recurrent neural system with scale-free distribution [8], decoupled ESNs with lateral inhibition [9], and ESNs with uniformly distributed poles and adaptive bias [10]. Luko˘sevi˘cius and Jaeger presented a comprehensive review on the theoretical results and applications of ESNs in [11]. The salient difference from traditional recurrent networks [12], [13] is that an ESN employs a large number of randomly connected neurons (usually on the order of 50 to 1000), namely the “reservoir”. The neurons in the reservoir are driven B. Zhang and Y. Wang are with the Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Arlington, VA, 22203 USA. D. J. Miller is with the Department of Electrical Engineering, The Pennsylvania State University, University Park, PA 16802. E-mail: [email protected].

1

by the input signals, and the trainable output neurons combine the output of the excited reservoir state to generate taskspecific temporal patterns. This new paradigm of recurrent neural networks is also known as “reservoir computing”. The working principle of an ESN derives from an important algebraic property of the reservoir, namely the echo state property. This property in essence states that the effect of both previous states and previous inputs on a future state should gradually vanish (i.e. neither persist nor become amplified) as time passes [2]. Jaeger presented both a necessary condition (under the assumption that the input space includes the zero sequence) and a sufficient condition for the echo state property [2]. Buehner and Young proposed a less restrictive sufficient condition based on minimizing the matrix operator D-norm over the set of diagonal matrices [14]. These papers did not consider the unique characteristic of the reservoir, i.e. that it is randomly generated, and thus the sufficient conditions in [2] and [14] are rather conservative. The topology of the reservoir in ESNs has also been of great research interest, with the classical form being a randomly generated and sparsely connected network [1], [2]. Several attempts have been made to search for a better topology – specifically, the small-world, scale-free, and biologically inspired reservoir topologies. However, follow-up investigations have indicated that “none of the investigated network topologies was able to perform significantly better than simple random networks, both in terms of eigenvalue spread as well as testing error” [11]. In this brief paper, we first analytically examine the essential characteristics of random reservoirs. We apply recent results of random matrix theory to demonstrate the asymptotic distributions of eigenvalues and singular-values of reservoir weight matrices. Then, we show that randomly generated reservoirs, either sparse or fully connected, either with Bernoulli or Gaussian connection weights (or, in fact, with weights distributed according to other density families), are all expected to behave similarly. Lastly and most importantly, we explicitly discuss the critical role of the random reservoir in achieving the echo state property and the gap between the scaling factor bounds for the necessary and sufficient conditions previously proposed for the echo state property. We show that when the spectral radius of the reservoir weight matrix is smaller than 1 (the necessary condition for the echo state property when the input space contains the zero sequence), the state transition mapping is contractive with high probability, given a sufficiently large reservoir. This result corroborates the observation in [2] that the necessary condition for the echo state property is often good enough in practice, i.e. good enough so that violations of the echo state property are not practically observed. The remainder of this paper is organized as follows. In Section II, we revisit the ESN model, random reservoirs, and the echo state property. This is followed by detailed discussion in Section III on relevant results of random matrix theory, the properties of random reservoirs, and the gap between the sufficient and necessary conditions previously proposed for the echo state property. In Section IV, we prove that the necessary condition for the echo state network ensures the state transition mapping is contractive with high probability.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. XX, NO. XX

2

W = αWN , where α is a properly chosen global scaling factor (its utility will be discussed later), and where the elements of the matrix WN are random variables that are both independent and identically distributed (i.i.d.). Here we consider the following three types of reservoir weight matrices. Sparse random reservoir: This is the most common type of random reservoir in ESNs [1], [2]. The random variable w (which characterizes each element of WN ) in a sparse random reservoir follows the modified Bernoulli probability mass function ( Pr(w = 0) = 1 − c , (4) Pr(w = ±1) = c/2

Fig. 1: Illustration of an echo state network.

We briefly conclude our work in Section V. II. T HE E CHO S TATE N ETWORK F ORMULATION A. Basic ESN Formulation A typical ESN is shown in Fig. 1. It can be mathematically represented by a state update equation and an output equation. As adopted by others [14], we consider ESNs without output feedback in this work. Thus, the activation of internal units is updated according to x(n + 1) = f (Wx(n) + Win u(n + 1)),

(1)

where x is a N ×1 vector of the reservoir state, W is a N ×N reservoir weight matrix, Win is the N × Nin input weight matrix, u is a Nin × 1 vector of system inputs, y is a Nout × 1 vector of system outputs, and f is the neuron activation function (usually a tanh sigmoid function), applied componentwise. For notational convenience, we can denote the state transition equation by

and then calculate the output according to x(n) y(n) = g(Wout ), u(n)

w ∼ N (0, 1).

(5)

Fully-connected Bernoulli random reservoir: The random variable w follows the Bernoulli distribution Pr(w = ±1) = 1/2.

(6)

These three types of reservoir weight matrices exhibit different network topologies, i.e., either sparsely connected or fully connected neurons in the reservoir, and different types of weights, i.e., either continuous-valued or discrete-valued. All three types have been used as random reservoirs in ESNs, and successfully applied. C. Definition of Echo State Property

x(n + 1) = T (x(n), u(n + 1)) = f (Wx(n) + Win u(n + 1)),

where Pr(·) denotes probability of an event and c is “the connectivity” of the reservoir. Fully-connected Gaussian random reservoir: The random variable w follows a standard normal distribution

(2)

(3)

where Wout is the Nout × (N + Nin ) output weight matrix, and g is usually a tanh sigmoid or identity function applied component-wise. B. Random Reservoirs in ESNs A salient feature that distinguishes ESNs from simple recurrent neural networks is the use of large, fixed random reservoirs. The classical topology of reservoirs in ESNs is a randomly generated and sparsely connected network [1]. It was thought that “this condition lets the reservoir decompose into many loosely coupled subsystems, establishing a richly structured reservoir of excitable dynamics” [1]. Nevertheless, this is not generally true and it has in fact been reported that fully connected reservoirs work as well as sparsely connected ones [15]. Such observation leads to inquiry of the essential characteristics of random reservoirs and their role in approximating nonlinear dynamical systems. The types of random reservoirs are characterized by the structure of the reservoir weight matrix. Assume the matrix

In order to work properly, an echo state network should possess the echo state property, as defined in [2]. Definition 1 (Jaeger [2]). Assume standard compactness conditions, i.e. inputs drawn from a compact input space U and network states restricted to a compact set A. Assume that the network has no output feedback connections. Then, the network has echo states if the network state x(n) is ¯ −∞ . uniquely determined by any left-infinite input sequence u More precisely, this means that for every input sequence, · · · , u(n − 1), u(n) ∈ U −N , for all state sequence pairs · · · , x(n − 1), x(n) ∈ A−N and · · · , x0 (n − 1), x0 (n) ∈ A−N , where x(k) = T (x(k −1), u(k)), x0 (k) = T (x0 (k −1), u(k)), and N is the set of natural numbers, it holds that x(n) = x0 (n). The definition of the echo state property implies that similar echo state sequences must represent similar input histories. The following theorem gives a sufficient condition (that the largest singular value of W is smaller than 1) and a necessary condition (that the spectral radius of W must be smaller than 1) for the network to hold the echo state property. Theorem 1 (Jaeger [2]). Assume a sigmoid network, i.e. with f = tanh, applied component-wise. (a) Let the weight matrix W satisfy σmax = Λ < 1, where σmax is its largest singular value. Then d(T (x, u), T (x0 , u)) < Λd(x, x0 ) for

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. XX, NO. XX

all inputs u ∈ U , for all states x, x0 ∈ [−1, 1]N , where d(·, ·) is any valid distance metric. This implies the echo state property holds. (b) Let the weight matrix have spectral radius |λmax | > 1, where λmax is the eigenvalue of W with the largest absolute value. Then the network has an asymptotically unstable null state. This implies that it does not satisfy the echo state property for input space U containing 0 and admissible state space A = [−1, 1]N . As suggested in [2], a convenient strategy to obtain ESNs is to start with some weight matrix WN and then select a global scaling factor α to define W = αWN . Let σmax (WN ) and |λmax (WN )| denote the largest singular value and the spectral radius of WN , respectively. Then, according to [2], −1 the sufficient condition is α ≤ σmax (WN ) and the necessary −1 condition is α ≤ |λmax (WN )| for the echo state property to hold. Furthermore, although the existence of the echo state −1 property for α ∈ [σmax (WN ), |λ−1 max (WN )|] has not been theoretically proved, it has been observed that “one obtains echo states even when α is only marginally smaller than |λ−1 max (WN )|” and “the sufficient condition is very restrictive” [2]. Buehner and Young proposed a tighter bound for the echo state property. The main idea is to minimize the matrix operator D-norm over the set of diagonal matrices [14]. However, because the matrix D does not have full structure (and in fact is restricted to being diagonal), the sufficient condition derived in [14] is still in general conservative. Pertinent to the sequel, we observe that the derivations of the existing results on the sufficient condition have not taken into account the unique characteristic of an ESN, i.e. that the reservoir matrix is a random matrix. III. R ANDOM MATRIX THEORY AND RANDOM RESERVOIRS In this section, we first introduce some recent results in random matrix theory, and then apply them in developing some relevant properties of random reservoirs in ESNs. A. The Empirical Spectral Distribution of Random Matrices Let 1 |{i|1 ≤ i ≤ N, Re(λi ) ≤ s, Im(λi ) ≤ t}| N (7) be the empirical spectral distribution (ESD) of WN ’s eigenvalues λi ∈ C, i = 1, ..., N , where | · | denotes the cardinality of the set and Re(·) and Im(·) are the real and imaginary parts of the complex number, respectively. A well-known conjecture is the circular law of random matrices, which states that asymptotically, as N gets large, the eigenvalues of a properly normalized matrix WN are uniformly distributed on the unit disk in the complex plane. After many pioneering efforts that were made in proving the circular law for various scenarios, including sparse random matrices [16]–[20], the circular law was proved in full generality, in both weak and strong forms, quite recently [21]. µWN (s, t) :=

Theorem 2 (Circular Law [21]). Let WN be the N × N random matrix whose entries are i.i.d. complex random variables

3

with mean 0 and variance 1. Define W = √1N WN . Then the ESD of W converges (in both the strong and weak senses) to the uniform distribution on the unit disk, as N → ∞. Corollary 1. The ESDs of reservoir weight matrices W as 1 , (5) with the defined in (4) with the scaling factor, α = √cN 1 √ scaling factor α = N , and (6) with the scaling factor α = √1 all have the same limit distribution, and more specifically, N converge (in both the strong and weak senses) to the uniform distribution on the unit disk. The circular law implies that when N is sufficiently large (as is typical for ESNs), the eigenvalues of W spread out evenly over the unit disk in the complex plane, independent of the specific distribution of w, as illustrated in Fig. 2. It is also important to note that when a sparse reservoir is used in ESNs, the connectivity c of the sparse reservoir weight matrix must satisfy the bound inequality c > N −1+1 , where 1 > 0 is a small positive constant, because otherwise, with non-negligible probability, the sparse reservoir weight matrix would lose its rank-efficiency as N gets large (Theorem 1.3 in [20]). B. Singular Values of Random Matrices Similarly, let σ1 , σ2 , ..., σN be the singular values of W. The empirical distribution of the squares of the singular values of W is defined by 1 (8) νW (t) := |{i|1 ≤ i ≤ N, σi2 ≤ t}| N It has been shown that νW is governed by the MarchenkoPastur law [22]–[24]. Theorem 3 (Marchenko-Pastur Law). Let WN be the N × N random matrix whose entries are i.i.d. complex random variables with mean 0 and variance 1. Define W = √1N WN . Then the empirical distribution of the squares of the singular values of W, νW (t), converges (both in the sense of probabilR min(t,4) q 4 1 ity and in the almost sure sense) to 2π x − 1dx, 0 as N → +∞. Remark: Supported by rigorous mathematical proofs, the circular and Marcenko-Pastur laws reveal an important, fundamental property of random matrices, i.e. that both the eigenvalues and the singular values of random reservoir weight matrices have unique limit distributions, independent of the distribution and connectivity of w, as N → ∞. C. The Gap between the Sufficient and Necessary Conditions in [2] As well-discussed in [2] and as aforementioned in Subsection II.C, the global rescaling factor α must be properly chosen to ensure the echo state property for W = αWN . Specifically, when α ≤ |λ−1 max (WN )|, the system is stable, which serves as the necessary condition (assuming the input space contains the −1 zero sequence); when α ≤ σmax (WN ), the echo state property is guaranteed, i.e. this serves as the sufficient condition. How−1 ever, the sufficient condition α ≤ σmax (WN ) is considered conservative, leading to a suboptimal ESN design wherein the amount of memory in the dynamical system is compromised

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. XX, NO. XX

Eigenvalue distribution of a Gaussian random matrix

Eigenvalue distribution of a Bernoulli random matrix

1.0

1.0

0.5

0.5

0.5

0.0

0.5

imaginary

1.0

imaginary

imaginary

Eigenvalue distribution of a sparse random matrix

4

0.0

0.5

1.0

0.5

1.0 1.0

0.5

0.0

real

0.5

1.0

1.0

1.0

(a) A sparse random matrix

0.0

0.5

0.0

real

0.5

1.0

(b) A Gaussian random matrix

1.0

0.5

0.0

real

0.5

1.0

(c) A Bernoulli random matrix

Fig. 2: The empirical eigenvalue distributions of three types of random matrices (N = 1000).

(the smaller α is, the shorter the system memory). In fact, it has been observed that “one obtains echo states even when α is only marginally smaller than |λ−1 max (WN )|” [2]. The discrepancy between the theoretical sufficient condition for the echo state property and the empirical observation that the necessary condition often works well in practice raises a −1 natural question: how big is the gap between σmax (WN ) and σmax (WN ) −1 |λmax (WN )|]? Let the ratio r = |λmax (WN )| quantify the gap between the sufficient and necessary condition bounds. It turns out that the gap is quite large. We have obtained the following result on the asymptotic value of r when N → ∞. Theorem 4 (Gap between the Sufficient and Necessary Conditions). If the random reservoir weight matrix is generated a.s. according to (4), (5), or (6), then r −−→ 2, as N → +∞. Proof: We first prove the cases for random reservoir weight matrices generated according to (5) and (6). From the Theorem in [25] (p. 1319), we have 1 |λmax ( √ WN )| ≤ 1, N

almost surely, as N → +∞. (9)

Then, combining (9) with the conclusion of the circular law, we have 1 a.s. |λmax ( √ WN )| −−→ 1, as N → +∞ (10) N Next, from Theorem 3.1 in [26], we have 1 a.s. σmax ( √ WN ) −−→ 2, N

as N → +∞

(11)

Therefore, we have r=

σmax ( √1N WN ) σmax (WN ) = |λmax (WN )| |λmax ( √1N WN )| a.s.

−−→ 2,

Fig. 3: A simulation study on σmax (W), λmax (W), and kWkD of Gaussian, Bernoulli, and sparse reservoirs, respectively, as N increases.

as N → +∞

(12)

For the case of random reservoir weight matrices generated 1 according to (4), if we replace √1N by √cN in the above equations, it is straightforward to show the same conclusion stated in (12). Figure 3 illustrates the asymptotic trend of σmax (W), λmax (W), and kWkD of Gaussian, Bernoulli, and sparse

reservoir weight matrices as N increases. Each point in Figure 3 is the average of 20 independent simulations, and kWkD is calculated using MATLAB’s µ-analysis Toolbox as suggested in [14]. First, we can see in Figure 3 that when N is large, Gaussian, Bernoulli, and sparse reservoirs all have similar respective values for σmax (W), λmax (W), and kWkD . Second, as N increases, σmax (W) tends to 2, and λmax (W) tends to 1. Thus, consistent with Theorem 4, the bound for the necessary condition is about twice the bound for the sufficient condition for an ESN to possess the echo state property as N gets large. Third, for the sufficient bound proposed in [14], kWkD is indeed tighter than σmax (W) when N is small, for example for N = 20, but kWkD is very close to σmax (W) when N is large. IV. W HY THE NECESSARY CONDITION FOR ECHO STATES IS OFTEN “ SUFFICIENT IN PRACTICE ” To establish the sufficient condition for the echo state property, Jaeger in [2] and Buehner and Young in [14] ˜ (n), showed that the distance between two states x(n) and x ˜ (n)), d(T (x(n), u(n + 1)), T (˜ x(n), u(n + 1))) ≤ Λd(x(n), x

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. XX, NO. XX

5

Λ < 1, shrinking at every step, regardless of the input. This Lipschitz condition results in echo states. In this section, we show that when the necessary condition for the echo state property is satisfied, the state transition mapping T (·, ·) is contractive with high probability, regardless ˜ (n) ∈ [−1, 1]N of the input. More specifically, for x(n), x and a random reservoir weight matrix W, the inequality ˜ (n)) d(T (x(n), u(n + 1)), T (˜ x(n), u(n + 1))) ≤ d(x(n), x holds with probability 1 − O(e−Cρ() N ), where the constant Cρ() depends on the spectral radius of W. A key ingredient for establishing this result is the concentration of measure phenomenon [27] – i.e., when projecting the state vector z onto the properly normalized random reservoir weight matrix W, the `2 norm of Wz is approximately equal to the `2 norm of z, when N is sufficiently large. Let WN = (wij )N ×N , W = αWN , and z = [z1 , z2 , ..., zN ]T . Suppose WN follows (4), (5), or (6), with α 1 under (4) or √1N under (5) and (6). We have set to √cN X X X Wz = [ αw1j zj , αw2j zj , · · · , αwN j zj ]T . (13) j

j

j

For the ith -element, we have X X E[ αwij zj ] = αE[wij ]zj = 0, j

(14)

j

j

= Pr(kWˆ zk ≤ 1 − 3 ) + Pr(kWˆ zk ≥ 1 + 3 ) = Pr(kWˆ zk2 ≤ (1 − 3 )2 ) + Pr(kWˆ zk2 ≥ (1 + 3 )2 ) < Pr(kWˆ zk2 ≤ 1 − 3 ) + Pr(kWˆ zk2 ≥ 1 + 3 ) N <2 exp(− (23 /2 − 33 /2)) 2

(19)

Therefore, as N → +∞, Pr(|kWˆ zk − 1| ≥ 3 ) → 0. Theorem 5. Assume the network defined in (2) and (3) with neuron activation function f = tanh, applied component-wise. ˜ (n) ∈ [−1, 1]N and W is a random Suppose that x(n), x reservoir weight matrix defined according to (4), (5), or (6), with W appropriately scaled to have a spectral radius |λmax | = ρ ≤ 1 − . Then, ˜ (n + 1)k ≤ kx(n) − x ˜ (n)k) Pr(kx(n + 1) − x N 2 (20) >1 − exp(− ( /2 − 3 /2)), 2 ˜ (n + 1) = where x(n + 1) = T (x(n), u(n + 1)) and x T (˜ x(n), u(n + 1)).

kz(n + 1)k = kT (x(n), u(n + 1)) − T (˜ x(n), u(n + 1))k = kf (Wx(n) + Win u(n + 1))

k

1 X 2 z , = N j j

(15)

where E[·] denotes expectation and V ar[·] denotes variance. Thus, we have: X X E[kWzk2 ] = E[ ( αwij zj )2 ] i

=

Pr(|kWˆ zk − 1| ≥ 3 )

˜ (n). We start by writing: Proof: Let z(n) = x(n) − x

X XX V ar[ αwij zj ] = E[ α2 wij zj wik zk ] j

Then, for 0 < 3 < 1,

− (W˜ x(n) + Win u(n + 1))k = kWx(n) − W˜ x(n)k ˜ (n))k = kW(x(n) − x = kWz(n)k

(21)

j

X 1 X zj2 = kzk2 . N i j

(16)

where k · k denotes the `2 norm, i.e. the expected squared length of Wz is the same as the squared length of z. Now we need to investigate how the distribution of kWzk concentrates around kzk. We first develop the following lemma. Lemma 1. Assume the random matrix W follows (4), (5), or 1 or √1N , as appropriate. (6), with scaling factors set to √cN ˆ ∈ RN be a unit vector; then, kWˆ Let z zk converges to 1 in probability, as N → ∞. Proof: Applying Lemma 4 and Lemma 5 in [28], we have N 2 ( /2 − 32 /2)) 2 2 N Pr(kWˆ zk2 ≤ (1 − 2 )) < exp(− (22 /2 − 32 /2)) 2

Pr(kWˆ zk2 ≥ (1 + 2 )) < exp(−

for small positive constant 2 > 0.

− f (W˜ x(n) + Win u(n + 1))k ≤ k(Wx(n) + Win u(n + 1))

(17)

ˆ(n) = z(n)/kz(n)k. Then rewrite (21) as Let z kz(n + 1)k ≤ kWz(n)k = kWˆ z(n)k · kz(n)k We have W = αWN , and WN generated according to (4), (5), or (6). Let α be √ρcN , √ρN , or √ρN , respectively, where 0 < ρ ≤ 1 − and > 0. From the circular law and the main theorem in [25], we know that the spectral radius of W converges to ρ, as N → ∞. Applying Lemma 1, we have kz(n + 1)k ≤ kWˆ z(n)k · kz(n)k 1 = ρk Wˆ z(n)k · kz(n)k ρ p

− → ρkz(n)k

as N → ∞.

(18) Further, let us characterize the probability that kz(n+1)k ≥ kz(n)k, i.e. that the contractive property is not satisfied, when

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. XX, NO. XX

N is finite. We have

6

R EFERENCES

Pr(kz(n + 1)k ≥ kz(n)k) ≤ Pr(kWz(n)k ≥ kz(n)k) = Pr(kWˆ z(n)k ≥ 1) 1 1 = Pr(k Wˆ z(n)k ≥ ) ρ ρ 1 1 = Pr(k Wˆ z(n)k ≥ ) ρ 1− 1 1 < Pr(k Wˆ z(n)k ≥ 1 + ) (∵ 1 + < ) ρ 1− N < exp(− (2 /2 − 3 /2)) 2 We thus see that the probability that kz(n+1)k > kz(n)k is exponentially decreasing with N . Moreover, since ρ ≤ 1 − , kz(n+1)k ≤ kz(n)k with probability 1−O(e−Cρ() N ), where Cρ() = 21 (2 /2 − 3 /2). ˜ (n) ∈ Theorem 5 shows that when ρ < 1, for x(n), x [−1, 1]N and a random reservoir weight matrix W, T (·, ·) is contractive with probability 1 − O(e−Cρ() N ). This result supports previous observations in the echo state network research: “extensive experience with this scaling game indicates that one obtains echo states when α is only marginally smaller than αmax ” [2] (αmax = |λ−1 max (WN )|). As a final cautionary statement, we note that while we have shown that there is a contractive property with high probability for large N , Theorem 5 is not definitive on whether the strict echo state property given in Definition 1 holds with high probability for large N . This remains an open question. V. C ONCLUSIONS In this paper, we applied random matrix theory to examine the properties of the random reservoirs used by ESNs, including different reservoir topologies (sparse or fully-connected) and different connection weights (Bernoulli or Gaussian). The asymptotic uniform distribution of the eigenvalues of the reservoir weight matrix ensures diverse dynamical patterns of the reservoir states. Moreover, such a phenomenon does not depend on the topology of the reservoir or the distribution of the weights of the connections. We showed that the bound for the necessary condition in [2] is about twice the bound for the sufficient condition in [2] for an ESN to possess the echo state property. Finally, we showed that when the spectral radius of ˜ (n) ∈ [−1, 1]N , the reservoir weight matrix, ρ < 1, for x(n), x the state transition mapping T (·, ·) is contractive with high probability, which explains why the necessary condition has been found to be “sufficient in practice”. ACKNOWLEDGMENT The authors would like to thank William T. Baumann for his valuable suggestions. This work is supported in part by the National Institutes of Health under Grants CA149147 and NS029525.

[1] H. Jaeger and H. Haas, “Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication,” Science, vol. 304, pp. 78–80, April 2004. [2] H. Jaeger, “The ’echo state’ approach to analysing and training recurrent neural networks,” tech. rep., German National Research Center for Information Technology Tech. Rep. 148, 2001. [3] W. Maass, T. Natschl¨ager, and H. Markram, “Real-time computing without stable states: A new framework for neural computation based on perturbations,” Neural Computation, vol. 14, no. 11, pp. 2531–2560, 2002. [4] Z. Shi and M. Han, “Support vector echo-state machine for chaotic time-series prediction,” Neural Networks, IEEE Transactions on, vol. 18, no. 2, pp. 359–372, 2007. [5] M. C. Ozturk and J. C. P´ıncipe, “An associative memory readout for ESNs with applications to dynamical pattern recognition,” Neural Networks, vol. 20, no. 3, pp. 377–390, 2007. [6] M. D. Skowronski and J. G. Harris, “Automatic speech recognition using a predictive echo state network classifier,” Neural Networks, vol. 20, no. 3, pp. 414 – 423, 2007. [7] B. Zhang and Y. Wang, “Echo state networks with decoupled reservoir states,” in Machine Learning for Signal Processing 2008, IEEE Workshop on, pp. 444–449, IEEE, 2008. [8] Z. Deng and Y. Zhang, “Collective behavior of a small-world recurrent neural system with scale-free distribution,” Neural Networks, IEEE Transactions on, vol. 18, no. 5, pp. 1364–1375, 2007. [9] Y. Xue, L. Yang, and S. Haykin, “Decoupled echo state networks with lateral inhibition,” Neural Networks, vol. 20, no. 3, pp. 365 – 376, 2007. [10] M. C. Ozturk, D. Xu, and J. C. Pr´ıncipe, “Analysis and design of echo state networks,” Neural Comput., vol. 19, no. 1, pp. 111–138, 2007. [11] M. Lukosevicius and H. Jaeger, “Reservoir computing approaches to recurrent neural network training,” Computer Science Review, vol. 3, no. 3, pp. 127 – 149, 2009. [12] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, pp. 179–211, 1990. [13] M. I. Jordan, “Attractor dynamics and parallelism in a connectionist sequential machine,” IEEE Computer Society Neural Networks Technology Series, pp. 112–127, 1990. [14] M. Buehner and P. Young, “A tighter bound for the echo state property,” Neural Networks, IEEE Transactions on, vol. 17, no. 3, pp. 820–824, 2006. [15] H. Jaeger, “Echo state network,” Scholarpedia, no. 9, 2007. [16] M. Mehta, Random matrices and the statistical theory of energy levels. New York: Academic Press, 1967. [17] A. Edelman, “The probability that a random real Gaussian matrix has k real eigenvalues, related distributions, and the circular law,” Journal of Multivariate Analysis, vol. 60, no. 2, pp. 203–232, 1997. [18] V. L. Girko, “Circular law,” Theory of Probability and Its Applications, vol. 29, pp. 694–706, 1984. [19] Z. D. Bai, “Circular law,” Annals of Probability, vol. 25, no. 1, pp. 494– 529, 1997. [20] T. Tao and V. Vu, “Random matrices: the circular law,” Commun. Contemp. Math., vol. 10, no. 2, pp. 261–307, 2008. [21] T. Tao, V. Vu, and M. Krishnapur, “Random matrices: Universality of ESDs and the circular law,” Annals of Probability, vol. 38, no. 5, pp. 2023–2065, 2010. [22] T. Tao and V. Vu, “Random matrices: the distribution of the smallest singular values,” Geometric And Functional Analysis, vol. 19, 2010. [23] V. A. Marenko and L. A. Pastur, “Distribution of eigenvalues for some sets of random matrices,” Mathematics of the USSR-Sbornik, vol. 1, no. 4, p. 457, 1967. [24] Y. Q. Yin, “Limiting spectral distribution for a class of random matrices,” J. Multivar. Anal., vol. 20, no. 1, pp. 50–68, 1986. [25] S. Geman, “The spectral radius of large random matrices,” Annals of Probability, vol. 14, no. 4, pp. 1318–1328, 1986. [26] Y. Q. Yin, Z. D. Bai, and P. R. Krishnaiah, “On the limit of the largest eigenvalue of the large dimensional sample covariance matrix,” Probability Theory and Related Fields, vol. 78, no. 4, pp. 509–521, 1988. [27] M. Ledoux, The Concentration of Measure Phenomenon. American Mathematical Society, 2001. [28] D. Achlioptas, “Database-friendly random projections,” in PODS ’01: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, (New York, NY, USA), pp. 274– 281, ACM, 2001.

Nonlinear System Modeling with Random Matrices

chaotic time series prediction [4], communications channel equalization [1], dynamical .... The definition of the echo state property implies that similar echo state ...

Download PDF

495KB Sizes 1 Downloads 255 Views

Report

Nonlinear System Modeling with Random Matrices

Recommend Documents