Preparation of Papers in Two-Column Format for the ...

Viewer
Transcript

A Systolic Solution for Computing the Symmetrizer of a Hessenberg Matrix Anil B. Biradar Department of Electrical and Electronics Engg. M.S.Ramaiah Institute of Technology Bangalore,India [email protected]

AbstractNonsymmetric matrix eigenvalue problems which are computationally more expensive than symmetric matrix problems arise in many signal-processing and dynamic system control applications. A symmetrizer is used to convert a nonsymmetric eigenvalue problem into an equivalent symmetric problem, the eigenvalues remaining the same. Computation of such problems in real-time, on extant von Neumann machines, is pathetic. In this paper, we present a Systolic Array to compute a symmetrizer of a lower Hessenberg matrix that has better time complexity and uses less number of processing elements(PEs) as compared to previous methods, i.e. Leiserson, Double Pipe and Fitted Diagonal methods.

I. INTRODUCTION High performance special purpose VLSI computer systems are typically utilized to meet specific application requirements or to offload computations that are especially burdensome to general purpose computers. A systolic array is a specialized computer architecture which uses massive pipelining and exploits the regularity inherent in certain large classes of problems (like DSP) to achieve high performance and low i/o requirement. There is a broad range of engineering problems that require tremendous amount of computation. In cases where these computations have to be performed in real time(like missile guidance, control of dynamic systems), the situation is pathetic on a conventional machine A major portion of the computational needs for signal processing and applied mathematics can, in fact, be reduced to a basic set of matrix operations where nonsymmetric matrices arise [1]. A nonsymmetric matrix problem is computationally more expensive than a symmetric matrix problem [2]. Computing eigenvalues of a symmetric matrix, real or

Harish Gopala School of Computer Science Carleton University Ottawa, Canada [email protected]

complex, is easier than computing the eigenvalues of a nonsymmetric matrix [2]. Further, there is no stable method yet available, for obtaining the eigenvalues (closely spaced eigenvalues of a matrix) of a nonsymmetric matrix [3]. A symmetrizer of a matrix A is a matrix X satisfying the equations XA=AtX and X=Xt. Symmetrizers are used to transform a nonsymmetric eigenvalue problem into a symmetric one which is relatively easy to solve. They are useful in many engineering problems, specifically stability problems in control theory and in the study of general matrices [4]. Let A = [aij] be the given n x n lower Hessenberg matrix with nonzero codiagonals. Let x1, x2, … , xn be the rows of the symmetrizer X to be computed. Step 1. Choose xn ≠ 0 (to ensure nonsingularity of X) arbitrarily. In practice, xn may be chosen as (c 0 0 0 ..0) where c ≠ 0. Step 2. Compute xn-1, xn-2, …, x1 recursively n

from x i =( x i+1A-

∑a

p,i+1

x p )/a i,i+1

(1)

p=i+1

i = n-1(-1)1 x i =(x i+1A-a i+1,i+1 x i+1 -a i+2,i+1 x i+2 ...-a n,i+1 x n )/a i,i+1 Step 3. Print x1, x2, x3, …, xn and stop. [5] It can be seen that the symmetrizer is not unique because if we choose a different x4 ,we get a different X. we apply a new systolic method for symmetrizing a lower Hessenberg matrix using the above algorithm and compare it with the previous methods, viz. Leiserson, Double Pipe and Fitted diagonal methods.

III. ARCHITECTURE OF THE SYSTEM

II. SINGLE ASSIGNMENT CODE A Single Assignment code is a form where every variable is assigned one value only during the execution of the algorithm [6]. The single assignment code for symmetrizing a lower Hessenberg matrix is as follows. for i:=1 to n do for j:=1 to n do read A[i,j]; for k:=1 to n do read X[n,k]; for i:=n-1 down to 1 do begin for j:=1 to n do Y[i,j,1]=0.0; Z[i,j,1]=0.0; for j:=1 to n do begin for k:=1 to n do Y[i,j,k+1]=Y[i,j,k]+A[k,j]×X[i+1,k]; for k:=1,p:=i+1 to k:=n-i,p:=n do Z[i,j,k+1]=Z[i,j,k]+A[p,i+1]×X[p,j]; X[i,j]= (Y[i,j,n+1]- Z[i,j,n-i+1])/A[i,i+1]; write X[i,j]; end end k is the recursion index. The implementation of this single assignment algorithm, using a 4 x 4 matrix on our method is depicted in fig.1.

To system Bus/Host

The above figure shows the architecture of the system which is used to implement the single assignment code.It consists of three modules • Module I consists of processing elements(PEs) 1-4 and is controlled by systolic array processor handler(SAPH(Y)). • Module II consists of PEs 5-7 and is controlled by SAPH(Z). • Module III consists of PE8 and is controlled by SAPH(X). SAPH is a micro-programmable processor that interfaces with PEs[7]. We use two types of PEs whose architecture is shown in fig.2a and fig.2b.PEs 1-7 are of type I and PE8 is of type II. The type I PE is a Multiply-Add(MA) PE. In the MA PE registers R1 and R2 store the broadcasted value from the SAPH (A[k,j],X[i+1,k] in PEs 1-4 and A[p,i+1],X[p,j] in PEs 5-7) and R3 stores the result from the previous PE. The computation that is performed here is output=[R3]+ [R1]×[R2]. The type II PE is a Subtract-Divide(SD) PE. In the SD PE registers R1 and R2 store the results calculated from the two parallel paths (Y[i,j,n+1],Z[i,j,n-i+1]) and R3 stores the relevant matrix element(A[i,i+1]) as described in the single assignment code. The computation that is performed in this PE is output=([R1]R[2])/R[3]. The number of PEs of type I that are required are 2n-1 (for a generalized n× n Hessenberg matrix).

INTERFACE PROCESSOR AND MEMORY SAPH (Y)

SAPH (Z)

PE5 PE1

y1 y 2 y3 y 4

PE2

PE3

PE6

SAPH(O)

PE7 PE8

PE4

z 0 z1 z 2 z3 z 4 Fig:1 Schematic diagram of the architecture

a

x

R1

x R3

R2

y0 , z0

×

R 2

z

x0 _

R 3

R 1

y,z

y

+

÷

Fig.2b Architecture of type II PE

Fig.2a Architecture of type I PE

using a zero(0) to meet the requirement of the computation. Fig.4 shows the pumping of the row vector xi,k and the matrix A into the type I and type II cells.The last row of matrix A is positioned appropriately to be pumped into PE1 taking into consideration the time- space relationships from the single assignment code.The previous row of A is fed into PE2 and so on.The generalization to a n×n matrix is immediate

In the problem chosen the dimension of the Hessenberg matrix is 4×4,hence number of PEs of type I is equal to 7. Out of the 7 PEs ,4 (n) are used for computing for computing Y and the remaining 3 (n-1) PEs are used for computing Z. Only one PE of type II is required for computing the symmetrizer of a Hessenberg matrix of any dimension(n×n).Thus, the total number of PEs required for the system is 2n. III. ALLOCATION OF THE ELEMENTS The allocation (broadcasting) of the elements to the different PEs is controlled by the microprogrammable processor SAPH, which interfaces with them.Fig.3 shows the allocation of the rows of the 4 × 4 Hessenberg matrix to PEs 1-4.These matrix elements are broadcast by the SAPH according to the single assignment code.The time delay for broadcasting the element is achieved by

Fig.3 Allocation of rows of Hessenberg matrix to PEs

a44

x44

0

0

0

0

a34

x43

a23

x42

a12

x41

a44

x43

0

0

0

0

a34

a44

x44

a33

x43

a22

x42

a11

x41

a44

x42

0

0

0

0

0

a43

x44

a32

x43

a21

x42

0

0

a44

a41

0

0

0

0

0

a42

x44

a31

x43

0

0

0

0

0

0

0

0

0

0

0

a41

a44

0

0

0

0

0

0

0 PE5

PE1

PE2

PE3

PE6

PE7 P E 8

PE4

Fig.4 Allocation of the elements to the PEs

IV. WORKING

Clock cycle I z0=z0+0×0

Fig.4 shows the pumping of the elements for the computation of the third row (n-1) row of the symmetrizer. PEs 1-4 compute xi+1 × A y1=y1+a41x44

n

(Y[i,j,n+1]) and PEs 5-7 compute

∑a

p,i+1

xp

;

p=i+1

i=(n-1)(-1)1. PE8 does the subtract and divide operation as required by the single assignment code. Since the computation xi+1 × A

Clock cycle II

n

(Y[i,j,n+1]) and

∑a

p,i+1

z1=z1+a44x41

x p (Z[i,j,n-i+1]) are

z0=z0+0×0

p=i+1

independent of each other, we do it in parallel (the first computation through PEs 1-4 and the second through PEs 5-7). The relevant matrix elements required for the computation are preloaded into the registers of the cell through the contacts by SAPH. In the first module, the fourth element of row vector x4 of the symmetrizer is fed into PE1 along with the elements of fourth row of A. Then the third element of x4 is fed into PE2 along with the elements of third row of A and so on. The second computation of Z[i,j,n-i+1] is done in parallel in the second module during which the fourth element of a4 (ap,i+1) is fed into PE5 along with the elements of the fourth row of symmetrizer matrix X (xp) as depicted in fig.4.While computing the third row of the symmetrizer PE6 and PE7 are not used (i.e there is no change in the value of Z[i,j,k] due to computation in these PEs) and while computing the second row PE7 is not used. Fig.5 shows the data flow during the first 2n-1 (7) clock cycles in the system.The cells represent the PEs as shown in the architecture of the sytem. After the first n+1 clock cycles and the result X[i,1] out, the next clock cycle will get us X[i,2] and so on till all the elements of the symmetrizer row are computed. It may be observed from fig.5 that, at the end of the n+1 (5) clock cycles, the elements of input matrix A are extinguished. This does not mean that there are no computations in those two clock cycles i.e sixth and seventh. As soon as matrix A elements are extinguished, the same elements are again fed for the computation of the next row of the symmetrizer. Thus the elements of matrix A are broadcast continuously in a rotating fashion by the SAPH along with the corresponding elements of the symmetrizer matrix as described in the single assignment code.

y2=y2+a42x44

y1=y1+a31x43

Clock cycle III

y3=y3+a43x44

z2=z2+a44x42

z1=z1+0×0

y2=y2+a32x43

y1=y1+a21x42

z3=z3+a44x43

z2=z2+0×0

z1=z1+0×0

y3=y3+a33x43

y2=y2+a22x42

y1=y1+a11x41

z4=z4+a44x44

z3=z3+0×0

z2=z2+0×0

y4=y4+a34x43

y3=y3+a23x42

y2=y2+a12x41

z4=z4+0×0

z3=z3+0×0

y4=y4+a42x24

y3=y3+a13x41

z0=z0+0×0

Clock cycle IV

y4=y4+a44x44 Clock cycle V

Clock cycle VI

VI. CONCLUSION

Clock cycle VII

z4=z4+0×0

y4=y4+a41x14

Fig.5 Data flow during the first 2n-1 (7) clock cycles

V. COMPARISON The comparison of our method with the previously proposed methods in terms of number of PEs and time complexity is shown in table I. The previous methods use three types of processing elements and the data is pumped diagonal-wise. The Leiserson method has tag bits interleaved between the data [8]. The Double pipe method minimizes the tag bits between the data but uses a delay cell and an add cell [8]. The Fitted Diagonal method uses less number of processing elements, but with increased complexity [8,9]. Our method uses only two types of processing elements and the data is pumped row-wise. We have eliminated the tag bits interleaved between the data pumped into the array. Thus our method decreases the hardware complexity and takes less time to compute a symmetrizer compared to the previous methods. TABLE I COMPARISON

Method

No. of PE’s

Leiserson

2n+1

Time complexity 4n+1

Double pipe

w=2n+3

w+(n+1/2)

Fitted Diagonal Our method

q= n+1/2 +n

q+2n

2n

(n+1)+3

Table I gives the comparison of the number of processing elements used and time complexity to compute a row of a symmetrizer.

A symmetrizer is of great interest because it can reduce the complexity of computation enormously. A symmetrizer for any matrix can be computed recursively by choosing arbitrarily the last row of the symmetrizer. This error free symmetrizer will produce a more accurate equivalent symmetric matrix [2] than what an approximate one does. Using the algorithm [5] described above, we have developed a novel systolic array that generates a symmetrizer which uses less number of PEs and has better time complexity than previous methods. The systolic array proposed requires no interleaved tag bits (dummy elements). Such a systolization will radically reduce the complexity of computing a symmetrizer that will produce an accurate equivalent symmetric matrix. REFERENCES [1] S.Y. Kung, K.S. Arun, Ron J. Galezer and D.V. Bhaskar Rao, "Wavefront Array Processor: Language, Architecture, and Applications", IEEE Transactions and Computers, Vol. C-31, No. 11, Nov. 1982, pp. 1054-1066. [2] S. K. Sen, V.Ch. Venkaiah, "On Computing an Equivalent Symmetric Matrix for a Nonsymmetric Matrix", Department of Applied Mathematics, Indian Institute of Science, Bangalore, 560 012, India, pp. 169-180. [3] S.K. Sen, E.V. Krishnamurthy, "Numerical Algorithms: Computations in Science and Engineering", Affiliated East-West press, New Delhi, 1988. [4] V.Ch.Venkaiah, S.K. Sen., "Error-free Matrix Symmetrizers and Equivalent Symmetric Matrices", Acta Applicandae Mathematicae, Vol. 21, pp. 291-313, 1990. [5] B.N. Datta, "An Algorithm for computing a symmetrizer of a Hessenberg Matrix", Unpublished. [6] S.Y. Kung, "VLSI Array Processors", Prentice Hall, 1988, NJ, Englewood Cliffs, New Jersey, 1988. [7] P. M. Dew, "VLSI Architectures for problems in numerical computation", (ed.) D.J. Paddon, Supercomputers and Parallel Computation, New Series No. 1(ed.), the Institute of Mathematics and its Application Series, 1984. [8] S.K. Sen and F.R.K. Kumar, "Symmetrizing a Hessenberg matrix: Designs for VLSI Parallel Processor Arrays", Proc. Indian Acad. Sci (Math.Sci.), Vol.105, No.1, pp. 5977, Feb 1995. [9] R. Suros and E. Montagne, "Optimizing Systolic Networks by fitting diagonals", Parallel Computing, pp. 167174, 1987. [10] Dan. I. Moldovan, "Parallel Processing: From Applications to Systems", Morgan Kauffman Publications, pp. 218-220, 1993. [11] Hosagrahar V. Jagadish, Sailesh K. Rao, Thomas Kailath, “Array Architectures for Iterative Algorithms”, Proceedings of the IEEE, Vol. 75, No. 9, Sep. 1987, pp. 1304. [12] Satish Kumar B.S., Harish G., “Symmetrizing Hessenberg Matrix Systolically without Interleaved Tag Bits and a processing cell”, VLSI 2001, All India Seminar on Recent Trends in VLSI, Indian Institute of Technology (IIT), Roorkee, Uttar Pradesh, India. pp. 121-125. [13] Siang W Song “Systolic Algorithms: concepts, synthesis and evolution”.

Preparation of Papers in Two-Column Format for the ...

A Systolic Solution for Computing the. Symmetrizer of a ... M.S.Ramaiah Institute of Technology. Carleton ... DSP) to achieve high performance and low i/o.

Download PDF

292KB Sizes 0 Downloads 214 Views

Report

Preparation of Papers in Two-Column Format for the ...

Recommend Documents