A Systolic Solution for Computing the Symmetrizer of a Hessenberg Matrix Anil B. Biradar Department of Electrical and Electronics Engg. M.S.Ramaiah Institute of Technology Bangalore,India
[email protected]
AbstractNonsymmetric matrix eigenvalue problems which are computationally more expensive than symmetric matrix problems arise in many signal-processing and dynamic system control applications. A symmetrizer is used to convert a nonsymmetric eigenvalue problem into an equivalent symmetric problem, the eigenvalues remaining the same. Computation of such problems in real-time, on extant von Neumann machines, is pathetic. In this paper, we present a Systolic Array to compute a symmetrizer of a lower Hessenberg matrix that has better time complexity and uses less number of processing elements(PEs) as compared to previous methods, i.e. Leiserson, Double Pipe and Fitted Diagonal methods.
I. INTRODUCTION High performance special purpose VLSI computer systems are typically utilized to meet specific application requirements or to offload computations that are especially burdensome to general purpose computers. A systolic array is a specialized computer architecture which uses massive pipelining and exploits the regularity inherent in certain large classes of problems (like DSP) to achieve high performance and low i/o requirement. There is a broad range of engineering problems that require tremendous amount of computation. In cases where these computations have to be performed in real time(like missile guidance, control of dynamic systems), the situation is pathetic on a conventional machine A major portion of the computational needs for signal processing and applied mathematics can, in fact, be reduced to a basic set of matrix operations where nonsymmetric matrices arise [1]. A nonsymmetric matrix problem is computationally more expensive than a symmetric matrix problem [2]. Computing eigenvalues of a symmetric matrix, real or
Harish Gopala School of Computer Science Carleton University Ottawa, Canada
[email protected]
complex, is easier than computing the eigenvalues of a nonsymmetric matrix [2]. Further, there is no stable method yet available, for obtaining the eigenvalues (closely spaced eigenvalues of a matrix) of a nonsymmetric matrix [3]. A symmetrizer of a matrix A is a matrix X satisfying the equations XA=AtX and X=Xt. Symmetrizers are used to transform a nonsymmetric eigenvalue problem into a symmetric one which is relatively easy to solve. They are useful in many engineering problems, specifically stability problems in control theory and in the study of general matrices [4]. Let A = [aij] be the given n x n lower Hessenberg matrix with nonzero codiagonals. Let x1, x2, … , xn be the rows of the symmetrizer X to be computed. Step 1. Choose xn ≠ 0 (to ensure nonsingularity of X) arbitrarily. In practice, xn may be chosen as (c 0 0 0 ..0) where c ≠ 0. Step 2. Compute xn-1, xn-2, …, x1 recursively n
from x i =( x i+1A-
∑a
p,i+1
x p )/a i,i+1
(1)
p=i+1
i = n-1(-1)1 x i =(x i+1A-a i+1,i+1 x i+1 -a i+2,i+1 x i+2 ...-a n,i+1 x n )/a i,i+1 Step 3. Print x1, x2, x3, …, xn and stop. [5] It can be seen that the symmetrizer is not unique because if we choose a different x4 ,we get a different X. we apply a new systolic method for symmetrizing a lower Hessenberg matrix using the above algorithm and compare it with the previous methods, viz. Leiserson, Double Pipe and Fitted diagonal methods.
III. ARCHITECTURE OF THE SYSTEM
II. SINGLE ASSIGNMENT CODE A Single Assignment code is a form where every variable is assigned one value only during the execution of the algorithm [6]. The single assignment code for symmetrizing a lower Hessenberg matrix is as follows. for i:=1 to n do for j:=1 to n do read A[i,j]; for k:=1 to n do read X[n,k]; for i:=n-1 down to 1 do begin for j:=1 to n do Y[i,j,1]=0.0; Z[i,j,1]=0.0; for j:=1 to n do begin for k:=1 to n do Y[i,j,k+1]=Y[i,j,k]+A[k,j]×X[i+1,k]; for k:=1,p:=i+1 to k:=n-i,p:=n do Z[i,j,k+1]=Z[i,j,k]+A[p,i+1]×X[p,j]; X[i,j]= (Y[i,j,n+1]- Z[i,j,n-i+1])/A[i,i+1]; write X[i,j]; end end k is the recursion index. The implementation of this single assignment algorithm, using a 4 x 4 matrix on our method is depicted in fig.1.
To system Bus/Host
The above figure shows the architecture of the system which is used to implement the single assignment code.It consists of three modules • Module I consists of processing elements(PEs) 1-4 and is controlled by systolic array processor handler(SAPH(Y)). • Module II consists of PEs 5-7 and is controlled by SAPH(Z). • Module III consists of PE8 and is controlled by SAPH(X). SAPH is a micro-programmable processor that interfaces with PEs[7]. We use two types of PEs whose architecture is shown in fig.2a and fig.2b.PEs 1-7 are of type I and PE8 is of type II. The type I PE is a Multiply-Add(MA) PE. In the MA PE registers R1 and R2 store the broadcasted value from the SAPH (A[k,j],X[i+1,k] in PEs 1-4 and A[p,i+1],X[p,j] in PEs 5-7) and R3 stores the result from the previous PE. The computation that is performed here is output=[R3]+ [R1]×[R2]. The type II PE is a Subtract-Divide(SD) PE. In the SD PE registers R1 and R2 store the results calculated from the two parallel paths (Y[i,j,n+1],Z[i,j,n-i+1]) and R3 stores the relevant matrix element(A[i,i+1]) as described in the single assignment code. The computation that is performed in this PE is output=([R1]R[2])/R[3]. The number of PEs of type I that are required are 2n-1 (for a generalized n× n Hessenberg matrix).
INTERFACE PROCESSOR AND MEMORY SAPH (Y)
SAPH (Z)
PE5 PE1
y1 y 2 y3 y 4
PE2
PE3
PE6
SAPH(O)
PE7 PE8
PE4
z 0 z1 z 2 z3 z 4 Fig:1 Schematic diagram of the architecture
a
x
R1
x R3
R2
y0 , z0
×
R 2
z
x0 _
R 3
R 1
y,z
y
+
÷
Fig.2b Architecture of type II PE
Fig.2a Architecture of type I PE
using a zero(0) to meet the requirement of the computation. Fig.4 shows the pumping of the row vector xi,k and the matrix A into the type I and type II cells.The last row of matrix A is positioned appropriately to be pumped into PE1 taking into consideration the time- space relationships from the single assignment code.The previous row of A is fed into PE2 and so on.The generalization to a n×n matrix is immediate
In the problem chosen the dimension of the Hessenberg matrix is 4×4,hence number of PEs of type I is equal to 7. Out of the 7 PEs ,4 (n) are used for computing for computing Y and the remaining 3 (n-1) PEs are used for computing Z. Only one PE of type II is required for computing the symmetrizer of a Hessenberg matrix of any dimension(n×n).Thus, the total number of PEs required for the system is 2n. III. ALLOCATION OF THE ELEMENTS The allocation (broadcasting) of the elements to the different PEs is controlled by the microprogrammable processor SAPH, which interfaces with them.Fig.3 shows the allocation of the rows of the 4 × 4 Hessenberg matrix to PEs 1-4.These matrix elements are broadcast by the SAPH according to the single assignment code.The time delay for broadcasting the element is achieved by
Fig.3 Allocation of rows of Hessenberg matrix to PEs
a44
x44
0
0
0
0
a34
x43
a23
x42
a12
x41
a44
x43
0
0
0
0
a34
a44
x44
a33
x43
a22
x42
a11
x41
a44
x42
0
0
0
0
0
a43
x44
a32
x43
a21
x42
0
0
a44
a41
0
0
0
0
0
a42
x44
a31
x43
0
0
0
0
0
0
0
0
0
0
0
a41
a44
0
0
0
0
0
0
0 PE5
PE1
PE2
PE3
PE6
PE7 P E 8
PE4
Fig.4 Allocation of the elements to the PEs
IV. WORKING
Clock cycle I z0=z0+0×0
Fig.4 shows the pumping of the elements for the computation of the third row (n-1) row of the symmetrizer. PEs 1-4 compute xi+1 × A y1=y1+a41x44
n
(Y[i,j,n+1]) and PEs 5-7 compute
∑a
p,i+1
xp
;
p=i+1
i=(n-1)(-1)1. PE8 does the subtract and divide operation as required by the single assignment code. Since the computation xi+1 × A
Clock cycle II
n
(Y[i,j,n+1]) and
∑a
p,i+1
z1=z1+a44x41
x p (Z[i,j,n-i+1]) are
z0=z0+0×0
p=i+1
independent of each other, we do it in parallel (the first computation through PEs 1-4 and the second through PEs 5-7). The relevant matrix elements required for the computation are preloaded into the registers of the cell through the contacts by SAPH. In the first module, the fourth element of row vector x4 of the symmetrizer is fed into PE1 along with the elements of fourth row of A. Then the third element of x4 is fed into PE2 along with the elements of third row of A and so on. The second computation of Z[i,j,n-i+1] is done in parallel in the second module during which the fourth element of a4 (ap,i+1) is fed into PE5 along with the elements of the fourth row of symmetrizer matrix X (xp) as depicted in fig.4.While computing the third row of the symmetrizer PE6 and PE7 are not used (i.e there is no change in the value of Z[i,j,k] due to computation in these PEs) and while computing the second row PE7 is not used. Fig.5 shows the data flow during the first 2n-1 (7) clock cycles in the system.The cells represent the PEs as shown in the architecture of the sytem. After the first n+1 clock cycles and the result X[i,1] out, the next clock cycle will get us X[i,2] and so on till all the elements of the symmetrizer row are computed. It may be observed from fig.5 that, at the end of the n+1 (5) clock cycles, the elements of input matrix A are extinguished. This does not mean that there are no computations in those two clock cycles i.e sixth and seventh. As soon as matrix A elements are extinguished, the same elements are again fed for the computation of the next row of the symmetrizer. Thus the elements of matrix A are broadcast continuously in a rotating fashion by the SAPH along with the corresponding elements of the symmetrizer matrix as described in the single assignment code.
y2=y2+a42x44
y1=y1+a31x43
Clock cycle III
y3=y3+a43x44
z2=z2+a44x42
z1=z1+0×0
y2=y2+a32x43
y1=y1+a21x42
z3=z3+a44x43
z2=z2+0×0
z1=z1+0×0
y3=y3+a33x43
y2=y2+a22x42
y1=y1+a11x41
z4=z4+a44x44
z3=z3+0×0
z2=z2+0×0
y4=y4+a34x43
y3=y3+a23x42
y2=y2+a12x41
z4=z4+0×0
z3=z3+0×0
y4=y4+a42x24
y3=y3+a13x41
z0=z0+0×0
Clock cycle IV
y4=y4+a44x44 Clock cycle V
Clock cycle VI
VI. CONCLUSION
Clock cycle VII
z4=z4+0×0
y4=y4+a41x14
Fig.5 Data flow during the first 2n-1 (7) clock cycles
V. COMPARISON The comparison of our method with the previously proposed methods in terms of number of PEs and time complexity is shown in table I. The previous methods use three types of processing elements and the data is pumped diagonal-wise. The Leiserson method has tag bits interleaved between the data [8]. The Double pipe method minimizes the tag bits between the data but uses a delay cell and an add cell [8]. The Fitted Diagonal method uses less number of processing elements, but with increased complexity [8,9]. Our method uses only two types of processing elements and the data is pumped row-wise. We have eliminated the tag bits interleaved between the data pumped into the array. Thus our method decreases the hardware complexity and takes less time to compute a symmetrizer compared to the previous methods. TABLE I COMPARISON
Method
No. of PE’s
Leiserson
2n+1
Time complexity 4n+1
Double pipe
w=2n+3
w+(n+1/2)
Fitted Diagonal Our method
q= n+1/2 +n
q+2n
2n
(n+1)+3
Table I gives the comparison of the number of processing elements used and time complexity to compute a row of a symmetrizer.
A symmetrizer is of great interest because it can reduce the complexity of computation enormously. A symmetrizer for any matrix can be computed recursively by choosing arbitrarily the last row of the symmetrizer. This error free symmetrizer will produce a more accurate equivalent symmetric matrix [2] than what an approximate one does. Using the algorithm [5] described above, we have developed a novel systolic array that generates a symmetrizer which uses less number of PEs and has better time complexity than previous methods. The systolic array proposed requires no interleaved tag bits (dummy elements). Such a systolization will radically reduce the complexity of computing a symmetrizer that will produce an accurate equivalent symmetric matrix. REFERENCES [1] S.Y. Kung, K.S. Arun, Ron J. Galezer and D.V. Bhaskar Rao, "Wavefront Array Processor: Language, Architecture, and Applications", IEEE Transactions and Computers, Vol. C-31, No. 11, Nov. 1982, pp. 1054-1066. [2] S. K. Sen, V.Ch. Venkaiah, "On Computing an Equivalent Symmetric Matrix for a Nonsymmetric Matrix", Department of Applied Mathematics, Indian Institute of Science, Bangalore, 560 012, India, pp. 169-180. [3] S.K. Sen, E.V. Krishnamurthy, "Numerical Algorithms: Computations in Science and Engineering", Affiliated East-West press, New Delhi, 1988. [4] V.Ch.Venkaiah, S.K. Sen., "Error-free Matrix Symmetrizers and Equivalent Symmetric Matrices", Acta Applicandae Mathematicae, Vol. 21, pp. 291-313, 1990. [5] B.N. Datta, "An Algorithm for computing a symmetrizer of a Hessenberg Matrix", Unpublished. [6] S.Y. Kung, "VLSI Array Processors", Prentice Hall, 1988, NJ, Englewood Cliffs, New Jersey, 1988. [7] P. M. Dew, "VLSI Architectures for problems in numerical computation", (ed.) D.J. Paddon, Supercomputers and Parallel Computation, New Series No. 1(ed.), the Institute of Mathematics and its Application Series, 1984. [8] S.K. Sen and F.R.K. Kumar, "Symmetrizing a Hessenberg matrix: Designs for VLSI Parallel Processor Arrays", Proc. Indian Acad. Sci (Math.Sci.), Vol.105, No.1, pp. 5977, Feb 1995. [9] R. Suros and E. Montagne, "Optimizing Systolic Networks by fitting diagonals", Parallel Computing, pp. 167174, 1987. [10] Dan. I. Moldovan, "Parallel Processing: From Applications to Systems", Morgan Kauffman Publications, pp. 218-220, 1993. [11] Hosagrahar V. Jagadish, Sailesh K. Rao, Thomas Kailath, “Array Architectures for Iterative Algorithms”, Proceedings of the IEEE, Vol. 75, No. 9, Sep. 1987, pp. 1304. [12] Satish Kumar B.S., Harish G., “Symmetrizing Hessenberg Matrix Systolically without Interleaved Tag Bits and a processing cell”, VLSI 2001, All India Seminar on Recent Trends in VLSI, Indian Institute of Technology (IIT), Roorkee, Uttar Pradesh, India. pp. 121-125. [13] Siang W Song “Systolic Algorithms: concepts, synthesis and evolution”.