A Parallel Implementation of the Conjugate Gradient Method on the Meiko CS-2 Antonio d' Acierno, Giuseppe De Pietro, Antonio Giordano IRSIP - CNR via P. Castellino 111, 80131 Napoli (Italy)
Abstract
In this paper we describe a parallel implementation of a preconditioned conjugate gradient algorithm on the Meiko CS-2. The parallel implementation is based on the use of a global reduction operator this allows to obtain a simple and ecient code whose performance is easily and precisly predictable. We describe and validate a performance model and we show and comment experimental results
1 Introduction Solving large sparse systems of linear equations is an integral part on mathemetical computing and nds application in a variety of elds such as uid dynamics, structural analysis, atmospheric modeling and so on. Iterative methods such as the conjugate method for solving such systems are becoming increasingly appealing as they can be easily parallelised 1, 2, 3, 4]. In this paper we describe a parallel implementation of a Preconditioned Conjugate Gradient (PCG) algorithm on the Meiko CS-2. The implementation is based on the use of reduce operations which are well supported by the CS-2 hardware. The paper is organised as follows. In section 2 the PCG algorithm is brie y descrided and the sequential code is scketched section 3 describes the CS-2 hardware and introduces the reduce operation supported by the used hw/sw system. In section 4 we describe the parallel implementation while, in section 5, we propose and validate a performance model and we show and comment experimental results conclusions are the concern of section 6.
2 The PCG Algorithm The PCG algorithm is used for solving systems of linear equations in the form: 1
A~x = ~b
(1)
where A is a N N (symmetric positive denite) sparse matrix. Let Ki the number of non-zero elements in the i ; th column and let Kmax = maxi Ki , i = 0 : : : N ; 1. In our implementation the matrix A is stored using two Kmax N matrices, AC and ROWS, where AC is a compressed version of the matrix A and ROWS contains the indices of the elements of A stored in AC, so that:
AC i] j ] = A ROWS i] j ]] j ]
(2) The matrix vector product of a compressed matrix for a vector can be described in the following way: void mv_prod (int n, int K_MAX, double *result, double **matr, int **rows, double *vect) { int i,j for (i=0i
For the sake of completeness we report the code for the inner product of two vectors: double vv_prod (int n, double *v1, double *v2) { int i double result result = 0 for (i=0i
In this paper we consider a diagonal PCG algorithm the preconditioner M is then a vector simply obtained (in our code) in the following way:
M i] = A i] i]
So, the PCG algorithm can be described as follow: #define N #define K_MAX
2
(3)
main (int argc, char *argv ]) { double a K_MAX] N] /*a is the comprssed matrix A*/ double x N],b N],m N],s N],q N],d N],r N] int j,rows K_MAX] N] double d_new, d_old, alpha, beta, eps /*code for the input of a, rows, b and eps omitted*/ /*code for initialiting m omitted*/ for(j=0j (eps*d_old)) { mv_prod(N,K_MAX,q,a,rows,d) alpha = vv_prod(n,d,q) alpha = d_new/alpha for(j=0j
As shown in the code, the PCG algorithm performs few basics operations in each iteration: 1. matrix-vector multiplications 2. vector-vector inner products 3. scalar-vector multiplications 4. vector-vector additions. Operations belonging to classe 2, to classe 3 and to class 4 clearly have time compexity O(N ) the complexity of a matrix-vector product, obviously, is O(), where is the number of non-zero elements in the matrix. In our experiments 3
3.5 2.5
T
3
Measured 3 p (3:24 N N + 0:08 N )10;6
3
3
2
3
1.5
3
1 0.5
3 0 3 3 0
10
3 20
30
40
50
N=100
60
70
80
90
100
Figure 1: Sequential implementation: theoretical time vs. measured time (in seconds) as N vary (one iteration). we used a general form of sparse matrices, i.e. the non-zero elements of A occur in such a way that it is possible to resctrict them within a band around the principal diagonal. In order to pdeal with just a parameter, p we xed the width of the band to Kmax = 1 + 2 N so that 2 N N . Let us call T seq (N ) the time employed to perform an iteration in the sequential case we have that: p
T seq = k1 N N + k2 N (4) We experimentally obtained that, when T seq is in microseconds, k1 = 3:24 and k2 = 0:08. Figure 1 shows the dierences between the theorethical model and the experimental results. By substituting k1 and k2 in (4) we have: p
p
T seq = k1 N N + k2 N 3:24 N N
4
(5)
8X8
Processor 0-15
Figure 2: The architecture of a Meiko CS-2 with 16 processors.
3 The Meiko CS-2 The Meiko CS-2 5] is a MIMD parallel computer able to provide both scalar and vector processing nodes, well suited for both scientic computation and data base applications according to the particular conguration choosen. The key point of the Meiko machine is the communication structure, which is composed by the communication network and by communication processors. The CS-2 communication network is a logarithmic network constructed from 8 way crosspoint switches and bidirectional links it can be considered to be a Benes network folded about its centre line. Bandwidth is constant at each stage of the network, and there are many links out according to the number of processors. A 16 processor network is illustrated in gure 2. The multi-stage network used in the CS-2 can be also viewed like a "fat tree" in this type of network, packets are routed back down at the rst node possible. So, the bandwidth at higher levels of the tree can be reduced. Although the CS-2 network permits this local packet routing, the bandwidth is not reduced at higher levels. In order to improve performance of parallel application, every processing elements in a CS-2 machine has its own communication processor, a dedicated interface to the communication network. This processor supports remote read, write and syncronization operations specied by virtual processor number and virtual address. Latency hiding is supported by non-blocking instructions, instruction sequences and completation tests. Another important aspect of the communication structure, is the avaliabil5
ity of a library of network functions which allow user to eciently solve some communication problems without using low-level routines. For examples, very simple and ecient routines are provided for message broadcasting, for DMA accessing of any node memory, etc. Particularly intresting is the gsum function this function performs a reduce operation, and the results are broadcast to all processors. In other words, suppose to have P processes and P processors, and that each process has an array a of N elements. When all processes make a function call to gsum on the array, the result is that all the corresponding elements of each array are summed, and the sum is placed in the correspondent position in each array. As it will show later, this function has been widely used for the implementation of the PCG algorithm.
4 The Parallel Implementation The basic operation to be parallelised is the matrix-vector product and there are two main strategies for such operation: in the rst the matrix is split row-wise while in the second one it is spit column-wise. Regarding the row-wise mapping we have that, for each product and by considering dense matrices, an all-to-all broadcasting is required. If unstructured sparse matrix are considered, just point-to-point communications are required but both the communication path and the number of communications is strictly dependent on the the matrix characteristics, so that code has to be designed and implemented for each class of matrices and hardware. In some special cases, such as when matrices derived from a two-dimensional discretization are considered, it is possible to show 1] that only nearest neighbour communications are required this feature really improves performances if the underlying hardware is based on a direct communication network. The column-wise method, on the other hand, allows to develope a generalpurpose code that does not depend on the used communication network (indirect, mesh, ring, hypercube and so on) moreover, the code and the communication overhead do not depend on the type of sparsity of the matrix. Other features that make such a strategy really interesting are: (i) the outcoming parallel code is obtained from the sequential one with simple modications, (ii) the performace is higly predictable and (iii) it is possible to derive special purpose hardware that makes the communication overhead really negligible. In the parallel implementation we developed the compressed matrix AC is then split column-wise among processors. Let P be the number of processors and let us suppose, for the sake of simplicity, that P exactly divides N (NL = NP ) processor j (j = 0 1 : : : P ; 1) holds columns j NL j NL + 1 : : : NL (j + 1) ; 1. By using the gdsum function described in section 2, a parallel version of the matrix-vector can be easily coded as follows: void par_mv_prod (int N, int NL, int K_MAX, double *res,
6
double **matr, int **rows, double *vect) { int i,j for (i=0i
In the same way, it is possible to code the parallel version of the vv-prod routine: double par_vv_prod (int NL, double *v1, double *v2) { int i double result 1] result 0] = 0 for (i=0i
The parallel code trivially becomes: #define #define #define #define
N K_MAX P NL N/P
main (int argc, char *argv ]) { double a K_MAX] NL] double x NL],b NL],m NL],s NL],q NL],d NL],r NL] int j,rows K_MAX] N], K_MAX NL] double temp N] double d_new, d_old, alpha, beta, eps /*initialize the parallel environment*/ ew_baseInit() name = (int) ew_state.vp /*this the name of the process*/ /*code for the input A, rows, b and eps omitted*/ /*code for initialiting m omitted*/ for(j=0j
7
d_new = par_vv_prod(NL,r,d) d_old = d_new while (d_new > (eps*d_old)) { par_mv_prod(N,NL,K_MAX,temp,a,rows,d) for(j=0j
Let us call T par (N P ) the time to perform an iteration with a problem of dimension N on P processors since two matrix-vector products and two vector-vector products are performed and by neglecting the time to extract the ~ we have that: NL-vetors ~q and ~r from the N -vector temp seq
T par = TP + 2 T g (N P ) + 2 T g (1 P )
(6) where T g (N P ) is the time to perform a gdsum on a vector of dimension N using P processors. It should be noted that formula 6 just is a rst order analysis representing an upper bound for the actual T par , indeed measured times are heavily aected from the availability in the cache memory of the data to be used. In a second order analysis, the term related to T seq should be multiplied for a factor < 1 as a consequence a super speed-up can be obtained, as em will be shown in section 5.
5 Experimental Results First, let us analyse the time complexity of the gsum function we observed a linear dependence both from the dimension N of the vector (see gure 3) and 8
0.06
0.04
T
P =2 P =4 3 P =6 + P =8 2 P = 10
0.05
0.03
2
+
+
+
3
3
2
0.02
3
2
0.01
2 + 3 0
10
3+ 20
30
2
40
50
60
N=100
70
80
90
100
Figure 3: The time (in seconds) for a gsum operation as N vary having P as parameter. fron the number P of processors (see gure 4). This behaviour can be represented in the following form:
T g (N P ) = k3 N + k4 P + k5 N P + k6
(7) When T g is in microseconds we found, using a least square error analysis on 25 experimental points:
k3 = 0:48 k4 = ;46 k5 = 0:37 k6 = 327
(8) (9) (10) (11)
From equations 7 and 5 it should be clear that we can neglect T g (1 P ) in equation 6, being, for our machine, Pmax 10 we then obtain: seq
T par = TP + 2(k3 N + k4 P + k5 N P + k6 ) 9
(12)
0.06
0.04
T
0.03
0.02
2
2
+
2
+
3
3
3
2
4
0.01 2 + 0
N = 1000 N = 2500 3 N = 5000 + N = 7500 2 N = 10000
0.05
+
6
P
8
2 +
3
3
10
12
Figure 4: The time (in seconds) for a gsum operation as P vary having N as parameter.
10
T
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
3
Measured 3
T seq + (1:22N ; 235)10;6 2
3
3
0
3
10
3 20
30
40
50
N=100
60
70
80
90
100
Figure 5: Parallel implementation: theoretical time vs. measured time (in seconds) as N vary (P = 2, one iteration) The dierence from the theoretical model and the measured T par is shown in gure 5 for P = 2, in gure 6 for P = 4 and in gure 7 for P = 10 in gure 7 we can alsoseqappreciate the dierence from the situation with optimal eciency (T par = T P ). Figure 8 summarises the performance of our implementation. The measured times sligghtly dier fron the expected ones because of the problems related to the cache memory this justies the super speed-up obtained in some cases. Moreover, there is the eect related to a load unbalancing that clearly appears when P does not divide N exactly. Last, it is worth noting that formula 5 can be easily generalised for every Kmax, so having:
T seq 1:62 Kmax N
(13) Since the overhead due to the communication does not depend on Kmax , we have that the performance model (6) holds also in the general case.
11
T
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
3
Measured 3
T seq + (1:96N + 143)10;6 4
3
3
0
3
10
3 20
30
40
50
N=100
60
70
80
90
100
Figure 6: Parallel implementation: theoretical time vs. measured time (in seconds) as N vary (P = 4, one iteration)
12
0.4
3
Measured 3
0.35
T seq + (4:18N ; 133)10;6 10 T seq =10
0.3
3
0.25
T
0.2 0.15
3
0.1 0.05 0
0
3
10
3 20
30
40
50
N=100
60
70
80
90
100
Figure 7: Parallel implementation: theoretical time vs. measured time (in seconds) as N vary (P = 10, one iteration)
13
12
P =2 =4 3 10 PP = 6 + P =8 2 8 P = 10 speed ; up
2
2
2
2
+
+
+
+
4 + 3
3
3
3
3
6
2
+
3
2 0
10
20
30
40
50
N=100
60
70
80
2
90
100
Figure 8: Parallel implementation: the speed-up as function of N having P as parameter.
14
6 Conclusions In this paper we proposed a parallel implementation of a diagonal preconditioned conjugate gradient method. The algorithm has been realised by using just a global operator and by avoiding point-to-point communication. The experiments on the Meiko CS-2 showed an interesting performance, despite the complexity of the used global operator is O(NP ). In this context, the proposed method seems very promising being possible to make the complexity of the gsum operation O(N log(P )) simply using tree-connected adders. The possibility of having pipelining in such a communication-computation structure, in order to design a parallel machine whose reduction complexity is O(N + log(P )), as well as the possibility of overlapping such a step with computation, and the range of applicability of such a special-purpose hardaware, are the concerns of our work in progress.
References
1] Gupta A., Kumar V., Sameh A., Performace and Scalability of Preconditioned Conjugate Gradient Methods on Parallel Computers, IEEE Transactions on Parallel and Distributed Systems, Vol. 6, 455-469, 1995.
2] Aykanat C., Ozguner F., Ercal F., Sadayappen P., Iterative Algorithms for Solution of Large Sparse Systems of Linear Equations on Hypercubes, IEEE Transactions on Computers, Vol. 37, 1554-1567, 1988.
3] Hammond S. W., Schreiber R., Ecient ICCG on a Shared Memory Multiprocessor, Int. J. High Speed Comput., vol. 4, 1-22, 1992.
4] Kim S. K. Chronopoulos, A Class of Lanczos-Like Algorithms Implemented on Parallel Computers, Parallel Computing, vol. 17, 763-777, 1991.
5] Barton E., Cownie J., McLaren M., Message Passing on the Meiko CS-2, Parallel Computing, vol. 20, 497-507, 1994.
15