A Parallel Implementation of the Conjugate Gradient Method on the Meiko CS-2 Antonio d' Acierno, Giuseppe De Pietro, Antonio Giordano IRSIP - CNR via P. Castellino 111, 80131 Napoli (Italy)

Abstract

In this paper we describe a parallel implementation of a preconditioned conjugate gradient algorithm on the Meiko CS-2. The parallel implementation is based on the use of a global reduction operator this allows to obtain a simple and ecient code whose performance is easily and precisly predictable. We describe and validate a performance model and we show and comment experimental results

1 Introduction Solving large sparse systems of linear equations is an integral part on mathemetical computing and nds application in a variety of elds such as uid dynamics, structural analysis, atmospheric modeling and so on. Iterative methods such as the conjugate method for solving such systems are becoming increasingly appealing as they can be easily parallelised 1, 2, 3, 4]. In this paper we describe a parallel implementation of a Preconditioned Conjugate Gradient (PCG) algorithm on the Meiko CS-2. The implementation is based on the use of reduce operations which are well supported by the CS-2 hardware. The paper is organised as follows. In section 2 the PCG algorithm is brie y descrided and the sequential code is scketched section 3 describes the CS-2 hardware and introduces the reduce operation supported by the used hw/sw system. In section 4 we describe the parallel implementation while, in section 5, we propose and validate a performance model and we show and comment experimental results conclusions are the concern of section 6.

2 The PCG Algorithm The PCG algorithm is used for solving systems of linear equations in the form: 1

A~x = ~b

(1)

where A is a N  N (symmetric positive de nite) sparse matrix. Let Ki the number of non-zero elements in the i ; th column and let Kmax = maxi Ki , i = 0 : : :  N ; 1. In our implementation the matrix A is stored using two Kmax  N matrices, AC and ROWS, where AC is a compressed version of the matrix A and ROWS contains the indices of the elements of A stored in AC, so that:

AC i] j ] = A ROWS i] j ]] j ]

(2) The matrix vector product of a compressed matrix for a vector can be described in the following way: void mv_prod (int n, int K_MAX, double *result, double **matr, int **rows, double *vect) { int i,j for (i=0i
For the sake of completeness we report the code for the inner product of two vectors: double vv_prod (int n, double *v1, double *v2) { int i double result result = 0 for (i=0i
In this paper we consider a diagonal PCG algorithm the preconditioner M is then a vector simply obtained (in our code) in the following way:

M i] = A i] i]

So, the PCG algorithm can be described as follow: #define N #define K_MAX

2

(3)

main (int argc, char *argv ]) { double a K_MAX] N] /*a is the comprssed matrix A*/ double x N],b N],m N],s N],q N],d N],r N] int j,rows K_MAX] N] double d_new, d_old, alpha, beta, eps /*code for the input of a, rows, b and eps omitted*/ /*code for initialiting m omitted*/ for(j=0j (eps*d_old)) { mv_prod(N,K_MAX,q,a,rows,d) alpha = vv_prod(n,d,q) alpha = d_new/alpha for(j=0j
As shown in the code, the PCG algorithm performs few basics operations in each iteration: 1. matrix-vector multiplications 2. vector-vector inner products 3. scalar-vector multiplications 4. vector-vector additions. Operations belonging to classe 2, to classe 3 and to class 4 clearly have time compexity O(N ) the complexity of a matrix-vector product, obviously, is O(), where  is the number of non-zero elements in the matrix. In our experiments 3

3.5 2.5

T

3

Measured 3 p (3:24  N  N + 0:08  N )10;6

3

3

2

3

1.5

3

1 0.5

3 0 3 3 0

10

3 20

30

40

50

N=100

60

70

80

90

100

Figure 1: Sequential implementation: theoretical time vs. measured time (in seconds) as N vary (one iteration). we used a general form of sparse matrices, i.e. the non-zero elements of A occur in such a way that it is possible to resctrict them within a band around the principal diagonal. In order to pdeal with just a parameter, p we xed the width of the band to Kmax = 1 + 2  N so that   2  N  N . Let us call T seq (N ) the time employed to perform an iteration in the sequential case we have that: p

T seq = k1  N  N + k2  N (4) We experimentally obtained that, when T seq is in microseconds, k1 = 3:24 and k2 = 0:08. Figure 1 shows the dierences between the theorethical model and the experimental results. By substituting k1 and k2 in (4) we have: p

p

T seq = k1  N  N + k2  N  3:24  N  N

4

(5)

8X8

Processor 0-15

Figure 2: The architecture of a Meiko CS-2 with 16 processors.

3 The Meiko CS-2 The Meiko CS-2 5] is a MIMD parallel computer able to provide both scalar and vector processing nodes, well suited for both scienti c computation and data base applications according to the particular con guration choosen. The key point of the Meiko machine is the communication structure, which is composed by the communication network and by communication processors. The CS-2 communication network is a logarithmic network constructed from 8 way crosspoint switches and bidirectional links it can be considered to be a Benes network folded about its centre line. Bandwidth is constant at each stage of the network, and there are many links out according to the number of processors. A 16 processor network is illustrated in gure 2. The multi-stage network used in the CS-2 can be also viewed like a "fat tree" in this type of network, packets are routed back down at the rst node possible. So, the bandwidth at higher levels of the tree can be reduced. Although the CS-2 network permits this local packet routing, the bandwidth is not reduced at higher levels. In order to improve performance of parallel application, every processing elements in a CS-2 machine has its own communication processor, a dedicated interface to the communication network. This processor supports remote read, write and syncronization operations speci ed by virtual processor number and virtual address. Latency hiding is supported by non-blocking instructions, instruction sequences and completation tests. Another important aspect of the communication structure, is the avaliabil5

ity of a library of network functions which allow user to eciently solve some communication problems without using low-level routines. For examples, very simple and ecient routines are provided for message broadcasting, for DMA accessing of any node memory, etc. Particularly intresting is the gsum function this function performs a reduce operation, and the results are broadcast to all processors. In other words, suppose to have P processes and P processors, and that each process has an array a of N elements. When all processes make a function call to gsum on the array, the result is that all the corresponding elements of each array are summed, and the sum is placed in the correspondent position in each array. As it will show later, this function has been widely used for the implementation of the PCG algorithm.

4 The Parallel Implementation The basic operation to be parallelised is the matrix-vector product and there are two main strategies for such operation: in the rst the matrix is split row-wise while in the second one it is spit column-wise. Regarding the row-wise mapping we have that, for each product and by considering dense matrices, an all-to-all broadcasting is required. If unstructured sparse matrix are considered, just point-to-point communications are required but both the communication path and the number of communications is strictly dependent on the the matrix characteristics, so that code has to be designed and implemented for each class of matrices and hardware. In some special cases, such as when matrices derived from a two-dimensional discretization are considered, it is possible to show 1] that only nearest neighbour communications are required this feature really improves performances if the underlying hardware is based on a direct communication network. The column-wise method, on the other hand, allows to develope a generalpurpose code that does not depend on the used communication network (indirect, mesh, ring, hypercube and so on) moreover, the code and the communication overhead do not depend on the type of sparsity of the matrix. Other features that make such a strategy really interesting are: (i) the outcoming parallel code is obtained from the sequential one with simple modi cations, (ii) the performace is higly predictable and (iii) it is possible to derive special purpose hardware that makes the communication overhead really negligible. In the parallel implementation we developed the compressed matrix AC is then split column-wise among processors. Let P be the number of processors and let us suppose, for the sake of simplicity, that P exactly divides N (NL = NP ) processor j (j = 0 1 : : :  P ; 1) holds columns j  NL j  NL + 1 : : : NL  (j + 1) ; 1. By using the gdsum function described in section 2, a parallel version of the matrix-vector can be easily coded as follows: void par_mv_prod (int N, int NL, int K_MAX, double *res,

6

double **matr, int **rows, double *vect) { int i,j for (i=0i
In the same way, it is possible to code the parallel version of the vv-prod routine: double par_vv_prod (int NL, double *v1, double *v2) { int i double result 1] result 0] = 0 for (i=0i
The parallel code trivially becomes: #define #define #define #define

N K_MAX P NL N/P

main (int argc, char *argv ]) { double a K_MAX] NL] double x NL],b NL],m NL],s NL],q NL],d NL],r NL] int j,rows K_MAX] N], K_MAX NL] double temp N] double d_new, d_old, alpha, beta, eps /*initialize the parallel environment*/ ew_baseInit() name = (int) ew_state.vp /*this the name of the process*/ /*code for the input A, rows, b and eps omitted*/ /*code for initialiting m omitted*/ for(j=0j
7

d_new = par_vv_prod(NL,r,d) d_old = d_new while (d_new > (eps*d_old)) { par_mv_prod(N,NL,K_MAX,temp,a,rows,d) for(j=0j
Let us call T par (N P ) the time to perform an iteration with a problem of dimension N on P processors since two matrix-vector products and two vector-vector products are performed and by neglecting the time to extract the ~ we have that: NL-vetors ~q and ~r from the N -vector temp seq

T par = TP + 2  T g (N P ) + 2  T g (1 P )

(6) where T g (N P ) is the time to perform a gdsum on a vector of dimension N using P processors. It should be noted that formula 6 just is a rst order analysis representing an upper bound for the actual T par , indeed measured times are heavily aected from the availability in the cache memory of the data to be used. In a second order analysis, the term related to T seq should be multiplied for a factor < 1 as a consequence a super speed-up can be obtained, as em will be shown in section 5.

5 Experimental Results First, let us analyse the time complexity of the gsum function we observed a linear dependence both from the dimension N of the vector (see gure 3) and 8

0.06

0.04

T



P =2 P =4 3 P =6 + P =8 2 P = 10 

0.05

0.03

2

+



+

+

3

3

2

0.02

3



2

0.01

 2 + 3 0

10

3+ 20

30

2



40

50

60

N=100

70

80

90

100

Figure 3: The time (in seconds) for a gsum operation as N vary having P as parameter. fron the number P of processors (see gure 4). This behaviour can be represented in the following form:

T g (N P ) = k3  N + k4  P + k5  N  P + k6

(7) When T g is in microseconds we found, using a least square error analysis on 25 experimental points:

k3 = 0:48 k4 = ;46 k5 = 0:37 k6 = 327

(8) (9) (10) (11)

From equations 7 and 5 it should be clear that we can neglect T g (1 P ) in equation 6, being, for our machine, Pmax  10 we then obtain: seq

T par = TP + 2(k3  N + k4  P + k5  N  P + k6 ) 9

(12)

0.06

0.04

T

0.03





0.02



2

2

+

2

+

3

3

3

2

4



0.01 2 + 0



N = 1000 N = 2500 3 N = 5000 + N = 7500 2 N = 10000 

0.05

+

6

P

8

2 +

3

3

10

12

Figure 4: The time (in seconds) for a gsum operation as P vary having N as parameter.

10

T

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

3

Measured 3

T seq + (1:22N ; 235)10;6 2

3

3

0

3

10

3 20

30

40

50

N=100

60

70

80

90

100

Figure 5: Parallel implementation: theoretical time vs. measured time (in seconds) as N vary (P = 2, one iteration) The dierence from the theoretical model and the measured T par is shown in gure 5 for P = 2, in gure 6 for P = 4 and in gure 7 for P = 10 in gure 7 we can alsoseqappreciate the dierence from the situation with optimal eciency (T par = T P ). Figure 8 summarises the performance of our implementation. The measured times sligghtly dier fron the expected ones because of the problems related to the cache memory this justi es the super speed-up obtained in some cases. Moreover, there is the eect related to a load unbalancing that clearly appears when P does not divide N exactly. Last, it is worth noting that formula 5 can be easily generalised for every Kmax, so having:

T seq  1:62  Kmax  N

(13) Since the overhead due to the communication does not depend on Kmax , we have that the performance model (6) holds also in the general case.

11

T

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

3

Measured 3

T seq + (1:96N + 143)10;6 4

3

3

0

3

10

3 20

30

40

50

N=100

60

70

80

90

100

Figure 6: Parallel implementation: theoretical time vs. measured time (in seconds) as N vary (P = 4, one iteration)

12

0.4

3

Measured 3

0.35

T seq + (4:18N ; 133)10;6 10 T seq =10

0.3

3

0.25

T

0.2 0.15

3

0.1 0.05 0

0

3

10

3 20

30

40

50

N=100

60

70

80

90

100

Figure 7: Parallel implementation: theoretical time vs. measured time (in seconds) as N vary (P = 10, one iteration)

13

12

P =2 =4 3 10 PP = 6 + P =8 2  8 P = 10 speed ; up







2

2

2

2

+

+

+

+

4 + 3

3

3

3

3

6



  2

+

3

2 0

10

20

30

40

50

N=100

60

70

80

2

90

100

Figure 8: Parallel implementation: the speed-up as function of N having P as parameter.

14

6 Conclusions In this paper we proposed a parallel implementation of a diagonal preconditioned conjugate gradient method. The algorithm has been realised by using just a global operator and by avoiding point-to-point communication. The experiments on the Meiko CS-2 showed an interesting performance, despite the complexity of the used global operator is O(NP ). In this context, the proposed method seems very promising being possible to make the complexity of the gsum operation O(N  log(P )) simply using tree-connected adders. The possibility of having pipelining in such a communication-computation structure, in order to design a parallel machine whose reduction complexity is O(N + log(P )), as well as the possibility of overlapping such a step with computation, and the range of applicability of such a special-purpose hardaware, are the concerns of our work in progress.

References

1] Gupta A., Kumar V., Sameh A., Performace and Scalability of Preconditioned Conjugate Gradient Methods on Parallel Computers, IEEE Transactions on Parallel and Distributed Systems, Vol. 6, 455-469, 1995.

2] Aykanat C., Ozguner F., Ercal F., Sadayappen P., Iterative Algorithms for Solution of Large Sparse Systems of Linear Equations on Hypercubes, IEEE Transactions on Computers, Vol. 37, 1554-1567, 1988.

3] Hammond S. W., Schreiber R., Ecient ICCG on a Shared Memory Multiprocessor, Int. J. High Speed Comput., vol. 4, 1-22, 1992.

4] Kim S. K. Chronopoulos, A Class of Lanczos-Like Algorithms Implemented on Parallel Computers, Parallel Computing, vol. 17, 763-777, 1991.

5] Barton E., Cownie J., McLaren M., Message Passing on the Meiko CS-2, Parallel Computing, vol. 20, 497-507, 1994.

15

1 Introduction 2 The PCG Algorithm

used. In a second order analysis, the term related to Tseq should be multiplied for a factor < 1 as a consequence a super speed-up can be obtained, as em will.

244KB Sizes 1 Downloads 137 Views

Recommend Documents

A Multiscale Mean Shift Algorithm for Mode Estimation 1. Introduction
Computer Science, University College London, UK; [email protected]. 2 ...... 36(4): p. 742-756. 13. Everitt, B.S. and D.J. Hand, Finite Mixture Distributions.

1 Introduction 2 Speechreading
Humans use a variety of modes of information audio, visual, touch and smell to recognize ... Our focus and interest is in demonstrating meaningful improvements for ... is to combine the two streams of information at an early stage and possibly .... g

1 Introduction 2 Vector magnetic potential - GitHub
Sep 10, 2009 - ... describes the derivation of the approximate analytical beam models ...... of the source whose solution was used to correct the residual data.

2 CHAPTER 1 - INTRODUCTION TO THE INDEX.pdf
an entrepreneurial and open, ... products & services, creative ... extent of collaboration for ... Activity: Activity measures a firm's activities across the innovation lifecycle from .... Displaying 2 CHAPTER 1 - INTRODUCTION TO THE INDEX.pdf.

man-91\sony-pcg-21212m.pdf
PDF Ebook : Sony Pcg-21212M. 12. PDF Ebook : Sony Pcg 21212m Dr?ver. 13. PDF Ebook : Sony Pcg 21212m Manual. 14. PDF Ebook : Sony Notebook Pcg ...

Driver sony vaio pcg- 51111w
Delllatitude d410 touchpad drivers.Driver sony vaio pcg- 51111w.Pantech crux usb modemdriver download.Nvidia geforce gt 540m2gb update driver.Free download driver printer hp deskjet 2050.(34). Wieri by netari onclonid tu jadgimurifevurebli on uar uwn

the matching-minimization algorithm, the inca algorithm and a ...
trix and ID ∈ D×D the identity matrix. Note that the operator vec{·} is simply rearranging the parameters by stacking together the columns of the matrix. For voice ...

the matching-minimization algorithm, the inca algorithm ... - Audentia
ABSTRACT. This paper presents a mathematical framework that is suitable for voice conversion and adaptation in speech processing. Voice con- version is formulated as a search for the optimal correspondances between a set of source-speaker spectra and

1. Introduction 2. Resource Critical Path Definition
May 15, 2002 - project management software package Spider Project that is most ... basing on the federal, local, industrial or corporate norms and standards. ... The planner should define desirable probabilities of meeting target dates, costs,.

Syntax 2 Week 1: Introduction; GB vs. Minimalism - Dustin Alfonso ...
Aug 1, 2017 - Speakers of English can understand a variety of novel expressions: (1). My cat Ernie ... sound and meaning in a systematic way. • We also ...

Abstract 1 Introduction 2 VMShadow Overview
Live Migrator. Nested. Hypervisor. Figure 1: VMShadow Architecture. Cloud computing has quickly become the paradigm for hosting applications ranging from multi-tier web applica- tions to individuals desktops. Today, users manually deter- mine which c

1. Introduction 2. Results HPC Asia & APAN 2009 649 - NASA
College Park, MD 20742, USA. 3NOAA Atlantic Oceanographic and Meteorological Laboratory. 4301 Rickenbacker Causeway. Miami, FL 33149. 1. Introduction. When the NASA Columbia supercomputer came into operation in late 2004, its computing power enabled

2. Background 5. Conclusion 1. Introduction 3 ...
1. Introduction. With the advent of the photonic crystal a new concept in fiber optics called photonic crystal fiber. (PCF) has come to forefront in fiber research.

1. Introduction to Robotics notes 2.pdf
Manipulator. Newton-Euler. Equations. Coordinate- invariant. algorithms for. robot. dynamics. Lagrange's. Equations with. Constraints. 4.1 Lagrangian Equations.

1. Introduction 2. Method 3. Results
The reconnection rate (electric eld at the X-Line) should also approach a constant value. Mass ux and reconnection rate are tracked to determine the state of the reconnection process. Attention is paid to the evolution of the current sheet structure.

Abstract Experiments 1 & 2 Conclusion Introduction ...
Naturalness of lexical alternatives predicts time course of scalar ... Some utterances are underinformative: The onset and time course of scalar inferences. Journal of ... 3b 37 Click on men- tioned gumballs if statement cor- rect, on central button

Categorization and Vagueness 1. Introduction 2 ...
which case it's just very large.) Your pronouncement “that's enough” defines a categorization of this set into those amounts that are sufficient for your needs and ...

1. Introduction 2. Results HPC Asia & APAN 2009 649 - NASA
Goddard Space Flight Center ... performance of the 0.08o model for Hurricane Rita. (2005) was documented in Biswas et al. (2007), which showed improved track and intensity forecasts with .... 2007). (d) Four-day forecasts of total precipitable water

Imagining Contradictions 1. Introduction 2. Background ...
Mar 26, 2009 - [Sec. 2] Introduce relativist system where propositions = sets of centered worlds. • [Sec. 3] Generalization: ... I know that he's at home! 2.3.

1 Introduction 2 Feature extraction and matching
The demonstration will present a html-based visualization tool that we recently built in order to be able to directly see and assess the results ... tional probability distributions or as a simple stochastic shape-emission process characterized by a.

Page 1 / 2 Loading… Page 1 Page 2 of 2 ...
Sign in. Page. 1. /. 2. Loading… Page 1. Page 2 of 2. Eacb1567b148a94cb2dd5d612c7b769256279ca60_Q8633_R329927_D1856546.pdf. Eacb1567b148a94cb2dd5d612c7b769256279ca60_Q8633_R329927_D1856546.pdf. Open. Extract. Open with. Sign In. Main menu. Displayi

1/2 index.html 2/2 - CS50 CDN
20: . 21: CS50 Shuttle. 22: . 23: . 24:

Taming the Shrew 1 Introduction 2 The Shrew Attack
Dec 6, 2008 - Collaborative versions of this idea in which multiple routers ..... [7] K. H. Y. Chen and Y. K. Kwok, “Filtering of shrew ddos attacks in frequency ...

Cheap 5 In 1 Duck Tales 1 ⁄ 2 + The Flintstones 1 ⁄ 2 + Little Samson ...
Cheap 5 In 1 Duck Tales 1 ⁄ 2 + The Flintstones 1 ⁄ 2 ... t 60 Pin Game Card Free Shipping & Wholesale Price.pdf. Cheap 5 In 1 Duck Tales 1 ⁄ 2 + The ...