Writing Parallel Programs on LINUX CLUSTER

Kabir Ahmed Syed Asadul haq Md. Monerul Islam Md. Iftakher Hossain

A graduation project completed under supervision of Syed Akhter Hossain Associate Professor & Chairperson Department of computer Science & Engineering

East West University 43 Mohakhali, Dhaka-1212.

Writing Parallel Programs on LINUX CLUSTER

Kabir Ahmed Syed Asadul haq Md. Monerul Islam Md. Iftakher Hossain

Copyright © October 2003

Modification of any part of this document, without the prior permission of author(s) is considered to be a violation of copyright law.

A graduation project completed under supervision of Syed Akhter Hossain Associate Professor & Chairperson Department of computer Science & Engineering

East West University 43 Mohakhali, Dhaka-1212.

ACKNOWLEDGMENTS

We would like to thank our project supervisor Associate Professor and Chairperson of Computer Science and Engineering Department, Syed Akhter Hossain. It is obvious that, without his endless support, encouragement, and assistant it would not be possible for us to accomplish this report. We are grateful to Professor Dr. Mozammel Huq Azad Khan for his patience and supportive attitude. We would like to mention the name of Safiqul Islam, Lab assistant, for his tremendous help. We are grateful to all our teachers and to our friends for their kind cooperation during our graduation period. Above all we are very much grateful to our family members for their endless love, encourage ment, and supports.

Abstract

Parallel programming has been evolved over the years with new dimensions and scientific community fostering on the new approaches of parallel programming. Network clustering and parallel programs are now becomes the only cheapest solution to the super computer. This paper is an initiative to a newer approach of using parallel programming language on the top of Message Passing Interface (MPI) in Linux Cluster. This paper discusses the very basic features of building Linux cluster and parallel programming using the different implementation of MPI and measures a little about the performance of the cluster.

- Table of Contents CHAPTER 01 - Introduction 1. Introduction __________________________________________________________2 1.1

Beowulf – A Parallel Computing Architecture__________________________3

1.2

The Evolution of Beowulf__________________________________________4

1.2.1 First-Generation Beowulf _________________________________________5 1.2.2

Second-Generation Beowulf ______________________________________7

1.2.2.1 BProc______________________________________________________7 1.2.2.2 The Scyld Implementation _____________________________________8 1.3 References ________________________________________________________10

CHAPTER 02 - Parallel Computing Architecture 2. Parallel Computing Architecture _________________________________________12 2.1 Parallel Computing Systems __________________________________________12 2.1.1 SIMD Systems _________________________________________________12 2.1.2 MIMD Systems ________________________________________________13 2.1.3 SPMD Systems ________________________________________________14 2.2 Beowulf Architecture _______________________________________________14 2.3

Cluster Design__________________________________________________15

2.3.1

Cluster Setup and Installation____________________________________16

2.3.2

Os installation ________________________________________________17

2.3.3

Configuration of Master Node and Client Node______________________17

2.4

Testing the System ______________________________________________19

2.5

References _____________________________________________________20

CHAPTER 03 - Communication API 3. Communication API ___________________________________________________22 3.1.1 Parallel Virtual Machine _________________________________________22

3.1.2 Message Passing Interface (MPI) __________________________________23 3.1.2.1 Architecture of MPI _________________________________________23 3.2 Software Bindings on MPI ___________________________________________25 3.2.1 MPICH_______________________________________________________26 3.2.1.1 Installation of Configuration of MPICH__________________________27 3.2.1.2 Testing the Beowulf cluster and MPICH configuration ______________28 3.2.2 mpiJava ______________________________________________________29 3.2.2.1 Class Hierarchy of mpiJava ___________________________________29 3.2.2.2 API of mpiJava _____________________________________________31 3.2.2.3 Installation and Configuration of mpiJava ________________________32 3.2.3 HPJava _______________________________________________________34 3.2.3.1 Installing HPJava ___________________________________________35 3.2.3.2 Compiling and Running HPJava program ________________________35 3.3 References ________________________________________________________36

CHAPTER 04 - Writing Parallel Application 4. Parallel Application Architecture _________________________________________38 4.1 Writing Parallel Programs in MPICH___________________________________40 4.1.1 HelloWorld.c __________________________________________________40 4.2 A Parallel Java Application using mpiJava ______________________________41 4.2.2 Source code of Matrix-Matrix Multiplication using Scatter-Gather ________45 4.2.3 Result of Matrix-Matrix Multiplication______________________________46 4.3.1 Image Enhancement using Foruier Transform ________________________47 4.3.1.1 How Fourier Transform works _________________________________47 4.3.1.2 Implementation _____________________________________________50 4.3.1.3 Source Code _______________________________________________51 4.4 References ________________________________________________________64

CHAPTER 05 - Performance Analysis 5. Performance Analysis __________________________________________________66

5.1 PI calculation in MPICH____________________________________________66 5.2 Matrix–matrix multiplication in MPICH ________________________________67 5.3

Matrix-Matrix multiplication in MPIJAVA ___________________________68

5.4 References ________________________________________________________69

CHAPTER 06 – Conclusion & Future Works 5. Conlusion ___________________________________________________________71 Communicated Paper to ICCIT 2003 ______________________________________73

WRITING PARALLEL PROGRAMS ON LINUX CLUSTER

Chapter 1

Introduction

1. Introduction

2

1. Introduction From the dawn of computer era, power is stand as a main driving force of development of computers. Scientists and engineers believe that more computational power makes a computer more powerful. On the basis of computational power they scaled the computer category as Super Computer, Mini Computer, Micro computer etc. Hardware vendors are engaging in development of more powerful CPU’s to gain more processing power. Now days, over 3 GHz processors are available in the market for the desktop computers. But still now, to run a complex scientific program, like simulation of weather forecasting model, a complex fluid dynamics code or data mining application, we need a super computer, more precisely the processing power of a super computer.

The obvious question arises that what is causing this ever-escalating need for greater computational power. The answer relies on the fact that, for centuries, science has followed the basic paradigm of first observe, then theorize, and then test the theory through experimentation. Similarly, engineers have traditionally first designed (typically on paper), then built and tested prototypes, and finally built a finished product. However, it is becoming less expensive to carry out detailed computer simulations than it is to perform in real with the hassle of numerous experiments or built a series of prototypes. Thus the experiment and observation in the scientific paradigm, and design and prototyping in the engineering paradigm, are being increasing replaced by computation. Furthermore, in some cases, we can now simulate phe nomena that could not be studied using experimentation; e.g., the evolution of universe.

But the problem arises that the cost of a super computer is extremely high and installation and maintenance is also a very complex process. Moreover, it has been proven that, in

1. Introduction

3

some cases, the greater computational power subsumes the both of greater speed and greater storage [1]. So to meet the needs of more computational power for the complex application, scientist provides a new approach which is called Parallel Computing. It is a method of computing, where a collection of computers works together to solve a problem.

And as the performance of commodity computer and network hardware increase, and their prices decrease, it becomes more and more practical to build parallel computational systems from off- the-shelf components, rather than buying CPU time on very expensive Supercomputers. In fact, the price per performance ratio of a Beowulf type machine is between three to ten times better than that for traditional supercomputers. Beowulf architecture scales well, it is easy to construct and only have to pay for the hardware as most of the software is free.

1.1 Beowulf – A Parallel Computing Architecture There are probably as many Beowulf definitions as there are people who build or use Beowulf Supercomputer facilities. Some claim that one can call their system Beowulf only if it is built in the same way as the NASA's original machine. Others go to the other extreme and call Beowulf any system of workstations running parallel code. But we recognized that the definition of Beowulf is between the two views. Beowulf is a multi computer architecture which can be used for parallel computations. It is a system which usually consists of one server node, and one or more client nodes connected together via Ethernet or some other network.

It is a system built using commodity

hardware

components, like any PC capable of running Linux, standard Ethernet adapters, and switches.

It does not contain any custom hardware components and is trivially

reproducible. Beowulf also uses commodity software like the Linux operating system,

1. Introduction

4

Parallel Virtual Machine (PVM) and Message Passing Interface (MPI). The server node controls the whole cluster and serves files to the client nodes. It is also the cluster's console and gateway to the outside world. Large Beowulf machines might have more than one server node, and possibly other nodes dedicated to particular tasks, for example consoles or monitoring stations. In most cases client nodes in a Beowulf system are dumb, the dumber the better. Nodes are configured and controlled by the server node and do only what they are told to do. In a disk- less client configuration, client nodes don't even know their IP

address or name until the server tells them what it is.[] One of the

main differences between Beowulf and a Cluster of Workstations (COW) is the fact that Beowulf behaves more like a single machine rather than many workstations. In most cases client nodes do not have keyboards or monitors, and are accessed only via remote login or possibly serial terminal like Keyboard Video Monitor (KVM Switch). Beowulf nodes can be thought of as a CPU + memory package which can be plugged in to the cluster, just like a CPU or memory module can be plugged into a motherboard. It is not a special software package, new network topology, rather than it is a technology of clustering Linux computers to form a parallel, virtual supercomputer [2].

1.2 The Evolution of Beowulf The original concept for Beowulf clusters was conceived by Donald Becker while he was at NASA Goddard in 1994 [3]. The premise was that commodity computing parts could be used, in parallel, to produce an order of magnitude leap in computing price/performance for a certain class of problems. The proof of concept was the first Beowulf cluster, Wiglaf, which was operational in late 1994. Wiglaf was a 16-processor system with 66MHz Intel 80486 processors that were later replaced with 100MHz DX4s, achieving a sustained performance of 74Mflops/s (74 million floating-point operations

1. Introduction

5

per second). Three years later, Becker and the CESDIS (Center of Excellence in Space Data and Information Services) team won the prestigious Gordon Bell award. The award was given for a cluster of Pentium Pros that were assembled for SC'96 (the 1996 Super Computing Conference) that achieved 2.1Gflops/s (2.1 billion floating-point operations per second). The software developed at Goddard was in wide use by then at many national labs and universities.

1.2.1 First-Generation Beowulf The first generation of Beowulf clusters had the following characteristics: commodity hardware, open-source operating systems such as Linux or FreeBSD and dedicated compute nodes residing on a private network. In addition, all of the nodes possessed a full operating system installation, and there was individual process space on each node.

These first-generation Beowulfs ran software to support a message-passing interface, either PVM (parallel virtual machine) or MPI (message-passing interface). Messagepassing typically is how slave nodes in a high-performance computing (HPC) cluster environment exchange information.

Some common problems plagued the first-generation Beowulf clusters, largely because the system management tools to control the new clusters did not scale well because they were more platform- or operating-specific than the parallel programming software. After all, Beowulf is all about running high-performance parallel jobs, and far less attention went into writing robust, portable system administration code. The following types of problems hampered early Beowulfs:



Early Beowulfs were difficult to install. There was either the labor- intensive, install- each-node- manually method, which was error-prone and subject to typos,

1. Introduction

6

or the more sophisticated install-all-the- nodes-over-the-network method using PXE/TFTP/NFS/DHCP--clearly getting all one's acronyms properly configured and running all at once is a feat in itself. •

Once installed, Beowulfs were hard to manage. If anyone thinks about a semilarge cluster with dozens or hundreds of nodes, it becomes impossible to manage. To run a new kernel on a slave node, must have to install the kernel in the proper space and tell LILO (or other favorite boot loader) all about it, dozens or hundreds of times. To facilitate node updates the r commands, such as rsh and rcp, were employed. The r commands, however, require user account management accessibility on the slave nodes and open a plethora of security holes.



It was hard to adapt the cluster: adding new computing power in the form of more slave nodes required fervent prayers to the Norse gods. To add a node, need to install the operating system, update all the configuration files, update the user space on the nodes and, of course, all the HPC code that had configuration requirements of its own



It didn't look and feel like a computer; it felt like a lot of little independent nodes off doing their own thing, sometimes playing together nicely long enough to complete a parallel programming job.

In short, for all the progress made in harnessing the power of commodity hardware, there was still much work to be done in making Beowulf 1 an industrial-strength computing appliance. Over the last year or so, the Rocks and OSCAR clustering software distributions have developed into the epitome of Beowulf 1 implementations [ ``The Beowulf State of Mind'', LJ May 2002, and ``The OSCAR Revolution'', LJ June 2002]. But if Beowulf commodity computing was to become more sophisticated and simpler to use, it was going to require extreme Linux engineering.

1. Introduction

7

1.2.2 Second-Generation Beowulf The hallmark of second-generation Beowulf is that the most error-prone components have been eliminated, making the new design far simpler and more reliable than firstgeneration Beowulf. Scyld Comp uting Corporation, led by CTO Don Becker and some of the original NASA Beowulf staff, has succeeded in a breakthrough in Beowulf technology as significant as the original Beowulf itself was in 1994. The commodity aspects and message-passing software remain constant from Beowulf 1 to Beowulf 2. However, significant modifications have been made in node setup and process space distribution.

1.2.2.1 BProc At the very heart of the second- generation Beowulf solution is BProc, short for Beowulf Distributed Process Space, which was developed by Erik Arjan Hendriks of Los Alamos National Lab. BProc consists of a set of kernel modifications and system calls that allows a process to be migrated from one node to another. The process migrates under the complete control of the application itself--the application explicitly decides when to move over to another node and initiates the process via an rfork system call. The process is migrated without its associated file handles, which makes the process lean and quick. Any required files are re-opened by the application itself on the destination node, giving complete control to the application process.

Of course, the ability to migrate a process from one node to another is meaningless without the ability to manage the remote process. BProc provides such a method by putting a ``ghost process'' in the master node's process table for each migrated process. These ghost processes require no memory on the master--they merely are placeholders

1. Introduction

8

that communicate signals and perform certain operations on behalf of the remote process. For example, through the ghost process on the master node, the remote process can receive signals, including SIGKILL and SIGSTOP, and fork child processes. Since the ghost processes appear in the process table of the master node, tools that display the status of processes work in the same familiar ways.

The elegant simplicity of BProc has far-reaching effects. The most obvious effect is the Beowulf cluster now appears to have a single-process space managed from the master node. This concept of a single, cluster-wide process space with centralized management is called single-system image or, sometimes, single-system illusion because the mechanism provides the illusion that the cluster is a single-compute resource. In addition, BProc does not require the r commands (rsh and rlogin) for process management because processes are managed directly from the master. Eliminating the r commands means there is no need for user account management on the slave nodes, thereby reducing a significant portion of the operating system on the slaves. In fact, to run BProc on a slave node, only a couple of dæmons are required to be present on the slave: bpslave and sendstats.

1.2.2.2 The Scyld Implementation Scyld has completely leveraged BProc to provide an expandable cluster computing solution, eliminating everything from the slave nodes except what is absolutely required in order to run a BProc process. The result is an ultra-thin compute node that has only a small portion of Linux running--enough to run BProc. The power of BProc and the ultrathin Scyld node, taken in conjunction, has great impact on the way the cluster is managed. There are two distinguishing features of the Scyld distribution and of Beowulf 2 clusters. First, the cluster can be expanded by simply adding new nodes. Because the nodes are ultra-thin, installation is a matter of booting the node with the Scyld kernel and making it

1. Introduction

9

a receptacle for BProc migrated processes. Second, version skew is eliminated. Version skew is what happens on clusters with fully installed slave nodes. Over time, because of nodes that are down during software updates, simple update failures, the software on the nodes that is supposed to be in lockstep shifts out of phase, resulting in version skew. Since only the bare essentials are required on the nodes to run BProc, version skew is virtually eliminated.

In above we are trying to give an elaborate history of Beowulf evolution. For further information and reference, visit official Beowulf site at http://www.beowulf.org .

Figure 1. A Beowulf Cluster

1. Introduction

1.3 References [1] Peter S. Pacheco. Parallel programming with MPI . [2] Jamesdennel.Lecture notes for intro parallel computing .Spring 1995.http://www.cs.berkeley.edu/~demmel/cs267. [3] Glen Otero Richard Ferri. The Beowulf Evolution.Linuxjournal Issue 100 http://www.linuxjournal.com.

10

Chapter 2

Architecture Overview & System Design

2. Parallel Computing Architecture

12

2. Parallel Computing Architecture Parallel processing refers to the concept of speeding up the execution of a program by dividing the program into multiple fragments that can execute simultaneously, each on its own processor .A program being executing across n processor might execute n time faster than it would using a signal processing .the original classification of parallel computers is popularly known as Flynn’s taxonomy. In 1966 Michael Flynn classified systems according to the number of instruction streams and the number of data streams the classical vonn Neumann machine has a single instruction stream and a single data stream, And hence is identified as a single – instruction single data (SISD)[1].At the opposite extreme is the multiple instruction multiple data (MIMD) system, in which a collection of autonomous processor operate on the own data streams. In Flynn’s taxonomy, this is the most general architecture for parallel computing. Intermediate between SISD and MIMD systems are SIMD and MISD.

2.1 Parallel Computing Systems 2.1.1 SIMD Systems SIMD (Single instruction stream, Multiple Data stream) refers to a parallel execution model in which all processors execute the same operation at the same time, but each processor is allowed to operate upon its own data. This model is naturally fits the concept of performing the same operation on every element of an array, and is thus often associated with vector or array manipulation. Because all of these operations are inherently synchronized, interactions among SIMD processors tend to be easily and efficiently implemented. The execution of the following code, for ( i = 0; I < 1000; i++) if ( y[i] ! =0.0) z[i] = x[i]/y[i];

2. Parallel Computing Architecture

13

else z[i] = x[i]; gives the sequence of operation like: Time Step 1. Test local_y != 0.0. Time Step 2. a. if local_y was nonzero, z[i] = x[i]/y[i] b. if local_y was zero, do nothing. Time Step 3 c. if local_y was nonzero, do nothing. d. if local_y was zero, z[i] = x[i]. This implies the completely synchronous execution of statements. This example makes the disadvantages of SIMD systems clear. That is, at any given instant of time, a given subordinate process is either “active” and doing exactly the same thing as the other entire active processes, or it is idle. So it found that in a program with many conditional branches or long segments of code whose execution depends on conditionals, it’s entirely possible that many processors will remain idle for long periods of time.

2.1.2 MIMD Systems MIMD (Multiple Instruction stream, Multiple Data Stream) refers to a parallel execution model in which each processor is essentially acting independently[2]. This model naturally fits the concept of decomposing a program for parallel execution on a functional basis; for example one processor might update a database file while another processor generates a graphic display of the new entry. This is a more flexible model than SIMD execution, but it is at the risk of debugging nightmares called race conditions, in which a program may intermittently fail due to timing variations reordering the operations of one processor relative to those of another.

2. Parallel Computing Architecture

14

2.1.3 SPMD Systems SPMD (Single program, Multiple Data) is restricted version of MIMD in which all processors are running the same program. Unlike SIMD, each processor executing SPMD code may take a different control flow path through the program[7]. In SPMD model parallelism is captured perfectly, as it recovers the barrier of idleness of processors and race conditions of SIMD and MIMD systems. The symmetric execution of code over the processors is guarantied on this model.

2.2 Beowulf Architecture We stated before that Beowulf is not special software package, new network topology or the latest kernel hack. It is a technology of clustering Linux computers to form a virtual super computer. It is built based on SPMD model of parallel computing, where in a group of processes cooperate by executing the identical program images on local data values. The whole systems is built with the commodity hardware components, like any PC capable of running Linux, Standard Ethernet Adapters, hubs and switches. It also uses common software Linux operating system, Parallel virtual Machine (PVM) and Message Passing Interface (MPI), etc. It does not require any custom hardware components.

Beowulf systems usually consist of one server node, and one or more client nodes connected together via Ethernet or some other network. The server node controls the whole cluster and serves the files and commands to the client nodes. It is also the cluster’s console and gateway to the outside world. Nodes are configured and controlled by the server node and do only the task they are asked to do. In disk- less client configuration client nodes don’t even know their IP address or until the server node tells them what it is.

2. Parallel Computing Architecture

15

Large Beowulf cluster may contain more server node which is dedicated to specific task like monitoring clusters performance etc. And the most important thing is client nodes don’t have keyboard and monitor. The entire terminals of the client nodes are attached to the server nodes through KVM (Keyboard Video Monitor) switches, which plays the main roll to imagine the cluster as unite machine rather than thinking of pile of PC’s. The physical layout of a Beowulf system would be look like the following picture:

Figure 2. The physical layout of Beowulf System 2.3 Cluster Design Beowulf cluster have been constructed from variety of parts. So the type of application will run in the cluster and the availability of the hardware components determines the systems configuration. And there is rule for same commodity components to be used with the system. But it is stated cluster with same configuration will work better than the others. In our research work the EWU Beowulf system was built with one server and four client node and the configuration for the system was as below: § § §

Processor Ram Hard Disk

:p4-1.8GHz :128MB :40GB

2. Parallel Computing Architecture

§ §

16

Network Card :Realtalk Bandwidth :100KBps

2.3.1 Cluster Setup and Installation This section of the chapter will cover the construction and configuration phenomena of EWU Beowulf system. It will differ from the references in many cases as it is a research work and as it is done according to project supervisor.

There are at least four methods of configuring disk storage in a Beowulf cluster. These configurations differ in price, performance and ease to administer. In this paper we will cover the fully local install configuration. a. Disk-less Configuration In this configuration the server serves all files to disk-less clients. The main advantage of disk- less clients system is the flexibility of adding new nodes and administering the cluster. Since the client nodes do not contain any information locally, when adding a new node you will only have to modify a few files on the server or run a script which will do the job. It doesn’t need to install the operating system or any other software on any of the nodes, but the server node. The disadvantages are increased NFS traffic slightly more complex initial setup [4,1,2]. b. Fully Local Install The other extreme is to have everything stored on each client. With this configuration every system has to install operating system and all the software on each of the client.

The advantage of this setup is no NFS traffic and the

disadvantage is a very complicated installation and maintenance. Maintenance of such a configuration could be made easier with complex shell scripts and utilities such as rsync which could update all file system.

2. Parallel Computing Architecture

17

c. Standard NFS Installation The third choice is a half way mark between the disk- less client and fully local install configurations. In this setup clients have their own disks with the operating system and swap locally, and only mount /home and /usr/local off the server. This is the most commonly used configuration of Beowulf clusters.

2.3.2 Os installation For our Beowulf cluster, we have chosen Linux distribution of RedHat Inc. version 7.3 (Valhalla). For detail installation instruction we refer to the RedHat Linux installation documentation. For the cluster, setup should complete with full network support with remote shell facilities. If the package for basic network communication and remote communication is not installed, then it will have to be installed manually later. And as it will work in trusted private network, any firewall protection should be removed, for proper functioning of the cluster.

2.3.3 Configuration of Master Node and Client Node 1. Create .rhosts files in the /user and /root directories. Our .rhosts files for the Beowulf users are as follows: node00 beowulf node01 beowulf node02 beowulf node03 beowulf And the .rhosts files for the root users are as node00 root node01 root node02 root node03 root

2. Parallel Computing Architecture

18

2. Create hosts file in the /etc directory. /etc/hosts file for master node (node00) in our EWU Beowulf cluster 192.168.1.220 node00.ewubd.edu node00 127.0.0.1 localhost 192.168.1.221 node01 192.168.1.222 node02 192.168.1.223 node03 /etc/hosts file for child node (node01) in our EWU Beowulf cluster 192.168.1.221 node01.ewubd.edu node01 127.0.0.1 localhost 192.168.1.220 node00 192.168.1.222 node02 192.168.1.223 node03 Pre-caution: Ordering of the nodes is very important. The node which is to be configured should place in first, and rest of the nodes will be listed in ascending order. 3. Modify hosts.allow files of /etc by adding the following lines: For node00: ALL: 192.168.1.220 ALL: 192.168.1.221 ALL: 192.168.1.222 ALL: 192.168.1.223 ALL: node00.ewubd.edu ALL: node01.ewubd.edu ALL: node02.ewubd.edu ALL: node03.ewubd.edu For node01: ALL: 192.168.1.221 ALL: 192.168.1.220 ALL: 192.168.1.222 ALL: 192.168.1.223 ALL: node01.ewubd.edu ALL: node00.ewubd.edu ALL: node02.ewubd.edu ALL: node03.ewubd.edu Pre-caution: Gives a “space’ after the colon. Maintain the ordering of the nodes. 4. Modify hosts.deny file of the /etc directory by adding the following lines:

2. Parallel Computing Architecture

19

ALL: ALL 5. Add the following lines to the /etc/securetty file: rsh, rlogin, rexec, pts/0, pts/1 6. Modify the rsh of /etc/pam.d directory as follows (The changing portion is highlighted by underlining): auth

sufficient

/lib/security/pam_nologin.so

auth

optio nal

/lib/security/pam_securetty.so

auth

sufficient

/lib/security/pam_env.so

auth

sufficient

/lib/security/pam_rhosts_auth.so

auth

sufficient

/lib/security/pam_stack.so service=system-auth

auth

sufficient

/lib/security/pam_stack.so service=system-auth

7. Modify the rsh, rlogin, telnet, rexec files of the /etc/xineted.d directory: Change the disabled = yes line to disabled = no. 8. After doing all the changes write the following command: xinetd -restart

2.4 Testing the System •

To test the system first use the ping command to test that whether there is physical connection between the nodes.



Try to remotely login in each of the machine. Successful login ensures the remote mechanism between the nodes, so that user of the system can use rcp, rexec etc commands.



Install software to run a parallel program, and test the system by running a demo program.

2. Parallel Computing Architecture

20

2.5 References [1] The latest version of the Beowulf HOWTO http://www.sci.usq.edu.au/staff/jacek/beowulf [2] Building a Beowulf system http://www.cacr.caltech.edu/beowulf/tutorial/building.html [3] Jacek’s Beowulf Links http://sci.usq.edu.au/staff/jacek/beowulf. [4] Chance Reschke,Thomas Sterling,Daniel Ridge,Daniel Savarese,Donald Becker,and Phillip Merkey A Design Study of Alternative Network Topology for the Beowulf parallel workstation,Proceedings Fifth IEEE internationa l Symposium on HIGH performance Distributed Computing, 1996. http://www.beowulf.org/papers/HPDC96/hpdc956.html [5] Thomas Sterling,Daniel Ridge,Daniel Savarese,Michel R.Berry,and Chance Res.Achieving a Balance Low-Cost Architecture for Mass storage Management through Multiple Fast ethernate Channels on the Beowulf parallel workstation .proceedings ,International parallel processing symposium, 1996. http://www.beowulf.org/papers/IPPS96/ipps96.html [6] Donald J.Becker,Thomas sterling ,Danial Savarese, John E. Dorband,Udaya A.Ranawak,Charles,V.packer. BEOWULF: A PARALLEL WORKSTATION FOR SCINTIFIC COMPUTATION .Pro-ceedings ,International Conference on parallel processing ,95.hhtp://www.beowulf.org/paper/ICPP95/icpp95.html [7] Beowulf Homepage http://www.beowulf.org [8] Extreme Linux http://www.extremelinux.org [9]

Extreme

Linux

Software

from

Red

Hat

http://www.redhat.com/extreme.

Chapter 3

Communication API

3. Communication API

22

3. Communication API A basic prerequisite for parallel programming is a good communication API. There are many software packages which are optimized for parallel computation. Application built using these package pass messages between nodes to communicate with each others. Message Passing architecture are conceptually simple. But their operation and debugging can be quite complex. There are two popular message passing libraries that are used: •

Parallel Virtual Machine (PVM)



Message Passing Interface (MPI)

3.1.1 Parallel Virtual Machine PVM is a freely available (http://www.epm.ornl.gov/pvm/pvm_home.htm), portable, message-passing library generally implemented on top of sockets. It is clearly established as the de- facto standard for message-passing cluster parallel programming. PVM supports single processor and SMP Linux machines, as well as clusters of Linux machines linked by socket-capable networks (e.g. SLIP, PLIP, Ethernet and ATM). In fact, PVM will even work across group of machines in which a variety of different types of processors, configuration and physical networks are used – Heterogeneous Cluster – even to the scale of treating machines linked by the internet as a parallel cluster. PVM also provide facilities for parallel job control across a cluster[1,3,5,7]. It is important to note that, PVM message passing calls generally ass significant overhead to standard socket operations, which already had high latency. Furthermore, the message handling calls themselves do not constitute a particularly “friendly” programming model.

3. Communication API

23

3.1.2 Message Passing Interface (MPI) The implementation process of MPI was begun at Williamsburg Workshop in April, 1992 and successfully organized at Supercomputing ’92 (November). The final version of the draft released in May, 1994. Although PVM is the de-facto standard message-passing library, MPI (Message Passing interface) is the relatively new official standard.. MPI is implemented using standard networking primitives. It attempts to preserve the functionality needed by scientific applications, while hiding details of networking, sockets etc. It is efficient, portable and functional is the case of parallel implementation of program. The features that include in MPI which caused the leap over PVM are: •

Completely separate address spaces and namespace.



Library handles al network reliability/ retransmission/handshake issues.



Provides a simple (trivial) naming scheme for communicating participants.



Point- to-point primitives (gather, scatter, broadcast etc.).



User-defined data types, topologies and other advanced topics

In our research work, we used Message Passing Interface (MPI) as the communication API to communicate between nodes. Because, we found that MPI provides a bit more functionality than PVM.[1,3,6]

3.1.2.1

Architecture of MPI

MPI has large library of various functions which provides extensive functionality to support the different branches of parallel computing. To be efficient parallel programmer one need to be master on all the parts of MPI. These are essential to the programmers as because the there are different types of approach to communication modes and architecture. Each of them is distinguished from others and should be used intelligently to acquire the parallel efficiency. The architectural features that MPI include are[7]:

3. Communication API





24

General -

Communicators combine context and group for message security.

-

Thread Safety.

Point to point Communication -

Structured buffers and derived data types.

-

Communication

Modes:

Normal

(blocking

and

non-blocking),

synchronous, ready, buffered. •



Collective -

Both built in and user-defined collective operations.

-

Large number of data movement routines.

-

Sub groups defined directly or by topology.

Application oriented process topologies -



Profiling -



Built in supports for grids and graphs (uses groups).

Hooks allow intercepting MPI calls to install their own tool.

Environmental -

Inquiry

-

Error control

MPI has over 126 library routine to support these various type of communication modes. But there are only six basic functions which are enough to build a complete parallel application. These are, •

MPI_Init() – Initializes MPI environme nt.



MPI_Send() – The routine sends messages.



MPI_Recieve() – The routine used to receive messages.

3. Communication API



25

MPI_COMM_Size() – The routine determines the Communicator (Number of Nodesin in the environment) size.



MPI_COMM_Rank() – The routine rnak the processors.



MPI_Finalize().- Finalize MPI and clean up all the objects.

The parameter for these routine are vary on different bindings of MPI. That is different type of implementation of MPI in different languages can have different types of parameters[4].

3.2 Software Bindings on MPI Message Passing Interface (MPI) has been implemented on different languages. Many different approaches have been taken to develop the MPI libraries. Fortran and C was first to deliver the MPI package. The important initiate to these steps are, •

MPICH (Message Passing Interface Chameleon) – in C.



mpiJava, Java/DSM, JavaPVM, use Java as wrapper for existing frameworks and libraries.



MPJ, jmpi, DOGMA, JVPM, JavaNOW, use pure Java libraries.



HPJava, Manta, JavaParty, Titanium, extend Java language with new keywords. Use preprocessor or own compiler to create Java (byte code).



WebFlow, IceT Javelin, web oriented and use Java applets to execute parallel tasks.

In our research work, we completed our experiment using different types of distribution of MPI. We use MPICH, mpiJava and HPJava which incorporates the maximum level of variation in parallel programming. Here we will discuss about the features, architecture and other issue of these binding of MPI and in Chapter 4 we will discuss about the factors

3. Communication API

26

of writing programs on these distribution and some program listing using these distribution[4].

3.2.1 MPICH MPICH is an open-source, portable implementation of the Message-Passing Interface Standard. It is developed by David Ashton, Anthony Chan, Bill Gropp, Rob Latham, Rusty Lusk, Rob Ross, Rajeev Thakur and Brian Toonen in Argonne National Laboratory. MPICH is a portable implementation of the full MPI specification for a wide variety of parallel and distributed computing environments. MPICH contains, along with the MPI library itself, a programming environment for working with MPI programs. The programming environment includes a portable startup mechanism, several profiling libraries for studying the performance of MPI programs, and an X interface to all of the tools. In MPICH, the basic six function of MPI are implemented with several numbers of parameters, which are as follows, §

MPI_Init(&argc, &argv);

§

MPI_COMM_Rank(MPI_COMM_WORLD, &my_rank);

§

MPI_COMM_Size(MPI_COMM_WORLD, &p);

§

MPI_Send(msg, msg_length, msg_type, dest, tag, MPI_COMM_WORLD);

§

MPI_Recv(msg, msg_length, msg_type,source,MPI_COMM_WORLD, &status);

MPI_Init() initializes MPI environment and every program must started with the initialization of the MPI. MPI_COMM_Rank() ranks the processor. The first parameter (MPI_COMM_WORLD) is the communicator object, which is a collection of processes that can send or receive

3. Communication API

27

messages from each other. The second parameter (my_rank) stores the rank of the processor. MPI_COMM_Size() determines the number of processor in the environment. It also takes the communicator object as parameter and store the result in the second parameter (p). MPI_Send() and Recv() are responsible to send and receive messages between nodes and it takes the message, message length, message type and communicator object as parameter. Tag and status are necessary in sending or receiving message because it ensures that the send or received data are correct of type and length. By conditional branching according to the rank of the processor the programs obtain the SPMD architecture.

3.2.1.1 Installation of Configuration of MPICH In our research we use MPICH version 1.2.5 which is freely available from http://www.mcs.anl.gov/mpi/mpich/download.html . The total installation MPICH1.2.5 required few steps. 1. Unpack the ”mpich.tar.gz” in the user specified directory by

prompt$ gunzip -c mpich.tar.gz | tar zxovf – 2. From the newly created directory “mpich-1.2.5.x” type following and press enter prompt $ ./configure 3. Then type the following to compile and install the distribution prompt $ make 4. Update the .bash_profile file by adding the following lines in the path environment variable. e.g. (for the root user) PATH = /root/mpich-1.2.5/util:/root/m[ich-1.2.5/bin:$PATH ( for the user other than root. If the user is Beowulf then)

3. Communication API

28

PATH = /home/beowulf/mpich-1.2.5/util:/home/beowulf/mpich-1.2.5/bin:$PATH 5. Modify the machines.LINUX file of /root/mpich-1.2.5/util/machines by writing the name of all nodes except the host node: So, for node00 of four node cluster, the file will appear as below, node01 node02 node03 6. Restart the PC.

3.2.1.2 Testing the Beowulf cluster and MPICH configuration To test the cluster configuration and the MPICH run an example file from the /mpich1.2.5/examples/basic directory. The syntax to run a program is prompt $ mpirun –np 4 hello Here mpirun is the program which executes the program in all the node of the cluster. -np 4 determines the number of processor will be involved in execution and hello is the executable binary code of the source program. Here is a detail process of compilation and execution of a parallel program: §

Compile the program using mpicc which includes the all the necessary header files and do all the necessary linking with the library files. E.g. prompt $ mpicc –o hello hello.c

§

Run the program by prompt $ mpirun –np 4 hello

§

The successful greetings from all of the nodes will confirm the installation process[5].

3. Communication API

29

3.2.2 mpiJava mpiJava was developed at the Sycracuse University by Bryan Carpenter, Mark Baker, Geoffrey Fox and Guansong Zhang. The existing MPI Standards specify a language bindings for Fortan, C nad C++ and this approaches implements a Java api for Message Passing Interface Chameleon(MPICH). More precisely, mpiJava is a Java interface which binds the Java Native Interface (JNI) C stubs to the underlying native MPI’s C interface – that is mpiJava uses some Java wrappers to invoke the C-MPI calls through the Java Native Interface (JNI) [1,11]. mpiJava runs the parallel Java programs on the top of MPICH through the Java Virtual Machine (JVM). The architecture stack of the environment, when running a parallel Java programs using mpiJava is shown in Figure 1.

3.2.2.1 Class Hierarchy of mpiJava Java Application

The existing MPI standard is explicitly object-based. The Java Virtual Machine (JVM)

C and Fortran bindings rely on “opaque objects” that can be manipulated only by acquiring object handles from

mpiJava

constructor functions, and passing the handles to suitable

MPICH

functions in the library. The C++ bindings specified in th OS (Linux)

MPI 2 standard collects these objects into suitable class hierarchies and defines most of the library functions as

Protocol (TCP/UDP)

class member functions. The mpiJava API follows this model, lifting the structure of its class hierarchy directly

Ethernet Card

from C++ binding. Node 0

Node 1

Node 2

Figure 3.1: Execution Stack of a parallel java program using mpiJava.

3. Communication API

30

The class MPI only has static members. It acts as a module containing global services, such as initialization of MPI, and many global constants including the default communicator COMM WORLD[6]. The most important class in the package is the communicator class Comm. All communication functions in mpiJava are members of Comm or its subclasses. As usual in MPI, a communicator stands for a “collective object” logically shared by a group of processors. The processes communicate, typically by addressing messages to their peers through the common communicator. The principle classes of mpiJava are shown in Figure 2.

MPI Group

Cartcomm Intracomm

Comm

Graphcomm Intercomm

Package MPI Datatype Status Request

Prequest

Figure 3.2: Class hierarchy of mpiJava Another important class of mpiJava is the Datatype class. This describes the type of the elements in the message buffers passed to send, receive and all the other communication functions. Various datatypes are predefined in the package [1,2]. These mainly correspond to the primitive types of Java and the interface between mpiJava and MPI via the Java Native Interface (JNI) is shown in Table 1.

3. Communication API

31

3.2.2.2 API of mpiJava

Table 3.1: Basic Datatypes

There are some basic communication API’s of

of mpiJava

mpiJava, which are used to develop parallel programs

in

Java

platform.

The

functions

Java Datatye

MPI Datatype MPI.BYTE

byte

MPI.CHAR

char

MPI.SHORT

short

initializes the MPI for the current environment and

MPI.BOOLEAN

boolean

Finalize(), finalize the MPI and frees up the

MPI.INT

Int

MPI.LONG

long

MPI.FLOAT

float

MPI.DOUBLE

double

MPI.Init(args) and MPI.Finalize() must used at the start and end of the program as because init()

memory by destroying all the communicator object, which are created for the communication purposes.

MPI.PACKED

In basic message passing, the processes coordinate their activities by explicitly sending and receiving messages. The standard send and receive operations of MPI are members of Comm interfaces[8]



public void send(Object buf, int offset, int count, Datatype datatype, int dest, int tag)



public void recieve(Object buf, int offset, int count, Datatype datatype, int source, int tag)

where buf is an array of primitive type or class type. If the elements of buf are objects, they must be serializable objects, offset is the starting point of message, datatype class describes the types of the elements, count is the amount of element to sent or received, source and dest are the rank of the processor of source and destination processor, and tag is used to identify the message.

3. Communication API

32

There are several issues that need to be addressed that the commands executed by process 0 (send operation) will be different from those executed by process 1 (receive operation).However, this does not mean that the programs need to different. By conditional branching according to the rank of the processor, the program can obtain the SPMD paradigm [1, 3]. Eg. my_process_rank= MPI.COMM_WORLD.Rank(); If ( my_process_rank==0 ) MPI.COMM_WORLD.Send(Object buf, int offset, int count, Datatype datatype, int dest, int tag) Else if (my_process_rank==1 ) MPI.COMM_WORLD.Recieve(Object buf, int offset, int count, Datatype dt, int source, int tag)

3.2.2.3 Installation and Configuration of mpiJava 1. Install your preferred Java programming environment. In our research we use j2sdk1.4.1–03. After Java JDK is installed successfully, add the Java JDK `bin' directory to path setting of .bash_profile, so that the `mpiJava/configure' script can find the `java', `javac', and `javah' commands. 2. Install your preferred MPI software. Add the MPI `bin' directory to your path setting. Test the MPI installation before attempting to install mpiJava. And we use MPICH-1.2.5 as our MPI software. 3. Now, install the mpiJava interface. step 1. Unpack the software, eg prompt $ gunzip -c mpiJava-x.x.x.tar.gz | tar -xvf A subdirectory `mpiJava/' is created. step 2. Go to the `mpiJava/' directory. Configure the software for the platform prompt $ ./configure step 3. Build (compile) the software:

3. Communication API

33

prompt $ make After successful compilation, the makefile will put the generated class files in directory `lib/classes/mpi/', and also place a native dynamic library in directory `lib/'. Now: Add the directory `/src/scripts' to your path environment variable. Add the directory `/lib/classes' to your CLASSPATH environment variable. Add the directory `/lib' to your LD_LIBRARY_PATH (Linux, Solaris, etc) or LIBPATH (AIX) environment variable. step 4. Test the installation: prompt $ make check

In the phase of installation of mpiJava we’ve encountered in many problem. We found some error in the configuration scripts of mpiJava and face some ghost situations. These are §

After configuring the mpiJava, go to the mpiJava/src/C directory. Open the make file in any editor and correct the entry for mpicc. In default it should be /root/mpich-1.2.5/bin/mpicc.

§

mpiJava runs a parallel java byte code by ‘prunjava’ command. It is a java wrapper which is wraps over the MPICH bindings. In using prunjava, we face some difficulties that sometimes our programs crashes; by saying “filename.jig” was not found. We discovered that the *.jig file was nothing but a file which includes the entire necessary path environment variable, library path and the commands to execute the program in parallel. The error occurs as because when MPI tries to execute a program in child node through a remote command (using rsh, rlogin etc), it can cannot find the appropriate path setting. And as a result it cannot generate the *.jig file and program crashes. To solve this problem we

3. Communication API

34

developed our own java wrappers to MPICH named it as “ewurun”. The syntax to run a java program is prompt $ ewurun 4 hello Here 4 are the number of processor should be involved in execution. Here is the sample java wrapper over MPI, the ewurun, which we are used execute parallel java programs. This wrapper may be different of type according to the application types and requirements. PNUMBER = $1 CLASSNAME = $2 cat > $CLASSNAME.kab << EOF JAVA = /usr/java/mpiJava/lib/classes export CLASSPATH LD_LIBRARY_PATH = /root/mpiJava/lib export LD_LIBRARY_PATH exec /usr/java/j2sdk1.4.1_03/bin/java $CLASSNAME $CLASSNAME \$* EOF chmod a+x $CLASSNAME.kab rcp $CLASSNAME.class $CLASSNAME.kab node01:/root rcp $CLASSNAME.class $CLASSNAME.kab node03:/root rcp $CLASSNAME.class $CLASSNAME.kab node04:/root mpirun -pg -np $PNUMBER $CLASSNAME.kab

3.2.3 HPJava +, -_._ __ ________ __ _ ____' ___*___ "___ ___"_'_._'__((____ _!_ _____ __ +__! -_._ ___*___ S__! (___"* $___ ___*5_*____! _ _"!__"_____ _ R!_T'_ _3_____' .__ *'_ T ' __ S'__3_____ $_ ' N'_____ %/ _ ___ ,__T'_(__5_ N'__1___3 ____ T'T_+, -_ ______ __ _ _ ____S _+ ,N 'T "___ _ ____ +, -_._ __ ________ __ _ ____' ___*___ "___ ___"_ _ ' __._'((._ _3__$___ _____ !_ ___%/ ____ _ +__! -_._ ___*___ S__! (___"* $___ ___*5_*____! _ _"!__"_____ _ R!_T'_ _3_____' __ ___ *'__T '_ S' ' __N'_ ___ ,__T'_(__5_ N'__1___3 ____ T'T_+, -_ ______ __ _ _ ____S _+ ,N 'T "___ _ ____

HPJava is designed as a language of parallel computing and developed by Bryan Carpenter, Han Ku Lee, Sang Boem Lim, Geoffrey Fox and Guansong Zhang in Pervasive Technology Labs, Indiana University. It extends the standard java language with syntax for manipulating a new kind of parallel data structure – the distributive array. It has a realistic implementation and support for SPMD architecture. Process management and communication between nodes are more classified and organized. In HPJava a “process” is always a java virtual machine. So the programs of HPJava has a self-contained context, where each program executes its own thread (or threads) of

3. Communication API

35

control and its own protected memory and associate address space[9]. This original model of programming has been adopt in HPJava. The leap which caused the HPJava to a different standard that is HPJava has two execution modes - multi-process model of execution and multi-threaded model of execution. And many, many different advanced feature included in HPJava which is makes it a efficient parallel programming language.

3.2.3.1 Installing HPJava Installation of HPJava is an easy process. §

Get the binary package form the http://hpjava.org/download.htm .

§

To install type in shell prompt tar xvfz hpjdk-1.0.tar.gz

§

Add the following lines to path environment variable. HPJAVA_Home=/root/hpjdk CLASSPATH=”$HPJAVA_Home/classes;$HPJAVA_Home/classes/multithre aded; PATH=/root/hpjdk/bin: $PATH[10]

3.2.3.2 Compiling and Running HPJava program To compile HPJava use the command “hpjavac”. Eg prompt $ hpjavac hello.hpj and to run the program use the traditional command “java” with a different parameter. Eg. java –Dhpjava.numprocs=4 Hello (in multithreaded model)

3. Communication API

36

3.3 References [1] Baker M., ``mpiJava: A Java MPI Interface." EuroPar98, Southampton, UK,September 1998. [2] Parallel Compiler Runtime Consortium. HPCC and Java - a report by the Parallel Compiler Runtime Consortium. URL: http://www.npac.syr.edu/users/gcf/hpjava3.html, May 1996. [3] Geoffrey C. Fox, editor. Java for Computational Science and Engineering Simulation and Modelling, volume 9(6) of Concurrency: Practice and Experience, June 1997. [4] Geoffrey C. Fox, editor. Java for Computational Science and Engineering Simulation and Modelling II, volume 9(11) of Concurrency: Practice and Experience, November 1997. [5] Geoffrey C. Fox, editor. ACM 1998 Workshop on Java for HighPerformance Network Computing. Palo Alto, February 1998, Concurrency: Practice and Experience, 1998. To appear. http://www.cs.ucsb.edu/conferences/java98. [6] Vladimir Getov, Susan Flynn-Hummel, and Sava Mintchev. HighPerformance parallel programming in Java: Exploiting native libraries. In ACM 1998 Workshop on Java for High-Performance Network Computing. Palo Alto, February 1998, Concurrency: Practice and Experience, 1998. To appear. [7] George Crawford III, Yoginder Dandass, and Anthony Skjellum. The jmpi commercial message passing environment and specification: Requirements, design, motivations, strategies, and target users. http://www.mpisofttech.com/publications.

[8] Dincer K., ``jmpi and a Performance Instrumentation Analysis and Visualization tool for jmpi." First UK Workshop on Java for High Performance Network Computing, EUROPAR-98, Southampton, UK, September 2-3, 1998. [9] ``The IceT Project", [Foster95] Foster I., ``Designing and Building Parallel Programs". Adison WesleyPublishing Company Inc. New York, 1995. [10] Gray P. and Sunderam V., ``The IceT Environment for Parallel and Distributed Computing." Proceedings of ISCOPE97 (Springer Verlag), Marina del Rey CA. Dec. 1997.

Chapter 4

Writing Parallel Application

4. Writing Parallel Application

38

4. Parallel Application Architecture In order to run an application in parallel on multiple CPUs, it must be explicitly broken in to concurrent parts. A standard single CPU application will run no faster than a single CPU application on multiple processors. There are some tools and compilers that can break up programs, but parallelizing codes is not a "plug and play” operation. Depending on the application, parallelizing code can be easy, extremely difficult, or in some cases impossible due to algorithm dependencies. Before jump to the writing of parallel application, we should discuss some important issue - the difference between concurrent and parallel. Concurrent parts of a program are those that can be computed independently. Parallel parts of a program are those concurrent parts that are executed on separate processing elements at the same time.

The distinction is very important, because concurrency is a property of the program and efficient parallelism is a property of the machine. Ideally, parallel execution should result in faster performance. The limiting factor in parallel performance is the communication speed and latency between compute nodes. Parallel applications are not so simple and executing concurrent parts of the program in parallel may actually cause the program to run slower, thus offsetting any performance gains in other concurrent parts of the program. In simple terms, the cost of communication time must pay for the savings in computation time; otherwise the parallel execution of the concurrent part is inefficient. The task of the programmer is to determining what concurrent parts of the program should be executed in parallel and what parts should not. The answer to this will determine the efficiency of application. The following graph summarizes the situation for the programmer:

4. Writing Parallel Application

39

% of applica tion.

Communication / Processing time

Figure: 5.1 Application Performance Graph In a perfect parallel computer, the ratio of communication/processing would be equal and anything that is concurrent could be implemented in parallel.

Unfortunately, real

parallel computers, including shared memory machines, are subject to the effects described in this graph. When designing a Beowulf, the user may want to keep this graph in mind because parallel efficiency depends upon ratio of communication time and processing time for a specific parallel computer. Applications may be portable between parallel computers, but there is no guarantee they will be efficient on a different platform. There is yet another consequence to the above graph. Since efficiency depends upon the comm. /process ratio, just changing one component of the ratio does not necessary mean a specific application will perform faster. A change in processor speed, while keeping the communication speed that same may have non- intuitive effects on your program.

For

example, doubling or tripling the CPU speed, while keeping the communication speed the

4. Writing Parallel Application

40

same, may now make some previously efficient parallel portions of your program, more efficient if they were executed sequentially. That is, it may now be faster to run the previously parallel parts as sequential. Furthermore, running inefficient parts in parallel will actually your application from reaching its maximum speed. Thus, by adding faster processor, may actually slow down the application. So, to gain efficiency in parallel program, the programmer must have to keep the following things in mind: 1. Determine concurrent parts of program. 2. Design appropriate parallel algorithm for the program. 3. Minimize communication between nodes. 4. Estimate parallel efficiently.

In our research, we use Message Passing Interface (MPI) as our communication software. Using the different communication modes, parallel libraries and topologies of MPI can help to improve performance of a parallel application. And we’ve stated before that determining concurrent and parallel part of a program plays the vital role in efficiency of the program. In MPI, it can be achieve by “conditional branching” the program, efficiently and effectively. Later we’ll show that performance of a parallel program largely depended on the application architecture.

4.1 Writing Parallel Programs in MPICH Here is the first program we have written in MPICH, the HelloWorld program. 4.1.1 HelloWorld.c #include #include #include

4. Writing Parallel Application

41

main(int argc, char *argv[] ){ int my_rank; int p; int source; int dest; int tag=0; char message[100]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &p); If(my_rank!= 0){ sprintf(message, ‘Hello from process %d!”, my_rank); dest=0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } else{ for(source=1; source
4.2 A Parallel Java Application using mpiJava Matrix-Matrix multiplication mpiJava is an interface over Message Passing Interface and Message Passing is a model for interconnections between processors within a parallel system, where a message is constructed by software on one processor and is sent through an interconnection network

4. Writing Parallel Application

42

to another processor, which then must accept and act upon the message contents. So developing programs in Java for parallel platform, the pre-requisite is the knowledge of the class structure of mpiJava and the mpiJava API. Consider the problem of computing C = AxB, where A, B are dense matrices of size MxN and NxP (the resultant matrix C has the size of MxP). We know that matrix- matrix multiplication involves O(N3) operations, because each Ci,j of C is equal to: N-1

Cij = Ai,k . Bk,j K=0

We are looking for an efficient parallel algorithm for matrix-matrix multip lication using mpiJava. Consequently, we exanimate several algorithms developed for this purpose. First, consider a onedimensional, row wise decomposition such that each processor is responsible for all computation associated with the Cij's at the assigned rows. Each task requires all of matrix B to compute the Cij corresponding with its column of A. The algorithm perform very well when N is much larger than number of processors P [4,4fos]. In a two-dimensional decomposition each task requires an entire row Ai,* and column B*,j of A and B respectively. The one-dimensional decomposition requires N2 /P data and the two-dimensional decomposition requires N2 P1/2. Because of features of Java for manage arrays, we consider a onedimensional row wise decomposition, as showed in Figure 3.

Figure 4.1: One-dimensional row wise decomposition The input matrices A and B are initially available on process zero. Matrix A will be sliced at row wise (N/p) and each set of row of A and the entire B will be sent to the processors. After computing result matrix C is returned at process zero. Each process computes the sum of product

4. Writing Parallel Application

43

of the set of row of A and all the columns of B, resulting in columns of C. The SPMD paradigm for matrix-matrix multiplication can be achieved by the following codes: for ( i = myRank*N/p ; i<(myRank+1)*N/p ; i++){ for ( j = 0 ; j < N ; j++){ c[i][j]=0; for ( k = 0 ; k < N ; k++){ C[i][j] + = A[i][k] + B[k][j] The successive phases of the computations are illustrated in Figure 4.

Send A

row of A x col of B

Compute local C

Receive C

Figure 4.2: computation phases of mpiJava

4.2.1 Source code of Matrix-Matrix Multiplication using Send and receive import mpi.*; import java.util.*; public class matrixmul { static int N=400; public static void main(String args[]) throws MPIException { double startwtime = 0.0,endwtime; int me,p; int a[][]=new int[N][N]; int b[][]=new int[N][N]; int c[][]=new int[N][N]; int store[] = new int[N]; int i,j,k; MPI.Init(args); p = MPI.COMM_WORLD.Size(); me = MPI.COMM_WORLD.Rank(); if(me==0) { startwtime = MPI.Wtime();

4. Writing Parallel Application

} for(i=0;i
44

4. Writing Parallel Application

45

4.2.2 Source code of Matrix-Matrix Multiplication using Scatter-Gather import mpi.*; import java.util.*; public class scgt { static int N=400; public static void main(String args[]) throws MPIException { double startwtime = 0.0,endwtime; int me,p,i,j,k; int a[][]=new int[N][N]; int b[][]=new int[N][N]; int c[][]=new int[N][N]; MPI.Init(args); p=MPI.COMM_WORLD.Size(); me= MPI.COMM_WORLD.Rank(); for(i=0;i
4. Writing Parallel Application

46

{ for(j=0;j
4.2.3 Result of Matrix-Matrix Multiplication Using the above approach listed first we teste d the implementation and obtain the results shown in Table 2. Table 2: Numerical results for matrix-matrix multiplication

Np/Size 1 2 3 4

2x2

128x128

256x256

400x400

2.59E-5 5.24E-5 4.89E-5 5.20E-5

0.047 0.029 0.022 0.019

0.431 0.222 0.154 0.110

1.998 1.034 0.681 0.504

4. Writing Parallel Application

47

4.3 Writing Parallel application in HPJava In HPJava we are trying to solve a imaging problem. We use Fourier Transform to enhance a Protable Graymap image. This is a elementary try to these type of application and we perform at very basic level of image enhancement.

4.3.1 Image Enhancement using Foruier Transform The Fourier Transform is an important image processing tool which is used to decompose an image into its sine and cosine components. The output of the transformation represents the image in the Fourier or frequency domain, while the input image is the spatial domain equivalent. In the Fourier domain image, each point represents a particular frequency contained in the spatial domain image.

The Fourier Transform is used in a wide range of applications, such as image analysis, image filtering, image reconstruction and image compression.[8].

4.3.1.1 How Fourier Transform works As we are only concerned with digital images, we will restrict this discussion to the Discrete Fourier Transform (DFT).[8]

The DFT is the sampled Fourier Transform and therefore does not contain all frequencies forming an image, but only a set of samples which is large enough to fully describe the spatial domain image. The number of frequencies corresponds to the number of pixels in the spatial domain image, i.e. the image in the spatial and Fourier domains are of the same size.

4. Writing Parallel Application

For a square image of size N×N, the two-dimensional DFT is given by:

where f(i,j) is the image in the spatial domain and the exponential term is the basis function corresponding to each point F(k,l) in the Fourier space. The equation can be interpreted as: the value of each point F(k,l) is obtained by multiplying the spatial image with the corresponding base function and summing the result.

The basis functions are sine and cosine waves with increasing frequencies, i.e. F(0,0) represents the DC-component of the image which corresponds to the average brightness and F(N-1,N-1) represents the highest frequency.[8]

In a similar way, the Fourier image can be re-transformed to the spatial domain. The inverse Fourier transform is given by:

To obtain the result for the above equations, a double sum has to be calculated for each image point. However, because the Four ier Transform is separable, it can be written as

where

48

4. Writing Parallel Application

49

Using these two formulas, the spatial domain image is first transformed into an intermediate image using N one-dimensional Fourier Transforms. This intermediate image is then transformed into the final image, again using N one-dimensional Fourier Transforms. Expressing the two-dimensional Fourier Transform in terms of a series of 2N one-dimensional transforms decreases the number of required computations. Even with these computational savings, the ordinary one-dimensional DFT has N2 complexity. This can be reduced to Nlog2 N. if we employ the Fast Fourier Transform (FFT) to compute the one-dimensional DFTs. This is a significant improvement, in particular for large images. There are various forms of the FFT and most of them restrict the size of the input image that may be transformed, often to N=2N where n is an integer. The mathematical details are well described in the literature.[8].The Fourier Transform produces a complex number valued output image which can be displayed with two images, either with the real and imaginary part or with magnitude and phase.

In image processing, often only the magnitude of the Fourier Transform is displayed, as it contains most of the information of the geometric structure of the spatial domain image. However, if we want to re-transform the Fourier image into the correct spatial domain after some processing in the frequency domain, we must make sure to preserve both magnitude and phase of the Fourier image.

The Fourier domain image has a much greater range than the image in the spatial domain. Hence, to be sufficiently accurate, its values are usually calculated and stored in float values.

4. Writing Parallel Application

50

The total enhancement process can be shown by the following diagram.

Fourier transform

Filter Function H(u,v)

F(u,v)

Inverse fourier transform

Postprocessin g

H(u,v) F(u,v)

Preprocessin

f(x,y) Input image

g(x,y) Enhanced image

Figure 4.1 Basic steps for enhancement of image using Fourier Transform[8]

4.3.1.2 Implementation We have studied a 2DFFT program written in HPJava. The original program was written by Bhaven Avalani. As the format of the image is Portable Graymap the enhancement process is easier. In PGM format information are stored as below

P2 256 256 255 209 227 222 217 225 217 234 221 217 224 211 213 226 229 216 232 234 209 232 194 237 224 231 218 231 207 237 207 224 217 229 232 231 226 217 208 194 231 202 217 217 180 175 222 223 222 198 194 222 208 206 … … so in enhancement process it is easier to find the frequency range and to detect the higher and lower frequency modes. By removing these modes, an image can be extensively enhanced. Here is the sample output of the program[8].

4. Writing Parallel Application

4.3.1.3 Source Code Wolf.hpj import java.io.* ; import java.awt.* ; import java.awt.event.* ; import javax.swing.* ; import javax.swing.event.* ; import javax.swing.text.* ; import hpjava.adlib.Adlib ;

public class Wolf implements HPspmd { String fileName ; int itrunc ; Procs0 control ; Procs1 p ; BeforeAndAfter pictures ; public static void main(String [] argv) { new Wolf().run() ; // Create an instance and run it, because I want to use // fields as global variables. If there was no instance,

51

4. Writing Parallel Application

52

// that would imply static fields; but that isn't // consistent with the multithreaded model of HPJava // execution. } public void run() { control = new Procs0() ; p = new Procs1() ; MyFrame gui = null ; on(control) { gui = new MyFrame() ; gui.show() ; gui.fetchParams() ;

// Wait for GUI to set `fileName'.

} int [[*,-]] image = readImage(fileName) ; int nx = image.rng(0).size() ; Range y = image.rng(1) ; while(true) { int ny = y.size() ; on(control) { gui.fetchParams() ;

// Wait for GUI to set `itrunc'.

int maxtrunc = Math.min(nx, ny) ; if(itrunc > maxtrunc) { System.out.println( "WARNING: Can't delete that many modes " + "(max " + maxtrunc + "). " + "Deleting " + maxtrunc + " modes.") ; itrunc = maxtrunc ; } else if(itrunc < 0) { System.out.println("WARNING: Number of modes to delete " + "can't be negative. " + "Deleting 0 modes.") ; itrunc = 0 ; } }

4. Writing Parallel Application

itrunc = Adlib.broadcast(itrunc, control) ; int [[*,-]] filtered = new int [[nx, y]] on p ; on(p) { Range x = new BlockRange(nx, p.dim(0)) ; float [[*,-]] reA = new float [[nx, y]], imA = new float [[nx, y]] ; overall(j = y for :) for(int i = 0 ; i < nx ; i++) { reA [i, j] = image [i, j] ; imA [i, j] = 0 ; } float norm = 1.0F / (nx * ny) ; overall(j = y for :) fft1d(reA [[:, j]], imA [[:, j]], 1) ; float [[-,*]] reB = new float [[x, ny]], imB = new float [[x, ny]] ; Adlib.remap(reB, reA) ; Adlib.remap(imB, imA) ; overall(i = x for :) fft1d(reB [[i, :]], imB [[i, :]], 1) ; // Scale magnitude. overall(i = x for :) for(int j = 0 ; j < ny ; j++) { reB [i, j] *= norm ; imB [i, j] *= norm ; } // Throw away modes in the middle of the frequency range. int ioff = (nx - itrunc) / 2 ; int joff = (ny - itrunc) / 2 ; overall(i = x for ioff : ioff + itrunc - 1) for(int j = joff ; j < joff + itrunc ; j++) { reB [i, j] = 0 ; imB [i, j] = 0 ; }

53

4. Writing Parallel Application

// // Throw away Low frequency modes... // // overall(i = x for 0 : itrunc - 1) // for(int j = 0 ; j < itrunc ; j++) { // reB [i, j] = 0 ; // imB [i, j] = 0 ; // } // // Throw away High frequency modes // // overall(i = x for nx - itrunc : nx - 1) // for(int j = 0 ; j < ny ; j++) { // reB [i, j] = 0 ; // imB [i, j] = 0 ; // } // // overall(i = x for 0 : nx - itrunc - 1) // for(int j = ny - itrunc ; j < ny ; j++) { // reB [i, j] = 0 ; // imB [i, j] = 0 ; // } // Do Inverse transform. overall(i = x for :) fft1d(reB [[i, :]], imB [[i, :]], -1) ; // Transpose Adlib.remap(reA, reB) ; Adlib.remap(imA, imB) ; overall(j = y for :) fft1d(reA [[:, j]], imA [[:, j]], -1) ; // Extract output image. overall(j = y for :) for(int i = 0 ; i < nx ; i++) { float re = reA [i, j] ; float im = imA [i, j] ; filtered [i, j] = (int) Math.sqrt(re * re + im * im) ; } } drawImage(filtered, Adlib.maxval(filtered)) ; } }

54

4. Writing Parallel Application

void fft1d(float [[*]] re, float [[*]] im, int isgn) { // One-dimensional FFT on sequential (non-distributed) data. // isgn = +1 or -1 for forward or reverse transform. final float pi = (float) Math.PI ; final int N = re.rng(0).size() ; int nv2 = N / 2 ; int nm1 = N - 1 ; // Sort by bit-reversed ordering. for(int index = 0, jndex = 0 ; index < nm1 ; index++) { if(jndex > index) { // Swap entries float tmpRe = re [jndex] ; float tmpIm = im [jndex] ; re [jndex] = re [index] ; im [jndex] = im [index] ; re [index] = tmpRe ; im [index] = tmpIm ; } int m = nv2 ; while ((m >= 2) && (jndex >= m)) { jndex = jndex - m ; m=m/2; } jndex = jndex + m ; } int ln2 = ilog2(N) ; // Base 2 log of the leading dimension. // Danielson-Lanczos algorithm for FFT. for(int ilevel = 1 ; ilevel <= ln2 ; ilevel++) { int le = ipow(2,ilevel) ; int lev2 = le / 2 ; float uRe = 1.0F ; float uIm = 0.0F ;

55

4. Writing Parallel Application

float wRe = (float) Math.cos(isgn * pi / lev2) ; float wIm = (float) Math.sin(isgn * pi / lev2) ; for(int jj = 0 ; jj < lev2 ; jj++) { for(int ii = jj ; ii < N ; ii += le) { int jndex = ii + lev2 ; int index = ii ; //tmp = u * a(jndex) ; float tmpRe = uRe * re [jndex] - uIm * im [jndex] ; float tmpIm = uRe * im [jndex] + uIm * re [jndex] ; //a(jndex) = a(index) - tmp ; re [jndex] = re [index] - tmpRe ; im [jndex] = im [index] - tmpIm ; //a(index) = a(index) + tmp ; re [index] = re [index] + tmpRe ; im [index] = im [index] + tmpIm ; } //tmp = u * w ; float tmpRe = uRe * wRe - uIm * wIm ; float tmpIm = uRe * wIm + uIm * wRe ; //u = tmp ; uRe = tmpRe ; uIm = tmpIm ; } } } static int ipow(int i, int j) { int k, tmp ; tmp = 1 ; for(k = 1 ; k <= j ; k++) tmp = tmp * i ; return tmp ; } static int ilog2(int n) { int i, n2, result ; n2 = n ; result = 0 ; for(i = 1 ; i <= n ; i++) { if(n2 > 1) { result = result + 1 ;

56

4. Writing Parallel Application

n2 = n2 / 2 ; } else break ; } return result ; } int [[*,-]] readImage(String fileName) { try { StreamTokenizer tokens = null ; int nx = 0, ny = 0, maxin = 0 ; on(control) { tokens = new StreamTokenizer(new FileReader(fileName)) ; if(tokens.nextToken() != StreamTokenizer.TT_WORD || !tokens.sval.equals("P2")) { System.out.println("Bad file format.") ; System.exit(1) ; } getNumber(tokens) ; nx = (int) tokens.nval ; getNumber(tokens) ; ny = (int) tokens.nval ; getNumber(tokens) ; maxin = (int) tokens.nval ; } nx = Adlib.broadcast(nx, control) ; ny = Adlib.broadcast(ny, control) ; int [[*,*]] local = new int [[nx, ny]] on control ; on(control) { for(int i = 0 ; i < nx ; i++) for(int j = 0 ; j < ny ; j++) { getNumber(tokens) ; local [i, j] = (int) tokens.nval ; } pictures = new BeforeAndAfter("Original data", local, maxin) ; java.awt.Dimension d1 = pictures.getToolkit().getScreenSize();

57

4. Writing Parallel Application

pictures.setLocation(d1.width/2 - pictures.getWidth()/2, d1.height/2 - pictures.getHeight()/2); pictures.show() ; } Range y = new BlockRange(ny, p.dim(0)) ; int [[*,-]] result = new int [[nx, y]] on p ; Adlib.remap(result, local) ; return result ; } catch(IOException e) { System.out.println(e.getMessage()) ; e.printStackTrace() ; System.exit(1) ; return null ; } } static void getNumber(StreamTokenizer tokens) throws IOException { if(tokens.nextToken() != StreamTokenizer.TT_NUMBER) { System.out.println("Bad file format.") ; System.exit(1) ; } }

void drawImage(int [[*,-]] image, final int maxin) { int nx = image.rng(0).size(), ny = image.rng(1).size() ; final int [[*,*]] local = new int [[nx, ny]] on control ; Adlib.remap(local, image) ; on(control) { final BeforeAndAfter pix = pic tures ; // Need a final variable for use in anonymous class. SwingUtilities.invokeLater(new Runnable() { public void run() { pix.setAfter(local, maxin) ; }

58

4. Writing Parallel Application

}) ; } } class MyFrame extends JFrame implements ActionListener { MyFrame() { setTitle("HPJava 2D Fourier Transform") ; setSize(500, 400) ; addWindowListener(new WindowAdapter() { public void windowClosing(WindowEvent e) { System.exit(0) ; } }) ; Container content = getContentPane() ; Box p = Box.createVerticalBox() ; p.add(new JLabel("File to Load:")) ; Box loading = Box.createHorizontalBox() ; fileField = new JTextField("wolf.pgm", 20) ; loading.add(fileField) ; loadButton = new JButton("Load") ; loadButton.addActionListener(this) ; loading.add(loadButton) ; p.add(loading) ; p.add(new JLabel("Number of Fourier modes to remove:")) ; Box truncating = Box.createHorizontalBox() ; modesField = new JTextField("220", 5) ; modesField.setEnabled(false) ; truncating.add(modesField) ; goButton = new JButton("Run") ; goButton.setEnabled(false) ; goButton.addActionListener(this) ; truncating.add(goButton) ; p.add(truncating) ; content.add(p) ;

59

4. Writing Parallel Application

pack() ; } public synchronized void actionPerformed(ActionEvent evt) { try { if(evt.getSource() == loadButton) { Document file = fileField.getDocument() ; fileName = file.getText(0, file.getLength()) ; fileField.setEnabled(false) ; loadButton.setEnabled(false) ; modesField.setEnabled(true) ; goButton.setEnabled(true) ; } else { Document modes = modesField.getDocument() ; String modesText = modes.getText(0, modes.getLength()) ; try { itrunc = Integer.parseInt(modesText.trim()) ; } catch(NumberFormatException e) { Toolkit.getDefaultToolkit().beep() ; return ; } } if(waiting) notify() ; else { waiting = true ; try { wait() ; // Shouldn't have to wait long. } catch(InterruptedException e) {} // Shouldn't happen. waiting = false ; } } catch(BadLocationException e) { Toolkit.getDefaultToolkit().beep() ; }

60

4. Writing Parallel Application

61

} public synchronized void fetchParams() { if(waiting) notify() ; else { waiting = true ; try { wait() ; } catch(InterruptedException e) {}

// Shouldn't happen.

waiting = false ; } } private JTextField fileField, modesField ; private JButton goButton, loadButton ; private boolean waiting = false ; } class BeforeAndAfter extends JFrame { BeforeAndAfter(String title, int [[*,*]] data, int maxin) { addWindowListener(new WindowAdapter() { public void windowClosing(WindowEvent e) { System.exit(0) ; } }) ; int width = data.rng(1).size() ; int height = data.rng(0).size() ; setTitle("Before and after") ;

Container content = getContentPane() ; Box p = Box.createHorizontalBox() ; Box before = Box.createVerticalBox() ; before.add(new JLabel("Original image")) ; PGMIcon image = new PGMIcon(data, maxin) ; JLabel picture = new JLabel(image) ;

4. Writing Parallel Application

before.add(picture) ; p.add(before) ; Box after = Box.createVerticalBox() ; after.add(new JLabel("Filtered image")) ; image2 = new PGMIcon(data.rng(1).size(), data.rng(0).size()) ; JLabel picture2 = new JLabel(image2) ; after.add(picture2) ; p.add(after) ; content.add(p) ; pack() ; } void setAfter(int [[*,*]] data, int maxin) { image2.setData(data, maxin) ; repaint() ; } private PGMIcon image2 ; } }

PGMIcon.hpj import java.awt.*; import javax.swing.*; public class PGMIcon implements Icon, SwingConstants { private int width, height, maxin ; private int [[*,*]] data ; public PGMIcon(int [[*,*]] data, int maxin) { this.data = data ; height = data.rng(0).size() ; width = data.rng(1).size() ; this.maxin = maxin ; } public PGMIcon(int width, int height) { this.width = width ; this.height = height ;

62

4. Writing Parallel Application

} public void setData(int [[*,*]] data, int maxin) { this.data = data ; this.maxin = maxin ; } public int getIconHeight() { return height ; } public int getIconWidth() { return width ; } public void paintIcon(Component c, Graphics g, int x, int y) { if(data != null) { Color [] colors = new Color [maxin + 1] ; for(int k = 0 ; k < maxin + 1 ; k++) { int col = (int) ((255.0 * k) / maxin) ; colors [k] = new Color(col, col, col) ; } for(int i = 0 ; i < width ; i++) for(int j = 0 ; j < height ; j++) { g.setColor(colors [data [j, i]]) ; g.fillRect(x + i, y + j, 1, 1) ; } } } }

63

4. Writing Parallel Application

64

4.4 References [1] lan Foster,Designing and building parallel programs.Reading,MA: AddisonWesly,1995.http://www.mcs.ani.gov/dbpp.

[2] Message passing Interface Forum.MPI: A message passing Interface standard version 1.1. June 12,1995.ftp.mcs.ani.gov [3] Geoffrey Fox et al. Solving Problems on Concurrent Processors. EngleWood Cliffs. NJ: Prentics Hall, 1998. [4] William Gropp, Ewing Lusk and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. Cambridge, MA: MIT Press, 1994. [5] Vipin Kumar et al. Introduction to Parallel Computing: Design and Analysis of Algorithms. Redwood city, CA:Benjamin/Cummings, 1994. [6] Message Passing Interface Forum. MPI: a message-passing interface standard. International Journal of Supercomputer Applications 8(3-4), 1994. Also available by anonymous ftp from ftp.netlib.org as Computer Science Dept. Technical Report, CS-94230, University of Tennessee, Knoxville, TN, May 5, 1994. [7] Marc Snir, Steve Otto, Steverr Huss-Lederman, David Walker, and Jack Dongarra. MPI: The Complete Reference. Cambridge, MA: MIT Press, 1996. Available over the World Wide Web at http://www.netlib.org/utk/papers/mpi-book/mpi-book.html [8]

Rafael

C.

Gonzalez,Richard

e.

Woods,Digital

Image

Processing

Chapter 5

Performance Analysis

5. Performance Analysis

66

5. Performance Analysis The price per performance ratio of a Beowulf type machine is between three to ten times better than that for traditional supercomputers. We have done several applications and also worked on some existing programs for performance analysis.

5.1 PI calculation in MPICH “Cpilog“ is a program written in MPICH which generate the approximate value of PI and also detect the error. From the following table we get the time for generating the value of PI and the relevant errors.

Table 5.1 : Numerical results for PI calculation Np/Size

Time

PI

Error

1

0.7080

3.1415926535897634 0.0000000000002980

2

0.4314

3.1415926535899406 0.0000000000001474

3

0.4087

3.1415926535899095 0.0000000000001164

4

0.4006

3.1415926535899028 0.0000000000001097

The table shows that, efficiency of performance depends on number of nodes. As the number of nodes increases, it takes less time to calculate the desire outcome and vice versa.

5. Performance Analysis

67

5.2 Matrix–matrix multiplication in MPICH The following table shows that when the size of matrices is huge, the algorithm performs better as the number of processors increased. For normal matrix- matrix multiplication (like 12X12), this algorithm is not that much efficient. [1][2][4]

Table 3 : Numerical results for matrix-matrix multiplication in MPICH 12x12

60x60

1200X1200

2400X2400

1

0.000535

0.017995

155.678127

334.352198

2

0.002026

0.042677

103.713092

278.457612

3

0.003153

0.058966

83.210713

203.654390

4

0.004135

0.074956

77.077650

168.235632

Np/Size

Graphical Representation of result of Matrix-Matrix Multiplication

400

E L A P S E D

350 300 Using 1 node

250

Using 2 node

200

Using 3 node

150

T i m e

Using 4 node

100 50 0

12 1

60 2

1200 3

42400

Matrix size

5. Performance Analysis

5.3

68

Matrix-Matrix multiplication in MPIJAVA

We have implemented another program for matrix- matrix multiplication both in MPICH and MPIJAVA. But we have used different algorithm in different platform (MPICH and MPIJAVA).

Table 2 : Numerical results for matrix-matrix multiplication in MPIJAVA 2x2

128x128

256x256

400x400

1

2.59E-5

0.047

0.431

1.998

4

5.24E-5

0.029

0.222

1.034

6

4.89E-5

0.022

0.154

0.681

8

5.20E-5

0.019

0.110

0.504

Np/Size

The table shows that the algorithm improve when the number of processors increment, which is the primary goal for parallel computation. It also shows that if the number of processor is less than 4, then the performance decreases because it produces an overhead during execution. The execution time is increased because of the communication between the machines through the VM and totally dependent on the machine architecture.

5. Performance Analysis

69

5.4 References [1] Message passing Interface Forum.MPI: A message passing Interface standard version 1.1. June 12,1995.ftp.mcs.ani.go v [2] Message Passing Interface Forum. MPI: a message-passing interface standard. International Journal of Supercomputer Applications 8(3-4), 1994. Also available by anonymous ftp from ftp.netlib.org as Computer Science Dept. Technical Report, CS-94230, University of Tennessee, Knoxville, TN, May 5, 1994. [3] William Gropp, Ewing Lusk and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message-Passing Interface. Cambridge, MA: MIT Press, 1994. [4] Marc Snir, Steve Otto, Steverr Huss- Lederman, David Walker, and Jack Dongarra. MPI: The Complete Reference. Cambridge, MA: MIT Press, 1996. Available over the World Wide Web at http://www.netlib.org/utk/papers/mpi-book/mpi-book.html

Chapter 6

Conclusion and Future Works

6. Conclusion and Future Works

71

6. Conclusion and Future Works:

For complex computations and tasks, like weather forecasting, computation of complex fluid dynamics code people still depends on super computers. But parallel programming can be the cheapest solution for these types of tasks. We have implemented a parallel architecture, Beowulf cluster and successfully prove that it can perform very complex computation in an efficient time.

We discover that Java implementation of the MPI is much more simple and flexible than the existing implementation of MPI in C.

As we are the people of the third world developing country, we have no way to do scientific research without the parallel computation. We try initiate to a simple approach to this parallel world and achieve some experience about this vast parallel world.

We have realized the importance of this field in real life and hope to do more research in this field. In weather forecasting, complex scientific computation this field may have a important significance.

Communicated Paper To

ICCIT 2003

Parallel Programming Using Java on LINUX CLUSTER

73

Parallel Programming Using Java On LINUX CLUSTER Abstract Parallel programming has been evolved over the years with new dimensions and scientific community fostering on the new approaches of parallel programming. This paper is an initiative to a newer approach of using Java programming language on the top of Message Passing Interface (MPI) which takes the leap of Java programming and its object orientation in parallel programming environment. The paper implemented matrix-matrix multiplication with significant improvements over the traditional parallel techniques. Keywords : Parallel Programming, MPI, Message Passing, Cluster

1.0 Introduction Now a day, object oriented programming language – Java is used for scientific and engineering computation and in particular for parallel computation [3,5,6,7]. On behalf of Java, it was stated, that it is simple, efficient and platform-neutral – a natural language for network programming, which makes it potentially attractive to scientific programmers hoping to harness the collective computational power of networks of workstation (NOW) and PC’s, or even of the internet. A basic prerequisite for parallel programming is a good communication API. Java comes with various ready-made packages for communication, notably an easy-to-use interface to BSD sockets, and the Remote Method Invocation (RMI) mechanism. Both communication models of Java are optimized for client-server programming, where as the parallel computing world is mainly concerned with "symmetric" communication, occurring in groups of interacting peers. This symmetric model of communication is captured in the successful Message Passing Interface standard (MPI), a set of optimized libraries, established a few years ago [4]. MPI directly supports the Single Program Multiple Data (SPMD) model of parallel computing, where in a group of processes cooperate by executing identical program images on local data values. And we use

Communicated Paper to ICCIT 2003

Parallel Programming Using Java on LINUX CLUSTER

74

mpiJava, an object oriented Java interface to the Message Passing Interface (MPI), to write Java programs on Linux cluster. 2.0 mpiJava: Message Passing Interface on Java mpiJava was developed at the Sycracuse University by Bryan Carpenter, Mark Baker, Geoffrey Fox and Guansong Zhang. The existing MPI Standards specify a language bindings for Fortran, C and C++ and this approaches implements a Java API for Message Passing Interface Chameleon (MPICH). More precisely, mpiJava is a Java interface which binds the Java Native Interface (JNI) C stubs to the underlying native MPI’s C interface – that is mpiJava uses some Java wrappers to invoke the C-MPI calls through the Java Native Interface (JNI) [1,8]. mpiJava runs the parallel Java programs on the top of MPICH through the Java Virtual Machine (JVM). The architecture stack of the environment, when running parallel Java programs using mpiJava is shown in figure 1.

2.1 Class Hierarchy of mpiJava Java Application

The existing MPI standard is explicitly object-based. The C and Fortran bindings rely on “opaque objects”

Java Virtual Machine (JVM)

that can be manipulated only by acquiring object handles from constructor functions, and passing the

mpiJava

handles to suitable functions in the library. The C++

MPICH

bindings specified in the MPI 2 standard collects these OS (Linux)

objects into suitable class hierarchies and defines most of the library functions as class member functions. The

Protocol (TCP/UDP)

mpiJava API follows this model, lifting the structure of its class hierarchy directly from C++ binding. The class MPI only has static members. It acts as a module containing global services, such as initialization of MPI, and many global constants including the default

Communicated Paper to ICCIT 2003

Ethernet Card

Node 0

Node 1

Node 2

Fig 1: Execution Stack of a parallel java program using mpiJava.

Parallel Programming Using Java on LINUX CLUSTER

75

communicator COMM WORLD. The most important class in the package is the communicator class Comm. All communication functions in mpiJava are members of Comm or its subclasses. As usual in MPI, a communicator stands for a “collective object” logically shared by a group of processors. The processes communicate, typically by addressing messages to their peers through the common communicator. The principle class of mpiJava is shown in figure 2.

MPI Group

Cartcomm Intracomm

Comm Package MPI

Graphcomm Intercomm

Datatype Status Request

Prequest

Fig 2: Class hierarchy of mpiJava Another important class of mpiJava is the Datatype class. This describes the type of the elements in the message buffers passed to send, receive and all the other communication functions. Various datatypes are predefined in the package [1,2]. These mainly correspond to the primitive types of Java and the interface between mpiJava and MPI via the Java Native Interface (JNI) is shown in Table 1.

2.2 API of mpiJava Basic Datatypes of mpiJava

There are some basic communications API of mpiJava, MPI Datatype

which are used to develop programs in Java platform. The functions MPI.Init(args) and MPI.Finalize() must used at the start and end of the program as because init() initializes the MPI for the current environment and

MPI.BYTE MPI.CHAR MPI.SHORT MPI.BOOLEAN MPI.INT MPI.LONG MPI.FLOAT MPI.DOUBLE MPI.PACKED

Java Datatype Byte Char Short Boolean Int Long Float Double

Finalize(), finalize the MPI and destroy all the communicator object, which are created for the communication purposes.

Communicated Paper to ICCIT 2003

Parallel Programming Using Java on LINUX CLUSTER

76

In basic message passing, the processes coordinate their activities by explicitly sending and receiving messages. The standard send and receive operations of MPI are members of Comm interfaces [1]



public void send(Object buf, int offset, int count, Datatype datatype, int dest, int tag)



public void receive(Object buf, int offset, int count, Datatype datatype, int source, int tag)

Where buf is an array of primitive type or class type. If the elements of buf are objects, they must be serializable objects, offset is the starting point of message, datatype class describe the types of the elements, count is the amount of element to sent or received, source and dest are the rank of the processor of source and destination processor, and tag is used to identify the message. There are several issues that need to be addressed that the commands executed by process 0 (send operation) will be different from those executed by process 1 (receive operation). However, this does not mean that the programs need to be different. By conditional branching according to the processor, the program can obtain the SPMD paradigm. Eg. my_process_rank = MPI.COMM_WORLD.Rank(); If(my_process_rank==0) MPI.COMM_WORLD.Send(buf,offset,count,datatype,dest,tag); If(my_process_rank==1) MPI.COMM_WORLD.Receive(buf,offset,count,datatype,source,tag); 3.0 A Parallel Java Application using mpiJava – Matrix-Matrix multiplication mpiJava is an interface over Message Passing Interface and Message Passing is a model for interconnections between processors within a parallel system, where a message is constructed by software on one processor and is sent through an interconnection network to another processor, which then must accept and act upon the message contents. So developing programs in Java for parallel platform, the pre-requisite is the knowledge of the class structure of mpiJava and the mpiJava API.

Communicated Paper to ICCIT 2003

Parallel Programming Using Java on LINUX CLUSTER

77

Consider the problem of computing C = AxB, where A, B are dense matrices of size MxN and NxP (the resultant matrix C has the size of MxP. We know that matrix-matrix multiplication involves O(N3 ) operations, because each Ci,j of C is equal to: N-1

Cij = Ai,k . Bk,j

(1)

K=0

We are looking for an efficient parallel algorithm for matrix-matrix multiplication using mpiJava. Consequently, we examined several algorithms developed for this purpose. First, consider a onedimensional, row wise decomposition such that each processor is responsible for all computation associated with the Cij 's at the assigned rows [2]. Each task requires all of matrix B to compute the Cij corresponding with its column of A. The algorithm perform very well when N is much larger than number of processors P [4]. In a two-dimensional decomposition each task requires an entire row Ai,* and column B*,j of A and B respectively. The one-dimensional decomposition requires N2 /P data and the two-dimensional decomposition requires N2 P1/2. Because of features of Java for manage arrays, we consider a onedimensional row wise decomposition, as showed in figure 3.

Fig 3: One-dimensional row wise decomposition The input matrices A and B is initially available on process zero. Matrix A will be sliced at row wise (N/p) and each set of row and the entire B will be sent to all the child processes. After computing, result matrix C is returned at process zero. Each process computes the sum of product of the set of row of A and all the columns of B, resulting in columns of C. The SPMD paradigm for matrix-matrix multiplication can be achieved by the following codes: for ( i = my_process_rank*N/p ; i < (my_process_rank+1)*N/p ; i++) for ( j = 0 ; j < N ; j++){ C[i][j] = 0; for ( k = 0 ; k < N ; k++) C[i][j] += A[i][k] + B[k][j];

Communicated Paper to ICCIT 2003

Parallel Programming Using Java on LINUX CLUSTER

78

} The successive phases of the computations are illustrated in Figure 4.

Send set of rows of A

Send B

Compute local C

Receive C

Fig 4: computation phases of matrix-matrix multiplication

4.0 Result and Analysis Using the above approach we tested the implementation and obtain the results shown in Table 2. Table 2: Numerical results for matrix-matrix multiplication Np/Size 1 2 3 4

2x2

128x128

256x256

400x400

2.59E-5 5.24E-5 4.89E-5 5.20E-5

0.047 0.029 0.022 0.019

0.431 0.222 0.154 0.110

1.998 1.034 0.681 0.504

Comparing these results with the obtained by jmpi [9] and IceT [10], we see that the algorithm is much more efficient than the other implementations and its improve the results.[2]

Moreover, the table shows that the algorithm improve when the number of processors increment, which is the primary goal for parallel computation. It also shows that if the size of matrices is less than 4, then the performance decreases because it produces an overhead during execution. The execution time is increased because of the communication between the machines through the VM and totally dependent on the machine and application architecture.

5.0 Conclusion and Future Works As Java is an object oriented language and robust with no pointers, and got the feature like portability, safety and pervasiveness, it will be important tool for scientific and high performance applications and we showed that mpiJava is an efficient tool for parallel processing on clustered

Communicated Paper to ICCIT 2003

Parallel Programming Using Java on LINUX CLUSTER

79

network. The only disadvantages of mpiJava is that these approach conflict with the essence of Java, ”Write once run anywhere”, as mpiJava depends on the availability of MPICH. We have planned to work in these specific area with more complex and real life problem like image enhancement, weather forecasting etc. References [1] Baker M,”mpijava: A java MPI Interface.” EuroPar98,Southampton,UK,September 1998. [2] Freddy Perez, ”On Implementation Issues of Parallel Computing Aplications in Java”. [3] Parralel Compiler Runtime Consortium.HPCC and Java – a report by the parallel Compiler Runtime Consortium.URL:http://www.npac.syr.edu/users/gcf/hpjava3.html,May 1996. [4] Geoffrey C. Fox ,editor.java for Computational Science and Engineering – Simulation and Modelling II,volume 9(6) of Concurency:Practice and Experince,june 1997. [5] Geoffrey C. Fox ,editor.java for Computational Science and Engineering – Simulation and Modelling II,volume 9(11) of Concurency:Practice and Experince,November 1997. [6] Geoffrey C.Fox ,editor . ACM 1998 Workshop on Java for High-Performance Network Computing . Palo Alto ,february 1998,Concurency:Practice and Experie nce, 1998.To appear. http://www.cs.ucsb.edu/conferences/java98. [7] Vladimir Getov,Susan Flynn-Hummel,and sava Mintchev.High-Performance parallel programing in java:Exploiting native libraries.In ACM 1998 Workshop on java for Highperformance Network Computing.Palo Alto ,February 1998,concurency:Practice and experience,1998.To appear. [8] George Crawford III,Yoginder Dandass,and Anthony Skjellum.The jmpi commercial message passing environment and specification : Requirements,design,motivations,strategies,and target users.http://www.mpi-softtech.com/publications. [9] Dincer k ,”jmpi and a Performance Instrumentation Analysis and visualization tool for jmpi”. First UK Workshop on java for High Performance network Computing.EUROPAR-98, Southampton, UK, September 2-3,1998.

[10]“The IceT Project”[Foster95] Foster I, “ Designing and Building Parallel ograms”.Adison wesley Publishing company Inc.New York,1995. [11]Gray P and Sunderam V,”The IceT Environment for Parallel and Distributed Computing”. Proceedings of ISCOPE97 (Springer Verlag),Marina del Rey CA.Dec

Communicated Paper to ICCIT 2003

Writing Parallel Programs on LINUX CLUSTER

4.2 A Parallel Java Application using mpiJava ...... After Java JDK is installed successfully, add the Java JDK `bin' directory to path setting of .bash_profile, ...

2MB Sizes 3 Downloads 216 Views

Recommend Documents

Cluster-parallel learning with VW - GitHub
´runvw.sh ´ -reducer NONE. Each mapper runs VW. Model stored in /model on HDFS runvw.sh calls VW, used to modify VW ...

Cluster-parallel learning with VW - PDFKUL.COM
Goals for future from last year. 1. Finish Scaling up. I want a kilonode program. 2. Native learning reductions. Just like more complicated losses. 3. Other learning algorithms, as interest dictates. 4. Persistent Demonization ...

Online PDF Linux Enterprise Cluster
... Available Cluster with Commodity Hardware and Free Software, Read PDF .... the Linux Virtual Server load balancing software, how to configure a reliable ...

Large-scale cluster management at Google with Borg - Parallel and ...
Apr 23, 2015 - triggered a software defect in Borg so it can be debugged); fixing it by .... Normally, though, an online schedul- ing pass ..... memory-accounting.

Parallel generation of quadripartite cluster entanglement in the optical ...
Jul 6, 2011 - University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan. (Dated: June 20, 2011). Scalability and coherence are two essential ...

Large-scale cluster management at Google with Borg - Parallel and ...
Apr 23, 2015 - Borg provides three main benefits: it (1) hides the details of resource ... web search, and for internal infrastructure services (e.g.,. BigTable). ... the high-performance datacenter-scale network fabric that connects them. A cluster 

Writing Scalable SIMD Programs with ISPC
ability, and this is equivalent to how the struct would ap- pear in C/C++ since foo is ..... The Intel Software Development Emulator [3] was used to collect statistics ...

[Read] Ebook Linux Enterprise Cluster: Build a Highly ...
to build a high-availability server pair using the. Heartbeat package, how to use the Linux Virtual Server load balancing software, how to configure a reliable printing system in a Linux cluster environment, and how to build a job scheduling system i

: A Tool for Debugging Parallel Programs
Checking can either be done online, using a separate checking thread ... mization level O3, and were executed on a quad core Intel X3440 with 16GB of. RAM. ... Computer Science Dept., University of Illinois at Urbana-Champaign, Urbana, IL.