A Lightweight Approach of Automatic Resource ...

Viewer
Transcript

1

A Lightweight Approach of Automatic Resource Configuration in Distributed Computing Hao Liu, Søren-Aksel Sørensen, Amril Nazir Department of Computer Science, University College London United Kingdom {h.liu, s.sorensen, a.nazir}@cs.ucl.ac.uk

Abstract— A key problem in executing performance critical applications on distributed computing environments (e.g. the Grid) is the selection of resources for execution. A lot of research related to ”automatic resource selection” has been made to allocate the best-effort resources on behalf of users to optimize the execution performance. However most of current approaches are based on the static principle (i.e. resource selection is performed prior to execution) and need detailed applicationspecific information. In the paper, we introduce a lightweight approach for automatic resource selection/configuration. This approach is based on a simple control theory: the application continuously reports performance values to the middleware Application Agent (AA), which relies on the reported values to decide how to dynamically reconfigure the execution environment during the execution to ensure users’ performance requirements (e.g. execution deadline, running N iteration per second). We divide the research into two paradigms: neglecting network latency and considering network latency. For the first paradigm, we use a linear prediction with Kalman filter to find the expected RC to satisfy certain performance requirement. For the second, we let AA probe possible RCs and rollback the bad RCs, to look for a local optimized RC that can provide applications highest performance. Index Terms— automatic resource selection, adaptive application, performance value.

I. I NTRODUCTION Large-scale distributed environments (e.g. the Grid) provide a cost-effective computing platform to execute scientific applications which need a large amount of processing power. One key problem of the computing model is to select the right resources from the environment to obtain better execution performance. Manual resource selection always relies on the resource information system (e.g. Globus MDS [1]) and users’ experiences to specify certain resource requirements (e.g. the number of processor, processor speed, processor architecture, and etc.) by job submission languages (e.g. classad [2], JSDL [3]). And according to the specification, resource matchmakers (scheduling systems or resource broker) select and allocate resources. However, due to the heterogeneous nature of resource environment and the complexity of execution behavior, it is difficult for most of the users to provide necessary resource specification information. Therefore, a lot of research has been given to ”automatic resource selection/configuration”, which aims to help users who have insufficient knowledge about resources to allocate proper distributed resources, in order to optimize the performance of either applications or resource pool. AppLes [4], depending

on NWS [5] for resource information, provides automatic resource selection and task scheduling suggestions considering both application intrinsic information and resource characteristics in order to minimize the execution time. It requires users to fill in an application template to provide information about the general functional decomposition of the application, such as the data needed to start the application. Nimrod/G [6] introduces an economy-oriented scheduling methodology for parametric computational batch jobs. It uses the Globus information services to select the most cost-effective resources to meet applications’ execution deadline and budget constrains. The scheduling strategy can be changed according to the execution performance response of the application (e.g. it will offer more powerful resources with high cost if it cannot meet the deadline with current ones ). Others [7], [8], [9] present various approaches to enable automatic resource selection in different scenarios. Most existing approaches follow two principles. One is the static principle, namely middleware selects the resources prior to actually allocating resources to the application. The selection process ends once execution starts. Another is the information principle, which can be explained by a statement in paper [4]: ”Application- and system-specific information is needed for good schedules”. Those approaches require users to provide more or less information about the application to obtain optimized performance, such the communication topology and I/O patterns. Normally, the more information is known by middleware, the better selection is made. In this paper we introduce a lightweight approach to automatically select resources for distributed/parallel applications. This approach dynamically tunes the resource environment by adding, releasing and replacing resources during application’s execution rather than statically select resources before commencing execution. It tries different resource configurations for the application and finally tunes the execution environment to a local optimal level. Different from the above static approaches, it doesn’t need to know explicit characteristics of the application. The application is simply treated as a blackbox. The approach only listens to the execution performance report (output of the black-box) from the application and according to the report it intelligently decides to re-configure execution environment (input of the black-box) to approach the performance requirements.

2

II. R ESEARCH C ONTEXT re po rt pe rfo rm anc e value

A. Precondition The approach firstly requires the served application to be configured as a Adaptive Distributed/Parallel Application (ADA) [10], which is defined as a parallel application that is able to add or release resources at any time during the execution; and autonomously balances the computational load over each resource to adapt to current resource configuration (reconfigurable). ADA has high flexibility about execution: it is able to execute with fewer resources and allow addition resources to be added on-the-fly to meet certain performance requirements. ADA can autonomously rearrange the distribution of computation load (e.g. objects ) on allocated resource according to the resources’ capacity to get better load balance and accordingly obtain better execution performance. The flexibility and adaptability of ADA gives the opportunity to us to try difference resource selections and see the execution reaction from the application. The middleware that supports automatic resource selection is called Application Agent (AA) [10], which can add and release resources by contacting underlying Distributed Resource Manager (DRM) on behalf of an application during its execution. Extended AA can automatically configure the Resource Configuration (RC, a set of hosts and related network where an application runs) for an application. It is able to independently decide when to perform resource addition/deleting/replacing actions without application’s requests. (In AA model, a process runs on one resource. A addition/release/replacing of a resource is associated with a addition/release/replacing of a process). AA only needs to inform the result of each action to the application. On the other hand, the application has to continuously check if any actions are performed by using GetNotification(int tag). Two types of tag PROCESS ADDED and PROCESS KILLED are respectively used to check if a new process is added or a process is killed. The process replacement is decomposed to an addition of new process and killing of an old process. A notification message including the target process and host information (e.g. cpu speed, cpu load) is returned when GetNotification() returns true. Once the RC is changed, the application balances the computation load over processes to adapt to the new RC. B. Assumption There are many factors of a RC that influence the execution of application. E.g. the network latency between hosts, the architecture or operation system of a host, the memory size of a host, the input/output speed of a host, the processing ability of a host, and etc. In this paper, we only consider two factors that have the biggest influences on execution: the processing ability of each host and the network latency in the RC. We assume the application can be executed on any processor architectures or operation systems and the memory is always enough for the execution. We also assume the network latency is only affected by the distance of hosts. Based on the assumption, a RC is composed by a set of hosts that is characterized by their processing abilities

adaptation

AA add/de le te /re plac e re so urc e s de plo y pro c e sse s

ac quire re so urc e s

A pplication Process 1

Process 2

3 2

Process 3 10

7 5

1

9

8 4

6

Re so urc e Co nfiguratio n (RC)

c o m putatio nal grids m anage d by D RM s (c o ndo r, SG E, e tc .)

Fig. 1. AA dynamically configures the RC according the reported PV by acquiring resources from the computational grids. The application autonomously reforms itself to adapt to the provided RC.

P and locations L (a latency is counted by |L1 − L2 |): RC = {{P0 , L0 }, {P1 , L1 }, ...{Pi , Li }}. III. M ETHODOLOGY Figure 1 shows how the AA interacts with underneath resource environment and the application to automatically configure the execution environment (RC). The AA initially assigns a random number of hosts with random processing abilities to the application. During the execution, the application periodically (every n seconds or every 1 iteration) reports a Performance Value (PV) to AA by API ReportPerfVal(). A PV(t) is abstract value that represents the satisfaction degree of one or more observed actual performance metrics in specific time t. It is a float number starting from 0.0. Value of 1.0 means the performance metrics is satisfied. A satisfied PV could be further set inside a range, e.g. 0.9 ∼ 1.1. The meaning of PV is defined by application developers. A simple example of PV is the execution speed satisfaction degree. For example, for an iterative application, users expect the execution speed is 2.0 iterations/second; if current monitored speed is 0.5 iteration/second, then PV(now) = 0.5/2 = 0.25. The actual meaning of a PV is transparent to AA. AA’s job is only to try its best to make the PV approaching 1.0 or the satisfactory range by tuning the RC according to the reported PVs - e.g. an intuitive idea is when PV is smaller than 1.0 AA is likely to add more hosts to RC; while when PV is bigger than 1.0, AA is likely to release some hosts from current RC. Since ADA can adapt to different RCs, AA could try different RCs, and learning from the execution reactions (PVs) from these configurations, AA would finally tune the RC to a best-effort level. In the next sections, we introduce the detailed methodologies respectively in two paradigms: applications neglecting network latency and applications considering network latency. IV. N EGLECTING N ETWORK L ATENCY For the applications with no communication between computational components, or the time takes to compute is much larger than the time for communication, the execution of application is not affected by network latency. For example,

3

parameter sweep applications, which has independent jobs that are responsible for testing a set of different parameters; and rendering of computer graphics, where each pixel may be rendered independently. With neglecting the network latency, the only factor affecting the execution is the processing ability. For those applications, the more hosts are provided, the higher the execution performance will be. For a given execution requirement, the question is really how to provide the least resources to satisfy the requirement. As we know, ADA can balance the distribution of objects on allocated resource based on the processing ability of each host. It indicates, ideally each process will take the same time to compute objects even though they stay on hosts with different processing abilities. If the processing ability Pi changes over time (e.g. CPU load changes), then Ni should change accordingly. Therefore: Ni = the number of objects on host i; Pi = the processing ability (instruction per time unit) of hosts i; M = the number of instruction of an object, given that each object has the same amount of instruction T = the time for completing the whole computation. At any time: N2 ·M Ni ·M N1 ·M P1 = P2 = ... = Pi = T We P can get: P T T NiP= N1 +N2 +...Ni = M ∗(P1 +P2 +...Pi ) = M ∗ Pi T = PNPi ·M i Therefore, ideally the reported PV, which has Plinear relationship with T , has a linear relationship with the Pi . According P to the PV and current RC’s P , AA can predict the i P P expectedP Pi . For each report, expected Pi = (1.0/P V )∗ current Pi (direct predicted). However, in real world, due to various reasons (e.g. overhead of application adaptation, load unbalanced due to non-optimal policy, and operation system scheduling variation etc), the prediction could have more or less errors. We therefore apply the Kalman filter [11] to remove the effects P of noise values and getPa better estimate of the expected Pi . We use expectedK Pi to represent the Kalman filtered value. We can change the measurement noise covariance R [11] to more slowly or more quickly respond to the observed measurement (PV): bigger R means slow response and smaller R means responding to PV quickly. P Smaller R can make the expectedK Pi to quickly adapt to the dynamics of execution, however also to quickly ”believe” the noisy measurements. P On each report, AA first calculates the expectedK Pi . And then it checks if the PV is inside the satisfactory range (default 0.9 ∼ 1.1 ). If it is not, AA then tries to reconfigure P the RC by adding, releasing or replacing hosts. If current Pi < P expectedK Pi , AA continuously adds more hosts to reach the value, and vice versa. Additionally, there are some other factors we need to consider when making the reconfiguration decision: •

The effect of RC reconfiguration to the execution may take some time due to the time taken to load balance. For

Actions Adding

Releasing

Replacing

How to perform ContinuouslyP add hosts until P Pi > expectedK Pi ContinuouslyPrelease hosts until Pi − smallestPi P < expectedK Pi Once per report

When to perform (1) P V < 0.9. (2) P Vcurrent − P Vprevious <= 0.P(3) current P Pi < expectedK Pi (1) P V > 1.1. (2) P Vcurrent − P Vprevious >= 0. (3) curP rent Pi P − smallestPi > expectedK Pi (1) P V > 1.1 (2) P Vcurrent − P P VpreviousP>= 0 (3) P Pi > expectedK Pi and Pi − P smallestPi < expectedK Pi

TABLE I S UMMARY OF THREE ACTIONS :

ADDITION , RELEASING , AND REPLACING .

example, it may take a couple of minutes after a resource addition to see the PV is increased. So when AA tries to add hosts, it needs to check if the PV is increasing. If it is, it probably indicates that the application is actually readapting the current RC and the PV may gradually reach the expected 1.0. In this situation AA will not perform any reconfiguration operation. Same as the host release action. • PV may fluctuate even under the same RC. For example, under the same RC, P V1 = 0.95, and P V2 = 0.89. For the second report, AA will try to reconfigure the RC since the PV is not in the satisfactory range. Fortunately we already considered this problem by applying Kalman to filter the fluctuation. In this case, even though P V2
4

Type A B C D

Number of Hosts 11 6 18 15

Processing Ability (MIPS) 59455 49161 27079 18938

TABLE II S IMULATED RESOURCE SET-1.

The execution of application runs on one machine. Processes are spawned on the machine as long as AA finds an eligible host from the emulated resource pool. Once a process is spawned, the host’s processing ability P is passed to the process. The time to compute an object then depends on usleep(M/P ∗ 106 ). In the experiment, users’ execution performance requirement is set to be 18 ∼ 22iteration/1000seconds. The aim of the configuration is to satisfy the requirement by least resources. We set the process noise covariance Q = 10−5 and measurement noise covariance R = 10−2 , which is intended to smooth the prediction and remove the noisy measurement as much as possible. AA initially assigns 5 type A hosts to the application. During the execution AA will gradually change the RC to satisfy the performance requirement. Figure 2 shows that the execution speed changed gradually with RC’s reconfiguration. We can see the execution speed was finally tuned to the required level. The tuning process lasted 4 iterations. Figure 3 shows how the RC was reconfigured according to the predicted value. More P hosts were added and P the Pi finally reached to expectedK Pi which is around 750,000. We can see if there is no Kalman filtering, the direct prediction has significant noisy which will affect AA reconfiguration decision making.

record the highest PV under a RC. We then check if the performance in current RC decreases by comparing with the highest PVs in current and previous RCs. If it does, it indicates that the current RC has bottleneck hosts which limit the execution performance. We therefore rollback the current RC to previous one, which lead better performance. We then find the bottleneck hosts by comparing the two RCs, and get the reason of bottleneck - the network latency. We record bottleneck latency and make it as a requirement to add future hosts: the latency the added host brings must be smaller than the bottleneck latency. Algorithm 1 presents the heuristic algorithm of dynamic configuration of RC. Please note the age Algorithm 1 Resource configuration algorithm for applications that consider network communication latency. Let current = the index of current RC. R.age = the age of RC, namely the number of reports received under the RC. R.pv = the highest PV under the RC (in order to remove the variance). b = the bottleneck latency, initially ∞. On each PV report: 1. If current RC is a new RC, record it to the RCs list. And record the PV for the current RC. 2. If Rcurrent .pv < Rcurrent−1 .pv, RCcurrent .age >= 2, go to step 3, else go to step 5. 3. Rollback RCcurrent to Rcurrent−1 , delete RCcurrent from the RCs record list 4. Find the largest latency from set H and let b equal it. 5. Keep adding new hosts from available resource pool with requirements : (1) the largest process ability from available pool (2) latency < b.

V. C ONSIDERING N ETWORK L ATENCY The execution of an application that has frequent communication between computational components is significantly effected by the network latency. With considering network latency, computing speed will be limited by the latency among hosts. In this section, We investigate how to configure the RC from current available resource pool to maximize application execution performance. In order to maximize the performance, our methodology is to keep adding hosts from the available host pool and keep reading the reported PV. If the PV increases after adding a host, that means that the positive effect from host’s processing ability is bigger than its negative effect from its latency. We call such hosts as positive hosts. However if PV decreases after a host addition, it indicates that the host bring more negative effects for the execution. We call these hosts as negative hosts. Our aim is to find those negative hosts, which are bottleneck of execution, remove them from RC, and also avoid similar hosts from being added in the future. We keep recording each RC (RC is dynamically changed by AA’s reconfiguration. Two RCs are different if any {P, L} are different) and give each RC an index number ( the same RC on different report has the same index). We also

of RC is used since it takes time for application to adapt to RC’s reconfiguration. We do rollback when age >= 2, which indicates that the application should have already adapted to the RC. A. Experiments Setup of the experiment is similar to that in section IV. The tested application has 100 objects in each iteration during the whole execution. The computation of an object must depends on the results produced by its two neighbor objects in last iteration. The emulated resource pool also has communication latency introduced. As we introduced, we use the host location to represent the network connection latency. Table III shows the resource pool. The internal latency is the communication latency of two hosts inside a cluster. External latency can be calculated by hosts’ location. In order to demonstrate the our algorithm’s ability, we set the resource pool is dynamically available during application’s execution: Initially 5 hosts in Cluster A available → another 5 in D available in iteration 2 → another 5 in C available in iteration 8 → another 5 in A available in iteration 13 → another 5 in B available in iteration 25 → another 5 in D available in iteration 28 .

25

12

1000000

20

15

10

5

900000 10

800000 700000

Number of Hosts

Total Processing Ability (MIPS)

Execution Speed (Iterations/1 Thousand Seconds)

5

600000 500000 400000 300000 200000

8 6 4 2

100000 0

0 1

0 1

2

3

4

5

6

7

8

9

2

3

4

6

7

8

9

1

10

3

5

7

9

Iteration

10

Iteration

5

Direct Predicted Total MIPS

11

13

15

17

19

21

23

25

27

29

Application's Iteration

Predicated Total MIPS (Kalman Filted)

Actual Added Total MIPS

Cluster A

Cluster B

Cluster C

Cluster D

14 30 12 25

800000

10

700000

20

Total P

Execution Speed (Iterations/1 Thousand Seconds)

P Fig. 2. Execution speed is tuned to the required Fig. 3. The RC ( Pi ) is reconfigured during the Fig. 4. The available hosts to the application during level (Section IV). execution (Section IV). the execution (Section V).

600000

15

8 6

500000

4

400000

10 2

300000 3.0 2.5 2.0 1.5 1.0

200000

5

100000

0

0

0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

5

10

15

Iteration

20

0 0

50

100

150

200

250

Time (sec)

0.5 25

30

0.0

Deadline = 200

Deadline = 300

P Fig. 7. The change of RC ( Pi ) to ensure Fig. 5. The change of execution speed during the Fig. 6. RC’s dynamic reconfiguration during the the meet of deadline for the Mandelbrot application execution (Section V). execution (Section V). (Section VI).

Cluster Cluster A Cluster B Cluster C Cluster D

hosts no. 11 6 18 15

MIPS 59455 49161 27079 18938

location 0.0 0.5 1.5 2.0

internal-latency 0.1 0.1 0.1 0.1

TABLE III S IMULATED RESOURCE SET-2.

Figure 6, 5 shows RC’s reconfiguration and corresponding tuning of execution speed (iteration/1000sec). We can see the RC started from the 5 hosts in Cluster A and AA kept trying to add positive hosts and throw away negative hosts. P In iteration 2, AA added 5 hosts from Cluster D. The Pi increased to 391965 while the total latency also increased. We notice the speed immediately dropped after the 5 hosts has been added due to the negative effect of latency. AA detected the decline of performance and soon rollbacked the RC to previous configuration which was composed of 5 hosts from Cluster AA. After rollback the speed raised up again. The second resource addition attempt occurred between iteration 8 to 10. AA tried to add another 5 hosts in Cluster C but ended by giving them up because of the decline in performance. The third attempt was successful. AA added 5 hosts from Cluster A in iteration 13, and the reported PV went up, which mean they brought positive effect in (the network latency was very small since the added hosts were in the same cluster of the previous added hosts). After the addition the speed gradually went up to 25 iteration/1000sec, which is the highest speed so far. The fourth attempt was to try to add another 5 hosts in Cluster B. We can see between iteration 25 to 28 the speed

dropped, which indicated the latency significantly dragged the execution speed. AA soon released those hosts and speed came back to the maximum point. Although at the end of the execution another 5 hosts in Cluster D became available, AA didn’t try to add them since from the fourth attempt it learned that in current configuration the bottleneck latency was 0.5, i.e. adding any hosts that bring in more than 0.5 latency could decrease the performance. In this experiment we can see that AA is able to find a best-effort RC that maximize the execution performance. AA probes the execution reaction from application with reconfiguring RC and learns what bottleneck hosts could be. It makes the execution performance increase to a local maximum level along with more resources become available to the application. VI. VALIDATION WITH A M ANDELBROT A PPLICATION In addition to the experiments based on emulation, we also validate our approach by executing an application in a real distributed environment. The application is to draw a Mandelbrot set on a 1000×1000 dots canvas with magnification = 1.0. The benchmark iteration is 500,000. The data is broken down to 1000 objects and each object takes charge of the computation of a row of 1000 dots. The objects are distributed to a number of ”herder” processes. The application is implemented with a simple dynamic load balancing policy: each process balances the load with its two neighbor processes every 10 seconds. The application is implemented by AA API and is fully benefited from AA services [10]. The resource pool available to the application is a homogenous cluster in computer science department of University

6

College London. This pool is composed of 57 hosts, each equipped with a Intel(R) Celeron(R) CPU 2.80GHz, 502M of RAM, and running Red Hat Enterprise Linux. Some hosts may have light load and the load P doesn’t change frequently. In order to see the change of Pi , we set the standard P of a host with load 0 as 1.0. The P of a host with load l is calculated by 1.0/(1 + l). We set the the execution deadline (ED) as the performance requirement. The PV is defined as: ne = the number of expected finished objects at t ne = 1000 ∗ t/ED n = the number of actual finished objects at t. P V = n/ne The application reports the PV to AA every 20 seconds. AA’s job is to make P V ≥ 1.0, with the least resources, to satisfy the execution deadline. As we know, since not all the 1000*1000 dots have the same computation amount (the dots belongs to the Mandelbrot set have less computation), the application has small variation about amount of computation during the execution (intrinsic-dynamic). We therefore set the R = 10−4 in the Kalman filter to quickly respond to the dynamics and also remove small noise signals. We did two experiments, with deadline 200 seconds and 300 seconds. For both experiments AA initially assigned 4 hosts. Figure 7 shows how AA tuned the RC during the execution to meet the deadlines. For the first experiment AA dynamically added 13 hosts (some hosts were lightly loaded) to serve the application and finally completed the execution by 174 seconds. For the second experiment, AA assigned about 7 hosts (some hosts were lightly loaded) at most of the time for the application and completed the execution by 228 seconds. As we anticipated, AA didn’t only satisfy the application performance but also aims to use the least resources with high utilization. We can see around 200 seconds AA released 1 host since it detected the remaining resources were enough to meet the deadline. VII. C ONCLUSION AND D ISCUSSION In this paper we introduce a novel approach to automatically configure resource environment to execute distributed/parallel application without requiring users to provide the information of application’s characteristics. This approach, which is much different from the traditional static automatic resource selection approaches, is similar to a simple control theory by dynamically tuning the RC depending on the execution reaction from the application. AA monitors the execution performance by letting the application itself report abstract PV, rather than placing sensors into the code or using event tracing to gain specific information [12], [13], [14]. This way gives users more options to choose performance metrics and is more lightweight. We divide the research into two paradigms: neglecting network latency and considering network latency. For the first paradigm, we use a linear prediction with Kalman filter to find the expected RC to satisfy the performance requirement.

For the second, we let AA probe possible RCs and rollback failed RCs, to look for a local optimized RC that can give the application a maximized performance. However, since the approach may result in frequent load balancing and adaption, the overhead of these actions may need to be taken account. Additionally, it requires the application to have small enough object granularity that allows the execution to expand to a large number of resources with good load balance. Even though the work has such limitations, we believe the research brings in some new and interesting aspects for the research area. In the future, the research will focus on consummating the methodologies in the two paradigms and invalidation by more complex scientific applications in largescale networks. R EFERENCES [1] K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith, and S. Tuecke, “A resource management architecture for metacomputing systems,” Lecture Notes in Computer Science, vol. 1459, pp. 62–??, 1998. [Online]. Available: citeseer.ist.psu.edu/czajkowski97resource.html [2] R. Raman, “Matchmaking frameworks for distributed resource management,” Ph.D. dissertation, 2000, supervisor-Miron Livny. [3] M. D. Ali Anjomshoaa, Fred Brisard, “Job submission description language (jsdl) specification, version 1.0,” Global Grid Forum, Tech. Rep., 2005. [4] F. D. Berman, R. Wolski, S. Figueira, J. Schopf, and G. Shao, “Application-level scheduling on distributed heterogeneous networks,” in Supercomputing ’96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM). Washington, DC, USA: IEEE Computer Society, 1996, p. 39. [5] R. Wolski, N. T. Spring, and J. Hayes, “The network weather service: a distributed resource performance forecasting service for metacomputing,” Future Generation Computer Systems, vol. 15, no. 5–6, pp. 757–768, 1999. [Online]. Available: citeseer.ist.psu.edu/wolski98network.html [6] R. Buyya, M. Murshed, and D. Abramson, “A deadline and budget constrained cost-time optimization algorithm for scheduling task farming applications on global grids,” in In Int. Conf. on Parallel and Distributed Processing Techniques and Applications, Las Vegas, 2002. [7] K.-W. Kang and G. Woo, “An automatic resource selection scheme for grid computing systems,” Computational Science and Its Applications ICCSA 2005, pp. 29–36, 2005. [8] R. Huang, H. Casanova, and A. A. Chien, “Automatic resource specification generation for resource selection,” in SC ’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM, 2007, pp. 1–11. [9] P. Lindner, E. Gabriel, and M. M. Resch, “Performance prediction based resource selection in grid environments,” in HPCC, 2007, pp. 228–238. [10] H. Liu, A. Nazir, and S.-A. Sorenson, “A software framework to support adaptive applications in distributed/parallel computing,” in HPCC09, 2009. [11] G. Welch and G. Bishop, “An introduction to the kalman filter,” Tech. Rep. [12] Y. L. Ribler, J. S. Vetter, R. L. Ribler, J. S. Vetter, H. Simitci, H. Simitci, D. A. Reed, and D. A. Reed, “Autopilot: Adaptive control of distributed applications,” in Proceedings of the 7th IEEE Symposium on HighPerformance Distributed Computing, 1998, pp. 172–179. [13] A. Morajko, P. Caymes-Scutari, T. Margalef, and E. Luque, “Mate: Monitoring, analysis and tuning environment for parallel/distributed applications: Research articles,” Concurr. Comput. : Pract. Exper., vol. 19, no. 11, pp. 1517–1531, 2007. [14] J. A. Kohl and G. A. Geist, “The pvm 3.4 tracing facility and xpvm 1.1,” 1995, pp. 290–299.