DGSchedSim: a trace-driven simulator to evaluate ...

Viewer
Transcript

DGSchedSim: a trace-driven simulator to evaluate scheduling algorithms for desktop grid environments Patricio Domingues ESTG – Leiria – Portugal [email protected]

Paulo Marques Univ. Coimbra – Portugal [email protected]

Abstract This paper describes DGSchedSim, a trace driven simulator to evaluate scheduling algorithms focused on minimising turnaround time of applications executed in heterogeneous desktop grid systems. The simulator can be used to model task farming applications comprised of a set of independent and equal sized tasks similarly to numerous @Home public computing projects like the popular SETI@Home. DGSchedSim permits to assess scheduling policies under several scenarios allowing to control parameters such as the application’s requirements (number of tasks, individual requirements of tasks like needed CPU time), the properties of the environment (machines computing capabilities and availabilities) and the characteristics of the execution (frequency and storage location of checkpoints, etc.). The simulations are driven by traces collected from real desktop grid systems. Besides DGSchedSim, the paper presents the Cluster Ideal Execution Time (CIET) algorithm that computes the ideal wall-clock time required by a fully dedicated and totally reliable cluster of M heterogeneous machines to process the T tasks of an application. As a test to the simulator capabilities, the paper analyses the suitability of two scheduling algorithms, FCFS and MinMax, for delivering fast turnaround time in desktop grids. Both algorithms, when combined with a centrally stored checkpoint policy, achieve efficiency close to 50% of CIET for certain scenarios. Keywords: desktop grid computing, turnaround time, trace-based simulation, scheduling, task-farming applications.

1. Introduction Desktop grid systems have in the last few years emerged as an important methodology to harness idle resources of desktop systems, allowing for the use of

Luis Silva Univ. Coimbra – Portugal [email protected]

computing power, which otherwise would frequently present CPU idleness above 90% [1] [2]. Although mature solutions for LAN environments like Condor [3] exist for more than a decade, the advent of the Internet brought a global scale vision for desktop grids. In particular, the popularity of SETI@Home attested by its impressive performance of 60 TFlops computing power [4] has brought attention to global computing, fuelling numerous @Home alike projects [5]. Nowadays, a plethora of frameworks and systems focused on harnessing desktop resources exists, ranging from academic projects such as Condor [3], BOINC [6], XtremWeb [7] and Alchemi.Net [8] or commercial oriented like Entropia [9] and DataSynapse [10], amongst many others. The availability of such systems allows the construction of institutional desktop grid environments, where the computing resources of an institution are harnessed to form desktop grid systems to provide demanding users (researchers, staff, etc.) inexpensive computing power. In this study we consider desktop grids comprised of resources connected by a local area network (for instance, all desktop machines of a university campus or a company). Public desktop grid systems are mostly oriented toward high throughput computing, being tuned for delivering high steady state performance when processing a large number of normally independent tasks. In fact, the primary goal of public projects is to process a high number of tasks over a long period of time, maximising the number of tasks processed per time unit. In contrast, users of smaller desktop grids are normally pursuing fast execution times, trading if needed, high throughput computing for faster turnaround times. So, instead of having a high number of tasks processed in a relatively large time period, high turnaround time schemes are focused in delivering fast execution time for a moderate number of tasks. However, scheduling on desktop grid environments having as objective function the minimisation of the total execution time is a NP-hard problem [11]. Thus,

heuristics oriented algorithms are needed for achieving good solutions on polynomial time or better. Indeed, while eager algorithms like First Come First Served (FCFS) yield good throughput for long running applications, such strategies are normally regarded as inefficient for delivering fast turnaround times [12]. Besides the natural complexity of scheduling, a major challenge posed to desktop grid computing lies in dealing with volunteer resources volatility. Indeed, frequently a desktop computer is assigned to an individual who, besides using the machine, effectively controls the resource and acts as its owner. Thus, desktop grids execution schemes need to tolerate the variation of resources induced by the interactive presence of users. Moreover, the success of volunteer mechanisms is tightly linked with the perception of owners that they remain in full control of their machines [13]. Choi et al. [14] categorise failures of volunteer resources in two main classes: volatility failures and interference failures. The former comprises failures like network and machine crashes. The latter is a consequence of the shared nature of resources where the interactive users of a machine have priority over volunteer computation in accessing the machine’s resources. For the purpose of scheduling on desktop grids, a failure is perceived as an interruption which stops or even aborts the execution, possibly compromising the goal of a fast turnaround time. A usual technique to minimise the effects of transient failures is checkpointing [15]. This technique allows applications to save their state at regular intervals to persistent storage so that the application may be restarted after an interruption from the last valid checkpoint. Since desktop grids are environments highly prone to failures, frameworks like BOINC [6] offers support for application-level checkpointing. Moreover, depending on the checkpoints portability and on the characteristics of the desktop grid environment, checkpointing might enable task migration and task replication, two interesting techniques to enhance scheduling strategies on desktop grids [12]. To research issues and factors involved in scheduling for minimising turnaround time on volatile desktop grid environments, we developed the simulator DGSchedSim. The main goal of the simulator is to provide accurate execution models of applications comprised of multiple independent and equally sized tasks executed over a set of machines characterized by their relative performance. To permit simulations close to real environments, the simulations are trace-driven by traces collected from real desktop grid systems. In the context of grids, one of the key benefits of simulation over real test-bed experiments for assessing scheduling policies is to permit reproducible and

controlled experiments. In particular, real desktop grid systems are prone to various external uncontrollable factors such as interactive load induced by users, variable network load and failures, just to name a few. These random external factors provoke wide and unpredictable fluctuations of resource availability, turning difficult, if not impossible, to have repeatable execution conditions for successive experiments except through simulation. Additionally, compared to real test-beds, simulated scenarios are much easier to setup and change since no real resources (hardware and human), besides the ones that effectively run the simulations, is actually needed. Indeed, simulation makes possible to study environments not available or possibly inexistent in the real world. For example, increasing the number of machines of a simulated scenario may be a simple question of editing a resource file, while to achieve the same effect in a real test bed requires much more effort, if it is at all doable. The remainder of this paper is organized as follows. Section 2 exposes the main features of DGSchedSim, the expected input, the output produced and the support for extensibility. Section 3 presents the Cluster Ideal Execution Time algorithm and some scheduling strategies simulated with DGSchedSim, reporting main results. Section 4 describes related work, while section 5 concludes the paper and presents some future work.

2. The simulator DGSchedSim The requirements that guided the development of the simulator DGSchedSim are: - Capability of simulating the execution, in desktop grid environments, of applications comprised of independent tasks. - Ability to support the simulation of predefined and user-defined scheduling algorithms. - Support for modelling heterogeneous resources with variable performance. - Capability of recreating real load scenarios based on trace load collected from real environments. - Aptitude to provide relevant information about the temporal evolution of a simulation so that results can be better understood, allowing the refinement of scheduling algorithms. - Support for a pre-defined set of parameters that are relevant to scheduling in desktop grids, namely checkpoint policies and associated parameters.

2.1. Input For carrying out a simulation DGSchedSim requires four main items: application requirements, characteristics of desktop machines that represent the

grid to simulate, the load traces needed to drive the simulation and the user-defined scheduling algorithms. As stated before, DGSchedSim supports applications comprised of equal-sized independent tasks (commonly referred as work units in public computing projects [6]). The description of an application includes, besides the number of tasks, the computing requirements of an individual task. The computing needs of a task are expressed by the required CPU time, which is the number of dedicated CPU time units necessary for the complete execution of the task. This metric is given relatively to a reference machine. To extrapolate the required CPU time for a task, the computing capabilities of the involved machines need to be quantified. For this purpose, DGSchedSim resorts to INTFP, which corresponds to the arithmetic mean of two numerical performance indexes: INT and FP. Both indexes are measured with the NBench [16] benchmark that relies on well-known algorithms to summarise integer and floating-point performance with the INT and FP indexes. To compute the CPU time needed for the execution of a task in a given machine, DGSchedSim uses the ratio between the machine’s INTFP index and the reference machine’s index. For example, a ratio of 3 means that the machine is credited as being as three times faster than the reference machine, and thus the execution of a task will consume 1/3 of the CPU time that would have been needed if the execution was performed by the reference machine. Other characterising elements of a task include the maximum required memory, disk space, the input data size (size of data used as input) and the checkpoint size, in case checkpointing is enabled. Every simulated machine is defined by a single entry in a so called desktop grid configuration file. An entry holds the machine name, its INTFP performance index, its static attributes like CPU model and speed, amount of main memory, disk space and the speed of the network interface. The entry also defines the thresholds for volunteer computing, namely the maximum main memory and maximum disk space available for running volunteer tasks. Tasks requiring resources above a machine’s thresholds cannot be run on the machine. To drive a simulation, DGSchedsim uses the traces produced by the Distributed Data Collector (DDC) [2]. DDC is a framework that permits periodically running probes over a set of distributed machines, collecting the output of every execution of the probe to a central repository. Traces suited to DGSchedSim are collected via DDC with the W32Probe [2]. The traces are organized by timestamps: a timestamp corresponds to the chronological time, in UNIX’s epoch format, when the data was collected. At every timestamp, the

collected data aggregates various metrics for all the monitored machines, like uptime, CPU idleness, presence of interactive user, load of main memory, and so on. A trace is comprised of data captured at successive timestamps. For the purpose of a simulation, the time interval between consecutive timestamps may influence the accuracy of the results of a simulation. A wide time interval between successive timestamps, even if it speeds up the time needed to execute the simulation (fewer timestamps need to be processed) might worsen the precision of the simulation since events occurring between two timestamps cannot be reproduced. To perform a simulation with a given trace, it is necessary to specify the timestamp of the trace to be used as the starting point. However, different starting points might yield substantially different results, since different load patterns can be crossed by the simulation. For instance, a Friday afternoon will probably yield a much different execution pattern and turnaround time than a Monday morning. This is especially relevant for short lived applications. Therefore, to prevent results biased by the unpredictability of the starting point, DGSchedSim supports the possibility of multi-run, meaning that several simulations are run from different starting timestamps with the final results corresponding to the average of all the executed simulations. Under this mode, the starting points for multi-run may be user selected (to allow reproducible results) or randomly generated by the simulator. Besides the trace data, the machines description file and the characterisation of the application and tasks to be simulated, a DGSchedSim simulation can be tuned with additional parameters such as checkpoint policy, checkpoint frequency and scheduling algorithms. Currently, the simulator supports two checkpointing policies: local and centralized. In the first case the checkpoints are saved in the executing machine (i.e., the local machine), while the latter corresponds to storing checkpoint in a central storage. While the central policy incurs in network overheads, it permits sharing of checkpoints, allowing the execution of a task interrupted at a given machine (for instance, the machine went off) to be resumed at another machine from the last valid checkpoint kept in the central server. As the name implies, the checkpoint frequency parameter defines the regularity for saving checkpoints. For instance, a 10% frequency means that a checkpoint occurs at every 10% of completed work. A 0% checkpoint frequency effectively disables checkpointing. Since one of the main goals of DGSchedSim is to allow experimentation of scheduling policies, the framework supports the addition of user-defined

scheduling algorithms. To add a scheduling algorithm the user only needs to implement a Python class derived from the base class: dgs_sched. The class needs to override the method DoSchedule(), which is called at every timestamp. This method receives as parameters the current timestamp and the list of non-executing tasks (i.e., tasks that have not yet been started or which are currently stopped). Through its base class, the method can also access the core data of the simulation like the list of machines and their associated status. So far, we have implemented the FCFS and MinMax algorithms (see section 3). DGSchedSim updates the status of the simulation at every timestamp. Based on the trace information from the timestamp being processed, the simulator updates the status of machines and tasks. If the machine assigned to a given task is still available, the simulator updates the progression percentage of the task accordingly to the idle CPU time of the executing machine, weighted by the computing capabilities relatively to the reference machine.

2.2. Output Besides turnaround time, DGSchedSim can be set to produce other types of results. The goal behind the diversity of outputs is to allow a better understanding of the outcome of the simulation. For every simulation executed, several statistics are generated such as turnaround times, the number of saved and loaded checkpoints, amongst others. These statistics are saved in CSV file format allowing the use of generic tools like spreadsheets for further analysis and processing. If instructed to do so, DGSchedSim can also produce a file containing the evolution over time of the counts of tasks completed, stopped and under execution. This information can be plotted to survey the temporal evolution of the execution and the consequent behaviour of the scheduling policy. Furthermore, to enable a better understanding of the execution, important for perceiving strategies to minimise turnaround time, DGSchedSim can also produce a set of images depicting the physical distribution of tasks over machines as well as their state of execution. Since one image is generated per timestamp, the whole set can be combined to compose an animation displaying the application execution driven by the traces of the desktop grid. DGSchedSim can also compute the cluster equivalence ratio (CER) yielded by a given load trace. The CER is a metric defined by Anderson et al. [13] and further refined by Kondo et al. [12] to gauge the usable performance of a set of non-dedicated machines as desktop grid systems. DGSchedSim applies the CER

definition, computing the CPU availability of a machine for a given period correspondingly to its measured CPU idleness over the analysed period of time. For instance, a machine with 95% CPU idleness is accounted as a 0.95 dedicated machine. This methodology assumes that all idle CPU can be harvested. Thus, the obtained results should be regarded as an upper limit of CPU resources that can effectively be exploited.

3. Simulations and results In this section, we analyse the results of two scheduling algorithms that have been simulated with DGSchedSim: first come first served (FCFS) and unassigned min-max (henceforth MinMax). The former is an implementation of the classical scheduling algorithm, which is used in the major desktop grid schedulers [12]. The second is a simple algorithm that at every scheduling point tries to assign non-started and non-running tasks to the machines that are free at the moment of scheduling. For this purpose, MinMax analyses four different scenarios, and predicts, for every scenario, the maximum time needed for the execution of the tasks when assigned to the available machines. The four different scenarios result from the combination of the free machines list with the list of unexecuted tasks. Since both lists can be sorted in order and reversely, the combination of both lists yields the mentioned four scenarios. The criterion to sort the list of machines is their individual performance as measured by the INTFP index, while the list of tasks is ranked accordingly to tasks’ individual computing time requirements (i.e., how much CPU time is still needed for completing the task). For every scenario, MinMax computes the expected time to finish, which corresponds to the last pair machine/task to terminate since the last task to finish determinates the turnaround time. The expected time to finish is an optimistic prevision based on the assumption that the machine will be fully dedicated to the task computation until it has completed it. Finally, the scheduler chooses the scenario that has the lowest turnaround time prediction. Thus, this algorithm attempts to transform the scheduling decision in a relatively simple min-max problem.

3.1. Simulated scenarios For the purpose of studying the turnaround time of these two algorithms, we have considered different scenarios involving applications and machines. Applications with 25 and 50 tasks were simulated, with individual tasks requiring 1800, 3600 and 7200 seconds of reference machine’s CPU time. The

reference machine was a Pentium 4/1.6 GHz with a INTFP index of 25.008. The simulations were driven by a 35-day trace collected with a two-minute period from 32 Windows classroom machines. The average CER for the trace is 0.52. Figure 1 aggregates two CER related plots of the used trace. Top plot represents the CER over the 35 days, while the bottom plot corresponds to the weekly distribution. The high volatility of the resources causes both plots to exhibit numerous high frequency changes.

Figure 1: CER over the trace interval (top) and its weekly distribution (bottom).

Besides execution without checkpoint, local and centralized checkpoint policies were simulated with checkpoint frequencies of 5% (9 checkpoints taken over the execution) and 50% (only one checkpoint). Checkpoint size was set to 10 KB. The pool of the 32 simulated machines is grouped in four evenly sets (eight machines each) accordingly to their performance. The main characteristics of machines are shown in Table 1. The last column of the table yields the machine’s INTFP ratio relatively to the reference machine. The speed of the network was set to 96 Mb/s as reported by several measurements performed with the IPerf tool [17].

3.2. Cluster Ideal Execution Time We define the Cluster Ideal Execution Time (CIET) as the theoretical time needed by the machines to carry out the application, considering the desktop grid

machines work as a fully dedicated cluster to the executions of the application’s tasks, that no failure occurs throughout the execution and that no migration of tasks occurs along the execution. As the condition of non-existence of failures implies, CIET is an ideal execution time. To compute the CIET for a pool of M heterogeneous machines executing T equal-sized tasks, we devised a simple iterative algorithm that determines the number of completed tasks as a function of the elapsed time t. The algorithm relies on a list that has one entry per machine. Every entry keeps the count of tasks already executed by the machine plus an additional one which corresponds to the task currently being executed. Along the count, the entry holds the relative execution time of the machine, which is simply the count of tasks multiplied by the time needed by the machine to compute an individual task. The relative execution time corresponds to the elapsed time since the distributed computation of the application has begun. At the start of an iteration, the list is sorted accordingly to the relative time of every entry, with entries with lower relative times coming first. Then, the machine pointed out by the lowest entry is credited with another executed task and its relative time is accordingly recomputed. The algorithm terminates when the count of completed tasks equals the number of tasks of the application. Simply put, successive iterations of the algorithm correspond to successive time marks when another task(s) gets completed. The CIET corresponding to the scenarios reported in this study are shown in Table 2. Note that the 2nd column (“Tasks”) specifies the CPU time needed to execute one task on the reference machine. Count 8 8 8 8 Avg.

CPU PIII@650 MHz [email protected] GHz [email protected] GHz [email protected] GHz --

INTFP 12.952 21.533 37.791 42.320 28.649

Ratio ref. 0.518 0.861 1.511 1.692 1.146

Table 1: Machines simulated in the experiments Num. tasks 25 25 25 50 50 50

Task (secs) 1800 3600 7200 1800 3600 7200

Turnaround (mins) 36 71 142 58 116 232

Table 2: Cluster Ideal Execution Time (CIET)

3.3. Results To prevent results biased by the starting point effect the multi-run factor for all simulations was set to 8, meaning that for every scenario 8 simulations were effectively run with the reported results corresponding to the average of the executions. The turnaround execution times for the FCFS policy and the MinMax scheduling are shown in Table 3 and in Table 4, respectively. All turnaround times are in minutes. Columns identified by C refer to runs with centralised checkpoints while L marked columns represent executions with checkpoint stored at the executing machine. Num. tasks

Tasks (secs)

25 25 25 50 50 50

1800 3600 7200 1800 3600 7200

No chkpt (min.) C L 75 131 154 382 323 883 127 177 256 674 514 1200

5% chkpt (min.) C L 87 140 180 389 342 744 137 216 277 682 589 893

50% chkpt (min.) C L 75 130 149 368 299 726 129 178 254 655 491 1007

Table 3: Turnaround times for FCFS Num. tasks

Tasks (secs)

25 25 25 50 50 50

1800 3600 7200 1800 3600 7200

No chkpt (min.) C L 73 130 150 381 326 877 116 304 248 681 509 1195

5% chkpt (min.) C L 86 139 167 374 343 740 132 211 275 675 554 890

50% chkpt (min.) C L 75 130 146 356 309 723 118 172 244 667 514 1005

Table 4: Turnaround times for MinMax The results clearly show that the centralised checkpointing approach yields turnaround times up to three times faster than the local checkpoint policy. This is consistent with the fact that centralised checkpointing permits to rapidly resume execution in another machine when a failure occurs, effectively allowing sharing of checkpoints. Central-based scheduling policies yield execution turnaround times ranging from 2.09 (25 tasks/1800 seconds, 50% checkpoint frequency) to 2.54 (50 tasks/1800 seconds, 5% checkpoint frequency) times slower than CIET. The local-based policies yield far worse results, with turnaround times ranging from 3.05 (50 tasks/1800 seconds, no checkpoint) to 5.88 (50 tasks/3600 seconds, 5% checkpoint frequency) slower than CIET. Regarding checkpoint frequencies, the results point out that 50% frequency yields the best turnaround times. In fact, the 5% frequency induces a noticeable overhead, slowing down execution up to the point of being slower than the checkpointless approach. This is

especially true for the centralized checkpoints, which pay the network costs of checkpointing. Finally, both scheduling algorithms present similar results, meaning MinMax does not bring any real improvement over the FCFS scheduling. Interestingly, in their study involving the BOINC desktop grid system for gene sequence alignment, Pellicer et al. [17] also found out that several FCFS based implementations produced turnaround times similar to the ant colony scheduling algorithm they developed.

4. Related work Several simulation frameworks targeting grid environments have been developed. SimGrid [18] is a C language toolkit that provides core functionalities for the simulation of distributed applications in heterogeneous environments. As stated by its authors, SimGrid’s specific goal is to foster research in the area of distributed and parallel applications scheduling on distributed computing platforms ranging from simple network of workstations to computational grids [19]. The toolkit allows the creation of tasks defined upon their needed CPU execution time and resource usages, which are defined relatively to the computing capabilities of a reference machine. We adopted a similar approach for DGSchedSim, recurring to the combined INTFP performance index. SimGrid supports the modelling of the underlying network infrastructure, and allows for trace driven simulations, supporting traces collected by the Network Weather Service (NWS) [20]. A limitation of resorting to NWS traces lies in the setup required in installing, configuring and maintaining NWS. Furthermore, NWS does not support Windows machines, a major inconvenience when considering desktop grid environments. Indeed, as shown by the statistics of several @Home projects, a huge percentage of volunteer machines run a version of the Windows operating system. For instance, statistics from the popular SETI@Home project reveals that 81.5% of the results were computed by Windows machines [4]. A drawback that difficult the wider adoption of SimGrid relates to the need of defining all resources of the simulation, as well as the components interactions. Additionally, although useful as proved by the research carried out with the simulator [11], SimGrid is still a work in progress as the recent changes in its structures and API (for instance, the SG layer replaced by SURF) prove. GridSim [21] is an object oriented framework targeted for the simulation of grid environments comprised of heterogeneous computing resources, where several applications contend for the available

means. The toolkit is based on SimJava [22], a package for discrete event simulation. With GridSim the resource capabilities are defined in the form of MIPS as per SPEC benchmark. GridSim is oriented for research on economic scheduling, combining deadlines with economic constraints, that is, how to respect a defined deadline under a given economic budget. MicroGrid [23] is a grid simulation framework which resorts to emulation to provide a virtual grid platform based on the Globus middleware infrastructure. The tool virtualizes every resource of a Globus grid system, like CPU, memory and network. Applications run under Microgrid have their relevant Globus library calls trapped and redirected to the emulation layer. Thus, from the application point of view, execution is carried out on a regular grid platform, with the application only perceiving the virtual grid resources independently of the physical resources being utilized. Beneath, the framework coordinates the virtual resources involved in the execution. For the purpose of our research, MicroGrid was unsuited since the framework is oriented toward Globus applications, having no specific support for desktop grids. Also, since it relies on resource emulation, simulations carried out by MicroGrid might require a significant amount of computing resources and consequently take a relatively long period to execute. Contrarily to the previously mentioned simulators which focus on tasks, OptorSim [24] aims to evaluate data placement and data replica strategies in data grid environments. The simulator was developed in Java for the EU Data Grid project (EDG). OptorSim takes as input a grid configuration and a replica optimiser algorithm, and then runs a number of grid jobs using the given configuration. Emphasis is on data placement and replication, with the goal to maximise execution efficiency while minimising data movement. Kondo et al. [12] used a self-developed object oriented Perl simulator in their study of scheduling strategies for optimising turnaround time in the execution of independent tasks on desktop grid environments. Although scarce details of the simulation tool are given in that study, the tool recurs to trace driven simulation and it is capable of using traces collected from an Entropia grid environment [25]. Our goals differ, since we consider tasks with checkpointing capabilities, a feature that permits task migration and restart from partial execution. Contrary to the above presented simulation tools, DGSchedSim is a more specific tool, targeting volatile desktop grids for studying scheduling strategies based on checkpointing.

5. Conclusions and future work DGSchedSim is a trace driven simulator that can be useful for studying and assessing scheduling policies relatively to their ability to provide fast turnaround times. DGSchedSim’s main goal is to be a simulation tool that permits to easily set and execute simulations for testing and assessing scheduling policies for desktop grid systems, especially checkpoint oriented scheduling algorithms. One of the major strengths of the tool lies in its association with the DDC framework. In fact, when combined with the ability of DDC to collect traces, DGSchedSim permits to quickly setup simulation scenarios and experiments that model real environments as demonstrated by the two scheduling strategies analysed in this paper. The simulations involving FCFS and MinMax clearly demonstrated that centralized checkpoint outperforms local checkpoint policies, at least for a moderate number of machines. DGSchedSim is still being actively developed. As future work, we plan to continue the tests with the simulator, tuning appropriately its simulation model. A planned modification will be to support applications comprised of heterogeneous tasks with different resource requirements, namely at the level of CPU time needed to complete. We anticipate this feature should not require major internal changes of the tool, since the framework already supports tasks that despite having started at the same time and due to the normal circumstance of scheduling (different machines availability, etc.), are in different percentages of their executions. Another feature, not directly related to DGSchedSim, is the ability to characterise traces used for simulations, complementing CER with other metrics. We also plan to use DGSchedSim to actively research scheduling algorithms for desktop grid environments, namely task migration and replication oriented scheduling policies. We anticipate that these features will strengthen the usability and features of the simulator.

Acknowledgements This work was partially supported by PRODEP III/Acção 5.3, by R&D Unit 326/94 (CISUC) and by the FP6 Network of Excellence CoreGRID funded by the European Commission (Contract IST-2002-004265).

References [1] [2]

D. G. Heap, "Taurus - A Taxonomy of Actual Utilization of Real UNIX and Windows Servers," IBM White Paper GM12-0191, 2003. P. Domingues, P. Marques, and L. Silva, "Resource Usage of Windows Computer Laboratories,"

[3]

[4] [5] [6]

[7]

[8] [9]

[10] [11]

[12]

[13] [14]

[15]

[16] [17]

presented at International Conference Parallel Processing (ICPP 2005)/Workshop PEN-PCGCS, Oslo,Norway, 2005. M. Litzkow, M. Livny, and M. Mutka, "Condor - A Hunter of Idle Workstations," presented at 8th International Conference of Distributed Computing Systems, San José, California, 1988. SETI, "SETI@Home Project Stats (http://setiathome2.ssl.berkeley.edu/stats/oss.html), " 2004. J. Bohannon, "Grassroots supercomputing," Science, vol. 308, pp. 810-813, 2005. D. Anderson, "BOINC: A System for PublicResource Computing and Storage," presented at 5th IEEE/ACM International Workshop on Grid Computing, Pittsburgh, USA., 2004. G. Fedak, C. Germain, V. Neri, and F. Cappello, "XtremWeb: A Generic Global Computing System," presented at 1st Int'l Symposium on Cluster Computing and the Grid (CCGRID'01), Brisbane, 2001. Alchemi.Net, "Alchemi.Net Project (http://www.alchemi.net/)," 2005. A. Chien, B. Calder, S. Elbert, and K. Bhatia, "Entropia: architecture and performance of an enterprise desktop grid system," Journal of Parallel and Distributed Computing, vol. 63, pp. 597-610, 2003. DataSynapse, "DataSynapse, Inc. (http://www.datasynapse.com)." O. Beaumont, A. Legrand, L. Marchal, and Y. Robert, "Independent and Divisible Tasks Scheduling on Heterogeneous Star-shaped Platforms with Limited Memory," presented at 13th Euromicro Parallel, Distributed and NetworkBased Processing (PDP2005), Lugano, Switzerland, 2005. D. Kondo, A. Chien, and H. Casanova, "Resource management for rapid application turnaround on entreprise desktop grids," presented at 2004 ACM/IEEE conference on Supercomputing, 2004. T. E. Anderson, D. E. Culler, and D. Patterson, "A case for NOW (Networks of Workstations)," Micro, IEEE, vol. 15, pp. 54-64, 1995. S. Choi, M. Baik, C. Hwang, J. Gil, and H. Yu, "Volunteer availability based fault tolerant scheduling mechanism in DG computing environment," presented at 3rd IEEE International Symposium on Network Computing and Applications (NCA'04), 2004. E. Elnozahy, L. Alvisi, Y.-M. Wang, and D. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Comput. Surv., vol. 34, pp. 375-408, 2002. U. Mayer, "Linux/Unix nbench project page (http://www.tux.org/~mayer/linux/bmark.html)," 2003. S. Pellicer, N. Ahmed, and Y. Pain, "Gene Sequence Alignment on a Public Computing Platform," presented at ICPP 2005 - 7th Workshop

[18]

[19]

[20]

[21]

[22]

[23]

[24]

on High Performance Scientific and Engineering Computing, Oslo,Norway, 2005. H. Casanova, "Simgrid: a toolkit for the Simulation of Application Scheduling," presented at IEEE International Symposium on Cluster Computing and the Grid (CCGrid'01), Brisbane, Australia, 2001. H. Casanova, A. Legrand, and L. Marchal, "Scheduling Distributed Applications: the SimGrid Simulation Framework," presented at 3rd IEEE Int'l Symposium on Cluster Computing and the Grid (CCGrid'03), 2003. R. Wolski, N. Spring, and J. Hayes, "The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing," Journal of Future Generation Computing Systems, vol. 15, pp. 757-768, 1999. R. Buyya and M. Murshed, "GridSim: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing," Concurrency and Computation: Practice and Experience (CCPE), pp. 1-32, May 2002. F. Howell and R. McNab, "SimJava: a discrete event simulation package for Java with applications in computer systems modelling," presented at 1st International Conference on Web-based Modelling and Simulation, San Diego, CA, USA, 1998. H. Song, X. Liu, D. Jakobsen, R. Bhagwan, X. Zhang, K. Taura, and A. Chien, "The MicroGrid: a Scientific Tool for Modeling Computational Grids," Scientific Programming, vol. 8, pp. 127141, 2000. W. Bell, D. Cameron, L. Capozza, A. Millar, K. Stockinger, and F. Zini, "OptorSim - A Grid Simulator for Studying Dynamic Data Replication Strategies," International Journal of High Performance Computing Applications, vol. 17, 2003.

DGSchedSim: a trace-driven simulator to evaluate ...

[3] exist for more than a decade, the advent of the. Internet brought a global scale vision for desktop grids. In particular, the ... inexpensive computing power. In this study we ... high turnaround time schemes are focused in delivering fast execution time .... index, its static attributes like CPU model and speed, amount of main ...

Download PDF

182KB Sizes 2 Downloads 138 Views

Report

DGSchedSim: a trace-driven simulator to evaluate ...

Recommend Documents