High-Speed Network Modeling For Full System ...

Viewer
Transcript

High-Speed Network Modeling For Full System Simulation D. Lugones*, D. Franco*, D. Rexachs*, J. C. Moure*, E. Luque*, E. Argollo♦, A. Falcón♦, D. Ortega♦, P. Faraboschi♦ *Department of Computer Architecture and Operating Systems. University Autónoma of Barcelona. Spain. [email protected], {daniel.franco, dolores.rexachs, juancarlos.moure, emilio.luque}@uab.es ♦ HP Labs, Exascale Computing Lab, Barcelona, Spain {eduardo.argollo, ayose.falcon, daniel.ortega, paolo.faraboschi}@hp.com

Abstract1

interconnection network are currently an open research challenge. Increasingly, computer architects need to deal with interconnection problems in order to design and evaluate the components of a system that can meet the parallel applications requirements in terms of computing communication bandwidth and latency, and storage. In this context, we believe that the traditional methods of building "off-line" network models exercised with statistical inputs have serious limitations and should be integrated in a full-system simulation environment to better capture the complex interactions between applications and architectural components. Developing a framework that makes it possible to study existing networks and understand their behavior under different realistic inputs significantly helps the design and customization of large-scale computing systems. Finally, we also believe that any network modeling system should come with a verification framework to ensure that the assumptions are reasonable and statistically validated. This increases the confidence that, if correctly implemented, the final solution would approximate the expected results [11].

The widespread adoption of cluster computing systems has shifted the modeling focus from synthetic traffic to realistic workloads to better capture the complex interactions between applications and architecture. In this context, a fullsystem simulation environment also needs to model the networking component, but the simulation duration that is practically affordable is too short to appropriately stress the networking bottlenecks. In this paper, we present a methodology that overcomes this problem and enables the modeling of interconnection networks while ensuring representative results with fast simulation turnaround. We use standard network tools to extract simplified models that are statistically validated and at the same time compatible with a full system simulation environment. We propose three models with different accuracy vs. speed ratios that compute network latency times according to the estimated traffic and measure them on a real-world parallel scientific application.

1. Introduction More than 80% of the Top500 supercomputers [21] are today based on clusters of industry-standard processors and standard high-speed interconnect fabrics. It is a wellknown fact that the performance of a typical highperformance computing application based on messagepassing libraries (such as MPI) is highly dependent on the throughput and latency features of the underlying interconnection network. The recent advances of multicore processors and integrated memory controllers are rapidly increasing the complexity of these systems. Hardware – from processing cores to cluster interconnects – has become hierarchical with wildly different access latencies. Heterogeneous architectures and accelerators are also contributing to increase this complexity. As a consequence, several interoperating policies are necessary to optimize the different components of these environments. In addition, it is common for the network itself to also be heterogeneous and multi-protocol in order to provide connectivity and compatibility inside and outside any large-scale system. Tools that allow designing, configuring, dimensioning and tuning new scale-out architectures and their

1.1 Network models and Full System Simulation As a result of the growing evolution and increasing availability of computational resources, Full Systems Simulation (FSS) environments have become an important tool in the design, validation and performance analysis of new computing systems. This is particularly true when some of the system components are not available, as it is often the case during the sizing phase of any large system (which, for example, may be planned to use next-generation processors). Simulation provides a good way to predict performance and compare alternatives. Furthermore, even if a system is available for measurement, a simulation model may be preferred over measurements because it allows alternatives to be compared under a wider variety of workloads and environments without incurring large prototyping costs. FSS frameworks provide accuracy by modeling the complete software stack including applications, middleware, operating systems, networks protocols, etc. Also, most of the important hardware components of a system and their associated device drivers can be fully represented.

1

This research has been supported by the MEC-MICINN Spain under contract TIN2007-64974

978-1-4244-5156-2/09/$26.00 ©2009 IEEE

24

In architecture and microarchitecture research, it is becoming common to assess the performance of new proposals (such as branch predictors or prefetching algorithms) by dynamically adjusting simulation accuracy and speed through sampling [8][10]. Similar simulation techniques can be also used at the system level to evaluate what combinations of computing resources better match customer requirements in terms of performance/cost [9]. In many FSS environments, robust interfaces enable users to incorporate their CPU timing and power models into the simulation infrastructures, and quickly evaluate them in a full-system context by running a full OS and realworld applications. The logical extension of a FSS to simulate a computing cluster is by inter-connecting multiple FSS instances through a simulated “network switch”. The switch is responsible for the functional delivery of packets to the FSSs with the correct timing based on the characteristics of the simulated network parameters such as topology, links latencies, bandwidths, routing algorithm, etc (Fig. 1).

A variety of network-oriented simulators, such as Opnet [17] or NS-2 [15], today provide excellent support for modeling large scale networks. However, their complexity makes it difficult to integrate them in FSS environments. For example, they are typically not designed to run real applications and operating systems, but rather they support simple behavioral workload models, statistical traffic generators or traces of network events. As mentioned in [14], this is currently not compatible with the need of simulating the target system with sufficient detail and a realistic workload. In this paper, we combine the best of both approaches: the accuracy and expressiveness of network-oriented simulations, and the system analysis capabilities of FSS environments. We describe a general methodology that performs fast and accurate interconnection network modeling, while preserving its compatibility to be integrated in a FSS environment. The proposed methodology employs accurate reference network models to extract simplified, faster network models with minimal accuracy loss. We use an iterative procedure to make speed vs. accuracy tradeoffs. The resulting models can then be applied to study latency/load curves under different communication patterns, to calculate average network delays, to identify communication hot-spots, and to perform network analysis and planning. Using this methodology we have developed a few different models with varying accuracy vs. speed tradeoffs. In particular we have observed that a simple destination-based model (where latencies are computed based only on the target node) is able to accurately identify the communication overhead in the majority of the cases that we observed in real applications. We use a quantized simulation approach, where in time interval n we accumulate the statistics for each destination node, which are then used in time interval n+1 to assign the latency of the packets targeting the same destination node. The rest of this paper is organized as follows. Section 2 presents our proposed network modeling methodology. Section 3 provides more details on some of the possible network models. Section 4 shows accuracy results for proposed models. Finally, Section 5 summarizes our conclusions.

Fig. 1 Cluster simulation The latency of a packet going through the simulated switch is what the switch model determines during simulation execution. Evaluating precise timing must take into consideration dynamic parameters such as the overall aggregated traffic load. This creates a complication, because the performance evaluation at the interconnection level requires modeling a large-enough parallel system and the integration of tools that are not a good fit for a FSS environment [16]. What we propose in this work is to decouple the two problems by simulating a simplified network model in the FSS (such as the ones presented in [1], [4], [5], [18], and [20]) which has been previously by a more precise and statistically validated network tool. The methodology enables accurate evaluation studies to fine-tune hardware and software and can be particularly important in those parallel applications whose performance dynamic depends from the system being executed (such as applications using load balancing features [12], or dynamic master-worker paradigm [19]). To summarize the objectives of this work, we want develop network models that are accurate enough to well represent the interconnection system, and fast enough to be used in a FSS environment.

2. Methodology for Network Modeling The proposed methodology is based on a closed-loop approach which allows modeling network behavior by analyzing the communication pattern of applications and iteratively improving the accuracy of the model (if necessary). We combine the accurate network description provided by the reference model and the real application execution provided by the FSS. With this approach we can reach a realistic timing approximation of the network

25

model. Our proposal follows an iterative improvement process composed of four phases, as shown in Fig. 2. The first phase is the trace collection (box "1" in Fig. 2) and uses the FSS to generate trace of network events for a given application running in the simulator with zero latency assumption (i.e., all packets are instantaneously delivered to the destination node). During trace collection, the system performance and the behavioral characteristics of the application, assuming an ideal network, can be measured. The network trace records the correlation (logical) and interference (timing) effects of the simulated workload. Because we execute real applications (Fig. 2 (a)), no simplification is required. For example, we do not need to extract an analytical representation of the workload that is common in many network simulation environments. In other words, the network models that we develop use real traffic and not abstracted workloads. Because of the high level of details that the trace contains, it is possible to study the effect of small changes and make various tradeoffs in the model. For example, we can easily compare different network topologies under the same input load. We believe this is much more powerful than using a randomized load that may not well represent the

real behavior of the application. The second phase is the reference execution (box "2" in Fig. 2). The application logical trace is used as input to a network-oriented simulator in order to get an accurate timing trace (Reference Network, in Fig. 2). Reference models (which we describe in section 3.1) are used to represent the logic of real networks, as accurately as desired. This phase produces a set of parameters for the network model, as well as a time-annotated trace to be used for accuracy validation. The third phase is the timing execution (box "3" of Fig. 2) and repeats the FSS simulation, while using the simplified network model parameters extracted in the second phase, generating another trace. Unlike in the first phase, the FSS now assigns different latencies to the packets traversing the network according to the model parameters. If the modeling was correct, the generated trace approximates the trace that the network simulator hypothesized in the second phase. The fourth phase is the model comparison (box "4" of Fig. 2). The two timing traces generated in phase two and three are compared in order to assess the accuracy of the model (Model Comparison, in Fig. 2) and to decide whether the projected accuracy meets the objectives of the

3

1

4

2

Fig. 2 Modeling Methodology

26

experiment. Depending upon the measured accuracy, the last three phases can be iterated to refine the network parameters obtained from the reference simulator (Network model design, in Fig. 2). Successive iterations allow minimizing the error produced by variation in the workload, and provide a realistic network model according to a desired accuracy/speed tradeoff. This methodology can be used to model networks of any topology and size. Also, different parameters for routers and switching techniques are supported covering the simulation of many of the technologies of interest. The developed model eventually converges to meet the designer requirements on accuracy and speed. Once the process reaches convergence, the resulting model can be integrated in the FSS and used independently of the calibration system. In other words, because the model faithfully represents a given network topology for a certain class of applications, it is no longer necessary to run the iterative trace extraction and analysis phases.

such as the distance between nodes, the link bandwidth and the effects of the packet size. The accuracy evaluation phase compares the reference timing trace (Fig. 2 (f)) to the application timing trace (Fig. 2 (d)) in order to decide whether the model must be refined or not. Comparing network traces is not trivial, because the features we select for comparison should be representative to assess an overall accuracy and trigger the iterative process. We use a combination of several aggregated metrics, such as global average latency of all messages, destination-based latency per node, sourcedestination latency pairs, network or link throughput and load distribution across the entire network. After the trace comparison phase, if the accuracy analysis shows an acceptable error, the iterative process stops and the network model can be used standalone in the FSS. Otherwise, the model needs to be refined by feeding the application timing trace back into the reference model and iterating the process one more time. The reference network models (Fig. 2 (e)) need to support the description of the relevant features for modern networks such as topology, routing protocols, switching techniques, flow control mechanisms, virtual channels. Also, they should include a set of parameters that can easily map to the implementation-specific details of the most commonly used interconnection fabrics in highperformance computing clusters (e.g., 10G-Ethernet, InfiniBand, Myrinet). The use of a reference network simulation also provides important information for the iterative refinement process (Fig. 2(j)): average bandwidth, point to point communication bandwidth, sent and received packets per node, packet size distribution, average end-to-end latency, single packet latency, link utilization, channel throughput, flow contention, and so on (Fig. 2(g)). This information allows a better tuning of the model parameters by adding new equations and parameter definitions. We show some examples of this process in the following section. The resulting simplified network model is finally merged into the FSS through an interface that depends on the specific infrastructure (Fig. 2 (k)). In our case, the network simulator includes a pluggable interface where standalone models can be independently developed and linked at runtime.

2.1 Implementation Details Using a FSS (Fig. 2 (b)) enables us to gather very detailed information in a network trace. The minimal required information is the source-destination pair of each routed packet, which we capture in the application logical trace (Fig. 2 (c)) and we feed to the reference network model. The logical trace describes the communication pattern behavior of the application. Source and destination nodes are represented by a unique identifier (such as the MAC address of the corresponding NIC) and the packet information also records the message size in bytes. In addition, it also records an approximate injection time of the packet as reported by the timing subsystem of the FSS. Because clustered FSS environments commonly use parallel discrete event simulation techniques to distribute the execution, the precision of timing information is limited to the synchronization quantum of the FSS. Messages in the trace are ordered according to their injection time, which we then use as sequencing information to replay the trace with the reference model (Fig. 2 (e)). The application timing trace (Fig. 2 (d)) generated by a successive FSS execution with the parameterized model contains the same type of information (source, destination, size and modeled latency), so that it can be compared with the reference timing trace (Fig. 2 (f)) provided by the reference simulator. This mechanism allows us to estimate the accuracy of the model, as shown in Fig. 2 (h) and (i). Message latency is the key metric we focus on. Latency enables the modeling of the delays produced when message contend for network resources, and it can also be used to represent important physical constraints

3 Network Model Design In this section, we present some of the considerations that can help navigate the complex design space of network models. We also show three concrete examples that we developed through the methodology previously described. The challenge in network modeling is to define the appropriate abstraction level that can reach an acceptable accuracy with minimal impact on simulation speed and complexity. This involves exploring the different

27

alternatives to evaluate delays and other network performance aspects. The goal of the network model is to abstractly describe the dynamic behavior of data communications in the parallel system. The interconnection network can be represented as a collection of links, routers, and computing nodes. We consider topology, size and arbitrary routing algorithm as inputs to the network model. Router characteristics like the routing delay (the time to take a routing decision), the switch delay (the time to pass a router), the link propagation delay (the time to pass a link), and the link bandwidth are also input parameters to the model. The application program is modeled as a graph of communicating tasks. We assume that each task executes independently and communicates with other tasks by means of logical channels. Each task is executed during an interval, and then it performs some communication action with another task by means of a channel, and finally resumes execution until next communication action. The elapsed time between communications is called message interval. Note that in a FSS infrastructure simulating the complete hardware and software stack, the message interval is determined by the real application communication pattern. The amount of information exchanged in each communication event is represented by the message size. Application tasks are assigned to a computing node by a task mapping abstraction. This assignment, together with topology and routing, determines the links belonging to a given source-destination channel. In short, topology, routing algorithm, the set of logical channels, task assignment, traffic load and message size are the input parameters of the model, as depicted by the model "template" of Fig. 3.

Therefore, the model estimates the communication delay for each source-destination logical channel. This delay is the time that messages spend travelling from source to destination, including the transmission time itself, as well as the contention time due to collisions with other messages. In other words, the model offers a snapshot of the communication system state for a given workload and time interval. From that snapshot, we can derive the analysis of what are the factors (if any) limiting network and application performance. Also, we can obtain the latency vs. traffic load curve by evaluating the model under different traffic loads. The network model design space introduced in Fig. 2(j) can be divided into two major components: latency and traffic estimation (Fig. 4). This approach allows evaluating communication traffic during a time interval, and then, according to network features (i.e. topology, link bandwidth, etc.) estimating the latency behavior.

Fig. 4 Model space design Communication traffic can be evaluated using different alternatives according to the desired model accuracy and complexity. The global injection approach considers the whole load (bytes/sec) without specifying its distribution within the network. In this case, the network is analyzed as a “black box”. In the destination based approach, the network load is calculated independently for each destination node. The packet based approach considers each packet, and its interaction with others, at each network link, as it is the case for simulation models used in discrete event simulators [13]. Latency can also be estimated by different approaches depending on the desired accuracy. The simpler model consists in representing the network state as congested or non-congested (step model), and fixed values of latency are used to represent those states. Alternatively, the latency can be estimated according to a statistical model which represent the network behavior, or according to an analytical model which provides latency values to packets in the network. Fig. 5 shows an example of a step network model is designed according to global injection evaluation and step latency estimation approaches. Despite the simplicity of

Fig. 3 Model Inputs & Outputs The objective of the model is to evaluate the network behavior during a time interval specified by the user.

28

this model, it is very useful to test the basic hardware and software functionality of the system simulated by the FSS environment. It is also surprisingly accurate for networks with limited congestion, few nodes and simple topologies (such as crossbar buses or token rings). We use this model as the starting point of the methodology presented in Section 2 to extract the preliminary traces.

reference network model. This function considers the average latency at a given aggregated communication load. This model also represents the network as a black box, and computes the average latency function for fixed network parameters. This implies that if a designer needs to analyze the network behavior with different parameters (e.g., a different topology or routing policy) the average latency function needs to be recomputed. To address this issue is necessary to abandon the "black-box" approach, and start looking deeper inside the network to establish the parameters that can provide a more flexible model. Also, by modeling the internal characteristics of the network, we can analyze the way in which the traffic load is distributed over the topology. Therefore, it is possible to identify performance hotspots and address an expanded design space covering areas such as new routing algorithms or quality of service policies. In order to model internal network features, we first need to "disaggregate" the network traffic and split the global traffic into per-destination traffic. This allows to quantify the behavior of each source-destination pair in the network and to estimate the latency according to an analytical representation describing the interaction of packet flows delivered to each destination node. For example, we can use queue and regression models as described in [11]. The destination based model that we propose evaluates, for each time interval Δt, the communication load sent to each destination node, as shown in Fig. 7. This is accomplished by accumulating the traffic sent to same destination averaged over Δt. In this way, we can easily estimate the bandwidth for each destination, and we can use this value as an input to estimate the latency. When model calculations are completed and integrated within the FSS, the packets sent to a given destination are then delivered after the corresponding latency. In this case, we estimate latency estimation with an equation which considers topology, network components delays, and dynamic interaction of packets contending for network resources.

Fig. 5 Step Network Model As shown in Fig. 5, all the traffic (bytes) injected by source nodes is accumulated for each time interval Δt. Thus, global injection is calculated and used as an input to the latency “step” model. All packets sent within a time interval use the modeled latency L, computed as L0 if input Bw < Bcut, or L1 otherwise. After we obtain the application logical trace (Fig. 2(c)) and we inject it to the reference model, the latency estimation may be refined if the model's accuracy is not sufficient. The information gathered by the reference model can also be processed to obtain a statistically distributed function which estimates network latency in a more accurate manner, as shown in Fig. 6.

Latency = fNetcomp(N) + ØQueueDynamics - fNetcomp(N) represents the physical delay, and it is defined as the minimal latency accumulated by a packet for a given source-destination path. The assumption is that a packet does not contend for resources with other packets. Thus, its physical latency is given by the network physical constrains such as the distance between nodes (or hops), the link bandwidth, the packet size, the router delay, the switching techniques, the virtual channels policies. - ØQueueDynamics represent the contention delay when network resources are shared by several sender nodes.

Fig. 6 Global Average Network Model In this case, the traffic estimation uses the same inputs of the step model but the latency is estimated using an average latency function which is extracted by the

29

Fig. 7 Destination Based model Where, PkSZ PkHR LkBW Ft Tsw Tnc N

To compute the physical delay fNetcomp(N) we analyze the network components in a given source-destination path, as shown in Fig. 8. A path contains network cards (NCs) of source and destination nodes, a number N of switches, and N+1 links. The time a NC takes to process and inject each packet into the network is modeled by the Tnc value. The fly time (Ft) is defined as the time taken by first byte in packet header to cross the link (propagation delay). The switching time (Tsw) value is defined in order to model the time it takes for one packet to cross the switch, this value depends on the switch architecture and its components.

To compute the contention delay ØQueueDynamics representing the dynamic interaction of packets in network resources, we use traffic patterns, topology, mapping, shared components filtered through model abstraction level, and the communication load evaluation. The destination based model calculates the contention delay for packets by evaluating their destinations nodes during a time interval called quantum. From the FSS point of view, the quantum is used to synchronize different simulated (virtual) nodes running in parallel on distributed, unsynchronized physical hosts. Hence, synchronization takes place at fixed intervals to optimize simulation speed with a low accuracy loss, as studied in [9]. Within a quantum, each simulated node sends packets to the network without any synchronization. Therefore, the model assumes a certain inter-arrival distribution of packets inside the quantum in order to correctly account for the contention and determine the delay of each packet. In this model, we assume a uniform distribution of arrival times within a quantum for each destination node (represented by a queue). For example, Fig. 9 shows the latency calculation when four packets contend for a given destination at a given quantum. When the first packet arrives, the network model assigns it to the start of the quantum. If there are no conflicts and no contention, the model assigns the packet latency based on its physical delay. When the second packet arrives at the same destination, the model assumes

Fig. 8 Source-destination path An important aspect to consider in the calculation of the physical latency is the switching technique implemented by the switch. Switching techniques determine when and how internal switches are set to connect inputs to outputs and the time at which message components may be transferred. For example, Virtual Cut-Through (VCT, [6]) is one of the most used techniques in high performance clusters, as mentioned in [7]. For a network composed by switches using VCT, the physical delay experienced by a packet, when it crosses a path containing N switches, can be calculated as follows:

f(N) = Tnc + Ft + Pk

SZ

LkBW

: Packet full size [bits] : Packet header size [bits] : Link bandwidth [bit/sec] : Link Fly time [sec] : Switching time [sec] : Network card time [sec] : Number of switches in the src-dst path

PkHR ⎞ ⎛ + N. ⎜ Ft + Tsw + ⎟ Lk BW ⎠ ⎝

30

that each packet is distributed uniformly across the quantum. If the physical delay of the first packet overlaps with the arrival time of the second packet, contention appears and causes additional delay.

the simplified models (Fig. 2(e)). In our experiments, we use OPNET Modeler [17], a discrete event simulator supporting several abstraction levels for networks, nodes, and processes. OPNET modules typically represent applications, protocol layers, and physical resources, such as buffers, ports, and buses. A complete description of the reference models for network cards, switch architecture, and topologies is provided in [13]. In our work, we investigate adaptive routing policies, virtual cut-through switching, flow control mechanisms and multipath features. By using these models it is possible to specify link parameters such as bandwidth, bit error rate, propagation delay, supported packet formats, as well as other attributes.

4 Experiments We implemented the proposed methodology using the HP Labs' COTSon simulation infrastructure [2]. COTSon uses AMD’s SimNow simulator [3] as the full-system simulator component, augmented with the timing models presented in [8] for the CPUs and system devices. The SimNow simulator is a fast platform emulator using dynamic compilation and caching techniques, which can run unmodified OSs and complex applications. SimNow implements the x86 and x86_64 instruction sets. In order to simulate clusters, we use the "virtual network" functionality of COTSon that interconnects and synchronizes multiple parallel distributed instances of node simulators. COTSon provides a simple interface to integrate the proposed network models in its simulated switch. Packets are sent from the simulated nodes to the switch which then queries the network model for the packet latency and delays their delivery for the appropriate duration. We tested our approach on NAMD [12], a parallel scalable molecular dynamics application designed for high-performance simulation of bio-molecular systems. NAMD is particularly interesting because it uses a standard message-passing library but also includes load balancing features which dynamically modify execution based on runtime measurements. These optimizations are becoming increasingly more important in the area of scalable applications, but also break the traditional offline network modeling and really require a full-system approach to be able to accurately capture their behavior. We evaluated a cluster of 108 simulated nodes (singlecore) connected with a 6-ary 3-tree (Fat-tree) topology using typical Infiniband parameters. Fig. 10 shows a summary description of the speed vs. accuracy tradeoffs of the proposed models. In the x axis we plot the relative accuracy error (percent) vs. the baseline reference model (smaller values represent better accuracy). In the logarithmic y axis we plot the simulation speedup with respect to the same baseline (larger numbers

Fig. 9 Estimation of contention latency In general the calculation is performed as follows. For each incoming packet in a quantum: - Identify destination node - Recalculate arrival distribution including the new packet - Calculate Ø latency based on the destination queue. The contention delay of packet i with destination x is: Q ⎧ ∀ i > 0, and Ø > 0 ⎪Ø[i - 1] = Ø[i - 2] + f[i - 1] − Npks ⎨ ⎪Ø[i - 1] = 0 ∀Ø< 0 ⎩

Where: f [i-1] Ø[i-1] Q Npks

: Physical delay of packet i-1 with destination x. : Contention delay of packet i-1with destination x : Quantum duration : Number of packets processed in the quantum

Finally, the total latency is calculated by adding physical and contention delays: Lati = f(N)+ Ø[i-1]

3.1 Reference Models As mentioned in section 2, we use a reference network simulator to assess accuracy and derive the parameters of

31

imply faster simulation). Each point represents the accuracy error and speed of a given experiment.

In the destination-based model, we use per-destination traffic modeling, with latency represented by the analytical model previously described. The accuracy error is reduced to 8%, showing that we can correctly capture the contention effects. Notice that the destination-based model still cannot capture the congestion that can appear in intermediate network switches, and this is likely to be the primary cause of the accuracy error. In Fig. 12, we plot the message latency measured at each destination node for NAMD. Some nodes perceive a shorter latency due to internal collisions of packets in the network, as mentioned above.

Fig. 10 Speed vs. Accuracy tradeoffs The destination-based model speedup is ~30x with an accuracy error of ~8% vs. the reference model. The average-latency model is faster (100x) but its accuracy error goes up to 24%. The step model is even faster (165x speedup) but less accurate (39% error). Fig. 11 shows the latency results over the (simulated) execution time for three proposed models together with reference model. The step model was used to start the iterative methodology and its logical trace used to calculate the parameters Bcut, L0 and L1 described in Section 3. When the network load is low (for example between time 0 and 13s) the step model achieves an accurate simulation (less than 5% error), as we would intuitively expect. When the network load grows, the accuracy degrades significantly. At the same time, if we are only interested in the congestion hotspots (for example, around execution time 21s), the step model is still able to identify them rather accurately.

Fig. 12 Latency perceived by each destination node In summary, we believe that the accuracy provided by the destination-based model is adequate for typical network experiments using FSS environments. As the computing power per node increases with the increasing number of cores, clusters with no more than a few hundred nodes will accommodate the vast majority of computing needs below the very high-end of the HPC spectrum [21]. For this scale of nodes, crossbar switched networks or multistage interconnection networks (such as Fat-Trees, Butterflies or Benes) with 1-5 levels are sufficient. This trend, combined with path diversity, causes contention problems or hotspots to appear with more intensity and frequency at the destination nodes rather than at the intermediate levels (if any at all). This behavior is well captured by the destination-based model. If we combine this with the computational, memory and complexity requirements of full network-oriented simulators (such as OPNET), we can conclude that a simple destination-based model represent an excellent sweet spot and is what we recommend adopting for FSS environments.

Fig. 11 Modeled latency We also computed an average model by measuring latency values for a representative range of traffic loads. This generates a latency curve representative of the communication pattern and topology. The average model slightly improves the accuracy of step model. However, at high communication loads the error is still above 30%.

5 Conclusions The key to successful network simulation studies of cluster systems running commercial workloads is to have a solid foundation for rapid development of timing

32

[7]

models, coupled with a rigorous methodology for obtaining representative measurement results quickly and reliably. Naturally, it is desirable to simulate networks with high accuracy and simulation speed. However, in most cases, it is difficult to implement a detailed and accurate model that can be combined to run realistic workloads through full-system simulation. The methodology presented in this paper optimizes the network model design by carefully trading off accuracy and speed. Models are designed according to network statistical information and/or analytical descriptions that ensure accurate measurement results with reduced simulation turnaround time. The proposed models enable designers to optimize network configurations for a given application, perform analysis on large and/or faster networks, evaluate new protocol stacks, and tuning opportunities. We have shown how to apply the proposed methodology to the analysis of three network models with different accuracy/speed ratios. In particular, we have shown that the destination-based model provides a faithful abstraction for the scale of systems that are of interest in the foreseeable future, reaches an 8% error and speedup of around 30x vs. a reference network simulation. In the future, we plan to extend the work providing models which consider the load at each link (link based) in order to better capture internal network contention. As link based models can be slow, we also plan to combine different models according to communication load in the network. Thus, for low traffic a less accurate model is used to improve speed, and if traffic grows significantly then change to a more accurate model. We believe that this combination of models will open up a much wider application space for full-system simulation.

[8]

[9]

[10]

[11] [12]

[13]

[14]

[15] [16]

References [1]

[2]

[3] [4]

[5] [6]

Agarwal, N.; Li-Shiuan Peh; Jha, N.K., "GARNET: A Detailed Interconnection Network Model inside a Fullsystem Simulation Framework," (CE-P08-001, Dept. of Electrical Engineering, Princeton Univ., Feb., 2008). Argollo, E., Falcón, A., Faraboschi, P., Monchiero, M., and Ortega, D. “COTSon: infrastructure for full system simulation”. SIGOPS Oper. Syst. Rev. 43, 1 (Jan. 2009), 52-61. Bedicheck R., “SimNow: Fast platform simulation purely in software,” In Hot Chips 16, Aug.2004. Binkert, N. L., Hallnor, E. G., and Reinhardt, S. K. “Network oriented Full-system Simulation Using M5,” In the Sixth Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW), Feb. 2003. Chen, H.; Wyckoff, P. “Performance evaluation of a Gigabit Ethernet switch and Myrinet using real application cores,” in Hot Interconnects, 8 August 2000. Duato J, Yalamanchili S, Ni L. “Interconnection Networks, an Engineering Approach”. Morgan Kaufmann. 2002.

[17] [18]

[19]

[20]

[21]

33

Duato, J., Robles, A., Silla, F., Beivide, R., "A comparison of router architectures for virtual cut-through and wormhole switching in a NOW environment," in IEEE IPPS/SPDP, pp.240-247, 12-16 Apr 1999. Falcón A., Faraboschi P., and Ortega D. “Combining simulation and virtualization through dynamic sampling”. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems & Software, Apr. 2007. Falcón A., Faraboschi P., and Ortega D. “An adaptive synchronization technique for parallel simulation of networked clusters”. In Proc. of the 2008 IEEE International Symp. on Performance Analysis of Systems & Software, Apr. 2008. Hardavellas, N., Somogyi, S., Wenisch, T. F., Wunderlich, R. E., Chen, S., Kim, J., Falsafi, B., Hoe, J. C., and Nowatzyk, A. G. 2004. “SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture”. SIGMETRICS Perform. Eval. (Mar. 2004), 31-34. Jain R. “The Art of Computer Systems Performance Analysis”. John Wiley & Sons, New York, 1991. Kalé, L. V., Bhandarkar, M. A., and Brunner, R.”Load Balancing in Parallel Molecular Dynamics”. In Proceedings of the 5th international Symposium on Solving Irregularly Structured Problems in Parallel. LNCS, Aug. 1998 vol. 1457. Springer-Verlag, London, 251-261. Lugones, D. Franco, D. and Luque, E. “Modeling Adaptive Routing Protocols in High Speed Interconnection Networks," in OPNETWORK2008, Washington, EEUU. Ago. 2008. available at: https://aomail.uab.es/~dlugones/opnet.html Martin, M. M., Sorin, D. J., Beckmann, B. M., Marty, M. R., Xu, M., Alameldeen, A. R., Moore, K. E., Hill, M. D., and Wood, D. A. 2005. “Multifacet's general executiondriven multiprocessor simulator (GEMS) toolset”. SIGARCH Comput. Archit. News 33, 4 (2005), 92-99. McCanne S. and Floyd S. “UCB/LBNL/VINT Network Simulator-ns (ver2),” http://www.isi.edu/nsnam/ns/, March 2009. Navaridas, J.; Ridruejo, F.J.; Miguel-Alonso, J., "Evaluation of Interconnection Networks Using FullSystem Simulators: Lessons Learned," Simulation Symposium, 2007. ANSS '07. 40th Annual , vol., no., pp.155-162, 26-28 March 2007. OPNET Technologies, “Opnet Modeler Accelerating Network R&D,” June 2008, http://opnet.com. 2008. Ortiz A. and Ortega J., and Diaz A. and Prieto A., “Modeling Network Behaviour by Full-System Simulation” Journal of Software (JSW) Vol: 2 Issue: 2, 2007, Pp: 11-18. Panagiotis D. Michailidis, Vasilis Stefanidis, Konstantinos G. Margaritis “Performance Analysis of Overheads for Matrix - Vector Multiplication in Cluster Environment,” Panhellenic Conference on Informatics, LNCS, 2005, pp: 245-255. Ridruejo Perez F., Miguel-Alonso J., “INSEE: An Interconnection Network Simulation and Evaluation Environment,” Euro-Par 2005 Parallel Processing, Pp: 1014-1023, 2005. Top500 Supercomputers Site, “Interconnect Family share,” June 2008, http://www.top500.org.

High-Speed Network Modeling For Full System ...

to fine-tune hardware and software and can be particularly important in those ... extract an analytical representation of the workload that is common in many ...

Download PDF

3MB Sizes 1 Downloads 231 Views

Report

High-Speed Network Modeling For Full System ...

Recommend Documents