Bachelorarbeit Informatik - GitHub

Viewer
Transcript

Eberhard Karls Universität Tübingen Mathematisch-Naturwissenschaftliche Fakultät Fachbereich Informatik Arbeitsbereich Kommunikationsnetze

Bachelorarbeit Informatik Ein Open-Source basiertes Tool zur Überwachung des Leistungsverhaltens in InfiniBand-Netzwerken An Open Source Based Performance Monitoring Tool for InfiniBand Networks

Christian Kniep 10.10.2011

Gutachter

Prof. Dr. habil. Michael Menth Fachbereich Informatik Arbeitsbereich Kommunikationsnetze Universität Tübingen

Betreuer

Dipl.-Inform. Michael Höfling M.Sc. Dr. Marcus Camen (science + computing ag)

Christian Kniep: An Open Source Based Performance Monitoring Tool for InfiniBand Networks Bachelorarbeit Informatik Eberhard Karls Universität Tübingen Bearbeitungszeitraum: 10.06.2011 - 10.10.2011

Erklärung Hiermit versichere ich, dass ich die vorliegende Bachelorarbeit selbstständig und nur mit den angegebenen Hilfsmitteln angefertigt habe und dass alle Stellen, die dem Wortlaut oder dem Sinne nach anderen Werken entnommen sind, durch Angaben von Quellen als Entlehnung kenntlich gemacht worden sind. Diese Bachelorarbeit wurde in gleicher oder ähnlicher Form in keinem anderen Studiengang als Prüfungsleistung vorgelegt.

Tübingen, den 10.10.2011

(Christian Kniep)

Danksagung Diese Thesis ist ein Wegpunkt, dessen Fundament schon vor einigen Jahren gelegt wurde. Ohne die Unterstützung und den Zuspruch vieler Menschen wäre das Erreichen dieses Wegpunktes nicht möglich gewesen. Auf diesem Wege möchte ich meinen Wegbegleitern herzlich danken. Der Grundstein wurde während meiner Ausbildung im Rechenzentrum der Universität Lüneburg gelegt. Hier kam ich initial mit Unix in Kontakt und erfuhr eine solide Basis in der Auseinandersetzung mit diesem Betriebssystem. Besonderer Dank gilt Andreas Paul, der mit seiner offenen und humorvollen Art eine Arbeitsumgebung geschaffen hat, die mich andauernd motivierte. Die seitdem anhaltende Freundschaft mit dem Team aus dieser Zeit legt nicht zuletzt davon Zeugnis ab. Nach dem Abschluss der Ausbildung ging ich für 3 Monate ins Ausland und nahm im Anschluss eine Stelle in Ingolstadt an. Dort arbeite ich für einen Subunternehmer der AUDI AG und kam in ein Team, das meinen Linux-Affinität weiter förderte. Mein damaliger Teamleiter und seitdem guter Freund Alexander Weimer hat mich schliesslich dazu ermuntert, mein IT-Fundament durch ein Studium der Informatik zu verstärken. Ohne ihn hätte ich diesen Schritt vielleicht nicht unternommen. In den Jahren seit dieser Entscheidung studiere ich nun Informatik an der Universität Tübingen und arbeite bei der science + computing ag. Dort fand ich ein innovationsfreundliches, engagiertes Kollegium, welches, auch unterstützt durch das parallel dazu laufende Studium eine sehr steile Lernkurve ermöglichte. Nach einem Auslandsaufenthalt im Herbst 2010 wechselte ich zu meinem jetzigen Team, welches Berechencluster bei einem grossen deutschen Automobilhersteller betreut. Diese komplexe und anspruchsvolle Umgebung ermöglichte mir die Einarbeitung in die dieser Thesis zu Grunde liegenden Technologien. Das Themengebiet meiner Abschlussarbeit stand dadurch schon lange Zeit fest. Mit Herrn Michael Menth fand ich einen engagierten, offenen Professor, der mein Wunschthema ohne Berührungsängste zur Betreuung annahm. Sein wissenschaftlicher Mitarbeiter Michael Hoefling hat mit seiner Erfahrung in der Erstellung von wissenschaftlichen Arbeiten ebenfalls einen wesentlichen Beitrag geleistet. Ein grosses Dankeschön geht an all die Kollegen und Freunde (sowie die Kaffemaschine) bei science + computing. Weiterer Dank gebührt dem Leiter, der von uns betreuten Abteilung, durch den mir das Privileg zu teil wurde, die Entwicklung des Systems nicht nur auf ein kleines Testsystem zu beschränken. Ganz besonderer Dank gilt meiner Mutter Annegret Kniep, sowie meinem Onkel Eckhard Matthies. Ohne die Unterstützung und den Zuspruch dieser beiden Menschen, hätte ich diesen Weg niemals gehen können.

i

Contents 1. Introduction

2

2. Motivation 2.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3. Usage Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 4 4 4

3. InfiniBand 3.1. Components . . . . . . . . . . 3.1.1. Channel Adapter . . 3.1.2. Switches . . . . . . . . 3.1.3. Topologies . . . . . . 3.2. Communication . . . . . . . . 3.2.1. Flow Control . . . . . 3.2.2. Queuing . . . . . . . . 3.2.3. Connection Types . . 3.2.4. GUID . . . . . . . . . 3.2.5. Addressing . . . . . . 3.2.6. Service Level . . . . . 3.2.7. Virtual Lanes . . . . . 3.3. Management . . . . . . . . . 3.3.1. Subnet Management 3.3.2. General Services . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

5 5 5 6 7 11 11 11 13 13 14 15 15 15 16 16

4. Related Work 21 4.1. Mellanox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2. QLogic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5. System Design 5.1. Data Aggregator . 5.2. Data Storage . . . 5.3. Data Processing . 5.4. Data Visualization 5.5. Data Reporting . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

24 24 25 25 25 25

6. Implementation 29 6.1. Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.1.1. OpenSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.1.2. Nagios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 ii

Contents

6.1.3. RRDtool . . . . . . . . . . 6.1.4. Gnuplot . . . . . . . . . . . 6.1.5. Graphviz . . . . . . . . . . 6.1.6. Foswiki . . . . . . . . . . . 6.2. Architecture . . . . . . . . . . . . . 6.3. Component Interaction . . . . . . 6.3.1. Data Aggregator . . . . . . 6.3.2. Data Storage / Processor 6.3.3. Data Visualization . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

31 32 32 32 32 34 34 36 38

7. Testing and Evaluation 7.1. Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1. qperf . . . . . . . . . . . . . . . . . . . . . . . 7.1.2. LINPACK . . . . . . . . . . . . . . . . . . . . 7.2. Design of the Test . . . . . . . . . . . . . . . . . . . 7.2.1. Link Utilization . . . . . . . . . . . . . . . . 7.2.2. LINPACK . . . . . . . . . . . . . . . . . . . . 7.3. The Testing Network . . . . . . . . . . . . . . . . . 7.3.1. Laboratory Network . . . . . . . . . . . . . 7.3.2. Customer Network . . . . . . . . . . . . . . 7.4. System Performance . . . . . . . . . . . . . . . . . . 7.4.1. Network Graph . . . . . . . . . . . . . . . . 7.4.2. Port Utilization . . . . . . . . . . . . . . . . 7.4.3. Congestion Visualization . . . . . . . . . . . 7.4.4. Traffic Locality . . . . . . . . . . . . . . . . . 7.4.5. Utilization Shares . . . . . . . . . . . . . . . 7.5. Limitations . . . . . . . . . . . . . . . . . . . . . . . 7.5.1. Link Speed / Performance Counter Width 7.5.2. XmitWait Counter . . . . . . . . . . . . . . 7.5.3. Scalibity . . . . . . . . . . . . . . . . . . . . . 7.6. Evaluation Summary . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

44 44 44 45 46 46 47 48 48 48 49 49 50 54 56 59 61 61 61 62 62

8. Conclusions 63 8.1. Usage Scenario Fulfillment . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8.2. Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 8.3. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 A. Appendix A.1. InfiniBand . . . . . . . . . . . . . . . A.1.1. ibnetdiscover . . . . . . . . . A.1.2. saquery . . . . . . . . . . . . A.2. qperf . . . . . . . . . . . . . . . . . . A.2.1. Full list of Tests . . . . . . . A.3. SQL . . . . . . . . . . . . . . . . . . . A.3.1. Function osmInPerfdata() . A.3.2. Function upsert_perf() . . . A.3.3. Function upsert_perfache() iii

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

65 65 65 66 66 66 67 67 68 69

Contents

A.4. Nagios . . . . . . . . . . . A.4.1. Nagios Agents . . A.5. System . . . . . . . . . . . A.5.1. Startup . . . . . . A.5.2. Normal Operation A.5.3. LINPACK . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

69 69 71 71 72 72

Glossary

76

Bibliography

79

List of Figures

81

List of Tables

83

iv

This bachelor thesis aims to create an open-source monitoring system of an InfiniBand network. An user got a set of small free tools to handle problems and debug the entire network, but it is not easy to get an overview of the current network state. Commercial products are available but with counterintuitive user interfaces and a high price even for small environments. The goal of the thesis is to create a flexible, modular and intuitive system to offer an overview of the network’s current state.

1. Introduction Since the beginning of computing until very recently, the paradigm of solving computational problems on computers was the serial paradigm, which means that the full problem is decomposed into (a lot of) elemental computing steps which are executed serially on one processor. For decades this paradigm was untouched. To speedup computation, the engineers increased the amount of transistors (Moore’s law), the clock speed and the precision of used types. Furthermore, they implemented more complex functions, expanded the number of algorithmic units and so forth. This run came to an end in the last couple of years. Even if the computer industry sticks to increasing the amount of transistors in each CPU the rush to higher clock-speeds winds up. Due to thermal and physical problems, the industry shifts to add more cores to the CPU. Therefore the urge to parallelize programs started. At first this approach was used to run independent programs at the same time. If big problems had to be solved with parallel processes they have to exchange data and synchronize among each other. By keeping this on one processors with multiple cores one minimizes the performance issues of exchange data between different (fast) components like caches and memory. When the problems grew bigger the amount of cores to used had to increase but there was no system bus to connect multiple CPU’s in a scalable way. In the late 90’s two groups tried to address the lack of performance of the system bus [26]. One group (Compaq, HP and IBM) concentrated on further developments of the PCI-X specification with a bandwidth of 1 GB/s which was called ’FutureI/O’. The other group (Microsoft, Dell, Hitachi, Intel, NEC, Siemens and Sun) tried to push a protocol which was named ’NextgenerationI/O’. It was based on serial connection with 200 MB/s bandwidth. In 1999 they combined their efforts by founding the InfiniBand Trade Association (ITA). The initial name of the protocol was ’SystemI/O’. In October 1999 the final name InfiniBand (IB) was settled. Infinite Bandwidth describes the protocol as well as the hardware. In the years since, a lot proprietary cluster interconnects were used, but IB was able to raise it’s share (Fig. 1.2). An InfiniBand interconnect is cost efficient in term of price/performance. Furthermore, InfiniBand is an open standard which reduces the risk to get into a vendor lock-in [19]. The current share of InfiniBand within the top500 list of supercomputers is shown in Fig. 1.1.

2

1. Introduction

Figure 1.1.: Current performance share in supercomputing interconnects [24]

Figure 1.2.: Performance share over time [24] The thesis motivation is shown in section 2. Section 3 focuses on the technical background of InfiniBand. The concept is laid down in section 4. The implementation is described in section 5. An evaluation of the implemented system will be shown in Section 6. The current existing proprietary solutions and a comparison is laid down in section 6 (Related Work) followed by the conclusions (section 7). The glossary is found at the end of this thesis. 3

2. Motivation For an user without years of experience an InfiniBand network is kind of a black box. The existing tools gave either no broad view on the topology, its problems and performance or they are too expensive to deploy even on smaller networks.

2.1. Problem Statement Users of an IB network are handed a set of tools to have a look at various aspects of the fabric, but there is no free tool to have a birds-eye view of the fabric. If the user is able to answer a couple of questions with efficient and well visualized answers it is going to be easier to get a broader understanding: • Does the topology serve the needs? • Where are the hotspots / bottlenecks within the network? • How to get a real time assessment of all components of the fabric? • What is the normal utilization of certain links? • How about the workflows which are running in the network? • How to get better turnaround times with the intended workflows?

2.2. Purpose To solve this issue this thesis aims to develop an open-source based monitoring system. The system has to address the following steps. 1. Impose the topology 2. Collect performance data 3. Process the collected information 4. Visualize the data in an intuitive way

2.3. Usage Scenario To improve the administration of an InfiniBand network an user needs an application that give him an intuitive view on the system. This will offer the ability to track down events within the cluster without the need to dig deep into the InfiniBand background. 4

3. InfiniBand This chapter will introduce the fundamentals of InfiniBand. The first section describes the components of an IB network followed by an introduction to the communication basics. The last section lays down the management services within IB.

3.1. Components IB is a switched network of point to point connections. Besides the network interfaces of computers and peripherals (storage, graphic, etc.), switches and routers complete the network infrastructure. In this context it is called fabric (Fig. 3.1).

Figure 3.1.: InfiniBand-Fabric with HCA, TCA, switches and routers

3.1.1. Channel Adapter Every terminal of point to point connection is called Channel Adapter (CA). Unlike switches a CA is either a start point or end point of a connection. 3.1.1.1. HCA In order to connect a computer to the fabric a Host Channel Adapter (HCA) (as shown in Fig. 3.2) is used. This device has a direct connection to the CPU and the memory controller. Like other network interfaces a HCA is able to implement more then one independent network connection.

5

3. InfiniBand

Figure 3.2.: Host Channel Adapter

Figure 3.3.: Target Channel Adapter

3.1.1.2. TCA I/O peripherals like storage targets are connected through a Target Channel Adapter (TCA) (Fig. 3.3). This type of CA is not allowed to initiate connections. Therefore the logic to do so is not implemented (see also 3.3.2.3 ’Communication Management’ - page 19).

3.1.2. Switches To connect more then two CA a crossbar switch is needed. See Fig. 3.4 for a diagrammatic description of a crossbar latch.

Figure 3.4.: Crossbar to connect 2x 4 Ports in a nonblocking way

Figure 3.5.: Crossbar and IB-logic encapsulated in an ASIC

This connection method ensures that disjoint pairs of ports are connected with full bandwidth and do not affect each other. Within an IB network these crossbars 6

3. InfiniBand

are encapsulated in an Application-specific Integrated Circuit (ASIC) (Fig. 3.5) including the IB protocol logic. A common ASIC has a size of 36 ports (mid 2011). Regardless of the size the ASIC implements a port 0 which holds the protocol logic. At this scale the complexity of the actual switch as a product comes down to add the power supply and connect the ASIC to the external ports (Fig. 3.6).

Figure 3.6.: Switch assembled with one ASIC

3.1.3. Topologies CLOS network The network shown in Fig. 3.7 establishes a dedicated connection between host 1 and host 5 . A second dedicated connection (red dotted) between host 2 and host 4 can not be established due to a shared interswitch connection on the left side of the picture. Therefore the network is defined as blocked.

7

3. InfiniBand

Figure 3.7.: Blocked network In the early fifties Charles Clos formalized the concept of a fully non-blocking network to switch telephone calls. This network ensures that every unused incoming port can be connected to an unused outport. On Fig. 3.8 the quantity of outgoing links on the leaf switches is referred to as n and the amount of connections to the inner switches is referred to as m. To build a strictly non-blocking network m must be greater than or equal to 2n − 1. The Fig. 3.9 shows a strictly non-blocking network with two root switches. If the quantity of hosts (n) is 4 the amount of uplinks (m) has to be at least 7.

Figure 3.8.: CLOS network

8

3. InfiniBand

Figure 3.9.: Strictly non-blocked network Binary Fat Tree To get a more efficient topology binary fat tree topologies (Fig. 3.10) are used. This topology provides dedicated uplinks for every downlink within a binary tree. A simple practical example of a fat tree is shown in Fig. 3.11. Within this network non-blocking disjoint connection can be established.

Figure 3.10.: Fat Tree [1]

9

3. InfiniBand

Figure 3.11.: Fully non-blocking fat tree Rearrangeable fully non-blocking A topology in between a fat tree and a strictly non-blocking network is a rearrangeable fully non-blocking network. As shown in Fig. 3.12 the amount of uplinks is equal to the quantity of downlinks. Therefore the network does not match the constraints for a strictly non-blocking network and not all connection pairs can be build right away. A dedicated connection along the dotted red line can not be established due to the already existing blue one. But after rerouting the green connection (Fig. 3.13) the problem is solved. Every disjoint pair of hosts can be connected now.

Figure 3.12.: Rearrangeable fully non-blocking

10

3. InfiniBand

Figure 3.13.: Rearranged network

3.2. Communication After the description of the components in the previous section this section will introduce the communication mechanism within InfiniBand.

3.2.1. Flow Control Flow control is implemented in the scope of a physical link between two components. The sender accounts the number of transmitted packets and is allowed to send further packets only if the receiver’s buffer capacity is big enough to hold them. The receiver periodically sends the value of its counters thus ensureing appropriately synced counters on both sides.

3.2.2. Queuing The mechanism to send and receive data is implemented with pairs of queues containing each a receive and a send queue. These pairs are referred to as Queue Pair (QP). QPs can be initialized through the driver’s API and are dedicated to the calling application and a specific port. Multiple QPs per application and port are allowed. The communication is based on different queues which are addressed through a well-defined API. In comparison to competing protocols and their implementation no interaction of the operating system is required. The underlying logic is completely implemented in the hardware. This is one reason for the ultra-low latency within an IB fabric.

11

3. InfiniBand

Figure 3.14.: InfiniBand Communication Stack [23] QPs are meant to be filled with tasks called Work Queue Element (WQE) (pronounced ’Wookie’) and which are depending on the kind of queue they are inserted into. The completion queue allows the mandated application to fulfill the operation (Fig. 3.14). 3.2.2.1. Send Queue The Send Queue (SQ) is able of completing the following WQEs: Send: This WQE sends a data out of given local memory address to a specific host within the fabric. Memory Binding To use Remote Direct Memory Access (RDMA) it is necessary to allocate local memory for remote access. This WQE allocates memory and creates a key (r_key) which is needed by remote operation to verify the access. RDMA Through RDMA operations an application is able to access memory of a remote host without further interaction of the accessed host apart from allocate memory for shared use. The following operations are possible: • Write Like a Send-WQE but sends data directly to a remote host without a corresponding receive-WQE. • Read: Reads data out of remote memory. • Atomic: This WQE reads 64bit of memory on a remote host and is able to execute atomic changes. Atomic operation are used to secure access of a shared resources like a semaphore. Besides Fetch&Add (read and increment a shared counter), Compare&Swap (IF value!=x THEN value=y) is possible. To verify access rights, the remote memory key (r_key) is needed which is created by the ’Memory Binding’-WQE.

12

3. InfiniBand

3.2.2.2. Receive Queue This queue can only hold Receive-WQE. It has to be generated to complete a corresponding Send-WQE by defining the local address which is supposed to hold the received data. 3.2.2.3. Completion Queue Every CA implements a Completion Queue (CQ). Every completed WQE adds a Completion Queue Element (CQE) to the CQ which advises the application which requested the WQE of the task completion. The CQE contains all required information to enable the application to complete the operation.

3.2.3. Connection Types IB supports multiple kinds of connections (Table 3.1). In addition to reliable / unreliable connections it is possible to send RAW-packets which implement IBheader to encapsulate non IB protocols (e.g. IPv6) within an IB fabric. Table 3.1.: InfiniBand connection types [21] Service Type Connection Oriented Acknowledged Reliable Connection yes yes Unreliable Connection yes no Reliable Datagram no yes Unreliable Datagram no no RAW Datagram no no

Transport IBA IBA IBA IBA no

3.2.4. GUID A Globaly Unique Identifier (GUID) contains a 64bit value and is roughly comparable to a MAC address [7] within TCP/IP [5] networks. The first 24bit holds a company ID assigned by the Institute of Electrical and Electronics Engineers (IEEE) [4]. The concluding 40bit are assigned by the manufacturer. Unlike MAC addresses GUIDs are assigned to a couple of components to uniquely identify every active part within a fabric. Depending on the kind of components a split of CA and switches eases the differentiation. Channel Adapter All ports are independent and therefore uniquely addressable with separate PortGUIDs. Every CA implements at least one port. A CA is labeled by a NodeGUID (synonymous with CaGUID). The location of the next level is not that easy to pin down. The ITA standard [21] describes the sysimgGUID as follows: ’Provides a means for system software to indicate the availability of multiple paths to the same destination via multiple nodes.’

13

3. InfiniBand

Switches As described in 3.1.2 a basic switch is composed of one ASIC which implements an array of connectible ports. Besides these ports a switch implements a management port (port 0). The external ports are not assigned an unique GUID because they are no target of connection (neither for data nor for management). Therefore a switch NodeGUID is assigned to the management port. Management packets incoming on an external port are rerouted to this port. As in a CA context the sysimgGUID identifies a logical entity. Practically the sysimgGUID is uniquely assigned to an ASIC even if two ASICs are located on one physical board (Fig. 3.15).

Figure 3.15.: Switch containing multiple ASIC on multiple boards The chassisGUID is the highest order GUID a switch can be identified with. It is provided through the ’Baseboard Management’ interface and identifies a chassis which holds multiple physical boards (Fig. 3.15).

3.2.5. Addressing Within a fabric there are two ways to address a node. During normal operation the node is addressed by a logical identifier which is transparent and definite. If this fails the node is addressed with a direct route. 3.2.5.1. Direct Route A direct route addresses a node by concatenating the ports which should be taken on every hop along the path. For example the vector (1, 3, 1, 2) would take • port 1 at the local interface • port 3 at the connected switch • port 1 at the switch at hop 2 14

3. InfiniBand

• port 2 at the final hop. Direct routes are relative to the source port and therefore not definite. While discovering the network topology direct routes are taken. 3.2.5.2. Local Identifier (LID) The Subnet Manager (SM) (see also 3.3.1.1 ’Subnet Manager’ - page 16) assigns a 16bit width Logical Identifier (LID) to every node within the fabric (1 to 65535). The initial LID is 0.

3.2.6. Service Level For every connection type a 4bit value is defined which describes the Service Level (SL). During packet creation the SL is assigned by the application. In interaction with Virtual Lane (VL) it is used to implement a Quality of Service (QoS) within the InfiniBand network.

3.2.7. Virtual Lanes VL are used to divide the physical I/O buffer into logical buffers. Every port has to implement at least one data buffer (V L0 ) and a buffer for subnet management packets (V L15 ). The implementation of 14 extra data buffers (V L1−14 ) is optional. Within the range of VL the V L15 has the highest priority. The data buffers are send with two different prioritization (high, low). After transmission of an item out of the lower prioritized buffer the scheduler checks for higher prioritized items. The SM can be configured to map miscellaneous SL to different VL in order to implement Quality of Service within the fabric.

3.3. Management To insure the functionality of IB, a series of management classes are used. The common base of all classes can be considered as follows: • Manager The manager is an active component. It sets variables and requests information. • Interface Implements a gateway between an agent/manager and the network interface. • Agent An agent is mostly passive and responds to commands or requests by the manager. A slightly active part is to send traps (e.g. port status change). • Messages The Interaction of the manager and the agent is founded on welldefined message schemes and state models. The messages are called MADs (MAnagement Datagram) the size of which is exactly 256Bytes. 15

3. InfiniBand

3.3.1. Subnet Management 3.3.1.1. Subnet Manager To enable a fabric to react to changes it is essential to run a SM on a node (managed switch / host) connected to the fabric. If there is more then one SM started, the one with the lowest subnet manager priority (value between 0 and 15) will be the master instance. In the case of equality the SM with the lowest LID will be in charge. The SM is in authority of the following functions:

Discovery To discover the network the SM uses direct routing. Beginning at it’s local interface an iteration through all ports is started. Every port is asked for a neighbor and if it is existent and not known yet the algorithm is applied to this node recursively. Address assignment After discovering the network a LID will be assigned to every node port and switch. The assignment is based on the PortGUID (ports of Channel Adapters) or the NodeGUID (switch port 0). Route calculation Calculating routes within the fabric is a serious mathematical problem. The complexity increases exponentially with the amounts of nodes. To reduce the need of recalculating, most implementations of the SM cache the mapping and reassigns LIDs (e.g. in case a node reboots). The algorithm to calculate of optimized routes through the fabric is based on the needs and the implementation. Route distribution After calculation of the routes they are published to all ASICs within the fabric. During this procedure the fabric is in a weak status because some switches rely on old and some on new routing information. Sweeping If the initial steps are done, the SM makes periodic checks. The time between these checks is defined in the SM’s configuration. During this checks the SM prospects changes (e.g. link speeds, link states, added / deleted host) within the fabric. According to the specification [21], the mechanism of discovering the topology is used. 3.3.1.2. Methods To comply with it’s functions the SM uses a series of methods (Table 3.2) which are transmitted as Subnet Management Packets (SMP). As described in 3.2.7 the SM communicates with V L15 to transmit packets with the highest priority.

3.3.2. General Services Besides the SubnetManagement, an IB network implements a couple of General Services. Like a SM, the General Services uses a subclass of a Management Datagram 16

3. InfiniBand Table 3.2.: Methods of the SubnetManager [21] Method Type Description SubnGet() Request a get (read) of an attribute. SubnSet() Request a set (write) of an attribute. SubnGetResp() Response from a get or set request. SubnTrap() Notify an event occurred. SubnTrapRepress() Cease sending of repeated Traps. (MAD). The used General Management Packets (GMP) are transmitted through the data buffers V L0−14 . Therefore they are not transmitted with the highest priority within the network. The following subsections explain the mandatory and optional General Services of IB. 3.3.2.1. Subnet Administration Even the SM has tasks to handle that are not in need of a preferred transmission. To exchange information that are not necessary to fulfill the basic needs of the master SM instance a general service is needed, which does not use the high prioritized buffer V L15 . The General Service ’Subnet Administration’ serves this need by providing not only an unprivileged equivalent (Table 3.3) of the SM methods (Table 3.2) but additional functions like deleting attributes and requesting reports or more complex data structures. This service is not dedicated to a role within the fabric. Every node is allowed to use this service. Table 3.3.: Methods of the SubnetManagerAgent Method Type SubnAdmGet() SubnAdmGetResp() SubnAdmSet()

Description Request a get (read) of an attribute from a node. Response from a get or set request. Request a set (write) of an attribute in a node. The ressponder shall issue a SubnAdmGetResp() as it’s response. SubnAdmReport() Forward an event previously subscribed for SubnAdmReportResp() Reply to a SubnAdmReport() method SubnAdmGetTable() Table request SubnAdmGetTableResp() Table request response SubnAdmDelete() Request to delete an attribute SubnAdmDeleteResp() Response to SubnAdmDelete() method

3.3.2.2. Performance Management A basic need within a running fabric is to get information about the performance and occurring errors. The General Service ’Performance Management’ addresses this by offering a series of counter information. The counters are implemented in hardware and altered in normal operation within the ASIC without interaction of 17

3. InfiniBand

an external service. The service allows interaction with these counters (read/reset). To do so, every IB device has to implement the following attributes: • ClassPortInfo Verifies the support for the service and discloses the supported methods and MAD versions to use. • PortCounters Register which holds the actual error and performance counter for every port. Some of them are optional some are not. The performance counters are implemented with 32bit, errors counters have different widths from 8 to 16 bit. • PortSamplesControl To measure out of the scope of the normal port counters this attribute is used. The register can be triggered to sample a specific performance counter, of a port apart from its normal performance counter. By this the requesting service does not rely to the limited width of the performance counters. The requesting service can limit the sampling time to two seconds and therefore avoid the saturation of a performance counter. The status and the results are offered by the following attribute. • PortSamplesResult Offers information about the status of a measurement and the results. The following listing shows an exemplary output of a performance management query applied on a hosts port. # Port counters : Lid 5 port 1 ibwarn : [11459] dum p_perfco unters : PortXmitWait not indicated so ignore this counter The counter doesPortSelect : . . . . . . . . . . . . . . . . . . . . . . 1 CounterSelect : . . . . . . . . . . . . . . . . . . . 0 x0000 SymbolErrors : . . . . . . . . . . . . . . . . . . . . 0 LinkRecovers : . . . . . . . . . . . . . . . . . . . . 0 LinkDowned : . . . . . . . . . . . . . . . . . . . . . . 0 RcvErrors : . . . . . . . . . . . . . . . . . . . . . . . 0 R cv R e mo t eP h ys E r ro r s :.............0 RcvSwRelayErrors :. .. ... .. .. .. ... .0 XmtDiscards : . . . . . . . . . . . . . . . . . . . . . 0 X mt C o ns t ra i nt E r ro r s :.............0 R cv C o ns t ra i nt E r ro r s :.............0 CounterSelect2 : . . . . . . . . . . . . . . . . . . 0 x00 L in k I nt e gr i ty E r ro r s :.............0 E xc B u fO v er r un E r ro r s :.............0 VL15Dropped : . . . . . . . . . . . . . . . . . . . . . 0 XmtData : . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 5 2 2 RcvData : . . . . . . . . . . . . . . . . . . . . . . . . . 3 7 0 4 7 7 XmtPkts : . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 9 RcvPkts : . . . . . . . . . . . . . . . . . . . . . . . . . 8 5 6 8 XmtWait : . . . . . . . . . . . . . . . . . . . . . . . . . 0

Congestion Counter Links suffer congestion when packets are ready to be send but are not transmitted due to saturated buffers on the receiver’s side. The amount 18

3. InfiniBand

of congested packets are indicated by the counter xmit_wait, which is part of the output of perfquery. Congestion can only appear on switch ports. Ports attached to a CA rather prohibit the transmitting application to transfer more data into the sending buffer then accept data that can not be send. Due to the congestion handling within a CA the command perfquery indicates within the first line of output that the xmt_wait counter is always set to zero and should be ignored . A switchport does not have such direct mechanism to decrease the amount of data transmitted by another node. It relies on the IB flow control mechanism to reduce the incomming stream of packets. Unfourtunaly the flow control is asynchronous and the switch is forced to transfer packets from the receiving to the sending buffer. Due to that the amount of packets within the sending buffer will increase until the flow control reduces the incoming data. Within this period of time there is no alternative other then prohibit packets to be send. Contrary to the output shown the warning concerning the xmit_wait counter does not occur when requesting performance data of a switch port. 3.3.2.3. Communication Management To use data connections and datagrams (reliable and unreliable) between nodes within the fabric, the Communication Management is fundamental. This service creates QP for a requesting application. After that the Communication Manager (CM) manages the connection and closes the QP if the connection terminates. The CM abstracts the whole process of creating client/server and peer-to-peer connections, including the segmentation and packaging of messages. It is fully transparent for the application. The communication between CM uses separate QPs which are not used for data communication. 3.3.2.4. Baseboard Management To obtain useful basic information about the device that holds the network device, the baseboard management can be used. Unlike all the previously described services, which implements functions and attributes only within the IB standard, this service can access information outside the scope of the InfiniBand standard. One use case of the service is to access the system temperature through the Baseboard Management Agent. 3.3.2.5. Optional Management Besides the introduced mandatory services, an IB device can implement the following additional services: Device Management The Device Management is responsible for devices directly attached to the CA. One common use case is a storage array which contains a series of I/O-Controllers (IOC) directly connected with to a TCA. The TCA implements

19

3. InfiniBand

a Device Manager (DM) while the IOCs implementing Device Management Agents (DMA) (Fig. 3.16).

Figure 3.16.: Storage array with TCA [22] Incoming requests are forwarded to a DMA to response. If it is allowed to send traps, a DMA is able to send traps to the DM which are forwarded to the destination within the fabric (e.g. the SM). SNMP Tunneling Management The IB specification allows the implementation of a SNMP infrastructure through a General Service. Vendor / Application-specific Management If there is a need to implement proprietary services, the standard addresses that by allowing vendor or applicationspecific General Services.

20

4. Related Work This chapter will introduce the leading commercial products to monitor and manage IB fabrics. The two big vendors of IB hardware developed software suits to manage and supervise fabrics. Due to the fact that InfiniBand is an open standard and the clusters are mostly build out of commodity hardware, the hardware vendors try to make a difference by shipping management software that assist customer to get the highest performance out of the system.

4.1. Mellanox The company Voltaire [20] developed the ’Unified Fabric Manager’ [18] (UFM) which provides a system to manage and supervise an InfiniBand fabric. Mellanox acquired Voltaire in 2011. UFM provides an overview of the status of the fabric (Fig. 4.1), information about performance of nodes and hosts (Fig. 4.3) and the topology (Fig. 4.2).

Figure 4.1.: UFM dashboard

21

4. Related Work

Figure 4.2.: Topology visualization

Figure 4.3.: Performance chart The main components included in UFM are: • Visualize Fabric Visualization of the fabric topology and real-time monitoring of events and performance • Job Scheduler UFM contains a job scheduler aware routing. • Service Level Assignment of different service levels to different kinds of traffic to improve (e.g. MPI) performance • Voltaire’s Traffic Aware Routing Algorithm (TARA) TARA allows UFM to track down congestion and reroutes the traffic to improve the over-all performance Besides this, UFM offers two proprietary routing algorithms. The traffic optimized routing (TOR) and the latest Traffic Aware Routing Algorithm (TARA). The algorithms are trying to optimize routing based on measurement or indicators like 22

4. Related Work

different job schedulers. They aim to avoid congestion and oversubscribed links to minimize the latency and maximize the throughput. The Voltaire website provides a case study ’Achieving Peak Performance with Advanced Fabric Management at HLRS’ that claims to improve the performance on large amounts [25].

4.2. QLogic QLogic [13] offers a system called ’InfiniBand Fabric Suite’ (IFS) [14]. Resources to gather information about IFS are hard to find. The system’s concept is shown in Fig. 4.4.

Figure 4.4.: System encapsulation of IFS 6.0 It addresses three aspects of the customer’s IB fabric: • Virtual Fabric With Virtual Fabric the traffic can be assigned to 16 service levels to implement quality of service for different traffic flows. • Intelligent Switches Due to adaptive and dispersive routing within the switches, IFS attempts to avoid and reduce congestion. • Vendor-specific MPI libraries IFS is not pinned to a specific MPI version. Contrary IFS explicitly supports vendor-specific MPI libraries, which allows the customer to take advantage of IFS and optimizations within specific MPI variants.

23

5. System Design For the development of the monitoring system that serves the basic needs of an user, a concept (Fig. 5.1) was created. The system should be clearly structured and expandable. This part of the thesis described down the initially planned four parts of the system and the reports that should be included.

Figure 5.1.: System concept

5.1. Data Aggregator The first component gathers topology information and performance data. The topology is persistently stored into the database system and is going to be altered by the collection mechanism if necessary. Performance information will be collected and linked to the stored topology.

24

5. System Design

5.2. Data Storage The persistent section of the information is going to be stored in a data base model that matches the InfiniBand components in the standard.

5.3. Data Processing Stored data has to be processed in a way that it contributes to the understanding of processes within the fabric. On one hand the topology information has to be held consistently, on the other hand the actual performance data has to be mapped into the context and the whole information mix has to fulfill constraints to ensure a consistent data base. If constraints are violated the system has to react appropriately.

5.4. Data Visualization To allow the end user to benefit from the data gathered and processed so far, it needs to be presented in an intuitive way. The visualization offers a birds-eye view of the fabric but it also has to offer detailed information if needed.

5.5. Data Reporting The visualization enables the user an intuitive view of performance data within the fabric. The reporting starts on the most atomic entity and continues to higher levels. Port Utilization The port performance shows the utilization at the end points of every link (Fig. 5.2). The links within the fabric will show the relation between transmitted data (xmitdata ) and the maximal bandwidth (xmitmax ). •

xmitdata = port utilization xmitmax

Figure 5.2.: Conceptual utilization graph

25

5. System Design

The color goes from green (low utilization) to red (saturation). If there is no traffic the link is shown as black line. Besides a visualization that shows the current utilization as an enriched graph, the utilization should be measured in another way. The user will be able to measure the utilization during a given time period in a way that the bandwidth shares are visualized (Fig. 5.3). This will provide information about the real usage of the network to the user.

Figure 5.3.: Utilization share over period of time Congestion The second graph offers a glance of the amount of congestion that happens within the fabric. The measurement within the system calculates how many transmissions suffers from congestion. Assuming that 10 time slots are blocked (xmitwait ) and 100 packets are transmitted (xmitpkts ) within a time period the calculation would be 10%. •

xmitwait = 0, 1 xmitpkts

The colorization starts with green and leads to red which indicates an abnormal link congestion. Traffic Locality Traffic locality measures the fraction of traffic that went through the node’s uplink. Traffic within a node is either incoming or outgoing via an uplink (upin /upout ) or downlink (downin /downout ) as shown in Fig. 5.4. The measurement computes the locality of both direction as follows: upin = localitydown downout upout • = localityup downin •

If connected hosts are only using their uplinks to transmit and receive data, the data stays at the switch and does not affect higher level switches (Fig. 5.5). The traffic locality in Fig. 5.5 has no uplink involved (localitydown/up = 0). In Fig. 5.6 the 26

5. System Design

incoming and outgoing traffic is global to the full extend. No traffic is transmitted within the sub tree (localitydown/up = 100%). The Fig. 5.7 shows a scenario where 50% of the downstream traffic is global (localitydown = 50% / localityup = 0%).

Figure 5.4.: Traffic locality scheme

Figure 5.5.: Local traffic

27

5. System Design

Figure 5.6.: 100% global traffic

Figure 5.7.: Mix of local and global traffic

System Monitoring Besides information about the networks performance and status the system itself will be in the scope as well. Especially a system with ongoing development requires a close monitoring of system parameters to avoid performance issues and to debug the system.

28

6. Implementation This chapter describes the setup of the monitoring system. The first section specifies which software products are used within the implementation, followed by the basic architecture. The last section describes the interaction.

6.1. Technologies 6.1.1. OpenSM OpenSM [10] is an open-source implementation of the SM. It is provided by the Open Fabric Enterprise Distribution (OFED) which is distributed by the OpenFabric Alliance [10] and is the core of most proprietary implementations of SM.

Performance Manager The Performance Manager (PerfMgr) is an included component of the OpenSM which periodically collects the performance data of all nodes within the fabric. osmeventplugin OpenSM provides a plug-in structure to implement additional functionality. Two plug-ins are included in the source code: osmtest, osmeventplugin. The first plug-in implements a test-suite. The second one implements callback functions to handle events within the SM which are useful to create an efficient aggregation system. • report Main callback function which detects the kind of event and decides what to do. If simple reports like ’Subnet up’ or ’Sweep completed’ came up they are only reported. Counter and trap events are handed to one of the following functions. • handle_port_counter This function is triggered if error counters occur. • handle_port_counter_ext Handles information about the performance counter. • handle_trap_event This function reports what kind of trap (e.g. link up / link down) was received.

29

6. Implementation

6.1.2. Nagios Nagios [8] is a popular open-source monitoring framework. Initially released in 1999 a huge amount of plug-ins and add-ons was added since then. The core function schedules monitoring tasks, manages the processing and triggers actions (usually notifications) when a monitoring status changes. The monitoring is based on an object-oriented structure. Agents are small independent programs which are monitoring a certain value, service, status. Within the Nagios mindset they are referred to as monitoring services (e.g. the fill-rate of a hard drive). Based on the agents error code (EC) the service can be in one of four states: • OK (EC 0) Default state with no problems at all • WARNING (EC 1) Warning status which indicates minor problems. • ERROR (EC 2) An error occurred and the services suffers a major error. • UNDEFINED (EC >2) The error code is outside of the scope. Besides the EC one line of standard output will be processed by Nagios. The output will be split into status and performance data. The status output will be displayed to illustrate the status level. Performance data can be processed through various plug-ins.

6.1.2.1. Nagiosgraph One plug-in which processes performance data is nagiosgraph [9]. It parses the performance data and by default creates a Round Robin Database (RRD) for every service-value pair (Figures 6.1–6.2). Due to the multitude of agents which provides performance data nagiosgraph is an easy way to manage graphs within Nagios.

30

6. Implementation

Figure 6.1.: Screenshot of nagiosgraph

Figure 6.2.: Graph of a hosts current load The actual visualization is provided by a couple of cgi scripts which create graphs as they are requested. One drawback of nagiosgraph in the matter of visualize the performance data is that it will not scale to every level. Due to the textual step during the processing and the file based data storage it is very expensive in terms of performance.

6.1.3. RRDtool RRDtool implements a RRD in which the data will be held within a shift register and is used by nagiosgraph to store the performance data (which is reported by the Nagios agents, not to be mixed with InfiniBand’s counter). Old data will get 31

6. Implementation

dropped due to the continuous stream of new incoming data. The RRD container is able to precalculate values (e.g. averages for different histories). The data visualization can be done by defining queries for the RRD.

6.1.4. Gnuplot Gnuplot [3] comes with a very powerful programming language to create plots of all kinds of data. It supports a lot of data formats and offers a powerful programing interface to customize the output.

6.1.5. Graphviz Graphviz [3] is an abbreviation for Graph Visualization Software and was developed by AT&T and the Bell-Labs. It is a very powerful tool to define graphs and visualize them in various formats. Graphviz comes with build in commands (dot, neato, twopi) to arrange the nodes and edges with different adjustments (Table 6.1).

Source

Table 6.1.: Simple Graphviz example rendered with dot neato

twopi

6.1.6. Foswiki Foswiki [2] is a fork of the Perl [11] based wiki system TWiki [17]. The data storage is implemented just by using text files within a well-defined structure of directories to store the content. It uses the Revision Control System (RSC [15]) to manage changes in the text files. This allows external programs to alter or update the content. A huge amount of plug-ins and an easy way to create individual macros offers the flexibility that is needed in this project.

6.2. Architecture The implementation of the system differs from the conceptual thoughts in the way that the storage and processing part are merged (Fig. 6.3). It turned out that this is necessary, because the data has to be processed right away to avoid performance bottlenecks.

32

6. Implementation

The reduction of complexity is no drawback at all, on the contrary it is a benefit due to the reduction and the extended locality of data and data processing.

Figure 6.3.: Conceptual implementation The implementation’s interaction is shown on Fig. 6.3 and is described as follows: 1. The initial step is to get information about the topology of the fabric. The Nagios agent parse_ibnetdiscover executes the tool ibnetdiscover, evaluates the output, normalizes the topology data and stores the topology within the database system. 2. The OpenSM itself and the add-on Performance Manager reacts on events within the fabric and scans the fabric periodically. Each event that occurs is handed to the osmeventplugin which will insert the performance data into the database system. 3. SQL functions are used to process the incoming data and aggregate information. 4. Monitoring the current system is quite useful when it comes down to detect performance issues of one of the components. The Nagios agent eval_statistics monitors a series of services such as • size of database tables, amount of connections to the database system, state of transaction 33

6. Implementation

• allocated system memory to detect memory leaks • free disk space, system load, running services and creates RRD to store the evaluations. 5. The RRDtool is used to store data as a function of time in a shift register. 6. To visualize the RRD requested by the user, the charts nagiosgraph is used. 7. The Nagios agent create_netgraph uses Graphviz to draw the survey of the various fabric graphs. 8. Visualizations which are not a function of time are created with the agent create_measurement by using gnuplot. 9. Changes within the structure of the Foswiki are applied by the agent update_foswiki . 10 Last but not least the Foswiki engine imports all components and visualizes the current state of the monitoring system.

6.3. Component Interaction 6.3.0.1. Nagios To encapsulate the system, which is made out of a series of independent open-source tools, a framework is needed which keeps all components together. Nagios serves as such a framework to schedule and supervise the different components. This framework can not be classified in one of the components the concept holds. 6.3.0.2. Nagios Agents Simple agents ascertain information like the current load of the system. The appendix shows an agent which executes the UNIX command uptime and extracts the current load from the output (Appendix A.4.1.1). Information printed on standard out will be processed by Nagios and its plug-ins. The exit code defines the service status. Each agent adds performance data to its return information to monitor the system itself.

6.3.1. Data Aggregator parse_ibnetdiscover This Nagios agent executes ibnetdiscover and parses the output. The output is processed in a way that the topology of the network is normalized and stored in the database system.

34

6. Implementation

OpenSM The system’s SM runs on the lowest priority (zero) and therefore interacts only if every normal SM is down. To prevent the fabric from adopting the default configuration if this inconvenient event occurs the configuration should be adopted to the local fabric. Performance Manager The PerfMgr is used to collect performance data. The following line in an opensm configuration enable the PerfMgr within the SM and instruct it to scan the performance counter every 4 seconds. enable_perfmgr TRUE p e r f m g r _ s w ee p _ t i m e _ s 4

After restarting the SM to accept the configuration the performance manager addon scans all port counters in time periods of 4 seconds. As shown in the following fragment of an opensm logfile the process is very fast which assures the chronological synchronicity of the measurements. Jun 17 17:32:09 fastest slowest average

[...] -> PerfMgr total sweep time : 0.002320 s mad : 154 us mad : 701 us mad : 500.414 us

osmeventplugin The triggered gathering of performance data enables the use of callbacks. Each time a MAD with performance data is received by the performance manager it calls a function within the osmeventplugin. Within this implementation all events (error counter, traps from nodes, additional nodes join the fabric, etc.) except the arrival of performance counters are discarded. Without the topology information evaluated in step 1 (6.2 on page 32) the information can not be stored into the system and will be discarded as well. The incoming information is directly handed to a SQL function to reduce the complexity of the osmeventplugin. sprintf ( str , " select * from osmInPerfdata ( ’% " PRIx64 " ’,’ xmit_data ’ ,%d ,% lu ); " , epc - > port_id . node_guid , epc - > port_id . port_num , epc - > xmit_data /250000); // normalize to MB res = PQexec ( conn , str );

The SQL function is shown in Appendix A.3.1. OpenSM Modifications Modification on the source code of the PrefMgr and the osmeventplugin are done to extract performance data about the amount of congestion the transmitted packets suffer. The original source code does not offer this information, presumably because valid data is only provided by switchport (see also 3.3.2.2 ’Performance Management’ - page 17). InfiniBand Diagnostic Tools The software package infiniband-diag offers a series of useful tools to analyze an IB fabric. They are relying on the methods of the General Service ’Subnet Administration’ and the register that holds the information of the fabric entity. Some of them are essential for administration and thus for the development of a monitoring system. 35

6. Implementation

ibnetdiscover The tool ibnetdiscover works the same way as the network discovery of the SM. So no active SM is necessary to use ibnetdiscover. As shown in Appendix A.1.1 the output illustrates all information about the topology and the involved nodes like link width, link speed, node description, the different GUID, et cetera. Within the monitoring system it is used to extract the IB topology. The output is processed through the Nagios agent parse_ibnetdiscover perfquery One way to visualize the counters of the General Service ’Performance Management’ of a specific node is perfquery . The output is shown in the ’Performance Management’ section (3.3.2.2). Like ibnetdiscover it does not depend on an active SM within the fabric, the communication is send directly from the client to the ports performance management. This tool is not explicitly used by the system itself but it is indispensable for debugging an InfiniBand fabric. saquery Every node implements a Subnet Management Agent which exchanges information with the SM. To request information from the local agent the tool saquery is used. Among a series of information saquery provides information about the node record Appendix A.1.2.1 or the Linear Forwarding Tables.

6.3.2. Data Storage / Processor 6.3.2.1. PostgreSQL

Figure 6.4.: Database model The database system of choice is PostgreSQL. It is fully compatible with the SQL standard and an open-source software with a good performance. Further influence 36

6. Implementation

came from the good experience with the system in a series of other projects. A stable and easy integration in the used operation system (CentOS) settles the decision. The left side of the database model (DBM) Fig. 6.4 reflects the various GUIDs within an IB fabric. A system (server, switch) can be part of a chassis (chassis) and implements at least one node. The node type will be represented in nodetypes. Every node contains at least one port and will be stored in ports. On the right side the DBM represents the performance management. The actual performance value of every port-counter pair is stored in perfcache. The database tables containing measurement are used to hold the measurement data. 6.3.2.2. RRDtool Data, which is going to be visualized as a function of time, is stored in RRD. Due to nagiosgraph, an agent’s performance data is one source for this storage. The second source is the agent eval_statistics. 6.3.2.3. eval_statistics This agent’s purpose is to monitor a series of system parameter to ensure that the system works as designed. Some of the measurements are: • size of database tables If certain database tables are growing over a defined threshold it will decrease the overall performance. • system load Another major measurement is the overall system load. If the load raises too high the system will perform badly. • database system performance By monitoring certain system parameters the database system is supervised (Fig. 6.5).

Figure 6.5.: RRD chart of system load

37

6. Implementation

6.3.3. Data Visualization 6.3.3.1. graphviz / create_netgraph A graph showing the network topology is used to be the main communicator of information. Just the plain topology (Fig. 6.6) is visualized. Other views are adding information like port / link utilization, congestion, traffic locality.

Figure 6.6.: Simple graph of a small network 6.3.3.2. nagiosgraph To process the feedback of the various agents and present it as a chart, the plug-in nagiosgraph is used. The following configuration leads to a processing of performance data every 30 seconds. # cat / usr / local / nagios / etc / nagios . cfg | grep perf p r o c e s s _ p e r f o r m a n c e _ d a t a =1 s e r v i c e _ p e r f d a t a _ f i l e =/ tmp / perfdata . log s e r v i c e _ p e r f d a t a _ f i l e _ t e m p l a t e = $ LA ST SER VI CE CHE CK $ || $HOSTNAME$ ||⤦ Ç $DESC$ || $OUTPUT$ || $PERFDATA$ service_perfdata_file_mode =a s e r v i c e _ p e r f d a t a _ f i l e _ p r o c e s s i n g _ i n t e r v a l =30 s e r v i c e _ p e r f d a t a _ f i l e _ p r o c e s s i n g _ c o m m a n d = process - service - perfdata -⤦ Ç for - nagiosgraph

The perfdata will be extracted from the agents feedback and then dumped into the perfdata file. # cat / tmp / perfdata . log 1315248607|| ibeat1 || updateFoswiki || OK - 0 changedFiles ||⤦ Ç changedFiles =0 1315248606|| platini || portData || OK - 4 dataIns ,|| dataIns =4 , 1315248606|| maradonna || portData || OK - 4 dataIns ,|| dataIns =4 , 1315248609|| ibeat1 || up dat eM ea sur em en ts || OK - 0 ⤦ Ç M i g r a t e d M e a s ur e m e n t s || M i g r a t e d M e a s ur e m e n t s =0 1315248606|| ibeat1 || portData || OK - 4 dataIns ,|| dataIns =4 , 1315248606|| kempes || portData || OK - 5 dataIns ,|| dataIns =5 , 1315248606|| voltaire1 || portData || OK - 18 dataIns ,|| dataIns =18 , 1315248603|| voltaire1 || portPkts || OK - 89 pktsIns ,|| pktsIns =89 ,

38

6. Implementation

Every key/value pair will be transformed in a RRD chart (Fig. 6.7). The nagiosgraph plug-in provides configuration to tune the amount of RRD files, use regular expressions to match only certain information, et cetra.

Figure 6.7.: RRD chart ’Current Load’ 6.3.3.3. update_foswiki This agent updates the structure of foswiki if necessary. If the user defines a new measurement within the wiki, it will transform the information into the database and the agent creates a new topic out of a template (Fig. 6.8).

Figure 6.8.: List of measurement within Foswiki

39

6. Implementation

6.3.3.4. Foswiki One of the major challenges is to visualize the gathered information in a way that is easily consumable by an user (Fig. 6.9).

Figure 6.9.: Dashboard front page All the different information is accessible by a few clicks and without a huge amount of configuration if e.g. the network topology changed. The information is linked in a way that the user is able to get to the point quickly. The foundation of this framework is based on the wiki engine Foswiki. The Nagios agent update_foswiki alters the sections without human interaction if necessary. The monitoring configuration is stored in wiki pages and therefore easily accessible. Dashboard The front page Fig. 6.9 offers a map showing the topology of the fabric. Nodes with different purposes are presented in different styles. Within the svg picture status information is overlaid to enable the user to get an overview at the first glance. The navigation section at the top of the web page provides easy navigation within the monitoring system. Netgraph To visualize the network, a plain netgraph without additional information is used. This will be created by the Nagios agent create_netgraph (Fig. 6.6).

40

6. Implementation

Performance The plain netgraph provides no further information other than the topology. The Nagios agent create_netgraph also creates graphs with information about the port performance (Fig. 6.10), link congestion (Fig. 6.11) and information about the traffic locality (Fig. 6.12).

Figure 6.10.: Dashboard view on port utilization

41

6. Implementation

Figure 6.11.: Dashboard view on congestion visualization

Figure 6.12.: Dashboard view on traffic locality 42

6. Implementation

Statistic View A deeper look into the system is offered by the system’s statistic view. The first tabular (Fig. 6.13) shows the size of the most volatile tables within the database system. The second view (Fig. 6.14) visualizes some performance parameters to analyze the system’s efficiency. The amount of locks within the database is shown in Fig. 6.15 to supervise database connections and prevent deadlocks.

Figure 6.13.: DB sizes as functions of time

Figure 6.14.: System statistics

Figure 6.15.: Amount of different db locks

43

7. Testing and Evaluation After the implementation, the system was tested. This chapter starts with a short description of the tools used in the testing scenarios. The next section describes the test design followed by the testbeds. The system performance is shown in the section 7.4. This chapter concludes with a discussion of the limitations and a short evaluation summary whether the measurements are accurate.

7.1. Tools Not only the analysis of topology and the node attributes but also the measurement of the network performance is an essential part of the administration and optimization of networks. This sections introduces some of the tools to perform these measurements.

7.1.1. qperf An easier way to measure the performance is provided by qperf. With a broad set of options it serves a high range of test scenarios. Tests The following shows a shortened list of the tests possible (a full list is shown in Appendix A.2.1). Miscellaneous conf quit Socket Based rds_bw RDMA Send / Receive rc_bi_bw RDMA rc_rdma_read_bw Ç bandwidth InfiniBand Atomics rc _c om par e_ sw ap_ mr Verification v er _ r c_ c om p ar e _ sw a p

Show configuration Cause the server to quit RDS streaming one way bandwidth RC streaming two way bandwidth RC RDMA read streaming one way ⤦

RC compare and swap messaging rate Verify RC compare and swap

44

7. Testing and Evaluation

Example measurement The following listing shows the measurement of the number of atomic compare and swap operation from one node to another over a period of 4 minutes. # qperf cruyff -t 4 m r c_ co mpa re _s wap _m r rc _c om par e_ sw ap_ mr : msg_rate = 186 K / sec

A bandwidth saturation is started as follows: root@x15 $ qperf root@x55 $ qperf x15 -t 10 min rc_rdma_read_bw rc_rdma_read_bw : bw = 3.4 GB / sec root@x55 $

7.1.2. LINPACK LINPACK [6] solves huge linear equations problems. Every iteration has to be synchronized and the processes exchange the intermediate data. The benchmark is used to create the top500 list of the most powerful computer systems [16]. In this testing scenario LINPACK is used to create a synthetic workload. It will utilize the CPU and the interconnect if the amount of used computation nodes raises. Within the small testbeds the utilization will not saturate the network, rather produce a moderate utilization. To get the best performance out of LINPACK it is necessary to plan and calculate the parameters wisely. This tests goal was not to get a good performance, it was choosen to approach a workload in normal operation.

45

7. Testing and Evaluation

7.2. Design of the Test 7.2.1. Link Utilization The link utilization benchmark uses qperf to saturate links within the fabric. The laboratory network will be utilized as shown in Fig. 7.1. Two connections saturating the interswitch connection. Combined with a third connection the link to the receiving host will be shared by two connections. Within the customer network (Fig. 7.2) three connections are used to utilize the interswitch connection. Local connections are established with two and three pairs of qperf connections to create an appropriate testbed for the traffic locality measurement.

Figure 7.1.: qperf testdesign in laboratory

Figure 7.2.: qperf testdesign within customers fabric

46

7. Testing and Evaluation

7.2.2. LINPACK As described, the LINPACK benchmark solves a linear equation problem with a distributed approach.

Figure 7.3.: LINPACK test within lab

Figure 7.4.: LINPACK benchmark in customer context The intermediate data of every iteration within the benchmark will be exchanged between the nodes. The expected traffic within the laboratory is shown in Fig. 7.3. Within the customers fabric the interaction graph is shown in Fig. 7.4. The input files of LINPACK for both environments are shown in the appendix Appendix A.5.3.

47

7. Testing and Evaluation

7.3. The Testing Network 7.3.1. Laboratory Network The fabric used to start this project contains four servers connected to a 24 port and an eight port switch (Fig. 7.5). All links are at (SDR) speed due to the capabilities of the 24 Port switch and the hosts SDR channel adapters. The switch with 24 ports is not compatible to implement congestion counters. Therefore only the eight port switch is able to indicate congestion even if the other switch suffers it as well.

Figure 7.5.: InfiniBand laboratory testbed It was used to develop the basic framework of the monitoring system and make some measurements to proof the accuracy of it.

7.3.2. Customer Network The testbed Fig. 7.6 is a section of a customers network and contains: • 1 file server • 1 monitoring server • 3 36 Port QDR IB switches • 14 compute nodes Due to the fact that the complete network is productive, the test did not run exclusively.

48

7. Testing and Evaluation

Figure 7.6.: Testing environment It had to be guaranteed that the tests did not affect the productive workload. The entire fabric contains more than 100 compute nodes with independent workflows which potentially influence the test scenarios by creating errors or interact with the subnet manager and so forth.

7.4. System Performance 7.4.1. Network Graph The network graphs are shown in Figures 7.7–7.8 on page 49–49. They are updated periodically by the agent create_netgraph.

Figure 7.7.: Laboratory network

49

7. Testing and Evaluation

Figure 7.8.: Plain customer fabric

7.4.2. Port Utilization To check the visualization of the port performance, the synthetic performance measurement tool qperf is used. The sending and receiving hosts were described earlier (Section 7.2 on page 46). The port utilization graph shows how much of the bandwidth is used by a port. The visualization is based on the degree of used bandwidth. The color shifts from green to red in 10% steps. If the interswitch connection implements multiple links the link with the maximal utilization is visualized. In case of multiple links, between ports the maximum utilized link is shown. The color scheme contains 10 shades. If there is no traffic at all the link is represented by a black line, otherwise the color goes from green to red in 10% steps. During the qperf benchmark the utilization in the laboratory is shown in Fig. 7.10. The customer network shows port utilization (Fig. 7.9) and congestion (Fig. 7.14) along the traffic path.

50

7. Testing and Evaluation

Figure 7.9.: Port performance during qperf The performance data that was used to generate the graphs is exemplary shown next to Fig. 7.10.

51

7. Testing and Evaluation

Figure 7.10.: Port performance during qperf | | | | | |

* Src * sw8 sw8 sw24 sw24 sw24

| | | | | |

* Dst * kempes platini sw8 maradonna ibmon1

| | | | | |

* XmitSrc ( MB / s ) * 0 989 462 988 0

|* XmitDst ( MB / s ) *| | 524 | | 988 | | 988 | | 0 | | 461 |

Data base of Fig. 7.10 The LINPACK benchmark did not utilized the links in the lab (Fig. 7.11) nor the customer network (Fig. 7.12).

52

7. Testing and Evaluation

Figure 7.11.: Port performance during LINPACK

Figure 7.12.: Port performance during LINPACK

53

7. Testing and Evaluation

7.4.3. Congestion Visualization The congestion visualization measures the amount of packets that suffer congestion. If the fraction is higher then 5 ∗ 10−5 of all packets send, the color deflection is maximal. The threshold is used to get a series of different colorization out of the testbeds. During the qperf performance test, the congestion appears along the traffic through the end nodes (Figures 7.13–7.14 on page 54–54).

Figure 7.13.: Congestion graph during qperf | | | | | |

* Src * sw8 sw8 sw24 sw24 sw24

| | | | | |

* Dst * kempes platini sw8 maradonna ibmon1

| * CongSrc (10^ -5) * | 23593 | 0 | 0 | 0 | 0

| * CongDst (10^ -5 s ) * | 0 | 0 | 19 | 0 | 0

Data base of Fig. 7.13

54

| | | | | |

7. Testing and Evaluation

Figure 7.14.: Congestion within qperf within custom fabric The congestion graph (Figures 7.15–7.16) during the LINPACK benchmark indicates that almost every link suffers congestion.

Figure 7.15.: Congestion map during LINPACK

55

7. Testing and Evaluation

Figure 7.16.: Congestion map during LINPACK Due to the fact that only one switch in the laboratory is compliant to the newest InfiniBand standard there are only these links that can be affected by congestion. The data leading to the congestion graph is shown for Fig. 7.13.

7.4.4. Traffic Locality Due to restrictions in Graphviz the highest percentage of traffic locality (up/down) colors the node. The locality of the traffic within the laboratory is shown in Figures 7.17–7.18.

56

7. Testing and Evaluation

Figure 7.17.: Traffic locality during qperf | | | | | | |

* Node * kempes platini sw8 maradonna ibmon1 sw24

| | | | | | |

* Up * 0 100 46 100 0 0

| | | | | | |

* Down * 100 100 65 0 100 0

| | | | | | |

* upIn * 0 989 462 988 0 0

|* downOut * | 0 | 0 | 989 | 0 | 0 | 1450

| * upOut * | 524 | 988 | 988 | 0 | 461 | 0

| * downIn * | 0 | 0 | 1512 | 0 | 0 | 1449

Data base of Fig. 7.17 Within the customer’s network the locality is shown in Figures 7.19–7.20.

57

| | | | | | |

7. Testing and Evaluation

Figure 7.18.: Locality during LINPACK

Figure 7.19.: Locality during qperf

58

7. Testing and Evaluation

Figure 7.20.: Locality during LINPACK Below Fig. 7.17 the data reference of the graph is shown.

7.4.5. Utilization Shares Within the laboratory network the measurement of utilization shares shows the distribution share of link utilization. The Figures 7.21–7.23 are showing the distribution of bandwidth on three links during a five minute measurement. The outlier above 1000 MB/s are due to inconsistency during the measurement. Due to a lot of SQL calculations a time drift leads to a higher bandwidth than expected. This could be eliminate by optimizing the process workflow or limit the parallel measurements. The first plot shows the link between maradonna and the root switch sw24. The link is fully utilized.

59

7. Testing and Evaluation

Figure 7.21.: Utilization distribution during 5 minutes of measurement between sw24 and maradonna The second plot (Fig. 7.22) visualizes the bandwidth distribution of the interswitch connection which contains two links. By consulting the port performance graph (Fig. 7.13) the consistency of the measurement is nicely shown. Towards the root switch the one link is fully utilized while the other is not used. The downlink utilization is 50 percent.

Figure 7.22.: Utilization distribution during 5 minutes of measurement between sw24 and sw8 60

7. Testing and Evaluation

The last section of the performance link (Fig. 7.23) is fully utilized in both directions.

Figure 7.23.: Utilization distribution during 5 minutes of measurement between sw8 and platini

7.5. Limitations 7.5.1. Link Speed / Performance Counter Width As described earlier (3.3.2.2 ’Performance Management’ - page 17) the performance counters are just 32bit wide. The xmit_data counter measures 4byte words leads to ≈ 17GB (232 ∗4B) before the counter reaches the highest value. By using QDR-speed on a link with a width of four (40Gbit/s ≈ 3.8GB/s), the counter saturates within four to five seconds, the transmitted data until the reset is not measured. Due to the fact that the sampling rate is four seconds the possibly lost measurements are negligible, the problem does not harm the monitoring system. In the worst case, counters saturate and the increasing since then is not measured in the time period.

7.5.2. XmitWait Counter The modification within the osmeventplugin and OpenSM to offer performance data leads to two limitations: 1. As described previously (see also 3.3.2.2 ’Performance Management’ - page 17) congestion can only effects switch ports. Ports on CAs does not provide propper counter values. The way the patch is applied OpenSM internaly assumes that the received performance data contains valid packets including the 61

7. Testing and Evaluation

xmit_wait counter. Within the SQL functions the congestion information of CA are dropped. This could lead to side effects even if they not occur during the testing. 2. During normal operation the PrfMgr part of the OpenSM executes a read and clear operation on every port. This ensures, that only the interval is measured. The mechanism does not effects the xmt_wait counter. Therefore the counter is not reliably reset every time it is read. To work around that the counter is cached in the database table perfhist and the delta is calculated. The Nagios agent eval_statistics will reset the port’s counters if the counter reaches a value above 4 trillion.

7.5.3. Scalibity The developed monitoring system works as aspected within the small laboratory network. Even within the bigger network the system works as well. At the end of the testing period the performance data was not collected within a time period of 4 second as it was expected. The delay of incoming performance data increases to 30 seconds and stays at this level. As described the customers network is effected by the normal workflow it was bought for. During this strange behavior there was a broken simulation job. This may causes had a side effect on the network. Prior to the submission of the thesis, the causation of the delay was not ascertained.

7.6. Evaluation Summary Besides the issue within the customers network all visualization and measurements are created accurately. Therefore the implementations serves the concept.

62

8. Conclusions In this chapter the usage scenario fulfillment is considered, the system is compared to the existing monitoring systems and a discussion about ideas for future work with respect to the current system’s limitations will conclude.

8.1. Usage Scenario Fulfillment The developed system offers the user a tool to visualize the given network topology. This is an improvement compare to the current situation. Furthermore, the enriched graphs with an indication of the amount of data that are transmitted through the network, the congestion that occurs and the traffic locality are offered. This facilitats the user to diagnose problems and track down certain problems and bottlenecks within the fabric. The utilization share plots are a very powerful tool to characterise the workloads utilization fingerprint. It reveals the real usage of the network during a workflow.

8.2. Comparison QLogics and Mellanox suites addresse the same issues. They approach easy monitoring by implementing a series of functions. The developed system does not have enough resources to bring it close to the market. But with the flexibility, open-source approach and the modular body it is a good point to discover an InfiniBand fabric. Table 8.1.: Comparison of UFM, IFS and the developed system Feature UFM IFS Developed System Visualize the network ✓ ✓ ✓ Traffic locality ✓ Rerouting the traffic ✓ ✓ Partitioning the fabric ✓ ✓ Easy customization ✓ open-source components ✓ Integration into standard tools ✓

8.3. Future Work In future work the system has to be brought to a reliable, tested and stable condition.

63

8. Conclusions

The reset of xmt_wait should be addressed by analyzing the OpenSM code and implementing the missing functions. If the system should be deployed on bigger fabrics and the performance counters are saturated to fast, it should be considered to implement a general service that imposes the performance value through the PortSampleControl register. After finishing this thesis the author plans to implement a plug-in to enrich the graphs with a series of additional information such as events within the fabric (errors, warns, status changes) and information out of a job queuing system used in the customers environment. Furthermore, it would be valuable to compare the current state to older ones to detect missing host and degraded links.

64

A. Appendix A.1. InfiniBand A.1.1. ibnetdiscover Listing of ibnetdiscover within the laboratory network. $ ibnetdiscover # # Topology file : generated on Fri Sep 30 22:02:45 2011 # # Initiated from node 0008 f10403990a7c port 0008 f10403990a7d vendid =0 x8f1 devid =0 x5a2f sysimgguid =0 x8f104004127bd switchguid =0 x8f104004127bc (8 f104004127bc ) Switch 24 "S -0008 f104004127bc " # " ISR9024S - M Voltaire " ⤦ Ç enhanced port 0 lid 1 lmc 0 [3] "H -0008 f10403990944 " [1](8 f10403990945 ) # " platini⤦ Ç HCA -1" lid 2 4 xSDR [4] "H -0008 f104039901d4 " [1](8 f104039901d5 ) # "⤦ Ç maradonna HCA -1" lid 3 4 xSDR [6] "H -0008 f1040399094c " [1](8 f1040399094d ) # " kempes ⤦ Ç HCA -1" lid 9 4 xSDR [7] "H -0008 f10403990a7c " [1](8 f10403990a7d ) # " ibmon1 "⤦ Ç lid 5 4 xSDR vendid =0 x8f1 devid =0 x6274 sysimgguid =0 x8f1040399094f caguid =0 x8f1040399094c Ca 1 "H -0008 f1040399094c " # " kempes HCA -1" [1](8 f1040399094d ) "S -0008 f104004127bc " [6] # lid 9 lmc 0 "⤦ Ç ISR9024S - M Voltaire " lid 1 4 xSDR vendid =0 x8f1 devid =0 x6274 sysimgguid =0 x8f104039901d7 caguid =0 x8f104039901d4 Ca 1 "H -0008 f104039901d4 " # " maradonna HCA -1" [1](8 f104039901d5 ) "S -0008 f104004127bc " [4] # lid 3 lmc 0 "⤦ Ç ISR9024S - M Voltaire " lid 1 4 xSDR vendid =0 x8f1 devid =0 x6274 sysimgguid =0 x8f10403990947 caguid =0 x8f10403990944

65

A. Appendix

Ca 1 "H -0008 f10403990944 " # " platini HCA -1" [1](8 f10403990945 ) "S -0008 f104004127bc " [3] # lid 2 lmc 0 "⤦ Ç ISR9024S - M Voltaire " lid 1 4 xSDR vendid =0 x8f1 devid =0 x6274 sysimgguid =0 x8f10403990a7f caguid =0 x8f10403990a7c Ca 1 "H -0008 f10403990a7c " # " ibmon1 " [1](8 f10403990a7d ) "S -0008 f104004127bc " [7] # lid 5 lmc 0 "⤦ Ç ISR9024S - M Voltaire " lid 1 4 xSDR

A.1.2. saquery A.1.2.1. Node Record NodeRecord dump : lid . . . . . . . . . . . . . . . . . . . . . 0 x3 reserved . ........ .......0 x0 base_version ............0 x1 class_version ...........0 x1 node_type ............... Channel Adapter num_ports ...............0 x1 sys_guid . ........ .......0 x0008f1 04039901 d7 node_guid ...............0 x0008f104 039901d4 port_guid ...............0 x0008f104 039901d5 partition_cap ...........0 x40 device_id ...............0 x6274 revision . ........ .......0 xA0 port_num . ........ .......0 x1 vendor_id ...............0 x8F1 NodeDescription ......... maradonna HCA -1

A.2. qperf A.2.1. Full list of Tests Miscellaneous conf quit Socket Based rds_bw rds_lat sctp_bw sctp_lat sdp_bw sdp_lat tcp_bw tcp_lat udp_bw udp_lat RDMA Send / Receive rc_bi_bw

Show configuration Cause the server to quit RDS streaming one way bandwidth RDS one way latency SCTP streaming one way bandwidth SCTP one way latency SDP streaming one way bandwidth SDP one way latency TCP streaming one way bandwidth TCP one way latency UDP streaming one way bandwidth UDP one way latency RC streaming two way bandwidth

66

A. Appendix

rc_bw rc_lat uc_bi_bw uc_bw uc_lat ud_bi_bw ud_bw ud_lat

RC RC UC UC UC UD UD UD

streaming one way one way latency streaming two way streaming one way one way latency streaming two way streaming one way one way latency

bandwidth bandwidth bandwidth bandwidth bandwidth

RDMA rc_rdma_read_bw Ç bandwidth rc_rdma_read_lat rc_rdma_write_bw Ç bandwidth rc_r dma_writ e_lat r c _ r d m a _ w r i t e _ p o l l _ l at uc_rdma_write_bw Ç bandwidth uc_r dma_writ e_lat u c _ r d m a _ w r i t e _ p o l l _ l at InfiniBand Atomics rc _c om par e_ sw ap_ mr rc_fetch_add_mr Verification v er _ r c_ c om p ar e _ sw a p ver_rc_fetch_add

RC RDMA read streaming one way ⤦ RC RDMA read one way latency RC RDMA write streaming one way ⤦ RC RDMA write one way latency RC RDMA write one way polling latency UC RDMA write streaming one way ⤦ UC RDMA write one way latency UC RDMA write one way polling latency RC compare and swap messaging rate RC fetch and add messaging rate Verify RC compare and swap Verify RC fetch and add

A.3. SQL A.3.1. Function osmInPerfdata() SQL function to insert the performance data into the database system. [12] It hides the logic of the insertion process from the osmeventplugin and ensures that the osmeventplugin stays independent. CREATE OR REPLACE FUNCTION osmInPerfdata ( text , text , int , bigint , int )⤦ Ç RETURNS VOID AS $$ /* IN : NodeGUID , Perfkey , PortNr , value to insert , time_diff since ⤦ Ç last insert * OUT : void */ DECLARE pk intval % ROWTYPE ; ports porttype % ROWTYPE ; BEGIN /* Resolving the textual perfkey into the id inside the ⤦ Ç current db state */ SELECT pk_id INTO pk FROM perfkeys WHERE pk_name = $2 ORDER ⤦ Ç BY pk_id LIMIT 1; /* Get key informations to insert the key / value pair */ SELECT p_id , n_guid , s_name , s_nagios , lid INTO ports FROM ⤦ Ç getport WHERE n_guid = $1 AND ⤦ Ç port = $3 ; IF ports . s_nagios THEN

67

A. Appendix

/* Update performance cache to ensure actual data e . g ⤦ Ç in network map */ PERFORM upsert_perf ( ports . p_id , pk . val , $4 , $5 ) ; /* performe measurement if this port is part of some ⤦ Ç */ PERFORM measure ( ports . p_id , ports . lid , pk . val , $4 , $5 ) ; END IF ; END ; $$ LANGUAGE ’ plpgsql ’;

A.3.2. Function upsert_perf() Function to decide how to insert a given performance value. CREATE OR REPLACE FUNCTION upsert_perf ( int , int , bigint , int ) ⤦ Ç RETURNS VOID AS $$ /* IN : PortID , PerfkeyID , value to insert , timediff to last upsert * OUT : void */ DECLARE typepk perfkeys % ROWTYPE ; typepc perfhist % ROWTYPE ; txt type1txt % ROWTYPE ; BEGIN SELECT nt_name INTO txt FROM ports NATURAL JOIN nodes NATURAL ⤦ Ç JOIN nodetypes WHERE p_id = $1 ; SELECT * INTO typepk FROM perfkeys WHERE pk_id = $2 ; IF typepk . pk_name IN ( ’ xmit_wait ’) THEN IF txt . val in ( ’ root ’ , ’ switch ’) THEN SELECT * INTO typepc FROM perfhist WHERE p_id = $1 and⤦ Ç pk_id = $2 ; IF NOT FOUND THEN INSERT INTO perfhist ( p_id , pk_id , pc_val ) VALUES (⤦ Ç $1 , $2 , $3 ) ; INSERT INTO perfcache ( p_id , pk_id , pc_val ) VALUES (⤦ Ç $1 , $2 , $3 / $4 ) ; ELSE IF typepc . pc_val < $3 THEN PERFORM upsert_perfcache ( $1 , $2 , ( $3 - typepc .⤦ Ç pc_val ) / $4 ) ; UPDATE perfhist SET pc_val = $3 WHERE p_id = ⤦ Ç $1 and pk_id = $2 ; ELSE PERFORM upsert_perfcache ( $1 , $2 , $3 / $4 ) ; END IF ; END IF ; END IF ; ELSE -- oerfkeys die funktionieren brauchen nicht behandelt ⤦ Ç werden . UPDATE perfcache SET pc_val = $3 / $4 WHERE p_id = $1 and ⤦ Ç pk_id = $2 ; IF NOT FOUND THEN

68

A. Appendix

INSERT INTO perfcache ( p_id , pk_id , pc_val ) VALUES ( $1 , ⤦ Ç $2 , $3 / $4 ) ; END IF ; END IF ; END ; $$ LANGUAGE ’ plpgsql ’;

A.3.3. Function upsert_perfache() Exemplary function to update a value if the key exists. Otherwise insert the key value pair. CREATE OR REPLACE FUNCTION upsert_perfcache ( int , int , bigint ) ⤦ Ç RETURNS VOID AS $$ /* IN : PortID , PerfkeyID , value to insert * OUT : void */ DECLARE BEGIN UPDATE perfcache SET pc_val = $3 WHERE p_id = $1 and pk_id⤦ Ç = $2 ; IF NOT FOUND THEN INSERT INTO perfcache ( p_id , pk_id , pc_val ) VALUES ( $1 , ⤦ Ç $2 , $3 ) ; END IF ; END ; $$ LANGUAGE ’ plpgsql ’;

A.4. Nagios A.4.1. Nagios Agents A.4.1.1. curLoad1 # !/ usr / bin / env python # -* - coding : utf -8 -* ( ec , out ) = commands . getstatusoutput ( " uptime " ) r = ’ load average \: (\ d +\.\ d +) , (\ d +\.\ d +) , (\ d +\.\ d +) ’ m = re . search (r , out ) if m : ( l1 , l5 , l15 ) = m . groups () print " load1 % s | load1 =% s " % ( l1 , l1 ) if l1 < warn : sys . exit (0) # if load < warn - > threshold : OK elif l2 < err : sys . exit (1) # if load < err - > threshold : WARN else : sys . exit (2) # if load >= err - > threshold : CRIT

# !/ usr / bin / env python # -* - coding : utf -8 -* # Bibliotheken laden

69

A. Appendix

import re , os , sys , commands , time sys . path . append ( ’/ usr / lib64 / python2 .4/ site - packages / ’) import pgdb , datetime from optparse import OptionParser sys . path . append ( ’/ usr / local / lib / ’) sys . path . append ( ’/ usr / local / src / lib / ’) import dbCon , debugFile import parseNagiosNodes as pnn class Parameter ( object ) : def __init__ ( self , argv ) : # Parameter handling usageStr = " check_curLoad1 [ options ] " self . parser = OptionParser ( usage = usageStr ) self . default () ( self . options , args ) = self . parser . parse_args () # copy over all class . attributes self . __dict__ = self . options . __dict__ self . args = args def default ( self ) : # Default - Options self . parser . add_option ( " -d " , action = " count " , dest = " debug " , help = " increases debug [ default : None , -d :1 , - ddd : 3] " ) self . parser . add_option ( " -t " , dest = " tabs " , default = " " , action = " store " , help = " tabels to ⤦ Ç count " ) def check ( self ) : pass class checks ( object ) : def __init__ ( self , debug , opt ) : self . opt = opt self . retEC = 0 self . start = int ( datetime . datetime . now () . strftime ( " % s " ) ) self . statusTXT = " " self . perfTXT = " " def addPerf ( self , key , val ) : self . perfTXT += " % s =% s , " % ( key , val ) if val !=0: self . statusTXT += " % s %s , " % ( val , key ) def load1 ( self ) : cmd = " uptime " ( ec , out ) = commands . getstatusoutput ( cmd ) if ec !=0: print " CRITICAL : % s " % out . replace ( " \ n " ," || " ) sys . exit (2) r = ’ load average \: (\ d +\.\ d +) , (\ d +\.\ d +) , (\ d +\.\ d +) ’ m = re . search (r , out ) if m :

70

A. Appendix

( l1 , l5 , l15 ) = m . groups () else : print " CRITICAL : % s " % out . replace ( " \ n " ," || " ) sys . exit (2) self . addPerf ( " load1 " , l1 ) def __str__ ( self ) : if self . retEC ==0: retTXT = " OK " elif self . retEC ==1: retTXT = " WARN " elif self . retEC ==2: retTXT = " CRIT " retTXT += " - % s | % s " % ( self . statusTXT , self . perfTXT ) # perfdata return retTXT def getEC ( self ) : return self . retEC

def main ( argv = None ) : # Parameter options = Parameter ( argv ) options . check () deb = debugFile . debug ( options , " / usr / local / nagios / var /⤦ Ç check_ports . log " ) chk = checks ( deb , options ) chk . load1 () print chk ec = chk . getEC () sys . exit ( ec ) # ein Aufruf von main () ganz unten if __name__ == " __main__ " : main ()

A.5. System A.5.1. Startup The system will automatically discover the network and store the information within the database system. To provide all necessary tables, function and views the database has to be initialized. # cat / tmp / perfdata . log $ psql -U postgres -f init_pg . sql DROP DATABASE CREATE DATABASE ALTER DATABASE [...] $

The next step is to start the Nagios framework and the OpenSM subnet manager. 71

A. Appendix

$ / usr / local / etc / init . d / opensmd start Starting opensm : Ç ] $ / etc / init . d / nagios start Starting nagios : done . $

[

OK

⤦

From now on Nagios will supervise the execution of the agents needed to run the system. The OpenSM plug-in osmeventplugin will provide the performance data.

A.5.2. Normal Operation The agent parse_nagiosgraph inserts the topology information. The visualization in various graphs is done by the agent create_netgraph. $ su OK Ç $ su OK $

-c ’/ usr / local / nagios / libexec / p a rs e _ ib n et d is c o ve r ’ nagios 1871 querys ,1350 commits , wall 2 sec | querys =1871 commits =1350⤦ wall =2 -c ’/ usr / local / nagios / libexec / create_netgraph ’ nagios 0 sec createDot | createDot =0

Nagios provides a web-interface to supervise the monitoring services.

Figure A.1.: Nagios services

A.5.3. LINPACK Laboratory The input file of the LINPACK benchmark within the laboratory. HPLinpack benchmark input file Innovative Computing Laboratory , University of Tennessee HPL . out output file name ( if any ) 6 device out (6= stdout ,7= stderr , file ) 1 # of problems sizes ( N ) 60000 Ns 1 # of NBs

72

A. Appendix

168 0 1 3 4 16.0 1 2 1 4 1 2 1 0 1 1 1 0 2 64 0 0 1 8

NBs PMAP process mapping (0= Row - ,1= Column - major ) # of process grids ( P x Q ) Ps Qs threshold # of panel fact PFACTs (0= left , 1= Crout , 2= Right ) # of recursive stopping criterium NBMINs ( >= 1) # of panels in recursion NDIVs # of recursive panel fact . RFACTs (0= left , 1= Crout , 2= Right ) # of broadcast BCASTs (0=1 rg ,1=1 rM ,2=2 rg ,3=2 rM ,4= Lng ,5= LnM ) # of lookahead depth DEPTHs ( >=0) SWAP (0= bin - exch ,1= long ,2= mix ) swapping threshold L1 in (0= transposed ,1= no - transposed ) form U in (0= transposed ,1= no - transposed ) form Equilibration (0= no ,1= yes ) memory alignment in double ( > 0)

It was started with the following command and output. The mpd.hosts file contains three server names that are used in the distributed run. Each server hosts 4 threads of LINPACK which leads to 12 threads total.

# time mpirun - hostfile mpd . hosts - np 12 ./ xhpl ================================================================================ Ç HPLinpack 2.0 -- High - Performance Linpack benchmark -⤦ Ç September 10 , 2008 Written by A . Petitet and R . Clint Whaley , Innovative Computing ⤦ Ç Laboratory , UTK Modified by Piotr Luszczek , Innovative Computing Laboratory , UTK Modified by Julien Langou , University of Colorado Denver ================================================================================ Ç [...] -------------------------------------------------------------------------------Ç - The matrix A is randomly generated for each test . - The following scaled residual check will be computed : || Ax - b || _oo / ( eps * ( || x || _oo * || A || _oo + || b || _oo⤦ Ç ) * N ) - The relative machine precision ( eps ) is taken to be ⤦ Ç 1.110223 e -16 - Computational tests pass if scaled residuals are less than ⤦ Ç 16.0

================================================================================ Ç

73

A. Appendix

T/V

N NB P Q Time ⤦ Ç Gflops -------------------------------------------------------------------------------Ç WR01L2R4 60000 168 3 4 2678.54 ⤦ Ç 5.376 e +01 -------------------------------------------------------------------------------Ç || Ax - b || _oo /( eps *(|| A || _oo *|| x || _oo +|| b || _oo ) * N ) = 0.0035458⤦ Ç ...... PASSED ================================================================================ Ç Finished

1 tests with the following results : 1 tests completed and passed residual checks , 0 tests completed and failed residual checks , 0 tests skipped because of illegal input values . -------------------------------------------------------------------------------Ç

End of Tests . ================================================================================ Ç real user sys

45 m34 .338 s 178 m19 .830 s 0 m53 .796 s

Customer Network The input file of the LINPACK benchmark within the customer network. The run was started with 12 threads per node. 14 nodes are used.

mpirun - ppn 12 -n 168 ./ xhpl_intel64 ================================================================================ Ç HPLinpack 2.0 -- High - Performance Linpack benchmark -⤦ Ç September 10 , 2008 Written by A . Petitet and R . Clint Whaley , Innovative Computing ⤦ Ç Laboratory , UTK Modified by Piotr Luszczek , Innovative Computing Laboratory , UTK Modified by Julien Langou , University of Colorado Denver ================================================================================ Ç [...] -------------------------------------------------------------------------------Ç - The matrix A is randomly generated for each test . - The following scaled residual check will be computed : || Ax - b || _oo / ( eps * ( || x || _oo * || A || _oo + || b || _oo ⤦ Ç ) * N ) - The relative machine precision ( eps ) is taken to be ⤦ Ç 1.110223 e -16 - Computational tests pass if scaled residuals are less than ⤦ Ç 16.0

74

A. Appendix

Column =001008 Fraction =0.005 Mflops =1432450.47 [...] Column =169176 Fraction =0.995 Mflops =1404509.88 ================================================================================ Ç T/V N NB P Q Time ⤦ Ç Gflops -------------------------------------------------------------------------------Ç WR01C2R4 169888 168 11 12 2328.57 ⤦ Ç 1.404 e +03 -------------------------------------------------------------------------------Ç || Ax - b || _oo /( eps *(|| A || _oo *|| x || _oo +|| b || _oo ) * N ) = 0.0018721⤦ Ç ...... PASSED ================================================================================ Ç T/V N NB P Q Time ⤦ Ç Gflops -------------------------------------------------------------------------------Ç WR01C2R4 0 168 11 12 0.00 ⤦ Ç 0.000 e +00 -------------------------------------------------------------------------------Ç || Ax - b || _oo /( eps *(|| A || _oo *|| x || _oo +|| b || _oo ) * N ) = 0.0000000⤦ Ç ...... PASSED ============================================================================⤦ Ç Finished

2 tests with the following results : 2 tests completed and passed residual checks , 0 tests completed and failed residual checks , 0 tests skipped because of illegal input values . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -⤦ Ç

End of Tests . ============================================================================⤦ Ç Done : Thu Oct 6 13:36:44 CEST 2011 real user sys

39 m13 .595 s 0 m0 .613 s 0 m0 .263 s

75

Glossary ASIC - Application-specific Integrated Circuit Chip which includes the crossbar and the management port (port0) of a switch. CA - Channel Adapter Consumer/Producer device within a proccesor node (HCA) or an I/O unit (TCA). CM - Communication Manager In charge of implementing and supervise the QP communication. Abstracts the communication in a way that application e.g. don’t have to choose ports by themself. CQ - Completion Queue Every processed WQE will be placed in the Completion Queue to conclued the command. CQE - Completion Queue Element If a work request is completed a CQE is created. It is necessary to enable the completion of the operation on applicationlevel. GMP - General Management Packets Not prioritized MADs which are transfered in V L0−14 . They are used for all General Services. GUID - Globaly Unique Identifier 64bit field which compareable to a MAC address. Unlike a MAC a GUID is assigned to more then the network interface. HCA - Host Channel Adapter Networkdevice of a compute system. IEEE - Institute of Electrical and Electronics Engineers Non-profit association to standardize technical protocols and therefore advance technical innovation. LID - Logical Identifier 16bit integer value which the SubnetManager assigned to every host within a network. This identifier is used to address a component in the fabric.

76

Glossary

MAD - Management Datagram Packettype which containts management-payload and has an extendet payload of exactly 256Byte. OFED - Open Fabric Enterprise Distribution Softwarebundel of kernel drivers, userspace libraries and applications. De facto core of all comercial SM-implementations. QP - Queue Pair Bundle of send and receive queue on two nodes within a fabric. RDMA - Remote Direct Memory Access Allows read, write and atomic operations on specific memory space of a remote host without interaction on the remote side (besides allocating memory for shared use). RRD - Round Robin Database Circular register of predefined data on a timescaled axis. If the register is filled the oldest value will be droped. It is common that older data has a declining resolution. A popular tool of this type is rrdtool. It was developed to store and visualize monitored performance data in relation of time. SDR- Single Data Rate The initial clockspeed of the InfiniBand standard (2.5GHz). SL - Service Level Field in the LRH (Local Routing Header) which will be assigned to a Virtual Lane. Could be used to implement QoS (Quality of Service). SM - Subnet Manager The Subnet Manager discovers the network, computes all routes within it, distributes the routes to every switch and adjust the configuration if an event (new node) accours. SMP - Subnet Management Packets MADs which are transfered in the prioritized V L15 . SQ - Send Queue The Send Queue contains the send-WQE of a specific Queue Pair. TCA - Target Channel Adapter Networkdevice of a special subsystem (e.g. Storage) VL - Virtual Lane Virtual buffer above the physical buffer of every port. Every port (every / only CA-Ports?) implements at least V L0 as a buffer for unpriorized traffic and V L15 for SMP (Subnet Management Packets). The unprioritized traffic communicates with the same methods like normal data traffic (Link Flow Control). SMP are privileged over all data VLs. 77

Glossary

WQE - Work Queue Element Item which defines a task in a send or receive queue. A processed WQE is placed in the port’s Completion Queue. WQE are pronounced ’wookie’.

78

Bibliography [1] Fat tree. (visited 10.2011) http://en.wikipedia.org/wiki/Fat_tree. [2] Foswiki. (visited 10.2011) http://foswiki.org/. [3] Gnuplot. (visited 10.2011) http://www.gnuplot.info/. [4] Ieee. (visited 10.2011) http://www.ieee.org. [5] Internet protocol suite. (visited 10.2011) http://en.wikipedia.org/wiki/Tcpip. [6] Linpack. (visited 10.2011) http://www.netlib.org/linpack/. [7] Mac address. (visited 10.2011) http://en.wikipedia.org/wiki/MAC_address. [8] Nagios - the industry standard in it infrastructure monitoring. (visited 08.2011) http://www.nagios.org. [9] Nagiosgraph - data collection and graphing for nagios. (visited 08.2011) http://nagiosgraph.sourceforge.net/. [10] Open fabric alliance. (visited 10.2011) https://www.openfabrics.org. [11] Perl. (visited 10.2011) http://www.perl.org/. [12] Postgresql documentation. Website. (visited 08.2011) http://www.postgresql.org/docs/8.2. [13] Qlogic. (visited 10.2011) http://www.qlogic.com/. [14] Qlogic infiniband products. (visited 10.2011) http://www.qlogic.com/Products/Switches/Pages/InfiniBandSwitches.aspx. [15] Rcs (revision control system). (visited 10.2011) http://www.cs.purdue.edu/homes/trinkle/RCS/.

79

Bibliography

[16] Top500. (visited 09.2011) http://www.top500.org. [17] Twiki. (visited 10.2011) http://www.twiki.org. [18] Unified fabric manager. (visited 10.2011) http://www.voltaire.com/Products/Unified_Fabric_Manager. [19] Vendor lock-in. (visited 10.2011) http://en.wikipedia.org/wiki/Vendor_lock-in. [20] Voltaire. (visited 10.2011) http://www.voltaire.com/. [21] Infiniband architecture specification, Nov 2007. [22] M. Franke. Managementtool for infiniband. Diplomathesis, Technical University Chemnitz, 2005. [23] O. Pentakalos. An introduction to the infiniband architecture. (visited 09.2011) http://www.oreillynet.com/pub/a/network/2002/02/04/windows.html. [24] Top500. Performance share of interconnects in the top500 list, Jun 2011. (visited 06.2011) http://top500.org/charts/list/37/connfam. [25] Y. H. Uwe Küster, Dr. Andreas Findling. Achieving peak performance with advanced fabric management a case study with hlrs and nec. 2010. [26] C. Windeck. Was kommt nach pci-x?, 1999. (visited 06.2011) http://www.heise.de/newsticker/meldung/Was-kommt-nach-PCI-X-23854.html.

80

List of Figures 1.1. Current performance share in supercomputing interconnects [24] . . . 1.2. Performance share over time [24] . . . . . . . . . . . . . . . . . . . . . .

3 3

3.1. InfiniBand-Fabric with HCA, TCA, switches and routers 3.2. Host Channel Adapter . . . . . . . . . . . . . . . . . . . . . 3.3. Target Channel Adapter . . . . . . . . . . . . . . . . . . . 3.4. Crossbar to connect 2x 4 Ports in a non-blocking way . . 3.5. Crossbar and IB-logic encapsulated in an ASIC . . . . . . 3.6. Switch assembled with one ASIC . . . . . . . . . . . . . . 3.7. Blocked network . . . . . . . . . . . . . . . . . . . . . . . . 3.8. CLOS network . . . . . . . . . . . . . . . . . . . . . . . . . 3.9. Strictly non-blocked network . . . . . . . . . . . . . . . . . 3.10. Fat Tree [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11. Fully non-blocking fat tree . . . . . . . . . . . . . . . . . . 3.12. Rearrangeable fully non-blocking . . . . . . . . . . . . . . 3.13. Rearranged network . . . . . . . . . . . . . . . . . . . . . . 3.14. InfiniBand Communication Stack [23] . . . . . . . . . . . 3.15. Switch containing multiple ASIC on multiple boards . . 3.16. Storage array with TCA [22] . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

5 6 6 6 6 7 8 8 9 9 10 10 11 12 14 20

4.1. 4.2. 4.3. 4.4.

UFM dashboard . . . . . . . . . Topology visualization . . . . . . Performance chart . . . . . . . . System encapsulation of IFS 6.0

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

21 22 22 23

5.1. 5.2. 5.3. 5.4. 5.5. 5.6. 5.7.

System concept . . . . . . . . . . . . . Conceptual utilization graph . . . . . Utilization share over period of time Traffic locality scheme . . . . . . . . . Local traffic . . . . . . . . . . . . . . . 100% global traffic . . . . . . . . . . . Mix of local and global traffic . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

24 25 26 27 27 28 28

6.1. 6.2. 6.3. 6.4. 6.5. 6.6. 6.7.

Screenshot of nagiosgraph . . . . . Graph of a hosts current load . . Conceptual implementation . . . . Database model . . . . . . . . . . . RRD chart of system load . . . . Simple graph of a small network . RRD chart ’Current Load’ . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

31 31 33 36 37 38 39

. . . .

81

. . . .

. . . . . . .

. . . . . . .

List of Figures

6.8. List of measurement within Foswiki . . . . . 6.9. Dashboard front page . . . . . . . . . . . . . 6.10. Dashboard view on port utilization . . . . . 6.11. Dashboard view on congestion visualization 6.12. Dashboard view on traffic locality . . . . . . 6.13. DB sizes as functions of time . . . . . . . . . 6.14. System statistics . . . . . . . . . . . . . . . . 6.15. Amount of different db locks . . . . . . . . .

. . . . . . . .

7.1. qperf testdesign in laboratory . . . . . . . . . 7.2. qperf testdesign within customers fabric . . . 7.3. LINPACK test within lab . . . . . . . . . . . 7.4. LINPACK benchmark in customer context . 7.5. InfiniBand laboratory testbed . . . . . . . . . 7.6. Testing environment . . . . . . . . . . . . . . . 7.7. Laboratory network . . . . . . . . . . . . . . . 7.8. Plain customer fabric . . . . . . . . . . . . . . 7.9. Port performance during qperf . . . . . . . . . 7.10. Port performance during qperf . . . . . . . . . 7.11. Port performance during LINPACK . . . . . 7.12. Port performance during LINPACK . . . . . 7.13. Congestion graph during qperf . . . . . . . . . 7.14. Congestion within qperf within custom fabric 7.15. Congestion map during LINPACK . . . . . . 7.16. Congestion map during LINPACK . . . . . . 7.17. Traffic locality during qperf . . . . . . . . . . . 7.18. Locality during LINPACK . . . . . . . . . . . 7.19. Locality during qperf . . . . . . . . . . . . . . 7.20. Locality during LINPACK . . . . . . . . . . . 7.21. Utilization distribution during 5 minutes of sw24 and maradonna . . . . . . . . . . . . . . 7.22. Utilization distribution during 5 minutes of sw24 and sw8 . . . . . . . . . . . . . . . . . . . 7.23. Utilization distribution during 5 minutes of sw8 and platini . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

39 40 41 42 42 43 43 43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . measurement between . . . . . . . . . . . . . . measurement between . . . . . . . . . . . . . . measurement between . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

46 46 47 47 48 49 49 50 51 52 53 53 54 55 55 56 57 58 58 59

. 60 . 60 . 61

A.1. Nagios services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

82

List of Tables 3.1. InfiniBand connection types [21] . . . . . . . . . . . . . . . . . . . . . . . 13 3.2. Methods of the SubnetManager [21] . . . . . . . . . . . . . . . . . . . . 17 3.3. Methods of the SubnetManagerAgent . . . . . . . . . . . . . . . . . . . 17 6.1. Simple Graphviz example . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 8.1. Comparison of UFM, IFS and the developed system . . . . . . . . . . . 63

83

Informatik - GitHub

FAKULTÂ¨AT FÂ¨UR INFORMATIK Adding C++ Support to ... - GitHub

Simulation - I4 * Lehrstuhl fuer Informatik * RWTH Aachen

Das-Erstellen-Eines-Informatik-Controllingkonzeptes-German ...

Report on Disruptive Technologies for years ... - Institut fÃ¼r Informatik

GitHub

Torsten - GitHub

Untitled - GitHub

ECf000172411 - GitHub

Untitled - GitHub

BOOM - GitHub

Supervisor - GitHub

robtarr - GitHub

MY9221 - GitHub

fpYlll - GitHub

article - GitHub

PyBioMed - GitHub

MOC3063 - GitHub

MLX90615 - GitHub