Impact of Higher Level Functional Units on High Performance Multi - Core Node Architectures Aravind Vasudevan‡ WARAN Research Foundation
[email protected]
Balaji Subramaniam‡ WARAN Research Foundation
[email protected]
Vidya Sangkar L‡ WARAN Research Foundation
[email protected]
Abstract As a result of increasing requirements for computationally intensive application, recognizing and exploiting the computational concurrency of an application becomes vital in design process of a node. The execution will be faster if we perform these computations faster and it will have a direct impact on the performance. Thus with increase in complexity and problem size for such applications, there is need for change in the design of functional units. In an attempt to address these issues, this paper presents a new paradigm involving Higher Level functional Units (HLFU) that could replace ALU based units in a processing node and its impact on the performance. 1. Introduction The evolution of new architectural concepts is required to harness the power of the technology in the current supercomputing era [2]. With the increase in complexity of the applications, there is a need for change in the design of the functional units. In conventional multiprocessors, cores are replicated processors with common cache shred through non-uniform access as in TRIPS [3]. These designs have increased raw the processing power of the node. But with increasing complexity of the application, the number of basic operations handled by the cores and hence by the processing nodes increases. Thus analysis of the characteristics of the application becomes a vital process in design, because different applications are computationally intensive with respect to particular classes of algorithms than others. Thus, with increase in the complexity and the problem size to be handled by the processing node (for such applications) the node architecture design has to incorporate higher level functional units. The class of functional units for the node architecture can be found by analysis of set of applications [4]. Thus HLFU’s such as the matrix multiplication units, sorter units, max/min finders etc. can replace large number of ALU based resources. Effective utilization of such units can give performance that cannot be achieved by any ALU based nodes. This paper is divided into 3 sections, the next section describes about the advantages of using a Higher Level Functional Units over conventional ALU based units in detail. In section 3, simulation results are shown which describes the advantages of using the HLFU in terms of instruction counts and the memory fetches with respect to problem sizes of different class of algorithms. ‡
Under – Graduate Research Trainee at WARAN Research Foundation, Chennai, India
2. Higher Level Functional Units As discussed earlier the Higher Level Functional Units such as the matrix multiplication units, graph partitioning units, graph traversal units, sorter units etc. can deliver higher performance. The types of functional units that are used are determined the characteristics of the application [4]. The corresponding number of instruction fetches and memory accesses is lesser in comparison to conventional ALU based processors. Hence, the usage of higher level functional units will have significant effect on the execution in terms of mapping, computation and communication complexity of application execution at the cluster level. One would think that when functional units are scaled from ALU to higher level functional units, the consumption of power increases but adapting a low power design for the functional units results in power consumption that is comparable with that of the ALU based units which are discussed in detail in [5]. Not all algorithms can be visualized as higher level functional units because of certain factors. This is evident in the case of Singular Value Decomposition [6], wherein the algorithm is not data partitionable. Because of this SVD cannot be fabricated as a unit. To overcome this, the required units can be arranged closer together, and the scheduler can delegate the algorithm to these functional units. Although it is not exactly similar to fabricating a unit for SVD, this mapping reduces the inter unit communication complexity.
Figure 1: Higher level instruction set
Figure 2: Instruction set comparison
2.1 Impact on Instruction Set Architecture (ISA): The ISA must fulfill the prerequisites of extracting performance of the underlying node by exploiting hardware level parallelism. A new ISA [1,4] is designed to govern the proposed higher-level functional units. In the higher-level ISA, one single instruction corresponds to number of ALU instructions(fig.2), which are to be executed after resolving their dependencies. The higher level ISA has direct impact on the mapping and communication complexity within a node of a cluster. The detailed ISA for the HLFU is given in figure 1.
Graph 4 LU Decomposition
Graph 1 LU Decomposition 5e+5
80000
4e+5
Instruction Count
Data accessed in KB
60000
40000
20000
0
3e+5
2e+5
1e+5
0
1
2
3
4
5
6
7
8
9
10
11
1
12
2
3
4
5
6
7
8
9
10
11
12
10
11
12
9
10
11
Scale Factor
Scale Factor ALU HLFU
ALU HLFU
Graph 2 Convex Hull
Graph 5 Convex Hull
18000
3e+5
16000 3e+5
12000
Instruction Count
Data accessed in KB
14000
10000 8000 6000
2e+5
2e+5
1e+5
4000 2000
5e+4
0 0 1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
Scale Factor ALU HLFU
7
8
9
ALU HLFU
Graph 6 Kernighan-Lin Algorithm
Graph 3 Kernighan-Lin Algorithm 1.2e+6
7000 6000
1.0e+6
5000
8.0e+5
Instruction count
Data accessed in KB
6
Scale Factor
4000 3000 2000
6.0e+5
4.0e+5
2.0e+5 1000
0.0
0
1
2
3
4
5
6
7
8
9
10
11
1
2
ALU MIP
3
4
5
6
7
Scale Factor
Scale Factor ALU HLFU
8
3. Simulation Analysis: To strike a comparison between the usage of ALU and HLFU a simulation methodology has been adopted which makes use of the IDA disassembler [IDA disassembler]. The first step would be to analyze the code fragments obtained from the disassembled output. The disassembled output consists of flow graphs and code sections which make up the original code. The flow graphs depict the control flow of the program and each code block represents a functional block of code. The next step would be to identify a pattern from this output and assign a sequence of activities for each instruction. For example, a single add instruction would boil down into the following sub steps, load the contents from the registers into the temporary registers, issue the “add” select signal to the ALU unit and store the result in the temporary output register and finally move it to the destination register. But this is not the case with the HLFUs. Since they are hard-wired, there will be no need to store the intermediate data. The data required for performing this is fed directly to these functional units and the data is routed through the unit. This removes all the intermediate loads and stores. The only memory access for the HLFUs will be the input and the output which will vary with the type of the unit. This reduction in memory access is shown in the graphs labeled 1-3. For a given algorithm the reduction in memory access is shown with respect to increase in problem size. As discussed in the previous sections, using higher level functional units abstracts the instruction set to a higher level. Although each instruction becomes more complex the number of instructions reduces. This is shown in the graphs labeled 4-7. For example, a single 2*2 Matrix Multiplication will effectively boil down into 8 multiplications and 4 additions. In conventional ALU architectures, performing a 2*2 matrix multiplication operation would require 8 multiplications and 4 additions. But if we have a unit that performs this multiplication a single instruction would suffice. This is the reason for the lesser number of instructions in the case of HLFUs. Choosing the type of the HLFUs is also important. The choice is heavily influenced by the characteristics of the application to be executed. For example, if a matrix centric application has to be run, it would be meaningful to use matrix related units rather than graph theoretic units. Also, having a unit that can perform ‘n’ additions together would make more sense rather than using n separate adder units. 4. Conclusion With the increase in the need for computational power, extraction of parallelism and pipelining, we move towards the use of higher level functional units. This paper describes the advantages of using the HLFU’s over the conventional ALU based architecture. Following this simulation results were presented that strike a contrast between the ALU based architecture and HLFUs. From this paper, it is clearly evident that HLFUs are far more advantageous than conventional ALU based architecture and the feasibility of such units is discussed in [1, 5].
5. References [1] N.Venkateswaran et. al, “Towards Node Architecture Designs for Realizing High Productivity Supercomputers” presented at the pre-conference of 23rd International supercomputing conference 2008 (ISC 08) held at Dresden, Germany. [2] Peter Hildebrand (Chair), Warren Wiscombe (Science Lead) et. al, “Earth Science Vision 2030 Predictive Pathways for a Sustainable Future” NASA Working Group Report. [3] Doug Burger et. al, “Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture” in ACM SIGARCH Computer Architecture News, Proceedings of the 30th annual international symposium on Computer architecture ISCA ’03 [4] Shyamsundar Gopalakrishnan, “High Performance Node Architecture for Supercomputing Clusters: A Generalized Design Methodology and Development of CAD Environment”, A Thesis Proposal Submitted to Waran Research Foundation. [5] Ravindhiran Mukundrajan, “Power Estimation of Higher Level Functional Units in Heterogeneous Multi-Core Processors” submitted to HiPC ‘08 student research symposium [6] Golub, Gene H & Kahan, William (1965), “Calculating the singular values and pseudo-inverse of a matrix”, Journal of the Society for Industrial and Applied Mathematics: Series B, Numerical Analysis 2(2): 205–224. [7] IDA Disassembler, http://www.datarescue.com/idabase/