Run-time Adaptive Resources Allocation and Balancing on Nanoprocessors Arrays Danilo Pani DIEE - University of Cagliari 09123, Cagliari, Italy [email protected]

Giuseppe Passino DIEE - University of Cagliari 09123, Cagliari, Italy

Abstract Modern processor architectures try to exploit the different kind of parallelism that may be found even in general purpose applications. In this paper we present a new architecture based on an array of nanoprocessors that parallely and cooperatively support both Thread and Instruction Level Parallelism. A such architecture doesn’t explicitly require any particular programming techniques since it has been developed to deal with standard sequential programs. Preliminary results on a model of the architecture show the feasibility of the proposed approach.

1. Introduction Multithreading is a well-known approach to allow the execution of complex programs made up of many separated threads in time-sharing on the same processor. In such systems the same processor serves in different moments a particular thread so that the time needed to serve N threads is always the sum of the single threads time. Many approaches have been proposed to address this issue, exploiting multiple processing parts working in parallel. These approaches speed up the execution time allowing multiple sub-part of the elaboration to be processed in parallel on different hardware resources. These architectures exploit in general Thread Level Parallelism (TLP) or Instruction Level Parallelism (ILP). In the latter case, involved resources are well established at compilation time by the compiler. In the former case, the effort invested in threads execution is fixed because the compiler cannot perform any load analysis since the number and kind of threads running on a processor depends on the user’s choices, and is not predictable. In this work we present a novel processor architecture able to support both TLP and ILP, exploiting some important new features like decentralized control and cooperative behaviors. The architecture is made up of a fabric of nanoprocessors (nPs) locally interconnected, and by other

modules at higher level compared to the fabric. The hierarchical structure involves centralized and decentralized control strategies at different levels, where centralized approaches (related to top hierarchical modules) are responsible for resources allocation and adaptive load balancing whereas decentralized ones (related to the nPs fabric) mask low-level characteristics allowing a transparent usage of the processing elements. A key part of the proposed architecture has been behaviorally simulated to define some basic strategies in resources management. Obtained results show the feasibility of the approach, and deserve further investigations. The paper is organized as follows: in Sec. 2 we present a brief state of the art; in Sec. 3 we present general guidelines and system requirements we have followed to design the architecture; in Sec. 4 an architecture that respects the previously highlighted guidelines is presented and analyzed whereas Sec. 6 shows the results of some simulation on a case-study. Conclusions are presented in Sec. 7.

2. State of the art Modern microprocessors use a large number of microarchitectural mechanisms to increase the concurrency in instruction processing and to overcome the classical sequential execution model. The challenge is to execute in parallel instructions written with sequential semantics. To achieve this goal, there are two main, orthogonal possibilities: ILP and TLP. Many architecures mix both them. Moreover, into the same type of parallelism many different solutions can be used at the same time.

2.1. ILP-oriented architectures ILP oriented architectures try to exploit instructions overlap into an instruction stream [6]. There are many techniques, not mutually exclusive, to achieve this goal: many of them have been developed in order to be used with other older ones.

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

IEEE

Luigi Raffo DIEE - University of Cagliari INFM - Section of Cagliari [email protected]

2.1.1. Superscalar architectures A superscalar architecture overcomes the limits of traditional pipelined architectures, allowing different neighboring instructions to be executed entirely at the same time. Superscalar and pipelined approaches are orthogonal, but often they are used at the same time. These architectures require a sufficient number of functional units, to support the execution of more instructions at the same time, and a parallel instructions decoder, for detection of instructions dependencies during decoding operations. 2.1.2. VLIW architectures Very Long Instruction Word (VLIW) architectures are aimed at the reduction of the highcomplexity scheduling logic of superscalar ones. In fact, instruction scheduling is executed statically. The compiler provides groups of independent instructions ready for parallel execution on the functional units provided by the architecture. Obviously, static software scheduling and dependencies resolution let a VLIW compiler being more complex than a superscalar one. Additionally, the compiler needs to have in-depth knowledge of architectural structure, and minor changes in it make an old code incompatible.

tem speed-up due to this approach is amazing (many studies have confirmed the validity of this approach [5, 8]), but the data prediction logic needed is more complex than the control prediction one, and at this time there aren’t predictors so accurate to make this approach feasible. 2.1.6. Instruction-Reuse based architectures Another technique for achieving high performances in instruction processing is instruction reuse: this technique does not involve speculation, but it is like an instruction caching. Many instructions, in fact, produce the same result many times repeatedly, and buffering input and output of such instructions results can be obtained through lookup tables instead of involving ALU.

2.2. TLP-oriented architectures

2.1.4. Speculative (control) architectures If we restrict the parallelism to the overlap of instructions whose execution is certain, there aren’t other possibilities to overcome the out-of-order strategy performances. The speculative approach overcomes this limit by a speculation over the conditional branches, allowing instructions to be executed even if their execution is not certain. This type of speculation is often referred as control speculation, emphasizing that it is performed only on control instructions like branch-test. Two major architectural requirements arise from this approach: to be able to save the processor state and to restore it in case of wrong prediction; to have a good branch prediction logic, that correctly forecasting the largest part of branch directions can effectively improve processor performances.

TLP-oriented architectures can handle multiple instruction streams at a time, and often these streams are processed exploiting ILP, so that such architectures can mix ILP and TLP supports. The key property of TLP-oriented architectures is the presence of multiple program sequencers. There are two main types of architectural structures: the first approach is to take a classical ILP-oriented architecture, like a superscalar one, and add multiple sequencers in the front-end logic of the machine, replicating instruction fetch/decode unit and so on; the second one is to create a cluster of processing units, one for each thread. The first option is commonly referred as simultaneous multithreaded (SMT) processor, and it is an extension of superscalar (outof-order, speculative) processors, where the functional units can be used both for ILP and TLP support. The second option, referred as chip multiprocessor (CMP), is a single chip parallel multiprocessor with processing units sharing fewer resources than the functional units in a SMT processor. Multiple sequencers can be used to improve processor throughput or single program performances. The first way is the easiest one, where each sequencer is associated to an independently running program, or thread. Since there are no inter-thread dependencies, there is no need for synchronization, but single thread performances are not improved. The second way involves a parallelization of single threads: splitting a thread in a number of dependent instruction streams called speculative threads the single program performances can be improved, but there can be many dependencies among different speculative threads.

2.1.5. Speculative (data) architectures All the architectures seen before cannot solve the Read After Write (RAW) data dependence, that occurs trying to execute an instruction that depends on the result of another one which has not been completed yet. The only way to solve this type of dependence is data speculation approach: the speculation occurs on data instead of simple controls. The potential sys-

2.2.1. Multiscalar architectures The multiscalar architecture [9, 1] is an example of multiple sequencers usage to improve single thread performances. A thread is split into multiple speculative treads, called tasks, which are dispatched and parallely executed. Inter-task dependencies at register level are solved using a speculative approach, without checking for RAW dependencies, and the sequential

2.1.3. Out-Of-Order architectures The Out-Of-Order execution logic allows parallel execution of instructions even if they are not neighbors in the execution flow, hence representing a superscalar logic generalization. Of course, the out-of-order execution may result in a code semantics violation, and this should be prevented by additional hardware structures able to ensure that the visible state of the job execution is the same as in the sequential execution.

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

IEEE

tasks ordering is kept by the architecture. Single tasks are executed exploiting ILP, and a number of predictors are used to achieve good performances in speculation. This approach is still under study, but two processors were produced with a very similar approach: the Sun MAJC [10] and the NEC Merlot [2]. 2.2.2. Helper Threads-based architectures Another way to use multiple data sequencer is to run helper threads, threads that can support the main thread to complete faster [12]. Helper threads are written independently from the specific thread that are going to support, so they cannot substitute the main thread in the normal operations. They are instead used to support some operations normally executed into hardware, like some complex prediction algorithms (to support other type of parallelism) or memory pre-fetching.

2.3. Other approaches to speed-up computations Another way to support the parallelism is to have a complex architecture, composed of a large number of homogeneous or heterogeneous elementary units, where units and interconnections among them can be configured to do some specific jobs. These architectures are referred as reconfigurable architecture [3], and are based on the configuration of an hardware substrate to define a specific datapath or the interconnections among the processing elements of a cluster. Configuration must be downloaded into the architecture: a given algorithm must be previously written in parallel form according to the hardware architecture in which the algorithm will run. Given a certain algorithm, typically is not easy to define the configuration: sometimes this can be done coding the algorithm by means of a specific language, even if some projects include a compiler to simplify the mapping process [11], [4]. Since usually it is not quick to change configuration, these architectures are effectively used for streaming processing.

3. Guidelines for the architectural definition The starting point for this work was to define the outline of an innovative general purpose processor architecture, with four major features: scalability at architectural level, scalability at performances level, run-time adaptive resource allocation and balancing, ILP and TLP support.

3.1. Architectural scalability The architectural scalability is the property of an architecture to vary in size without affecting system functionality and with minor modifications to non-scaled hardware. Chip multiprocessors (see 2.2) systems exploit TLP by means of allocation of different threads to different processors. In

this manner threads run actually in parallel, and a consistent speed-up can be achieved. Such systems are coarsegrained since the complexity of a single processor is usually very high, and hence it is difficult to cheaply arrange a large number of them. A finer scalability should take into account processing units with reduced size and functionality. In this case a large number of such units, that we call nanoprocessors (nPs), could fit into the same chip obtaining significant speed up in communications compared to off-chip clusters. This architectural approach can exploit in the best way technology scaling allowing wider implementations following CMOS technology improvements.

3.2. Performances scalability beyond TLP: adaptive resources allocation and balancing Performance scalability surely depends on the previously described architectural choice, since a large number of nPs implies a large number of threads that can be allocated and run in parallel. Usually, in multiprocessor systems, when a thread is allocated the execution time is fixed by the thread complexity. However threads can be subdivided into (probably) dependent tasks. Exploiting the fine-grained approach introduced by nPs, it is conceivable that a set of nPs could be assigned to a single thread, so that also task level parallelism is supportable (see 2.2.1). In this manner, thread execution performances can be modulated at run-time to follow the user’s requirements. Furthermore a central allocation unit can decide, at run-time, to reallocate a thread to a reduced number of nPs if other threads with more hard constraints need a number of resources higher than actual availability. For example, we would allow threads related to real-time constrained applications (i.e. audio and video coders and decoders) to use as much resources as they need, even exploiting the computational strength of resources previously assigned to other threads with looser timing constraints. All these aspects lead to a flexible computing platform composed of a fabric of nPs and a centralized unit for allocation and monitoring purposes. To maintain architectural and performances scalability as simple as possible, the centralized unit must not know hardware details.

3.3. Cooperative processing In our architecture we intend to support both TLP and ILP at the same time, using the same resources replication strategies. To support ILP in a flexible way it is possible to allow that unused nPs cooperate with overloaded ones without external intervention to reduce the overall workload. Cooperations could interest also task level parallelism since it is easy to extend this mechanism to tasks (see 5.2). Threads can be allocated to “chains” of nPs closely interconnected so that the modulation in the number of proces-

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

IEEE

C A C H E C A C H E

SCHEDULER

DS

DS

DS

DS

PC

C A C H E

Local Memory

Processing Module

C A C H E

BUF

Figure 1. System architecture overview.

Interconnection Module

BUF

Figure 2. The nP structure.

sors in the chain could cause a modulation in the execution performances. Unchained nPs could work for ILP support with chained nPs assigned to threads. Compared to other state-of-the-art approaches, in our cooperative solution the possibility of ILP exploitation is evaluated at run-time and locally (in a decentralized way). Hence, this approach is flexible and light in terms of compilation and with no need for hardware ILP exploitation structures inside the nP. Decentralized approaches require direct interactions among elements. Global interactions require the knowledge of the hardware fabric map since each element must know the way to reach every other one. This fact limits scalability, and also routing strategies should be much complex. For this reason, in decentralized systems it is preferable to adopt local interactions and only local knowledge, since this can dramatically reduce the overhead even if it could lead to suboptimal solutions. The interconnection network can consist of a regular mesh where every node communicates only with its direct 4 neighbors. This choice allows a large amount of simultaneous communications into the fabric, with a light control. This network approach is also commonly used into Grid Processors, like TRIPS [7], but in our model the network nodes are independent nPs rather than simple execution units, and the static Grid Processors instruction allocation mechanism is replaced by a dynamic, fully decentralized one.

The border elements (Dispatchers and Caches) and the Scheduler make the architecture similar to a SMT system, because a common big functional unit (the nP fabric) is shared among more front-end structures (Dispatchers) to handle multiple program streams. The maximum number of independent threads that can be allocated is equal to the number of available dispatchers.

4. Architectural overview

4.1. The nanoprocessor

The architecture, shown in Fig. 2, take advantage of high modern chip density having a scalable array of nP units, and realizes the distributed and scalable parallel structure introduced in Sec. 3. The main architectural components are:

The nP is the system elementary elaboration unit. Every nP is composed of a simple processing module, a local memory, and an interconnection module that allows communications among different units. The nP structure is shown in Fig. 2.

• nPs, that constitute a parallel computation fabric; • system caches, that provide a parallel high-bandwidth memory system; • a Scheduler, that performs centralized jobs in the architecture, and maintains the architectural state; • some Dispatchers (“DS”, in Fig. 1), that are the interfaces between nPs fabric and the rest of the system.

The nPs fabric is the architectural core: a large number of independent processing units that can be dynamically allocated to handle a job performing both TLP and ILP. This parallel approach is similar to a CMP one, because there are many independent processors, but it is different because nPs are very simple, and communications between them are easy: they are not build to handle independent thread, but contrarily speculative thread, or task. More units can be used to perform even ILP. The processing array can be functionally splitted into three layer: • a computational layer, constituted by the nanoprocessing modules; • a memorization layer, constituted by the local distributed memory system used by nPs to handle tasks; • a communication layer, given by the network switches that link up all the different nPs.

4.1.1. Processing Module The processing module is a simple RISC processor, very small, providing a reduced instruction set and an ALU, so that a large number of nPs can be placed into the computational fabric. Every thread can be handled by a single processing module. Every processing module contains a register bank, an instruction fetching/decoding unit and a simple ALU: no ILP is supported

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

IEEE

at this level. The register bank contains 16 register 16-bit wide. This number could be reduced if future investigations will show that the area required by those register could be better spent in a larger ALU (and consequently a lager instruction set). Instructions are fixed-size and 16-bit wide, the opcode being 4-bit wide. The instructions are register-based and there is no accumulator into the processing unit. Standard instructions are provided: LOAD/STORE instructions, ADD/SUB/MUL arithmetic instructions, branch instructions. Additionally there is a small number of coordination and communication instructions, to support cooperation and TLP. The latency of each instruction is typically one clock cycle, but operations that use more than one cycle can aid the thread level parallelism by relieving the network (less instructions executed per time unit behave less instructions sent into the network). Additionally, they can help obtaining lighter ALU: in our model, the multiply instruction requires 8 clock cycles to be performed. 4.1.2. Local Memory The local memory is a little storage area used by the nP to handle its instructions without being hardly dependent on the network. The local memory should be able to contain at least one task to be performed by the processing module (about 100 word). The memory serves also the interconnection module, that can store there data from other units. If there is concurrency in write operations, the processing module has higher priority. A buffer can be placed between interconnection module and local memory to prevent network deadlocks due to memory write operations made by the processing module. 4.1.3. Interconnection Module The interconnection module acts as a network switch, routing information packets along the network. Network topology provides only local connections, and the network is very regular to achieve the simplest routing logic. For each thread, nPs are allocated in a chain to perform task-level parallelism, so that every nP has at maximum other two boundary nPs allocated to the same thread; if possible, other boundary nPs are locally recruited to perform ILP. These aspects are further analyzed in Sec. 5. Every nP in the chain is identified by its position relatively to the first link, and the routing can be done counting the number of links that must be passed to reach the receiving unit. Requiring few routing info, a simple routing logic consisting of a subtractor, to perform the countdown on the number of remaining steps, and an OR module, to check if the remaining steps are going to zero, is sufficient. Since every allocated nP has at the maximum two active communication directions, if a packet comes from a direction, the opposite one in the chain should be taken for routing. Network links are half-duplex. The priority can be given to packet going forward or backward in the allocation array: the best

approach is to give the priority to packets going backward, because this choice gives priority to data sent by nPs over the instruction coming from outside the nPs fabric.

4.2. Dispatcher The dispatcher is the interface between the nPs fabric and the rest of the system, like the memory system, the I/O system, and the centralized scheduling unit. The memory system is a parallel cache system, to ensure the high bandwidth required to feed the processing array with instructions and data. The interface between dispatcher and fabric is very simple, because the dispatcher acts as a switch like all the other interconnection modules into the fabric, using the same routing method. Dispatchers also communicate with the scheduler to obtain general job information like allocation directives. This type of communication is bidirectional, because the dispatcher can give to the scheduler some feedbacks about the allocation results or the state of the computation. Special packets due to scheduler activity are sent into the network by the dispatcher, like allocation and deallocation packets. The allocation, given the number of units to reserve for a job, is performed autonomously by the units, only with local information. The main dispatcher task is dependence translation. The code read from the cache is a thread subdivided into various tasks sent to the allocated nPs; the inter-task dependencies are found by the compiler and various tasks are statically linked, so that when there is an inter-task dependence into a task, the nP which handle that one is blocked until an unblock command is sent from the nP who solve the dependence. The link between tasks obviously does not depend on the number of allocated units, which is given at runtime: once the number of units is chosen by the scheduler for a thread, the dispatcher should translate the inter-task dependencies into inter-nP links, according to the number of allocated units and to the tasks distribution algorithm. Finally, the dispatcher should handle the task commit: when a task is being executed into a nP, the system does not see any change, until the task is committed by the dispatcher confirming the state-change in the system given by the task. This method also allows a speculative task execution.

4.3. Scheduler The scheduler represents all the centralized logic in the architecture. This module handle all the job that can not be performed by nPs: resource allocation and global system state information maintenance. The scheduler should have basic information about every job running in the system, to perform resource allocation: job priority, real time requirements, information about the achievable parallelism

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

IEEE

level, like the maximum number of nPs to allocate on a job. If a job has real-time requirement, it needs to have a minimum amount of resources; on the other hand, every job has information about the maximum number of resources that can improve the job performances. For real-time purposes, anytime algorithms support can be given, for example in some elementary nP instructions, or in some group of instructions, like a task. All these information are supplied by the compiler, but many of them need special directives to be provided by the programmer. Once the scheduler has got information about a job, it can send to the dispatcher requests indicating how many units must be allocated to every job. If there are some allocation problems the dispatcher informs the scheduler. Once the allocation is done, the scheduler can receive feedback about the job progress from the dispatcher. If there is negative feedback, or if there is another job to be executed in the system, a resources reallocation could be performed: this decision is taken by the scheduler, and the dispatching units are informed.

5. Centralized and decentralized control strategies Centralized approaches have been considered the best way to accomplish control tasks. Decentralization can be an alternative way to deal with control and processing. Nevertheless, even in decentralized systems, there are some tasks that could need a centralized approach because are related to a global “organization”, in its largest meaning. In our work we have joined centralized and decentralized approaches at different levels. The nPs fabric has some degrees of freedom in resources allocation and recruitment, and in the exploitation of ILP. At the same time our system works with standard sequential codes, without explicitly parallel coding techniques, so a centralized scheduler must be present in the system.

5.1. Distributed recruitment When the scheduler allocates a thread, it tries to involve in the specific thread the best number of nPs according to compiler indications (off-line information) and to actual system status (run-time information). The scheduler knows if there are available dispatchers, and it asks to an unloaded one to allocate the optimal number of nPs. That dispatcher informs the first border-line nP sending an allocation packet to it: at this point nPs try to create a chain, where every link knows only its two direct neighbors. If an nP is not able to find another nP unallocated in the same line, it tries to change the search direction. The priority search direction depend on the incoming request direction, hence allowing the generation of folded allocation patterns, like

2a 1

IEEE

3

2a

4

5

7

6

2

1

: JOB A : JOB B

2b

: ILP

Figure 3. TLP and ILP onto the fabric shown in Fig. 3. If is not possible to add other links to the chain, the last nP allocated informs the dispatcher with an error packet. Then the dispatcher informs the scheduler: it is leaved to it the decision to allocate that thread asking to another dispatcher or reduce the parallelism exploitation leaving that thread assigned to the number of nPs available in the chain instantiated. Since the scheduler knows the number of nPs allocated to a thread but doesn’t know the actual allocated resources map, its complexity is reduced and the scalability preserved, even if it is possible to have fragmentation. This is a direct consequence of the adopted distributed allocation scheme, which masks hardware details. At the same time, distributed recruitment allows, if necessary, to try different solutions in parallel without overhead for the scheduler.

5.2. Run-time unscheduled ILP support As stated in 3.3, ILP generally requires either a complex hardware structure or a complex compiler. This is because ILP is supported inside a processor. Instead, in our architecture we exploit decentralized ILP support: this technique is flexible even if it cannot reach performances of specifically designed ILP-oriented architectures. Since our processor consists of a fabric of nPs, ILP can be supported not by the single nP but by means of interactions among them. The typical situation consists of various threads allocated to different nPs chains: every chain is responsible for a thread. If there are boundary nPs (with respect to a chain) not involved in any thread execution, they can be used for ILP. Figure 3 shows this situation limitedly to two threads allocated. If nPs 2 and 4 of chain A, and nP 2 of chain B want to exploit ILP, they try to search not-in-chain adjacent helper nPs to assign to them some instructions in their flow. Obviously it is not sure that this mechanism could be actually exploited in every situation since it depends on the physical amount of resources allocated and also on their displacement into the nPs fabric. ILP support is based on the sharing of registers information, so it involves an overhead due to synchronization of involved nPs. This mechanism can be easily extended to tasks. In fact, allocation of threads implies that a chain has to deal with all the task

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

2

4a

Figure 4. C++ code for a simple linear space transform algorithm

7000 6000 performances factor

for(int i = 0; i < N; i++){ // outer loop y[i] = 0; for(int j = 0; j < N; j++){ // inner loop y[i] += c[i][j] * x[j];}}

5000 4000

(a)

3000 2000 1000 0 0 3

Surely it is useless to allocate a too big number of nPs, because the network bandwidth is not large enough to feed an arbitrary large number of units: the performances will raise allocating more units, until there will come a saturation effect due to the limited network bandwidth. So the key question is: what is the maximum number of nPs that can be allocated for a thread so that the performances do not saturate? In Fig. 5 are shown the results relative to the execution of previously described algorithm with N = 16 and

IEEE

8 10 number of nPs

12

14

16

performances factor

1.5

(b)

1 0.5 0

2

4

6

8 10 number of nPs

ideal performances

12

14

16

obtained performances

Figure 5. Computation time versus number of allocated nPs for the algorithm described in Fig. 4 with: (a) N = 16, and (b) N = 32

N = 32, and they have been compared to the ideal case result, obtained dividing the time used by one nP to perform all the thread, by the number of nPs available. The performances saturation effect is gradual, but it become complete when the number of nPs reach a threshold. This is given, in the simple case of equal-length tasks, by: Nth = Ttask /Tsend 

(1)

Ttask being the time spent by a nP in order to complete a single task, and Tsend the time spent by the dispatcher to send the task to the nP (is equal to the number of task instructions). This happens because the dispatcher must send all the tasks into the same path (every task has only an access point to the network), so the M -th task can be sent only after the previous M − 1 ones. So, performances saturate according to (1) with N nPs allocated, if the time spent by a nP to process one task (Ttask ) is greater than the time needed to feed all the N nPs (N · Tsend ). Tasks composed of a large number of simple instructions and that do not include any loop are less suitable than tasks that are composed of a little number of complex instructions, for parallel execution. In Fig. 6 the performances factor is plotted: basically, this factor is obtained normalizing and inverting the time spent to complete the entire thread by a given number of nPs. The ideal case is shown, jointly to obtained results for a 16 elements wide vector, and for a 32 elements wide one (different Ttask and Tsend ).

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

6

2

0

6.1. Tests

4

x 10

2.5

6. Performances analysis on a case-study A system composed of the nPs fabric, and where the role of dispatchers and scheduler have been just functionally described, has been simulated. Simulations work performing TLP over the allocated nPs, and task must be hand-written as they would be after the dispatcher operation, since a compilation tool is not available at the moment. The main simulation subject are network system performances: the network system is a key item in the architecture, because its speed can limit system performances. Since we are interested in network communication, we have chosen a simple algorithm that can be split into tasks without inter-task dependencies: this is not a common situation, but can help evaluating task transmission overhead, because it removes all the other causes of performance reduction. The algorithm taken into account is a simple linear space transform algorithm, like the one shown in Fig. 4: we split it into tasks composed by a single outer loop step. The task code is somewhat simple, and substantially is a multiply/accumulate (MAC) algorithm. Every task is composed by the assembly statements to code the MAC operation, and the data on which the operation are performed. The space dimension (i.e. the vector length) is chosen equal to 16 or 32, so that a large number of computational units could be used.

2 4

included in thread, and hence each nP could be engaged in the execution of more than one task. Exploiting cooperation, tasks could migrate towards other unloaded not-inchain nPs. At the moment cooperation for task-level parallelism has not been implemented.

performances factor

20 15 10 5 0 0

2

4

6

8 10 number of nPs

12

14

16

ideal performances obtained performances: vector length equal to 16 obtained performances: vector length equal to 32

Figure 6. Performances improvement factor versus number of allocated nPs.

7. Discussion and conclusion In this paper we have presented a novel architectural paradigm for general purpose processors. The proposed architecture consist of a fine-grained fabric of simple RISC processors (nanoprocessors) that, coordinated by larger dispatching and scheduling units, is able to perform complex computations in a multithreading environment. This kind of processor structure is especially designed to support architectural and performances scalability, hence representing a good solution to follow technology scaling and multimedia applications demands. Actually, such applications are generally noise tolerant and quality scalable. If performances scalability is connected to processing power it is also possible to modulate for a thread the resources needed as function of the desired quality. Exploiting mixed centralized– decentralized control strategies, the processor is able to mask architectural details reaching the goal of flexibility in threads allocation, tasks division and cooperative ILP support. This latter aspect allows to keep small the nPs preserving the speed-up introduced by ILP exploitation. Moreover, the proposed architecture is a good substrate for the evaluation of novel methodologies to treat general purpose traditional problems, exploiting an high flexibility and a good resource allocation mechanism. Since a such architecture is based on an array of simple units exploiting locality and cooperation, and without absolute localization necessities, it is potentially a very promising platform for fault-tolerant systems. In fact, nPs faults can be detected by means of simple global (or local) tests, and hence broken nPs can be easily bypassed by their switches, without affecting system functionality. Some simulations have been performed on assembly codes to analyze system performances an highlight possible bottlenecks in the processing fabric. These simulations have revealed the presence of a saturation mechanism in perfor-

mances speed-up. This negative aspect is primarily due to a bottleneck in the network structure, caused by the topology of the network and by the limits of linear allocation structures with only strictly local interconnections. This problem is more evident when allocated threads are composed of too simple subtasks, since in this case the overhead for tasks transmission is comparable to that of tasks execution. At the same time simulations shows the feasibility of the approach, and deserve further study and experimentations.

References [1] S. E. Breach. Design And Evaluation Of A Multiscalar Processor. PhD thesis, University Of Wisconsin - Madison, 1998. [2] M. Edahiro, S. Matsushita, M. Yamashina, and N. Nishi. A single-chip multiprocessor for smart terminals. IEEE Micro, pages 12–20, Jul.-Aug. 2000. [3] R. Hartenstein. A decade of reconfigurable computing: a visionary retrospective. In Proc. Design, Automation and Test in Europe (DATE’01), pages 642–649, 2001. [4] J. R. Hauser. Augmenting a Microprocessor with Reconfigurable Hardware. Ph.D. Dissertation, University of California, Berkeley, 2000. [5] M. H. Lipasti. Value locality and speculative execution. Ph.D. Dissertation, Carneige Mellon University, Pittsbourgh, PA, April 1997. [6] A. Moshovos and G. Sohi. Mircoarchitectural innovations: Boosting microprocessor performances beyond semiconductor tecnology scaling. In Proceedings of the IEEE, volume 89, pages 1560–1575, November 2001. [7] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. Keckler, and C. Moore. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In International Symposium on Computer Architecture, 2003. [8] Y. Sazeides and J. E. Smith. The predictability of data values. In Proc. 30th Annu. Int. Symp. Microarchitecture, pages 248–258, December 1997. [9] G. S. Sohi, S. E. Breach, and T. Vijaykumar. Multiscalar processors. In Proc. 22nd Annu. Int. Symp. Computer Architecture, pages 414–425, June 1995. [10] M. Tremblay, J. Chan, S. Chaudhry, A. W. Conigliaro, and S. S. Tse. The MAJC architecture: A synthesis of parallelism and scalability. IEEE Micro, pages 12–25, Nov.-Dec. 2000. [11] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarsinghe, and A. Agarwal. Baring it all to software: RAW machines. IEEE Computer, 30(9):86–93, September 1997. [12] P. Wang, J. Collins, H. Wang, D. Kim, B. Greene, K.-M. Chan, A. Yunus, T. Sych, S. Moore, and J. P. Shen. Helper Threads via Virtual Multithreading. Micro, IEEE, 24(6):74– 82, Nov.-Dec. 2004.

Proceedings of the 2005 8th Euromicro conference on Digital System Design (DSD’05) 0-7695-2433-8/05 $20.00 © 2005

IEEE

Run-time Adaptive Resources Allocation and Balancing ...

on Nanoprocessors Arrays ... chitecture based on an array of nanoprocessors that par- ...... [2] M. Edahiro, S. Matsushita, M. Yamashina, and N. Nishi. A.

257KB Sizes 0 Downloads 116 Views

Recommend Documents

Adaptive resource allocation and frame scheduling for wireless multi ...
allocation and frame scheduling concept for wireless video streaming. ... I. INTRODUCTION. Wireless multimedia communication is challenging due to the time ...

Balancing Acquisition and Retention Resources to ...
cate to customer acquisition and retention and how to distribute those allocations across communication channels. ... one period to the next, they compute and seek to maximize .... EMAIL, and WEB to measure the number of contacts that.