1
A Design of Asynchronous Double Grain Reconfigurable Computing Array 1
Pakon Thuphairo,1 Arthit Thongtak1* 1DSEL (Digital System Engineering Laboratory), Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Patumwan, Bangkok, Thailand 10330 E-mail:
[email protected],
[email protected]* Tel: +66-(0)2-218-6956
1. ABSTRACT An asynchronous double-grain reconfigurable computing array is proposed in this paper. The architecture is designed to be a general-purpose partial-runtime reconfigurable architecture with 2D array style and 4NN interconnection, which is for both of coarse and fine grain tasks. It also includes the carry path separately. Moreover, it is routed into the single-bit datapath to be used with logic operations. The single-bit datapath is not only for 1-bit tasks themselves, but can also be used when mapping multi-bit tasks that have single-bit conditional datapaths. Static data-flow [4] structure and Balsa [2], an asynchronous design tool, are used to design the processing element(PE) at architectural level. In this paper, three basic functions are provided to show the feasibility of compilation. Firstly, loop control-signal generator contains both of multi-bit and singlebit datapaths. Secondly, a 32-bit ripple carry adder is constructed from PEs. Finally, 16bit loop control-signal generator is given to be an example of multi-bit function. The functions are manually synthesized from high level specification and mapped onto the computing array. It shows that the circuits generated from Balsa operate correctly under the QDI asynchronous delay model.
Keywords: Asynchronous circuit, Data flow computing, Reconfigurable architecture. 2. INTRODUCTION 2.1 Reconfigurable architecture Reconfigurable computing (RC) compensates the gap between the limited speed of von Neumann machine paradigm and customized functions of ASIC by providing the reconfigurable fabrics, for example, high performance computing is one of the applications that RC serves to help compute tasks more efficiently. Asynchronous circuit design methodologies have emerged to solve the design problems in synchronous systems [4]. For instance, clock skew. Moreover, advantages of asynchronous design have been taken into account to achieve the higher performance. Fine grain reconfigurable architectures have more flexibility in serving arbitrary functions, but the configuration data size is the drawback when comparing with coarse grain architectures. Recent works on asynchronous reconfigurable architectures have been proposed in [7], [5], [3]. 2.2 Asynchronous logic circuit Asynchronous circuits offer advantages over the synchronous ones [4]. for example, asynchronous circuits has no clock skew problems since a global clock is not applied to control the entire circuit, a clock gating technique is not needed since the circuit operates when it is required, and it is automatically adapted to physical properties. 2.3 Static dataflow structure Designing an asynchronous circuit at gate level can be a labor-intensive task for a complex system. Static dataflow structure (SDF) is a methodology to help asynchronous designers build a circuit at pipeline level without knowing of how it is implemented in the components, for example, handshaking protocol. SDF is used in our work to design the architecture of the Processing Element (PE). The templates for constructing an expression in high level 13th Annual Symposium on Computational Science and Engineering (ANSCSE 13)
2
programming language has already been shown in [4]. Hence, this leads to feasibility of directly compilation of a task expressed using high level specification to produce a pipeline level circuit.
3. OVERALL ARCHITECTURE The architecture of ARCA [7], which has a simple 4NN interconnection, is adopted in our design. Each PE communicates with the neighbors adjacently in the up, left, right and down directions shown in the figure 1.
Figure 1. The overall architecture In our design, each PE is reconfigured to perform an operation, receiving up to 2 inputs from its neighbors. The input and output widths are 8 bits with a single bit data for carry signal of addition/subtraction and also for a conditional signal. Our PE design is based on SDF structure. A PE consists of most of the SDF components. The configuration signals are inputted to multiplexers and merges to select the data paths of the components. Our design is a mixed-grain architecture, a single-bit data path is also included to operate single bit computation. Without having Single-bit Function Units (SFU), 1-bit operations will have to be padded and mapped onto Multi-bit Function Units (MFU), Multi-bit Input
Multi-bit Output MFU
Carry In
Carry selector Conditional Signals
Carry Out
SFU Single-bit - Output
Single-bit Input
Figure 2. The PE architecture The main part of PE, Multi-bit function unit operate the data routed from the input selector and the single-bit condition unit takes charge of initialization of single-bit latch, controlling both of the input and output selector and also performs single-bit operations. Placing latches in asynchronous circuit can form a process to be a pipeline or a ring. In this design, a latch immediately settles after function unit in order to reduce the need of additional PE when a latch is required in SDF without affecting the correctness of function. The SFU takes charge of computing the single-bit tasks, having only the MFU, the circuit has to wait for the padded signals, if, at least, implemented with asynchronous methodologies with completion detection. The single-bit datapath can be mapped separately from the multi-bit datapath but can also be routed to MFU in the same PE and vice versa. One advantage is that a single-bit signal can be passed through a group of PEs that is only multi-bit configured.
13th Annual Symposium on Computational Science and Engineering (ANSCSE 13)
3
4. RESULTS 4.1 Application Manually mapping a complex application will be a labor intensive task. Thus, a few applications are given in this paper.
Sk = Sink latch E = Empty latch CMP = Comparator
Figure 3. Loop control signal generator. The loop control signal generator provides Control Signal 1 (CS1) and Control Signal 2 (CS2) to control the loop activity in a circuit, for instance, a multiplexer pulls an input data to a circuit, and after the loop activity is completed a demultiplexer releases a result to the next circuits.
Figure 4. The PE array configured for rectangular mapping The figure 4 shows a group of 6 PEs performing the application in the figure 3. Due to the 4NN style of interconnection, a PE is needed to be a forwarder. SFUs in PE labeled with 4-1 and 2 are configured to operate as a latch containing a data token and a forwarder respectively.
Figure 5. 32-bit ripple carry adder Reconfiguring 32-bit ripple carry adder is straight forward. 4 PEs are needed to be configured to construct a chain of adders. The carry signals are routed from the MFUs to the next ones.
13th Annual Symposium on Computational Science and Engineering (ANSCSE 13)
4
Figure 6. 16-bit loop control-signal generator The first example generates the control signal for a loop operation. This example produces a control signal for loop operations that the loop count exceeds the value 255, in this circuit, the loop iteration number is 768. Two adders are needed to handle the 16-bit data value shown in the figure 6. All the example circuits are given in this paper show the possibility of mapping the dataflow circuits onto the computing array, it is needed to be more investigated that the real world applications can be mapped onto the array. 4.2 Functional verification
Figure 7.
Behavioral simulation of the loop control signal generator.
In this simulation, the circuit is configured to produce CS1 ten times before CS2 occurs. The number of loop can be reconfigured just by changing the constant value inputted to the comparator. Since the control signals are not fed to other circuits and each of PEs are repeatedly reconfigured every time they respond to the environment, the circuit operates infinitely as shown in figure 7. The computing array was manually configured to perform a function. The behavioral simulation model is generated from Xilinx ISE Webpack 10.1 and simulated by ModelSimXE III/Starter 6.3c under QDI (Quasi Delay Insensitive) asynchronous delay model. The rest of the example circuits are also verified, but not synthesized, because of needing of manually insertion of the reset circuits. The Balsa simulation shows that all the rest of the example circuits operate correctly at architectural level.
5. CONCLUSION The proposed asynchronous coarse grain architecture is intentionally designed for nonspecific application domain. Our results only shown that the basic functions synthesized from high level specification can be mapped onto the array. Manual synthesis will be a laborintensive task if desired functions are complex. The design of the array architecture has to be more investigated to have the ability to perform arbitrary functions. Designing the controller for 13th Annual Symposium on Computational Science and Engineering (ANSCSE 13)
5
the computing array is in our future works. Some existing researches have already proposed the algorithms for reconfigurable architectures with 2D array style. [1][6]
6. LITERATURE CITED 1. Esmaeildoust, M., Fazlali, M., and Zakerolhosseini, A., An Efficient Algorithm for Online Placement in a Reconfigurable System, International Conference on Optimization of Electrical and Electronic Equipment, 2008, 69 – 74. 2. http://intranet.cs.man.ac.uk/apt/projects/tools/balsa 3. Kagotani, H., Schmit, H., Asynchronous PipeRench: architecture and performance evaluations, Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003, 121 – 129. 4. Sparso, J., Asynchronous Circuit Design - A Tutorial, Technical University of Denmark, 2006, 4, 29-40. 5. Sun K., Pan X., and Wang J., Design of A Novel Asynchronous Reconfigurable Architecture for Cryptographic Applications, Proceedings of the First International MultiSymposiums on Computer and Computational Sciences, 2006, 751 – 757. 6. Walder, H., Steiger, C., and Platzner, M., Fast online task placement on FPGAs: free space partitioning and 2D-hashing, Proceedings of the International Parallel and Distributed Processing Symposium, 2003, 8 pp. 7. Zhang, J., Pan, X., and Shen, H., Asynchronous Reconfigurable Computing Array Design, Proceedings of the Second International Conference on Embedded Software and Systems, 2005. 6 pp.
13th Annual Symposium on Computational Science and Engineering (ANSCSE 13)