1

A Design of Asynchronous Double Grain Reconfigurable Computing Array 1

Pakon Thuphairo,1 Arthit Thongtak1* 1DSEL (Digital System Engineering Laboratory), Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University, Patumwan, Bangkok, Thailand 10330 E-mail: [email protected],[email protected]* Tel: +66-(0)2-218-6956

1. ABSTRACT An asynchronous double-grain reconfigurable computing array is proposed in this paper. The architecture is designed to be a general-purpose partial-runtime reconfigurable architecture with 2D array style and 4NN interconnection, which is for both of coarse and fine grain tasks. It also includes the carry path separately. Moreover, it is routed into the single-bit datapath to be used with logic operations. The single-bit datapath is not only for 1-bit tasks themselves, but can also be used when mapping multi-bit tasks that have single-bit conditional datapaths. Static data-flow [4] structure and Balsa [2], an asynchronous design tool, are used to design the processing element(PE) at architectural level. In this paper, three basic functions are provided to show the feasibility of compilation. Firstly, loop control-signal generator contains both of multi-bit and singlebit datapaths. Secondly, a 32-bit ripple carry adder is constructed from PEs. Finally, 16bit loop control-signal generator is given to be an example of multi-bit function. The functions are manually synthesized from high level specification and mapped onto the computing array. It shows that the circuits generated from Balsa operate correctly under the QDI asynchronous delay model.

Keywords: Asynchronous circuit, Data flow computing, Reconfigurable architecture. 2. INTRODUCTION 2.1 Reconfigurable architecture Reconfigurable computing (RC) compensates the gap between the limited speed of von Neumann machine paradigm and customized functions of ASIC by providing the reconfigurable fabrics, for example, high performance computing is one of the applications that RC serves to help compute tasks more efficiently. Asynchronous circuit design methodologies have emerged to solve the design problems in synchronous systems [4]. For instance, clock skew. Moreover, advantages of asynchronous design have been taken into account to achieve the higher performance. Fine grain reconfigurable architectures have more flexibility in serving arbitrary functions, but the configuration data size is the drawback when comparing with coarse grain architectures. Recent works on asynchronous reconfigurable architectures have been proposed in [7], [5], [3]. 2.2 Asynchronous logic circuit Asynchronous circuits offer advantages over the synchronous ones [4]. for example, asynchronous circuits has no clock skew problems since a global clock is not applied to control the entire circuit, a clock gating technique is not needed since the circuit operates when it is required, and it is automatically adapted to physical properties. 2.3 Static dataflow structure Designing an asynchronous circuit at gate level can be a labor-intensive task for a complex system. Static dataflow structure (SDF) is a methodology to help asynchronous designers build a circuit at pipeline level without knowing of how it is implemented in the components, for example, handshaking protocol. SDF is used in our work to design the architecture of the Processing Element (PE). The templates for constructing an expression in high level 13th Annual Symposium on Computational Science and Engineering (ANSCSE 13)

2

programming language has already been shown in [4]. Hence, this leads to feasibility of directly compilation of a task expressed using high level specification to produce a pipeline level circuit.

3. OVERALL ARCHITECTURE The architecture of ARCA [7], which has a simple 4NN interconnection, is adopted in our design. Each PE communicates with the neighbors adjacently in the up, left, right and down directions shown in the figure 1.

Figure 1. The overall architecture In our design, each PE is reconfigured to perform an operation, receiving up to 2 inputs from its neighbors. The input and output widths are 8 bits with a single bit data for carry signal of addition/subtraction and also for a conditional signal. Our PE design is based on SDF structure. A PE consists of most of the SDF components. The configuration signals are inputted to multiplexers and merges to select the data paths of the components. Our design is a mixed-grain architecture, a single-bit data path is also included to operate single bit computation. Without having Single-bit Function Units (SFU), 1-bit operations will have to be padded and mapped onto Multi-bit Function Units (MFU), Multi-bit Input

Multi-bit Output MFU

Carry In

Carry selector Conditional Signals

Carry Out

SFU Single-bit - Output

Single-bit Input

Figure 2. The PE architecture The main part of PE, Multi-bit function unit operate the data routed from the input selector and the single-bit condition unit takes charge of initialization of single-bit latch, controlling both of the input and output selector and also performs single-bit operations. Placing latches in asynchronous circuit can form a process to be a pipeline or a ring. In this design, a latch immediately settles after function unit in order to reduce the need of additional PE when a latch is required in SDF without affecting the correctness of function. The SFU takes charge of computing the single-bit tasks, having only the MFU, the circuit has to wait for the padded signals, if, at least, implemented with asynchronous methodologies with completion detection. The single-bit datapath can be mapped separately from the multi-bit datapath but can also be routed to MFU in the same PE and vice versa. One advantage is that a single-bit signal can be passed through a group of PEs that is only multi-bit configured.

13th Annual Symposium on Computational Science and Engineering (ANSCSE 13)

3

4. RESULTS 4.1 Application Manually mapping a complex application will be a labor intensive task. Thus, a few applications are given in this paper.

Sk = Sink latch E = Empty latch CMP = Comparator

Figure 3. Loop control signal generator. The loop control signal generator provides Control Signal 1 (CS1) and Control Signal 2 (CS2) to control the loop activity in a circuit, for instance, a multiplexer pulls an input data to a circuit, and after the loop activity is completed a demultiplexer releases a result to the next circuits.

Figure 4. The PE array configured for rectangular mapping The figure 4 shows a group of 6 PEs performing the application in the figure 3. Due to the 4NN style of interconnection, a PE is needed to be a forwarder. SFUs in PE labeled with 4-1 and 2 are configured to operate as a latch containing a data token and a forwarder respectively.

Figure 5. 32-bit ripple carry adder Reconfiguring 32-bit ripple carry adder is straight forward. 4 PEs are needed to be configured to construct a chain of adders. The carry signals are routed from the MFUs to the next ones.

13th Annual Symposium on Computational Science and Engineering (ANSCSE 13)

4

Figure 6. 16-bit loop control-signal generator The first example generates the control signal for a loop operation. This example produces a control signal for loop operations that the loop count exceeds the value 255, in this circuit, the loop iteration number is 768. Two adders are needed to handle the 16-bit data value shown in the figure 6. All the example circuits are given in this paper show the possibility of mapping the dataflow circuits onto the computing array, it is needed to be more investigated that the real world applications can be mapped onto the array. 4.2 Functional verification

Figure 7.

Behavioral simulation of the loop control signal generator.

In this simulation, the circuit is configured to produce CS1 ten times before CS2 occurs. The number of loop can be reconfigured just by changing the constant value inputted to the comparator. Since the control signals are not fed to other circuits and each of PEs are repeatedly reconfigured every time they respond to the environment, the circuit operates infinitely as shown in figure 7. The computing array was manually configured to perform a function. The behavioral simulation model is generated from Xilinx ISE Webpack 10.1 and simulated by ModelSimXE III/Starter 6.3c under QDI (Quasi Delay Insensitive) asynchronous delay model. The rest of the example circuits are also verified, but not synthesized, because of needing of manually insertion of the reset circuits. The Balsa simulation shows that all the rest of the example circuits operate correctly at architectural level.

5. CONCLUSION The proposed asynchronous coarse grain architecture is intentionally designed for nonspecific application domain. Our results only shown that the basic functions synthesized from high level specification can be mapped onto the array. Manual synthesis will be a laborintensive task if desired functions are complex. The design of the array architecture has to be more investigated to have the ability to perform arbitrary functions. Designing the controller for 13th Annual Symposium on Computational Science and Engineering (ANSCSE 13)

5

the computing array is in our future works. Some existing researches have already proposed the algorithms for reconfigurable architectures with 2D array style. [1][6]

6. LITERATURE CITED 1. Esmaeildoust, M., Fazlali, M., and Zakerolhosseini, A., An Efficient Algorithm for Online Placement in a Reconfigurable System, International Conference on Optimization of Electrical and Electronic Equipment, 2008, 69 – 74. 2. http://intranet.cs.man.ac.uk/apt/projects/tools/balsa 3. Kagotani, H., Schmit, H., Asynchronous PipeRench: architecture and performance evaluations, Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003, 121 – 129. 4. Sparso, J., Asynchronous Circuit Design - A Tutorial, Technical University of Denmark, 2006, 4, 29-40. 5. Sun K., Pan X., and Wang J., Design of A Novel Asynchronous Reconfigurable Architecture for Cryptographic Applications, Proceedings of the First International MultiSymposiums on Computer and Computational Sciences, 2006, 751 – 757. 6. Walder, H., Steiger, C., and Platzner, M., Fast online task placement on FPGAs: free space partitioning and 2D-hashing, Proceedings of the International Parallel and Distributed Processing Symposium, 2003, 8 pp. 7. Zhang, J., Pan, X., and Shen, H., Asynchronous Reconfigurable Computing Array Design, Proceedings of the Second International Conference on Embedded Software and Systems, 2005. 6 pp.

13th Annual Symposium on Computational Science and Engineering (ANSCSE 13)

081216 A Design of Asynchronous Double-Grain Reconfigurable ...

081216 A Design of Asynchronous Double-Grain Reconfigurable Computing Array _for ANSCSE 13.pdf. 081216 A Design of Asynchronous Double-Grain ...

115KB Sizes 6 Downloads 145 Views

Recommend Documents

472 MHz throughput asynchronous FIFO design on a ...
In the international technology roadmap for semiconductor (ITRS) [1], asyn- chronous circuit design techniques are considered as a promising design al-.

vlsi design and implementation of reconfigurable ...
Apr 10, 2009 - In this paper a reconfigurable cryptographic system is proposed. .... the RAM blocks which are used for keys storage, and the. RCS. Core that is.

Semantics of Asynchronous JavaScript - Microsoft
ing asynchronous callbacks, for example Zones [26], Async. Hooks [12], and Stacks [25]. Fundamentally ..... {exp: e, linkCtx: currIdxCtx};. } bindCausal(linke) { return Object.assign({causalCtx: currIdxCtx}, linke); .... the callbacks associated with

Reconfigurable computing iee05tjt.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Reconfigurable ...

Quantitative Verification of Reconfigurable ...
(SPM) vs parallel PMs (PPM), and low-performance TM. (LTM) vs ..... [2] E.W. Endsley and M. R. Lucas and D.M. Tilbury, “Software Tools for Verification of ...

Asynchronous, Online, GMM-free Training of a ... - Research at Google
ber of Android applications: voice search, translation and the ... 1.5. 2.0. 2.5. 3.0. 3.5. 4.0. 4.5. 5.0. Cross Entropy Loss. Cross Entropy Loss. 0 5 10 15 20 25 30 ...

Quantitative Verification of Reconfigurable Manufacturing Systems ...
Min and Max processing times as quantitative verification indices th,at reflect the .... quantitative analysis to the processing time of an activity that starts and ends with ..... [2] E.W. Endsley and M. R. Lucas and D.M. Tilbury, “Software Tools.

Characterizing the Opportunity and Feasibility of Reconfigurable ...
best memory hierarchy configuration for each application, ..... Includes dynamic power only, no accounting ..... and Software (ISPASS), White Plains, NY, 2010.

Characterizing the Opportunity and Feasibility of Reconfigurable ...
tablet, laptop, and server environments. As Moore's law continues to deliver ... the memory wall [10], multi-level caches have been a key element of computer architecture for decades with research studies spanning organization [11], write and ...

Hidden Problems of Asynchronous Proactive ... - Semantic Scholar
CODEX enforces three security properties. Availability is provided by replicating the values in .... disclose information stored locally. Note that there is an implicit ...

Semantics of Asynchronous JavaScript - Research at Google
tive C++ runtime, others in the Node.js standard library. API bindings, and still others defined by the JavaScript ES6 promise language feature. These queues ...... the callbacks associated with any database request would be stored in the same workli

A Low Latency Asynchronous FIFO Combining a Wave ...
bursty traffic between a data producer and a consumer. In addition, the independent ...... University 1996, and M.S. degree in infor- mation and communication ...

A sequence machine built with an asynchronous ... - Semantic Scholar
memory, a type of neural network which can store associations and can learn in a .... data memory and address decoder have real values between. 0 and 1.

Reconfigurable interconnects in DSM systems, a ...
(OS) and its influence on communication between the processing nodes of the system .... and secondly the Apache web server v.1.3 concurrently run with the.

A Memory-Efficient Reconfigurable Aho ... - Research at Google
Firewalls, i.e. security systems permitting or blocking packets based on their header information, have been a standard security solution for several years but are ...

Execution of Execution of Asynchronous Substitution ...
2Assistant Professor, Department of ECE,Velalar College of Engineering and Technology, Anna University. Chennai ... substitution box design essentially matches all the important security properties. ... using Mentor Graphics EDA (Electronic Design Au

Reconfigurable Models for Scene Recognition - Brown CS
Note however that a region in the middle of the image could contain water or sand. Similarly a region at the top of the image could contain a cloud, the sun or blue ..... Last column of Table 1 shows the final value of LSSVM objective under each init

uRON v1.5: A Device-Independent and Reconfigurable ...
The authors would like to thank Jaeyong Seo, Cass Woo,. Eun-Soo Jang, Haksoo Lee, and Hyun-Joong Kim for their collaboration of developing Tomorrow City ...

A Five-Band Reconfigurable PIFA for Mobile Phones - IEEE Xplore
PLANAR inverted F antennas (PIFAs) are widely used in mobile phones [1]–[8]. This is primarily because they ex- hibit an inherently low specific absorption rate ...

19106853-Reconfigurable-Computing-Accelerating-Computation ...
Connect more apps... Try one of the apps below to open or edit this item. 19106853-Reconfigurable-Computing-Accelerating-Computation-With-FPGAs.pdf.

Asynchronous Parallel Coordinate Minimization ... - Research at Google
passing inference is performed by multiple processing units simultaneously without coordination, all reading and writing to shared ... updates. Our approach gives rise to a message-passing procedure, where messages are computed and updated in shared

Static Deadlock Detection for Asynchronous C# Programs
contents at url are received,. GetContentsAsync calls another asynchronous proce- dure CopyToAsync .... tions are scheduled, and use it to define and detect deadlocks. ...... work exposes procedures for asynchronous I/O, network op- erations ...