.4n Evaluation

System for Distributed-Time Alessandra Costa, Paolo Faraboschi

Via

University Opera Pia

Alessandro and Mauro

of Genoa 1 la, 16145

Performance of VHDL simulation is a criticrd issue in electronic circuit design and is hard to achieve due to the complexity of the language and the different abstraction levels. This paper presents a system for performance evaluation of d~tributed-time VHDL simulation based on the analysis of simulation traces, The system allows to model different architectures, interconnection topologies and simulation algorithms. The main tools are a VHDL analyzer to extract dependencies, and a trace-driven simulator to evaluate the execution time on a given architecture.

1



Introduction

Centralized

Time

(synchronous

simulation):

Italy

ecute events in parallel only if they have the same time-stamp: processors must re-synchronize before any event with a greater time-stam can be processed thus following a centralized time [8! . Distributed Time asynchronous simulation): we execute events in par L el regardless of their time-stamps: processor can proceed with their own local t:me, but they must guarantee the correctness of the simulation results [9].

Today’s VHDL accelerators only exploit centrahzed-tirne parallelism, and can reach a limited speedup due to the intrinsic Iirrutations of the technique [IO]. By comparing to other simulation fields, we believe that VHDL could greatly benefit from the ado tion of a distributed-time technique: we can exploit par~elism and we can scale node performance with the improvements of future microprocessors. Once chosen the distributed-time approach, we can distinguish between the two basic algorithm classes, conservative [2] and optimistic [5]. However, in this paper we to keep the analysis as independent as possible.

VHDL is today’s language of choice for hardware design projects 1]. It is a worldwide accepted industrial standard for speci [ ymg all levels of detail from behavioral to gate. Originally proposed for documentation purposes, VHDL is today commonly used also for specification, simulation and synthesis and its diffusion is expected to increase considerably in the near future. A VHDL description is a collection of concurrent processes (or concurrwat statements in their simple form), that generate events on data structures called signals. An event on a signal is defined by its new value and by the time when the new value will be assigned (also known u time-stamp). A process can be suspended by a wait sta~ement and reactivated on specific events. Statements within a process are executed sequentially and can use temporary storage elements (called uariables) that are local to the process and have no timin information. The algont “% m behind VHDL simulation is the basic twostep simulation cycle used in many other fields of Discrete Event Simulation (DES [4]): in the scheduling phase we compute the new vahes and time-stamps of the signals according to the actions specified by the active processes, and in the driving phase we actually assign the new values to the signals. performance of VHDL simulation is a critical issue in the process of electronic circuit desi n and is difficult to improve due to the complexity of the f anguage and the different supported abstraction levels: software solutions that optimize the simulation algorithms on the fastest workstations are not sufficient for industrial needs, and dedicated hardware solutions cannot keep the pace with the increa+ ing speed of state-of-the-art microprocessors and will become obsolete in a short time. Like in any other DES problems, we can exploit parsllelism in VHDL by distributing the processes and the event queue over different processors with one of two strategies: ●

De Gloria, Olivieri

- DIBE Geneva,

Abstract

Sirn dation

VHDL

The paper presents a system ] for performance evaluation of distributed-time VHDL simulation on distributedmemory multiprocessors based on the analysis of simulation traces. The system allows to model different architectural configurations, parameterized in terms of number of processors and interconnection topology. The main tools are a VHDL analyzer to extract process dependencies, and a trace-driven simulator to evaluate the execution time on a given architecture. Our anrdyms aims at identifying the most critical parameters that influence the performance of a parallel simulation system. In this research stage, we are not interested in proposing a particular architecture, but we want to estimate the conditions under which a distributed simulation can be effective in terms of performance and cost.

2

The

There par~el ●



we ex-

Evaluation

System

are two basic ways to analyze the performance simulation architecture for VHDL

of a

Execution-driven simulation: requires a VHDL compiler, a run-time kernel embedding the required distributed-time algorithm, and an architecture simulator. Thw method is precise but very expensive in terms of development time, mainly due to the complexity required for a VHDL compfler. Trace-driven simulation: reauires a trace-driven architecture simdator and a V-HDL analyzer to extract signal dependencies. ThM technique allows to analyze simulation traces of commercial VHDL simulators to estimate the execution time of a given architecture.

Trace-driven simulation has proved to be useful, rather precise, not expensive and flexible in several applications fields, such as microprocessor performance analysis [6] For 1This work hss been supported (CHESS)

o

6498project 147

by the EU under the ESPRIT-

these

reasons, we have built the evaluation system trace-driven approach that uses traces derived from

mercial The

VHDL simulators. tools composing

on

parallel simulation, as we want to emulate the behavior of a distributed simulation, by analyzing the events of a sequential simulation and mapping the sequentiti events to the corresponding processors? while ensuring that the dependency constraints are satisfied. At the bottom of any consideration we have the concepts of Execution Time and simulation Time of a VHDL statement. Two processors at the same execution time (real time) can operate on events wit h different simulation time and, conversely, two events with same simulation time can be executed by two processors at a different execution time. When analyzing a VHDL source code, we can extract only static dependencies. Although far from being perfect, the static information in a VHDL statement can be used to obtain a rather precise estimate of the execution profile when analyzing stat ement traces. In particular, we lose predictability with composite data structure and unknown indexes: in this case we assume a conservative hypothesis, by forcing a dependency with all the elements of the corresponding array.

a

com-

the TDPA are:

e The Signal

Dependency Extractor (SDE to analyze the VHDL code of the simulated benchmar L and gather the information necessary to the trace analyzer. c The Partitioned (PAHT) to find the partitions in the process graph that have to be allocated to each processor. ● The Machine Description Generator (MACH) to generate the information associated to a given interconnection topology in a proper form for the trace analyzer. ● The Instruction Trace Analyzer (ITA) to operate a trace-driven architecture simulation. Figure 1 shows the TDPA.

the data

flow

among

the different

tools

of

‘T-he Klerarchy

e&&

flattener. The possibility to handle VHDL hierarchy is a major feature of the evaluation system. The instantiation of components is flattened in the pre-simulation phase and each instance assumes a unique identity. A trace file usually shows the statements within a component module and its inst ante path. From this, we must be able to associate the instance signals to the actual top-level signals and local signals in instances must be assigned a unique identifier. The J31ER module flattens the hierarchy and operates the substitutions in the formal parameters of the component instances so that the statement trace can be interpreted by the ITA in a nonambiguous way. The flattening procedure is operated by a recursive algorithm that traverses the instance tree.

/

\

I

Rocsaaor AWaW

2.2



Immuorkm TRIOS iVISlyZ, (ITA)

o Info

2.1

The

1: The

Signal

data

flow

Dependency

within

Partitioned

The quality of the partitioning algorithm is paramount when load balancing is static, as in our case. The role of the partitioning algorithm comes into play when the number of VHDL processes is larger that the number of physical processors. From now on, processes and processors assume a separate identity. Formally, the problem is equivalent to a graph partitioning to minimize the number of cuts, balancing the subset size with a defined ratio. The parameters that are involved in the partitioning process are the number of available processors, the interconnection network topology, the “distance” between processors (in terms of estimated communication latency), and the load of each process to be allocated. The TDPA partitioned is based on a linear-time implelement ation of the rein-cut algorithm [3] that minimizes the number of “cuts” between two subsets while maintaining a fixed ratio between the subset dimensions. Nodes represent VHDL processes and subsets represent processors. In order to minimize the number of cuts (inter-processor communication) while maintaining a good load balancing, we assign the wei ht of each node a function that derives from an estimate o f’ Its execution time. This function is approximated by the number of VHDL statements composing the process weighted by the number of signals written by each signal assignment.

Parallelism

Figure

The

the TDPA

Extractor

The front-end of the TDPA is the Si~naJ Dependency Extractor (SDE). Its role is to extract reformation on static dependencies of each VHDL statement in a VHDL multi-file benchmark.

The VHDL source analyzer. To understand the working principles of the analyzer, it is important to define the concept of dependency. We assume that a VHDL process (or concurrent statement) is a basic unit that cannot be split among processors. As a consequence, the only data structure that may generate inter-process (and interFor this reason, processor dependencies) is the s i variables are never considered in a r’”ependency relation, as the correct execution of variable statements is always uaranteed by the original sequentiality. With this hypot % esis we only evaluate inter-process parallelism, without considering parallelism within a process that should not be too interesting due to the inherent sequentiality of the process constructs. We say that there is a dependency between two events on two different signals if and only if their order of execution haa to be preserved. In other words, our concept of dependency is equivalent to a sequentiality constraint. The main problem of a parallel simulation is to satisfy the dependency conditions. The trace-driven simulation approach is something in-between sequential and

2.3

The

Machine

Description

Extract

or

As the choice of the possible architecture configurations is wide, we have developed an automated support for three commonly used topologies: linear interconnected network (1 links per processor), 2-dimension mesh (4 links per processor), n-dimension hypercube (n links per processor). The MACH tools generates a link table (links connecting neighbors) and a hop table (length of the shortest route) for a given topology and a given number of processors. The

148

tables are used by the ITA to estimated the transmisdon time of an event message between two processors.

2.4

The

Instruction

by the SDE, by assuming number of signale written

~hile(trace-line) = get-processor (trace-line) latency = get-latency (trace-line); dep-list = get.dep-list (trace-line); current-time = get-time (processor); processor

to the

Causality Errors and Rollback. A distributed simulation may occur in causality errors when using an optimistic algorithm. Optimistic algorithms allow a processor to resume execution on a message without taking care if its input data are correct. When an an erroneous comput ation is discovered, the algorithm rolls back to a previously saved state [5, 7]. Although the ITA cannot simulate optimistic algorithms, it can be used to estimate the overhead due to rollback. The ITA understands that a rollback situation would occur in a dwtributed simulation whenever the execution time of a statement z generating a signal s, is larger than the execution time of any statement readhg 91 with a simulation time larger than the simulation time of i. In this csaes, the ITA adds an overhead time for the rollback recovery routine. To estimate optimistic algorithms, we can also add overhead time for state saving. A limitation of the ITA is that it cannot take into account the propagation of wrong events generated by the execution of incorrect statements. We can onl assume that the number of wrong events is proportion ~ to the time difference between the beginning of the wrong path and the instant of the error detection.

!CYace Analyzer

The purpose of the ITA (Instruction Trace Analyzer), is to sssign an execution time to every VHDL statement. The output of the ITA is a set of information about performance and statistics of the distributed simulation on the given architecture, including: execution time for each processor, critical path length? estimated speed-up over a sequential simulation, parallehsm, communication, rollback cost and most used links. The ITA algorithm haa the following structure:



that A(i) is proportional by the statement i.

;

switch(action(trace-line) ) { case SCHEDULI~G: current-time - processor-time (processor) ; nax.dep-t ime = O; for(dep=dep-list; clep ! =IIJLL; dep=dep->next) rL dep-processor = get.proceesor(dep) ; rnax-dep-time = max(max-dep-tirne, exec-t ime (dep-processor ,dep) + KM e hops(processor, dep-processor) ; } if(current-time < smx-dep-time) rollback-factor = 1 else rollback-factor = O; current-time = Ks * latency + Klt * rollbackyfactor + max(current-tme, max-dep-time) ; break; case DSIIVIIG: current -t lme += latency; break; } update-processor-t ime(current-time ,trace-line);

3

Experiments

and

Results

This section describes a set of evacuations the TDPA system that show the potentiality and the effectiveness of the distributed-time VHDL simulation

-Name dee imdsp bfly encdec h-mean VHDL Name iSep bfly encdec h-mean

} The basic time unit for the ITA is the execution time of a VHDL statement reading and writing a scalar signal (e.g. a <- not b), and all costs are expressed relatively to it. The analyzer is parameterizable in terms of communication cost (KH), rollback cost (KR when assuming an op timistic algorithm), and scheduling cost (KS). The latency of a statement is statically estimated on the basis of the number of signals read/written by the statement. In the driving phase of a statement, we just add the statement latency, since timing constraints are handled in the scheduling phase. In the scheduling phase of a statement, the ITA computes the estimated execution time by evaluating when all the backward dependencies are satisfied. This includes the computation of the time to receive the messages from the processors that assign the signals generating the dependencies. The execution time of a statement i in a processor p can then be computed as: T=(i) = A(i)+ max(T.(p), max,eD(i){T.(s) + R(s, P)}) where T=(p) is the current execution time of processor p, A(i) is the latency of statement i, D(t) is the set of signals which i depends from, T=(s) is the execution time when signal s changed (driving), R(s, p) is the communication delay of the message s to the processor p. The communication delay is computed as: J?(s, p) = M. L(p, p(s)) where s) is the processor generating s, L(p, q is the number of ~~ps between processors p and q, M is tie communication latency of a single hop. The statement latency is computed

VHDL Process lines number

281 4398 493 379 149 351 520 125 Stat./ Time stats. stamps cycle 438 67984 155 1466406 79635 lTl 543868 3853 235750 28776 8

Table

1: Information

performed with of the system approach for

s“ al n&Wber ;;;; 410 1155 Event number 275 297729 34822 34671

(a)

Stat.f event (b) :5 6

on benchmarks

Table 1 summarizes the static (a and dynamic (b) characteristics of the benchmarks an 1 the simulated traces. The benchmarks are behavioral/structural, just above the synthesizable abstraction level and the grain size of the processes varies widely from benchmark to benchmark. This can be seen in the table where it is evident how the number of statement per simulation cycle varies si Nficantly (from 8 stat/cycle of encdec to 459 stat/eye 1 e of ales). All benchmarks have been simulated with significant in ut stimuli and for enough time to filter out initialization e 1’ects. The huge output traces (70,000-1,500,000 VHDL statements) have been filtered and compressed to retain just the necessary information. Centralized and Distributed Time A first set of evaluations assumes an uncommitted topolo y with null comaf gorithm munication cost and ideal conservative (no rollback cost). Figure 2 shows the values of speedup with respect to a sequentiid simulation for varying number of We can see that the speedup for a centralprocessors. ized algorithm is bounded by a factor of approx. 2 while that of distributed algorithms by a factor of approx. 10. We can observe that the performance of centralized-time algorithms is limited by the “circuit activit y“, that will

149

—-

,.

m .

a

_--———---. -—.-----””””_”-”_””-”””

r

.

,.



Figure

—.=:

2: Ideal

.

I=

performance

‘~?=== ,., e . . .—

.—

.—._

.~,- - ..- - -.. - . . . . .. . . . . . . - . . . . . . . .. . . . . .. . . . . .- . .

s

.,h, A -“- --- ---

1

‘1

Figure

.. 3: Performance

—-

,00

vs. communication

w

I

latency

-— -----

“1

------

..+

I

—2

,0

,Cca

‘“

4: Performance

vs. rollback

cost

References

serve the presence of a threshold in message latency beyond which performance degrades when processors increase, as execution time is mostly spent waiting for messages. The value of threshold for the considered benchmarks is ap proximately 10. Role of Rollback If we consider an optimistic algorithm, the influence of rollback must be evaluated before we can say that the approach is valuable. Figure 4 shows that rollback does not seem a heavy limiting factor for VHDL simulation. We can note that there is a ne ligible performance degradation when the rollback cost ~ Beyond that value “? ess than 100 VHDL st atements. performance decreases considerably. As it is intuitive, the degradation is higher for larger number of processors, and indicates that the rollback recovery routine must be optimized if we want to exploit large number of processors.

4

‘—-— -

With the same resources, the gain can be measured between a factor of 1.5 (4 processors) and a factor of 4.3 (128 processors). We can note that this last value is likely to increase wit h higher complexity circuits (the average nnmber of processes in the benchmark is approximately 125). Performance saturates when the number of processor is larger than 64–128. While for 4 processors the efficiency is about 57%? it decreases to 7% with 128 processors. Although the ei%clency is likely to increase with larger circuit, this seems the limit for typical circuit dimensions. We have also seen that communication latency is an important issue for VHDL: we need to keep the latency of an event message to less than (approximately) 10 times the latency of a VHDL statement to avoid performance degradation with as many as 64-128 processors. Somewhat surprisingly, rollback does not seem to be a hmiting factor for VHDL optimistic simulation, as long as the cost for rolIba.ck recovery is kept within certain limits. A feasible value can be estimated in the range of 100 (i.e. a rollback recovery operation takes as long as 100 VHDL statement), that seems to be a reasonable value for an optimized algorithm without specific hardware support. Future work will use these kinds of evaluations to extract the specification requirements for a parallel simulation system, in terms of node features and communication parameters of a fine grain distributed-memory multiprocessor machine.

Communication Issues Figure 3 shows the performance of the mesh topology, for varying communication costs. We remind that communication cost is relative to a cost of 100 means that the time execution cost (i.e. to send an event message is equal to the time to execute 100 VHDL statements). The graph represents the harmonic mean of the speedup of the four benchmarks. We can ob-

—.

-—_

Figure

not likely increase with complexity. On the contrary, the speedup of distributed-time algorithm is linked with the inherent concurrency in V HDL and wti probably increase for larger descriptions.

s

*..

.

1 ‘!

i

—*

~<

-—

*

w-.

[1]

J.

[2]

K.

[3]

Armstrong. Hall, NJ, 1989. Chandy

and

C.

Fiduccla

R.

D.

of

Jefferson.

discrete

time.

simulation

Comrnuntcattons

A

hnear-t

Ime

In

Proc.

19th

event

33(10):30-53,

and

distributed

of

1981.

Mattheyses.

Virtual

Languages

Apr.

Prentice

VHDL.

siinulat

Oct. ACM

Systems,

heurmt

:on.

Des:gn

Ic

for Au-

Commun$

ca-

1990.

Zkmasactzons 7(3):404425,

on July

Program1985.

[6]

E. Koldinger, S. Eggers, and H. Levy. On the validlty of trace driven simulation for multiprocessors. In ft?th Annual International Sympomum on Computer Architecture (1.SCA 18), pages 244-253, Toronto, Canada, May 1991.

PI

Y -B. Lin mecharusma.

[s]

Parallel L. Soule. centralized-time and Report CSL-TFL92-527,

[9]

and

[10]

E. ACM

Lszowska. A 7Yansactsona

1(1)

51-72,

Jan. Io@c

J. Willis p#el

Dewgn

&f Test,

study of time on Modehng

1991

Parallel

pages

and D Siewiorek. simulation. IEEE

warp rollback and Computer

simulation:

distributed-time Stanford

L. Soule and A. Gupta. IEEE

150

ACM,

w!th

computations.

partitioning. 1982.

Parallel the

.Simu/ations,

In this paper we have shown how to build an evaluation system based on a trace-driven architectural simulator to evaluate parallel VHDL performance. The system can be used to estimate the influence of several parameters on performance, such as number of processors, communication latency and topology, cost of rollback recovery. Results obtained on a set of medium-sized VHDL benchmarks show that a distributed-time approach for VHDL simulation performs much bet ter than centralized-time.

R.

network Conference,

Fhjlmoto

mmg

Conclusions

and

Modehng

Asynchronous

of parallel a sequence ACM, 24(11):198-206,

tions

[5]

J. Misra.

via the

improving tomation

[4]

Chap-Level

An evaluation of algorithms Technical Unlverzlty, June 1992.

distributed-time

32–4S, Optim]zlng Design &

Dec.

simulation

1989.

VHDL compilation Test, pages 42–53,

for Sept.

4n Evaluation System for Distributed-Time

abstraction levels: software solutions .... The Machine. Description .... Machine. Description. Extract or. As the choice of the possible architecture configurations is.

477KB Sizes 2 Downloads 259 Views

Recommend Documents

Evaluation of a COTS Autopilot and Avionics System for UAVs
A commercial off-the-shelf autopilot system was tested on a RC ... Architecture of the Communication System for the Autopilot ... recorded in the telemetry file.

pdf-1365\musculoskeletal-system-trauma-evaluation-and ...
... the apps below to open or edit this item. pdf-1365\musculoskeletal-system-trauma-evaluation-and- ... dical-illustrations-volume-8-part-3-by-frank-h-net.pdf.

Alachua- Instructional Evaluation System ...
time. A revised evaluation system shall be submitted for approval, .... Alachua- Instructional Evaluation System Template2016-2017 Proposed (7).pdf. Alachua- ...

Alachua- Instructional Evaluation System Template2016-2017 ...
directions, but does not limit the amount of space or information that can be added to ... For classroom teachers newly hired by the district, the student performance .... Alachua- Instructional Evaluation System Template2016-2017 Proposed.pdf.

Evaluation of an Ontology-based Knowledge-Management-System. A ...
Fur- thermore, the system enables the efficient administration of large amounts of data in accordance with a knowledge management system and possesses the ...

User Evaluation of an Interactive Music Information Retrieval System
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – search ... To date, most MIR systems and online services have been using the ...

Performance Evaluation System (PES) Form 3 Q2 2015.pdf ...
Objective/Measure Formula Weight 2014 Target Actual Target Actual Target Actual ... of Transparency and Fairness in our Dealings with our Business Partners.

Evaluation of an automated furrow irrigation system ...
crop (63.14 kg/ha/cm) was considerably higher than of conventional method (51.43 kg/ha/cm). Key words ... no need to go to the field at night or any other ...

An Evaluation of a Collision Handling System using ...
A number of experiments in virtual scenarios with objects falling in a static plane ... Realism—Animation; I.3.7 [Virtual Reality]: Three-Di- mensional Graphics ..... Physical Modeling, pages 173 – 184, Cardiff, Wales,. UK, 2006. ACM Press. 1245.

029 EQ By FRANCE BELGIUM NETHERLAND 7D 4N OCT-DEC ...
029 EQ By FRANCE BELGIUM NETHERLAND 7D 4N OCT-DEC 2017.pdf. 029 EQ By FRANCE BELGIUM NETHERLAND 7D 4N OCT-DEC 2017.pdf. Open.

ZKT_LIKE SAKURA _NGO-NRT 4N MAR-APR 2017_49900 BY TG.pdf
Retrying... Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. ZKT_LIKE SAKURA _NGO-NRT 4N MAR-APR 2017_49900 BY TG.pdf. ZKT_LIKE SAKURA _NGO-NRT 4N MA

FEEDING SYSTEM FOR LIVESTOCK
Ration must contains minimum of 27% NDF or 19% ADF (DM basis), with 75% of the ration NDF derived from forage / roughage. • RDP to UDP or bypass protein ...

Performance Evaluation for Widely Linear ...
WL processing offers a significant performance advantage with a moderate ... IN wireless networks, fading caused by the multi-path signal propagation and ...

Comparative evaluation of drying techniques for surface
Universiw of California, Los Angeles, CA 90095-1597, USA. Abstract. Five different ... 'stiction' in the field of microelectromechanical systems. (MEMS). Stiction ...

EVALUATION FORM FOR CED 310 -
Jul 23, 2013 - Use the space below and at the back of the sheet. Be precise and to the point. Do not write matters in general or the minute details of the work.

Comparative evaluation of drying techniques for surface
testing of a C-shape actuator, Tech. ... John Y&in Kim was born in 197 I. He received his B.S. in ... B.Tech. degree in mechanical engineering from the Indian.

FOUR SUTURE TECHNIQUE FOR EVALUATION OF TIP ...
FOUR SUTURE TECHNIQUE FOR EVALUATION OF TIP DYNAMICS IN RHINOPLASTY.pdf. FOUR SUTURE TECHNIQUE FOR EVALUATION OF TIP ...

two methods for calculating peer evaluation scores
1 Have students fill out ... 6 for a sample form for collecting data for this method.) ... 3 Plug Peer Evaluation Percentage into the Course Grading Form .... Please assign scores that reflect how you really feel about the extent to which the other.

Evaluation of approaches for producing mathematics question ...
File Upload. • Fill in Blanks ... QML file. Generate QML question blocks. Import back in to. QMP. Import QML file to. Excel ... Anglesea Building - Room A0-22.

EVALUATION OF SPEED AND ACCURACY FOR ... - CiteSeerX
CLASSIFICATION IMPLEMENTATION ON EMBEDDED PLATFORM. 1. Jing Yi Tou,. 1. Kenny Kuan Yew ... may have a smaller memory capacity, which limits the number of training data that can be stored. Bear in mind that actual deployment ...

CONSENT FOR EVALUATION AND TREATMENT.pdf
Patient Signature Date. Page 1 of 1. CONSENT FOR EVALUATION AND TREATMENT.pdf. CONSENT FOR EVALUATION AND TREATMENT.pdf. Open. Extract.