ST200: A VLIW Architecture for Media-Oriented Applications Paolo Faraboschi Hewlett-Packard Laboratories (Cambridge, MA) Fred Homewood STMicroelectronics (Cambridge, MA) http://www.hpl.hp.com/cambridge/projects/cfp http://www.st.com

Project Overview vST200: the First implementation of the “Lx” Processor Family vJoint Hewlett-Packard Labs and STMicroelectronics Design • Technology platform for System-On-Chip (SOC) VLIW cores vLx is an “architecture framework” • Customizable and Scalable, high performance VLIW • Targeting ‘soft core’ approach for SOC • Includes hardware and toolchain – Very aggressive retargetable ILP compiler is fundamental • Presented as one compatible architecture family to the user

1

HP Labs: Evolution of Computing, Towards the Information Utility Very Large Scale Computing (Internet Data Centers)

Pervasive Computing (Transparent)

Common Substrate Specialized Embedded Computing

Software Technologies Compilers and Tools Memory Technologies Design Automation Custom Processors Custom Accelerators

Lx Project

2

Speed

Technology

Performance/ Cost

Time to market

Time to change functionality

ASIC

Very High

Very Long

Impossible: Redesign

DSP

High

Long

Long

Custom VLIW

High

Short

Short

RISC

Low-Medium Very Short

Flexibility

Lx: Optimized Solution for Speed and Flexibility

Very Short

3

Target Applications for the Lx Family vInteger computation-intensive media-processing applications • Programmed in a high-level language (C, C++, extensions) • “DSP-style” computational kernels • Significant “control” component – Multitasking O/S, user interfaces, interrupt handlers, etc. vExamples of domains that share these common properties • Digital still-imaging • Video processing • Networking, cryptography • Audio processing

4

Architecture and Compiler Design vLx Design Philosophy • “Build only the features you can compile for” – Compiler-driven architecture and microarchitecture design – Compiler technology already in place – Built-in scalability in compiler, tools and ISA vLx compiler based on HP Labs technology • Descendant of the Multiflow compiler • Very aggressive and robust ILP compiler • Global scheduling (trace scheduling) • Table driven retargetable – Wide class of VLIW architectures, including clusters

5

Major Lx Architectural Features vBase VLIW ISA + Extensions • Efficient 32-bit integer ISA • Extensible for Floating point and SIMD • Large set of general purpose and branch registers • Simple predication through “select” operations • Static branch prediction vExecution Model • Precise interrupts, explicit speculation model vCustomizable for a specific application • Variable number of clusters and registers, cache sizes, and operation latencies

6

Multi-Cluster Lx Architecture

Single PC

MMU

Cluster 0 Instruction Cache Fetch and Expansion Unit

Cluster 1

Interrupt and Exception Controller

D$

D$

Core Memory Controller and

...

I$

Control Registers

Cluster N

D$

System Bus

Inter Cluster Bus

7

ST200 Lx Cluster Architecture IPU

Exception Control

Multiply

Multiply

DPU

Control Registers

Other Clusters

Pre-Decode

I$

Other Clusters

Branch Unit

Reg File

Load Store Unit

64 GPR (32b)

Branch RegFile 8BR

ALU

ALU

ALU

D$

Prefetch Buffer

ALU

(1 bit)

Cluster 0

Cluster

8

Format

Cluster Start

Bundle Stop

ST200 Lx Instruction Formats Opcode

Source2 Destination Branch Reg

Immediate Destination Branch Reg

Source1

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

INT3-Reg

X X 0 0 0 IntOp-CmpOp

INT3-Im

X X 0 0 1 IntOp-CmpOp

Special

X X 0 1

SpecialOp

Load/Store

X X 1 0

LoadStoreOp

Call/Branch

X X 1 1

CallBrOp

BrReg

Dest

Immediate[8:0]

Src2

Src1

Dest / BrReg

Src1

Operand(s) Immediate[8:0]

Dest / Src2

Src1

Immediate[22:0]

v Syllables are grouped to form Bundles (VLIW instructions) • Variable-size length with Bundle-stop bits –Allows for no-op folding • Clustering with Cluster-start bit –Allows scalability v Simple and extensible coding format

9

ST200 Lx Cluster Pipeline Structure Fetch Phase

Decode Phase

Read Phase

E1 Phase

E2 Phase

Write Phase

Mul

128b Wide ICache Word

I-buffer And Variable Size Decoding

Register File Access And Immediate Generation

IU

Load/Store

Register Rollback, Writeback and Bypassing

Dcache Fetch Address Generation

PC and Branch Unit

Exception Generation

10

ST200: The first Lx Implementation vFirst implementation sampling January 2001 • ST200-STB1 device – Includes Lx and peripherals vInstance of single cluster Lx • 4 issue processor with 32kByte I$ and 32kByte D$ • 64 x 32-bit registers • 2 Multipliers, 4 integer, 1 load/store, 1 branch • 300MHz in 0.25µ technology (2.5V) – Single-cluster processor core 5mm2 – Total core size : 21mm2 • 400MHz in 0.18µ technology (1.8V)

11

ST200-STB1 Chip Features v32 mm2 in 0.25µ technology Core: 21mm2, Peripherals: 11mm2 v372 BGA, 1.7 W peak power v64-bit SDRAM / DDRAM interface PCI Port

PCI I/F

DRAM Port

DDRAM Interface

Lx VLIW Core

ST-Bus Bridge Clock Control

Timer

Debug Support

Interrupt Controller

Serial Port 1

Serial Port 2

EPROM Port

12

Performance Analysis of ST200-STB1 Measurements • Benchmarking of ST200-STB1 : – Performance with a real memory system – SDRAM 133MHz – Level 1 32K I$ and 32K D$

– Code size • Compared to Reference platform : – StrongARM SA-110 / 275 – High-end 32b embedded RISC CPU – Corel Netwinder system (gcc / Linux)

13

Industry Standard and Application Specific Benchmark Suites Name

Description

Name

Description

bmark

Color printing imaging pipeline (optimized)

adpcm

ADPCM audio decoder and encoder

cpmark

Color copier imaging pipeline (optimized)

dhry

Dhrystone 1.1 and 2.1 benchmarks

crypto

Cryptography benchmarks (optimized) Color-space conversion loops (optimized)

gcc go

SPECINT'95 GNU cc compiler SPECINT'95 game of GO

MPEG-2 decoder (optimized) JPEG-like encoder / decoder (optimized)

li

SPECINT'95 LISP interpreter

Color printing rendering pipeline (C++)

gs

csc mpeg2 tjpeg boise

m88ksim SPECINT'95 M88000 simulator Ghostscript PostScript interpreter

14

gs

m88k

gcc

go

li

dhry

adpcm

boise

tjpeg

mpeg2

bmark

csc

ST200/300 ST200/400 PIII/800

crypto

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

cpmark

Speedup vs. StrongARM SA-110

ST200 Speed relative to StrongARM For 1 cluster Lx implementation

15

ST200 : Code Compression System System CPU + Caches

Decompress Device Address from Device

Memory Controller

ROM

Flash

DRAM

..

Region Mapping

Other Memory Devices

vSoftware Compression

Data to Device

Addr’

Hit

Address Translation

vHardware Decompression

Decomp. Block

Comp

Decomp Circuitry

(Optional Peripheral)

vLAT (line-address-table) based

Address to Memory Controller

Data from Memory

16

ST200 : Code Size and Optimization Maxspeed Mincode Maxspeed Comp. Mincode Comp.

Lx Code Size relative to StrongARM

100% 75% 50% 25% 0% -25%

cpmark (36k)

tjpeg (42k)

li (43k)

mpeg2 (55k)

crypto (89k)

m88k go (207k) gcc (95k) (1011k)

AVG

17

Wrapping Up… vVLIW is becoming the predominant embedded / DSP technology • Custom-VLIW: right balance of performance and flexibility vHP/ST Lx is a custom-VLIW “technology platform” • Can be effectively customized to an application • First implementation (ST200-STB1) sampling soon vCustomizing embedded VLIW is beneficial • 4x-12x gains vs. a general-purpose RISC architecture – Starting from C-level code – At similar cost and technology vGeneral purpose code performance • Comparable to RISC • Efficient precise interrupts and exceptions in the ST200

18

ST200: A VLIW Architecture for Media-Oriented ...

Technology platform for System-On-Chip (SOC) VLIW cores. ❖Lx is an ... Computing. (Internet Data Centers) ... Built-in scalability in compiler, tools and ISA.

720KB Sizes 1 Downloads 137 Views

Recommend Documents

Optimizing CABAC for VLIW architectures
Figure 5: Renormalization and bit insertion normalizes the values of low and range in the interval. [0, 1] so that they are at least separated by a QUAR-.

Lx: A Technology Platform for Customizable VLIW ...
degree of customization or scaling for a particular application ..... The memory savings for the code compression algorithm averages ..... tradeoffs into account. 3.

Lx: A Technology Platform for Customizable VLIW ...
we developed the architecture and software from the beginning to support both ... from companies implementing variations of traditional embedded ...... Page 10 ...

VLIW Processors
Benefits of VLIW e VLIW design ... the advantage that the computing paradigm does not change, that is .... graphics boards, and network communications devices. ere are also .... level of code, sometimes found hard wired in a CPU to emulate ...

A distributed system architecture for a distributed ...
Advances in communications technology, development of powerful desktop workstations, and increased user demands for sophisticated applications are rapidly changing computing from a traditional centralized model to a distributed one. The tools and ser

A VLSI Architecture for Visible Watermarking in a ...
Abstract—Watermarking is the process that embeds data called a watermark, a tag, ...... U. C. Niranjan, “VLSI impementation of online digital watermarking techniques with ... Master's of Engineering degree in systems science and automation ...

Isabelle Zenone_GENERIC ARCHITECTURE FOR MULTIMISSION ...
a thermal control, ... ARCHITECTURE FOR MULTIMISSION POINTING GONDOLAS.pdf ... ARCHITECTURE FOR MULTIMISSION POINTING GONDOLAS.pdf.

Architecture patterns for safe design
We have been inspired by computer science studies where design patterns have been introduced to ease software development process by allowing the reuse ...

ELF for the ARM Architecture
Jul 5, 2005 - (navigate to the ARM Software development tools section, ABI for the ARM Architecture subsection). .... http://www.sco.com/developers/gabi/2003-12- ..... LX. DCD R_ARM_GLOB_DAT(X). PLT code loads the PLTGOT entry SB- relative (§A.1). D

A Goal Processing Architecture for Game Agents
Keywords. Agent architectures, computer game agents, teleo-reactive programs .... ing maintenance goals persist in the arbitrator, even when the goal condition ...

A Distributed Virtual Laboratory Architecture for Cybersecurity Training
Abstract—The rapid burst of Internet usage and the corre- sponding growth of security risks and online attacks for the everyday user or enterprise employee lead ...

a visual servoing architecture for controlling ...
servoing research use specialised hardware and software. The high cost of the ... required to develop the software complicates the set-up of visual controlled ..... Papanikolopoulos, N. & Khosla, P.- "Adaptive Robotic Visual. Tracking: Theory ...

A Case for FAME: FPGA Architecture Model Execution
Jun 23, 2010 - models in a technique we call host multithreading, and is particularly ..... L1 Instruction Cache Private, 32 KB, 4-way set-associative, 128-byte lines. L1 Data ..... In Proc. of the 17th Int'l Conference on Parallel Architectures and.

Improving a Pipeline Architecture for Shallow Discourse Parsing
data, to allow evaluation with the official shared task scoring code. .... Shared Task evaluation framework of Xue et al. (2015). 3.1 Data ... Some mis- takes occur ...

A Goal Processing Architecture for Game Agents
to the current game situation and assign priorities to goals based on the current situation; ... Our architecture contains four main components: memory, a set of goal generators, a program selector and the ... interface takes information from the env

A Hierarchical Fault Tolerant Architecture for ... - Semantic Scholar
Recently, interest in service robots has been increasing in ... As it may be deduced from its definition, a service robot is ..... Publisher, San Francisco, CA, 2007.

An Architecture for Anonymous Mobile Coupons in a Large Network
Nov 15, 2016 - services and entertainment [2]. .... credit/debit card payment (see also the next section). Note ... (ii) Executes online and hence must have.

Microvisor: A Scalable Hypervisor Architecture for ...
of CPU cores, while maintaining a high performance-to- power ratio, which is the key metric if hardware is to con- tinue to scale to meet the expected demand of ExaScale computing and Cloud growth. In contrast to the KVM Hypervisor platform, the Xen.

A Privacy-Protecting Architecture for Recommendation ...
One of the main privacy risks perceived by users is that of a computer “figuring things ... In other words, the simple fact of showing interest in a certain item may be .... of a tag cloud, which may be regarded as another kind of histogram. .....

A Scalable FPGA Architecture for Nonnegative Least ...
Each of these areas can benefit from high performance imple- mentations .... several disadvantages are immediately apparent—for each iter- ation there are two ...

A Layered Architecture for Detecting Malicious Behaviors
phishing web sites or command-and-control servers, spamming, click fraud, and license key theft ... seen in the wild [9,10]. Therefore, it is .... Each behavior graph has a start point, drawn as a single point at the top of the graph ..... C&C server

incremental software architecture a method for saving ...
incremental software architecture a method for saving failing it implementations contains important information and a detailed explanation about incremental ...