ST200: A VLIW Architecture for Media-Oriented Applications Paolo Faraboschi Hewlett-Packard Laboratories (Cambridge, MA) Fred Homewood STMicroelectronics (Cambridge, MA) http://www.hpl.hp.com/cambridge/projects/cfp http://www.st.com

Project Overview vST200: the First implementation of the “Lx” Processor Family vJoint Hewlett-Packard Labs and STMicroelectronics Design • Technology platform for System-On-Chip (SOC) VLIW cores vLx is an “architecture framework” • Customizable and Scalable, high performance VLIW • Targeting ‘soft core’ approach for SOC • Includes hardware and toolchain – Very aggressive retargetable ILP compiler is fundamental • Presented as one compatible architecture family to the user

1

HP Labs: Evolution of Computing, Towards the Information Utility Very Large Scale Computing (Internet Data Centers)

Pervasive Computing (Transparent)

Common Substrate Specialized Embedded Computing

Software Technologies Compilers and Tools Memory Technologies Design Automation Custom Processors Custom Accelerators

Lx Project

2

Speed

Technology

Performance/ Cost

Time to market

Time to change functionality

ASIC

Very High

Very Long

Impossible: Redesign

DSP

High

Long

Long

Custom VLIW

High

Short

Short

RISC

Low-Medium Very Short

Flexibility

Lx: Optimized Solution for Speed and Flexibility

Very Short

3

Target Applications for the Lx Family vInteger computation-intensive media-processing applications • Programmed in a high-level language (C, C++, extensions) • “DSP-style” computational kernels • Significant “control” component – Multitasking O/S, user interfaces, interrupt handlers, etc. vExamples of domains that share these common properties • Digital still-imaging • Video processing • Networking, cryptography • Audio processing

4

Architecture and Compiler Design vLx Design Philosophy • “Build only the features you can compile for” – Compiler-driven architecture and microarchitecture design – Compiler technology already in place – Built-in scalability in compiler, tools and ISA vLx compiler based on HP Labs technology • Descendant of the Multiflow compiler • Very aggressive and robust ILP compiler • Global scheduling (trace scheduling) • Table driven retargetable – Wide class of VLIW architectures, including clusters

5

Major Lx Architectural Features vBase VLIW ISA + Extensions • Efficient 32-bit integer ISA • Extensible for Floating point and SIMD • Large set of general purpose and branch registers • Simple predication through “select” operations • Static branch prediction vExecution Model • Precise interrupts, explicit speculation model vCustomizable for a specific application • Variable number of clusters and registers, cache sizes, and operation latencies

6

Multi-Cluster Lx Architecture

Single PC

MMU

Cluster 0 Instruction Cache Fetch and Expansion Unit

Cluster 1

Interrupt and Exception Controller

D$

D$

Core Memory Controller and

...

I$

Control Registers

Cluster N

D$

System Bus

Inter Cluster Bus

7

ST200 Lx Cluster Architecture IPU

Exception Control

Multiply

Multiply

DPU

Control Registers

Other Clusters

Pre-Decode

I$

Other Clusters

Branch Unit

Reg File

Load Store Unit

64 GPR (32b)

Branch RegFile 8BR

ALU

ALU

ALU

D$

Prefetch Buffer

ALU

(1 bit)

Cluster 0

Cluster

8

Format

Cluster Start

Bundle Stop

ST200 Lx Instruction Formats Opcode

Source2 Destination Branch Reg

Immediate Destination Branch Reg

Source1

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

INT3-Reg

X X 0 0 0 IntOp-CmpOp

INT3-Im

X X 0 0 1 IntOp-CmpOp

Special

X X 0 1

SpecialOp

Load/Store

X X 1 0

LoadStoreOp

Call/Branch

X X 1 1

CallBrOp

BrReg

Dest

Immediate[8:0]

Src2

Src1

Dest / BrReg

Src1

Operand(s) Immediate[8:0]

Dest / Src2

Src1

Immediate[22:0]

v Syllables are grouped to form Bundles (VLIW instructions) • Variable-size length with Bundle-stop bits –Allows for no-op folding • Clustering with Cluster-start bit –Allows scalability v Simple and extensible coding format

9

ST200 Lx Cluster Pipeline Structure Fetch Phase

Decode Phase

Read Phase

E1 Phase

E2 Phase

Write Phase

Mul

128b Wide ICache Word

I-buffer And Variable Size Decoding

Register File Access And Immediate Generation

IU

Load/Store

Register Rollback, Writeback and Bypassing

Dcache Fetch Address Generation

PC and Branch Unit

Exception Generation

10

ST200: The first Lx Implementation vFirst implementation sampling January 2001 • ST200-STB1 device – Includes Lx and peripherals vInstance of single cluster Lx • 4 issue processor with 32kByte I$ and 32kByte D$ • 64 x 32-bit registers • 2 Multipliers, 4 integer, 1 load/store, 1 branch • 300MHz in 0.25µ technology (2.5V) – Single-cluster processor core 5mm2 – Total core size : 21mm2 • 400MHz in 0.18µ technology (1.8V)

11

ST200-STB1 Chip Features v32 mm2 in 0.25µ technology Core: 21mm2, Peripherals: 11mm2 v372 BGA, 1.7 W peak power v64-bit SDRAM / DDRAM interface PCI Port

PCI I/F

DRAM Port

DDRAM Interface

Lx VLIW Core

ST-Bus Bridge Clock Control

Timer

Debug Support

Interrupt Controller

Serial Port 1

Serial Port 2

EPROM Port

12

Performance Analysis of ST200-STB1 Measurements • Benchmarking of ST200-STB1 : – Performance with a real memory system – SDRAM 133MHz – Level 1 32K I$ and 32K D$

– Code size • Compared to Reference platform : – StrongARM SA-110 / 275 – High-end 32b embedded RISC CPU – Corel Netwinder system (gcc / Linux)

13

Industry Standard and Application Specific Benchmark Suites Name

Description

Name

Description

bmark

Color printing imaging pipeline (optimized)

adpcm

ADPCM audio decoder and encoder

cpmark

Color copier imaging pipeline (optimized)

dhry

Dhrystone 1.1 and 2.1 benchmarks

crypto

Cryptography benchmarks (optimized) Color-space conversion loops (optimized)

gcc go

SPECINT'95 GNU cc compiler SPECINT'95 game of GO

MPEG-2 decoder (optimized) JPEG-like encoder / decoder (optimized)

li

SPECINT'95 LISP interpreter

Color printing rendering pipeline (C++)

gs

csc mpeg2 tjpeg boise

m88ksim SPECINT'95 M88000 simulator Ghostscript PostScript interpreter

14

gs

m88k

gcc

go

li

dhry

adpcm

boise

tjpeg

mpeg2

bmark

csc

ST200/300 ST200/400 PIII/800

crypto

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

cpmark

Speedup vs. StrongARM SA-110

ST200 Speed relative to StrongARM For 1 cluster Lx implementation

15

ST200 : Code Compression System System CPU + Caches

Decompress Device Address from Device

Memory Controller

ROM

Flash

DRAM

..

Region Mapping

Other Memory Devices

vSoftware Compression

Data to Device

Addr’

Hit

Address Translation

vHardware Decompression

Decomp. Block

Comp

Decomp Circuitry

(Optional Peripheral)

vLAT (line-address-table) based

Address to Memory Controller

Data from Memory

16

ST200 : Code Size and Optimization Maxspeed Mincode Maxspeed Comp. Mincode Comp.

Lx Code Size relative to StrongARM

100% 75% 50% 25% 0% -25%

cpmark (36k)

tjpeg (42k)

li (43k)

mpeg2 (55k)

crypto (89k)

m88k go (207k) gcc (95k) (1011k)

AVG

17

Wrapping Up… vVLIW is becoming the predominant embedded / DSP technology • Custom-VLIW: right balance of performance and flexibility vHP/ST Lx is a custom-VLIW “technology platform” • Can be effectively customized to an application • First implementation (ST200-STB1) sampling soon vCustomizing embedded VLIW is beneficial • 4x-12x gains vs. a general-purpose RISC architecture – Starting from C-level code – At similar cost and technology vGeneral purpose code performance • Comparable to RISC • Efficient precise interrupts and exceptions in the ST200

18

ST200: A VLIW Architecture for Media-Oriented ...

Technology platform for System-On-Chip (SOC) VLIW cores. ❖Lx is an ... Computing. (Internet Data Centers) ... Built-in scalability in compiler, tools and ISA.

720KB Sizes 1 Downloads 176 Views

Recommend Documents

No documents