ST200: A VLIW Architecture for Media-Oriented Applications Paolo Faraboschi Hewlett-Packard Laboratories (Cambridge, MA) Fred Homewood STMicroelectronics (Cambridge, MA) http://www.hpl.hp.com/cambridge/projects/cfp http://www.st.com
Project Overview vST200: the First implementation of the “Lx” Processor Family vJoint Hewlett-Packard Labs and STMicroelectronics Design • Technology platform for System-On-Chip (SOC) VLIW cores vLx is an “architecture framework” • Customizable and Scalable, high performance VLIW • Targeting ‘soft core’ approach for SOC • Includes hardware and toolchain – Very aggressive retargetable ILP compiler is fundamental • Presented as one compatible architecture family to the user
1
HP Labs: Evolution of Computing, Towards the Information Utility Very Large Scale Computing (Internet Data Centers)
Target Applications for the Lx Family vInteger computation-intensive media-processing applications • Programmed in a high-level language (C, C++, extensions) • “DSP-style” computational kernels • Significant “control” component – Multitasking O/S, user interfaces, interrupt handlers, etc. vExamples of domains that share these common properties • Digital still-imaging • Video processing • Networking, cryptography • Audio processing
4
Architecture and Compiler Design vLx Design Philosophy • “Build only the features you can compile for” – Compiler-driven architecture and microarchitecture design – Compiler technology already in place – Built-in scalability in compiler, tools and ISA vLx compiler based on HP Labs technology • Descendant of the Multiflow compiler • Very aggressive and robust ILP compiler • Global scheduling (trace scheduling) • Table driven retargetable – Wide class of VLIW architectures, including clusters
5
Major Lx Architectural Features vBase VLIW ISA + Extensions • Efficient 32-bit integer ISA • Extensible for Floating point and SIMD • Large set of general purpose and branch registers • Simple predication through “select” operations • Static branch prediction vExecution Model • Precise interrupts, explicit speculation model vCustomizable for a specific application • Variable number of clusters and registers, cache sizes, and operation latencies
6
Multi-Cluster Lx Architecture
Single PC
MMU
Cluster 0 Instruction Cache Fetch and Expansion Unit
v Syllables are grouped to form Bundles (VLIW instructions) • Variable-size length with Bundle-stop bits –Allows for no-op folding • Clustering with Cluster-start bit –Allows scalability v Simple and extensible coding format
9
ST200 Lx Cluster Pipeline Structure Fetch Phase
Decode Phase
Read Phase
E1 Phase
E2 Phase
Write Phase
Mul
128b Wide ICache Word
I-buffer And Variable Size Decoding
Register File Access And Immediate Generation
IU
Load/Store
Register Rollback, Writeback and Bypassing
Dcache Fetch Address Generation
PC and Branch Unit
Exception Generation
10
ST200: The first Lx Implementation vFirst implementation sampling January 2001 • ST200-STB1 device – Includes Lx and peripherals vInstance of single cluster Lx • 4 issue processor with 32kByte I$ and 32kByte D$ • 64 x 32-bit registers • 2 Multipliers, 4 integer, 1 load/store, 1 branch • 300MHz in 0.25µ technology (2.5V) – Single-cluster processor core 5mm2 – Total core size : 21mm2 • 400MHz in 0.18µ technology (1.8V)
11
ST200-STB1 Chip Features v32 mm2 in 0.25µ technology Core: 21mm2, Peripherals: 11mm2 v372 BGA, 1.7 W peak power v64-bit SDRAM / DDRAM interface PCI Port
PCI I/F
DRAM Port
DDRAM Interface
Lx VLIW Core
ST-Bus Bridge Clock Control
Timer
Debug Support
Interrupt Controller
Serial Port 1
Serial Port 2
EPROM Port
12
Performance Analysis of ST200-STB1 Measurements • Benchmarking of ST200-STB1 : – Performance with a real memory system – SDRAM 133MHz – Level 1 32K I$ and 32K D$
– Code size • Compared to Reference platform : – StrongARM SA-110 / 275 – High-end 32b embedded RISC CPU – Corel Netwinder system (gcc / Linux)
13
Industry Standard and Application Specific Benchmark Suites Name
Wrapping Up… vVLIW is becoming the predominant embedded / DSP technology • Custom-VLIW: right balance of performance and flexibility vHP/ST Lx is a custom-VLIW “technology platform” • Can be effectively customized to an application • First implementation (ST200-STB1) sampling soon vCustomizing embedded VLIW is beneficial • 4x-12x gains vs. a general-purpose RISC architecture – Starting from C-level code – At similar cost and technology vGeneral purpose code performance • Comparable to RISC • Efficient precise interrupts and exceptions in the ST200
18
ST200: A VLIW Architecture for Media-Oriented ...
Technology platform for System-On-Chip (SOC) VLIW cores. âLx is an ... Computing. (Internet Data Centers) ... Built-in scalability in compiler, tools and ISA.