A Free and Open ISA Enabling a Diversity of CPU Cores and Accelerators
Guy Lemieux CEO
Professor
What is RISC-V? • 5th generaIon RISC InstrucIon Set Architecture (UC Berkeley)
– Andrew Waterman, Yunsup Lee, Dave PaQerson, Krste Asanovic – First public specificaIon released in May 2011
• High-quality, license-free, royalty-free ISA spec. – Microcontrollers to supercomputers
• Standard maintained by RISC-V Founda.on
Arduino Cinque Announced at Bay Area Maker Faire , May 20, 2017
RISC-V ISA “Green Card” ① Category Loads
Name Load Byte
Load Halfword Load Word Load Byte Unsigned Load Half Unsigned
Stores
Store Byte Store Halfword Store Word
Shifts
Shift Left
Shift Left Immediate Shift Right Shift Right Immediate Shift Right Arithmetic Shift Right Arith Imm
Arithmetic
ADD
ADD Immediate SUBtract Load Upper Imm Add Upper Imm to PC
Logical
XOR
XOR Immediate OR OR Immediate AND AND Immediate Compare
Set <
Set < Immediate Set < Unsigned Set < Imm Unsigned
Branches
Branch = Branch ≠ Branch <
Branch ≥ Branch < Unsigned Branch ≥ Unsigned
Jump & Link
J&L
Jump & Link Register
Synch
Synch thread
Synch Instr & Data
System System CALL System BREAK
Counters ReaD CYCLE ReaD CYCLE upper Half ReaD TIME ReaD TIME upper Half ReaD INSTR RETired ReaD INSTR upper Half
Fmt I I I I I S S S R I R I R I R I R U U R I R I R
RV{32|64|128)I Base LB rd,rs1,imm LH rd,rs1,imm L{W|D|Q} rd,rs1,imm LBU rd,rs1,imm L{H|W|D}U rd,rs1,imm SB rs1,rs2,imm SH rs1,rs2,imm S{W|D|Q} rs1,rs2,imm SLL{|W|D} rd,rs1,rs2 SLLI{|W|D} rd,rs1,shamt SRL{|W|D} rd,rs1,rs2 SRLI{|W|D} rd,rs1,shamt SRA{|W|D} rd,rs1,rs2 SRAI{|W|D} rd,rs1,shamt ADD{|W|D} ADDI{|W|D} SUB{|W|D} LUI AUIPC XOR XORI OR ORI AND
rd,rs1,rs2 rd,rs1,imm rd,rs1,rs2 rd,imm rd,imm rd,rs1,rs2 rd,rs1,imm rd,rs1,rs2 rd,rs1,imm rd,rs1,rs2
I R I R I SB SB SB SB SB
ANDI SLT SLTI SLTU SLTIU BEQ BNE BLT BGE BLTU
rd,rs1,imm rd,rs1,rs2 rd,rs1,imm rd,rs1,rs2 rd,rs1,imm rs1,rs2,imm rs1,rs2,imm rs1,rs2,imm rs1,rs2,imm rs1,rs2,imm
SB UJ I I I I I I I I I I I
BGEU JAL JALR FENCE FENCE.I SCALL SBREAK RDCYCLE RDCYCLEH RDTIME RDTIMEH RDINSTRET RDINSTRETH
rs1,rs2,imm rd,imm rd,rs1,imm
RV Privileged Instructions (32|64|128) Category CSR Access
Fmt Atomic R/W R Atomic Read & Set Bit R Atomic Read & Clear Bit R Atomic R/W Imm R Atomic Read & Set Bit Imm R Atomic Read & Clear Bit Imm R Change Level Env. Call R Environment Breakpoint R Environment Return R Trap Redirect to Supervisor R Redirect Trap to Hypervisor R Hypervisor Trap to Supervisor R Interrupt Wait for Interrupt R MMU Supervisor FENCE R Name
RV mnemonic CSRRW CSRRS CSRRC CSRRWI CSRRSI CSRRCI ECALL EBREAK ERET MRTS MRTH HRTS WFI SFENCE.VM
rd,csr,rs1 rd,csr,rs1 rd,csr,rs1 rd,csr,imm rd,csr,imm rd,csr,imm
rs1
Optional Multiply-Divide Extension: RV32M Category Multiply
Name Fmt MULtiply R MULtiply upper Half R MULtiply Half Sign/Uns R MULtiply upper Half Uns R Divide DIVide R DIVide Unsigned R RemainderREMainder R REMainder Unsigned R
RV32M (Mult-Div) MUL{|W|D} MULH MULHSU MULHU DIV{|W|D} DIVU REM{|W|D} REMU{|W|D}
3 Optional FP Extensions: RV32{F|D|Q} Category
rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2
Name Fmt RV{F|D|Q} (HP/SP,DP,QP) Load Load I FL{W,D,Q} rd,rs1,imm Store Store S FS{W,D,Q} rs1,rs2,imm Arithmetic ADD R FADD.{S|D|Q} rd,rs1,rs2 SUBtract R FSUB.{S|D|Q} rd,rs1,rs2 MULtiply R FMUL.{S|D|Q} rd,rs1,rs2 DIVide R FDIV.{S|D|Q} rd,rs1,rs2 SQuare RooT R FSQRT.{S|D|Q} rd,rs1 Mul-Add Multiply-ADD R FMADD.{S|D|Q} rd,rs1,rs2,rs3 Multiply-SUBtract R FMSUB.{S|D|Q} rd,rs1,rs2,rs3 Negative Multiply-SUBtract R FMNSUB.{S|D|Q} rd,rs1,rs2,rs3 Negative Multiply-ADD R FMNADD.{S|D|Q} rd,rs1,rs2,rs3 Sign Inject SiGN source R FSGNJ.{S|D|Q} rd,rs1,rs2 Negative SiGN source R FSGNJN.{S|D|Q} rd,rs1,rs2 Xor SiGN source R FSGNJX.{S|D|Q} rd,rs1,rs2 Min/Max
Name Fmt R R R R R AND R OR R MINimum R
Load Load Reserved Store Store Conditional Swap SWAP Add ADD Logical XOR
Min/Max
MAXimum MINimum Unsigned MAXimum Unsigned
R R R
RV{32|64|128}A (Atomic) LR.{W|D|Q} SC.{W|D|Q} AMOSWAP.{W|D|Q} AMOADD.{W|D|Q} AMOXOR.{W|D|Q} AMOAND.{W|D|Q} AMOOR.{W|D|Q} AMOMIN.{W|D|Q}
rd,rs1 rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2 rd,rs1,rs2
AMOMAX.{W|D|Q} rd,rs1,rs2 AMOMINU.{W|D|Q} rd,rs1,rs2 AMOMAXU.{W|D|Q} rd,rs1,rs2
CI CSS CIW CL CS CB CJ
MINimum R
R Compare Compare Float = R Compare Float < R Compare Float ≤ R Categorize Classify Type R Move Move from Integer R Move to Integer R Convert Convert from Int R Convert from Int Unsigned R MAXimum
Optional Atomic Instruction Extension: RVA Category
16-bit (RVC) and 32-bit Instruction Formats rd rd rd rd rd rd
③
②
Base Integer Instructions (32|64|128)
Convert to Int Convert to Int Unsigned
Configuration Read Stat Read Rounding Mode Read Flags Swap Status Reg Swap Rounding Mode Swap Flags Swap Rounding Mode Imm Swap Flags Imm
R R R R R R R R I I
FMIN.{S|D|Q} rd,rs1,rs2 FMAX.{S|D|Q} rd,rs1,rs2 FEQ.{S|D|Q} rd,rs1,rs2 FLT.{S|D|Q} rd,rs1,rs2 FLE.{S|D|Q} rd,rs1,rs2 FCLASS.{S|D|Q} rd,rs1 FMV.S.X rd,rs1 FMV.X.S rd,rs1 FCVT.{S|D|Q}.W rd,rs1 FCVT.{S|D|Q}.WU rd,rs1 FCVT.W.{S|D|Q} rd,rs1 FCVT.WU.{S|D|Q} rd,rs1 FRCSR FRRM FRFLAGS FSCSR FSRM FSFLAGS FSRMI FSFLAGSI
rd rd rd rd,rs1 rd,rs1 rd,rs1 rd,imm rd,imm
RISC-V Reference Card ④ Optional Compressed Instructions: RVC Category Loads
Name Load Word Load Word SP Load Double Load Double SP Load Quad Load Quad SP
Load Byte Unsigned Float Load Word Float Load Double Float Load Word SP Float Load Double SP
Stores
Store Word Store Word SP Store Double Store Quad Store Quad SP Float Store Word Float Store Double
Float Store Word SP Float Store Double SP
Arithmetic
ADD
ADD Word ADD Immediate ADD Word Imm ADD SP Imm * 16
Name Fmt RV{F|D|Q} (HP/SP,DP,QP) rd,rs1 R FMV.{D|Q}.X rd,rs1 Move to Integer R FMV.X.{D|Q} rd,rs1 Convert Convert from Int R FCVT.{S|D|Q}.{L|T} Convert from Int Unsigned R FCVT.{S|D|Q}.{L|T}U rd,rs1 rd,rs1 Convert to Int R FCVT.{L|T}.{S|D|Q} Convert to Int Unsigned R FCVT.{L|T}U.{S|D|Q} rd,rs1
Load Immediate Load Upper Imm MoVe SUB SUB Word
Logical
XOR OR AND AND Immediate
Shifts
Shift Left Imm
Shift Right Immediate Shift Right Arith Imm
Branches
Branch=0 Branch≠0
Jump
Jump Jump Register
R I S SB U UJ
CS CSS CSS CSS CSS CSS CR CR CI
C.SQ C.SQSP C.FSW C.FSD C.FSWSP C.FSDSP C.ADD C.ADDW C.ADDI
RVC rd′,rs1′,imm rd,imm rd′,rs1′,imm rd,imm rd′,rs1′,imm rd,imm rd′,rs1′,imm rd′,rs1′,imm rd′,rs1′,imm rd,imm rd,imm rs1′,rs2′,imm rs2,imm rs1′,rs2′,imm rs2,imm rs1′,rs2′,imm rs2,imm rd′,rs1′,imm rd′,rs1′,imm rd,imm rd,imm rd,rs1 rd',rs2' rd,imm
CI C.ADDIW rd,imm CI C.ADDI16SP x0,imm
ADD SP Imm * 4 CIW C.ADDI4SPN rd',imm
3 Optional FP Extensions: RV{64|128}{F|D|Q} Move from Integer
C.LW C.LWSP C.LD C.LWSP C.LQ C.LQSP C.LBU C.FLW C.FLD C.FLWSP C.FLDSP C.SW C.SWSP C.SD
Store Double SP CSS C.SDSP
Category Move
Fmt CL CI CL CI CL CI CL CL CL CI CI CS CSS CS
Jump & Link
J&L
Jump & Link Register
System
Env. BREAK
CI CI CR CR CR CS CS
C.LI C.LUI C.MV C.SUB C.SUBW C.XOR C.OR
rd,imm rd,imm rd,rs1 rd',rs2' rd',rs2' rd',rs2' rd',rs2'
CS CB CI CB CB CB CB CJ CR CJ CR CI
C.AND C.ANDI C.SLLI C.SRLI C.SRAI C.BEQZ C.BNEZ C.J C.JR C.JAL C.JALR C.EBREAK
rd',rs2' rd',rs2' rd,imm rd',imm rd',imm rs1′,imm rs1′,imm imm rd,rs1 imm rs1
5
RISC-V Base + Standard Extensions • Base < 50 instrucIons, 4 variants
– 16 registers: RV32E – 32 registers: RV32I, RV64I, RV128I
• Standard extensions – – – – –
M: Integer mulIply/divide A: Atomic memory operaIons (AMOs + LR/SC) F: Single-precision floaIng-point D: Double-precision floaIng-point Q: Quad-precision floaIng-point
• Fairly standard RISC encoding 32-bit instrucIon format
Base ISA Encoding: Always 32 Bits
▪ 32b x 32 registers (32b x 16 in “embedded”) ▪ 64b, 128b variants ▪ rd/rs1/rs2 in fixed locaIon, no implicit registers
▪ Immediate field (instr[31]) always sign-extended
Copyright © 2010–2014, The Regents of the University of California. All rights reserved.
7
Rich InstrucIon Encoding Space 16b:
xxxxxxxxxxxxxxaa
16-bit (aa 6= 11)
xxxxxxxxxxxxxxxx
xxxxxxxxxxxbbb11
32-bit (bbb 6= 111)
· · ·xxxx
xxxxxxxxxxxxxxxx
xxxxxxxxxx011111
48-bit
· · ·xxxx
xxxxxxxxxxxxxxxx
xxxxxxxxx0111111
64-bit
· · ·xxxx
xxxxxxxxxxxxxxxx
xnnnxxxxx1111111
(80+16*nnn)-bit, nnn6=111
· · ·xxxx
xxxxxxxxxxxxxxxx
x111xxxxx1111111
Reserved for
base+4
base+2
base
32b: Extra:
Byte Address:
Figure 1.1: RISC-V instruction length encoding.
192-bits
Compressed InstrucIons 32-bit Address
180%
▪ 16b encoding ▪ Expands into one 32b instr. ▪ 2-address forms (32 reg.) ▪ 3-address forms (8 reg.) ▪ ImplementaIon ▪ Decoder ~700 gates ▪ 16-bit instrucIon alignment ▪ Compiler-oblivious
160% 140% 120% 100% 80%
173% 140% 136% 126% 126% 101% 100% Other Architectures
RISC-V smallest on SPECint2006
RISC-V Growth • Rapid uptake in industry + academia • Google, Microsok, IBM, Samsung, AMD, NVIDIA, …
• Growing open sokware ecosystem • Variety of proprietary + open-source cores – Seeded by Rocket SoC Generator (open source)
Survey: 26 RISC- V CPU Designs
Survey: Standard Extensions
Core Count
Survey: MulIcore
big.LITTLE
Survey: Availability Status
Licensing
Core and Accelerator Choices RISC-V enables two kinds of choices… 1. Choices as a user
– Variety of RISC-V providers/products – Best selecIon / performance / availability / compeIIve prices
2. Choices a provider (SoC designer, cpu architect, …) – Variety of design/implementaIon opIons – More value to users
FPGA Sok Processor Choices Intel/Altera Xilinx LaEce Microsemi
ISA Nios II MicroBlaze Mico32 ARM M1
• Highly fractured – – – –
Interconnect Avalon AXI Wishbone3 APB/AHP/AXI
No common ISA, closed-source CPUs No common interconnect No common system build tools No common IP or sokware
Tool Qsys IPI MicoSystemBuilder SystemBuilder
FPGAs + RISC-V • Healthy shared ecosystem, cross-plaLorm potenMal (all vendors) • Members of RISC-V FoundaIon – Microsemi – Laqce
Rocket chip + SokConsole IDE
• FPGA-opImized sok cores – ORCA VectorBlox – PicoRV32 Clifford Wolf – GRVI Jan Gray
200 MHz pipelined (all vendors) 300+ MHz mulIcycle (Xilinx) 1,000+ pipelined cores (Xilinx)
Case Study • VectorBlox mission
– Design custom vector accelerators
• But…
– No access to sok processor source code – Hard ARM cores are inflexible – Fragmented ecosystems – very costly/risky for small IP Vendors
• VectorBlox ORCA hQp://www.github.com/vectorblox/orca – Open source RISC-V, opIonal proprietary extensions • Lightweight vector: share the RISC-V ALU • Full vector: up to 256 parallel ALUs
Vector Programming 4 Vector Lanes
Data-level parallelism for ( i=0; i<8; i++ ) A[i] = B[i] * C[i];
set VL, 8 vmult A, B, C
Source Vectors B, C
Destination Vector A
System Model Off-chip RAM
Memory Bus
LVE
Scratchpad
ORCA Processor (RV32IM)
• Vector instrucIons operate only on scratchpad – Stream data through RISC V ALU – Address generators in hardware
Lightweight Vector Extension (LVE) Vadd Vsll Vxor Vslt
Vsub Vsrl Vor Vsltu
Vmul Vdiv
Vmulh Vrem
Vcmv_z Vcmv_nz Vtype
// RV32I base (vectorized) Vsra Vand
// RV32M multiplication (vectorized)
// VectorBlox: conditional move // VectorBlox: data type, vector length
FIR filter (12x speedup) •
RV32IM
•
00000030
long*, long*, int, int)>: sub a3,a3,a4 blez a3,98 <.L6> slli a3,a3,0x2 slli t3,a4,0x2 add t4,a0,a3 add t3,a2,t3 li t5,1
0000004c <.L10>: 4c: 0005a683 50: 00062803 54: 00460793 58: 00058893 5c: 03068833 60: 01052023 64: 02ef5263
lw lw addi mv mul sw ble
a3,0(a1) a6,0(a2) a5,a2,4 a7,a1 a6,a3,a6 a6,0(a0) a4,t5,88 <.L11>
00000068 <.L13>: 68: 0048a683 6c: 0007a303 70: 00478793 74: 00488893 78: 026686b3 7c: 00d80833 80: 01052023 84: fefe12e3
lw lw addi addi mul add sw bne
a3,4(a7) t1,0(a5) a5,a5,4 a7,a7,4 a3,a3,t1 a6,a6,a3 a6,0(a0) t3,a5,68 <.L13>
00000088 <.L11>: 88: 00450513 8c: 00458593 90: faae9ee3 94: 00008067
addi addi bne ret
a0,a0,4 a1,a1,4 t4,a0,4c <.L10>
RV32IM + LVE
00000000
long*, long*, int, int)>: lui a5,0x0 sw a4,0(a5) sub a3,a3,a4 blez a3,2c <.L1> slli a3,a3,0x2 add a3,a1,a3
00000018 <.L3>: 18: 08c5fe2b 1c: a6e50cab 20: 00458593 24: 00450513 28: fed598e3
vtype.www a1,a2 vmul.vv.1d.sss.acc a0,a4 addi a1,a1,4 addi a0,a0,4 bne a1,a3,18 <.L3>
0000002c <.L1>: 2c: 00008067
ret
Vectors 1 instrucIon, 0 stalls 20 bytes code for 2 loops No vectors 8 instrucIons, N stalls 72 bytes code for 2 loops
Real applicaIon: “buQ-posIng” “buQ-tweeIng” cause?
soluIon!
Deep Learning: Face Detector Built using ORCA RISC-V + Lightweight Vectors + BNN Accel. hQps://www.youtube.com/watch?v=_0rWDrOGqrk
Deep Learning w/ Binary Weights rg
3 maps 32 x 32
b
48 maps 32 x 32
convolve
48 maps 32 x 32
convolve
48 maps 16 x 16
maxpool
convolve
96 maps 16 x 16
96 maps 16 x 16
convolve
maxpool
96 maps 8x8
convolve
128 maps 8x8
convolve
128 maps 8x8
maxpool
128 maps 4x4
• Binary weights: only +1/-1, no * • Database > 75,000 face images • Trained 99% accurate
256 x1 dense
256 x1
dense
dense
10 x1
Binary-weight NN Accelerator srcA (4x8b)
srcB (4x8b) row2
row1
row0
align[1:0] invert[8:0] dst
ImplementaIon uses two of these in parallel.
Face Detector 7.3s 0.1s 1.0s
RISC-V soU core (unaccelerated) + vector + BNN accel. (71x) + power-opMmized (< 5mW)
VectorBlox ORCA
FPGA
UltraPlus 5,280 LUTs
VGA Camera (3mm x 3mm)
Low-power Face DetecIon • RISC-V made this possible! – – – –
Open ISA Lightweight vectors BNN accelerator Power opImizaIons
à FPGA-based CPU + gcc à 10x performance \ 70x à 73x performance / combined à about 15x lower power
• Not possible with closed-source CPUs – Could not add lightweight vectors – Could not add BNN accelerator – Could not make power opImizaIons
Deep Learning ApplicaIon: YOLO
Full Vector Extension (MXP) Memory
I$ D$
AXI
(DDR3/4)
Slave
RISC-V RISC-V Custom InstrucIons
Master DMA Engine
InstrucIon & DMA Queue
Vector Engine VectorBlox MXP™ Matrix Processor
Inside the VectorBlox MXP (2) Custom Instructions
InstrucIon & DMA Queue
DMA and Vector Work Queues, Instruction Decode & Control Address Generation
A B C D
DMA Engine M S
Rd SrcA
Scratchpad
DMA
AXI Bus
Wr DstC Align C
R/W DMA
Σ Align A
Rd SrcB Align B
ALU Pipelines
+ CNN Accelerator 2000 ops/cycle
Deep Learning Performance • Same 28nm FPGA – ARM Cortex A9 – VectorBlox MXP
667 MHz 160 MHz (1/4 of Cortex A9)
• About 300x speedup over A9 – RISC-V Custom InstrucIons: Vectors + CNN Accel. – Parallelism: 2,000 operaIons / clock cycle
Deep Learning on VectorBlox
hQps://www.youtube.com/watch?v=SS2Rc5zpwxs
Summary • RISC-V is a free and open ISA
– Enables healthy sokware (and hardware) ecosystems – Variety of cores available / in development
• Impact on VectorBlox
– Single ecosystem for all FPGA Vendors
• Shared SW + HW costs, leverage open source contribuIons
– Access to CPU internals – Full vectors (MXP) • CNN applicaIon – Lightweight vectors (LVE) • BNN applicaIon • Power opImizaIons
à performance + power à10x to 10,000x performance (vs. sok core) à about 300x performance (vs. hard ARM) à 10x performance à 70x performance à about 15x lower power
RISC-V CPU Survey Respondents Rocket
github.com/freechipsproject/
RV01
n/a
Freedom Everywhere
SiFive.com
IntenCore
intensivate.com
Mythic IPU
mythic-ai.com
YARVI
github.com/tommythorn/yarvi
riscV
n/a
RISCVBusiness
purduesoceet.github.io
PulpTR
ankasys.com
GRVI Phalanx
fpga.org/gray-research-llc
proc_rv32ec_p2
astc-design.com
SCR1
syntacore.com
proc_rv32ic_p5
astc-design.com
SCR2
syntacore.com
KCP53000
kestrelcomputer.github.io/kestrel
SCR3
syntacore.com
RV12
roalogic.com
SCR4
syntacore.com
PicoRV32
github.com/cliffordwolf/picorv32
SCR5
syntacore.com
Klessydra processing core
en.uniroma1.it
Celerity
n/a
BOOM
ucb-bar.github.io/riscv-boom
riscv-lanzones
github.com/e19293001/riscvlanzones
PULP
github.com/pulp-pla•orm
ORCA
github.com/vectorblox/orca