A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P Zhenyu Liu1 Yang Song1 Ming Shao1 Shen Li1 Lingfeng Li1 Shunichi Ishiwata2 Masaki Nakagawa2 Satoshi Goto1 Takeshi Ikenaga1 1
The Graduate School of IPS, Waseda University, Japan 2
Toshiba Corporation, Kawasaki, Japan
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.1/26
Outline H.264/AVC Introduction Design Specifications Optimizations of the design IME FME and Intra engine Embedded MeP Processor Embedded SiS DRAM
Chip features and performance Conclusion
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.2/26
H264/AVC Introduction Superior performance 25%-45% bit-rate reduction (MPEG-4) 50%-70% bit-rate reduction (MPEG-2)
Prominent techniques in H264/AVC VBS 4x4 4x8 8x4 8x8 8x16 16x8 16x16 with MRF 1/4-pel accurate motion prediction 17 intra prediction modes Lagrangian cost oriented RD Optimization In-loop deblocking filter Context-based adaptive variable length EC
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.3/26
Design Specifications TSMC 0.18um 1P6M CMOS Technology H.264 features Baseline profile Max frame YUV4:2:0 1920x1080 at 30fps Search range H:[-96,95] V:[-64,63] with 1 reference frame Inter mode: 16x16, 16x8, 8x16 and 8x8
Embedded MeP processor Embedded 64Mb SiS DRAM A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.4/26
Chip Block Diagram 64bits External Bus (AMBA-AHB)
External DRAM H.264/AVC HDTV1080p Encoder IO I/F
64Mb SiS DRAM 0.11um Triple-Well TLM 512bits, 11.5Gbps
SiS-DRAM I/F
DMA Ctrl.
64bits System Bus (AMBA-AHB)
Cur. MB
2-D Filter
Mep IVC
0.18um CMOS Technology
Fractional Ref. Pel. Ref. Pel. Rec. Reg. Pel.
INTRA Engine Re-construct
Buffer Level2
HW Engines
Ref. Pel. SRAMs RD-Cost Cur. MB
AMBA-AHB HW Engine Buffers
Local Bus
Cur. MB
Main Ctrl. (MeP Core) AHB IF
Ref. Pel. SRAMs
SiS-DRAM
Cache SRAMs
Res. Pel.
MeP UCI
IME&FME Buf. Ref. AMBA MB N+1 IME
CAVLC Mep HWE
Mep Ctrl
IO I/F
DRAM-I/F
Buffer Level1
MB N
FME
MB N-1
INTRA
MB N-1 MB N-2 Chroma
EC/DB
1st stage
2nd stage
3rd stage
MB N-2 Stage1
Stage2
Stage3
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.5/26
IME Sub-Sampling Algorithm Low-pass filter alleviates the adverse effect of down sampling Haar low-pass filter based 4:1 sub-sampling 75% SAD operation is saved in each search position 75% internal IO bandwidth is saved in each search position
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.6/26
IME Coarse-Fine Search Scheme 32
x
32
32
32
128
Coarse_MV16x16
y
Search area around [0,0]
Search area around Coarse_MV16x16
192
Search is processed on [even_col, even_row] candidates Fine search area of other kinds of search positions are [x, y]|(−32 ≤ x < 32) ∪ (−32 ≤ x−Coarse_M V16×16 [x] < 32) Save 25%-50% search candidates A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.7/26
IME Level C+ data reuse CBn CBn+1 CBn+2 CBn+4 CBn+6
Level C+ data reuse search window
CBn+3 CBn+5 CBn+7
HF3V2 n-stitched zigzag scan mode
HF3V2 n-stitched
16
128+16
192+16 Search Area for CBn
32
16
Extend Area for CBn+1 CBn+2
Refilling Buffer
zigzag scan
Extend Area in Vertical Direction
Level C+ search window
26.875% memory volume is increased In comparison with level C scheme, more 44.5% external IO bandwidth is saved A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.8/26
IME Memory Mapping Target: Reduce SRAM Partitions; Improve Utilization Search Window Width
Even column even row pixel
Read out data
Search Window Height
00
02
04
06
Even column odd row pixel Odd column even row pixel 20
22
24
26
Odd column odd row pixel 40
42
44
46
60
62
64
66
00
02
04
06
20
22
24
26
40
42
44
46
60
62
64
66
Search Window Pre-Mapping
Search window pixels are categorized as [even_col,even_row] [even_col,odd_row] [odd_col,even_row] [odd_col,odd_row] Stored in four areas in physical memory
Search Window Post-Mapping
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.9/26
IME Memory Organization
71 pixel
Reference Buffer
71x8 2-1MUX 71x8bit
80x8bit
8
64bit
Even_column Even_row pixels
Even_column Odd_row pixels
Odd_column Even_row pixels
Odd_column Odd_row pixels
24
128
128
Extension for CBn+1 CBn+2
128 Refilling Buffer
128 Extension for C+ search window
Memory partitions are reduced from 16 to 5; Original one is 426.7k gates, ours is 253.8k. 40.52% hardware is saved; IO utilization is 2.91 times of the original one A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.10/26
IME SAD Tree Optimization .
.
.
,
.
,
,
,&2 &2>@ 6>@
,
,
,
,&2 &2>@
R[7:0]
,
&,
,
,&2 &2>@
, 6>@
inv(C[7:0])
, 6>@
,
,
&,
,
8-bit adder
s7 s6 s5 s4 s3 s2 s1 s0
,
,
&,
,
,
,
, &,
,
,
,
,
,&2 &2>@
,
,&2 &2>@ 6>@
,
6>@
, &,
, &,
,
,&2 &2>@ 6>@
, &,
4:2 compressor , &2>@
Cx
, 6>@
,
&,
,&2 &2>@ 6>@
s8
,
,&2 &2>@
,
>@ 6>@
ABSx
Circuits Optimizations |R-C| is adopted to make use of multi-cycle path delay through current MB Inverters of current MB pixels are merged into the source register array Addition of ‘Cx’ and ‘ABSx’ is merged into the 4:2 compressor tree
Performance Clock speed is 200MHz Each SAD Tree 12.1-12.2k gates A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.11/26
IME Engine Block Diagram 2-D Low_Pass Filter Cur.MB 8×8Pels
4-1 Decimate
PU#31 64PE
Cmp. Tree
...
PU#1 64PE
RD_costs & IMV Regs
GEN.Pels
PU#0 64PE
...
Ref Pels Buf 71×8 Pels 71 Pels
MUX 80 Pels
ZigZag Scan
Cur Search Area
SRC.Pels
...
C+ Search Window 512 80 Pels
MVD_costs Gen.
485.7k gates STD cell and 327.68 kb SRAM 44.5% external IO bandwidth is saved 40.52% memory hardware reduction and 11.6% datapath hardware reduction 416-608 processing cycles for each MB A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.12/26
FME Optimizations Mode trimming and 8x8 block based process unit Optimized schedule reduces pipeline bubbles 1/2-pels reusing - Avoid the search window access and 1/2-pel calculation during 1/4-pel search SATD reusing - Depending on the MVs’ matching, SATD8x8 of the overlapped 8x8 blocks are reused
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.13/26
FME 8x8 Based PU Current MB Integer Horizontal Vertical Diagonal
Multi-Function MUX
¼-Pel interpolation
Prediction Position Half/Quarter
Residual Generation
Hadamard Transform
PU
8x8 block based PU doubles the throughput
Hadamard Transform
ADDER Tree SATD Tree
Mode trimming saves 42.9% computation
Partition
Hardware utilization is improved
SATD Output
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.14/26
FME Processing Schedule 16×16 ½-pel
16×16 ½-pel
16×16 ¼-pel
16×8 ½-pel
16×8 ½-pel
8×16 ½-pel
16×8 ¼-pel
8×8 ½-pel
8×16 ½-pel
16×16 ¼-pel
16×8 ¼-pel
8×16 ¼-pel
8×16 ¼-pel
8×8 ½-pel
8×8 ¼-pel
8×8 ¼-pel
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.15/26
FME 1/2 Pixel Reusing During 1/4 pel search, 1/2 pels directly fetched from buffer Avoid the access to search window memory Skip the 1/2 pel recalculation Much power consumption is saved
Skipping the 1/2 generation improves the hardware utilization Improve the 1/4 pel search throughput
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.16/26
FME SATD Reusing
HMV matching ratio %
IMV matching ratio %
If (Block[I]_modem is totally or partially overlapped by Block[j]_moden){ if (IMV of Block[i]_modem == IMV of Block[j]_moden){ SATDs8×8 of overlapped area are reused in ½-pel search } if (HMV of Block[i]_modem == HMV of Block[j]_moden){ SATDs8×8 of overlapped area are reused in ¼-pel search } } 85.09 64.58
69.78
75.67 50.5
55.93
MODE 16×8
8×16
8×8
MODE 16×8
8×16
8×8
Sequence: station1080p; QP=28 A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.17/26
FME Block Diagram Ref. pixels Luma Ref. Pixels SRAMs
Cur. Luma MB Cur. MB Buffer
Half-Pixel Interpolation
Cur. Luma Half-pixel Buffer
Multi-functional MUX
FME Controllor
IMVs MV Comparison
MV Buffer
8×8 based PU #0
8×8 based 8×8 based PU #1 PU #2
8×8 based PU #3
8×8 based 8×8 based PU #4 PU #5
8×8 based PU #6
8×8 based 8×8 based PU #7 PU #8
MV Cost Generation
Prediction Position Decision
Prediction Mode Decision
SATD Buffer
Residual Generaition Luma Sub-pixel
Pixel Reusing
4×4 based PUs None opt. 8×8 based PUs None opt. 8×8 based PUs Pixel Reusing
Proposed
8×8 based PUs Pel. & SATD Reusing
170k MB/s 292k MB/s
SATD Reusing
Throughput of FME Architecture
459k MB/s 956k MB/s
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.18/26
Intra Optimizations With the edge direction based mode reduction algorithm[1], 66% intra prediction computation is saved Improve the edge detection algorithm and make it more VLSI friendly 2x2 filter without neighboring pixels simplifies the IO operation Multiplication/division removed for easier HW implementation [1] F. Pan, etal, Fast mode decision algorithm for intra prediction in h.264/avc video coding, IEEE Trans. on CSVT, vol.15, no.6, pp.813-821, July 2005. A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.19/26
Intra Improved Edge Detection 1) 9 samples 2x2 local edge detector fi,j
i
Gi,j True pixel Virtual pixel
j
− → G i,j =(Gxi,j ,Gyi,j ) Gxi,j =fi+1,j +fi+1,j+1 −fi,j −fi,j+1 Gyi,j =fi,j +fi+1,j −fi+1,j+1 −fi,j+1 − → Amp( G i,j )=|Gxi,j |+|Gyi,j |
2) linear gradient vector direction decision
(
erri = ||Gxi,j | − ki |Gyi,j || ki = {8, 0.125, 1, 1, 2, 0.5, 2, 0.5}
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.20/26
MeP Module VLIW Coprocessor (IVC)
pixel_in
UCI Inst. in
Rm 32bit
Inst. decoder
Extension
Processor core
Rn 32bit
Func. #1 Func. Func.11
pixel_in q3 q2 q1 q0
p3 p3 p1 p0
MUX
input sel.
Filter #1
MUX
X o3 o2 o1 pixle_out
rslt 32bit
Control bus
32-bit RISC core CAVLC HWE
CAVLC Ctrl. Coef. in
Input buf.
Coded bits Output
buf.
Bitstream packer
Symbol buf. Code tab.
Local bus
AMBA-AHB I/F unit System AHB Bus
Coef. scan
MUX
4KB 4KB Data RAM Inst. RAM & & Bus I/F unit 4KB 4KB Data cache Inst. cache
DMAC
Data streamer
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.21/26
SiS Architecture Silicon Interposer
Micro Bump
Global Wires
Local Wires
SiS-DRAM
H.264 Encoder Chip
Low power SiS-DRAM [2] Techniques H.264 Encoder TSMC CL018G 1P6M SiS-DRAM 0.11um Triple-Well TLM [2] K. Kumagai etal, System-in-Silicon Architecture and its Application to an H.264/AVC Motion Estimate for 1080HDTV, ISSCC06 A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.22/26
Chip Photograph
A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.23/26
Chip Features Chip Feature Embedded CPU Embedded DRAM Technology ASIC SiS DRAM Core Size Logic Gates SRAMs H.264 Feature Profile Ref. Num. Search Range Frequency Power
Huang[ISSCC2005] ASIC − −
Ours SoC 32-bits MeP 64Mb SiS DRMA
0.18µm CMOS 1P6M − 31.7mm2 922.8k 34.72KB
0.18µm CMOS 1P6M 0.11µm Triple-Well TLM 27.1mm2 1140k 108.3KB
Baseline 720p 1 128×64 108MHz 785mW
Baseline 1080p 1 196×128 200MHz 1409mW A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.24/26
Encoding Performance 45 Sun Flower (HDTV1080p)
44
Pedestrian (HDTV1080p)
43
PSNR (dB)
42 41 40 Sun Flower by JM8.1a (5 ref.)
39
Sun Flower by Ours Pedestrian by JM8.1a (5 ref.)
38
Pedestrian by Ours
37 36 0
4
8
12
16
20
Bitrate (Mbps) A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.25/26
Conclusion High Performance H.264/AVC BP HDTV1080p@30fps real-time encoder
Low hardware cost 3-stage MB pipelining architecture 27.1 mm2 with 0.18um 1P6M technology
High flexibility Embedded MeP processor
Low power dissipation Proposed algorithms reduce the computation complexity Embedded low power and high bandwidth SiS DRAM Circuits optimizations reduce the power dissipation A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.26/26