A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P Zhenyu Liu1 Yang Song1 Ming Shao1 Shen Li1 Lingfeng Li1 Shunichi Ishiwata2 Masaki Nakagawa2 Satoshi Goto1 Takeshi Ikenaga1 1

The Graduate School of IPS, Waseda University, Japan 2

Toshiba Corporation, Kawasaki, Japan

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.1/26

Outline H.264/AVC Introduction Design Specifications Optimizations of the design IME FME and Intra engine Embedded MeP Processor Embedded SiS DRAM

Chip features and performance Conclusion

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.2/26

H264/AVC Introduction Superior performance 25%-45% bit-rate reduction (MPEG-4) 50%-70% bit-rate reduction (MPEG-2)

Prominent techniques in H264/AVC VBS 4x4 4x8 8x4 8x8 8x16 16x8 16x16 with MRF 1/4-pel accurate motion prediction 17 intra prediction modes Lagrangian cost oriented RD Optimization In-loop deblocking filter Context-based adaptive variable length EC

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.3/26

Design Specifications TSMC 0.18um 1P6M CMOS Technology H.264 features Baseline profile Max frame YUV4:2:0 1920x1080 at 30fps Search range H:[-96,95] V:[-64,63] with 1 reference frame Inter mode: 16x16, 16x8, 8x16 and 8x8

Embedded MeP processor Embedded 64Mb SiS DRAM A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.4/26

Chip Block Diagram 64bits External Bus (AMBA-AHB)

External DRAM H.264/AVC HDTV1080p Encoder IO I/F

64Mb SiS DRAM 0.11um Triple-Well TLM 512bits, 11.5Gbps

SiS-DRAM I/F

DMA Ctrl.

64bits System Bus (AMBA-AHB)

Cur. MB

2-D Filter

Mep IVC

0.18um CMOS Technology

Fractional Ref. Pel. Ref. Pel. Rec. Reg. Pel.

INTRA Engine Re-construct

Buffer Level2

HW Engines

Ref. Pel. SRAMs RD-Cost Cur. MB

AMBA-AHB HW Engine Buffers

Local Bus

Cur. MB

Main Ctrl. (MeP Core) AHB IF

Ref. Pel. SRAMs

SiS-DRAM

Cache SRAMs

Res. Pel.

MeP UCI

IME&FME Buf. Ref. AMBA MB N+1 IME

CAVLC Mep HWE

Mep Ctrl

IO I/F

DRAM-I/F

Buffer Level1

MB N

FME

MB N-1

INTRA

MB N-1 MB N-2 Chroma

EC/DB

1st stage

2nd stage

3rd stage

MB N-2 Stage1

Stage2

Stage3

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.5/26

IME Sub-Sampling Algorithm Low-pass filter alleviates the adverse effect of down sampling Haar low-pass filter based 4:1 sub-sampling 75% SAD operation is saved in each search position 75% internal IO bandwidth is saved in each search position

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.6/26

IME Coarse-Fine Search Scheme 32

x

32

32

32

128

Coarse_MV16x16

y

Search area around [0,0]

Search area around Coarse_MV16x16

192

Search is processed on [even_col, even_row] candidates Fine search area of other kinds of search positions are [x, y]|(−32 ≤ x < 32) ∪ (−32 ≤ x−Coarse_M V16×16 [x] < 32) Save 25%-50% search candidates A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.7/26

IME Level C+ data reuse CBn CBn+1 CBn+2 CBn+4 CBn+6

Level C+ data reuse search window

CBn+3 CBn+5 CBn+7

HF3V2 n-stitched zigzag scan mode

HF3V2 n-stitched

16

128+16

192+16 Search Area for CBn

32

16

Extend Area for CBn+1 CBn+2

Refilling Buffer

zigzag scan

Extend Area in Vertical Direction

Level C+ search window

26.875% memory volume is increased In comparison with level C scheme, more 44.5% external IO bandwidth is saved A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.8/26

IME Memory Mapping Target: Reduce SRAM Partitions; Improve Utilization Search Window Width

Even column even row pixel

Read out data

Search Window Height

00

02

04

06

Even column odd row pixel Odd column even row pixel 20

22

24

26

Odd column odd row pixel 40

42

44

46

60

62

64

66

00

02

04

06

20

22

24

26

40

42

44

46

60

62

64

66

Search Window Pre-Mapping

Search window pixels are categorized as [even_col,even_row] [even_col,odd_row] [odd_col,even_row] [odd_col,odd_row] Stored in four areas in physical memory

Search Window Post-Mapping

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.9/26

IME Memory Organization

71 pixel

Reference Buffer

71x8 2-1MUX 71x8bit

80x8bit

8

64bit

Even_column Even_row pixels

Even_column Odd_row pixels

Odd_column Even_row pixels

Odd_column Odd_row pixels

24

128

128

Extension for CBn+1 CBn+2

128 Refilling Buffer

128 Extension for C+ search window

Memory partitions are reduced from 16 to 5; Original one is 426.7k gates, ours is 253.8k. 40.52% hardware is saved; IO utilization is 2.91 times of the original one A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.10/26

IME SAD Tree Optimization .

.

.

,

.

,

,

,&2 &2>@ 6>@

,

,

,

,&2 &2>@

R[7:0]

,

&,

,

,&2 &2>@

, 6>@

inv(C[7:0])

, 6>@

,

,

&,

,

8-bit adder

s7 s6 s5 s4 s3 s2 s1 s0

,

,

&,

,

,

,

, &,

,

,

,

,

,&2 &2>@

,

,&2 &2>@ 6>@

,

6>@

, &,

, &,

,

,&2 &2>@ 6>@

, &,

4:2 compressor , &2>@

Cx

, 6>@

,

&,

,&2 &2>@ 6>@

s8

,

,&2 &2>@

,

>@ 6>@

ABSx

Circuits Optimizations |R-C| is adopted to make use of multi-cycle path delay through current MB Inverters of current MB pixels are merged into the source register array Addition of ‘Cx’ and ‘ABSx’ is merged into the 4:2 compressor tree

Performance Clock speed is 200MHz Each SAD Tree 12.1-12.2k gates A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.11/26

IME Engine Block Diagram 2-D Low_Pass Filter Cur.MB 8×8Pels

4-1 Decimate

PU#31 64PE

Cmp. Tree

...

PU#1 64PE

RD_costs & IMV Regs

GEN.Pels

PU#0 64PE

...

Ref Pels Buf 71×8 Pels 71 Pels

MUX 80 Pels

ZigZag Scan

Cur Search Area

SRC.Pels

...

C+ Search Window 512 80 Pels

MVD_costs Gen.

485.7k gates STD cell and 327.68 kb SRAM 44.5% external IO bandwidth is saved 40.52% memory hardware reduction and 11.6% datapath hardware reduction 416-608 processing cycles for each MB A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.12/26

FME Optimizations Mode trimming and 8x8 block based process unit Optimized schedule reduces pipeline bubbles 1/2-pels reusing - Avoid the search window access and 1/2-pel calculation during 1/4-pel search SATD reusing - Depending on the MVs’ matching, SATD8x8 of the overlapped 8x8 blocks are reused

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.13/26

FME 8x8 Based PU Current MB Integer Horizontal Vertical Diagonal

Multi-Function MUX

¼-Pel interpolation

Prediction Position Half/Quarter

Residual Generation

Hadamard Transform

PU

8x8 block based PU doubles the throughput

Hadamard Transform

ADDER Tree SATD Tree

Mode trimming saves 42.9% computation

Partition

Hardware utilization is improved

SATD Output

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.14/26

FME Processing Schedule 16×16 ½-pel

16×16 ½-pel

16×16 ¼-pel

16×8 ½-pel

16×8 ½-pel

8×16 ½-pel

16×8 ¼-pel

8×8 ½-pel

8×16 ½-pel

16×16 ¼-pel

16×8 ¼-pel

8×16 ¼-pel

8×16 ¼-pel

8×8 ½-pel

8×8 ¼-pel

8×8 ¼-pel

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.15/26

FME 1/2 Pixel Reusing During 1/4 pel search, 1/2 pels directly fetched from buffer Avoid the access to search window memory Skip the 1/2 pel recalculation Much power consumption is saved

Skipping the 1/2 generation improves the hardware utilization Improve the 1/4 pel search throughput

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.16/26

FME SATD Reusing

HMV matching ratio %

IMV matching ratio %

If (Block[I]_modem is totally or partially overlapped by Block[j]_moden){ if (IMV of Block[i]_modem == IMV of Block[j]_moden){ SATDs8×8 of overlapped area are reused in ½-pel search } if (HMV of Block[i]_modem == HMV of Block[j]_moden){ SATDs8×8 of overlapped area are reused in ¼-pel search } } 85.09 64.58

69.78

75.67 50.5

55.93

MODE 16×8

8×16

8×8

MODE 16×8

8×16

8×8

Sequence: station1080p; QP=28 A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.17/26

FME Block Diagram Ref. pixels Luma Ref. Pixels SRAMs

Cur. Luma MB Cur. MB Buffer

Half-Pixel Interpolation

Cur. Luma Half-pixel Buffer

Multi-functional MUX

FME Controllor

IMVs MV Comparison

MV Buffer

8×8 based PU #0

8×8 based 8×8 based PU #1 PU #2

8×8 based PU #3

8×8 based 8×8 based PU #4 PU #5

8×8 based PU #6

8×8 based 8×8 based PU #7 PU #8

MV Cost Generation

Prediction Position Decision

Prediction Mode Decision

SATD Buffer

Residual Generaition Luma Sub-pixel

Pixel Reusing

4×4 based PUs None opt. 8×8 based PUs None opt. 8×8 based PUs Pixel Reusing

Proposed

8×8 based PUs Pel. & SATD Reusing

170k MB/s 292k MB/s

SATD Reusing

Throughput of FME Architecture

459k MB/s 956k MB/s

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.18/26

Intra Optimizations With the edge direction based mode reduction algorithm[1], 66% intra prediction computation is saved Improve the edge detection algorithm and make it more VLSI friendly 2x2 filter without neighboring pixels simplifies the IO operation Multiplication/division removed for easier HW implementation [1] F. Pan, etal, Fast mode decision algorithm for intra prediction in h.264/avc video coding, IEEE Trans. on CSVT, vol.15, no.6, pp.813-821, July 2005. A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.19/26

Intra Improved Edge Detection 1) 9 samples 2x2 local edge detector fi,j

i

Gi,j True pixel Virtual pixel

        

j

− → G i,j =(Gxi,j ,Gyi,j ) Gxi,j =fi+1,j +fi+1,j+1 −fi,j −fi,j+1 Gyi,j =fi,j +fi+1,j −fi+1,j+1 −fi,j+1 − → Amp( G i,j )=|Gxi,j |+|Gyi,j |

2) linear gradient vector direction decision

(

erri = ||Gxi,j | − ki |Gyi,j || ki = {8, 0.125, 1, 1, 2, 0.5, 2, 0.5}

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.20/26

MeP Module VLIW Coprocessor (IVC)

pixel_in

UCI Inst. in

Rm 32bit

Inst. decoder

Extension

Processor core

Rn 32bit

Func. #1 Func. Func.11

pixel_in q3 q2 q1 q0

p3 p3 p1 p0

MUX

input sel.

Filter #1

MUX

X o3 o2 o1 pixle_out

rslt 32bit

Control bus

32-bit RISC core CAVLC HWE

CAVLC Ctrl. Coef. in

Input buf.

Coded bits Output

buf.

Bitstream packer

Symbol buf. Code tab.

Local bus

AMBA-AHB I/F unit System AHB Bus

Coef. scan

MUX

4KB 4KB Data RAM Inst. RAM & & Bus I/F unit 4KB 4KB Data cache Inst. cache

DMAC

Data streamer

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.21/26

SiS Architecture Silicon Interposer

Micro Bump

Global Wires

Local Wires

SiS-DRAM

H.264 Encoder Chip

Low power SiS-DRAM [2] Techniques H.264 Encoder TSMC CL018G 1P6M SiS-DRAM 0.11um Triple-Well TLM [2] K. Kumagai etal, System-in-Silicon Architecture and its Application to an H.264/AVC Motion Estimate for 1080HDTV, ISSCC06 A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.22/26

Chip Photograph

A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.23/26

Chip Features Chip Feature Embedded CPU Embedded DRAM Technology ASIC SiS DRAM Core Size Logic Gates SRAMs H.264 Feature Profile Ref. Num. Search Range Frequency Power

Huang[ISSCC2005] ASIC − −

Ours SoC 32-bits MeP 64Mb SiS DRMA

0.18µm CMOS 1P6M − 31.7mm2 922.8k 34.72KB

0.18µm CMOS 1P6M 0.11µm Triple-Well TLM 27.1mm2 1140k 108.3KB

Baseline 720p 1 128×64 108MHz 785mW

Baseline 1080p 1 196×128 200MHz 1409mW A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.24/26

Encoding Performance 45 Sun Flower (HDTV1080p)

44

Pedestrian (HDTV1080p)

43

PSNR (dB)

42 41 40 Sun Flower by JM8.1a (5 ref.)

39

Sun Flower by Ours Pedestrian by JM8.1a (5 ref.)

38

Pedestrian by Ours

37 36 0

4

8

12

16

20

Bitrate (Mbps) A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.25/26

Conclusion High Performance H.264/AVC BP HDTV1080p@30fps real-time encoder

Low hardware cost 3-stage MB pipelining architecture 27.1 mm2 with 0.18um 1P6M technology

High flexibility Embedded MeP processor

Low power dissipation Proposed algorithms reduce the computation complexity Embedded low power and high bandwidth SiS DRAM Circuits optimizations reduce the power dissipation A 1.41W H.264/AVC REAL-TIME ENCODER SoC FOR HDTV1080P – p.26/26

Presentation 2-1

Chip Block Diagram. Cur. MB. Ref. Pel. SRAMs ... IME Level C+ data reuse. CBn. CBn+1 CBn+2 ... Search Window Post-Mapping. Search Window Width. S e a.

439KB Sizes 3 Downloads 163 Views

Recommend Documents

Committee Day Presentation 11-21-14.pdf
recommendation of the Presidential Search Committee, to expend up to One Hundred Thousand. Dollars ($100,000) from the Academy's Reserve Funds in ...

Presentation
A fast, cheap and simple analytical method. .... limited data from Jordan ... data. • Some of those: Mishor Yamin,. Revivim – Mashabim, Sde-. Boker, Shivta ...

Presentation Title Presentation Sub-Title
April 2010, Prahran, Melbourne. • Direct impacts ... Victoria. Currently infrastructure and facilities are designed based on past climate, not future climate. ... Sensitivity of Materials to Climate Change Impacts. Material. CO. 2. Cyclones. & Stor

Presentation Title Presentation Sub-Title
Climate change impacts – impact upon cycling conditions and infrastructure. Infrastructure and climate change risks for Vic. Primary impacts – impact upon ...

Presentation Title Presentation Sub-Title
Helen Millicer, Member, Glen Eira BUG and Bicycle. Victoria Board. Thanks for permission to use slides from presentations given to PACIA members in Vic and ...

Presentation Information
Please arrive at the assigned meeting room 10 minutes before the session ... All meeting rooms are equipped with digital projectors and laptop computers.

21-21-Aguilera.pdf
División Zoología Vertebrados. Museo de La Plata. FCNyM, UNLP. Octubre, 2013. Imagen de Tapa. Festival Raíz Reggae Rock 3 en San Salvador de Jujuy, ...

presentation guidelines
QUESTIONS AND ANSWERS. A. EACH GROUP WILL LISTEN TO PRESENTATIONS CAREFULLY. B. AFTER RESOLUTION IS PRESENTED OPPORTUNITY FOR QUESTIONS. 1. QUESTIONS: EXPOSE WEAKNESSES IN GROUPS RESOLUTION. 2. ANSWERS DEMONSTRATE THAT YOUR SUGGESTIONS ARE.

DCC03 Presentation
Design of Optimal Quantizers for Distributed Source Coding. 2 ... R. D. J λ λ +. −. = )1(. Distortion. Rate. Lagrangian cost. ▫ Rate measure r(q,y) models coder.

AGM Presentation - Tata Motors
Aug 12, 2011 - projections, estimates and expectations of the Company i.e. Tata. Motors Ltd ... Cash Profit = EBITDA + Other Income – Product Development Expenses – Net Interest - Tax Paid .... applications from the Tata Winger platform.

Oral Presentation Rubric
You will create and present a 5- to 7-minute oral presentation to the class, using at least one prop. Presentation must ... support theme. □ Uses correct grammar.

DCC03 Presentation
Mar 25, 2003 - Design of Optimal Quantizers for Distributed Source Coding. 2. Outline ... YQrER. = R. D. J λ λ +. −. = )1(. Distortion. Rate. Lagrangian cost.

FY16 presentation FINAL - Sage
Financial progress. Share based payments. (£8m). (£9m). Underlying depreciation and amortisation. (£30m). (£29m). Non-GAAP EBITDA. £465m. £429m. +8.4% ... Revenue categories. +10%. FY16. FY15. Recurring. Revenue. +32%. 0%. +6%. -9%. +6%. Other

DCC03 Presentation
Rebollo, Rane, Girod: Wyner-Ziv Quantization and Transform Coding of Noisy ..... R[bit]. SNR. OUT. [dB]. Wyner-Ziv Bound. Conditional q(x|y). Distributed q(x).

d1.1 project presentation - NUBOMEDIA
Feb 28, 2014 - ... ICT-2013.1.6. Connected and Social Media ... NUBOMEDIA: an elastic PaaS cloud for interactive social multimedia. 2 .... around 10 minutes.Missing:

presentation name
Nov 2, 2011 - ENTERPRISE RESOURCE PLANNING. Manesh ... information and business ... ERP. ➢Direct costs include hardware, software, and people on.

conference presentation
Social Media Communities. Wei Gong, Ee-Peng Lim, Feida Zhu ... Users in social sites can: Silent Users (or Lurkers) ... (marital status, religion, and political orientation) using content features: • The user's tweets. • The user's followees' twe

Windows 8 Presentation Template
The reason you get strong tools for IT is so you ... “10% of all laptops, and 70% of all USB sticks, are lost every year”. “600,000 laptops are lost at U.S. airports.

Research Presentation
Apr 2, 2005 - Relaxed conditions in both the direct and indirect case. Side information arbitrarily distributed. ▫ In the indirect case, condition on data similar to ...

presentation
Parser. ✓ Other parsers for other languages. ○ Graph conversion. ✓ Linguistic structure processing, e.g., coordinations, sortal coreferences. ○ URL lookup. ✓ Matamap, Bioportal annotator for Bio-domain. ✓ DBPedia spotlight for general dom