GPP vs DSP : A Performance/Energy Characterization ...

Viewer
Transcript

GPP vs DSP : A Performance/Energy Characterization and Evaluation of Video Decoding Yahia Benmoussa∗† , Jalil Boukhobza ∗ Univ. Bretagne Occidentale, UMR 6285, Lab-STICC F-29200 Brest, France Email: {firsname.lastname}@univ-brest.fr

Eric Senn Univ. Bretagne Sud, UMR 6285, Lab-STICC F-56100 Lorient, France Email: [email protected]

Abstract—Mobile devices such as smart-phones and tablets are increasingly becoming the most important channel for delivering end-user Internet traffic especially multimedia content. One of the most popular use of these terminals is video streaming. In this type of application, video decoding is considered as the most compute and energy intensive part. Some specific processing units, such as dedicated Digital Signal Processors (DSPs), are added to those devices in order to optimize the performance and energy consumption. In this context, the objective of this paper is to give a comprehensive and comparative study of the performance and energy consumption of video decoding application on embedded heterogeneous platforms containing a GPP and a DSP. To achieve this goal, a performance and energy characterization methodology for H.264/AVC video decoding is proposed. This methodology considers a large set of video coding parameters and operating clock frequencies to reflect different execution scenarios ranging from low-quality video decoding on low-end mobile phones to high-quality video decoding on tablets. The obtained results revealed that the best performance-energy trade-off highly depends on the required video bit-rate and resolution. For instance, the GPP can be the best choice in many cases due to a significant overhead in DSP decoding which may represent 30% of the total decoding energy in some cases. Some explanations about the obtained performance and overheads are given. Finally, guidelines on which processing element to choose according to video properties are also proposed. Keywords-Video decoding, Performance, Energy, GPP, DSP, DVFS, H.264/AVC, OMAP, GStreamer.

I. I NTRODUCTION Energy supply limitation of mobile devices such as smartphones and tablets is a critical issue to deal with in hardware and software design. In fact, Lithium battery technologies are not evolving fast enough to cope with the ever-increasing complexity of both hardware and software aspects of mobile devices [1]. So, mobile applications must do better with the same (or even less) energy budget. This is becoming a critical issue especially when using processor intensive applications such as video playback. It is shown in [2] that it is the most energy-intensive application used in mobile devices. This is due to the intensive use of processing resources responsible of more than 60% of the consumed energy [2]. Furthermore, to allow high quality video decoding, embedded processors equipping mobile devices are more and more powerful. A hardware configuration including a processor clocked at more than 1 GHz frequency becomes common. The

Djamel Benazzouz M’hamed Bougara Boumerdes, Algeria Email:[email protected] † Univ.

main drawback of using high frequencies is that it requires higher voltage levels. This leads to a considerable increase in energy consumption due to the quadratic relation between the dynamic power and the supplied voltage in CMOS circuits [3]. To overcome this issue, one possible solution is to use Digital Signal Processors (DSP) to provide better performance-energy properties. Indeed, the use of parallelism in data processing increases the performance without the need of higher voltages and frequencies [4]. This makes them an energy-efficient choice in energy constrained devices [5]. In smart-phones and tablets, they are more and more integrated in multi-core SoCs in addition to the General Purpose Processor (GPP) [6]. When decoding a video stream, the use of the full processing capabilities of the hardware is not always necessary. For example, due to bandwidth limitation, the video may be coded in a low quality which leads to less decoding processing requirements [7]. In this case, in order to save energy, one might use dynamic voltage and frequency scaling (DVFS) feature provided by some low-power processors. This mechanism is used to scale down the voltage and the frequency in case of low processing workload (in our case video decoding) [8], [9]. As stated above, the video decoding energy saving efforts should consider video coding parameters, heterogeneous hardware processing resources utilization without neglecting performance constraints. Understanding the impact of all these parameters on energy consumption is a challenging task in a context of an ever-growing hardware and software complexity. In this paper, we propose to give a comprehensive study of performance and energy consumption of video decoding in terms of video quality and processing resources utilization. As we target two types of processors (GPP and DSP), we propose to use a system and an application level characterization methodologies that are independent from the complexity of the underlying hardware architecture. The proposed methodology is based on experimental measures conducted on a low-power embedded platform including both a GPP and a DSP. On this platform, we used a common multimedia framework for GPP and DSP video decoding, GStreamer, to guarantee a better performance and energy consumption comparison. The obtained experimental results revealed that the best performance-energy trade-off highly depends on the required video bit-rate and resolution as the GPP can be the best

choice in many cases. In fact, in case of low video bit-rate and resolution, the DSP decoding suffers from a considerable inter-processor overhead leading to a degradation in the performance and energy efficiency. The remainder of this paper is organized as follows: in section II, Architecture and system considerations for video decoding performance and energy consumption are presented. In sections III and IV, the experimental methodology and setup are detailed. Experimental results are discussed in section V. Related works on energy consideration of video decoding are summarized in section VI. Finally, conclusions and some future work perspectives are given in section VII. II. BACKGROUND As described in the introduction, the performance and energy consumption of video decoding depends on the used processing elements (processor type and clock frequency) and on the video coding quality. We describe hereafter how this is reflected at both architectural and system level. A video consists of a data stream which is processed sequentially to produce the decoded video. In digital circuits, the decoding time and the power consumption depend on the clock frequency. In fact, in CMOS circuits, the dynamic power consumption is described by the following equation : Pdyn = Cef f .V 2 .f

(1)

where Cef f represents the circuit effective capacitance and V , the supply voltage associated to the clock frequency f [3]. For instance, Fig. 1 shows a simplified representation of a CMOS circuit which processes a set of data D using a block B. The block B operates at frequencies f2 (Fig. 1-b) and f (Fig. 1-a) corresponding to the supply voltage levels V1 = 1.06V and V2 = 1.2V respectively1 . If we suppose that the processing time is t, then the energy consumption when B operates at a frequency f is EV2 = PV2 .t where PV2 = Cef f .V22 .f . Fig. 1-a illustrates this configuration. Fig. 1-b shows the case where B operates at a frequency f2 . If we suppose the processing time at frequency f2 is doubled, then the ratio between the energy EV1 consumed by the circuit at the frequency f2 with V1 = 1.06V , and EV2 is :

Fig. 1: System and architecture driven frequency scaling f 2

and a voltage V1 as described in Fig. 1-c. consumed power and energy associated to this P2.V1 and E2.V1 respectively. Since two blocks in parallel, the execution time is not decreased between E2.V1 and EV2 is :

We note the configuration are operating and the ratio

Cef f .V12 . f2 .t + Cef f .V12 . f2 .t V1 E2.V1 4 = ( )2 ' = 2 EV2 Cef f .V2 .f.t V2 5

In this case, scaling down the voltage and the frequency decreases the power consumption to PV1 = Cef f .V12 . f2 ' 25 .PV2 which leads to 20% energy saving at the cost of performance decrease. This may represent a scenario where the operating system can scale-down dynamically the processor frequency at run time when it detects a load decrease. This illustrates a system-driven voltage scaling. In order to save energy without sacrificing the performance, an architectural-driven voltage scaling [10] can be achieved by using two B blocks which are both clocked at a frequency

In this configuration, the total power consumption P2.V1 is the sum of the power consumptions of the two blocks 2.Cef f .V12 . f2 = 54 .PV2 and the energy saving is equal to 20% without sacrificing the performance but at a cost of an additional circuit area. Theoretically, this type of parallelism provides better performance and energy efficiency for multimedia data processing applications [10]. Nevertheless, it generates an energy overhead in both architectural and system levels. In fact, the duplication of the processing circuits and the use of multiplexing/demultiplexing circuits in the architecture increases the leakage energy [4]. When this type of parallelism is implemented in additional specialized processors such as DSPs2 , the GPP-DSP interprocessor communication may generate a system overhead, and thus yet another additional energy consumption [11], [12]. As an illustration, Fig. 2 describes the steps of a typical DSP video decoding process controlled by a GPP. The video frames are supposed in an input buffer in the memory: 1) The GPP writes-back the frame located in its cache to a shared memory so that the DSP can access it. 2) The GPP sends the parameters to the DSP codec via a GPP-DSP hardware bus. 3) The DSP cache invalidates the entries in its cache corresponding to a frame buffer in the shared memory. 4) The DSP decodes and transfers the frame to the output buffer. 5) The DSP sends the return status to the GPP.

1 V and V are the voltage levels associated to the frequencies 500 MHz 1 2 and 250 MHz for the Cortex A8 processor used in our experimental part.

2 Parallelism can be used internally in GPPs using pipeline and SIMD instruction sets.

Cef f .V12 . f2 .2.t V1 4 EV1 = = ( )2 ' 2 EV2 Cef f .V2 .f.t V2 5

Fig. 2: DSP video decoding

Fig. 3: Performance and energy characterization levels

The fact that both the DSP and the GPP have their proper cache memory and communicate using a shared memory imposes to manage cache coherency each time data are shared between the DSP and the GPP. In addition, from the operating system level point of view, the GPP-DSP communication is managed by a driver through a system call. A frame decoding is considered, from the GPP point of view, as an I/O operation generating a system latency caused by entering the idle state and handling the hardware interrupt. So, in order to fully compare GPP and DSP performance for video decoding, one must pay a particular attention to the additional communication overhead when choosing the DSP processing element and not only compare raw performances of both processors.

help to understand and quantify the performance and the energy consumption of the different video decoding process phases at a frame granularity.

III. P ERFORMANCE AND E NERGY C HARACTERIZATION M ETHODOLOGY In this work, we chose to characterize H.264/AVC [13] video decoding standard. Although other solutions are emerging, such as HEVC [14] and VP8 [15], H.264/AVC stays the most mature used standard. An H.264/AVC video consists of a sequence of frames (Coded Pictures). Each frame may depend on previous and/or next frames and its decoding requires the execution of different steps such as entropy decoding, inverse transform, motion compensation [13]. Each step is characterized by its performance and energy consumption properties [16] which highly depend on the underlying execution platform [7]. In our methodology, we decided to characterize the performance and the energy consumption of video decoding regardless of the internal intricacies of H.264/AVC decoding process. For this purpose, we propose a performance characterization methodology which intervenes at three hierarchical levels : 1) Operating system level. 2) Video frame level. 3) Video sequence level. As depicted in Fig. 3, the results of each level serves to characterize and understand the performance and the energy properties of upper level. A. Operating System Level Power Characterization In this step, the power consumption of both the GPP and the DSP at the idle and the active states are measured at the available processor clock frequencies regardless of any video decoding process. The objective of this level is to constitute a set of reference power consumption values that

B. Frame-level Performance and Energy Characterization This is a very important step in our methodology. It helps to understand the global performance study and is related to both operating system and decoder application. At this level, the elementary video frame processing is characterized according to a set of parameters such as buffer transfers, GPPDSP communication and cache coherency maintenance. The objective of this step is to understand how the processing elements are used and where goes the energy when decoding a single frame in addition to the evaluation of the overhead of both GPP and DSP video decoding process. This step relies on the preceding operating system level characterization to identify the active and idle processor states. A frame processing period includes the decoding of the frame located in the memory (in the input buffer) and copying it to the output buffer. In case of GPP decoding, the overhead is all the application logic controlling the sequential calls to the frame processing functions. In this case, the frame decoding function is local (to the GPP) and the overhead is related to the design of the application decoder. For example, a multi-threaded decoder may induce an overhead due to thread synchronization. We note this kind of overhead as a GPPoverhead. In the case of DSP decoding, the frame decoding function is executed in a remote processor (the DSP). As described in section II, this induces an overhead due to GPPDSP communication such as cache maintenance and hardware interrupt handling. We note this overhead a DSP-overhead. Both of these overheads induce an extra performance and energy costs. We describe in the experimental methodology how these costs have been evaluated. C. Video Decoding Performance and Energy Characterization At this level, the average performance (number of decoded frames per seconded) of GPP and DSP H.264/AVC decoding of the overall video sequence is measured in terms of bitrate, resolution and clock frequency. The chosen performance evaluation criteria was the video displaying rate of the decoded video. In fact we considered that a decoding rate which is lower than the displaying rate of the coded video is not sufficient for playing-back the video with respect to real time constraints.

The overall energy consumption is then calculated by multiplying the sum of the elementary measured power values by the decoding time. The average energy per frame (mJ/Frame) is then obtained by dividing the overall energy on the total number of frames. IV. E XPERIMENTAL M ETHODOLOGY AND S ETUP This section describes the details of the conducted experiments which follow the above-mentioned methodology. A. Experimental Methodology In the conducted experimentations, the GStremaer multimedia framework was used for both GPP and DSP video decoding. GStreamer allows an accurate decoding comparison thanks to its modular design allowing to plug and execute both the GPP and DSP video decoders into the same software environment. In addition, in order to improve the precision of the measurements, the phases which are not part of the actual GPP and DSP video decoding process are not considered: the execution time and the energy consumption related to buffering and displaying were deliberately omitted. In fact, these parts are I/O dependent and their performance may vary according to bandwidth fluctuation or file system performance. Studying the impact of these parts is beyond the scope of this paper. B. Experimental Hardware Setup Power measurements performed for this study were realized on the OMAP3530EVM board containing the low-power OMAP3530 SoC. This SoC consists of a Cortex A8 ARM processor and TMS320C64x DSP. The OMAP3530 supports six operating frequencies ranging from 125 MHz to 720 MHz for the ARM and from 90 MHz to 520 MHz for the DSP. The power consumptions of the DSP and the ARM processors are measured using the Open-PEOPLE framework [17], a multiuser and multi-target power and energy optimization platform and estimator. It includes the NI-PXI-4472 digitizer allowing up to a 100 KHz sampling resolution. The OMAP3530EVM board provides a single jumper for measuring both of the DSP and the ARM processor power consumptions. In case of ARM video decoding, the measured power represents the ARM dynamic consumption plus the ARM and DSP static powers. In case of the DSP video decoding, both the ARM and the DSP are involved. In fact the ARM controls the DSP which executes the video decoding process. The measured power is thus equal to the sum of the static and dynamic power of both the ARM and DSP. In the rest of the paper, Pstatic is the sum of the ARM and the DSP static powers. C. Experimental Software Setup On this hardware platform, the Linux operating system version 2.6.32 was used with cpufreq [18] enabled to drive the ARM and DSP frequency scaling. The user-space governor was activated: a frequency scaling policy allowing to control the clock frequency at the application level. The H.264/AVC video decoding was achieved using GStreamer [19], a multimedia development framework. The ARM decoding, was

Fig. 4: GStreamer H.264/AVC decoder plug-in design [19]

performed using ffdec h264, an open-source plug-in based on the widely used ffmpeg/libavcodec library compiled with the support of NEON SIMD instructions set. According to [20], NEON gives 60-150% performance boost for video codecs. For DSP decoding, we used TIViddec2, a proprietary GStreamer H.264/AVC baseline profile plug-in provided by Texas Instrument. Its internal design is illustrated in Fig. 4. The video frames are moved from the encoded video buffer (input buffer which contains the coded frames) to a circular buffer by a queuing thread. The circular buffer is a shared memory between the ARM and the DSP. This shared memory is not cached in the ARM to avoid incoherency when the DSP makes access to it. The video decoding-thread invokes the DSP decoder via the dsplink DSP driver. The DSP codec executes a cache invalidation operation so that it can see the right data in the shared memory, decodes the frame and transfers it to the decoded frame buffer using DMA. In our experimental tests, videos are decoded from a flash memory. As described in section IV-A, we measured only the video decoding phase. However, GStreamer is a multithreaded application and the buffering operations (transfers of the video frames from the file system to the input buffer in the memory) may interleave with the video decoding operations which makes difficult to measure the performance and the energy of the decoding phase only. To avoid this situation, a customized video decoder is developed using GStreamer API. As shown in Fig. 5, in this decoder, the decoding thread is kept initially in pause state while the video stream is being copied in an input buffer (GStreamer queue2 element) by the buffering thread. The transmission of the QoS messages containing the buffer level information are activated in queue2 element. These messages are then monitored by a dedicated handler that wakes-up the decoding thread when all the video stream is held in the input buffer. On the other hand, the decoded pictures were redirected toward the /dev/null directory in order not to consider the processing related to copying the frame from the output buffer to the memory of the display driver or to a file. This mechanism is the same in both GPP and DSP video decoding. In Fig. 5, we can note the GPP and DSP GStreamer pipes. filesrc, queue2 and filesink are GStreamer modules allowing

Fig. 5: GStreamer GPP and DSP video decoding pipes

to read a video file, buffer into a memory and send it to a destination location respectively. These modules are shared between the GPP and DSP pipes and we suppose that they generate the same load for both DSP and GPP decoding process. The measured phases are the steps 2, 3 and 4 (see Fig.5). Step 5 is negligible since no video frames copying is performed there. The videos sequences used in our tests are Harbor and Soccer [21]. Each video is coded in 13 different bit-rates (64 Kb/s, 128 Kb/s, 256 Kb/s, 512 Kb/s, 1024 Kb/s, . . . 5120 Kb/s) and 3 different resolutions ( qcif, cif and 4cif ). Table I gives a a complete summary of the hardware and software setups. TABLE I: Hardware and software setup summary Tests sequences Rate Applications

Videos

Resolution Bit-rate

GStreamer Operating System DSP

Hardware

ARM SoC

ARM plug-in DSP plug-in Name DVFS driver Name Frequencies (MHz) Name Frequencies (MHz) Name Voltages levels (V)

Harbour, Soccer 30 Frames/s qcif (176x144), cif (352x288) 4cif (704x576) 64, 128, 256, 512, 1024, 1536, 2048, 2560, 3072, 3584, 4096, 4608, 5120 ffdec h264 TIViddec Linux 2.6.32 cpufreq TMS320C64x 90, 180, 360, 400, 430, 520 Cortex A8 + NEON 125, 250, 500, 550, 600, 720 OMAP3530 0.975, 1.05, 1.2, 1.27 , 1.35, 1.35

D. Performance and Energy Characterization As described in the characterization methodology in section III, the experiments follow three subsequent steps. 1) Operating System Level Power Characterization: In this step, the operating system power consumption is characterized. At each clock frequency, the power consumption of the ARM processor is measured in idle and active states. The active state corresponds to the execution of a variable incrementation loop. The power consumption of the DSP in idle state is measured by calculating the power difference due to the activation of the DSP using the LPM (Local Power Manager) driver [22]. The cpuidle driver [23] allowing deeper low-power modes during idle states is disabled in the used kernel as the dynamic power management (DPM) is beyond the scope of this study.

2) Video Frame Level Decoding Performance and Energy Characterization: In this step, the video decoding power and performance are characterized at the frame granularity. A 100 KHz power measurement sampling (10 µs resolution) is used to discern the maximum power variation information within a frame decoding phase. This is useful especially in case of low video resolution decoding where a frame decoding can be achieved in 1ms. The frame decoding time is measured in ARM decoding by displaying timestamps information at the beginning and at the end of the avcodec decode video2() function (video decoding function) of libavcodec library. In case of DSP decoding, the tracing is enabled in the DSP API through an environment variable activation as described in [24]. Enabling the tracing does not impact the frame decoding time since no debug information are displayed within the decoding process. The timing information are provided as a function of the number of clock cycles which are converted to time duration by dividing it by the DSP clock frequency [24]. The frame execution times are mapped to the measured power variations so that to identify the beginning and the end of the frame decoding. In order to calculate the time overhead, the sum of the frame decoding times is subtracted from the total video decoding time (with tracing disabled in case of DSP decoding). Similarly, the energy overhead is calculated by subtracting the sum of the frame decoding energies from the overall video decoding energy. The frame decoding energy is obtained by multiplying the average frame decoding power consumption in the frame decoding time already calculated. In fact, we have noticed that the average power consumption corresponding to the frame decoding phase is constant for a given video resolution. 3) Video Sequence Level Performance and Energy Characterization: In the third step, The number of decoded frames per second and the energy (mJ) per frame of ARM and DSP H.264/AVC video decoding are measured. This operation is executed for each tested video sequence (Harbor ans Soccer), bit-rate, resolution, and clock frequency. The clock frequency was kept constant during all the video sequence decoding process for a given test. The video decoding energy is obtained by summing elementary energies using 1 KHz power measurement sampling. V. E XPERIMENTAL R ESULTS & D ISCUSSIONS A. Operating System Power Characterization Fig. 6-a shows both active and idle power consumption of the ARM processor for the six available clock frequencies. The power consumption in the active state is almost 35% greater than the one of the idle state. This is due the Wait For Interrupt (WFI) ARM instruction called when entering idle state. WFI puts the processor in a low power state by disabling most of the clocks while keeping the processor powered up. We note that the WFI power consumption is not equal to processor static power which corresponds to the state where all the clocks are gated.

720 Mhz 600 Mhz 550 Mhz

0,5 Power (W)

ARM idle ARM active

(a) ARM Active/idle power consumption

(a) DSP frame decoding (4cif) power consumption

Power (W)

0,6

Overhead

1,2

DSP active/ARM idle

Decoded frame transfer using DMA

1

250 Mhz

0,3

DSP Decoding

DSP idle/ ARM idle

500 Mhz 0,4

Memory DSP + ARM

Frame processing periode

0,8

0,2

125 Mhz

0,6

DSP idle/ ARM active

0,4

0,1

Memory power increase due to frame copy.

0,2 0 0

5

Time (s)

10

15

0

20

10

0.8 (b) ARM and DSP idle power consumption 600 Mhz

0.6 Power(W)

ARM + DSP idle

0.1

DSP activation DSP deactivation

70

80

90

100

Memory DSP + ARM

DSP Decoding

Overhead

1 250 Mhz

0.3

60

DSP idle/ARM idle DSP active/ARM idle

1,2

500 Mhz

0.4

40 50 Time (ms)

Frame processing periode

550 Mhz

0.5

0.2

30

(b) DSP frame decoding (qcif) power consumption

Power (W)

720 Mhz

0.7

20

Decoded frame transfer using DMA

DSP idle/ARM active

0,8 0,6

125 Mhz

Memory power increase due to frame copy.

0,4 0,2 10 Time (s)

15

0

20

Fig. 6: Cortex A8 and DSP dynamic power consumption

Fig. 6-b shows the DSP idle power consumption at different frequency levels. The idle state corresponds to the state where DSP is activated without executing any instruction. Table II summarizes the measured power characteristics of the Cortex A8 processor and the TMS320C64x DSP. The values of the static power corresponding to each voltage level are taken from [25]. One can notice that the dynamic power TABLE II: ARM and DSP power characterization Vdd (V ) 1.35 1.35 1.27 1.2 1.05 0.975

farm (M Hz) 720 600 550 500 250 125

fdsp (M Hz) 520 430 400 360 180 90

Parmact (W ) 0.5965 0.4997 0.4087 0.3276 0.1238 0.0421

Parmidle (W ) 0.4342 0.3801 0.3089 0.2476 0.0913 0.0275

Pdspidle (W ) 0.2312 0.1778 0.1490 0.1217 0.0498 0.0224

Pstatic (W ) 0,01975 0,01975 0,01527 0,01135 0,00716 0,00516

represents the major part of the total power as compared to static power consumption. We can also observe that the ARM and DSP idle states power consumption is not negligible and may constitute an important part of the energy budget. The different power consumption levels obtained in this step provide information on how the energy is consumed in different processor states. The amount of time spent in each state is one of the parameters which impacts the overall energy consumption. This is illustrated when analyzing video decoding in the following sections. B. Frame-level Performance and Energy Characterization In this step, we analyzed the ARM and DSP decoding (at 720 MHz clock frequency) of 10 frames from the Harbor sequence coded in (4cif, 4 Mb/s), (cif, 1 Mb/s) and (qcif, 128 Kb/s) resolutions. The objective of this step is to understand how the processor is used and where goes the consumed energy at a frame granularity. The video decoding overhead is thus assessed in this section. As described in the methodology section, we define the overhead as all the processing which is not related to the

2

4

6

Time (ms)

8

10

12

(c) ARM frame decoding (4cif) power consumption

Power (W)

5

14

Memory ARM + DSP

Ovearhead Frame processing periode Frame decoding

0,8 0,6 0,4 0,2

Memory power increase due to frame copy. 0 0

20

40

60

80 Time (ms)

100

120

140

m (d) ARM frame decoding (qcif) power consumption Power (W)

0 0

Memory ARM + DSP

Frame processing periode

0,4 Frame decoding 0,2

0 0

1

2

3

4

5 6 Time (ms)

7

8

9

10

Fig. 7: ARM and DSP frames decoding

frame decoding. In case of ARM decoding, the overhead is related to the GStreamer framework. In addition to this overhead, the DSP decoding is characterized with an interprocessor communication overhead as described in section II. Fig. 7-a and Fig. 7-b show the power consumption level of 4cif and qcif DSP video decoding. The DSP frame decoding phase is represented by the strip varying between 0.7 W and 1.1 W corresponding to [32 ms, 62ms] and [6.2 ms, 7.5ms] intervals for 4cif and qcif respectively. This phase is terminated by a burst of DMA transfers of the decoded frame macro-blocks from the DSP cache to the shared memory. This phase corresponds to the intervals [56 ms, 62ms] and [7.2 ms, 7.5ms] for 4cif and qcif respectively, and is illustrated by an increase in memory power consumption. When the DSP terminates the frame decoding, it returns to the GPP the execution status and enters the idle state. This event occurs,

for example, at 25 ms in Fig. 7-a. The ARM wake-up latency is represented by the power level 0.66 W which is the sum of the power consumptions of both ARM and DSP in the idle state (0.43 W +0.23 W) as described in Table II. The ARM wake-up is represented by the power transition to 0.83 W level which is the sum of the ARM active state (0.59 W) and the DSP idle state (0.23 W). The ARM sends then the parameters (the next frame to decode) to the DSP codecs and triggers a DSP decoding function. Fig. 7-c and 7-d show the power consumption variation in case of 4cif and qcif ARM decoding. Like the DSP decoding, the frame decoding phase is characterized by a constant level of power consumption. The decoded frame copy does not appear clearly as in case of the DSP decoding since the frames are decoded in the ARM cache and evicted when no space is left in the cache. One can notice also that the frame decoding time is lower than the frame processing period due to a GStreamer overhead. One can observe that the amount of time spent in frame decoding as compared to the total video decoding time varies according to the video resolution. For example, in case of qcif DSP decoding (Fig. 7-b), the frame decoding represents almost 50% of the total time. In what follows, we show how we evaluated the amount of time and energy spent in frame decoding phases and calculated the overhead accordingly. The overhead decoding time is the difference between the total video decoding time and the sum of the elementary frames decoding times calculated according the method described in IV-D2. Table III shows the obtained values for qcif, cif and 4cif. TABLE III: ARM and DSP decoding times Resolution qcif (128kb) cif (1024kb) 4cif (5120 kb)

ARM decoding time(ms/frame) Processing Total Overhead (%) 2.19 2.87 10.04 10.85 12.04 9.88 47.23 52.39 9.86

DSP decoding time (ms/frame) Processing Total Overhead (%) 1.97 4.16 52.64 6.016 8.36 28.11 23.73 25.93 8.48

The time overhead percentage is 8%, 28% and 52% of the total frames decoding time in case of 4cif, cif and qcif DSP decoding. On the other hand, it is almost constant (10%) in case of ARM decoding. We note that the overhead is not negligible as compared to the total time decoding. For example, total qcif DSP decoding time is higher than the one of ARM although the frames are processed faster by the DSP. To investigate the energy overhead, we calculated the energy consumption of the frame decoding phases and subtracted it from the total decoding energy as described in section IV-D2. Table IV shows the calculated average power values and Table V the energy overhead calculated accordingly. TABLE IV: Frames processing average power consumption Resolution qcif (128kb) cif (1024kb) 4cif (5120 kb)

ARM frame decoding average power (W) 0.55 0.57 0.58

DSP frame decoding average power (W) 0.87 0.91 0.96

TABLE V: ARM and DSP decoding energy overhead Resolution qcif (128kb) cif (1024kb) 4cif (5120 kb)

ARM decoding energy (mJ/Frame) Processing Total Overhead (%) 1.20 1.54 10.01 6.18 6.87 9.97 27.39 28.4 3.55

DSP decoding energy (mJ/frame) Processing Total Overhead (%) 1.71 2.33 30.48 5.35 6.72 20.38 21.59 22.16 2.5

One can observe that the energy overhead is higher in case of DSP decoding. In fact, the ARM-GPP communication contributes to a significant part of the energy consumption especially in case of low resolutions. We can also notice that keeping the DSP in idle state waiting to receive a decoding request from the ARM participates to the energy overhead. In fact, the DSP is not deactivated when it is waiting to receive the next frame. The idle interval is so short that entering in a deeper sleep mode would have a negative impact on performance. During this interval, the DSP consumes about 0.23 W at 520 MHz (refer to Table II) without executing any task. For example, in case of qcif decoding, the DSP is not used for more than 50% of the time, but still it consumes idle power. We can conclude from the conducted tests that the time and energy overhead is not negligible especially in case of DSP qcif resolution. Theses results are used, in the next section, to analyze the overall performance and energy consumption properties of a whole video decoding. C. Video Stream Performance and Energy Characterization 1) Decoding Performance Results: Fig. 8 shows a comparison between ARM and DSP video decoding performance in case of 4cif, cif and qcif resolutions for the Harbor video sequence. The performance variation behavior is the same in the case of Soccer video sequence. The flat surface represents the acceptable reference video displaying rate (30 Frames/s). The first observation witch can be drawn is that, in the case of qcif and cif resolutions, the video is decoded at a rate that is higher than the displaying rate (30 Frames/s) even for low clock frequencies regardless of the video bit-rate and the processor type. The ratio between the actual decoding speed and the displaying rate increases for high clock frequencies and low bit-rates. It reaches the maximum value of 10 at 720 MHz clock frequency and qcif 64 Kb/s ARM decoding. In the case of 4cif resolution, A decoding rate higher than 30 Frames/s can be performed by the DSP starting from 180 MHz frequency (corresponding to 250 MHz ARM frequency) for low bit-rates and starting from 430 MHz frequency (corresponding to 600 MHz ARM frequency) for high bit-rates. The performances of the ARM processor and the DSP are almost equivalent in case of qcif resolution. However, the ARM decoding speed is 43% higher than the DSP in case of 64 Kb/s bit-rate while the DSP decoding speed is 14% higher than the ARM in case of 5120 Kb/s bit-rate. On the other hand, The DSP decoding is almost 50 % faster than the ARM in case of cif resolution and 100% in case of 4cif. This ratio decreases drastically for low bit-rates where the ARM processor performance tends to increase faster than the one of the DSP.

qcif ARM and DSP decoding (Harbour)

160

DSP

Frames/s

250 200 150 100 50

200

DSP

120 100 80

50 40 30 20

60

10

40

0

400

Frquency

600

4000

2000

200

400

Frquency

5000

4000 2000

600

800

0

4000

400

Frquency

6000

Bitrate (Kb/s)

0

6000

200

0 0

6000

ARM

60

DSP

20

0

4cif ARM and DSP decoding (Harbour)

70 ARM

140

300

Frames/s

cif ARM and DSP decoding (Harbour)

180

ARM

350

Frames/s

400

3000

2000

600

1000

Bitrate (Kb/s)

Bitrate (Kb/s)

0

Fig. 8: ARM and DSP decoding performance of the Harbor video

1

0.5

0 6000

0.8 ARM DSP

Power (W)

DSP ARm

Power (W)

Power (W)

4cif decoding average power consumption (Harbour)

cif decoding average power consumption (Harbour)

qcif decoding average power consumption (Harbour) 1

0.5

0 6000 4000

Bite−rate2000 0

400

200

0

600

800

0.4 0.2

0 6000 4000

Bite−rate

Frequency

ARM DSP

0.6

2000 0

200

0

400

600

800

4000

Bite−rate

Frequency

2000 0

0

600

400

200

800

Frequency

Fig. 9: ARM and DSP video decoding power consumption 4cif decoding energy consumption (Harbour)

cif decoding Energy consumption (Harbour)

qcif decoding energy consumption (Harbour)

ARM DSP

ARM DSP

ARM DSP

35

10

5

30

3 2

6 4 2

20 15 10 5

5000 4000

0 0

3000 200

300

400

Frequency

500

600

400

Frequency

Bitrate (Kb/s)

800

0 0

4000

200

2000 1000 700

6000

400

Frequency

Bitrate (Kb/s)

0

800

2 1 6000

0 0

4000

200 400

2000

Bitrate(Kb/s)

600 800

30

8

25

mJ/Frame

mJ/Frame

3

Bitrate (Kb/s)

0

ARM DSP

10

4

600

4cif decoding Energy consumption (Soccer)

ARM DSP

5

2000 800

cif decoding Energy consumption (Soccer)

ARM DSP

Frequency

6000 4000

200

2000

600

0

qcif decoding Energy consumption (Soccer)

mJ/Frame

25

6000

1

0 100

mJ/Frame

mJ/Frame

mJ/Frame

8 4

6 4 2 6000

0 0

4000

200 400

Frequency

2000

Bitrate (Kb/s)

600

0

800

0

20 15 10 5

0 100

6000 5000 4000 200

300

3000 400

500

Frequency

2000

600

700

1000 800

Bitrate (Kb/s)

0

Fig. 10: ARM vs DSP decoding energy consumption of H.264/AVC video

2) Power Consumption Results: Fig. 9 illustrates the variation of the average power consumption of the ARM and the DSP video decoding according to the video resolution and bitrate in case of the Harbor video (the Soccer video sequence gave similar results). We note that the power consumption depends mainly on the clock frequency which is explained by the dominance of the dynamic power as compared to the static power. For example, at 720 MHz, the static power is 19.75 mW (Refer to Table II) which represents 3.4% and 2.8% of ARM and DSP qcif video decoding power consumption. (540 mW and 700 mW respectively).

We can also observe that, unlike the ARM decoding average power consumption, the DSP power consumption increases when the video resolution increases. The DSP power consumption is thus 30%, 40%, and 50% higher than the ARM in case of qcif, cif and 4cif resolution respectively. This can be explained by the results obtained in section V-B regarding the overhead evaluation. In fact, we found that the percentage of time overhead is almost constant in case of ARM decoding and decreases when the video resolution increases in case of DSP decoding. A frame level power characterization shows that the overhead phase corresponds generally to a power

consumption decrease due to entering idle state (Refer to Fig. 7). Consequently, the more the overhead is important, the lower is the average power consumption. 3) Energy Consumption Results: The previous results showed a very important variation of the DSP/ARM performance and power consumption depending on the clock frequency, the video bit-rate and the resolution. The energy consumption is the combination of the power consumption and the decoding time proprieties. Fig. 10 shows a comparison between the ARM and DSP video decoding average energy consumption (mJ/Frame) in case of 4cif, cif and qcif resolutions and both Soccer and Harbor videos. The DSP qcif video decoding consumes 100% more energy than the ARM in case of low bit-rate and 20% for high bitrate. This is explained by a lower performance and a higher power consumption of the DSP decoding as compared to the ARM. In fact, it was shown in section V-B that in case of low video qualities, the system overhead of the DSP decoding is responsible of the drop of performance and energy properties as compared to the ARM. For example, as described in Table III, in case of qcif 128Kb video decoding, although the DSP frame decoding is faster than the ARM, its overall performance are lower. On the other hand, The DSP 4cif video decoding consumes globally less energy than the ARM although it consumes 60% more power. This is due to a better DSP decoding performance which can be 100% higher than the one of the ARM. In case of cif resolution, we noticed a crossing between the ARM and the DSP energy consumption levels. In fact, for bit-rates lower than 1Kb/s, the ARM consumes less energy than the DSP while for high bit-rates, the DSP consumes less. 4) Discussion: The analysis of video decoding results shows that the overall performance and the energy efficiency of the DSP as compared to the ARM processor depend mainly on the required video coding quality (bit-rate and resolution). In fact, the DSP video decoding is the best performance and energy efficient choice in case of 4cif resolution and the use of ARM decoding is better in case of qcif resolution and cif resolution with a bit-rate less than 1Mb/s. If we suppose the qcif, cif and 4cif resolutions are used respectively in low-end, high-end and tablets, we can propose the guideline in Table. VI for selecting the processor type (ARM Cortex A8 or DSP TMS320C64x) which offers the best performance and energy properties for decoding a video according to the bit-rate and the mobile device type. TABLE VI: Processor type selection guideline

On the other hand, the analysis of the experimental results according to the processor clock frequency reveals that in many cases, even if the clock frequency is scaled down,

the video can still be decoded while meeting the displaying deadline. For example, in case of 64 Kb/s qcif resolution, when using the maximum frequency (720 MHz), the video can be decoded 10x faster than the displaying rate on the ARM processor. This is a typical configuration where an energy saving can be realized by scaling down the processor clock frequency. VI. R ELATED W ORKS H.264/AVC video decoding performance and energy consumption characterization issue may be addressed at application, system, or architectural level. In [4], the performance and energy consideration when using pipelines and parallelism in CMOS circuits is studied and it was shown why these architectures enhance the energy efficiency. In [26], the particular case of H.264/AVC video decoding is analyzed and an energy-aware processor architecture design methodology is proposed for energy efficient H.264/AVC decoding. At this level, the impact on the energy consumption of application and operating system layers are not considered. At a higher level, in [7], H.264/AVC decoding performance is characterized on different GPP processor architectures at CPU cycle level. This approach is used in [16] for energy characterization and modeling of the different H.264/AVC decoder modules. The results were used to develop an energyaware video decoding strategy for ARM processor supporting DVFS feature. The result of this study are generalized in [27] for considering the variation in video bit-rate. These studies was focusing only on GPP processor. On the other hand, several works studied the performance and energy consumption of DSP video decoding. In [28],[29], and [11], performance consideration of DSP decoding are analyzed especially according to cache coherency maintenance and DMA transfers. In [30], energy characterization of DSP processing is addressed in terms of memory access and DMA transfers. In [31], DSP video decoding energy consumption is analyzed in terms of different video coding qualities. Many of these studies highlight the performance and the energy efficiency of the DSP video decoding however, they do not compare them to the performance of GPP video decoding. In this study, we have proposed to characterize the performance and energy consumption of GPP and DSP H.264/AVC video decoding in terms of bit-rate, resolution and clock frequency. To allow an objective evaluation of the GPP decoding as compared to that of the DSP, we focused, in the proposed methodology, on executing the GPPs and DSPs video decoding performance and energy measurement under the same conditions by using Gstreamer: a GPP-DSP multimedia framework. As far as we know, no study considered all these parameters for both GPP and DSP within the same application framework. VII. C ONCLUSION In this paper, we proposed a comprehensive study of the performance and the energy consumption of GPP and DSP

H.264/AVC video decoding. A system and application level characterization methodology was proposed to evaluate the performance and energy proprieties of the H.264/AVC GPP and DSP decoding in terms of architecture, system and application parameters. An important contribution of our work is the evaluation of the performance and energy overhead of GPP and DSP video decoding for different video bit-rates and resolutions. It was shown that in case of low video coding qualities, the system overhead is significant in case of DSP decoding which leads to a degradation in the performance and energy properties. The overhead evaluation results was used to explain the overall performance and energy properties of both GPP and DSP video decoding. This allowed to define the video coding qualities where using a GPP is more energy efficient than using a DSP. The results are based on experimental tests conducted on OMAP3530 low-power platform, however, our methodology can be applied on any other platform for performance and energy evaluation. The result of this work can be used as guidelines for selecting the appropriate processing element of the studied SoC when decoding a video having a given video bit-rate and resolution. An interesting application (we plan to implement) is to use these guidelines to select dynamically the processor type to be used in an energy-ware decoding for dynamic streaming [32], [33] where the quality of the decoded video changes dynamically according the the bandwidth fluctuation or the mobile device displaying capabilities. R EFERENCES [1] M. Broussely and G. Archdale, “Li-ion batteries and portable power source prospects for the next 5-10 years,” Journal of Power Sources, vol. 136, no. 2, pp. 386–394, Oct. 2004. [2] A. Carroll and G. Heiser, “An analysis of power consumption in a smartphone,” Proceedings of the 2010 USENIX conference on USENIX annual technical conference, 2010. [3] T. Burd and R. Brodersen, “Energy efficient CMOS microprocessor design,” System Sciences, Proceedings of the Twenty-Eighth Hawaii International Conference on, vol. 1, pp. 288–297, 1995. [4] D. Markovic, V. Stojanovic, B. Nikolic, M. Horowitz, and R. Brodersen, “Methods for true energy-performance optimization,” Solid-State Circuits, IEEE Journal of, vol. 39, no. 8, pp. 1282–1293, 2004. [5] A. Wang and A. Chandrakasan, “Energy-efficient DSPs for wireless sensor networks,” Signal Processing Magazine, IEEE, vol. 19, no. 4, pp. 68–78, 2002. [6] C. H. K. Van Berkel, “Multi-core for mobile phones,” Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1260–1265, 2009. [7] M. Horowitz, A. Joch, F. Kossentini, and A. Hallapuro, “H.264/AVC baseline profile decoder complexity analysis,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 13, no. 7, pp. 704–716, 2003. [8] Z. Lu, J. Lach, M. Stan, and K. Skadron, “Reducing multimedia decode power using feedback control,” Computer Design, 2003. Proceedings. 21st International Conference on, pp. 489–496, 2003. [9] J. Pouwelse, K. Langendoen, and H. Sips, “Dynamic voltage scaling on a low-power microprocessor,” Proceedings of the 7th annual international conference on Mobile computing and networking, pp. 251–259, 2001. [10] A. Chandrakasan, S. Sheng, and R. Brodersen, “Low-power CMOS digital design,” IEEE Journal of Solid-State Circuits, vol. 27, no. 4, pp. 473 –484, Apr. 1992. [11] J. Golston, S. Arora, and R. Reddy, “Optimized video decoder architecture for TMS320C64x dsp generation,” Proc. SPIE 5022, Image and Video Communications and Processing, pp. 719–726, 2003.

[12] Codec engine overhead. [Online]. Available: http://processors.wiki.ti.com/index.php/Codec Engine Overhead [13] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560 –576, Jul. 2003. [14] G. Sullivan, J. Ohm, W.-J. Han, T. Wiegand, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 22, no. 12, pp. 1649–1668, 2012. [15] C. Feller, J. Wuenschmann, T. Roll, and A. Rothermel, “The VP8 video codec - overview and comparison to H.264/AVC,” Consumer Electronics - Berlin (ICCE-Berlin), 2011 IEEE International Conference on, pp. 57– 61, 2011. [16] Z. Ma, H. Hu, and Y. Wang, “On complexity modeling of H.264/AVC video decoding and its application for energy efficient decoding,” IEEE Transactions on Multimedia, vol. 13, no. 6, pp. 1240 –1255, Dec. 2011. [17] E. Senn, D. Chillet, O. Zendra, C. Belleudy, S. Bilavarn, R. Atitallah, C. Samoyeau, and A. Fritsch, “Open-people: Open power and energy optimization PLatform and estimator,” 2012 15th Euromicro Conference on Digital System Design (DSD), pp. 668 –675, Sep. 2012. [18] V. Pallipadi and A. Starikovskiy, “The ondemand governor: past, present and future,” in Proceedings of Linux Symposium, vol. 2, pp. 223-238, 2006. [19] C. M. Don Darling and B. Singh, “Gstreamer on texas instruments OMAP35x processors,” Proceedings of the Ottawa Linux Symposium, pp. 69–78, 2009. [20] “The ARM NEON general-purpose SIMD.” [Online]. Available: http://www.arm.com/products/processors/technologies/neon.php [21] “Svc test sequences.” [Online]. Available: ftp://ftp.tnt.unihannover.de/pub/svc/testsequences/ [22] Local power manager driver. [Online]. Available: http://softwaredl.ti.com/dsps/dsps public sw/sdo sb/targetcontent/lpm/index.html [23] V. Pallipadi, “cpuidle - do nothing, efficiently...” Proceedings of the Ottawa Linux Symposium, 2007. [24] “Codec engine profiling - texas instruments embedded processors wiki.” [Online]. Available: http://processors.wiki.ti.com/index.php/Codec Engine Profiling [25] “OMAP3530 power estimation spreadsheet.” [Online]. Available: http://processors.wiki.ti.com/index.php/OMAP3530 Power Estimation Spreadsheet [26] K. Xu, T.-M. Liu, J.-I. Guo, and C.-S. Choy, “Methods for power/throughput/area optimization of H.264/AVC decoding,” Journal of Signal Processing Systems, vol. 60, no. 1, pp. 131–145, 2010. [27] X. Li, Z. Ma, and F. Fernandes, “Modeling power consumption for video decoding on mobile platform and its application to power-rate constrained streaming,” Visual Communications and Image Processing (VCIP), 2012 IEEE, pp. 1 –6, 2012. [28] P. Ramachandra and M. R. Satish, “H.264 main profile video decoding implementation techniques on OMAP3430IVA,” Signal Processing (ICSP), 2010 IEEE 10th International Conference on, pp. 271–274, 2010. [29] S. Kant, U. Mithun, and P. S. S. B. K. Gupta, “Real time H.264 video encoder implementation on a programmable dsp processor for videophone applications,” Consumer Electronics, 2006. ICCE ’06. 2006 Digest of Technical Papers. International Conference on, 2006. [30] N. Julien, J. Laurent, E. Senn, and E. Martin, “Power consumption modeling and characterization of the TI C6201,” IEEE Micro, vol. 23, no. 5, pp. 40–49, Sep. 2003. [31] E. Jurez, F. Pescador, P. J. Lobo, A. Groba, and C. Sanz, “Distortionenergy analysis of an OMAP-Based H.264/SVC decoder,” Mobile Multimedia Communications, no. 77, pp. 544–559, Jan. 2012. [32] T. Stockhammer, “Dynamic adaptive streaming over HTTP : standards and design principles,” Proceedings of the second annual ACM conference on Multimedia systems, pp. 133–144, 2011. [33] H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable video coding extension of the H.264/AVC standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 9, Sep. 2007.

GPP vs DSP : A Performance/Energy Characterization ...

4cif (704x576). Bit-rate. 64, 128, 256, 512, 1024, 1536, 2048,. 2560, 3072, 3584, 4096, 4608, 5120. GStreamer. ARM plug-in ffdec h264. DSP plug-in. TIViddec. Operating System. Name. Linux 2.6.32. DVFS driver cpufreq. Hardw are. DSP. Name. TMS320C64x. Frequencies (MHz). 90, 180, 360, 400, 430, 520. ARM. Name.

Download PDF

628KB Sizes 3 Downloads 249 Views

Report

GPP vs DSP : A Performance/Energy Characterization ...

Recommend Documents