a survey of mobile audio architecture issues

Viewer
Transcript

Pierre-Louis Bossart

A survey of mobile audio architecture issues

A SURVEY OF MOBILE AUDIO ARCHITECTURE ISSUES PIERRE-LOUIS BOSSART Freescale Semiconductorsi Wireless and Mobile Systems Group 6501 William Cannon Dr. W, Austin, TX 78735, USA [email protected]

This paper aims at providing a broad overview of the status of audio support in mobile hand-held devices. The evolution from audio players or cell-phones into convergent devices supporting multiple and complex use-cases has generated new system constraints and new mobile audio architectures. By reviewing the complete mobile audio framework, from codecs to audio middleware, from platform architecture to hardware, we provide a down-to-earth explanation for critical and dimensioning factors and shed some light on existing standards, upcoming solutions and needed compromises.

INTRODUCTION After a definition of mobile audio platforms, we present a set of use cases and discuss implementation problems and critical factors. Audio transforms such as codecs and effects are then reviewed in detail. A section is dedicated to output devices, with a specific attention to audio quality issues in the D/A conversion. Bluetooth audio protocols are presented, along the issues with platform architectures and interfaces. The core of this paper is dedicated to system constraints and requirements due to OS integration and complexity of concurrent use cases. Several cell-phone architectures are described, as well as possible hardware-acceleration solutions and their support in multimedia software frameworks. 1 DEFINITION A mobile audio platform relies on • A host application processor, such as ARM, running a general operating system. In most cases a proprietary RTOS is used, however open operating systems such as Linux, Symbian, WindowsCE and Windows Mobile are gaining ground. • RAM, either embedded or external. This RAM is shared between the host processor, the audio and multimedia subsystems. • Flash memory to store the firmware. NAND flash requires error corrections for the firmware being copied to RAM before it is executed. • Input keys, display and a GUI. • Mass storage, such as Memory-Stick, SD and MMC flash cards or an HDD, to store recorded or downloaded audio files.

• • • • • •

Audio interfaces such as I2S or AC-97, with audio data routed through DMA. Baseband modems and RF for GSM, 3G, WLAN, and Bluetooth wireless links. Connectivity interfaces such as USB, HSI. FM-Radio, DVB-H or T-DMB tuners. Audio Acceleration capabilities. Security support in HW.

As can be seen from the list, one fundamental difference between audio platforms lies in the choice of mass storage and connectivity interfaces. Silicon vendors often provide support for most of these interfaces, and rely on alternate functions to keep pin-count and prices low. Each input or output pin can be assigned to different peripherals, however only a number of combinations are supported. A thorough analysis is required to make sure the needed interfaces can be enabled in all use cases. 2

USE CASES

2.1 Audio Player This use case requires access to mass storage and its file system, Digital Rights Management (DRM) capabilities, audio decompression and rendering. The IPOD is one of the most successful products in this category. Besides design and simplicity, the critical differentiating factors are the storage size and battery life. Current playback times are in the range of 20 hours, and will reach over 100 hours in future products. Reaching such long playbacks requires a look-ahead of compressed audio data with copies to the RAM, reducing processor clock speeds, minimizing the number of interrupts and simplifications of audio routing. Additional PCM

AES 29th International Conference, Seoul, Korea, 2006 September 2–4

1

Pierre-Louis Bossart

A survey of mobile audio architecture issues

processing proves costly and significantly decreases battery life. Audio content can be high-quality audio as well as MMS or voice memos. 2.2 Audio Encoder Few devices provide high-quality audio encoding capabilities. Digital input interfaces are rarely provided, and microphones are often not suitable for high-quality audio. In most cases encoding is used for low sampling rates and speech data, such as MMS or voice memos. Encoding at high sample-rates is more common when combined with video in a camcorder application. 2.3 Multimedia Player This use case is the natural evolution of the audio player. The efficiency of the audio decoding is decreased due to additional traffic on buses and memory controllers. Audio playback needs to keep the highest priority to prevent drops and clicks. An additional demux layer running on the host processor side splits audio and video payloads while maintaining A/V sync. The increased processing requirement on the host and video and display subsystems dramatically reduces battery life. 2.4 Circuit-switched (CS) speech call Cell-phones perform full-duplex speech encoding and decoding, along with Acoustic Echo Cancellation (AEC), Noise Reduction (NR) and dynamics processing (DRC), as presented in Figure 1. Network bandwidth is optimized by virtue of Voice Activity Detection and SID transmission (Silence Descriptor). The GSM system defines strict requirements on latency and roundtrip delays; double buffering needs to be minimized. Most cell-phones rely on the 8 kHz sampling frequency; however 3G phones will support an extended bandwidth with AMR-WB operating at 16 kHz, which will improve speech intelligibility yet increase dramatically the MIPS requirements in AEC. Algorithms operating in the time-domain will need to be replaced with frequency-domain adaptive filters. Downlink speech up-link speech

Dec

EQ

Enc

DRC NR

AEC

Figure 1: Speech call data flow 2.5 Voice-over-IP (VoIP) call Instead of relying on circuit-switched protocols, speech is transmitted through Internet protocols. Connections rely on RTP either over WLAN or GPRS/EDGE/3G. Operating system vendors generally provide Internet protocols and VoIP software stacks, which are run on

the host processor. Latency constraints do not exist in VoIP calls; however speech packets can be lost or transmitted with an unpredictable delay. VoIP implementations rely on a jitter buffer. Clock recovery and handling of lost packets can be performed with • Packet Loss Concealment (PLC) implemented in decoders. The jitter buffer stores compressed frames. The granularity depends on the speech codec and varies between 10 and 30 ms. • Time-scale algorithms such as PSOLA, which have a finer granularity with better audio output but require additional memory and MIPS. Cancellation of long echo tails (> 200ms) is difficult and often results in muting one signal when the other is active. Full-duplex conversation is hardly possible in this use case. The efficiency of PLC and AEC is the major differentiating factor in VoIP. 2.6 UMA UMA stands for Unlicensed Mobile Access [2], which enables seamless hand-overs between CS calls, WLAN and Bluetooth. This technology helps reduce cell-phone bills, and was demonstrated at 3GSM on Nokia phones. UMA involves a considerable amount of protocol stack software and is extremely complex to validate. 2.7 Video telephony This is basically the same use case as the VoIP call from an audio standpoint, only with additional A/V sync and buffering constraints, as audio and video are transmitted in separate sessions possibly routed through different servers. Transmission errors are a major issue, with audio and video quality being significantly decreased compared to wired PC implementations. 2.8 Audio streaming In this use case the same protocols as for VoIP are used, only at higher sampling rates. Jitter buffer management and error concealment are needed as well; however skipping and repeating frames or time-scale expansion degrade the audio quality. Driving a fractional audio PLL based on the fill ratio of the jitter buffer proves more efficient with no increase of MIPS and memory. For cost reasons such fractional PLLs are almost never available in mobile devices. 2.9 Mobile TV Mobile TV is described as the next killer application, with two competing standards (DVB-H and T-DMB). Video is the main hurdle, while audio requirements do not differ drastically from the audio player use case. The audio subsystem needs to support recent low-bitrate algorithms suitable for broadcast such as e-AACPlus, AAC-BSAC, AMR-WB+ and possibly WMA. Since T-

AES 29th International Conference, Seoul, Korea, 2006 September 2–4

2

Pierre-Louis Bossart DMB relies on the DAB framework, MPEG Layer2 is needed as well. Besides power consumption, the main issue in mobile TV lies in transmission error handling, especially with long periods of lost data. Error concealment algorithms need to preserve A/V sync and support seamless transition between frame substitution and frame muting schemes. In addition, decoders need to be rock-solid when decoding corrupted data, as CRC detection is not always enabled in the transport stream demux layer. 2.10 FM Radio Listening to FM radio is a classic feature of audio players and cell phones. Time-shifting capabilities will be added by recording an FM channel, encoding it e.g. in AAC, storing it on MMC and reading it later. Battery life is strongly impacted by high-quality audio encoding and FM demodulation. 2.11 Push-Over-Cellular (PoC) PoC allows for voice-enabled chats or Instant Messaging with multiple users connected to a server. This Walkie-talkie push-to-talk functionality mostly appeals to teenagers. PoC is not full-duplex; speech encoding and decoding are not active simultaneously which limits the complexity of this use case. 2.12 MMS/ringtone creation Audio content creation involves recording of voice or music, mixing with existing content, encoding and possibly transcoding to meet network requirements and storing on MMC. Even though the hardware can support each of these tasks, limited man-machine interaction proves frustrating; content creation will most likely rely on a PC, with the transcoding and network adaptation performed on the mobile platform after download. 2.13 Gaming Gaming applications require multiple audio tracks to be played simultaneously and mixed. Data may be stored on mass storage or downloaded on-the-fly. Digital effects such as 3D positioning or Doppler are applied to individual channels. The mixed result is processed with environmental effects, mainly room reverberation. See Figure 2 for an example of this use case, with samplerate conversions omitted for the sake of readability. One of the main goals in games is to maximize the number of tracks, possibly by decreasing sampling frequencies (32 kHz max) and/or using plain PCM files. Nevertheless gaming requirements are the most difficult to sustain both in terms of speed and memory bandwidth. Battery life is a major issue, as audio is used in combination with highly-demanding high-bandwidth processing (3D graphics, video). Keeping a highpriority for audio is mandatory to avoid dropouts.

A survey of mobile audio architecture issues

Game sounds or players

Background music

dec

3D

dec

3D

dec

EQ

dec

3D

_

Stereo widening

_

dec

reverb

Figure 2: Example of gaming use case. 2.14 Gaming with speech call Supporting VoIP calls with other players amounts to adding several speech decoders to be upsampled and mixed with the main game content. One uplink speech encoder is needed as well. Acoustic Echo Cancellation is next to impossible in this situation; the output is muted when talking. As a refinement to push the limits of engineering further, each VoIP session is positioned in 3D space. 2.15 Speech calls and music A much-hyped use case consists in live sharing of an MP3 music file during a speech call, either in CS or VoIP modes, as shown in Figure 3. Alternatively, the MP3 stream is only rendered locally, and not shared with the far-end. This feature is requested by operators and cell-phone manufacturers but raises many audio quality issues. music

MP3

Downlink speech

Dec

up-link speech

Enc

_ _

NR

AEC

Figure 3: MP3 listening during speech call •

•

•

AES 29th International Conference, Seoul, Korea, 2006 September 2–4

Upsampling MP3 and speech to 48 kHz could be an option. However AEC processing requirements are impossible to sustain at this sampling rate on a mobile platform. Inserting the AEC prior to the mix decreases the MIPS requirements but the music contribution to echo is not compensated for. Band-limiting and downsampling the music to 8 kHz is not an elegant solution either. AEC algorithms also rely on voice-activity and double-talk detection, which are easily fooled by music content.

3

Pierre-Louis Bossart •

A survey of mobile audio architecture issues

Audio quality differs between local playback and uplink encoded data, as speech encoders do not perform well with voice mixed with music.

2.16 Conversation record This use case is one of the most complex use cases in terms of combination of streams and audio routing. Recording a conversation can be enabled at any time for both circuit-switched and VoIP calls. In theory this amounts to mixing up- and down-link speech and encoding it, possibly at a lower sampling rate to decrease MIPS requirements. However to comply with regulations the far-end are informed with a periodic tone that they are being recorded. This tone shall not be mixed. Figure 4 depicts the data flows and speech transforms. record

Enc

music

MP3

Downlink speech

Dec

up-link speech

Enc

_

_ _

NR

AEC

Tone

Figure 4: conversation record use case 2.17 Output record Recording the output may be enabled on top of any playback use case. It only makes sense when several streams are mixed, with some audio effects. The output can be re-encoded in any available format, even the most complex ones such as AAC, with optional samplerate conversion. A possible example is a replay of a game session after the game is over. This output record use case is critical both in terms of MIPS and memory; battery life is shortened as the only solution consists in re-using the host processor at the expense of power consumption. Recording the output is legal only if none of the audio sources are DRM-protected. 3

AUDIO TRANSFORMS

3.1 Codecs Progress in audio and speech compression technologies, along with convergence of PC and mobile applications, has generated a proliferation of standard codecs. Speech codecs support extended bandwidths and music encoding capabilities, while the Spectral Band Replication (SBR) and the Parametric Stereo (PS)

technologies embedded in the latest e-AACPlus standard have resulted in spectacular bit-rate reduction. • 3GPP codecs such as AMR, AMR-WB are widely used for speech calls. AMR-WB+ competes with eAACplus for broadcast applications at low bit-rates. AMR-WB+ encoding has however few applications and is scarcely requested. • ITU codecs such as G711, G722, G726, G723.1 and G729ab are mandatory in a mobile platform for VoIP and camcorder use cases. • CDMA codecs (QCELP13, EVRC) are used both for MMS and speech calls. SMV has not been widely deployed so far. • Audio codecs include SBC for Bluetooth applications, MPEG-Layer2, MP3, as well as different AAC profiles such as LC/LTP/BSAC. eAAC+ is a must-have codec required in most low-bitrate applications. • Non-standard codecs such as ATRAC3, Real Audio, WMA, OGG-Vorbis are required in some applications. • The IETF has also endorsed the ILBC codec, a royalty-free speech codec used in VoIP applications. 3.2 Codec issues The proliferation described in the previous section has a significant cost and brings additional system constraints.

3.2.1 Footprint of constant data All recent speech codecs listed above are based on the A-CELP technology, yet use different tables and frame lengths. 10k words are typically needed to store constant tables and dictionaries for each speech encoder. The proliferation of speech codecs has a direct impact on system architecture, as these tables must be stored in ROM or in external memory.

3.2.2 Increased memory requirements Table 1 describes the evolution of audio encoding algorithms and their increasing memory requirements. Granularity MPEG 32 AC3 256 MP3 576 AAC 1024 WMA 2048 OGG 4096 Table 1: Evolution of frame sizes in words This evolution combined with concurrent use cases does not allow any longer for implementations based only on tightly coupled memory. Accesses to the external memory are a crucial dimensioning factor, as the

AES 29th International Conference, Seoul, Korea, 2006 September 2–4

4

Pierre-Louis Bossart

A survey of mobile audio architecture issues

efficiency of the memory hierarchy becomes more critical than the raw MIPS of the audio processor.

• •

3.2.3

Conformance issues

Audio encoders standardized by MPEG are typically tested with objective measurements (PEAQ, Opera) or listening tests. MPEG only mandates that the syntax of the bitstream be compliant, which helps improve the quality of encoders or tune them to a specific platforms. The adoption of eAACPlus in 3GPP has generated a heated debate on conformance methodology, as mobile carriers are used to bit-exact tests. In order to allow for some flexibility, implementations can be either bit-exact or tested with objective measures to assess the divergence from the 3GPP reference floating-point encoder. This compromise will result in all 3GPPcompliant implementations of eAAC+ providing the same quality, with little scope for enhancements. This fundamental difference with MPEG restricts the number of technology providers and is a strong constraint for silicon vendors. 3.3 MIDI synthesis The Mobile DLS standard has been endorsed by MMA and 3GPP. It mandates support for wave-table synthesis so that the user experience is the same across different platforms. Wavetable synthesis results in large memory requirements as well as highly variable bandwidth needs, as peak bandwidth differs dramatically from normal or average cases. Depending on the pitch-bend and modulation parameters, the bandwidth may be multiplied by 4, which becomes somewhat untractable when applied on the 16 channels. The efficiency of a wave-table synthesis depends mainly on the memory hierarchy design. Implementations vary between fullsoftware solutions (Beatnik, Sonic Solutions) or fullhardware (Japanese manufacturers such as Yamaha and Rohm). Evolutions include applying effects such as 3D on individual MIDI channels, relying on encoded data (MP3) for wave-tables. The use of MIDI for ringtones will more than likely decrease, as MP3 is supported by all platforms, even though its low-bandwidth and lowfootprint properties are still appealing to game developers. 3.4 Digital Effects Digital effects can be classified in two categories.

3.4.1

User-defined effects

Audio engineers are familiar with effects that can modify the listening experience. The user typically enables and controls effects through standard APIs. Effect parameters are modified independently of audio accessories and routing. This category includes

• •

• • • • •

Tone generation, DTMF, Sines for alerts, beeps. Volume controls (digital gains, ramps, pan/balance) Limiter/compressor and multi-band DRC for music Mixer. Sample Rate Conversion is needed in order to mix different sources. DRC may be utilized to prevent saturations when new sources are plugged to the mixer engine. Frequency transforms (Graphical EQ, Bass Boost, Loudness) Stereo widening (loudspeakers, headphones) 3D processing (3D positioning and Doppler effects) Environmental effects (Room reverberation, Occlusion, Obstruction) Music effects (Chorus, Flanger)

3.4.2 Platform-defined effects PC or audio workstation users are less familiar with these effects, which are fundamental enablers of mobile audio platforms. Such effects are factory-tuned, tightly linked to acoustic parameters, and not accessible to user applications. Effects on audio output include: • Compressor/Limiter to protect the speakers and maximize the volume • Parametric EQ to correct speaker nonlinearities Due to the noisy environments, enhancements on input data are even more critical. Acoustic Echo Cancellation (AEC) is mandatory for full-duplex speech calls, and is typically enabled along with Noise Reduction, Wind noise reduction, AGC and DC removal. In case the platform includes several microphones, beam-forming enhances the sound directionality and speech intelligibility. Platform-defined effects need to be updated when the audio routing or the audio output devices are modified (jack insertion, Bluetooth earpiece detection). In some cases, the audio accessories store relevant parameters in ROM; these values are passed to the audio processing unit when the device is detected. 4

OUTPUT DEVICES

4.1 D/A conversions Since mobile platforms run on batteries and need different voltage supplies for both the application processor and the base-band processor, a powermanagement device is always required. This IC relies on mixed-signal technologies; embedding D/A and A/D conversions is generally easier than adding analog functionality on high-speed CMOS devices. For this reason, the power-management chip is a critical element of any mobile platform.

AES 29th International Conference, Seoul, Korea, 2006 September 2–4

5

Pierre-Louis Bossart Reaching low distortion and high SNR is a challenge in mobile environments, as a compromise between SNR and power consumption needs to be found. One major difference with PC and Set-Top Boxes are the low voltage power supplies: the noise floor stays the same independently of the voltage, using low-voltage amounts to having larger noise levels. Internal step-up, voltage converters or charge pumps added to reach good SNR levels typically increase power consumption. Previously, mobile phones relied on 13- or 14-bit D/A converters, 16-bit D/A with Multi-bit Sigma-Delta modulators is now a minimum as audio playback and recording become mandatory, at the expense of higher power consumption. As it is in a PC, the audio quality suffers from mixing digital processing and analog conversion. Additional problems such as EMI and RF interference are extremely difficult to control, as the speakers are typically 20-cm away from the D/A converters in clamshell devices. The Power IC often integrates D/A and amplifications stages, as audio quality would suffer if the two steps were performed on separate devices. Two different amplifiers are used • Class D for line, loudspeaker output, with PWM and a differential pair. An 85-90% yield helps reach minimum power consumption up to 1W. • Class A-B for headset, as Class D is too noisy for weak signals. Despite a less efficient yield (45-55% yield), the additional power consumption is not significant, as the output power remains small. Mobile devices typically rely on 10ppm crystals to minimize cost, which is a real issue when PWM is used. Jitter shapers and smart clock corrections are needed to decrease the jitter below 0.5 ppm. 4.2 Bluetooth audio Bluetooth headset and speakers are ubiquitous and a mandatory requirement [1]. The main issue with Bluetooth is the different protocols for speech and audio. • Speech data is transmitted using a Synchronous Connection Oriented (SCO) link, whose bandwidth is limited to 64 kbit/s. The format for speech is 8 kHz, 8-bit mono (a-law, mulaw, CVSD). Synchronous transfers (play or record) to the BT module rely on serial links such as I2S. • Audio data is transmitted using an Asynchronous ConnectionLess (ACL) link, whose 721 kbits/s bandwidth is incompatible with 48 kHz stereo PCM. The A2DP profile relies on compression with SBC to decrease the bit-rate below 512kb/s, but does not provide a real QoS; channels are not isosynchronous,

A survey of mobile audio architecture issues only best-effort. The retransmission rate for lost packets needs to be adapted to the environment, which further decreases the available bandwidth further. The A2DP stack runs on the host and inserts SBC frames in RTP packets send to the Bluetooth module over a UART interface. The increased latency makes this profile unsuitable for full-duplex communications. To the author’s knowledge, audio recording using A2DP is not supported in any wireless device. The two different Bluetooth protocols require separate interfaces for audio and speech and a seamless switch between speech and music is not possible. Combination of music and speech is not allowed. For example, a user cannot listen to an MP3 file during a speech call. Clock references also make the mix between music and speech even more difficult. In a cell-phone, the baseband processor typically provides the audio clock. If the baseband is off when listening to music, the clock needs to be generated with a timer derived from the system clock. The bandwidth extension to 16 kHz in 3G phones has generated a need for a new wideband speech profile. This standard is not yet ratified; its support by silicon vendors is to be confirmed. This new protocol still relies on a SCO link, which will not solve the issues mentioned above. The UWB (Ultra-Wide Band) technology endorsed by Bluetooth will on the contrary solve all these issues by providing a 20+Mb/s bandwidth suitable for speech as well as hi-fi multi-channel audio. 5

SYSTEM CONSTRAINTS

5.1 OpenOS In modern operating systems, dedicated drivers and software layers control the File System. Audio data may not be fetched directly by the audio subsystem. By construction, software stacks require a copy of data in user space before they can be processed. This increases the power consumption and restricts the scope of hardware acceleration. Open operating systems are now as reactive as RTOS, and a specific attention needs to be given to thread priorities. Audio-related threads need to be assigned the highest priority to meet real-time constraints. Variable latency may still occur with kernels such as Linux that work with time quantums and dynamically decrease priority of low-latency recurrent threads. Controlling the latency is mandatory when interfacing with transmission protocols. Last, open operating systems rely on legacy API and multimedia frameworks. The WAVE drivers in WindowsCE, ALSA in Linux or devSound in Symbian,

AES 29th International Conference, Seoul, Korea, 2006 September 2–4

6

Pierre-Louis Bossart are not suitable for complex use cases with interactions between streams described in Section 2. These API are either too constrained or too flexible with ever-changing ad-hoc extensions, which prevent software vendors from providing high-performance portable applications. 5.2 Monolithic software Monolithic VoIP or streaming applications combining GUI, file browsing, network protocols and audio decode are a direct consequence of the lack of suitable audio APIs. Audio acceleration requires a clean-split between the audio and application part, and is limited by the use of proprietary codecs. 5.3 DRM

case. The extreme variability of MIPS and memory needs generates strong constraints for the hardware and memory hierarchy dimensioning. Taking into account these requirements significantly increase the cost of a mobile device. In many cases a fall-back strategy is implemented, where a simple tone is substituted for a user-defined tone. 5.5 Error concealment As described above, error concealment is mandatory in broadcast applications, streaming and VoIP calls. A/V sync is maintained and lost packets are compensated for, either through frame-loss substitution or muting, time-scale expansion, repetition in the frequency domain or interpolation of speech parameters. Error concealment implementations require explicit signaling between network protocol stack and audio decoders. A tight coupling is needed with frame-accurate control. Working at the frame level is a strong constraint in lowpower devices, as it requires the host processor operating system to handle interrupts every 20ms. In contrast with the audio player use case, look-ahead is not possible, copies are needed and the system clock cannot be decreased to handle interrupts in a timely manner. 6

CELL-PHONE INTEGRATION

6.1 Base-band centric architecture This architecture is typical of cell-phone platforms with optional multimedia capabilities. All audio and speech data are routed through the modem, as shown in Figure 5. The baseband processor handles CS speech calls, speech enhancements and mixing between audio and speech. It provides the reference audio clock derived from the network. Application processor mp3

modem Channel, modulation

I2S

A2DP

Digital Rights Management technologies rely on different levels of protection, which severely constrain mobile audio architecture. Step1: By relying on cryptography, key exchange and secure environments, hackers are prevented from being able to strip DRM information. DRM would not make any sense otherwise, and does require dedicated hardware support. Step2: A key constraint is to prevent users from gaining access to decrypted data prior to decoding. This results in DRM-enabled applications being implemented as monolithic software applications combining decryption and decoding. In most implementations, DRM does not allow for audio acceleration and low-power implementations. Step3: This step constrains the output stage, by preventing audio content from being rendered on insecure interfaces. As an example, DRM-protected files may not been rendered on a Bluetooth headset, even though the quality is degraded due to the SBC encoding. Software and silicon vendors are required to route DRM-protected audio to analog interfaces only. PCM decoded data may not be grabbed on digital interfaces Step4: Set-Top Boxes typically rely on random address shuffling in RAM interfaces to prevent users from watching pay-TV for free. The same type of requirements can be expected in the mobile environment, the main question being the additional cost and customer acceptance.

A survey of mobile audio architecture issues

RF

Speech dec/enc

5.4 System tones High-priority tones are generated to signal alerts, lowbattery, and key-press or ring tones. These tones can be triggered at any time, possibly several at once. In open operating systems, the user can configure system and ring tones, with no constraints on file formats and sampling frequencies. Tones can be simple 8 kHz PCM or e-AACplus files. Support for system tones requires additional processing power to decode, upsample and mix on top of given use

BT

speech enh, mixer

Power IC

Speech in/out audio out

Figure 5: baseband-centric architecture

AES 29th International Conference, Seoul, Korea, 2006 September 2–4

7

Pierre-Louis Bossart

A survey of mobile audio architecture issues

The benefits of this platform architecture stem from its simplicity, as both the baseband and application processors can be developed and validated separately. Its drawbacks are numerous as well: • The application processor handles VoIP codecs and MMS use case. This leads to duplication of speech codecs on the two processors. • Power consumption is not optimal, as the baseband is always on, even for audio playback. • Additional mixing and SRC capability are required on the modem

6.3 Application processor-centric variant This architecture is a pragmatic variant of the previous one; it preserves the strong coupling between speech codecs and channel encoding implementations. The CS speech codecs still run on the baseband processor, and the speech enhancements are run on the application processor. The modem can still be validated alone, at the expense of duplication of CS speech codecs. mp3

speech enh. mixer

6.2 Application-processor centric architecture

Speech dec/enc I2S

A2DP

Audio and speech data are always routed through the application processor, as shown in Figure 6; the baseband processor becomes a slim modem, a pure data pump focusing on channel encoding, modulation, etc.

Speech in/out

Power IC

audio out

BT mp3

Speech dec/enc

Channel, modulation

RF

Figure 7: application-processor centric architecture (2)

Speech enh., mixer I2S

A2DP

BT

Power IC

RF

Channel, modulation

Speech in/out audio out

Figure 6: application-processor centric architecture

In addition, all components of a mobile platform need to be synchronous and locked on the same master audio clock. Since the baseband and application processors may be activated or turned off independently, depending on the use cases, the management of the audio reference clock is extremely complex. mp3 A2DP

Channel, modulation

BT

RF

Speech dec/enc

I2S

I2S

Speech enhancements and speech codecs are run on the application processor only. The benefits are numerous: • The duplication issue mentioned above is solved. • The modem complexity is removed. • Mixing capabilities are enhanced However this architecture proves extremely difficult to implement: • The validation of modem cannot be done without the application processor, which required coordination of roadmaps and chip availability. • During audio playback the baseband is turned off while it remains the clock master during a speech call. Changing the audio clock reference is a headache for audio architects. • CS call latency and round-trip delays constrain the implementation to rely on small buffers, which raises the number of context switches and decreases available processing power on the application processor.

6.4 PowerIC centric architecture The previous architectures assume that speech and music are mutually exclusive. When audio and speech links are active, sample-rate conversion is required, and the problems discussed in Section 2.15 are encountered.

Power IC

Speech enh Speech in/out audio out

Figure 8: power IC centric architecture, audio and speech data paths are independent. By adding several digital and analog interfaces on the Power IC, the audio routings become simpler, with little

AES 29th International Conference, Seoul, Korea, 2006 September 2–4

8

Pierre-Louis Bossart

A survey of mobile audio architecture issues

if no interaction between baseband and application processors, as seen in Figure 8. Mixing is typically performed in the analog domain or in the digital domain when an asynchronous sample-rate converter is available. This approach results in simple platform designs, but suffers from severe drawbacks as well: mixed-signal devices are difficult to design and require several cuts to become stable. Their digital parts suffer from speed limitations when compared to pure CMOS devices. 6.5 Single-core modems Baseband and application processors usually rely on a similar architecture, with a host processor and DSP capabilities. On the baseband, the host runs an RTOS and communication protocols. New architectures, such as the Freescale MXC, allow for significant costsavings, with all DSP and protocols running on a single StarCore DSP. Many issues related to audio clocks, latency and interprocessor communications are solved as well. 6.6 Single-host platforms An alternative approach requires the application processor to run an open operating system while handling real-time protocols such as a GSM stack, which was not feasible until OS vendors improved their kernels. For example real-time constraints can be taken into account with the Symbian EKA2 kernel delivered since version 8.1. Microsoft has also announced better real-time support in Windows Mobile, and this approach may become more popular in the future. 6.7 Memory sharing The Bill of Materials and the platform costs can be reduced by sharing the same RAM memory interface between modem and application processors, as shown in Figure 9. The concurrency between AP and modem generates conflicts on the RAM accesses. The additional latency decreases the system efficiency and requires a careful architecture study based on execution traces. Power management is also a major issue, as the application processor needs to be powered at a minimum to let the modem handle speech calls. Such an approach requires strong partnerships between modem and application processor vendors; new interfaces are for example being standardized in the MIPI consortium. RAM

App processor

modem

Figure 9: Memory sharing through bypass of application processor

6.8 Stacking The integration of various audio platform components on a single die is extremely difficult, and may not be the wisest thing to do in terms of performance. For example, the application and baseband processors may not rely on the same manufacturing process. A simpler solution consists in stacking several dies in the same package. Stacking the application processor with RAM and flash memory dramatically reduces the number of chips on the PCB and allows for smaller form-factors. Stacked devices are usually more stable and more immune to noise. Their main drawback is late availability and longer tuning of the manufacturing process. 7

HARDWARE-ACCELERATION ARCHITECTURES

7.1 Tightly coupled The performance of general purpose processors can be enhanced by adding specific instructions for audio and multimedia processing (XScale, ARM9-11) or an additional co-processor tied to the main CPU (ARMNeon, ARC). This approach is familiar to the audio software developers, as these extensions have been used in Pentium and PowerPC architectures as well. The benefits are a simple programming model, as audio acceleration runs in user-space, with no need for specific multimedia frameworks. Debug is easy and time-to-market minimal. The host processor needs however to handle low-latency output device interrupts; performance and power consumption are not optimal. SIMD instructions are also never suitable for saturated arithmetic. 7.2 Coprocessor This approach relies on a dedicated core with true DSP functionality and architecture. The host offloads all processing routines to this coprocessor, which operates in slave mode and provides the result back to the host. Controlling such a coprocessor requires dedicated software drivers which are more difficult to debug. This approach is a good compromise between time-to-market and performance; the interface with communication protocols is simple, as data is exchanged with a framelevel granularity. 7.3 Distributed processing To further boost performance, the coprocessor can also handle audio playback and direct rendering on audio peripherals. The host processor is only interrupted to provide compressed data, as the coprocessor is master and pulls data at its own pace. The host clock frequency can be reduced, which decreases power consumption.

AES 29th International Conference, Seoul, Korea, 2006 September 2–4

9

Pierre-Louis Bossart The main drawback of this architecture is the time-tomarket. As in the previous section, debugging multicore implementations is difficult and requires time before the complete platform is stable. This approach is also incompatible with protocol stacks which require frame-level granularity. In addition, applications can only access hardware acceleration resources through standard APIs, unless dedicated support is put in place by the silicon vendors. 8

OS INTEGRATION

8.1 Symbian/Series60 The standard distributions rely on audio processing on the ARM, using DSP instructions to reduce MIPS requirements. A Multimedia Framework allows for hardware acceleration of natural audio playback. However, support for error concealment mandates hardware acceleration resources to work in slave mode with a frame-level granularity. Effects may also be hardware accelerated. 8.2 WindowsCE Windows applications may use two different audio frameworks. The old Wave API does not really allow for HW acceleration, all codecs provided as run on the host. On the contrary the DirectShow framework allows for addition of custom filters, both for codecs and digital effects, running on distributed hardware. 8.3 WindowsMobile WindowsMobile limits hardware acceleration by enforcing strict DRM rules. Bluetooth is supported through a specific framework which prevents acceleration as well. In practice, coprocessor and multicore approaches are feasible only with dedicated support from Microsoft.

A survey of mobile audio architecture issues developers also lack standard API that allow for crossplatform portability. The Khronos Group [3] has standardized APIs to decrease this dependency on operating systems and hardware. The OpenMAX IL (Integration Layer) API defines a standardized media component interface to integrate multimedia codecs implemented in hardware or software. This middleware API provides the same flexibility as Gstreamer and DirectShow filters; it has been demonstrated on Linux. The OpenMAX DL (Development Layer) API will enable codec developers to accelerate the portability of codecs with a standardized set of primitive functionality across a range of computing platforms. This level is suitable for tightly coupled acceleration; its granularity is however too small for coprocessor or loosely coupled architectures. The OpenMAX AL (Application Layer) defines a standardized interface between an application and multimedia middleware. The OpenSL ES API targets application developers, with standardized access to audio features (e.g. 3D positional audio, MIDI playback, DSP effects). Cross-platform applications can also be developed in Java, with the JSR-235 and JSR-234 specifications focused respectively on codecs and MIDI support, and effects, 3D positioning 9

CONCLUSIONS

This paper has described a wide overview of use cases, architectures and critical factors, as well as the need for compromises on audio quality with Bluetooth and when mixing music and speech. The limitations of current audio interfaces and clock reference issues will be reduced in the future with the MIPI Slimbus [4], unveiled at this conference. The future of low-power hardware acceleration and multi-core solutions is less clear, as their success is dependent on the support of software and OS vendors.

8.4 Linux The standard audio API is ALSA, which provides little scope for hardware acceleration, all codecs running in software in alsa-lib. The Gstreamer framework shares many features with DirectShow. Audio transform may be accelerated and data tunneled between compatible filters without being copied in user space. Security considerations require a master application running as root to prevent rogue applications from creating insecure connections. Platform-defined filters such as AEC need to be added automatically by a central server.

REFERENCES [1]

“Bluetooth and Wireless Networking, a primer for audio engineers”, J. Audio Eng. Soc., Vol. 50, No. 11, 2002 November http://www.aes.org/tutorials

[2]

http://www.umatechnology.org

[3]

http://www.khronos.org

[4]

http://www.mipi.org

8.5 Standard Audio APIs As seen in the sections above, operating systems and middleware do not always provide support for hardware acceleration of audio processing. Application

i

This paper was submitted while the author was still with STMicroelectronics.

AES 29th International Conference, Seoul, Korea, 2006 September 2–4

10

a survey of mobile audio architecture issues

dedicated to output devices, with a specific attention to audio quality issues in .... Gaming applications require multiple audio tracks to be played simultaneously ...

Download PDF

220KB Sizes 3 Downloads 217 Views

Report

a survey of mobile audio architecture issues

Recommend Documents