AES 130th conv Taejin Park.pdf

Viewer
Transcript

Audio Engineering Society

Convention Paper 8432 Presented at the 130th Convention 2011 May 13–16 London, UK The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that have been peer reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.

System Identification for Acoustic Echo Cancellation Using Stepped Sine Method Related to FFT Size 1

2

TaeJin Park , Seung Kim , and Koeng-mo Sung

3

1

Applied Acoustics Lab, School of EE and CS, INMC, Seoul National University [email protected], [email protected], [email protected]

ABSTRACT A stepped sine method was applied for system identification to cancel acoustic echoes in a speaker phone system that has been widely used in recent mobile devices. We applied the stepped sine method by regarding Discrete Fourier Transform (DFT) as a uniform-DFT filter bank. By using this stepped sine method, we were able to obtain more accurate and detailed characteristics of system. We stored the frequency response information into linear transform matrices and estimated the responses of the mobile device speaker. The proposed method exhibited higher Echo Return Loss Enhancement (ERLE) and increased correlation when compared to the conventional method. 1. INTRODUCTION The usage of smart phones and hands-free units has become remarkably widespread in recent years. Due to its convenience, speaker phone mode is being widely used in these kinds of telecommunication devices. When using the speaker phone mode during driving, for teleconferences, or for video-relay conversations, acoustic echoes deteriorate the quality of the call. To guarantee that these telecommunication devices retain an acceptable call quality while running on speaker phone mode, the performance of the device’s acoustic echo controller (AEC) is of critical importance [1].

Of the numerous parts comprising the acoustic echo controller, the acoustic echo canceller plays the most essential role. The acoustic echo canceller reduces the acoustic echo level while preserving the user’s speech intact. To build an efficient echo controller, it is most important to perform an exact system identification of the acoustic system, ranging from its speaker and microphone to its estimated acoustic path. Since acoustic echoes are influenced more by the speaker and microphone than the acoustic path of the system, this paper focuses on the identification of the speaker and microphone system. With the trend of miniaturization in mobile devices, the thickness of the accompanying speaker devices have

Park et al.

System ID for AEC Using Stepped Sine Method

decreased. Ironically, however, this has caused the sound quality of those speaker devices to deteriorate compared to their earlier versions. The distortion and harmonic signals present in these new compact speakers have also made system identification a more challenging task than when mobile speaker devices were thicker. Therefore, system identification methods should be improved so as to enable setting exact target echo signals for acoustic echo controllers. Additionally, mobile phones are limited in hardware operation capability, calculation time, and complexity, which are also significant issues when considering real time application for mobile phone systems. Thus far, various methods have been proposed for system identification and distortion simulation. The swept sine method, based on the Least Mean Square (LMS) or Volterra series method [2], is currently being widely employed for system identification. An alternative to this has also been proposed by Angelo Farina, in the form of an effective technique to measure impulse response and harmonic distortion with the swept-sine concept [3]. These methods each have their strengths and weaknesses. The LMS method based on the swept sine concept successfully estimates fundamental frequencies, but it cannot handle harmonic frequencies [4], which makes it inadequate for use with compact speakers which produce abundant harmonic distortions. Several researchers have developed an adaptive filter based on the Volterra series method. This filter is also relatively successful in estimating nonlinear distortion [5]. However, the Volterra series method requires large numbers of coefficients and enormously complex calculations based on O(N3) or O(N4). Because of these shortcomings, the Volterra series method is inadequate for real time application with limited computing device resources. Because of these difficulties in system identification using conventional methods, our investigation in this paper focuses on performance within the most widelyused environment. The suggested algorithm was investigated in Short-Time-Fourier-Transform (STFT) within a 128-tap and 128-sample-window environment. By thus limiting the environment, the suggested algorithm obtained more detailed frequency responses of the tested system with stepped sine measurement. We also proposed an algorithm with O(N2) calculation complexity. Generally, the stepped sine method is considered adequate for filter bank system modeling, but is rarely

used in loudspeaker identification due to its discontinuity in frequency. It is true that, when approaching acoustic echo cancellation on the time domain, the stepped sine method may not be adequate. However, in the case of a frequency domain approach, it is possible to use the stepped sine method by regarding Discrete Fourier Transform (DFT) as a uniform-DFT filter bank. In this paper, the stepped sine method has been applied for system identification in echo cancellation, under this assumption of DFT filter banks. The discrete stepped sine method was capable of obtaining more precise distortion and frequency response while the traditional swept sine method has some shortcomings in terms of Signal-to-Noise Ratio (SNR) [6]. By using the stepped sine method, we obtained more accurate and detailed frequency characteristics of non-linearity compared to the sweptsine method in certain environments. Then we stored the resulting harmonic distortion information and amplitude-dependent non-linearity information into linear transform matrices. By multiplying these matrices with the proposed technique we obtained an estimation of the response of mobile loudspeaker devices. In order to verify the performance of our proposed stepped sine measurement technique, we employed an acoustic echo canceller provided by MIGHTYWORKS, Inc. Performance was measured in terms of Echo Return Loss Enhancement (ERLE), Mean Square Error (MSE) and the correlation between the original and the estimated signal. The proposed method exhibited higher ERLE, less MSE and increased correlation compared to conventional methods. 2. ECHO CONTROLLER ENVIRONMENT Our investigation was conducted based on the assumption that the echo controller system is working on the frequency domain. As our proposed measurement method is dependent on window length and FFT length, we also assumed that the echo controller system uses 128 tap windows and 128 sample length FFTs. Although our proposed method is based on certain specifications, this assumption can be changed with measurement strategy. A typical audio terminal with AEC is shown in Figure 1. From the network system, the echo controller system receives the opponent’s voice signal x(k ) . But because of loudspeaker distortion and a time-varying acoustic path, the echo signal x% ( k ) is quite different from the

AES 130th Convention, London, UK, 2011 May 13–16 Page 2 of 7

Park et al.

System ID for AEC Using Stepped Sine Method

original opponent’s voice signal x(k ) . Therefore, the Echo Canceller needs an exact signal close to x% (k ) . The closer the estimator output xˆ(k ) gets to the real signal x(k ) , the more success the echo canceller can have in reducing the echo signal. In this paper, we excluded the effect of the time-varying acoustic path, but concentrated on the estimation of x(k ) .

By regarding the DUT as 65 DFT filter banks, we were able to reduce calculation complexity while preserving performance compared to other conventional methods. Figure 2 describes the DFT filter banks. The original signal passes through M DFT filter banks. After being processed on the frequency domain, the processed signal y(n) passes through synthesize filter F(z). As mentioned above, for our test environment, we set M=128 .

Figure 2 Diagram of DFT filter banks

Figure 1 Typical frequency domain echo controller

3. SYSTEM IDENTIFICATION 3.1.

Figure 3 Uniform-DFT filter banks

Stepped sine with DFT filter bank

Usually, swept sine methods are rarely used for loudspeaker system identification. Because stepped sine waves excite the signal with discrete frequency, the stepped sine method is applied for special purposes rather than for measuring the frequency response of the system [7][8]. However, using stepped sine waves is also the established method when it comes to precise distortion measurement. In this case, every harmonic can be picked up with ease and high precision from the FFT spectra [6]. For conventional loudspeaker frequency response measurement, the stepped sine method is inadequate because stepped sine measurement generates numerous frequency gaps between the selected excitation frequencies. However, by limiting the conditions of the experiment to a 128 tap FFT and 8kHz sampling frequency environment, we can regard the whole Device Under Test (DUT) as 65 DFT filter banks.

Figure 3 describes the uniform-DFT filter banks. We assumed that the DUT consists of M uniform filter banks, and we set the frequency of the stepped sine wave as the center frequency of the DFT filter. When setting M to 128, the first frequency bin k=0 is the DC signal component. Since we proceeded to the real part signal, we only concentrated on the first 65 frequencies. As we set M to 128, N=65 frequencies were set as below (N=(M/2)+1). The excitation signal is laid out below, where

(

x k [ n ] = sin 2p n

62.5k 8000

)

k=1, 2, L ,

M

+1

(1)

2

xk [ n] is the k-th excitation signal and 62.5Hz is the fundamental frequency. The excitation file is generated with 3 seconds of a sine signal and 1 second of silence for each frequency. This time length was set to minimize the effect of the inertia of the transducer. With the 65 frequency stepped sine signal, we also obtained a

AES 130th Convention, London, UK, 2011 May 13–16 Page 3 of 7

Park et al.

System ID for AEC Using Stepped Sine Method

response signal which consisted of 65 frequencies and its harmonic frequencies. Figure 4 describes the excitation signal and the response signal.

Matrix A

1.6 10 1.4

Original frequency bin

20

1.2 1

30

0.8 40 0.6 50

0.4 0.2

60 10

20 30 40 Response frequency bin

50

60

0

Figure 5 Image map of matrix A By constructing this linear transform matrix A, we were able to generate a fundamental frequency and harmonic frequencies from the input signal.

Figure 4 Excitation signal and response of system

4. SYNTHESIZING OF ESTIMATED MATED SIGNAL 4.1.

3.2.

STFT and Windowing

Linear Transform Matrix

Next, we took the DFT of the N frequency data and stored it into N by N matrix A. aij is a component of matrix A where

aij =

xoutput [ j ] xinput [i]

(2)

As our proposed algorithm uses STFT and the overlapadd method, the application of proper window function is necessary. Thus, a hamming amming window was applied to the frame to minimize the frequency spreading effect. x[n] is the original input signal and x%[n ] is the signal with the window applied. The window indow frames were shifted with 64 samples. 4.2.

xinput [i ] is the i-th DFT coefficient of the input signal and xoutput [ j ] is the j-th th DFT coefficient of the response signal. Before dividing each frequency, an absolute value was taken to each coefficient to preserve the phase information of the original input data. Figure 5 describes matrix A, which is obtained ned from the response signal.

Fundamental Frequency Estimation

As stated in 3.2, the frequency response data and harmonic frequency data are stored in the linear transform matrix A. The diagonal part rt of the matrix A is stored in matrix F. Since the diagonal part of matrix A constitutes the fundamental frequency components of matrix A, a simple matrix multiplication yields the

%. estimated fundamental frequency vector d

d% 1´ N = d1´ N FN ´ N

AES 130th th Convention, London, UK, 2011 May 13–16 Page 4 of 7

(3)

Park et al.

System ID for AEC Using Stepped Sine Method

dˆ 1´N = (P1´N o d1´ N )H N ´ N

Matrix F

1.6

o denotes element-wise multiplication. This matrix contains the harmonic frequencies of each frame.

10 1.4

Matrix H

1

30

0.8

0.8 10

40

0.7

0.6 50

0.4 0.2

60 10

20

30

40

50

60

0

Figure 6 Image map of matrix F

0.6

20

0.5 30 0.4 40 0.3 50

0.2 0.1

60

4.3.

dˆ

1.2

Original frequency bin

Original frequency bin

20

(5)

10

Harmonic Frequency Estimation

Due to the nonlinearity of compact speakers, it is a challenging task to estimate the harmonic frequencies of their systems. To estimate the nonlinear activity of the transducer, we calculated the whole signal power in each frame. Since harmonic distortion is dependent on signal level, we controlled the number of frequency bins to generate the harmonic frequencies. The number of the frequencies required to calculate the harmonic frequencies is described by following equation, where

é N ù ê å d[k ] ú NF = ê k =1 + 1ú ê R ú ê ú ë û

20 30 40 Response frequency bin

50

60

0

Figure 7 Image map of matrix H

4.4.

Total Signal Generation

Finally, the conjugate part of the frame was filled with the conjugate of the real part, where M d%[( M + 1) - n] = conj ( d%[ n ]) n=1,2, L , 2

(4)

(6)

M dˆ[( M + 1) - n] = conj ( dˆ [ n ]) n=1,2, L , 2

NF is the number of peak frequencies required to calculate the harmonic frequencies, and R is the ratio between the signal power and the number of peak frequencies. The value of R can be set to maximize the performance. We chose R =12 to calculate the frequencies to be multiplied with matrix H. After calculating the number of frequencies, our proposed algorithm searches the NF peak frequencies yielded from the current frame. The selected frequency bin is stored in Matrix P with ones and the other frequency bin is stored in matrix P with zeros. Matrix d is the data matrix and matrix H is the harmonic part of the linear transform matrix A. The equation below describes the calculation of harmonic frequencies, where

d%[ n] denotes the fundamental frequency signal from % and dˆ[ n] denotes the harmonic frequency vector d

dˆ . To overlap these harmonic ˆ should be frequencies, the harmonic frequency vector d signal from vector

multiplied by the window function after an Inverse

dˆ is performed. % to the IDFT of dˆ we After summing the IDFT of d , Discrete Fourier Transform (IDFT) of

were able to obtain a total estimated signal.

AES 130th Convention, London, UK, 2011 May 13–16 Page 5 of 7

Park et al.

System ID for AEC Using Stepped Sine Method

5. PERFORMANCE EVALUATION

The following tables summarize the performance of each estimation method with each criterion.

Performance evaluation was conducted by testing compact loudspeakers. We tested the speakers installed on the LG Optimus one and Samsung Galaxy S mobile phones, the Sony PCG-4JFP, and the Samsung Q30 laptop computer. Performance evaluation was performed by calculating the Mean Square Error (MSE) between the original signal and the estimated signal, the Echo Return Loss Enhancement (ERLE) with an echo canceller, and the correlation between the original signal and the estimated signal. Also, we compared the performance of this method with that of the LMS method, the Volterra series method, and the impulse response convolution method proposed by Angelo Farina [2].

Samsung Q30

A

B

C

D

MSE

42.48

42.18

11.73

14.34

ERLE

16.85

43.91

35.68

63.49

Residue signal

772.46

377.24

512.26

247.99

Correlation

0.25

0.31

0.78

0.82

Table 1

Performance comparison for Samsung Q30

To simplify the table, each estimation method is represented by an alphabet as in the following. A. B. C. D.

LMS method Volterra series method Impulse response convolution Proposed method

And each criterion for the performance evaluation is calculated as in the following,

x : Original speech signal vector xˆ : Estimated speech signal vector e : Output error signal from echo canceller MSE: E[(x - xˆ ) 2 ] ERLE:

10 log

Residue signal:

Sony PCG-4JFP

A

B

C

D

MSE

368.51

296.94

503.78

330.06

ERLE

14.69

43.41

22.98

56.47

Residue signal

2659.7

1492.7

2270.6

1260.6

Correlation

0.11

0.22

0.23

0.25

Table 2

Performance comparison for Sony PCG-4JFP

(7)

E[ x 2 (n)] E[e2 (n)]

(8)

e

Samsung Galaxy S

A

B

C

D

MSE

1.13·104

1.12·104

1.52·104

4.73·103

ERLE

17.86

24.7

19.8

35.44

Residue signal Correlatio n

2.4·104

1.9·104

2.26·104

1.42·104

0.21

0.20

0.52

0.54

(9)

ˆ) Correlation: corr ( x, x

(10)

where the expected value is estimated as follows [9]. The lower value of the MSE and Residual signal, as well as the and higher correlation and ERLE value, represent better performance.

E[ x 2 ] =

1 N

Table 3

Performance comparison for Sony Samsung Galaxy S

N

åx

2

(11)

k =1

AES 130th Convention, London, UK, 2011 May 13–16 Page 6 of 7

Park et al.

System ID for AEC Using Stepped Sine Method

LG Optimus One

A

B

C

D

MSE

1.24·103

142.28

350.12

218.13

ERLE

13.46

102.08

28.76

84.07

Residue signal

2910.1

226.05

1819.9

436.28

Correlation

0.35

0.87

0.81

0.69

Table 4

Performance comparison for LG Optimus One

The performance evaluation test did not yield a unified result for all criteria. However, except for the LG Optimus One model, the results produced by the proposed algorithm consistently exhibited higher performances. For compact speaker devices with low distortion, the conventional Volterra series method showed better results. However, for systems with high levels of distortion, our proposed algorithm showed less MSE and higher ERLE and correlation. Although the referenced methods have not been recently optimized, the results can serve as a brief comparison among conventional methods. 6. SUMMARY In this paper, we have proposed an identification method based on the stepped sine technique. The frequency in the stepped sine signal is related to DFT size and sampling frequency. By using a stepped sine and linear transform matrix, our proposed algorithm obtained better results compared to the conventional Volterra series method and impulse response convolution method with specific test environment. Also, by using a linear transform matrix, the calculation complexity was reduced in proportion to O(N2).

[2] Angelo Farina, Alberto Bellini and Enrico Armelloni, “Non-linear convolution : A new approach for the auralization of distorting systems”, presented at the AES110th convention, Netherlands, Amsterdam, 2001 May 12-15. [3] Angelo Farina, “Simultaneous measurement of impulse response and distortion with a swept-sine technique”, presented at the AES108th convention, Paris, France, 2000 February 19-22. [4] Simon Haykin, “Adaptive Filter Theory”, Second Edition, Prentice-Hall, 2002, pp231-238. [5] Heung Ki Baik, John Mathews, “Adaptive Lattice Bilinear Filters”, IEEE TRANSACTIONS ON SIGNAL PROCESSING. VOL. 41. NO. 6. JUNE 1993. [6] Swen műller and Paulo Massarani, “TransferFunction Measurement with Sweeps”, Journal of AES, Vol 49, No 6, 2001 June. [7] Brunet, Pascal, Fallon, Brian, Temme and Steve, “Practical Implementation of Perceptual Rub & Buzz Distortion and Experimental Results”, presented at the AES 129th convention, USA, San Francisco, 2010 November 4-7. [8] Christopher J. Struck, “An Adaptive Scan Algorithm for Fast and Accurate Response Measurements”, presented at the AES 91th convention, USA, New York, 2010 October 4-7.

7. ACKNOWLEDGEMENTS This work was supported by MIGHTYWORKS, Inc. 8. REFERENCES [1] ITU-T Recommendation G.167, Acoustic Echo Controllers. Helsinki, 1993.

AES 130th Convention, London, UK, 2011 May 13–16 Page 7 of 7

AES-VCM, AN AES-GCM CONSTRUCTION ... - Research at Google