Low Complexity, Non-Intrusive Speech Quality ...

Viewer
Transcript

1

Low Complexity, Non-Intrusive Speech Quality Assessment Volodya Grancharov, Student Member, IEEE, David Y. Zhao, Student Member, IEEE, Jonas Lindblom, and W. Bastiaan Kleijn, Fellow, IEEE

Abstract— Monitoring of speech quality in emerging heterogeneous networks is of great interest to network operators. The most efficient way to satisfy such a need is through non-intrusive, objective speech quality assessment. In this paper we describe an algorithm for monitoring the speech quality over a network with extremely low complexity and memory requirements. The features used in the proposed algorithm can be computed from commonly used speech-coding parameters. Reconstruction and perceptual transformation of the signal is not performed. The critical advantage of the approach lies in generating quality assessment ratings without explicit distortion modelling. The results from the performed simulations indicate that the proposed output-based objective quality measure performs better than the ITU-T P.563 standard. Index Terms— quality assessment, non-intrusive, quality of service.

I. I NTRODUCTION Speech quality assessment is an important problem in mobile communications. The quality of a speech signal is a subjective measure. It can be expressed in terms of how natural the signal sounds or how much effort is required to understand the message. In a subjective test, speech is played to a group of listeners, who are asked to rate the quality of this speech signal [1], [2]. The most common measure for user opinion is the mean opinion score (MOS), obtained by averaging the absolute category ratings (ACR). In ACR, listeners compare the distorted signal with their internal model of high quality speech. In degradation MOS (DMOS) tests, the subjects listen to the original speech first, and then are asked to select the degradation category rating (DCR) corresponding to the distortion of the processed signal, see Table I. DMOS tests are more common in audio quality assessment [3], [4]. Assessment of the listening quality [1]–[4] is not the only form of quality of service (QoS) monitoring. In many cases conversational subjective tests [2] are the preferred method of subjective evaluation, where participants hold conversations over a number of different networks and vote on their perception of conversational quality. An objective model of conversational quality can be found in [5]. Yet another class V. Grancharov, D. Y. Zhao, and W. B. Kleijn are with the Sound and Image Processing Lab, Royal Institute of Technology, Stockholm, Sweden (e-mail: [email protected]; [email protected]; [email protected]; phone: +46 87908819; fax: +46 87917654). J. Lindblom was with the Sound and Image Processing Lab, Royal Institute of Technology and is currently with the Skype Technologies (e-mail: [email protected]). This work was funded by Wireless@KTH.

TABLE I G RADES IN MOS AND DMOS

Grade 5 4 3 2 1

ACR(MOS) Excellent Good Fair Poor Bad

DCR(DMOS) Inaudible Audible, but not annoying Slightly annoying Annoying Very annoying

of QoS monitoring consists of intelligibility tests. The most popular intelligibility tests are the Diagnostic Rhyme Test (DRT) and Modified Rhyme Test (MRT) [6]. In this paper we will not further discuss intelligibility and conversational quality tests, and will focus entirely on ACR listening quality, denoted for simplicity as subjective quality. Subjective tests are believed to give the ”true” speech quality. However, the involvement of human listeners makes them expensive and time consuming. They can be used only in the final stages of developing the speech communication system and are not suitable to measure QoS on a daily basis. Objective measures use mathematical expressions to predict speech quality. Their low cost means that they can be used to continuously monitor the quality over the network. Two different test situations can be distinguished: 1) intrusive (both the original and distorted signals are available), and 2) non-intrusive (only the distorted signal is available). The methods are illustrated in Fig. 1. The simplest class of inreference signal

SYSTEM UNDER TEST

distorted signal

NON-INTRUSIVE MEASUREMENT INTRUSIVE MEASUREMENT

predicted quality

Fig. 1. Intrusive and Non-intrusive type of quality assessment. Non-intrusive algorithms do not have access to the reference signal.

trusive objective quality measures are waveform-comparison algorithms, such as signal-to-noise ratio (SNR) and segmental signal-to-noise ratio (SSNR). The waveform-comparison algorithms are simple to implement and require low computational complexity, but they do not correlate well with subjective measurements if different types of distortions are compared. Frequency-domain techniques, such as the Itakura - Saito (IS) measure, and the spectral distortion (SD) measure are widely used. Frequency-domain techniques are not sensitive

2

to a time shift and are generally more consistent with human perception [7]. The distinguishing characteristic of both waveform comparison and frequency domain techniques is that they are equipped with a very simple error pooling schemes and that they do not contain mappings that are trained by databases. With error pooling we denote the final stage of all quality metrics that has to combine the estimated per-frame distortions in a single value. A significant number of intrusive perceptual-domain measures has been developed. These measures incorporate knowledge of the human perceptual system. Mimicry of human perception is used for dimension reduction and a ”cognitive” stage is used to perform the mapping to a quality scale. The cognitive stage is trained by means of a date base. These include the Bark Spectral Distortion (BSD) [8], Perceptual Speech Quality (PSQM) [9], and Measuring Normalizing Blocks (MNB) [10], [11]. Perceptual evaluation of speech quality (PESQ) [12] and perceptual evaluation of audio quality (PEAQ) [13] are standardized state-of-the-art algorithms for intrusive quality assessment of speech, and audio respectively. Existing intrusive objective speech quality measures may automatically assess the performance of the communication system without the need for human listeners. However, intrusive measures require the presence of the original signal, which is typically not available in QoS monitoring. For such applications non-intrusive quality assessment must be used. These methods often include both mimicry of human perception and/or a mapping to the quality measure that is trained using a data base. An early attempt towards non-intrusive speech quality measure based on spectrogram of the perceived signal is presented in [14]. The spectrogram is partitioned, and variance and dynamic range calculated on a block-by-block basis. The average level of variance and dynamic range is used to predict speech quality. The non-intrusive speech quality assessment reported in [15] attempts to predict the likelihood that the passing audio stream is generated by the human vocal production system. The speech stream under assessment is reduced to a set of features. The parameterized data is used to generate physiologically based rules for error assessment. The measure proposed in [16] is based on comparing the output speech to an artificial reference signal that is appropriately selected from a optimally clustered codebook. The Perceptual Linear Prediction (PLP) [17] coefficients are used for parametric representation of the speech. A fifth-order allpole model is performed to suppress speaker-dependent details of the auditory spectrum. The average distance between the unknown test vector to its nearest reference centroids provides an indication of speech degradation. Recent algorithms based on Gaussian-mixture probability models (GMM) of features derived from perceptually motivated spectral-envelope representations can be found in [18] and [19]. A novel, perceptually motivated speech quality assessment algorithm based on temporal envelope representation of speech is presented in [20]. The International Telecommunication Union (ITU) standard for non-intrusive quality assessment ITU-T P.563 can be found

in [21]. A total of 51 speech features are extracted from the signal. Key features are used to determine a dominant distortion class, and in each distortion class a linear combination of features is used to predict the intermediate speech quality. The final speech quality is estimated from the intermediate quality and 11 additional features. The above listed measures for quality assessment are designed to predict the effects of many types of distortions, and typically have high computational complexity. These type of algorithms will be referred to as the general speech quality predictors. It has been shown that non-intrusive quality prediction is possible at much lower complexity if it is assumed that the type of distortion is known [22] [23]. However, this class of measures is likely to suffer from poor prediction performance if the expected working conditions are not met. We conclude that existing algorithms either have a high complexity and a broad range of application or a low complexity and a narrow range of application. This has motivated us to develop a speech-quality assessment algorithm with low computational complexity. The algorithm predicts speech quality from generic features commonly used in speech coding, without assumption of the type of distortion. In the proposed lowcomplexity, non-intrusive speech quality assessment (LCQA) algorithm an explicit distortion model is not used, but instead the quality estimate is based on global statistical properties of per-frame features. In the next section we provide the motivations for the critical choices made in the development of the LCQA algorithm, followed by a detailed algorithm description in section III. The performance of the proposed algorithm is compared with ITU-T P.563 in section IV. II. K EY I SSUES IN O BJECTIVE Q UALITY A SSESSMENT In this section we discuss some unresolved questions in speech quality assessment. We give the reasoning for the conceptual choices behind the particular LCQA implementation, and outline the distinguished features of the algorithm. A critical issue in the design of an automatic system for QoS monitoring is the scale of the quality mapping; continuous or discrete. In practice subjective MOS scores do not have continuous character, due to the limited number of listeners’ opinions used in averaging. This behavior is demonstrated in Fig. 2. One approach proposed in [19] is to see quality prediction not as a regression problem, but rather as the classification of quality ratings in intervals of predefined range. The major disadvantage of the classification definition of the problem is that the resolution has to be set in advance, which may not be appropriate if the algorithm is to be used in different applications. Therefore, the proposed LCQA algorithm predicts the speech quality on a continuous scale, and this choice is supported by the simulations presented in section IV. The human speech quality assessment process can be divided into two parts: 1) conversion of the received speech signal into auditory nerve excitations for the brain, and 2) cognitive processing in the brain, see Fig. 3. The key principles of perceptual transform are signal masking, critical band spectral resolution, equal-loudness curve, and intensity loudness

3

moved [18], [16]. However, it is known that telephony systems provide higher quality score for some voices over the others [26]. The algorithm presented in this paper incorporates the speaker-dependent information in the form of the pitch period and the coefficients of a tenth-order autoregressive (AR) model estimated by means of linear prediction.

30

20

N

III. L OW- COMPLEXITY QUALITY ASSESSMENT

10

0 1

2

3 MOS

4

5

Fig. 2. Distribution of the subjective MOS scores over a database of 1000 utterances. The 200 bins histogram shows the discrete character of subjective scores. distorted signal

AUDITORY PROCESSING

nerve excitation

COGNITIVE MAPPING

predicted quality

Fig. 3. Human perception of speech quality involves both hearing and judgement.

The objective of the proposed LCQA algorithm is to provide an estimate for the MOS score of each utterance, using a simple set of features that is readily available from speech codecs in the network. Thus, the speech quality is predicted at low computational complexity, which makes the method useful for practical applications. The dotted area in Fig. 4 shows extraction of the per-frame feature vector from the compact speech representation, used in Code-Excited Linear Prediction (CELP) coders [27]. Each 20

speech signal TO SPEECH CODEC PARAMETRIZAION

bit-stream FEATURE EXTRACTION

law, e.g., [24]. These principles are well studied and in most existing quality assessment algorithms a perceptual transform is a pre-processing step. The main implicit purpose of the perceptual transform is to perform dimension reduction on the speech signal. An advantage of this mimicry-motivated approach is that it reduces the need for a sophisticated feature selection mechanism based on a database. However, this comes at the cost of a high computational expense of the perceptual transform. More-over, mimicry may result in the removal of relevant information. Therefore, the proposed LCQA algorithm does not perform a perceptual transform, but instead the dimensionality is reduced jointly with optimizing the mapping function coefficients. This guarantees minimum loss of relevant information. Our approach is consistent with the recent emergence of algorithms performing quality assessment without a perceptual transform in image quality assessment [25]. Many of the existing quality assessment algorithms are based on specific models of distortion, i.e., level of background noise, multiplicative noise, presence of ringing tones [21], or simulate a known distortion like handset receiver characteristics [12]. The LCQA algorithm does not incorporate an explicit model of the distortion. The speech quality estimate is based entirely on the statistics of a processed speech signal, and the distortion is implicitly assessed by its impact on these statistics. As a result, the LCQA algorithm is easily adapted to the next generation communication systems that will likely produce new types of distortions. In some methods the speaker-dependent information is re-

local (per-frame) features STATISTICAL DESCRIPTION

global (per-utterance) features GMM MAPPING

predicted speech quality Fig. 4. The structure of the LCQA algorithm. Doted area represents the LCQA mode optimal for the CELP coders. In any other environment the LCQA can extract the required features from the waveform.

ms speech frame is represented by the variance of the excitation of the AR model, the pitch period, and a ten-dimensional vector of line-spectral frequencies (LSF) coefficients [28], {Een , Tn , fn }, where n is the frame index. We hypothesize that such a compact speech representation, successfully used in speech coding, is likely to give us meaningful features. The LCQA algorithm predicts the speech quality from the global speech statistics. The statistical properties of per-frame features, captured by per-utterance features, form the input for GMM mapping, which estimates the speech quality level on a MOS scale.

4

A. Speech Features The basis of any type of automatic quality analysis system is the extraction of a feature vector. The set of features used in LCQA aim to capture the structural information from a speech signal. This is motivated by the fact that the natural speech signal is highly structured, and it is likely that human quality judgement relies on patterns extracted from information describing this structure. In this section we list the features that we have selected. The spectral flatness measure [29] is related to the amount of resonant structure in the power spectrum: R π 1 log (Pn (ω)) dω exp 2π −π Rπ , (1) Φ1 (n) = 1 2π −π Pn (ω)dω where the AR envelope P (ω) is defined as the frequency response of the AR model with coefficients ak P (ω) =

|1 +

1 . −jωk |2 k=1 ak e

Pp

(2)

As a second feature we use spectral dynamics, defined as Z π 1 2 Φ2 (n) = (log Pn (ω) − log Pn−1 (ω)) dω. (3) 2π −π

The spectral dynamics have been studied and successfully used in speech coding [30], and speech enhancement [31]. The spectral centroid [32] determines the frequency area around which most of the signal energy concentrates Rπ ω log (Pn (ω)) dω , (4) Φ3 (n) = R−ππ log (Pn (ω)) dω −π

and it is also frequently used as an approximation for a measure of perceptual ”brightness”. The last three features are the variance of the excitation of the AR model Een , the speech signal variance Esn , and the pitch period Tn . They will be denoted as Φ4 (n), Φ5 (n), and Φ6 (n), respectively. The presented above features and their first time derivatives (except the derivative of the spectral dynamics) are grouped in a 11 dimensional per-frame feature vector Φ(n). We hypothesize that the speech quality can be estimated from statistical properties of these per-frame features, and describe their probability distribution with the mean, variance, kurtosis, and skewness. The moments are calculated independently for each feature, and this gives a set of features that globally describe one speech utterance: 1 X Φi (n) (5) µΦi = ˜ |Ω| ˜ n∈Ω

σ Φi

=

s Φi

=

kΦi

=

1 X (Φi (n) − µΦi )2 ˜ |Ω| ˜ n∈Ω P 3 1 ˜ (Φi (n) − µΦi ) n∈Ω 3/2 ˜ |Ω| σ Φi P 4 1 ˜ (Φi (n) − µΦi ) n∈Ω . 2 ˜ σ Φi |Ω|

(6) (7) (8)

˜ we denote the frames set, of cardinality |Ω|, ˜ used to With Ω calculate statistics for each of the per-frame features Φi (n).

The global features are grouped in one feature vector Ψ = {µΦi , σΦi , sΦi , kΦi }11 i=1 . In the next subsection we describe a two-step dimensionality reduction procedure that 1) extracts ˜ out of all available frames Ω, the ”best” subset of frames Ω ˜ of 2) and transforms feature vector Ψ into feature vector Ψ low dimensionality. B. Dimensionality reduction The feature selection algorithm is important to the practical performance of quality assessment systems. The main purpose of the feature selection algorithm is to improve predictive accuracy of the quality assessment system by removing irrelevant and redundant data. A commonly used approach, in the quality assessment literature, is to remove non-speech regions based on a voice activity detector or an energy threshold [33]. It is interesting to note that the removal of low energy regions can be seen as a feature selection problem. We propose a generalization of this concept by considering activity thresholds in all feature dimensions. The scheme, presented in Table II allows speech active frames to be excluded if they do not carry information that improves the accuracy of speech quality prediction. TABLE II T HE OPTIMAL SET OF FRAMES AS A FUNCTION OF A THRESHOLD VECTOR Θ

˜ = {∅} initialize: Ω for n ∈ Ω U L U if Φ1 (n) ∈ [ΘL 1 , Θ1 ] & . . . & Φ11 (n) ∈ [Θ11 , Θ11 ] Accept the n-th frame ˜ =Ω ˜ + {n} Ω From Table II we can see that the optimal set of frames U 11 is determined by the threshold Θ = {ΘL i , Θi }i=1 , i.e., ˜ ˜ Ω = Ω(Θ). We search for the threshold Θ that minimizes the criterion ε: ˜ ∗ )). Θ = arg min ε(Ω(Θ

(9)

Θ∗

The criterion ε is related to the root-mean-square error (RMSE) performance of the LCQA algorithm, and is properly defined in section IV. The choice of optimization criterion is motivated by the fact that no other objective measures than the performance of the regression function can determine the set of optimal features. ˜ is found, we search Once the optimal subset of frames Ω ˜ This optimization step is for the optimal subset of features Ψ. defined as follows: given the original set of features Ψ of ˜ select a cardinality |Ψ|, and the optimal set of frames, Ω, ˜ ⊂ Ψ of cardinality |Ψ| ˜ < |Ψ| that is subset of features Ψ optimized for the performance of the LCQA algorithm: ˜ = arg min ε(Ψ ˜ ∗ ). Ψ

(10)

˜ ∗ ⊂Ψ Ψ

A full search is the only dimensionality reduction procedure that guaranties that a global optimum is found. It is rarely applied due to its high computational requirements.

5

The well-known Sequential Forward Selection and Sequential Backward Selection, e.g., [34] are step-optimal only, since the best (worst) feature is added (discarded), but the decision cannot be corrected at a later stage. The more advanced (L,R) algorithm [35] consists of applying Sequential Forward Selection L times, followed by R steps of Sequential Backward Selection. The Floating Search methods [36] are extensions of the (L,R) search methods, where the number of forward and backward steps is not pre-defined, but dynamically obtained. In our simulations we use the Sequential Floating Backward Selection procedure, which consists of applying after each backward step a number of forward steps as long as the resulting subset are better than the previously evaluated ones, see Table III. TABLE III T HE S EQUENTIAL F LOATING BACKWARD S ELECTION PROCEDURE CONSISTS OF APPLYING AFTER EACH BACKWARD STEP A NUMBER OF

where E{} is the expectation operator. The above defined criterion is the probabilistic measure corresponding to (10). It is well known, e.g. [37] that equation (11) is minimized by the conditional expectation ˆ Ψ) ˜ = E{Q|Ψ}, ˜ Q(

and the problem reduces to the estimation of the conditional probability. To facilitate this estimation, we model the joint density of the feature variables with the subjective MOS scores as a Gaussian mixture f (ϕ|λ) =

M X

ω (m) N (ϕ|µ(m) , Σ(m) ),

(13)

m=1

˜ ω (m) are the mixture weights, and where ϕ = [Q, Ψ], (m) (m) N (ϕ|µ , Σ ) are multivariate Gaussian densities. The Gaussian mixture is completely specified by the mean vector, covariance matrix, and mixture weight

FORWARD STEPS AS LONG AS THE RESULTING SUBSET ARE BETTER THAN

λ = {ω (m) , µ(m) , Σ(m) }M m=1 ,

THE PREVIOUSLY EVALUATED ONES

˜ =Ψ initialize: Ψ while error does not increase by more than a threshold Exclusion Step: Find the least significant feature

(12)

(14)

and these coefficients are estimated off-line from a large training set using the EM algorithm [38]. Finally, we express the optimal quality estimator (12) in a form of a weighted sum of known quantities: ˜ = E{Q|Ψ}

˜ − {Ψi }) Ψi− = arg min ε(Ψ ˜ Ψi ∈Ψ

M X

˜ (m) u(m) (Ψ)µ ˜ Q|Ψ

(15)

m=1

where

Exclude the feature ˜ =Ψ ˜ − {Ψi− } Ψ

u

while error decreases by more than a threshold Inclusion Steps: Find the most significant feature

and

˜ + {Ψi }) Ψi+ = arg min ε(Ψ Include the feature ˜ =Ψ ˜ + {Ψi+ } Ψ The presented two-stage dimensionality reduction procedure is sub-optimal, i.e., we do not optimize jointly for the optimal ˜ and Ψ. ˜ The main reason for that is the high comsets of Ω putational complexity. However, the simulations presented in section IV show that the proposed training scheme is sufficient to outperform the reference quality assessment methods. C. Quality Estimation Given the Global Feature Set Let Q denote the subjective quality of an utterance as obtained from MOS labelled training database. We construct ˆ of the subjective quality as a function an objective estimator Q ˆ = Q( ˆ Ψ), ˜ and search for the function of a feature vector, i.e., Q closest to the subjective quality with respect to the criterion ˜ Q∗ (Ψ)

˜ (m) , Σ(m) ) ω (m) N (Ψ|µ ˜ ˜Ψ ˜ Ψ Ψ ˜ , (Ψ) = PM (k) (k) (k) ˜ N (Ψ|µ ˜ , ΣΨ ˜Ψ ˜) k=1 ω Ψ

(m) (m) −1 ˜ (m) (m) (m) (Ψ − µ Ψ µQ|Ψ ˜ ). ˜Ψ ˜) ˜ (ΣΨ ˜ = µQ + ΣΨQ

(16)

(17)

D. Implementation Details

˜ Ψi 6∈Ψ

ˆ Ψ) ˜ = arg min E{(Q − Q∗ (Ψ)) ˜ 2 }, Q(

(m)

(11)

In this section we describe how the n-th frame features are calculated, based entirely on {Een , Tn , fn } and {Een−1 , Tn−1 , fn−1 }. Then we show how the global statistical properties are calculated recursively, without storing the local features in a buffer. We calculate the pitch period Tn according to [39], and the AR coefficients are extracted from the speech signal every 20 ms without overlap. To keep the complexity of the LCQA algorithm low, we redefine the per-frame features: spectral flatness, spectral dynamics, and spectral centroid. The new definitions are based entirely on the speech codec bit-stream, and signal reconstruction is avoided. We calculate the spectral flatness as the ratio of the tenthorder prediction error and the signal variance Φ1 (n) =

Een . Esn

(18)

Given the variance of the excitation of the AR model, its definition 10 X ai sk−i , (19) ek = s k − i=1

6

and AR coefficients ai , we calculate the signal variance without reconstructing the waveform sk using the reverse Levinson-Durbin recursion (step-down algorithm). The spectral dynamics are redefined as a weighted Euclidean distance in the LSF space: Φ2 (n) = (fn − fn−1 )T Wn (fn − fn−1 ),

(20)

where the inverse harmonic mean weight is defined by the components of the LSF vector: Wn(ii) Wn(ij)

= =

(fn(i) − fn(i−1) )−1 + (fn(i+1) − fn(i) )−1 (21) 0 (22)

These weights are also used to obtain a redefined spectral centroid: P10 (ii) iWn . (23) Φ3 (n) = Pi=1 (ii) 10 i=1 Wn

We calculate the selected global descriptors recursively, i.e., the per-frame features are not stored in the buffer. Until the end of the utterance the mean is recursively updated

n−1 1 µΦ (n − 1) + Φ(n) (24) n n to obtain the desired µΦ . Here n is the index over the ˜ as discussed earlier in this section. In accepted frames set Ω, a similar fashion, we propagate Φ2 , Φ3 , and Φ4 to obtain the central moments µΦ2 , µΦ3 , and µΦ4 . These quantities are used to obtain the remaining global descriptors, namely variance, skew, and kurtosis: µΦ (n) =

σΦ sΦ kΦ

= µΦ2 − (µΦ )2 (25) µΦ3 − 3µΦ µΦ2 + 2(µΦ )3 (26) = 3/2 σΦ µΦ4 − 4µΦ µΦ3 + 6(µΦ )2 µΦ2 − 3(µΦ )4 = . (27) 2 σΦ

Table IV gives a short overview of the proposed LCQA algorithm. TABLE IV OVERVIEW OF LCQA A LGORITHM

1) For the n-th speech frame calculate {Een , Tn , fn } from the waveform or extract from the bit-stream. 2) Calculate per-frame feature vector Φ(n), based on {Een , Tn , fn } and stored in a buffer {Een−1 , Tn−1 , fn−1 }. ˜ recursively 3) From a selected subset of frames Ω calculate the central moments {µΦ , µΦ2 , µΦ3 , µΦ4 }. Frames selection is controlled by the threshold Θ. 4) At the end of the utterance calculate global feature ˜ = {µΦ , σΦ , sΦ , kΦ } as mean, variance, vector Ψ i i i i skew, and kurtosis of local features. 5) Predict the speech quality as a function of the ˆ = Q( ˆ Ψ), ˜ through GMM global feature vector Q mapping.

IV. S IMULATIONS In this section we discuss type of training and MOS labelled databases, used with the LCQA algorithm. We also present results from simulations, with respect to both prediction accuracy and computational complexity of the proposed algorithm. A. Training For the training procedure we used 11 MOS labelled databases provided by Ericsson AB and one ITU database [40]. The combined database contains utterances in the following languages: English, French, Japanese, Italian and Swedish. The database contains large variety of distortions, such as: different coding, tandeming, and modulated noise reference unit (MNRU) [41] conditions, as well as packet loss, background noise, effects of noise suppression, switching effects, different input levels, etc. The total size of the database is 7646 speech files. In the training we use 10-fold cross validation with 20% of the speech material, to provide robustness in the performance evaluation [42]. To further improve generalization performance we perform training with noise. We create virtual training patterns, by adding zero mean white Gaussian noise to true training patterns. B. Performance Evaluation In this section we compare the performance of the proposed LCQA algorithm with the standardized ITU-T P.563. The estimation performance is assessed using correlation coefficient R ˆ and subjective and RMSE ε, between the predicted quality Q quality Q. The RMSE is given by s PN ˆ 2 i=1 (Qi − Qi ) , (28) ε= N

and the correlation coefficient is defined as PN ˆ ˆ )(Qi − µQ ) i=1 (Qi − µQ qP R = qP , N N 2 ˆ i − µ ˆ )2 ( Q (Q − µ ) i Q i=1 i=1 Q

(29)

where µQ and µQˆ are the mean values of the introduced variables. Here N is the number of MOS labelled utterances used in evaluation. Table V contains the averaged results of the cross-validation tests, and Table VI contains the RMSE outliers in %. The test results clearly indicate that the proposed LCQA algorithm outperforms the standardized ITU-T P.563. In Fig. 5 we demonstrate the correlation between subjective speech quality ratings and LCQA predicted, values over a database. TABLE V AVERAGED PERFORMANCE IN CORRELATION AND RMSE

LCQA ITU-T P.563

R 0.89 0.75

ε 0.39 0.61

Processing time and memory requirements are important figures of merit for the quality estimation algorithms. The

7

TABLE VI O UTLIERS IN RMSE, AVERAGED OVER CROSS - VALIDATION TESTS

LCQA ITU-T P.563

ε > 0.6 6.1 22.5

Outliers (in %) ε > 0.8 3.9 14.6

ε > 1.0 2.6 10.3

ACKNOWLEDGMENT

5

The authors would like to thank Stefan Bruhn of Ericsson AB for providing MOS databases.

Subjective MOS

4

R EFERENCES

3

2

1 1

2

3

Predicted MOS

4

5

Fig. 5. Correlation between subjective and predicted MOS values for ITU-T P.23 database.

LCQA algorithm has insignificant memory requirements: a buffer of 12 scalar values, calculated from the previous and current frame is needed (future frames are not required), as well as memory for the 12 Gaussian mixtures. In table VII we demonstrate the difference in computational complexity between the proposed LCQA and the ITU-T P.563. The comparison is between the optimized ANSI-C implementation of ITU-T P.563 and the MATLAB 7 implementation of LCQA, both executed on a Pentium 4 machine at 2.8 GHz with 1 GB RAM. With LCQA-P we denote the case where input features {Een , Tn , fn } are readily available from codecs used in the network. TABLE VII E XECUTION TIME ( IN S ) FOR UTTERANCES OF AVERAGED LENGTH 8 S

Time

presented algorithm predicts speech quality more accurately than the standardized ITU-T P.563, at much lower complexity. In the proposed algorithm the distortion is modeled only implicitly by its effect on the distribution of the selected speech features. Since there is no explicit distortion model, the algorithm is easily extendable towards quality assessment of future communication systems.

ITU-T P.563 4.63

Execution time (in s) LCQA 1.24

LCQA-P 0.01

V. C ONCLUSIONS We demonstrated that a low-cost non-intrusive speech quality assessment algorithm can be a valuable tool for monitoring the performance of a speech communication system. The proposed quality assessment algorithm operates on a heavily restricted parametric representation of speech, without the need for a perceptual transform of the input signal. By means of simulations over a large database we demonstrated that the

[1] ITU-T Rec. P.830, “Subjective performance assessment of telephoneband and wideband digital codecs,” 1996. [2] ITU-T Rec. P.800, “Methods for Subjective Determination of Transmission Quality,” 1996. [3] ITU-R Rec. BS.1534, “Method for the subjective assessment of intermediate quality level of coding systems,” 2001. [4] ITU-R Rec. BS.562-3, “Subjective assessment of sound quality,” 1990. [5] ITU-T Rec. G.107, “The e-model, a computational model for use in transmission planning,” 2003. [6] M. Goldstein, “Classification of methods used for assessment of text-tospeech systems according to the demands placed on the listener,” Speech Communication, vol. 16, pp. 225–244, 1995. [7] S. Quackenbush, T. Barnwell, and M. Clements, Objective Measures of Speech Quality. Prentice Hall, 1988. [8] S. Wang, A. Sekey, and A. Gersho, “An objective measure for predicting subjective quality of speech coders,” IEEE J. Selected Areas in Commun., vol. 10, no. 5, pp. 819–829, 1992. [9] J. Beerends and J. Stemerdink, “A perceptual speech-quality measure based on a psychoacoustic sound representation,” J. Audio Eng. Soc, vol. 42, no. 3, pp. 115–123, 1994. [10] S. Voran, “Objective estimation of perceived speech quality - Part I: Development of the measuring normalizing block technique,” IEEE Trans. Speech, Audio Processing, vol. 7, no. 4, pp. 371–382, 1999. [11] S. Voran, “Objective estimation of perceived speech quality - Part II: Evaluation of the measuring normalizing block technique,” IEEE Trans. Speech, Audio Processing, vol. 7, no. 4, pp. 383–390, 1999. [12] ITU-T Rec. P. 862, “Perceptual evaluation of speech quality (PESQ),” 2001. [13] ITU-R. BS.1387, “Method for Objective Measurements of Perceived Audio Quality (PEAQ),” 1998. [14] O. Au and K. Lam, “A novel output-based objective speech quality measure for wireless communication,” Signal Processing Proceedings, 4th Int. Conf., vol. 1, pp. 666–669, 1998. [15] P. Gray, M. Hollier, and R. Massara, “Non-intrusive speech-quality assessment using vocal-tract models,” in Proc. IEE Vision, Image and Signal Processing, vol. 147, pp. 493–501, 2000. [16] J. Liang and R. Kubichek, “Output-based objective speech quality,” IEEE 44th Vehicular Technology Conf., vol. 3, no. 8-10, pp. 1719–1723, 1994. [17] H. Hermansky, “Perceptual linear prediction (PLP) analysis of speech,” J. Acous. Soc. Amer., vol. 87, pp. 1738–1752, 1990. [18] T. Falk, Q. Xu, and W.-Y. Chan, “Non-intrusive GMM-based speech quality measurement,” in Proc. IEEE Int. Conf. Acous., Speech, Signal Processing, vol. 1, pp. 125–128, 2005. [19] G. Chen and V. Parsa, “Bayesian model based non-intrusive speech quality evaluation,” in Proc. IEEE Int. Conf. Acous., Speech, Signal Processing, vol. 1, pp. 385–388, 2005. [20] D. Kim, “ANIQUE: An auditory model for single-ended speech quality estimation,” IEEE Trans. Speech, Audio Processing, vol. 13, pp. 821– 831, 2005. [21] ITU-T P. 563, “Single ended method for objective speech quality assessment in narrow-band telephony applications,” 2004. [22] M. Werner, T. Junge, and P. Vary, “Quality control for AMR speech channels in GSM networks,” in Proc. IEEE Int. Conf. Acous., Speech, Signal Processing, vol. 3, pp. 1076–1079, 2004. [23] A. Conway, “Output-based method of applying PESQ to measure the perceptual quality of framed speech signals,” in Proc. IEEE Wireless Communications and Networking, vol. 4, pp. 2521–2526, 2004. [24] B. C. J. Moore, An Introduction to the Psychology of Hearing. London: Academic Press, 1989.

8

[25] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Trans. Image Process, vol. 13, pp. 600–612, 2004. [26] R. Reynolds and A. Rix, “Quality VoIP - an engineering challenge,” BT Technology Journal, vol. 19, pp. 23–32, 2001. [27] M. Schroeder and B. Atal, “Code-excitated linear prediction (CELP): high quality speech at very low bit rates,” in Proc. IEEE Int. Conf. Acous., Speech, Signal Processing, pp. 937–940, 1985. [28] F. Itakura, “Line spectrum representation of linear predictor coefficients of speech signals,” J. Acous. Soc. Amer., vol. 57, S35(A), 1975. [29] S. Jayant and P. Noll, Digital Coding of Waveforms. Englewood Cliffs NJ: Prentice-Hall, 1984. [30] H. Knagenhjelm and W. B. Kleijn, “Spectral dynamics is more important than spectral distortion,” in Proc. IEEE Int. Conf. Acous., Speech, Signal Processing, vol. 1, pp. 732–735, 1995. [31] T. Quatieri and R. Dunn, “Speech enhancement based on auditory spectral change,” in Proc. IEEE Int. Conf. Acous., Speech, Signal Processing, vol. 1, pp. 257–260, 2002. [32] J. Beauchamp, “Synthesis by spectral amplitude and brightness matching of analized musical instrument tones,” J. Audio Eng. Soc, vol. 30, pp. 396–406, 1982. [33] S. Voran, “A simplified version of the ITU algorithm for objective measurement of speech codec quality,” in Proc. IEEE Int. Conf. Acous., Speech, Signal Processing, vol. 1, pp. 537–540, 1998. [34] P. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach. London, UK: Prentice Hall, 1982. [35] S. Stearns, “On selecting features for pattern classifiers,” in Proc. 3th Int. Conf. on Pattern Recognition, pp. 71–75, 1976. [36] P. Pudil, F. Ferri, J. Novovicova, and J. Kittler, “Floating search methods for feature selection with nonmonotonic criterion functions,” in Proc. IEEE Intl. Conf. Pattern Recognition, pp. 279–283, 1994. [37] T. Soderstrom, Discrete-time Stochastic Systems. London: SpringerVerlag, second ed., 2002. [38] A. Dempster, N. Lair, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal Royal Statistical Society., vol. 39, pp. 1–38, 1977. [39] W. B. Kleijn, P. Kroon, L. Cellario, and D. Sereno, “A 5.85 kbps CELP algorithm for cellular applications,” in Proc. IEEE Int. Conf. Acous., Speech, Signal Processing, vol. 2, pp. 596–599, 1993. [40] ITU-T Rec. P. Supplement 23, “ITU-T coded-speech database,” 1998. [41] ITU-T. Rec. P.810, “Modulated Noise Reference Unit,” 1996. [42] R. Duda, P. Hart, and D. Stork, Pattern Classification. WileyInterscience, second ed., 2001.

With Low Complexity

Perceptual Similarity based Robust Low-Complexity Video ...

Controlling Complexity in Part-of-Speech Induction - arXiv

Controlling Complexity in Part-of-Speech Induction - INESC-ID

Implementation of Low-Complexity Error-Resilient W ...

Low Complexity Resource Allocation with Opportunistic ...

Low Complexity Encoder for Generalized Quasi-Cyclic ...

Low Complexity Opportunistic Decoder for Network Coding - Rice ECE

Low-Complexity Policies for Energy-Performance ...

Low-Complexity Fuzzy Video Rate Controller for Streaming

Low-Complexity Feedback Allocation Algorithms For ...

Low Complexity Multi-authority Attribute Based ... - IEEE Xplore

Polynomial-complexity, Low-delay Scheduling for ...

Low-Complexity Detection of POI Boundaries Using ...

A Low-Complexity Synchronization Design for MB ... - Semantic Scholar

A New Algorithm to Implement Low Complexity DCT ...

Low Complexity Opportunistic Decoder for Network ...

A Low ML-Decoding Complexity, Full-Diversity, Full ... - IEEE Xplore

Low ML-Decoding Complexity, Large Coding Gain, Full ... - IEEE Xplore

Low-Complexity Shift-LDPC Decoder for High-Speed ... - IEEE Xplore