Voice Filtering over a Wideband Stereophonic Audio Signal Debdoot Sheet Department of Electronics & Communication Engineering Institute of Engineering & Management Salt Lake City, Kolkata, INDIA 700 091
[email protected] Abstract— An audio signal is attributed to having consisted of voiced as well as unvoiced signals. In general for any Human Voice the signals are band limited in the range of 300Hz to 3.4 kHz. The CCITT recommendation extends this to cover an additional guard band of upto 4 kHz. However, analyzing some samples of voiced sounds we have found the spectrum to remain largely concentrated in the region of 300Hz to 2.5 kHz. In case of a stereophonic audio signal, as per the MPEG standard that the spectrum extends upto 24 kHz, typically with the sampling rate of 48 kHz. However, down sampling is required for transmission of such a signal over the telecommunication channel with a sampling rate of 8 kHz. In order to reduce the chances of loosing out on the useful voiced information in a typical audio signal, filtering of such a signal is of utmost importance. The following methodology proposes one such scheme to band limit a wideband audio signal within the voice band for the purpose of down sampling without a considerable degradation in signal quality. Index Terms—Audio signal, Telecommunication, DSP, Filter, Sampling, Bandwidth, Voice.
I. INTRODUCTION Stereophonic sound, commonly called stereo, is the reproduction of sound, using two or more independent audio channels, through a symmetrical configuration of loudspeakers, in such a way as to create a pleasant and natural impression of sound heard from various directions, as in natural hearing. It is often contrasted with monophonic (or "monaural", or just mono) sound, where audio is in the form of one channel, often centered in the sound field (analogous to a visual field). The word "stereophonic" — derived from Greek stereos = "solid" and phōnē = "sound" — was coined by Western Electric, by analogy with the word "stereoscopic". In popular usage, stereo usually means 2-channel sound recording and sound reproduction using data for more than one speaker simultaneously [1]. Stereo recordings often cannot be played on monaural systems without a significant loss of fidelity. Since each microphone records each wavefront at a slightly different time, the wavefronts are out of phase; as a result, constructive and destructive interference can occur, if both tracks are played back on the same speaker [2]. This phenomenon is known as phase cancellation.
A voice frequency (VF) or voice band is one of the frequency bands, within part of the audio range that is used for the transmission of speech. The voiced speech of a typical adult male will have a fundamental frequency of from 85 to 155 Hz, and that of a typical adult female from 165 to 255 Hz [4]. Thus, the fundamental frequency of most speech falls at the bottom of the "voice frequency" band as defined above. However, enough of the harmonic series will be present for the missing fundamental to create the impression of hearing the fundamental tone [3]. In telephony, the usable voice frequency band ranges from approximately 300 Hz to 3400 Hz. It is for this reason that the band of the electromagnetic spectrum between 300 and 3000 Hz is also referred to as "voice frequency" (despite the fact that this part of the electromagnetic spectrum is rarely used to actually convey audio signals). The bandwidth allocated for a single voicefrequency transmission channel is usually 4 kHz, including guard bands, allowing a sample rate of 8 kHz to be used as the basis of the pulse code modulation system used for the digital PSTN [4][10]. The fore mentioned details pertaining to the spectral spread of a voiced signal highlight a problem of transmission over the narrow band telecommunication channel [10]. The problem cannot be solved using the technique of down sampling as this may lead to a considerable loss of the voiced information. However, the technique implemented in the following literature initiates to solve this problem by filtering out the informative voiced band information from the wide band audio signal for a fidel transmission over the telecommunication channel. II.
NATURE OF AN AUDIO SIGNAL
An audio signal typically ranges in the spectral spread of 20Hz to 20 kHz. However the situation may vary from listener to listener. According to the CCITT recommendation voiced signal ranges in spectral spread of 300Hz to 3.4 kHz [10]. Laboratory analysis of the 256 point FFT spectra of typical voiced spectra was in the range of 300 Hz to 3 kHz.1 (Fig 2.1)
1. The measurement was carried out using the following device configuration: CPU: Pentium 4, Matlab r2007b[6]. AudioMax PC sound card with 48kHz sampling.
National Conference on VLSI & Communication 2008
24
Fig.2.1. Spectral response of a Voiced signal The spectral response of the voiced signal renders the information regarding spectral spread and the bandwidth. However compared to an audio signal, typically consisting of both voiced as well as music (unvoiced) over a considerable spectral spread (Fig. 2.2), a technique is necessary for fidel extraction of voice signal.
Fig. 2.2 Spectral response of an Audio Signal III.POSSIBLE METHOD TO EXTRACT VOICED SIGNAL According to the CCITT[10] and MPEG[11] recommendation, the signal available for the purpose of transmission is in digitized form. The typical sampling rate of the signal is in the range of 44.1 kHz to 48 kHz. In order to transmit it over to a channel with a band limit of 8 kHz, down sampling may be a possible matching solution. However down sampling removes a considerable number of samples per frame of the signal, thus reducing it to a noisy signal. This leads to a loss of information leading to non-fidel reproduction of the signal at the receiver end. In order to overcome such difficulty, a possible method may be to filter out the high frequency components from the reconstructed signal, and re-sampling it. This solution may work out fine in most of the cases. However, this method imposes a difficulty in case of stereophonic recording of signals. In this case, since the reproduction from both the channels contributes to the synthesis of the perceived sound,
so filtering out the two channels separately may lead to sufficient loss in the fidelity of the reproduced sound [1]. Laboratory analysis of the signal spectra in this respect has led to sufficient conclusive evidence. In case of stereophonic sound, while the voiced signal is normally recorded in phase in both the channels, the music signal in them is not always in phase [1]. It is seen, that only the high pitched music signals are in phase in both channels while the lower frequency components in them are out of phase [2]. However, a possible method to selectively eliminate the high frequency highly pitched components from the audio signal may lead to a solution retaining the fidelity of the voiced signal. This analysis led to the evolution of the proposed method of extraction of Voiced information from a wide band audio signal. IV. MODELLING OF THE VOICE FILTERING METHOD A typical stereophonic audio signal transmitted in two channels was analysed. The analysis led to the proposal of the following design concept (Fig 4) in order to appropriately have a fidel extraction of voiced information from a wide band audio signal. Analysis of the stereophonic signal spectrum shows few instances whereby, voiced information is not equally present in both the channels. Here, we adopt a methodology to transfer the high frequency component constituting to this voiced signal to both the channels [5]. Since, we find that the high frequency components in certain cases constitute to the reproduction of voiced information, the internal high frequency cancellation network enables this approach. This system is an easily implemental system using DSP systems. This serves as an important point of interest in application of this methodology for extracting voiced information from a wide band audio signal. The system has been implemented on a simulated platform using MATLAB & SIMULINK [6][7][8]. The system functions accordingly. 1. Signal input is through a .wav file using the Signal Processing Blockset/From Wave File. 2. Submatrix Blockset helps to isolate the information pertaining to the two channels separately. (Table 1) 3. FDA Toolbox has been used to design the Butterworth Lowpass & Highpass filters in FIR mode [6][8][9]. (Table 2) 4. Matrix Concatenation blockset has been used to reconstruct the stereophonic 2 channel signal to be fed to the output stage.(Table 3) 5. The output sink for the 2 channel .wav output is the standard PC sound output path. FFT blockset has been used to analyse and study spectral information pertaining to one of the channels. (Table 4)
National Conference on VLSI & Communication 2008
25
Fig 4. Designed model for the Voice Filtering Method Fig 4 shows the model for the Voice Filtering Method implemented using MATLAB. The two Lowpass Butterworth filters attached to the two channels in the second stage serve the purpose of eliminating the jitter in them due to the high frequency reconstructed signal. V.
RESULTS AND DISCUSSIONS
The proposed system was simulated using the MATLAB (ver 7.5.0 R2007b) and SIMULNK (ver. 7.0 R2007b) [6]. The results were obtained by tracking the signal spectra obtained at different stages of the functional model. The results obtained have been compared hence.
Fig. 5.1 Spectral response of Input signal. (Scope 0)
Fig. 5.2 Spectral response of signal through first stage of Lowpass Filter. (Scope 3)
Fig 5.3 Spectral response of the signal with unmatched High Frequency Voiced components. (Scope 2)
National Conference on VLSI & Communication 2008
26
TABLES Table 1 Submatrix Row span: all rows Column span: one column Column: first
Fig 5.4 Spectral response of the Band limited Voiced Signal extracted from the wideband stereophonic audio signal. (Scope1) The scopes used over here analyse using 256pt FFT algorithm. Hence it is possible to easily measure out the spectral response of a frame of the audio signal. All the scopes cater to the data pertaining to Ch 1 of the 2 channel stereophonic audio signal being sourced. The input signal spectrum (Fig 5.1) spreads over to around 15 kHz for our system. The filtered out signal from first stage (Fig 4. Lowpass Butterworth Filter) (Fig 5.2) consists an unwanted jitter around 4kHz. This spike may be attributed to some amount of voiced information pertaining in high frequency zone. When we analyse the reconstructed signal with the high frequency components (Fig 5.3) we can attribute this jitter to a significant amount of voiced information content. The resultant signal spectra (Fig 5.4) obtained consists of a more fidel reproduction of the voiced signal. Moreover it also serves the purpose of eliminating the unwanted low frequency noise (e.g. breathing sounds) from the voiced signal. However, quality of an audio signal is best determined by the Human Senses rather than spectral responses. Due to lack of means for supplying such evidence in print, it is hoped that the comparative analysis using the spectral sample supplied serves the purpose of concrete evidence. It is hereby hoped that further reproduction of the proposed filtering technique would result in similar fruitful outcome. VI. CONCLUSION Voice filtering over a Wideband Stereophonic Audio Signal has been implemented over the design proposed (Fig 4). The results analysed from the study of spectral response as well as the quality of sound judged by human senses were satisfactory. This method can hence be suitably applied for stereophonic audio signals (2 channels) to obtain the voice band signals with minimum loss in fidelity. This method can find far more implications than just serving the purpose of matching a wideband signal onto a narrowband channel without loss of the voiced information. It can also be used to suitably filter out voiced information from a stereophonic source for further processing and similar other uses are possible.
Submatrix1 Row span: all rows Column span: one column Column: last Table 2 Lowpass Butterworth Filter (0,1,2,3) Structure: Direct Form II, Second Order Structures Response type: Lowpass Design Method: IIR Butterworth Filter order: Minimum (14) Match exactly: stop band Frequency Specifications: Units: Hz, Fs: 48000, Fpass: 2000, Fstop: 4000 Magnitude Specifications: Units: dB, Apass: 1, Astop: 80 Highpass Butterworth Filter (0,1) Structure: Direct Form II, Second Order Structures Response type: Highpass Design Method: IIR Butterworth Filter order: Minimum (34) Match exactly: stop band Frequency Specifications: Units: Hz, Fs: 48000, Fstop: 3000, Fpass: 4000 Magnitude Specifications: Units: dB, Apass: 1, Astop: 80 Table 3 Matrix Concatenation Number of inputs: 2 Mode: Multidimensional array Concatenate dimension: 2 Table 4 Spectrum Scope (0,1,2,3) Window type: Hann Window sampling: Periodic Number of spectral averages: 2 Frequency units: Hz Frequency range: 0 – Fs/2 Amplitude scaling: dB Axes: Autoscale Channel: Channel1 -> visible
National Conference on VLSI & Communication 2008
27
REFERENCES [1]. [2]. [3]. [4]. [5].
[6]. [7]. [8]. [9]. [10]. [11].
http://en.wikipedia.org/wiki/Stereophonic_sound Track Jenny Ondioline by the band Stereolab on their album Transient Random-Noise Bursts with Announcements. Baken, R. J. (1987). Clinical Measurement of Speech and Voice. London: Taylor and Francis Ltd. http://en.wikipedia.org/wiki/Voice_frequency George W. P. York, Christopher M. Rondeau, Dane F. Fuller, Teaching Real-time DSP Applications (Voice Removal) with the C6711 DSK and MATLAB, Proceeding of 2004 AEEE Conference. MATLAB http://www.mathworks.com Kermit Sigmon. MATLAB Primer. 3/e, Dept. of Mathematics, University of Florida. Sanjit K Mitra. Digital Signal Processing Laboratory using MATLAB. McGraw Hill. B A Shenoi. Introduction to Digital Signal Processing & Filter Design. Wiley Interscience. Thiagarajan Viswanathan. Telecommunication Switching Systems and Networks. Prentice Hall of India. http://en.wikipedia.org/wiki/MPEG
Debdoot Sheet is an Under-Graduate student at Institute of Engineering & Management, Kolkata, INDIA. He was born in 1986 at Kharagpur, West Bengal, INDIA. He is currently enrolled for his final semester in the 8 semester course in Electronics & Communications Engineering under The West Bengal University of Technology leading to Bachelor of Technology. He is credited to having achieved a WBJEE rank in the top 3% range. The author has been working in the field of Signal Processing, Renewable Energy Resources and RFID and has published his works at International Conferences.
National Conference on VLSI & Communication 2008
28