Super Saiyan Audio (SS Audio)

Viewer
Transcript

Super Saiyan Audio (S.S. Audio) A conversation between man and machine Four Unknown Men: “Hello Siri” Siri: “State your name please” Four Unknown Men: “We are … “ (Dramatic Pause) Four Unknown Men: “SUPER SAIYAN AUDIO!” Siri: “You said ‘Hyper Alien Monster’, is that correct?” (Four Unknown Men groan in unison …) (The first of the Four Unknown Men, Rinku Patel, emerges) Rinku (to the other Unknown Men): “You other guys’ voices is messing with Siri’s ability to distinguish the proper pitch. It needs a strong, deep baritone like mine!” (other Unkown Men suppress giggles) Rinku: “O.K. Siri, my name is Rinku Patel” (one of the Unknown men could be heard saying “You realize that you’ve revealed your identity, Mr. ‘Unknown Man’”) Siri: “You said your name is ‘Ice rink petal’” (Rinku recedes back into the shadows, head hung in shame. The second of the Four Unknown Men, Ravesh Sukhnandan emerges). Ravesh: “Let me give it a shot. Siri, my name is Ravesh Sukhnandan” Siri: “You said ‘rubbish stick man’” (Other Unknown Men have to prevent Ravesh from smashing the iPhone. The third of the unknown men, Bilal Javaid, gives the speech recognition a test) Bilal (to the other Unknown Men): “You must enunciate. Watch this.” (Bilal clears his throat) Bilal: “Siri, I am Bilal Javaid”

Siri: “You said ‘Billy Javelin’” (The final member, Xueyu Huang, picks up the phone, confident that he has mastered the dark art of speaking clearly) Xueyu: “I am Xueyu Huang, Lord of the Saiyans” Siri: “Sorry, I don’t understand” (The Four Unknown Men leave in shame, their identities revealed and their speaking skills questioned) The above (fictional) conversation illustrates the tremendous difficulties associated with speech recognition. Even well-developed systems like Apple’s “Siri”, which employ state of the art algorithms and technology, can still be put off by unknown words, accents and background noise. For the EECS 451 course project, our group, “Super Saiyan Audio” (or S.S. Audio for brevity), investigated speech recognition and its digital signal processing underpinnings, with the intent of creating a MATLAB-based isolated-word recognition system. While we certainly could not hope to replicate the robustness or utility of “Siri” or related speech-recognition systems, we hoped to gain an idea of how such systems work.

Progression of Goals Our goals for this project could be divided into distinct phases, from broad goals when we had little knowledge of the speech recognition process to much more finely detailed goals when we had considerably more knowledge and hence understood what was feasible in the given time frame: Phase 1 – Big Dreams, Bigger Challenges Before the daunting nature of implementing a robust speech recognition system dawned on us, we initially wanted to implement a system that could isolate and match an English word spoken by an individual of varying age and gender from a dictionary of 20 pre-defined words. From the gathered data, we also wanted to introduce person-identification capabilities, i.e. if a person says a word, could we identify who the person is from a database of speakers? Finally, we wanted to implement some noise cancellation is the post-processing analysis, so that the speaker could be speaking with noticeable ambient noise present and we would still be able to recognize the spoken word accurately. Methods investigated The first and most obvious method was to use a form of template matching, where we compared the raw audio signal being investigated (plots of sound amplitude against time), called an acoustic signal in speech-recognition literature (Keller, p. 5), to a database of pre-recorded acoustic signals. Then, using some metric or comparison algorithm, we search for the best match. While template matching is a viable option in speech recognition, the use of the raw acoustic signals was found to be a poor representation of the speech data. This is because human speech “need not be perfect” to convey information (Keller, p. 6). This means that the person speaking (sending the message) can say the same word slightly differently each time, but the person receiving the message would most likely

receive the same information that was intended. Thus, while an individual may say the same word twice, and we may decipher that the speaker said the same word twice, the raw acoustic signal may look completely different between the first and second utterances of the word (Keller, p. 6). This is illustrated in the Figure below, where Ravesh says the word Michigan two times: Figure 1: Ravesh saying the word “Michigan” twice. Audio was recorded at 44.1 kHz using the SONY ICD-PX333 Sound Recorder.

Although the waveforms show similarities, they are not exact replicas of each other. Sensitive metrics like the Euclidean distance measure would be thrown off by these small fluctuations. Furthermore, we expect that the differences between the acoustic signals of the same word spoken by different speakers would be even greater than those shown in the figure above, because of variations in intonation, pitch, amplitude etc. This is illustrated in the Figure below: Figure 2: Xueyu and Ravesh saying the word “Michigan”. Notice that the waveforms show a great deal of difference, despite being temporal signals of the same word.

Because the template matching method was unlikely to work, it meant that more effort had to be spent on the word-recognition portion of the system rather than the noise-isolation and person identification subsystems. This led us into Phase 2 of our goals for this project. Phase 2 – Smaller Dreams, Still Big Challenges As more effort was required to implement the word-recognition sub-system, the scale of the project was drastically reduced. Instead of a rigid dictionary of 20 words, the dictionary was reduced to six words: grey, red, orange, cool, tool and Michigan, with the option to substitute words that would make it easier to recognize. We chose simple words that, with the exception of Michigan, were mono or bi-syllabic, which we believed would be easier to recognize and distinguish. Furthermore, the person-identification and noise-cancellation portions of the project were eliminated, leaving the project solely focused on isolatedword speech-recognition. The final reduction in the scope of the project came in the robustness of our system to different speakers. Initially we wanted our system to be able to recognize words spoken by a large variety of individuals. However, as Fig. 2 above indicates, for different individuals speaking the same word, the waveforms look significantly different. Thus, a very robust system would have to be implemented to make the system speaker-independent. One method widely used in practical speech recognition systems to achieve this speaker independence is a Hidden Markov Model (HMM) based approach. The HMM based approach is well suited to speech because just like a sentence is composed of a series of words, words can further be subdivided into syllables, and syllables can be subdivided into phonemes. In HMM, the current probability of a state is dependent of the probability of the state before it (Furui, p. 280). When you say a word, for instance “hello”, you don’t sound out each letter – the sounds all flow into each other. Thus, with a HMM based system, a probability of a sound being a certain phoneme or group of letters is assigned, with that probability being affected by the words uttered before it. For instance, in English, the letter “Q” is almost always followed by the letter “U”, so the probability that “U” comes after “Q” is high. Unfortunately, the HMM, while very robust and in wide use, requires a large amount of training data to estimate these state probabilities, and so we decided to make our system speaker-dependent, meaning that we hope only to recognize words spoken by an individual against a database of words also spoken by that individual. Thus, our final goals for this project are to implement an isolated-word recognition system in MATLAB that is subject to the following constraints: 1. The dictionary/database is nominally composed of six words: orange, grey, red, cool, tool and Michigan 2. The person must be speaking in conditions that were as similar as possible to the conditions under which the template was recorded, i.e. same distance from microphone, little background noise, same intonation etc. 3. The system is only expected to work for individuals attempting to match a word against a databank of words spoken by that same individual. For the purposes of this project, we expect it to work only for individuals in our group

OUR SYSTEM Introduction As we explained earlier, a raw amplitude-time representation of our acoustic signal is not well suited for speech application purposes. In its place, we take advantage of the spectral properties of the speech signal, and using this representation, we then attempt to condense the features characteristic of a certain signal into aptly named feature vectors. These feature vectors can then be compared to the templates, and using some sort of metric or algorithm, we obtain a numerical measure of the match between the acoustic signal under investigation and our databank of speech signals. From our research, we have found that the Mel Frequency Cepstrum Coefficients (MFCC) feature vector representation offers the perfect balance between ease of implementation and robustness, while Dynamic Time Warping (DTW) provides a suitable algorithm to investigate the similarities between signals. We will go into more detail in the following sections.

Mel Frequency Cepstrum Coefficients (MFCC) Background on human speech and spectral representation We stated earlier that spectral properties are often utilized in practical speech processing and recognition systems. For our purposes, spectral properties is defined as the magnitude of the frequency representation, |𝑋(𝜔)| of the temporal acoustic signal 𝑥[𝑛] resulting from Fourier Analysis(Oppenheim & Schafer, p. 49; Furui, p. 14), i.e., in discrete time: −𝑗𝜔𝑛 |𝑋(𝜔)| = | ∑∞ | (Eq. 1) 𝑛=−∞ 𝑥[𝑛]𝑒

An obvious question is why? Well, to understand how a frequency representation is helpful, specifically the magnitude of the frequency spectrum, we need some background of human speech. The major roles of frequency in the transmission and perception of human speech are: 1. The speech wave is “reproducible by summing the sinusoidal waves, the amplitude and phase of which vary slowly” (Furui, p. 52) 2. The “critical features” in human perception of speech are mainly concerned with the spectral information, i.e. the amplitude, with “the phase information not usually playing a key role” (Furui, p. 52) The implications are that just by looking at the magnitude of the frequency representation and disregarding the phase information, we can develop a model of speech that is close to what a human being perceives, or recognizes. This forms the foundation of our speech-recognition project. Short-time Fourier Transform Simply taking the Fast-Fourier Transform (FFT) of a speech signal is not enough, however. As evidence, the figure below is a 32768 point FFT of Ravesh saying the word “Michigan”, which was sampled at a rate of 32 kHz for 1 second:

Figure: Frequency domain representation of Ravesh saying the word “Michigan”

While we see some distinctive features, like large peaks at low frequencies, this representation of the acoustic signal does not seem to be much better than the temporal representation. Part of the problem with purely time-based and purely frequency-based representations is that human speech is a combination of the two. Thus, a combination of the two domains would provide a more complete representation of the acoustic signal. This is accomplished through the Short-time Fourier Transform (STFT). In discrete time, this is defined by the equation below, where 𝑥[𝑛] is the temporal representation of the acoustic signal, 𝑤[𝑚] is the windowing function applied, 𝜆 is the continuous frequency variable and 𝑛 is the discrete time variable (Oppenheim & Schafer, p. 714): −𝑗𝜆𝑚 𝑋[𝑛, 𝜆) = ∑∞ (Eq. 2) 𝑚=−∞ 𝑥[𝑛 + 𝑚]𝑤[𝑚]𝑒 The windowing function essentially separates the acoustic signal into different frames of time, each of which we apply the DTFT to. This is essentially a DTFT over a window, i.e. a short segment of time. This gives rise to the idea of the spectrogram, which is captures the time and frequency dependence in a 2-D, color coded plot. Below are various spectrograms of Xueyu and Ravesh saying the words “grey”, “Michigan” and “cool” using a window length of 512 points.

Figure: Various spectrograms of Xueyu and Ravesh saying the words “grey”, “Michigan” and “cool”

Looking at spectrograms similar to these, researchers found that the speech wave and spectrum were “short-time stationary”, meaning that while the frequency characteristics change with time, over the span of approximately 20-40 milliseconds, the wave and spectrum could be considered to have “constant characteristics” (Furui, p. 14). Thus, our use of the STFT is justified. The effects of windowing on a signal are primarily reduced resolution and amplitude attenuation (Oppenheim & Schafer, p. 701). Furthermore, the choice of a window size comes with consequences. A shorter window allows greater ability to resolve changes of the acoustic signal in time, but the frequency resolution decreases which decreases the ability to resolve and identify harmonics, which are critical in vowel formation. Choosing a longer window results in the reverse (Oppenheim & Schafer, p. 717). This increased resolution in time ↔ decreased resolution in frequency (and vice verse) can be seen in the equation below for one frame 𝑣[𝑛], where 𝑣[𝑛] = 𝑥[𝑛] × 𝑤[𝑛], 𝑉[𝑘] is the DFT of 𝑣[𝑛], 𝑁 is the length of the window 𝑤[𝑛], and 𝜔 is the discrete-time frequency (Oppenheim & Schafer, p. 696): 𝑉[𝑘] = 𝑉(𝜔)|𝜔=2𝜋𝑘 = ∑𝑁−1 𝑛=0 𝑣[𝑛]𝑒

−𝑗(

2𝜋 )𝑘𝑛 𝑁

(Eq. 3)

𝑁

Here, the key relation is 𝜔 =

2𝜋𝑘 𝑁

(Eq. 4). If we increase 𝑁, then we increase the number of samples that

we take of 𝑉(𝜔), which is the DTFT of 𝑉(𝜔). However, if 𝑁 approaches the length of the signal 𝑥[𝑛], then we are no longer really windowing the signal, and hence we lose the temporal information. On the other hand, if we choose 𝑁 to be very small, then we increase the amount of frames we have, and hence

the time resolution, but we sample 𝑉(𝜔) at considerably less points, which results in decreased resolution in frequency. In our project, we used windows of length 25 milliseconds (ms), which provide a good tradeoff between frequency and time resolution. Power Spectral Density (PSD) Now that we have established the validity of using spectral properties and the STFT in our system, we delve into the power spectral density representation of the acoustic signal, and how it correlates to the usual model of speech. The model of speech production is usually that of an “excitation sequence is convolved with the impulse response of the vocal system model” (Deller, Proakis & Hansen, p. 352). In mathematical terms, if the voiced acoustic signal that we perceive is 𝑥(𝑡), the “excitation” by 𝑔(t) and the vocal tract impulse response as ℎ(𝑡), then 𝑥(𝑡) can be written as (Furui, p. 64): 𝑡

𝑥(𝑡) = ∫0 𝑔(𝜏)ℎ(𝑡 − 𝜏)𝑑𝜏 (Eq. 5) Because there is well developed theory on LTI systems, it is assumed that the system model of human speech production is linear, which is a simplification that works. In the frequency domain, the above convolution becomes: 𝑋(𝜔) = 𝐺(𝜔)𝐻(𝜔) (Eq. 6) In a practical system, such as one implemented in MATLAB, 𝑋(𝜔) is implemented using the DFT. Usually, we do not have access to 𝒈(𝒕) or 𝒉(𝒕), but only the recorded acoustic signal 𝒙(𝒕). The goal then becomes trying to find 𝑔(𝑡) and ℎ(𝑡). Consider taking the logarithm of the magnitude of the above equation. From the properties of logarithms, we get (Furui, p. 64): log|𝑋(𝜔)| = log|𝐺(𝜔)| + log|𝐻(𝜔)| (Eq. 7) Where log |𝑋(𝜔)| is the power spectral density in logarithmic form. To see an illustration, the Figure below is the power spectral density (PSD) of Ravesh saying the word “Michigan” over a single frame (800 samples). From the Figure, we can see that it appears as though the PSD is composed of the combination of two things: the spectral envelope which bounds the upper portion of the PSD plot and varies slowly with frequency, and the spectral fine structure (Furui, p. 52), which corresponds to the quickly varying peaks and valleys in the figure (Deller, Proakis & Hansen, p. 360),. The envelope corresponds to the impulse response of the vocal tract, while the fine structure corresponds to the “excitations” which are normally periodic and very fast, such as in vowels. Thus, from Eq. 7 above, because we plotted the logarithmic PSD, this combination is a linear sum,where the envelope corresponds to log|𝐻(𝜔)| and the fine structure corresponds to log|𝐺(𝜔)|. We can now see the importance of the logarithmic PSD: in the convolution and the multiplication, it is much more difficult to identify either the excitation or the impulse response, because as mentioned above,

we normally only have access to the acoustic signal, 𝑥(𝑡), and by extension, 𝑋(𝜔). However, the problem becomes much simpler when we take the logarithm of Eq. 6: we get a linear sum. This means we can now apply our linear systems theory to the problem. Figure: Power spectral density of Ravesh saying the word “Michigan”. The approximate “envelope” is highlighted in black, with the spectral fine structure in blue

You may be wondering, “Why are we so interested in the spectral envelope and fine structure”? As explained earlier, it corresponds to the slowly varying, i.e. low-frequency (actually it is low-quefrency, as will be explained later), portion of the PSD. Furthermore, because it is not varying quickly, it captures a great deal of the overall spectral characteristics of the acoustic signal. As Furui says (p. 52), the envelope “reflects not only the resonance and antiresonance characteristics of the articulatory organs, but also the overall shape of the glottal source spectrum and radiation characteristics at the lips and nostrils”. In a nutshell, because the envelope has a direct correlation with the parts of speech that are difficult to capture

by the raw acoustic signal alone, it provides a good encapsulation of the phoneme or sound that is being said. As we usually take the PSD for each frame of the entire speech signal, then if you utter two words, the sounds in each frame should be similar from the standpoint of the envelopes. This thus allows us a measure of the sound with which we can compare or recognize the acoustic signal. In the next section, we investigate how to obtain this envelope. Envelope detection using Cepstral Analysis As mentioned above, the PSD is composed of a slowly varying envelope (low quefrency), and quickly changing spectral fine features (high quefrency), of which the envelope is more important for speech recognition purposes. Note that quefrency here is used to avoid confusion because the PSD is already in a frequency domain, 𝜔, so quefrency refers to frequency components in the frequency domain. From Eq. 7, the envelope in the PSD domain is log|𝐺(𝜔)|, or the logarithmic magnitude of the Fourier transform of the impulse response, ℎ(𝑡). We said earlier that we are interested in the low-frequency component of log|𝑋(𝜔)|. Normally, when we are interested in the frequency response of some general signal, 𝑔(𝜆), we use the Fourier Transform to analyze this in a generalized frequency domain, represented by 𝜓. The envelope, log|𝐺(𝜔)|, however, is already in a frequency domain, 𝜔. Thus, because we are taking the logarithmic magnitude which distorts the scaling, the inverse Fourier Transform of log|𝑋(𝜔)|, denoted as 𝐹 −1 results in a representation of our signal in a pseudo-frequency domain, called quefrency, and denoted by 𝜏 (Deller, Proakis & Hansen, p. 362). The resulting waveform is denoted as the cepstrum, 𝑐𝑠 (𝜏). This is summarized in Eq. 8 below: 𝑐𝑠 (𝜏) = 𝐹 −1 log|𝑋(𝜔)| = 𝐹 −1 log|𝐺(𝜔)| + 𝐹 −1 log|𝐻(𝜔)| (Eq. 8) As alluded to earlier, quefrency is a reference to the fact that the Inverse Fourier Transform of log|𝑋(𝜔)| is not quite a representation in the frequency domain, 𝜔, but in a pseudo-frequency domain. Ditto for cepstrum, which is like a spectrum, but in the quefrency domain. Now that we have a cepstral representation of the acoustic signal, we can apply a low-pass filter (called liftering in the quefrency domain) to extract the low-quefrency components, denoted by ℎ(𝜏) which we recognize from our earlier discussion as being the key components of the spectral envelope (Furui, p. 65). Note that ℎ(𝜏) is not simply 𝐹 −1 log|𝐻(𝜔)|but the result of lowpass-filtering 𝑐𝑠 (𝜏). Finally, we take the Fourier Transform, 𝐹, of ℎ(𝜏) to get back log|𝐻(𝜔)| , i.e. log|𝐻(𝜔)| = F { h (τ) } (Eq. 9) The entire system to extract the spectral envelope is shown below (reproduction of Figure 6.4 on p. 363 of Deller, Proakis and Hansen’s “Discrete Time Processing of Speech Signals”):

“Liftering” operation

x(n)

DTFT

log |.|

Low-time lifter

IDTFT

New “signal” log|Xs(ω )|

“Cepstrum” Cs(n)

DTFT

log|H(ω )|

Estimate of h(τ )

Convert to “quefrency domain”

Multiply cepstrum by this lowpass filter Cs(n)

h(τ )

Mel Frequency Cepstrum Coefficients (MFCC) Finally we arrive at the star of our show, the MFCC. The MFCC is a form of Cepstral Analysis that is relatively easy to implement while still capturing the essential features of the PSD envelope. Indeed, what makes the MFCC so attractive and useful is that Cepstral Analysis must be performed for each frame. The resulting spectral envelopes, log|𝐻(𝜔)| for each frame may be hundreds of points long, depending on the length of the DFTs, taken. Furthermore, the number of points of a given envelope in our test data is unlikely to be exactly equal to the number of points of a given envelope in the template data, complicating the process of comparing data. The MFCC resolves many of these problems by distilling the essential features of spectral envelope into a fixed number of coefficients, which can be orders of magnitude smaller than the number of points describing the spectral envelope. The MFCC is able to do this because of two key features: 1. In the stage where we take the logarithm of |𝑋(𝜔)|, a special logarithmic scale is used, called the Mel scale. It is described by Eq. 10 below, where 𝐹𝑚𝑒𝑙 is the frequency in the Mel scale and 𝐹𝐻𝑧 is the frequency in Hertz (Deller, Proakis & Hansen, p. 380). 𝐹𝑚𝑒𝑙 =

1000 𝐹𝐻𝑧 [1 + 1000 ] log 2

𝐹

𝐻𝑧 ≈ 1127log [1 + 700 ] (Eq. 10)

This is a useful scale because it more closely mimics the frequency that a human-being perceives. This is because for frequencies below 700 Hz, 𝐹𝑚𝑒𝑙 is approximately linear, while for frequencies above 700 Hz, the spacing is noticeably logarithmic. This corresponds to human being’s pitch perception, where additional high frequencies do not contribute to the message being sent. This is why we can sample at a low rate like 8000 Hz and still understand the acoustic signal. 2. Instead of taking the IDFT to get the cepstrum, the Discrete Cosine Transform (DCT) is used. It is similar to the DFT and IDFT, but for this type of application, it proves superior because: a. It is less sensitive to the edge effect in waveform extraction than the DFT and thus has less distortion than the DFT in the frequency domain (Furui, p. 164)

b. DCT takes less samples to compute (Furui, p. 164) than the DFT The following section will explain our implementation of MFCC in Matlab. MFCC Implementation in MATLAB Note: The following steps borrows heavily from this website and this website. To compute the MFCC in MATLAB of the raw acoustic signal 𝑥[𝑛] (discrete time), we go through a 8 step process. These are described in detail below, with reference to the code in the file MFCC_coeff.m that is used to compute the MFCC. 1. Decimation and Pre-filtering: As mentioned previously, while high-frequencies are important with regards to the overall perception of sound, it is the low frequencies which are dominant in conveying a message. Thus, to reduce the amount of data we have to deal with, we downsample, or decimate, 𝑥[𝑛]. 𝛼 𝑧

After decimation, we apply a pre-emphasis filter of the form 𝐻(𝑧) = 1 − , which is a high pass filter. This is done to counterbalance the fact that much of the energy is in the lower portion of the signal (Muda, Begam and Elamvazuthi, p. 139). The magnitude response of the filter 𝐻(𝑧) is shown below: Figure: Magnitude response of the high-pass pre-emphasis filter, 𝐻(𝑧), used to flatten the spectra

The portion of our code that this corresponds to is shown in the screen shot below

For the filter, 𝛼 is chosen as 0.97, which is a standard value for this kind of application. Our 𝑥[𝑛], which in the code is raw_adata, is downsampled to a frequency of 16 kHz, which is sufficient to capture the low frequency components. The frequency is denoted by rec_freq. 2. Framing the signal: As mentioned previously, simply taking the FFT of the entire signal 𝑥[𝑛] gives us the frequency characteristics of the signal but we have no knowledge of the temporal characteristics. As explained, speech is a combination of the two domains. Thus, we frame the signal into short segments, where we refer to the 𝑖 𝑡ℎ frame of 𝑥[𝑛] as 𝑥𝑖 [𝑛], to look at the frequency characteristics over each frame, because as explained, if we choose our frame length appropriately, speech can be considered “short-time stationary”, and we only have to be concerned with the frequency oscillations. The total number of frames, 𝑁𝑓 , in the signal can then be computed by Eq. 11 below, where 𝑁𝑥[𝑛] is the number of samples contained in 𝑥[𝑛], and 𝑁𝑂𝑣𝑒𝑟𝑙𝑎𝑝 𝐵𝑒𝑔𝑖𝑛 = 𝑁𝑥[𝑛] − 𝑁𝑜𝑣𝑒𝑟𝑙𝑎𝑝 , where 𝑁𝑜𝑣𝑒𝑟𝑙𝑎𝑝 is the number of samples of overlap. N ote that we take the lower bounding integer to avoid indexing issues 𝑁𝑥[𝑛] (Eq. 11) 𝑁𝑓 = 𝑓𝑙𝑜𝑜𝑟 ( − 1) 𝑁𝑂𝑣𝑒𝑟𝑙𝑎𝑝 𝐵𝑒𝑔𝑖𝑛

The portion of our code that does this framing for us is shown below:

We have chosen the length of our frame, given by the variables timeframe (time domain) and framelength (discretized time)in the code, to be 30 ms. Because we don’t know when the shorttime transitions occur, we have some overlap of the frames to account for this. In the code, the variable overlap_t denotes the amount of time of overlap. The for-loop above extracts each frame. 3. Periodogram estimate of the Power Spectral Density (PSD)

Now that we have the signal divided into frames, we want to characterize the spectral envelope of each frame. Thus, we must first compute the Power Spectral Density (PSD). For each frame, 𝑥𝑖 [𝑛], where 𝑛 ranges from 1 to the length of the frame, 𝑁, we can compute the Discrete Fourier Transform, 𝑋𝑖 [𝑘] for the 𝑖 𝑡ℎ frame by multiplying 𝑥𝑖 [𝑛] by the window function 𝑤[𝑛]. This is described by Eq. 12 below, where 𝐾 is the length of the DFT: 𝑁 (Eq. 12) −𝑗2𝜋𝑘𝑛 𝑋𝑖 [𝑘] = ∑ 𝑥𝑖 [𝑛]𝑤[𝑛]𝑒 𝑁 𝑓𝑜𝑟 1 ≤ 𝑘 < 𝐾 𝑛=1

We use a window because it ”gradually attenuates the amplitude” so that we don’t get abrupt changes at the endpoints of the frames (Furui, p. 57). Thus, each frame has a smoother transition to the next frame. Now that we have a frequency representation of 𝑥𝑖 [𝑛], we can find the Periodogram-based estimate of the PSD for the 𝑖 𝑡ℎ frame, denoted by 𝑃𝑖 [𝑘]. This is done by application of Eq. 13: 𝑃𝑖 [𝑘] =

|𝑋𝑖 [𝑘]|2 𝜆

(Eq. 14).

Here, 𝜆 is a scale factor that scales the periodogram by the weight of the window function, 𝑤[𝑛]. This is given by Eq. 15 below: 2 (Eq. 15) 𝑓𝑠 ∑𝑁 𝑛=1|𝑤[𝑛]| 2𝑁 Here, 𝑓𝑠 is the sampling frequency and 𝑁 is the length of the frame. Note the factor of 2 in the denominator, which is there because in taking the DFT in Eq. 12, we know that for real-signals,

𝜆=

the DFT is conjugate symmetric (i.e. the amplitude is even), so we take the first

𝑁 2

+ 1 DFT

coefficients. Because of this, to conserve total power, we include the factor of 2. Our implementation is shown in the screenshot below:

Here, we utilize the FFT to calculate the DFT. As we would discover, MATLAB already has an implementation of finding the Periodogram Estimate of the PSD (which they call the Modified Periodogram Estimate because of the

inclusion of the scale factor in Eq. 15 above), which is built into the command spectrogram. This is shown below: MATLAB does all the heavy lifting for us, as it frames the signal (here 𝑥[𝑛] is raw_adata), computes the FFT etc. To ensure our algorithm was correct, we compared our PSD estimate with MATLAB’s (contained in the variable Pa in the screenshot above), and we found that it was identical, with the only exception being the number of frames: MATLAB chooses to discard the last frame if the length of the signal is not exactly divisible by the framelength, while we chose to zero-pad it. 4. Mel-spaced Filterbanks After we have the periodogram estimate of the PSD, we then create the filterbank based on the Mel-frequency scale, described by Eq. 10 above. We start by selecting the lower Mel frequency, 𝑓𝑙𝑚 , and the higher Mel frequency, 𝑓ℎ𝑚 , both of which are in Hz. Applying Eq. 10, we get 𝑓𝑙𝑚 and 𝑓ℎ𝑚 in the Mel frequency scale. We want to have M number of filters whose centerfrequencies, 𝑓𝐶𝑚 are linearly spaced in the Mel-frequency domain, because when we transform these filters back into the regular frequency domain, then the spacing of the filters will be such that there are more filters at lower frequencies and less filters at higher frequencies, which as we discussed earlier, closely mimics the way human beings hear and perceive speech. In MATLAB, the following three lines of code gives us the correct positions of the centers of the filter in the frequency domain, as well as placing them into the correct “frequency bin” defined by the DFT:

We have to place the frequencies that we calculate into an appropriate “FFT frequency bin” because the DFT is basically a sampling of the DTFT at 𝐾 points, where 𝐾 is the length of the DFT , and so because frequency is discretized, we don’t have the resolution to place the frequencies of a filter at an exact location. In the above code, lowmelf corresponds to 𝑓𝑙𝑚 , highmelf corresponds to 𝑓ℎ𝑚 , fft_length is the length of the DFT (i.e. 𝐾), numfiltbanks is the number of filters that you want to implement (i.e. M ) and rec_freq is the sampling frequency, 𝑓𝑠 . Once the central frequencies for the filters has been established. We go about the process of creating the filters, 𝑀𝑓𝑗 [𝑘]. The filters are supposed to be triangular in shape, with the 𝑗 𝑡ℎ filter starting at 𝑓𝑐𝑚𝑗−1 , i.e. the center frequency of the filter before it, and ending at 𝑓𝑐𝑚𝑗+1 , i.e. the central frequency of the filter after it. The peak is supposed to come at its center frequency,𝑓𝑐𝑗 . The following code gives these filterbanks:

Normally, the low Mel-frequency is chosen to be around 300 Hz, the high Mel-frequency chosen to be at half the sampling rate (i.e. Nyquist Rate) and the number of filters chosen to be 26. Below is a plot of the filterbank with these specifications: Figure: Plot of the filterbank of 26 filters. Note that the lower the frequency, the closer the filters, while the higher the frequency, the greater the spacing becomes.

Finally, for each frame, we sum the product of the PSD, 𝑃𝑖 [𝑘], by each filter to get the coefficient 𝑈𝑖𝑗 . This is summarized in the equation below for the 𝑖 𝑡ℎ frame: 𝐾 +1 2

𝑈𝑖𝑗 = ∑ 𝑃𝑖 [𝑘]𝑀𝑓𝑗 [𝑘]

(Eq. 16)

𝑗=1

Here, 𝑀𝑓𝑗 [𝑘] is the 𝑗 𝑡ℎ filter. We repeat this process for all 𝑗 filters, from 1 to M. Thus for each frame, we get 𝑀 coefficients, which in our case is 26. This is given by the code below:

In this step, we also include step 5, which is taking the DCT of the logarithm of 𝑈. 5. For each frame, we take the logarithm of the 𝑗 𝑡ℎ coefficient 𝑈𝑖𝑗 , and then take the DCT of all 𝑈𝑗 in the 𝑖 𝑡ℎ frame, i.e.

𝐷𝐶𝑇(log(𝑈𝑖 [𝑗])) = 𝐶𝜏𝑖 [𝑗]

(Eq. 17)

6. Finally, we discard the first coefficient and keep the next 10-13 coefficients because keeping the other coefficients has been found to degrade the quality of the speech recognition process. Thus the 𝐶𝜏𝑖 [𝑗] are the Mel-Frequency Cepstral Coefficients for the 𝑖 𝑡ℎ frame.

Dynamic Time Warping Now that we have distilled the important elements of the spectral envelope for each frame into 10-13 coefficients using cepstral analysis, and specifically, the MFCC, the problem now becomes how do we match the acoustic signal that we are investigating to our database of acoustic signals, all of whom have feature vectors of MFCC coefficients. One approach is to use a distance metric, like the Euclidean distance. Consider these two waveforms below, however, where 𝑣[𝑛] is the black waveform and 𝑏[𝑛] is the green waveform. Both of them have the same length of 100. Figure: Plot of two waveforms 𝑣[𝑛] and 𝑏[𝑛]

Qualitatively, they seem quite similar. In fact, they are the same waveform, just that 𝑏[𝑛] is shifted relative to 𝑣[𝑛]. If we disregard the time shift, the Euclidean metric should give us 0 because there is no “distance” between corresponding pairs of points. If we compute the Euclidean metric directly on these waveforms, however, and sum these distances, we get a value of 22.74, clearly no 0. Furthermore, what if t 𝑣[𝑛] and 𝑏[𝑚] had different lengths? This is a frequent occurrence in our MFCC feature vectors, because although each vector is a fixed length long (10-13 coefficients), the number of frames is not necessarily the same because the time over which the recordings take place is not necessarily the same. We could not use the Euclidean metric directly in such a case because we can’t find a coordinate pair between 𝑣[𝑛] and 𝑏[𝑚] for all (𝑛, 𝑚).

Clearly, the Euclidean metric alone is not robust enough to account for these problems, although it is very simple conceptually to understand. Thus, we combine the Euclidean metric with Dynamic Time Warping (DTW) so we can have a more accurate measure of comparing our feature vectors. Dynamic Time Warping – a brief introduction To illustrate what DTW is, consider our sequences 𝑣[𝑛] (length N) and 𝑏[𝑚] (length M), where 𝑛 is not necessarily equal to 𝑚. The fictional sequence 𝑣[𝑛] is displayed vertically on the left-hand side of the grid, while 𝑏[𝑚] is displayed horizontally on the bottom of the grid. Within the grid (coloured portion), where 𝑑(𝑚𝑜 , 𝑛𝑜 ) corresponds to the metric defining the distance between 𝑏[𝑚𝑜 ] and 𝑣[𝑛𝑜 ]. This distance can be the Euclidean Metric. As one can probably suspect, we can traverse many different paths from the starting point of 𝑑(1,1) to the ending point at 𝑑(𝑀, 𝑁). However, we are concerned with the path that minimizes the cumulative distance, D, traversed, where D can be described by Eq. 18 below. Here, 𝑘 is the current position in the path (Deller, Proakis & Hansen, p. 636) 𝐾

𝐷 = ∑ 𝑑𝑛 (𝑚𝑘 , 𝑛𝑘 )

(Eq. 18)

𝑘=1

Figure: Sample grid of distances formed from 𝑏[𝑚] and 𝑣[𝑛] d(1,N) v[N]

d(M,N)

.

.

.

.

.

.

v[7] v[n]

v[6] v[5] v[4]

d(1,4)

v[3]

d(1,3)

v[2]

d(1,2)

v[1]

d(1,1)

d(2,1)

d(3,1)

d(4,1)

d(5,1)

…

d(M,1)

b[1]

b[2]

b[3]

b[4]

b[5]

…

b[M]

b[m]

If the cumulative distance D is small, then that means we did not have to take a very warped path reach the goal. For instance in the figure below, the middle blue path is not very warped and corresponds to good agreement between 𝑣[𝑛] and 𝑏[𝑚]. The yellow and red paths show much more warping, however, indicating that the match between the two sequences or signals is not as good. This discussion indicates that the minimum value of D, called 𝐷𝑚𝑖𝑛 , can be used as a measure of the agreement between two signals. So let’s say the signal under investigation is 𝑞[𝑝] with length P not necessarily equal to M or N. Let us also denote the minimum cumulative distance between 𝑞[𝑝] and 𝑣[𝑛] using the DTW method as 𝐷𝑉 and the minimum cumulative distance between 𝑞[𝑝] and 𝑏[𝑚] as 𝐷𝐵 . Then the signal that best matches with 𝑞[𝑝] is the one with the smaller 𝐷, i.e. min(𝐷𝑉 , 𝐷𝐵 ). This is particularly useful if the signal is shifted in time or has some compression and expansion in the time domain, as can happen when you put more emphasis on a sound, because the DTW essentially finds a path around these features. Our implementation In general, finding the shortest path through the grid is a complicated problem because there are many possible combinations. This was our initial attempt at implementing DTW, where we solved the problem by recursion. However, for vectors that were even of just moderate length, then the amount of paths through the grid grew exponentially, and so this was a highly inefficient and impractical way of implementing DTW. The method of Dynamic Programming (DP) provides a convenient and efficient way to find the minimum distance, 𝐷𝑚𝑖𝑛 , through the grid by determining the minimum cost to reach square in the grid. Furthermore, there are some constraints that can be placed on the signal path that are relevant to signal processing. These constraints can improve the efficiency even more (Furui, p. 268):. These are 1. Monotonicity: The path cannot go back in time for either of the sequences 2. Boundary Condition: You start at (m=1,n=1) and finish at (m=M,n=N) 3. Adjustment Window: This is done to prevent extreme expansion and contraction. In other words, for a good match, the best path through the grid is most likely to be close to the diagonal line. For the purposes of our project, we found a very simple implementation of DTW on MATLAB’s file exchange, which itself is based on this Wikipedia code. It was authored by Quan Wang. The code is shown below in its entirety:

So for the two signals 𝑠[𝑛] and 𝑡[𝑛], the grid 𝐷 is first constructed to by slightly larger than the lengths of 𝑠 and 𝑡. All points in the grid are initialized to infinity. What the for-loops do is that they calculate the cost to reach that specific point in the grid, based on the Euclidean distance (given by 𝑎𝑏𝑠(𝑠(𝑖) − 𝑡(𝑗)). We can accomplish this in two for-loops because we initialized all the costs to be infinite, so we rule out many paths right away. The only legal starting maneuver then becomes moving diagonally from the starting position of (1,1). After the for loops have finished executing, the minimum cost to reach every point in the grid, including the point of interest at 𝐷(𝑛𝑠 + 1, 𝑛𝑡 + 1) is calculated. 𝐷(𝑛𝑠 + 1, 𝑛𝑡 + 1) corresponds to the minimum distance, 𝐷𝑚𝑖𝑛 , through the grid. In our project, this function is called once we have all the MFCC coefficients for the test signal and all the template signals. Then, we compare each of the 11 rows of MFCC coefficients of the test signal to that of the template signal by evaluating the DTW. Thus, we are left with vectors of DTW distances, which we

then sum up. The template signal with the lowest cumulative DTW score is the one that most closely matches our test signal.

MEASURED DATA To test our system, we had Xueyu and Ravesh record themselves saying the words cool, tool, red, grey, orange and Michigan three times each. Data was taken with the SONY ICD-PX333 Sound Recorder and sampled at a rate of 44.1kHz. All possible environmental conditions, like ambient noise, distance from microphone, intonation, pitch etc. were tried to kept as controlled as possible. In total, we had 36 sound signals. In the MATLAB program, we loaded in one of each word into the “template” section, and then selected an audio signal to be used as the “test” signal. The program then told us which word in the template section most likely matched the test signal based on which template had the smallest cumulative DTW score. We could then determine how successful the program was at recognizing the isolated words. For instance, if we sent in a signal with “orange” and the program said we sent in “red”, then the program did not successfully detect the word. Overall, the system does not work very well at all, with a successful detection rate in the mid-teens (≈ 14%), which is just slightly above random for the six words tried. This was true whether the user attempted to match his signal against a database of templates of his own speech, or against a database composed of different group member’s speech signals. Consider the histogram plots below of the cumulative DTW distance calculated for each word in the database with respect to the test signal. We can see that even when the program correctly predicts a word, it barely does so. For instance, when Xueyu says “tool” in the first histogram, the DTW distance is 121, but “orange” is not far behind at 130. When “Xueyu” says “orange” in the second histogram, the DTW distance to the “orange” template is 67, but DTW distance to “tool” is 72. This was a consistent pattern for virtually all the times the program correctly interpreted a word, and so we cannot decisively say that our program accurately detected the word because the margins by which the correct template word “beats out” the other words in the template database is very small. In all likelihood then, our system’s correct recognition of the isolated words are most likely due to randomness. If we view the histograms for Ravesh’s words, the situation gets even worse. For instance in Figure d, when he says “Michigan”, the corresponding template word has the second highest DTW distance, an indication of a poor match. Furthermore, the system predicts that “orange” is the correct word, with a DTW distance far below the rest of words. This situation is repeated in Figure e, where the template corresponding to the test word “grey” has the second highest DTW distance. If the system was working properly, we would expect that even if the template corresponding to the spoken word did not have the lowest DTW of all the words in the database, it should be the second or third lowest. That this is not the case further indicates that the system is in need of more refinement.

Figure a: Histogram of cumulative DTW distance for Xueyu saying tool. The program correctly recognizes the word. Remember, the lower the DTW distance, the better the match 160 140 120 100 80 60 40 20 0 michigan

orange

red

grey

cool

tool

Figure b: Histogram of cumulative DTW distance for Xueyu saying “orange”. The program correctly recognizes the word 120

100

80

60

40

20

0 michigan

orange

red

grey

cool

tool

Figure c: Histogram of cumulative DTW distance for Xueyu saying “cool”. The program does not recognize the word, saying the speech signal contains the word “red” 120 100 80 60 40 20 0

Figure d: Histogram of cumulative DTW distance for Ravesh saying “Michigan”. The program does not recognize the word, saying the speech signal contains the word “orange” 140 120 100 80 60 40 20 0 michigan orange

red

grey

cool

tool

Figure e: Histogram of cumulative DTW distance for Ravesh saying “grey”. The program does not recognize the word, saying the speech signal contains the word “tool” 120 100 80 60 40 20 0 michigan

orange

red

grey

cool

tool

We will also take a look at some of the probabilities that the system could detect a correct word. In Figure f below, we can see that the probability that the system correctly interprets the speech of Xueyu is low, with the exception of the words cool and tool. Relatedly, it was found that when the system incorrectly interpreted a word, these words were the most likely candidates to be offered by the system as the best match. This is an indication that our system has problems with the ‘oo’ sound, and that in the next phase of testing, these words should be removed to see if the system performance increases. In Figure g, where Ravesh is the one who voiced the template and test signals, we have a similar situation. There is a relatively high detection rate for cool and tool, but the probability that we correctly interpreted these signals is close to random. Again, substituting cool and tool with other words that don’t have such long vowel sounds may improve the performance of the system. Figure f: Histogram of probabilities that the system correctly interpreted the word voiced by Xueyu 35 30 25 20 15 10 5 0 michigan

orange

red

grey

cool

tool

Figure g: Histogram of probabilities that the system correctly interpreted the word voiced by Ravesh

35 30 25 20 15 10 5 0 michigan

orange

red

grey

cool

tool

SOURCES OF ERROR AND FUTURE IMPROVEMENTS As shown in our measurement and results section, our system performed well below expectations, with detection rates that were usually less than 20%, with the exception of “cool” and “tool”. The most probable sources of error that caused this underwhelming performance are: 1. We do not believe that the MFCC coefficients are incorrect, but perhaps the method with which were comparing them was. We simply applied the DTW algorithm to each row of the feature matrix, and summing the resulting vector of DTW values to get a cumulative DTW distance. However, a more statistically robust method would involve fitting multidimensional Gaussian distributions, as is done here (Ning, p. 5). However, to fit such distributions would require taking much more data, which we could not do due to our time constraints. 2. We did very little post-processing on the signals, with the exception of a high-pass filter in the pre-emphasis stage of calculating the MFCC coefficients. Perhaps doing some more filtering or time-shifting would help to get rid of the anomalies that appear in the speech data. 3. The words chosen was not based on any scientific reasoning, but purely on intuition. As noted, the system seemed to have an affinity for the “oo” sound, with many false positives for “cool” and “tool”. This is perhaps due to the resonant frequencies characteristic of vowel sounds overpowering the other consonant sounds, and thus any words with longer vowel sounds like “red” or “grey” would be falsely matched to “cool” and “tool” because the consonants are not well represented in the speech signals. Overall though, the science and mathematical basis of our system is sound, and with a few tweaks, we are confident that it could work very well. Perhaps not to the level of Siri, but well enough to not frustrate the Four Unknown Men.

REFERENCES Furui, S, 2001, ‘Digital Speech Processing, Synthesis and Recognition: Second Edition’, Marcel Dekker, Inc. New York. Deller, J., Proakis. J., Hansen, J., 1993, ‘Discret-Time Processing of Speech Signals’, Macmillan Publishing Company, New York.

Cheap Leory 2Pcs 500W Car Audio Speaker Super Power Dome ...

Cheap Fx-Audio M-160E Bluetooth@4.0 Digital Audio Amplifier ...

Cheap Fx-Audio M-160E [email protected] Digital Audio Amplifier ...

Cheap Breeze Audio & Weiliang Audio New Su1 Ak4495 & Xmos ...

akiko audio ...

Audio conferencing system

Audio Specifications

Audio Video Systems - djmit

Indictment - My Private Audio

Audio conferencing system

Audio digital.pdf

Audio Transformers - ALPB

Audio Video Systems - djmit

Advanced Microcontroller Audio Workshop - GitHub

Bigasoft Audio mac

Audio-Visual Objects - MAPLE Lab

Audio book music

dual audio ghibli.pdf