Voice Operated Intelligent Wheelchair - VOIC

Viewer
Transcript

IEEE ISIE 2005, June 20-23, 2005, Dubrovnik, Croatia

Voice Operated Intelligent Wheelchair - VOIC G. Paþnik, K. Benkiþ and B. Breþko Faculty of Electrical Engineering and Computer Science, Institute of Robotics, Maribor, Slovenia [email protected], [email protected], [email protected] Abstract – A development of the intelligent wheelchair lab prototype is shown in the paper. VOIC is designed for physically disabled person, who can not control their movements and control the wheelchair with the joystick. The article describes basic components of voice recognition and wheelchair control system. Voice recognition begins with input signal sampling, word isolation, LPC cepstral analysis, coefficient dimension reduction and trajectory recognition using fixed point approach with neural networks. Wheelchair control system is divided into system for sensors data acquisition and system for wheelchair steering. The complexity of the voice recognition is reduced using LPC cepstral analysis and coefficient dimension reduction with minimum loss of vital information. A proper decision for the speech recognition using neural networks is supported with experimental results.

Neural networks are used mainly because of their simple usage, great resistance to the noise added to the signal when voice commands are recognized and to changed tone and color of speech, for example when the user is ill or walking down the street. Speech recognition is based on work of Rabiner and Juang [6], Furui [1], Zegers [7] and Kirsching [3]. “Long-short term memory” (LSTM) neural network architecture and a new learning algorithm are added and they are explained in this paper. Main advantage of the proposed speech recognition system is that there is no need to recognize only speech signal but can also be used on any application where some kind of useful measurable signal is present. Example of such usage is recognizing commands using measuring frequency of vocal cord mounted with special sensor on the user neck, where a very good recognition can be achieved because virtually no noise is presented.

I. INTRODUCTION II. WHEELCHAIR The main idea of the system for speech recognition is a strong need for verbal commands in a case of wheelchair operating. The most important parameter of such verbal commanded wheelchair is safety of the user – physically disabled person and people in the vicinity of it. The wheelchair system is built redundantly so if the speech recognition system fails the wheelchair control system connected with ultra sound sensors would stop the wheelchair in a case of dangerous situation. VOIC is suitable for users such as quadriplegics, people paralyzed from neck downward and for people that do not control their movements (patients with brain paralysis...). Of course it could be utilized by any other physically disabled person. Disabled persons and wheelchair application are not the only potential users of this system. The system could be also used on numerous different locations or applications such as: elevators, doors... It is also applicable in the various situations in industry where the voice data input or controlling the secondary functions of machine tools in very noisy environment are needed.

Wheelchair STORM3 EURO provided by Invacare Company is shown in picture 2-1. The voice operated wheelchair is currently usable only as a lab prototype. A developed wheelchair control system for acquiring data from ultrasound sensors is visible in the front of the wheelchair. The voice recognition and controlling logic was built on a PC workstation using .NET framework. The wheelchair currently communicates with a PC using two RS232 connections. First one is used for control connection while the other one is used for sensor data communication.

Recognizing speech is natural and simple for people. But it is difficult job for computers and no perfect solution has been found until now. In history some methods have been found such as Hidden Markov Model (HMM), which shows us that no solution is perfect. Based on this, we decided to use known but fresh approach and implement voice recognition using neural networks.

0-7803-8738-4/05/$20.00 ©2005 IEEE

Picture 2-1: Lab prototype of voice operated intelligent wheelchair

1221

Authorized licensed use limited to: George Mason University. Downloaded on October 8, 2009 at 18:23 from IEEE Xplore. Restrictions apply.

The component, in the picture 3-2 under item 1, executes the processing of the signal and it is thoroughly presented in the picture 3-3. This component is independent of the purpose and the final user of the speech recognition system can be used also for other applications mentioned in section I.

III. VOICE RECOGNITION

3

1 left

6

2

The individual parts of the sound processing are:

5 4

1.

d

8 1010001101010101 ''left''

2. 3.

7

Picture 3-1: Basic principle of VOIC

Basic principle of wheelchair operating using voice recognized commands is shown on the picture 3-1.

4.

5.

Detailed description of picture elements: 1.

wheelchair user utters command "left" ("levo" in Slovene language), 2. voice recognition system recognizes the command, 3. command is transferred to the part of the system designed for it evaluation and execution, 4. wheelchair surrounding dynamical obstacles is being observed all the time during its operation, 5. special designed component measures distance to the obstacle, 6. control system collects all necessary data and decides whether it is safe to execute recognized command, 7. in the case, that recognized command means no threat to wheelchair user and people surrounding it, the command is transmitted to the wheelchair for execution and 8. wheelchair receives the command and executes it. Command list which is supported by the control system can easily be adapted to specific user or application. Simple one word commands as "left", "right", "forward"... and also more complicated multi word commands such as "left 10", "forward left" are supported.

6.

sampling of the signal with 11025 Hz, 8 bits and one channel, filtering with a high pass filter, word isolation is done with the help of running average and the threshold value which defines the beginning and the end of every word (there has to be a gap between words), scooping into frames; the pattern divides itself into blocks of defined length, so we get rid off the discontinuity with the Hamming window, LPC analysis, calculate 12 LPC coefficients for every window, cepstral analysis and the transformation of LPC coefficients into cepstral coefficients which are the exit from the sound processing component.

Picture 3-2: Basic components of the voice recognition

It is possible to reconfigure number and locations of the wheelchair sensors so the voice operate wheelchair is individually suited to the user needs. Voice recognition is composed from three successive components that are shown on the picture 3-2.

Picture 3-3: The input signal processing

The voice recognition system processes the speech signal, analyzes and reduces coefficients dimensions and recognizes a signal trajectory. The result is suitable coded recognized command.

1222

Authorized licensed use limited to: George Mason University. Downloaded on October 8, 2009 at 18:23 from IEEE Xplore. Restrictions apply.

The second approach is a continuous recognition of the trajectory where we have a special neural net with only two inputs, to which we give the received timedependent coefficients of the trajectory [3]. This neural net has a time delay architecture with inner recurrent connections, which enables memory with an already seen pattern and so furthermore a successful recognition on the basis of already known characteristics. The advantage of this approach is its insensitivity to the words isolation. Actually this approach could be used without the isolation. However it has a significant disadvantage in the way of learning but there are several solutions to cope with this problem. The right one is the usage of genetic algorithms which find solution for sure. Unfortunately, the time for such solution is unknown. The second solution is to use known learning methods (backpropagation) for the time delayed neural nets, which has a convergence or oscillation problem. The third solution is recently presented neural nets with special memory cells named LSTM – “Long-short term memory” [2] neural nets.

Picture 3-4: Patterns for two words "left"

IV. APPLICATION

Picture 3-5: Patterns for two words "right"

The component, in the picture 3-3 under item 2, is a dimension reduction implemented with the help of self organizing map neural network (SOM) [4] and unsupervised learning. The neural net enables the reduction from 12 cepstral coefficients to 2 coefficients which presents the coordinates of the winning neuron in a net of the size 16 times 16 neurons. The reduction itself is necessary for an easier and consequentially more quality recognition of a speech signal. From our experience and the general theory of neural nets is known that the reduction does not loose any of important information of the speech signal. In a case that the output from reduction is presented graphically for a variety of speech orders, assumingly that we use the same neuron net for all, it is possible to see repeating patterns for same words. That is shown on the pictures 3-4 and 3-5. Trajectory recognition could be realized in two ways. The first approach is the fixed approach in which the given pairs of coefficients from the SOM reduction are scaled on the length of 100 pairs. This represents information for a special neural net with 200 input neurons. It is also possible to choose the ordinary “feedforward” neural net as well as the special neural net based on the SVM [5] (“support vector machine”). The advantage of a fixed approach is a simple training and learning of the neural net. However the weakness is the sensitivity to an isolation of the words because the location of the coefficient is depended of the beginning and the ending of the isolated word.

The application is divided into several parts, as it is seen on the picture 4-1, which will be explained later. Basically, the application is divided between learning how to recognize the words and recognizing the words. Under item 1 (picture 4-1) we can see the basic components for speech recognition. They are divided into next paragraphs: 1.

"LOAD SOM" - loading weights for the self organizational neural net, 2. "LOAD NN" - loading weights for the neural net for trajectory recognition, 3. "START" button begins the recognition of speech and controlling of the wheelchair, 4. "STOP" button finishes the recognition and control of the wheelchair, 5. "WRITE LPC", selection of the LPC cepstral coefficient for every perceived word, 6. input field with the value of 256 shows the present buffer size, 7. selection field with the value 0,20 shows the present threshold value for a successful speech recognition and 8. “SAVE SOM”, save weights for the self organizational neural net. The list marked with the item 4 is meant to show the recognized words. Written out are only the successfully recognized words. In this version the program also writes out the output values of the neural net. The first value is for the word “backward”, second for “backward”, third for “stop”, the fourth for “stop” and the last (the fifth) for “stop”.

1223

Authorized licensed use limited to: George Mason University. Downloaded on October 8, 2009 at 18:23 from IEEE Xplore. Restrictions apply.

Picture 4-1: Application window

It can be seen under item 5: 1. 2. 3. 4.

running show of the calculated average value, first vertical line means the beginning of an isolated word , second vertical line is the end of a isolated word and tresshold value is thin horizontal line.

It is possible to adjust the threshold value with the help of a choice filed “Threshold”. The momentarily selected value is 1.4 which successfully eliminates the unnecessary noise between words in our case. The group of controls under item 3 is used for neural net training of the speech recognition. Enables also an independent control of the wheelchair in the case if speech recognition is turned off and we anyway want to control the wheelchair and collect the data from the sensors. The appropriate buttons simulate the recognized words and in this way we can study the behavior of the wheelchair easier. In this screen picture it is also possible to see up to 5 different text fields, which show the present value of individual sensors of the wheelchair. The fields are sorted in the same way as on the wheelchair: three in front and one on left and right.

A special part of the application represents group under item 2 which is used for training of both neural nets. The button “Open LPC files” opens a dialogue for the choice of files with LPC cepstral coefficients of individual words. We can get to these files with only the choice of “write LPC” during the recognition procedure. The loading is followed by the training of SOM. To achieve this procedure we press the button “Train SOM with data”. Momentarily the training is constructed in a very simple way so that every coefficient is sent to the neural net 60 times, which in our example insures more than enough satisfaction convergence towards small error. In the opposite example, it is possible to simply change the training net so that the training stops when the net converges to enough small error. It is possible to save the value of weights in the neural net with a press on the button “Save SOM” after the end of the training. After the training of SOM neural net follows reduction of LPC cepstral coefficients from 12 to 2 dimensions. We achieve this with a press on the button “SOM data” and then with the press on the button “Norm SOM” we normalize the values gained with SOM neural net on 100 patterns. Previously used and chosen words have approximately 30 to 60 frames calculated into LPC cepstral coefficients. Those 30 to 60 frames are extended to 100 frames to be prepared for the training of the neural net.

1224

Authorized licensed use limited to: George Mason University. Downloaded on October 8, 2009 at 18:23 from IEEE Xplore. Restrictions apply.

The program saves into a file each gained trajectory for every word, which is already customized for use in a special program package for manufacturing and training of the neural nets NeuroSolutions in the final step. We chose SVM (“support vector machine”) neural net, which in our case is additionally a good choice for the classification of patterns (trajectory). Afterwards, NeuroSolutions enables training of the neural net with classic learning techniques which we can also be chosen for every level of the neural network separately. The parameters for learning is possible do define manually or to be defined with a special genetic algorithm. Unfortunately, optimization of learning parameters could last quite a while (approximately 2 hours) with a large number of samples. After the concluded training NeuroSolutions automatically loads neural net weights with the smallest error and prepares them for export into a text file. Then we export the trained neuron net and translate it into a Visual C++ 6.0 dynamic linked library DLL with the use of the wizard for the construction of a custom applicatio. The library is afterwards sent to a folder where the program for speech recognition is situated and it renames into “svm.dll” library. V. RESULTS Test for recognizing the verbal orders were executed with two different subjects. Subject 1 was male, aged 22. The pattern of its orders was used for training the neural nets in the first environment described later. The training was made on the basis of the next number of patterns: x x x x x

left – 141 samples, right – 142 samples, forward – 145 samples, back – 155 samples in stop – 199 samples.

Subject 2 was female, aged 22. In comparison with subject 1 there was a slight difference in the pronouncing in the accent. Test was executed in two different environments:

The test was done in 6 different ways: x x x x x x

repeating the word "left", repeating the word "right", repeating the word "forward", repeating the word "back", repeating the word "stop" and cyclical repeating the pattern "left", "right", "forward", "back", "stop". The results of the words pronunciation spoken by subject 1 in the first environment can be seen in the table I. TABLE I: The results of speech recognition by subject 1 in the first environment.

Number of Error Recognition samples count quality Left 27 0 100% Right 30 0 100% Forward 30 0 100% Back 34 0 100% Stop 34 0 100% Cyclical 37 0 100% The results of the words pronunciation spoken by subject 1 in the second environment can be seen in the table II. Word

Table II: The results of speech recognition by subject 1 in the second environment.

Number of Error Recognition samples count quality Left 34 1 97% Right 32 0 100% Forward 32 3 91% Back 39 2 95% Stop 42 1 98% Cyclical 44 0 100% The results of the words pronunciation spoken by subject 2 in the first environment can be seen in the table III. Word

Table III: The results of speech recognition by subject 2 in the first environment.

Number of Error Recognition samples count quality Left 31 0 100% Right 37 3 92% Forward 37 3 92% Back 44 0 100% Stop 44 7 84% Cyclical 64 2 97% The results of the words pronunciation spoken by subject 2 in the second environment can be seen in the table IV. Word

x x

The first one was with a little background sound. You could hear the cars on the road but really suppressed. The second was with a louder sound mainly caused by a radio receiver, which was at full blast so that a normal conversation was not possible. The test subjects where 1.5 meters from the receiver which was turned exactly to the test subject. The sound mainly consisted out of music with some interference of the speaker.

Neural networks were trained to the samples gathered from speaker 1 in first environment.

1225

Authorized licensed use limited to: George Mason University. Downloaded on October 8, 2009 at 18:23 from IEEE Xplore. Restrictions apply.

Table IV: The results of speech recognition by subject 2 in the second environment.

Number of Error Recognition Word samples count quality Left 38 0 100% Right 37 0 100% Forward 38 9 76% Back 43 5 88% Stop 46 10 78% Cyclical 60 0 100% Recognition of the pronounced of subject 1 in the first environment was 100% for all commands. That was the same as in the situation with the learning of the neural net. Now, we can positively state that the system for speech recognition works in all environments in which the patterns for learning were recorded. In checking the results of subject 1 in the second environment we can see that there were some errors in recognizing the commands. However, in the case of cyclical recognition, which is similar to everyday life, the recognition is 100%. This means that constant pronouncement, especially in loud environments, where the user wishes to over scream the environment and so that can hear himself/herself, is totally unnecessary.Similar results were given with subject 2 in the first environment where we can see again an improved recognition in cyclical sequence. Besides that the recognition of orders left and forward was 100%. Very interesting results were given with subject 2 in the second environment. Where we can see a large influence of the surrounding environment on the recognition of sequentially spoken orders forward, backward and stop, but, when the recognition execution of the cyclical pronouncement of commands has a 100% quality of recognition. This can be related with the fact that the cyclical pronouncement is closer to everyday giving commands because each command is different from a former one and it does not come to a monotone pronouncement with giving the same command several times. We can say that the recognition of isolated words in the speech with neural nets is successful. Besides that, the neural nets were learnt with the words of subject 1 in the first environment and there was still 100% recognition of speech with subject 2 in the second environment in case of usual cyclical pronouncement. Certainly, it is possible to improve the system to the extent where the recognition of the primary user in an unpredictable situation would be 100% for every case. There also is a possibility to improve the system, so that the recognition of spoken command enables with a special one or their sequence. The suggested improvement is based on repeating the word “stop” three times to enable or disable the system depending in which state it would be.

VI. CONCLUSION Analysis of the results of the recognition of speech shows that the recognition of speech with artificial neural nets is certainly possible. Also the quality of the recognition is adequate enough for use in laboratories but it still does not satisfy the criteria for use in circumstances where the user's safety is questionable. A good quality of the recognition of speech with artificial neuron nets is most likely their resistance to the sounds of the background which is shown in the results. However, there is also a disadvantage, that it does not distinguish between a normal conversation and an command. For now, the use of the wheelchair is difficult with the wish to have in the same time a normal conversation. One solution is to use an appropriate choice of commands, which is normally not repeated in an every day conversation. The problem of the recognition of commands by another person is minimized in a situation where the user is wearing a dynamic microphone with a narrow filed of sound, which also partially removes the background sounds and enhances the change in the color of pronounced commands by the user. This does not have a larger effect on the quality of the recognition. Work for the future is to improve feature extractor, recognizer and to build a way how to use the recognizer for a larger quantity of commands and also combined commands. An important improvement is also construction of a system for separating between everyday conversations and giving commands, which most certainly is a part of the improvement of the system for deciding about the wheelchair operation where there is fixed logic momentarily in use. The presented work is also a good base for additional improvements. Some of possible upgrades should be: recognition of the mimic of the face with the help of a camera, recognition of the environment with the help of the camera, additional systems or sensors, the use of different neural nets in the recognition, for an instance the “Long-short term memory “ (LSTM) neural nets…

Speech recognition in a laboratory environment absolutely promises very good results. However it would be necessary to run some additional test on the terrain, in a different background sounds environment.

VII. REFERENCES [1] [2]

[3]

[4] [5]

[6] [7]

S. Furui, Digital Speech Processing, Synthesis and Recognition, Marcel Dekker Inc., 1989. A. Graves, D. Eck, N. Beringer, J. Schmidhuber, Biologically Plausible Speech Recognition with LSTM Neural Nets. I. Kirsching, Continuous Speech Recognition Using the Time-Sliced Paradigm, University of Tokushima, Tokushima Shi, 1995. T. Kohonen, Self-Organization and Associative Memory, Springer-Verlag, Berlin, 1984. V. Pascal, B. Yoshua, A Neural Support Vector Network Architecture with Adaptive Kernels, IJCNN (5), 2000, 187-192 L. Rabiner, G. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993. P. Zegers, Speech recognition using neural networks, University of Arizona, Arizona, 1998.

1226

Authorized licensed use limited to: George Mason University. Downloaded on October 8, 2009 at 18:23 from IEEE Xplore. Restrictions apply.

Voice Operated Intelligent Wheelchair - VOIC

Faculty of Electrical Engineering and Computer Science, Institute of Robotics, Maribor, ... Authorized licensed use limited to: George Mason University. ..... Now, we can positively state that the system for .... University of Arizona, Arizona, 1998.

Download PDF

244KB Sizes 2 Downloads 189 Views

Report

Voice Operated Intelligent Wheelchair - VOIC

Recommend Documents