Voice for Voiceless_Report.pdf

Viewer
Transcript

TRIBHUVAN UNIVERSITY INSTITUTE OF ENGINEERING PULCHOWK CAMPUS

“Voice for the Voiceless” Nepali Sign Language To Speech Converter

By: Karun Poudel (064BCT514) Kiran Timsina (064BCT515) Krishna Prasad Panthi (064BCT516) Madhu Sudan Sigdel (064BCT517)

Supervisor: Associate Prof. Dr. Subarna Shakya Assistant Dean, IOE

A PROJECT SUBMITTED TO THE DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING IN PARTIAL FULFILLMENT OF THE REQUIREMENT FOR THE BACHELOR’S DEGREE IN COMPUTER ENGINEERING DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING LALITPUR, NEPAL November, 2011

CERTIFICATE OF APPROVAL The undersigned certify that they have read and recommended to the Department of Electronics and Computer Engineering, a final year project work entitled “Voice for the Voiceless, Nepali Sign Language to Speech Converter” submitted by Karun Poudel, Kiran Timsina, Krishna Prasad Panthi and Madhu Sudan Sigdel in fulfillment of the requirement for the Bachelor’s Degree in Computer Engineering.

……………………………………………. Project Supervisor Associate Prof. Dr. Subarna Shakya Assistant Dean, Institute Of Engineering

…………………………………

……………………………………..

Internal Examiner

External Examiner

Dr. Aman Shakya

Er. Pratima Pradhan

Assistant Professor,

Manager,

Department of Electronics and Computer

Directorate of Wireless Telephone

Engineering

Nepal Telecom

………………………………………. Project Coordinator Surendra Shrestha, PhD Deputy Head, Department of Electronics and Computer Engineering

Date of Approval: November 7, 2011

ii

COPYRIGHT The author has agreed that the Library, Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering may make this report freely available for inspection. Moreover, the author has agreed that permission for extensive copying of this project report for scholarly purpose may be granted by the supervisors who supervised the project work recorded herein or, in their absence, by the Head of the Department wherein the project report was done. It is understood that the recognition will be given to the author of this report and to the Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering in any use of the material of this project report. Copying or publication or the other use of this report for financial gain without approval of to the Department of Electronics and Computer Engineering, Pulchowk Campus, Institute of Engineering and author’s written permission is prohibited. Request for permission to copy or to make any other use of the material in this report in whole or in part should be addressed to:

Head of Department Department of Electronics and Computer Engineering Pulchowk Campus, Institute of Engineering Pulchowk, Lalitpur Nepal

iii

ACKNOWLEDGEMENT We would like to express our deep sense of gratitude to our project supervisor, Associate Prof. Dr. SubarnaShakya, Asst. Dean of IOE, for providing a lot of inspiration and intellectualguidance while being very encouraging and supportive. We are obliged to the Departme nt of Electronics and Computer Engineering for providing us an opportunity for developing a major project. We are very grateful to Mr. Narayan Bhakta Shrestha(Raayan),Principal, School for Deaf and Dumb, Naxal, for his cooperation. We would like to express our acknowledgement to Mr. Rishi Devkota, Teacher in the same school for his expert help on Nepali Sign Language and cooperation during collection of training data. We want to gratefully acknowledge the generous support of M r. Kul Prasad Bhattarai, Teacherof the same school and all the students who volunteered in the collection of training data. We also

like to

thank Nepal National Federation of the Deaf (NNFD),

Kamalpokharifor their cooperation. We would like to express our gratitude to lecturersDr. Sanjeeb Prasad PandayandEr. ManojGhimirefor providing help on image processing and data mining field respectively. We are grateful to Sr. Er. SushantPokharel, Deerwalk Inc.for helping us plan the project properly. We are thankful to our seniorsEr.SuvashSedhainandEr. Hari Prasad Gaire for helping us on pattern recognition. We would like to thank all the teachers, friends and seniors for providing valuable suggestions regarding the project and wewant to give our special thanks to our family whose love enabled us to complete this project.

Karun Poudel ([email protected]) Kiran Timsina ([email protected]) Krishna Prasad Panthi ([email protected]) Madhu Sudan Sigdel ([email protected])

iv

ABSTRACT There is a huge communication gap between normal people and hearing impaired people. A human interpreter is needed for the communication between the two worlds. This project aims to provide a foundation in the direction to make that communication automated. There are two halves of communication between normal people and hearing impaired people. First, normal people have to understand signs made by hearing impaired people which involves conversion of sign language to speech. Second, the hearing impaired peoplehave to understand speech produced by normal people which involves conversion of speech to some 3D model of hands. This project deals with the first part whichinvolves great deal of image processing, feature extraction and pattern recognition. Similar projects have been implemented in other countries. But sign language differs from one country to other. So, for each sign language, a new system has to be developed. Although the output of this project does not completely replace human interpreter, it is the first step towards the development of a complete automated interpreter.

Key words: Nepali sign language, PCA, HU moment, SVM, Bare hand detection

v

TABLE OF CONTENTS CERTIFICATE OF APPROVAL ..........................................................................................ii COPYRIGHT........................................................................................................................iii ACKNOWLEDGEMENT .................................................................................................... iv ABSTRACT........................................................................................................................... v TABLE OF CONTENTS...................................................................................................... vi LIST OF FIGURES ............................................................................................................viii LIST OF ACRONYMS ......................................................................................................... x 1.

2.

3.

4.

INTRODUCTION .......................................................................................................... 1 1.1.

Background ............................................................................................................. 1

1.2.

Objectives ................................................................................................................ 2

1.3.

Sign Language Overview ........................................................................................ 2

LITERATURE REVIEW ............................................................................................. 10 2.1.

Data Acquisition.................................................................................................... 10

2.2.

Hand Features........................................................................................................ 21

2.3.

Pattern Recognition ............................................................................................... 24

2.4.

Text To Speech...................................................................................................... 27

SYSTEM DEVELOPMENT AND METHODOLOGY .............................................. 29 3.1.

Preliminary investigation phase ............................................................................ 29

3.2.

General project description ................................................................................... 34

3.3.

Problem Analysis phase ........................................................................................ 34

3.4.

Decision Analysis phase........................................................................................ 37

3.5.

System Modeling................................................................................................... 49

IMPLEMENTATION................................................................................................... 51 4.1.

Frame Grabber....................................................................................................... 52

4.2.

Image Processor .................................................................................................... 52 vi

4.3.

Feature Extractor ................................................................................................... 57

4.4.

Classifier................................................................................................................ 59

4.5.

Audio Generator .................................................................................................... 61

4.6.

GUI ........................................................................................................................ 61

4.7.

Control Box ........................................................................................................... 62

5.

DEVELOPMENT METHODS .................................................................................... 63 5.1.

Methodology ......................................................................................................... 63

5.2.

Tools and Environment ......................................................................................... 63

6.

PROBLEM FACED AND SOLUTION....................................................................... 64 6.1.

Some Solved Problems.......................................................................................... 64

6.2.

Some Unsolved Problems ..................................................................................... 64

7.

RESULT AND ANALYSIS ......................................................................................... 66 7.1.

Experimental Setup ............................................................................................... 66

7.2.

Signer Dependent Testing ..................................................................................... 66

7.3.

Signer Independent Testing................................................................................... 69

8.

FUTURE SYSTEM ENHANCEMENT....................................................................... 73

9.

CONCLUSION............................................................................................................. 75

10.

REFERENCE............................................................................................................ 76

11.

BIBILOGRAPHY ..................................................................................................... 77

12.

APPENDIX A: NSL Alphabets and normalized image ........................................... 79

13.

APPENDIX B: VFV Database Trainer..................................................................... 81

14.

APPENDIX C: Detail Result .................................................................................... 82

vii

LIST OF FIGURES Figure 1.1 Taxonomy of hand gestures for HCI. .................................................................. 3 Figure 1.2 Nepali sign gesture for alphabets (For more gestures refer to Appendix A) ...... 6 Figure 1.3 Gesture with hand motion (a) Rotational (b)Translational.................................. 8 Figure 2.1 Sensor Placement for Bi-Channel recognition system proposed by Kim .......... 10 Figure 2.2 The Acceleglove ................................................................................................. 10 Figure 2.3 Left: Flock of Birds 3D motion tracker, Right: CyberGlove ............................ 10 Figure 2.4 Head mounted camera and accelerometer data collection................................. 10 Figure 2.5 Colored Gloves ................................................................................................... 11 Figure 2.6 Tracking using motion Energy Image (MEI) .................................................... 13 Figure 2.7 Tracking using Motion History Image .............................................................. 14 Figure 2.8 Convexity defects For Finger Count ................................................................. 18 Figure 2.9 Template matching that sweeps a template image patch across another image looking for matches. ..................................................................................................... 20 Figure 2.10 Haar Cascade .................................................................................................... 20 Figure 2.11 PC1 and PC2 of a set of data points. ............................................................... 24 Figure 2.12 A conventional pattern recognition procedure ................................................. 24 Figure 2.13 Examples of hyper planes separating data points of two classes. ................... 26 Figure 3.1 Automated communication system between normal and hearing impaired people (highlighted in light grey is our working domain) ............................................ 30 Figure 3.2 Complete domain of the sign language (highlighted in light grey is our working domain) ......................................................................................................................... 31 Figure 3.3 System Block Diagram ....................................................................................... 49 Figure 3.4 Class Diagram of the system .............................................................................. 50 Figure 4.1 Different Types of Background in the application ............................................. 53 Figure 4.2 Background Subtraction ..................................................................................... 53 Figure 4.3 Foreground mask after face elimination ............................................................. 54 Figure 4.4 Foreground mask after color subtraction............................................................ 54 Figure 4.5 Foreground mask with all noise removed. ......................................................... 55 Figure 4.6 Sample raw binary image of size 24*32............................................................. 57 Figure 4.7 Sample normalized image of letter “ च ” with unfilled contour. .......................... 58 Figure 4.8 A sample contour of sign “च” filled with white. ............................................... 59 viii

Figure 4.9 GUI of Debug Tab .............................................................................................. 62 Figure 6.1 Overlaping hand and face ................................................................................... 64 Figure 6.2 Deviation from ideal shape of nepali sign “ ट ” ................................................... 65 Figure 6.3 Effect of lighting................................................................................................. 65

ix

LIST OF ACRONYMS ASL

American Sign Language

BSL

British Sign Language

CCMSPF

Connected Component Mean Shift Particle Filter

CSL

Chinese Sign Language.

EMG

Electromyogram

FD

Fourier Descriptor

HCI

Human Computer Interface

HMM

Hidden Markov Model

MEI

Motion Energy Image

MHI

Motion History Image

MS

Manual Sign

NLP

Natural Language Processing

NMS

Non-Manual Sign

NN

Neural Network

NSLR

Nepali Sign Language Recognition

PCA

Principal Component Analysis

PDF

Probability Density Function

PF

Particle Filter

RBF

Radial Basis Function

RGB

Red Green Blue

SLR

Sign Language Recognition

SVM

Support Vector Machine

VFV

Voice forthe Voiceless

WV

Wave Descriptor x

1. INTRODUCTION 1.1. Background Sign languages are used all over the world as a primary means of communication by deaf people. According to current estimates 1 in every 1000 people are deaf. This is a huge number and thus, there is a great need for systems that can interpret sign language or can serve as interpreters between sign languages and spoken languages. There is a limited number of hearing people who are competently able to communicate in sign language. Hence, there is a huge communication gap between sign world and spoken world requiring a human interpreter. An example of this scenario is in our Constitution Assembly, where a deaf & dumb member of the assembly needs a human interpreter to convey his message. Upon the completion of this project, we hope to ease the communication between the two worlds. Sign language interpreters can be used to aid communication between deaf and hearing people but this is often difficult due to the limited availability and high cost of interpreters. These difficulties in communication between hearing and deaf people can lead to problems in the integration of deaf people into society and conflicts with an independent and selfdetermined lifestyle. Hearing people learn and perceive written language as a visual representation of spoken language where letters encode phonemes. For deaf people, this correspondence does not exist thus letters are just seen as symbols without any meaning. Deaf people therefore have great difficulties with reading and writing due to the fact that there is no direct correspondence between their natural language (sign language) and written language. Research in automated recognition is therefore needed in order to improve communication between deaf and hearing people. Similar projects have been implemented in other countries. But sign language differs from one country to other. For example, letters “क”, “ख” etc. are same in Nepali & Hindi written language but the corresponding signs for those letters can vary. So, for each sign language, a new system has to be developed.

1

1.2. Objectives i. ii.

To convert the Nepali Sign Language posture to text and then to speech. To develop an image processing module to extract the final image for pattern recognition.

iii.

To develop a recognition model to classify signs.

iv.

To compare and analyze different algorithms applicable to build the system.

1.3. Sign Language Overvie w Websters Dictionary defines a gesture as, (1) “a movement usually of the body or limbs that expresses or emphasizes an idea, sentiment, or attitude"; (2) “the use of motions of the limbs or body as a means of expression". Most of the gestures are performed with the hand but also with the face and the body. In the case of hand gestures, the shape of the hand, together with its movement and position with respect to the other body parts, forms a hand gesture. Gestures are used in many aspects of human communication. They can be used to accompany speech, or alone for communication in noisy environments or in places where it is not possible to talk. In a more structured way, they are used to form the sign languages of the hearing- impaired people. With the progress on HCI, the gestures have found a new area of usage. Systems that enable the use of computer programs with hand gestures, such as operating system control, games, and virtual reality applications have been developed.

1.3.1. Hand Gestures Accompanying Speech Hand gestures are frequently used in human to human communication either alone or together with speech. There is considerable evidence that hand gestures are produced unconsciously along with speech in many situations and enhance the content of accompanying speech. It is also known that even when the listener cannot see the hands of the speaker or there is no listener at all, hand gestures are produced. Although hand gesture recognition for HCI is a relatively recent research area, the research on hand gestures for human- human communication is well developed. Several taxonomies are presented in the literature by considering different aspects of gestures. Gestures can be 2

classified with respect to their independence such as autonomous gestures (independent gestures) and gesticulation (gestures that are used together with another means of communication). Some literatures classify gestures into three groups: iconic gestures, metaphoric gestures and beats. While other classifies gestures into four groups: conversational, controlling, manipulative and communicative. A similar categorization views gestures in terms of the relation between the intended interpretation and the abstraction of the movement. The conversational and controlling gesture types are accepted as a sub category under communicative gestures. A widely accepted taxonomy of gestures for HCI is given inFigure 1.1.

Figure 1.1Taxonomy of hand gestures for HCI.

The hand/arm movements during conversation can generally be classified into two groups: intended or unintended. Although unintended hand movements must also be taken into account in order to realize human-computer interaction as natural as human- human interaction, current research on gesture recognition focuses on intended gestures which are used for either communication or manipulation purposes. Manipulative gestures are used to act on objects; such as rotation, grasping, etc. Communicative gestures have an inherent communicational purpose. In a natural environment they are usually accompanied by speech. Communicative gestures can be acts or symbols. Symbol gestures are generally used in a linguistic role with a short motion (i.e. sign languages). In most cases, the symbol itself is 3

not directly related to the meaning and these gestures have a predetermined convention. Two types of symbol gestures are referential and modalizing. Referential gestures are used to refer to an object or a concept independently. For example, rubbing the index finger and the thumb in a circular fashion independently refers to money. Modalizing gestures are used with some other means of communication, such as speech. For example, in ASL, the sentence “I saw a fish. It was this big." is only meaningful with the gesture of the speaker. Another example is the symbol for continuation which means that the person should continue to whatever he/she is doing. Unlike symbol gestures, act gestures are directly related to the intended interpretation. Such movements are classified as either mimetic or deictic. Mimetic gestures usually mimic the concept or object to refer. For example, a smoker going through the motionof “lighting up" with a cigarette in his mouth indicates that he needs a light, or hand is like holding a gun. Deictic gestures or pointing gestures are used for pointing to objects and classified as specific, generic or metonymic by its context. There is another type of gesture that has different characteristics than the other gesture types. These are called beat gestures. In beat gestures, the hand moves, in a short a nd quick way, along with the rhythm of speech. A typical beat gesture is a simple and short motion of the hand or the fingers up and down or back and forth. The main effect of a beat gesture is that it emphasizes the phrase it accompanies. In order to differentiate intended and unintended hand/arm movements or different gestures, one should know the exact start and end time of a gesture. This is called the gesture segmentation problem. Gestures are dynamic processes and the temporal characteristics of gestures are important for segmentation purposes. With the exception of beat gestures, each gesture starts, continues for some interval and ends. This is not only valid for dynamic gestures that include both the spatial and the temporal component, but also for static gestures that only contain the spatial component. A gesture is constituted in three phases: preparation, stroke, and retraction or recovery. In the preparation phase, the hand is oriented for the gesture. The stroke phase is the phase of the actual gesture. Finally, in the retraction phase, the hand returns to the rest position. The preparation and the stroke phases constitute a gesture phrase and together with the recovery phase, they constitute a gesture unit.

4

1.3.2. Manual Signals Manual signals are the basic components that form sign languages. Manual signlanguage communication can be considered as a subset of gestural communicationwhere the former is highly structured and restricted. Thus, analysis of manual signsis highly connected to hand gesture analysisbut needs customized methods tosolve several issues such as analysis of a large- vocabulary system, correlation analysisof signals and to deal with its structured nature. Some of the studies in SLR literature concentrate only on recognizing static hands hapes. These hand shapes are generally selected from the finger alphabet or from static signs of the language. However, a majority of the signs in many sign languages contain significant amount of hand motion and a recognition systemthat focuses only on the static aspects of the signs has a limited vocabulary. Hence,for recognizing hand gestures and signs, one must use methods that are successful onmodeling the inherent temporal aspect of the data. Grammatical structures in the language are often expressed as systematic variations of the base manual signs. These variations can be in the form of speed, tension,and rate. Most of the SLR systems in the literature ignore these variations.However, special care must be paid

to

variations,

especially

for

continuous

signing

andin

sign-to-text

systems.Psycholinguistic researches reveal two main aspects within manual signals. a. Hand Posture Recognition i.

Hand shape

ii.

Orientation

b. Spatiotemporal Gesture Recognition i. ii.

I.

Place of articulation Motion

Hand Shape

Hand postures include hand shape and orientation. Hand shape refers to finger configuration and orientation refers to the inclinations of palm and arm. Different postures are used to represent the letters and numbers of writing and numeral systems.Hand shape is one of the main modalities of the gestured languages. Apart from sign language, hand 5

shape modality is widely used in gesturecontrolled computer systems where predefined hand shapes are used to give specificcommands to the operating system or a program. Analysis of the hand shape is a verychallenging task as a result of the high degree of freedom of the hand. For systemsthat use limited number of simple hand shapes, such as hand gesture controlled systems (hand shapes are determined manually by the system designer), the problem iseasier. However, for sign languages, the unlimited number and the complexity of thehand shapes make discrimination a very challenging task, especially with 2D visionbased capture systems.

Figure 1.2 Nepali sign gesture for alphabets (For more gestures refer to Appendix A) In sign languages, the number of hand shapes is much higher. For example, without considering finger spelling, American Sign Language (ASL) has around 150hand shapes, and in British Sign Language (BSL) there are 57 hand shapes. For the analysis of hand shapes, a vast majority of the studies in the literatureuse appearance based methods. These methods extract features of a hand shape byanalyzing a 2D hand image and are preferred due to their simplicity and low computation times, especially for real time applications. These features include regionbased descriptors (image moments, image eigenvectors, Zernike moments, Hu invariants, or grid descriptors) and edge based descriptors (contour representations, Fourier descriptors). When the position and angles of all the joints in the hand are needed with high precision, 3D hand models should be preferred. The 3D hand model can be estimatedeither from a multiple camera system by applying 3D reconstruction methods, or in asingle camera setting, the 2D appearances of the hand are matched with 3D hand models in the shape database. However, computational complexity of these methodscurrently prevents their use in SLR systems.

II.

Place of Articulation

The location of the hand must be analyzed with respect tothe context. It is important to determine the reference point on the space and on thehand. In sign languages, where both 6

the relative location and the global motion of the hand are important, the continuous coordinates and the location ofthe hand with respect to body parts should be analyzed. This analysis can be doneby using the center of mass of the hand. On the other hand, for pointing signs, usingcenter of mass is not appropriate and the coordinates of the pointing finger and thepointing direction should be used. Since signs are generally performed in 3D space, location analysis should be done in 3D if possible. Stereo cameras can be used toreconstruct 3D coordinates in vision based systems.

III.

Motion

In gestured communication, it is important to determine whether the performed hand motion conveys a meaning by itself. In sign languages, the hand motion is one of the primary modalities that formthe sign, together with the hand shape and location. Depending on the sign, thecharacteristic of the hand trajectory can change, requiring different levels of analysis.For example, some signs are purely static and there is no need for trajectory analysis.The motion of the dynamic signs can be examined as either of two types: i.

Signs with global hand motion: In these signs, the hand center of mass translates in the signing space.

ii.

Signs with local hand motion: This includes signs where the hand rotates withno significant translation of the center of mass, or where the finger configurationof the hand changes.

For signs with global hand motion, trajectory analysis is needed. For signs withlocal motion, the change of the hand shape over time should be analyzed in detail, since even small changes of the hand shape convey information. The first step of hand trajectory analysis is tracking the center of mass of eachsegmented hand. Hand trajectories are generally noisy due to segmentation errorsresulting from bad illumination or occlusion. Thus, a filtering and tracking algorithmis needed to smooth the trajectories and to estimate the hand location when necessary.Moreover, since hand detection is a costly operation, hand detection and segmentationcan be applied not in every frame but less frequently, provided that a reliable estimationalgorithm exists. For this 7

purpose, algorithms such as Kalman filters and particle filterscan be usedand the estimations of these filters for theposition, and its first and second order derivatives, the velocity and the accelerationof the hand are used as hand motion features. Based on the context and the sign,hand coordinates can be normalized with respect to the reference point of the sign Relative motion and position of each hand with respect to the other is anotherimportant feature in sign language. The synchronization characteristics of the twohands differ from sign to sign, i.e., the two hands can move in total synchronization; one hand can be stationary and the other can be moving; they can be approaching ormoving away. Although the hand motion features are continuous values, they can be discretizedfor use in simpler models, especially when there is low amount of training data. The discretized featurevector includes the discretized values for the position of the hands relative to eachother, position of hands relative to key body locations, relative movement of the handsand the shape of the hands (the class index).

Figure 1.3 Gesture with hand motion (a) Rotational (b)Translational

IV.

Non-manual Signals

Non-manual signals are used in sign language either to strengthen or weaken or sometimes to completely change the meaning of the manual sign [1]. These include facial expressions, facial movements, and body posture. For example, by using the same MS (Manual Signals) but different NMS (Non-Manual Signals), the ASL sign HERE may mean NOT HERE, HERE (affirmative) or IS HERE. The non- manual signs can also be used by themselves, especially for negation. As opposed to studies that try to improve SLR performance by adding lip reading to the system, analysis of NMS is a must for building a complete SLR system: two signs with exactly the same manual component can have completely different 8

meanings. Some limited studies on non- manual signs attempt to recognize only the NMS without the MS. In SLR literature, there are only a few studies that integrate manual and non- manual signs. The synchronization characteristics of non- manual signals and manual signs should be further analyzed. In a continuous sign language, sentence the non-manual signal does not always coincide with the manual sign. It may start and finish before or after the manual sign. It may last through multiple manual signs or the whole sentence. Thus, the joint analysis of the manual signs and non-manual signals requires the fusion of modalities in different time scales.

9

2. LITERATURE REVIEW 2.1. Data Acquisition 2.1.1. Wearable Computing Based Acquisition We can acquire data through various means like hardware sensors, accelerogloves, cyber gloves, accelerometers or color gloves.

Figure 2.1Sensor Placement for Bi-Channel recognition system proposed by Kim

Figure 2.2The Acceleglove

Figure 2.3Left: Flock of Birds 3D motion tracker, Right: CyberGlove

Figure 2.4Head mounted camera and accelerometer data collection

10

Figure 2.5ColoredGloves

2.1.2. Vision Based Acquisition I.

Locating Hand

Hand detection and segmentation can be done with or without markers. Several markers are used in the literature such as single colored gloves on each hand, or gloves with different colors on each finger or joint. With or without a marker, descriptors of color, motion and shape information, separately or together, can be used to detect hands in images. However, each source of information has its shortcomings and restrictions. Following table shows the shortcomings of each source of information and the assumptions or restrictions on the systems that use them. Type of Information Color cue

Motion cue

Problems 

Assumptions/Restrictions

Existence of other skin colored 

Long-sleeved clothing

regions



Excluding the face



Contact of two hands



Only single hand usage



Identifying the left and right hands



Motion of objects other than the 

Stationary background



hands 

Fast & highly variable motion of

Hand moves with constant velocity

hand Shape cue



High degree of freedom of the hand



Restricting the hand shapes

a. Color based Color information is used with the strong assumption that hands are the onlyskin regions in the camera view. Thus, usershave to wear long-sleeved clothing tocover other skin regions such as arms. Face detection can be applied to excludethe face from the image sequence, 11

leaving the hands as the only skin regions. However,this approach ignores situations where the hand is in front of the face: a common andpossible situation in sign languages. When there are two skin regionsresulting from the two hands of the signer, the two biggest skincolored regions can beselected as the two hands. This approach will fail when the two hands are in contact,forming a single skin-colored region. Another problem is to decide which of these tworegions corresponds to the right and left hands and vice versa. In some studies, it isassumed that users always use or at least start with their left hand on the left and righthand on the right. Starting with this assumption, an appropriate tracking algorithmcan be used to track each region. However when the tracking algorithm fails, the usersneed to re- initialize the system. Some of these problems can be solved by using motion and shape problems.

b. Motion based Hands movements are relatively distinct from other movements because they move frequently, faster & in different directions. Other parts of body also exhibit movement but the displacement is less & in same direction. The basic idea about motion based technique is that the region that moves frequently is likely to be a hand. But in this technique, there should not be moving objects in the background.

c. Shape based The main disadvantage of using the shape information alone comes from the fact that the hand is a non-rigid object with a very high degree of freedom. If the number of hand shapes is limited, shape information can be robustly used for detection and segmentation, even during occlusion with the face or with the other hand. Thus, to achieve high classification accuracy of the hand shape, either the training set must contain all configurations that the hand may have in a sign language video, or the features must be invariant to rotation, translation, scale and deformation in 3D. d. Hybrid model Color, motion and shape based techniques all suffer from some disadvantages. So, a hybrid way incorporating all the techniques can be used to locate the hands.Systems that combine 12

several cues for locating hand have fewer restrictions and are more robust to changes in the environment.

II.

Hand Tracking

a. MEI(Motion Energy Image) Given a rich vocabulary of motions that are recognizable, an exhaustive matching search is not feasible, especially if real-time performance is desired. In keeping with the hypothesisand-test paradigm, the first step is to construct an initial index into the known motion library. Calculating the index requires a data-driven, bottom up computation that can suggest a small number of plausible motions to test further. Consider the example of someone sitting, as shown in the figure below. The top row contains key frames from a sitting sequence. The bottom row displays a cumulative binary motion energy image (MEI) sequence corresponding to the frames above. The MEIs highlight regions in the image where any form of motion was present. The summation of the square of consecutive image differences often provides a robust spatial motiondistribution signal. Image differencing also permits real-time acquisition of the MEIs. As expected, the MEI sequence sweeps out a particular (and perhaps distinctive) region of the image.

Figure 2.6Tracking using motion Energy Image (MEI)

13

b. MHI(Motion History Image) Consider the Figure 2.7Tracking using Motion History Image. This image captures the essence of the underlying motion pattern of someone sitting (sweeping down and back) superimposed on the corresponding MEI silhouette. Here, both where (MEI silhouette) the motion is happening and also how (arrows) the motion is occurring are present in one compact template representation. This single image appears to contain the necessary information for determining how a person has moved during the action. The temporal motion information is collapsed into a single image template where intensity is a function of how recent the motionoccurred. The resultant image yields a description similar to the "arrow" picture. To represent how motion is moving, motion history image (MHI) is used. In an MHI, pixel intensity is a function of the motion history at that location, where brighter values correspond to more recent motion. We currently use a simple replacement and linear decay operator using the binary image difference frames. Examples of MHIs for three actions (sit-down, arms-raise, crouch-down) are presented in the figure below right. Notice that the final motion locations appear brighter in the MHIs.

Figure 2.7Tracking using Motion History Image

This technique can be used to find the velocity and direction of a moving blob (which in our case is the hand).

14

c. Connected Component In connected component tracking, similar image components in the consecutive frames are found. In the example of video surveillance observing the movement of people in airport(for example), the background is still. So, essentially in two consecutive frames, backgrounds won’t appear as connected components. While the people in the foreground while moving from one place to other change their position in consecutive frames. So, moving blobs(hands in our project) can be found by finding the connected components.

d. Mean-Shift Mean Shift is a powerful and versatile non parametric iterative algorithm that can be used for lot of purposes like finding modes, clustering etc. Mean Shift was introduced by Fukunaga and Hostetler in 1975 and has been extended to be applicable in other fields like Computer Vision.

Mean ShiftTrackingProcedure The mean-shift

algorithm is a robust method of finding local extrema in the density

distribution of a data set. This is an easy process for continuous distributions; in that context, it is essentially just hill climbing applied to a density histogram of the data.For discrete data sets, however, this is a somewhat less trivial problem. The descriptor “robust” is used here in its formal statistical sense; that is, mean-shift ignores outliers in the data. This means that it ignores data points that are far away from peaks in the data. It does so by processing only those points within a local window of the data and then moving that window. The mean-shift algorithm runs as follows. i.

Choose a search window: a. its initial location b. its type (uniform, polynomial, exponential or Gaussian) c. its shape (symmetric or skewed, possibly rotated, rounded or rectangular) d. its size (extent at which it rolls of or is cut of )

ii.

Compute the window’s (possibly weighted) center of mass.

iii.

Center the window at the center of mass. 15

iv.

Return to step 2 until the window stops moving (it always will).

e. Camshift Another similar variation of tracking algorithm is the camshift algorithm. It differs from the meanshift in that the search window adjusts itself in size. If you have well-segmented distributions (say face features that stay compact), then this algorithm will automatically adjust itself for the size of face as the person moves closer to and further from the camera. Many people think of mean-shift and camshift as tracking using color features, but this is not entirely correct. Both of these algorithms track the distribution of any kind of feature that is expressed in the source image; hence they make for very lightweight, robust, and efficient trackers.

f. Kalman Filter A Kalman filter is simply an optimal recursive data processing algorithm. There are many ways of defining optimal, dependent upon the criteria chosen to evaluate performance. The Kalman filter is optimal with respect to virtually any criterion that makes sense. One aspect of this optimality is that the Kalman filter incorporates all information that can be provided to it. It processes all available measurements, regardless of their precision. to estimate the current value of the variables of interest, with use of (i) Knowledge of the system and measurement device dynamics, (ii) The statistical description of the system noises, measurement errors, and uncertainty in the dynamics models, and (iii) Any available information about initial conditions of the variables of interest.

g. Particle Filte r One fundamental assumption of the Kalman filter is that the distributions of the state variables are Gaussian. Particle filters are sophisticated model estimate techniques based on simulations which generate a set of random samples (particles) with weights 16

( )(sampling probability) that approximate the filtering distribution. The weight of each sample is used to define its importance, which can also be seen as the observation frequency.

h. CCMSPF (Connected Component Mean Shift Particle Filte r) As the name suggests, it is a hybrid technique adopted by OpenCV(an image processing library) using Connected Component method, Mean Shift Algorithm and Particle Filter method. The core logic behind this method is that Connected Component tracking efficiently tracks blobs which aren’t likely to occlude the paths of each other. But when occlusion is likely to occur, Particle Filtering method is applied based on the weights given by Mean Shift Algorithm. The particle filtering method can predict the position of a blob after the occlusion is over.

III.

Other Image Processing Techniques

a. Erosion and Dilation The basic morphological transformations are called dilation and erosion, and they arise in a wide variety of contexts such as removing noise, isolating individual elements, and joining disparate elements in an image. Morphology can also be used to find intensity bumps or holes in an image and to find image gradients.Dilation is a convolution of some image (or region of an image), which we will call A, with some kernel, which we will call B. The kernel, which can be any shape or size, has a single defined anchor point. Most often, the kernel is a small solid square or disk with the anchor point at the center. The kernel c an be thought of as a template or mask, and its effect for dilation is that of a local maximum operator. As the kernel B is scanned over the image, we compute the maximal pixel value overlapped by B and replace the image pixel under the anchor point with that maximal value. This causes bright regions within an image to grow. This growth is the origin of the term “dilation operator”.

17

b. Histogram An image histogram[2] is a type of histogram that acts as a graphical representation of the tonal distribution in a digital image. It plots the number of pixels for each tonal value. By looking at the histogram for a specific image a viewer will be able to judge the entire tonal distribution at a glance. c. Contour Convexity and Convexity Defect Another useful way of comprehending the shape of an object or contour is to compute a convex hull for the object and then compute its convexity defects. The shapes of many complex objects are well characterized by such defects.Figure below illustrates the concept of a convexity defect using an image of a human hand. The convex hull is pictured as a dark line around the hand, and the regions labeled A through H are each “defects” relative to that hull. As you can see, these convexity defects offer a means of characterizing not only the hand itself but also the state of the hand.

Figure 2.8 Convexity defects For Finger Count The dark contour line is a convex hull around the hand; the gridded regions (A–H) are convexity defects in the hand contour relative to the convex hull

d. Image Binarization Once we obtain a gray scale image of an object by some processing technique, we often only need to know whether a pixel is white or black. For example in SLR systems, binary image of hand contour is enough to extract useful information. To obtain the binary image 18

from gray scale we use a threshold. For example, if the gray level is 255, pixels above 127 can be treated as white(255) and others can be treated as black(0).

e. Canny The Laplace method for finding edges was further refined by J. Canny in 1986 into what is now commonly called the canny edge detector. One of the differences between the Canny algorithm and the simpler, Laplace-based algorithm is that, in the Canny algorithm, the first derivatives are computed in x and y and then combined into four directional derivatives. The points where these directional derivatives are local maxima are then candidates for assembling into edges.

f. Background Subtraction In order to perform background subtraction, we first must “learn” a model of the background.Once learned, this background model is compared against the current image and then the known background parts are sub tracted away. The objects left after subtraction arepresumably new foreground objects.

g. Averaging Background Method One simple modification of the basic principle can be average background method which basically learns the average and standard deviation (or similarly, but computationally faster, the average difference) of each pixel as its model of the background. A single frame can’t significantly represent the background in cases where there are movements occurring in the background or in cases where lighting conditions are slightly changing. In such situations, a good solution would be to accumulate considerable number of frames and take an average of all the frames as the representative of the background.

19

h. Template Matching Template matching is not based on histograms; rather, the function matches an actual image patch against an input image by “sliding” the patch over the input image. As in Figure 2.9 , we have an image patch containing a face, then we can slide that face over an input image looking for strong matches that would indicate another face is present.

Figure 2.9Template matching that sweeps a template image patch across another image looking for matches.

i.

Haar Cascade

Figure 2.10 Haar Cascade The core basis for Haar classifier object detection is the Haar- like features. Thesefeatures, rather than using the intensity values of a pixel, use the change in contrast valuesbetween 20

adjacent rectangular groups of pixels. The contrast variances between the pixelgroups are used to determine relative light and dark areas. Two or three adjacent groupswith a relative contrast variance form a Haar- like feature. Haar- like features, as shownin Figure 2.10

are used to detect an image.Haar features can easily be scaled byincreasing or

decreasing the size of the pixel group being examined. This allows featuresto be used to detect objects of various sizes.

2.2. Hand Features In this we will review about shape representation and pattern recognitionframework to classify segmented hand images. In data acquisition, we discuss variety of diﬀerent hand segmentation techniques. These techniques can be utilized to extract binary images ofa user’s hand from a video stream. In order to utilize the hand segmentationmethods to their full potential, a technique which can accuratelyclassify hand postures from the hand segmentation data is to be used. In this section, we will see how hand shape representations are computedfrom a segmented hand contour. Our proposed hand shape features aredeveloped to utilize computer vision based hand segmentation data.

2.2.1. Shape Representations Appearance based gesture recognition requires an accurate extraction of an eﬀective feature set that can separate the hand shapes. Accurate shape representations must be able to identify similar shapes and distinguish between diﬀerent shapes.

2.2.2. Hu Moments In image processing, computer vision and related fields, an image moment is a certain particular weighted average (moment) of the image pixels' intensities, or a function of such moments, usually chosen to have some attractive property or interpretation. Image moments are useful to describe objects after segmentation. Simple properties of the image which are found via image moments include area (or total intensity), its centroid, and information about its orientation 21

Hu Moments [5], which are a reformulation of the non-orthogonal centralized moments, are a set of transition, scale and rotation invariant moments. The set of Hu Moments, I= {I1, I2, I3, I4, I5, I6, I7}, are calculatedfrom the hand contour.

The first one, I1, is analogous to the moment of inertia around the image's centroid, where the pixels' intensities are analogous to physical density. The last one, I7, is skew invariant, which enables it to distinguish mirror images of otherwise identical images. A general theory on deriving complete and independent sets of rotation invariant moments was proposed by J. Fusser and T. Suk. They showed that the traditional Hu's invariant set is neither independent nor complete. I2 and I3 are not very useful for pattern recognition, as they are dependent. On the original Hu's set there is a missing third order independent momentinvariant:

2.2.3. Principal Component Analysis The Principal Component Analysis (PCA)[4] is one of the most successful techniques that have been used in image recognition and compression. PCA is a statistical method under the broad title of factor analysis. The purpose of PCA is to reduce the large dimensionality of the data space (observed variables) to the smaller intrinsic dimensionality of feature space (independent variables), which are needed to describe the data economically. This is 22

the case when there is a strong correlation between observed variables. The jobs which PCA can do are prediction, redundancy removal, feature extraction, data compression, etc. Because PCA is a classical technique which can do something in the linear domain, applications having linear models are suitable, such as signal processing, image processing, system and control theory, communications, etc. Application:When using these sorts of matrix techniques in computer vision, we must consider representation of images. A square, N by N image can be expressed as an N 2 -dimensional vector where the rows of pixels in the image are placed one after the other to form a onedimensional image. The values in the vector are the intensity values of the image, possibly a single grayscale value. a.

PCA to find patte rns

i.

Say each processed image is of size of (N*N).

ii.

Take all the pixels of the image in a row forming a (1*N 2 ) matrix.

iii.

‘m’ number of such row images for a input image matrix of size (m*N 2 ).

iv.

Find the mean image (

v.

Make the input image matrix mean centered (

vi.

Find the covariance matrix (C) of size (N 2 *N 2 ).

vii.

) of the matrix.

Compute the eigenvectors (

).

) and Eigen values of the covariance matrix. The

size of Eigen vector matrix is (N 2 *N2 ) if m >= N 2 else (m-1 * N 2 ). The Eigen vectors are the Eigen images and define the new Eigen space. The size of Eigen values is (N 2 *1) if m>=N 2 else (m-1 * 1). 23

viii.

Get projected images (

) by projecting the input images into the new Eige n

space defined by taking ‘k’ Eigen vectors with highest Eigen values. ix.

The difference between the

and

gives a distance measure (

) whic h

can be used for classification.

b. PCA for finding the direction of data variance Given a set of data points, PCA finds the direction of largest variance called PC1 (Principal Component 1) and other orthonormal directions of smaller variance called PC2, PC3 and so on. The Eigen vector with the highest Eigen value defines the vector for PC1 and the Eigen vector with smaller Eigen value defines PC2 and so on.

Figure 2.11PC1 and PC2 of a set of data points.

2.3. Patte rn Recognition

Figure 2.12 A conventional pattern recognition procedure Pattern recognition [6] deals with mathematical and technical aspects of classifying objects through their observable information, such as grey levels of pixels for an image, energy levels in frequency domain for a waveform and the percentage of certain contents in a 24

product. In Pattern recognition firstly observable information of an unknown object is first transduced into signals that can be analyzed by computer systems. Features suitable for classification are then extracted from the collected signals. The extracted features are classified in the final step based on certain types of measures, such as distance, likelihood and Bayesian, over class models.

Some Basic Concepts Patte rn Pattern is a quantitative or structural description of an object or some other entity of interest. It is usually arranged in the form of a feature vector. The requirement on features is that the features can reflect the characteristics of desired objects. Class Class is a set of patterns that share common properties. The feature vectors of the same type of objects will naturally form one set. Due to diversity of the objects, the patterns extracted from the same type of objects are seldom identical. Classification Criterion Classification criterion is also called decision rule. The mostly widely used classification criteria are distance, Bayes decision rule and likelihood. Classifier A classifier creates a series of functions g(x),I = 1, ……..,k as the input-output function, which are called discriminant functions.

2.3.1. SVM(Support Vector Machine) Support Vector machines (SVM) [7] are a new statistical learning technique that can be seen as a new method for training classifiers based on polynomial functions, radial basis functions, neural networks, splines or other functions. Support Vector machines use a hyper- linear separating plane to create a classifier. For problems that cannot be linearly 25

separated in the input space, this machine offers a possibility to find a solution by making a non- linear transformation of the original input space into a high dimens ional feature space, where an optimal separating hyper plane can be found. Those separating planes are optimal, which means that a maximal margin classifier with respect to the training data set can be obtained.It is developed by Vladimir Vapnik and co-workers at AT&T Bell Laboratories in 1995. A classification task usually involves separating data into training and testing sets. Each instancein the training set contains one “target value" (i.e. the class labels) and several“attributes" (i.e. the features or observed variables). The goal of SVM is to producea model (based on the training data) which predicts the target values of the test datagiven only the test data attributes. A Two-Dimensional Example:

Figure 2.13Examples of hyper planes separating data points of two classes.

Kernel Function The simplest way to divide two groups is with a straight line, flat plane or an N dimensional hyper plane. But what if the points are separated by a nonlinear region such as shown in Figure 2.14Irregular line separating two classes. In this case we need a nonlinear dividing line.

26

Figure 2.14Irregular line separating two classes Rather than fitting nonlinear curves to the data, SVM handles this by using a kernel function to map the data into a different space where a hyper plane can be used to do the separation.

Figure 2.15Kernel function transforms data into higher dimension obtaining linear hyper plane The concept of a kernel mapping function is very powerful. It allows SVM models to perform separations even with very complex boundaries.

2.4. Text To Speech .Net Framework provides us the speech synthesizer for English Language. While implementing the same API for the Nepali language, we cannot obtain the true natural Nepali sound. This is due to the lack of some phoneme in English language. In English language there is no phoneme for 2 nd Nepali consonant “ख”. The other easier approach is to load the “wav” file and play the sound. Required sound are stored in the disk and played accordingly. In .NET there is a new namespace - System.Media which brings audio support. Previously the common way to add sounds to the application was to use the PlaySound API (either invoking ourselves or using one of many wrappers). Now that the framework itself has built- in support, which matches the desktop .NET v2.0 framework it makes sense to standardize on the SoundPlayer component. 27

In this approach when we specify a filename this doesn't simply call through to the native APIs passing the filename, the file is first loaded into memory and then the WaveOut APIs are used to play it. This means that if we simply create a new SoundPlayer and call Play the file will not be loaded, the Play method will have to first load the file contents, and then play the sound. This will create a noticeable delay before the sound is heard. The class allows us to load the file at any time prior to calling play - we can use either Load or LoadAsync to do this. Once the file is loaded the Play method will be able to immediately begin playing the file. Exactly where we call the Load/LoadAsync method will depend on the application design. Keeping a large audio file cached will tie up valuable memor y, so SoundPlayer instance should be disposed once the work has been finished with it.

28

3. SYSTEM DEVELOPMENT AND METHODOLOGY 3.1. Preliminary investigation phase 3.1.1. Proble m in existing system Due to the multimodal nature of sign language, the research area of sign language recognition is a multidisciplinary research area involving pattern recognition, machine learning, computer vision, natural language processing and linguistics. Since the research area is very huge and of complex nature, there are a lot of problems in the sign language itself & the existing systems that try to implement sign language recognition. i. Sign language is not universal. It varies from nation to nation and from locality to locality. ii. No complete & robust SLR system. SLR systems are usually imposed with many constraints typically related to the signing environment and clothing. iii. Most of the systems acquire data through either hardware based or vision based techniques using colored gloves. Very few works deal with bare hand. iv. Movement epenthesis (the problem to distinguish a valid sign from an invalid sign during transition from one sign to other during continuous testing). v. Difficult to collect large training and testing dataset. Pattern recognition usually demands very large dataset for good performance. On top of that, the training dataset should be ideal to represent the actual signs. vi. Difficult to implement unsupervised training (the training with no manual input for specifying which class a feature vector belongs to). vii. Difficult to integrate the combined meaning of manual and non-manual signs. The meaning expressed by manual sign can be changed by non- manual signs making it necessary to interpret data on multiple channels. viii. No systems with complete vocabulary. This is because the vocabulary itself is quite large and some signs cannot be differentiated when viewed from a single camera. ix. No benchmark databases to evaluate the performance of a SLR system. x. Difficult to achieve signer independent recognition framework due to variations in hand shapes and color of signers.

29

xi. The environment in which the system is used has lots of constraints like uniform lighting, uniform clothing, uniform background with no moving objects, etc. xii. Difficult to achieve real-time performance due to computational complexity to evaluate multiple channels of information. xiii. Difficult to perform segmentation during occlusion between two hands & between hands and face. xiv. Impractical to use multiple cameras while difficult to differentiate between similar signs from 2D information from a single camera. xv. Very few research in this field. xvi. Large vocabulary reduces the inter-class separation. Due to which high accuracy is difficult to achieve.

3.1.2. Scope of project The automated communication between normal people and hearing impaired people is a very huge project consisting sub-systems to handle two modes of communication. The two sub-systems are sign language to speech converter and speech to 3D modeler. Our focus is on sign language to speech converter. This sub-system further consists of three main subsystems: sign language to text converter, NLP (Natural Language Processing) and speech synthesizer. NLP is a huge area involving detail knowledge of grammar of the Nepali language (the language in which the speech will be generated). Due to time constraint, NLP of text lies outside the scope of this project.

Figure 3.1Automated communication system between normal and hearing impaired people (highlighted in light grey is our working domain) 30

The following figure shows the overview o f sign language. Marked in dark grey are the fields which our system deals with. Sign language consists of manual gestures and nonmanual gestures. A manual gesture further consists of hand posture and spatio-temporal gesture. Spatio-temporal gestures have space and time properties which make them very complicated to model. So, taking account of the time constraint, we will be working on only hand posture recognition.

Figure 3.2 Complete domain of the sign language (highlighted in light grey is our working domain)

3.1.3. Project Worth One of the main uses proposed for a sign language recognition system is a sign to text conversion system. This would require the complete translation of signed sentences to the text, or speech, of a spoken language. Such a translation system is not the only use for sign language recognition systems. There are other envisaged applications for sign language recognition systems such as a translation system for speciﬁc transactional domains such as post offices, banks etc. Another application is a bandwidth conserving system allowing communication between signers where recognized signs, which are the input of the communication system at one end, can be translated to avatar based animations at the other. An additional proposed application is an automated sign language teaching system. 31

It could support users suffering from hearing loss, deaf people with sign language deﬁciencies and hearing people wishing to learn sign language. Other envisaged applications include an automated, or semi-automated, system for the annotation of video databases of native signing. Linguistic re- search on sign language requires large scale annotated corpora and automated methods of analyzing sign language videos would greatly improve annotation efficiency. Finally, sign language recognition systems could be incorporated into applications which enable an input

interface for augmented

communication systems. Assistive technology implemented for human to human communication by people with speech impairments often requires keyboard, mouse and joystick inputs. Systems which could incorporate natural aspects of sign language would increase the accessibility of these systems. The techniques used in this project are not limited to sign language recognition. These techniques have potential to be applied to different problems that focus on human motion modeling and recognition, such as gesture controlled Human Computer Interface (HCI) systems, human action analysis and social interaction analysis. Such a huge area of application of project shows that it will be worth if we work on the project. But before that we need to consider some other factors to see the project feasible.

I.

Operational Feasibility

Performance On continuous testing, it will be quite awkward if there is a large delay between signing and speech production. So, real-time performance is needed. Our proposed system will adopt better techniques especially for image processing and pattern recognition to ensure quick prediction of a sign. Economy Most of the SLR systems make use of accessories for data acquisition. Some systems require instrumented gloves which are costly and not readily available in our market making the whole project infeasible.

32

Our proposed system avoids all such accessories and thus the signer can sign with bare hand. So, the cost of the system will be just that of the software development cost making it economically feasible. Accuracy Accuracy is one the critical aspect which has to be considered for a SLR system. Most of the existing systems suffer from poor accuracy. Signer dependent tests will have high accuracy in our system and signer independent tests will have slightly lower accuracy due to variation in shape and size of hand of different signers. Ease of use Since our proposed system is totally GUI based, hence it shall provide end user with an easy to use interface. A user can customize the system by himself/herself to achieve better performance. Environment The environment in which the signer uses our proposed system has some conditions imposed (due to difficulty in image processing) like there should be no moving objects in the background, there should be uniform lighting and the signer has to wear uniform color shirt. But as this system is currently targeted for indoor application, all these conditions can easily be achieved.

II.

Technical Feasibility

Today very little is technically impossible. The technologies in the related fields are mature enough to apply for solving our problem. It is for sure a large project involving completely different and vast fields like image processing and pattern recognition but the increasing number of researches in those fields will help us in realizing the system.

33

3.1.4. Conclusion From the preliminary investigation phase, we are introduced with the brief idea and information about the proposed system. From this phase, we came to a conclusion that the project undertaken is feasible technically, operationally and environmentally. Also the information gathered will be helpful for further studying of the system.

3.2. General project description 3.2.1. Requirement description Following is the list of requirements. i.

Upper half portion of the body of a signer sitting in front of web camera has to be detected.

ii.

Hand postures of Nepali Sign Language will be recognized.

iii.

The recognized text will be converted to speech.

iv.

A signer can easily customize the system by providing his training data (images).

v.

An easy to use GUI for quick training and testing.

3.3. Proble m Analysis phase 3.3.1. The problem domain Our system consists of many different problems to be addressed. i. Get image frames from the video grabbed by webcam. Web camera provides us with a real- time video of signer. We need to get each frame from the video for further processing. ii. Face detection and elimination Face also consists of non- manual gesture. But the scope of this project is limited to manual hand postures, so face has to be detected and eliminated. 34

iii. Hand detection and segmentation Hand contains all the information needed to interpret a posture. Proper segmentation has to be done to detect the region containing the hand(palm). iv. Normalize the segmented palm The detected palm has to be represented in some standard form by normalizing it. v. Feature extraction A hand shape has to be represented mathematically by a vector of features for classification. This is achieved by feature extraction. vi. Recognition framework A framework has to be constructed that does the job of recognizing of the respective class (in text) of a given vector of features. vii. Text to speech converter The recognized class in text form has to be converted to speech.

3.3.2. System improvement objective and constraints Improvements i.

Data acquisition by vision based method of bare hand without the need of accessories. Most of the SLR systems make use of accessories for data acquisition. Some systems even require instrumented gloves which are costly and not readily available in our market. While others require colored gloves to identify hands. Bare hand detection is one of the major improvements of our proposed system.

ii.

Feature extraction by multiple techniques to train multiple classifier models. Most of the researches are based on the features of only single type of feature extractor technique. Doing so leads to improper classification of signs because the 35

set of features extracted from one technique may be predicting a class with highest but very low probability estimate. This observation may not be repeated if we take another set of features extracted by another technique for the shape hand shape. Our proposed system makes use of three feature extraction techniques. The final output of the system will be the one weighted by taking in account the probability estimate and cross validation accuracy of each technique. iii.

Significantly ensure user independency. For a SLR system to work, it has to be trained by the data taken from certain registered users. But different users have different hand sizes and shapes resulting in user dependency. This means a system trained for registered users is likely to give high accuracy for those users but shows low accuracy for unregistered users. So, user independency is one of the ideal goals of every SLR system. We tend to ensure user independency by being careful in image processing and in choosing proper feature extraction technique.

iv.

Relatively larger vocabulary of hand postures compared to existing systems. High vocabulary is one of the major challenges. Signs that appear to be distinct in 3D may not be so in 2D. The variation in signing process by different users shifts the class boundary resulting in small inter-class separation. All these reasons make it difficult to maintain a large vocabulary. Most the systems dealing with ASL deal with 5 to 23. Our proposed system will be able to increase that number to 36.

v.

Use of single camera. Some signs that appear to be distinct in 3D may not be so in 2D. To overcome this problem, some systems make use of multiple cameras (one in front and one mounted on the head). This increases the system performance but is quite impractical. We will be using only an ordinary web camera (provided in most of the laptops).

36

Constraints i.

Indoor environment.

ii.

Static background with uniform lighting.

iii.

Difficult to differentiate between similar signs from 2D information from a single camera.

iv.

Difficult to identify frames constituting movement epenthesis (the period of shift in between two different hand postures) during continuous testing.

v.

No existing databases resulting in small training set adversely affecting system accuracy.

3.4. Decision Analysis phase 3.4.1. Identify candidate solution a) Candidate solution for face detection and elimination i. ii.

Haar cascade Template matching

b) Candidate solution for hand detection i.

Something that lies within a specified skin color range are presumably hands and face

ii.

Something that remains after elimination of background, clothing and face is a hand

iii.

Edge detection

iv.

Something that is in motion is hand

v.

Something that not part of recorded background is hand

vi.

Template matching of known hand shapes

vii.

Manually specifying hand region during system initialization

c) Candidate solution for hand tracking i.

Cam shift hand tracking using histogram of hand

ii.

Motion Energy Image(MEI) / Motion History Image(MHI)

iii.

Connected Component Tracking

iv.

Particle Filter 37

v.

Kalman Filter

vi.

CCMSPF (Connected Component Mean Shift Particle Filter)

d) Candidate solution for segmentation i.

Color based segmentation

ii.

Motion based segmentation

iii.

Mean Shift Clustering

e) Candidate solution for Feature Extraction i.

Geometrical Features

ii.

PCA (Principle Component Analysis)

iii.

HU moment

iv.

Binary Image

f) Candidate solution for Recognition Framework i. ii.

SVM (Support Vector Machine) Neural Network

g) Candidate solution for Text to Speech Converter i.

Speech Synthesizer

ii.

Play recorded audio

3.4.2. Analyzecandidate solution I.

Analysis of candidate solution for face detection and elimination a. Template matching i.

As template matching is not based on histograms; rather, the function matches an actual image patch against an input image by “sliding” the patch over the input image, it can take pretty long time for matching.

ii.

This technique uses the intensity values of a pixel.

iii.

Matching process is simple and easy to understand.

iv.

The template to be matched has to be taken for each signer. So, to work for multiple signers, the whole matching process has to be repeated.

b. Haar cascade i.

This technique uses the change in contrast valuesbetween adjacent recta ngular groups of pixels. 38

ii.

Once the haar feature xml file representing all faces is made, multiple iterations as in template matching for multiple signers is not necessary for face detection.

II.

Candidate solution for hand detection a. Something that lies within a specified skin color range are presumably hands and face i.

Simple technique to understand and implement as it only demands a color range to filter out.

ii.

The colors of hands and face are similar thus, demanding the need of face detection and elimination process.

iii.

It is often difficult to give a color range that works for all users due to the variation in skin color.

iv.

Something in the background which lies within the specified color range can also be treated as hand.

b. Something that remains after elimination of background, clothing and face is a hand i. ii.

Background is defined as the real background without the signer in it. Background of any color can be subtracted to obtain the foreground mask containing hand, face and the shirt region.

iii.

Face is eliminated by face detection process.

iv.

Shirt is eliminated by manually picking the shirt color. But the shirt should be of uniform color.

v.

Other noises can be eliminated by taking the blob with biggest contour perimeter.

c. Edge detection i.

Edge detection algorithm like canny edge algorithm can be used.

39

ii.

Canny edge gives the overview of hand contour when viewed by human eye but small edges inside and outside the hand region make it difficult to extract the hand contour.

d. Something that is in motion is hand i.

In this technique, background is defined as the real background items including the signer with his hands laid down with no or very little movement.

ii.

Background subtraction will provide us with the foreground mask consisting of hand and noise.

iii.

If the signer moves slightly away from accumulated background, no further processing is needed to filter the hand.

iv.

If the signer movement is larger, manual selection of colors must be done to filter out things that are not recorded in the background model.

e. Template matching of known hand shapes i.

The whole frame grabbed by the camera can be matched against a small patch consisting of possible hand shapes.

ii.

The process is complicated as compared to other problems like matching fixed shape objects like cars, people, face, etc. since the hand can be in any shape.

iii.

The comparison process can be done by the use of decision tree which can provide us with the subset of hand shapes to be compared.

f. Manually specifying hand region during system initialization i.

This process imposes a constraint during system initialization. The signer has to sit with his hands wide open for the first few frames and the user must manually define a region containing the hand.

ii.

This technique has to be accompanied by a tracker algorithm since after the initialization process it will be inefficient to compare all possible hand shapes.

iii.

Once the tracker loses the track of the object being tracked (i.e. the hand) system has to be reinitialized to locate the hand. 40

III.

Candidate solution for hand tracking a. Cam shift hand tracking using histogram of hand i.

The algorithm works quite well for objects whose shape doesn’t change like car, face, etc.

ii.

For object whose shape can vary, like hand in our system, the tracker window includes much larger region including the hand and sometimes the window doesn’t include the entire hand. In such cases, it often involves further intensive processing to extract the hand contour.

b. Motion Ene rgy Image(MEI) / Motion History Image(MHI) i.

MEI and MHI can provide us with the binary mask of the parts that are in motion. The fainter parts are the ones that showed movement earlier and the brighter parts are the one that showed movement few moments ago. ii.

The binary mask provided by this technique contains the hand along with other moving parts like the body and face. It is o ften difficult to distinguish which part is the hand.

c. Connected Component Tracking i.

This technique tracks the shift in position of a blob from frame to frame.

ii.

The technique fails when the object being tracked is occluded by other objects.

iii.

It shows poor performance in tracking when the shape of object being tracked is varying.

iv.

It demands highly uniform lighting.

d. Kalman Filter i.

Kalman filter can track objects even after occlusion. It can predict the position of the object after the occlusion is over by taking the information before occlusion.

ii.

It assumesthat the tracked object moves based on a linear dynamic system with Gaussian noise which may not happen.

iii.

It can track multiple objects. 41

iv.

For the application of this technique in SLR system, limited and restricted motion of hands is to be assumed.

v.

It needs a dynamic model for the hand motion which is not easy to estimate.

e. Particle Filte r i.

Particle filter is well suited to applications where the dynamics of the tracked object are not well-defined, or with non-linear dynamics and non-Gaussian noise conditions.

ii.

Although a PF is capable of multiple object tracking, it cannot properly handle cases when the objects touch or occlude.

iii.

It also needs a dynamic modelfor the hand motion which is not easy to estimate.

iv.

Estimations of PF in the case of hand tracking are more accurate but computational overhead make it unsuitable for real-time systems.

f. CCMSPF (Connected Component Mean Shift Particle Filte r) i.

It inherits all features of connected component, mean shift and particle filter methods.

ii.

It can work in cases of occlusion as well.

iii.

The method was originally developed for video surveillance in which it is sufficient to know the center of mass of the objects (especially people) being tracked.

iv.

In our system, it can work well to track the center of hand but further extensive processing is needed to find the hand contour.

IV.

Candidate solution for segmentation

This process is similar to candidate solution for hand detection except that in hand detection section we work on the whole input image, while here we work on the

42

rectangular region provided by a tracker that contains the hand along with some other unwanted background items. a. Color based segmentation i.

The hand can be segmented by using color cue as in the candidate solution for hand detection.

b. Mean Shift Clustering i.

The hand can be segmented by using mean shift clustering or cam shift clustering as discussed in candidate solution for hand detection.

V.

Candidate solution for Feature Extraction a. Geometrical Features i.

Geometrical features are the features seen from the geometry of hand shape like number of finger projected out, circularity, ellipticity, etc.

ii.

The extraction process is relatively simpler than other feature extraction techniques.

iii.

The discriminative power of these properties is low causing it difficult to distinguish between large numbers of hand shapes.

b. PCA (Principle Component Analysis) i.

The purpose of PCA is to reduce the large dimensionality of the data space to the smaller intrinsic dimensionality of feature space (independent variables), which are needed to describe the data economically.

ii.

When the number of features is too large, a classifier can suffer from the curse of dimensionality.

iii.

Classifiers like neural network heavily suffer from the curse of dimensionality but SVMs can handle pretty large number of dimensions efficiently.

iv.

Before using PCA transformed features for SVM a question has to be addressed. “Is it wise to use PCA transformed features for SVM?” 43

There is a hot debate in this regard when we visited internet forums. Some experts say that you have to do PCA when you are going to use an algorithm that suffers the curse of dimensionality like a neural network for example. However, there is no need to use PCA to reduce dimensionality, as kernel-based techniques have overcome the problem of high dimensions. Also PCA is a compression technique that could make you lose some information. While others argue that PCA preprocessing helps lot in reducing parameter grid search time and does improve SVM accuracy a little. Also PCA can improve performance by removing correlations between variables and possibly removing noise. But perhaps the answer to this question is problem-dependent. So, we are giving a try to observe the performances of SVM and SVM with PCA transformed features.

c. HU mome nt i. ii.

HU moments are a set of transition, scale and rotation invariant moments. This will help us to tackle the problem of slight variations in scale occurred due to changing distance from the camera.

iii.

This property will also help in tackling the problem of variation in rotation of a standard shape by the same user at different times and by different users.

iv.

The vector of features is just of length 7. Thus the training time for classifier model is very small.

d. RAW Binary Image i.

The term RAW is used throughout this report to indicate a binary image in raw form (without any further processing).

ii.

The raw binary image is fed directly to the classifier with the list of pixel values arranged in a row.

iii.

The training time for classifier is high since the dimension is large.

iv.

No information is lost since other feature extraction techniques extract features at the cost of loss of some information.

44

VI.

Candidate solution for Recognition Framework a. Neural Network i. ii.

The number of nodes in multilayer perceptron is fixed. When we apply PCA to training data, the number of features depends on the number of training dataset. In such situations where the number of features varies, it will be difficult to handle the problem.

iii.

NN suffer from number of drawbacks & limitations as compared to SVM.

b. SVM (Support Vector Machine) i.

SVM offers numerous advantages over NN for our problem.

ii.

Good generalization.

iii.

Explicit dependence of model of data (via support vectors).

iv.

No false minima like in NN since it involve quadratic programming for optimizing a convex function & have definitely one solution.

v.

Few parameters to fine tune.

vi.

Can implement confidence measure(probabilistic measure).

vii.

Different starting points leads to the same one solution as it involves optimization problem subject to some constraints.

viii.

Faster.

ix.

Doesn't suffer from curse of dimensionality.

x.

Can handle variable length of inputs(including strings) by the help of string kernels, which is very difficult in NLP since its number of input nodes is fixed.

xi.

VII.

Can handle graphs. This is helpful in bioinformatics.

Candidate solution for Text to Speech Converter a. Speech Synthesizer i. ii.

Speech synthesizer is not the novel part of our work. .Net Framework provides us the speech synthesizer for English Language. While implementing the same API for the Nepali language, we cannot obtain

45

the true natural Nepali tone. This is due to the lack of some phoneme in English language.

b. Play recorded audio i.

To obtain Nepali tone, recordings of each letter can be taken and played when needed.

3.4.3. Conclusion I.

Selected solution for face detection and elimination

Haar cascade is selected for face detection since it works for any faces once we have the haar feature file. The feature file can be downloaded or is provided by image processing library. II.

Selected solution for hand detection

Two solutions are selected for two situations. For the situation in which record the background containing only actual background objects, “Something that remains after elimination of background, clothing and face is a hand” is selected since the technique is quite easy to implement provided the user wears a uniform colored shirt. This technique allows for high degree of body movement of signer. For the situation in which record the background containing the actual background objects and the signer with his hands laid down exhibiting very less or no motion, “Something that is in motion is hand” is selected since the technique is easy to implement and is less affected by noise if the signer remains very close to his initial position while recording background. III.

Selected solution for hand tracking

All the trackers suffered some defects like- very slow performance making them unfit for real-time system, highly dependent on lighting, need of extra information of dynamic model of hand motion, etc. In the cases where tracker showed a bit good result, the hand was accompanied by other noises which were very difficult to filter out. Further processing 46

techniques could be adopted to solve the problem but that seemed not to be feasible within the time bound. On applying hand detection algorithm on every frame, the performance was satisfactory. This could be because for now we are just tracking a single hand. Thus we did not feel the need of a hand tracking alternative.

IV.

Selected solution for segmentation

This step is not needed when we do not apply tracker algorithm.

V.

Selected solution for Feature Extraction

The performance of classifier fed with features extracted from different techniques is a matter of curiosity. So, instead of choosing an alternative, we are going for all the alternatives and their performance shall be analyzed in the result and analysis section. Most of the research works on SLR system adopt a single feature extraction technique and train a classifier model for all the signs. This will obviously decrease the performance since the inter-class separation reduces on training a single model. To ge t rid of this, we take help of a geometrical feature, finger count, to train different models based on the count of fingers projected outwards in hand shapes. In the former technique a single model has to classify 37 signs whereas in the later technique, a model has to classify 12 signs at most. So, we will be going for all the techniques. a. Geometrical Features b. PCA (Principle Component Analysis) c. HU moment d. RAW Binary Image

47

VI.

Selected solution for Recognition Framework

In our problem, the number of features fed to the classifier can vary depending on the number of training data. The dimension of data can be pretty large in the case of RAW binary image. When the number of features is large, the training time can be painfully high. So, a faster technique is needed. When the final prediction of class has to be done being based on the confidence measure of multiple techniques, a method that gives us with probabilistic measure is needed. Considering all these conditions, SVM was analyzed to be the appropriate choice.

VII.

Selected solution for Text to Speech Converter

The sound produced by the speech synthesizer of .NET did not sound natural for Nepali words and this is not the novel part of our project, we are going for an easy technique by playing recorded audio. The number of letters handled by our system is just 37. So, it would not be difficult to record each sound.

48

3.5. System Modeling 3.5.1. System Block Diagram

Figure 3.3 System Block Diagram Once individual frames are extracted from the video source, hand is to be segmented. The segmented hand is then normalized to make it ready for feature extraction. The features are used to train SVM models and saved in model. This ends the training phase. In the testing phase, same procedure is followed. The class of a vector of features is predicted by loading the saved frame and then the audio recorded for the class is played.

49

3.5.2. Class diagram

Figure 3.4 Class Diagram of the system

50

4. IMPLEMENTATION As sign language is of multimodal nature, the research area of sign language recognition is a multidisciplinary research area involving pattern recognition, machine learning, computer vision, natural language processing and linguistics. Our working domain is restricted to hand posture recognition and conversion of the recognized text to speech, thus mainly involving areas like pattern recognition and computer vision. For implementation, we need to modularize the system according to the requirements. First of all, we need a module that deals with image acquisition. In our system, we have planned to go for vision based data(image) acquisition technique using a web camera. So, we need a module called “Frame Grabber” that can provide us with 2D images of the signer. The second module is called “Image Processor” which performs all the pre-processing jobs followed by hand detection and segmentation. Once we have the final processed image, we need some techniques that provide us with mathematical representation of a hand shape usually in the form of vector of features. This process is formally called feature extraction and we call the module performing this job as “Feature Extractor”. Each similar shape has similar vector of features which differs (in ideal case) from the feature vector of other classes. To distinguish different hand shapes, a classifier which when trained properly will be able to predict the correct class of a hand shape provided for test. The module performing this job is named “Classifier”. The final output of our system is in speech. To achieve this, the predicted class in the form of text is converted to speech by the module called “Audio Generator”. For the users to interface with the system, a suitable GUI(Graphical User Interface) with necessary controls is designed. This is called the “GUI” module. To perform all these jobs, we need central control mechanismwhich shall ease the process of training and testing. This module is called “Control Box”. We present the details of implementation of each module in following section. 51

4.1. Frame Grabber This module handles the initialization and release of webcam or video file. After the video source is initialized it returnsthe consecutive frame from webcam or video file. It also controls the size and orientation of the frame.

4.2. Image Processor The main task of this module is to obtain a normalized binary image of hand region from the image returned by frame grabber. The processing of image can be done in various color formats like Gra y, Bgr, Bgra, Hsv, YCbCr,RGB,RBGa, etc.In Ycc color space Y is the luma component and Cb and Cr are blue difference and red difference chroma component. Since the illuminance component is separate, the lighting has effect only on Y component. But in BGR format, the color and intensity information are not separable. Hence most of the processing is done in YCbCr format. This module is further divided into following sub- modules.

4.2.1. Background Accumulator Certain number of initial frames is recorded as background at the beginning. All the frames are converted to (Y,Cr,Cb) color format from (B,G,R) color format. i.

10 initial frames where background is almost stable are taken. The term background means just the background items with or without signer included in it.

52

Figure 4.1 Different Types of Background in the application ii.

The value at each pixel is averaged to find the mean background.

iii.

Absolute deviation from mean of each frame is calculated.

iv.

For each pixel, color range of acceptable background is calculated by adding and subtracting the absolute deviation from mean.

v.

Eliminate those pixels within the background color range to obtain the foreground mask which is in binary form.

Figure 4.2 Background Subtraction

4.2.2. Face Detector When the signer’s body is included in background, slight movement of head produces a small region in the subtracted frame as if it were foreground object. And when the signer’s body is not included in the background, the whole face comes in the foreground. To eliminate those unwanted stuffs, we need to detect the face and eliminate it. i.

Define haar features of face in an xml file.

ii.

Get the haar feature file and a test frame in which face is to be detected.

iii.

A region in the test frame which is similar to haar feature is detected as face. 53

iv.

Rectangular region above the bottom of detected face is painted black to eliminate the face.

Figure 4.3 Foreground mask after face elimination

4.2.3. Color Subtractor The above step can still fail to separate the region containing only the hand. In such case, objects like shirt, tie, etc. can be eliminated by specifying their color by manual picking.

Figure 4.4 Foreground mask after color subtraction

4.2.4. Noise Remover The image supposed to contain only the hand region contains unwanted noise like small patches of background and difference region that occurs when signer’s body moves. To remove small patches, i.

The image is eroded with rectangular structuring element resulting in shrinked image.

ii.

The original image is regained by dilation. 54

To remove larger patches those are still smaller than hand region, i. ii.

The contours of all patches are found. The largest contour is considered to be hand.

Figure 4.5 Foreground mask with all noise removed.

4.2.5. Principal Component Analyzer The job of this sub- module is to perform PCA of binary hand image. This is useful to obtain PC1 (First Principal Component) and PC2 (Second Principle Component) which is in perpendicular direction to PC1. i.

Get the binary image of hand.

ii.

Perform PCA on it.

iii.

The Eigen vector with largest Eigen value gives the direction of find PC1(First Principal Component) which is the direction of largest hand region extension.

iv.

The Eigen vector with second largest Eigen value gives the direction of find PC2(Second Principal Component) which is the direction of second largest hand region extension.

4.2.6. Finger Counter When a single classifier model is trained for all signs, the prediction accuracy can largely reduce. This is because the inter-class separation of data points in feature space is reduced. To overcome this problem, we use a geometrical feature- the finger count.

55

A finger is counted when it is well separated from adjacent finger. When two fingers are projected outward with no gap in between, the finger count is 0 and not 2. On the basis of this finger count, different classifier models are trained with the group of signs having same finger count. The counting process for finger like projections in the hand is: i.

Identify the convexity defect point sets (start, depth and end) in the contour of hand.

ii.

Refine those defect points by selecting only the start point of a defect set if the total distance from start to depth to end point is greater than 2.5% of the contour perimeter.

iii.

Iterate through the list of refined defect points to see that if the included angle is less than 70 degree. If so, the finger count is incremented by 1.

4.2.7. Palm Separator The signer can appear in either full shirt or half shirt. The job of palm separator is to separate the palm from the arm when the signer wears half shirt and therefore, it is not needed in case of full shirt. i. ii.

Find the convexity defect points (start, depth and end) in the contour of hand. Compute the total distance between start to depth and depth to end points for each defect set.

iii.

Pick the depth point from the defect set having largest total distance. This depth point is called depth point below palm as it usually lies around the region just below the palm.

iv.

Draw a line through the depth point below palm and along PC2.

v.

Shade in black the region below that line to eliminate the arm.

vi.

Any finger like projections counted as finger in the shaded region is subtracted to give the final finger count.

56

4.2.8. Normalizer The extracted palm region varies in size and orientation. This can lead to adverse effect during feature extraction. So, they need to be transformed into some standard form. This process is called normalization. i.

Resize the whole image (640*480) containing binary palm region into a smaller image (24*32).

ii.

Rotate the image by an angle such that PC1 aligns with the vertical.

iii.

Find a box that bounds the palm.

iv.

Resize the bounding box to map it to the original frame.

v.

Resize the bounding box to a standard size (24*32) to obtain the normalized palm image.

4.3. Feature Extractor For the classifier to distinguish between the postures, the normalized image should be represented in some distinguishable characteristic value. For this process, we can directly feed the raw image to the classifier. But this process introduces noise causing the characteristic feature to be outnumbered as well as increases complexity. So, different processes have been employed to extract features of segmented normalized image.The applied processes are:

4.3.1. Raw Image To directly feed the segmented normalized binary image directly to the classifier, we take each pixel value as the feature. So the number of feature is the total number of pixels in the image.

Figure 4.6 Sample raw binary image of size 24*32. 57

When we take such an image the features will be a list of 0s and 1s.

4.3.2. PCA Here are the steps we adopted to perform PCA. i.

Get normalized images of each hand postures with unfilled hand contour of size (24*32).

Figure 4.7 Sample normalized image of letter “च” with unfilled contour. ii.

Take all the pixels of the image in a row forming a (1*768) matrix.

iii.

If ‘m’ is the number of training images obtained, then we obtain aninput image matrix of size (m*768).

iv.

Find the mean image (

v.

Make the input image matrix mean centered (

vi.

Find the covariance matrix (C) of size (768*768).

vii.

) of the matrix and save in disk.

Compute the eigenvectors (

).

) and Eigen values of the covariance matrix. The size of

Eigen vector matrix is (768*768) if m >= 768 else (m-1 * 768). The size of Eigen values is (768*1) if m>=768 else (m-1 * 1). Save the Eigen vectors in disk. Project the input images into the new Eigen space defined by taking ‘k’ Eigen vectors

viii.

with highest Eigen values to get the new features for each normalized image.

4.3.3. Hu Moment The steps to extract features using Hu moments are: i.

Get a normalized image of a hand posture with filled contour.

58

Figure 4.8A sample contour of sign “च” filled with white. ii.

Find the seven Hu moments of the normalized image.

4.4. Classifier The main objective of the classifier is to classify the feature into unique class. Class is nothing but only a representation of single sign. The features with its equivalent class that are fed into the classifier for training are stored in a file. The classifier used in our system is SVM. To implement a classifier from scratch is out of the scope of this project. So, we take help from a suitable library called SVM.NET. SVM.NET is a .NET conversion of LIBSVMwritten in C++ and Java.

4.4.1. Training Phase i. ii.

Get the file containing vector of features of all training data. Transform the file into a file format understood by SVM.NET.

The file format is: truelabel_indexfeature_index1:v1 feature_index2:v2 …. Example: 3 1:-2271.244 2:24.79307 3:-5.973912 4:-364.3655 5:-534.7533 6:-1844.195 7:808.4861 8:1836.421 9:2051.527 10:306.1829 11:-706.0677 12:0 13:1198.748 14:-306.2599 15:1191.686 16:1749.029 17:-1658.997 18:-1158.746 iii.

Conduct scaling on data if necessary.

59

This avoids features with larger variance dominate those with smaller variance. Scaling is usually done in the range [-1, +1] or [0, 1]. [0, 1] is faster than [-1, +1].

iv.

Choose suitable kernel function. RBF is often a good choice and works well for most of the problems. Find the best parameters (C, ϒ) by performing grid search.

v.

The best values can be identified by the help of cross- validation accuracy. Normally 5 fold cross-validation is done which means the training set is divided into 5 random subsets. 4 of the subsets are used to train the model and the remaining one is used to test the model for its accuracy.

vi.

Use those best values to train the SVM model.

To train a model means to find the best values of variables in a decision function that predicts the class.

vii.

Save the model in disk.

4.4.2. Testing Phase i. ii.

Load the appropriate model from disk. Get the file with the vector of features whose classes are to be predicted. The file format is similar to that of training dataset. But for testing, we do not know the class label. So, any random class label is written but cannot be left unfilled.

iii.

Predict the classes.

60

4.5. Audio Generator Output of the classifier can be visualized in the GUI of the system. It would be even better if the system can give the audio. This would make the communication more practical. Since the speech synthesizer API of Microsoft was not applicable for nepali language, we chose the easier way by playing the audio file using the function in .net framework. Once the output of the classifier is found the audio save in the system is played. There are only 37 lexicons which represents the devnagari consonants in the current system. So only the required audio are saved and played accordingly.

4.6. GUI The main function of GUI module is to accept input from the user and supply it to the Control Box from where the input can be transformed to the appropriate modules and vice versa. Our GUI is organized in four tabs, viz., Home, Debug, Settings and Help. Home tab consists of minimal interface for normal use by user. It consists of window to view video, display the converted text and control to choose video source. Debug tab consists of full interface to train and test the recognition model and debug the software. Setting tab provides interface to change the setting of i.

Web camera and video

ii.

Color range, direction of hand, nature of shirt for image processing

iii.

Path to load the model from.

Help tab gives the information about how to operate the software.

61

Figure 4.9 GUI of Debug Tab 4.7. Control Box This module ‘Control Box’ is the operational backbone of the system. Above discussed modules are interfaced by control box. When the system execute, control box is created, which handle the creation of GUI, frame grabber, image processor, feature extractor and classifier. It accepts user request through GUI and delegate to the appropriate module for smooth operation of the system.

62

5. DEVELOPMENT METHODS 5.1. Methodology Our project “Voice for the Voiceless” is a research oriented project, therefore traditional approaches like Waterfall model is not suited for the development process. We have followed agile software methodology, but we have not strictly adapted to the principles of agile method. Some of the practices we have followed during our project are listed below: i.

Pair Programming

ii.

Stand-Up Meetings

iii.

Code Ownership

iv.

Method Comments

5.2. Tools and Environme nt i. ii. iii. iv. v. vi.

Language: C# Framework:.NET 4.0 Image Processing Library: EmguCV (wrapper of OpenCV for .NET framework) SVM Library: SVM.NET (wrapper of LIBSVM for .NET framework) IDE: Visual Studio 2010 GUI Design: Microsoft Expression Blend 4.0

63

6. PROBLEM FACED AND SOLUTION 6.1. Some Solved Problems i.

Variation in shape and size of hand This problem is considerably solved by normalization and customized training for those users in which the system performs poor.

ii.

Variation in skin color A simple technique for hand detection is to specify a color range of skin. But it is difficult to define a color range that is applicable for all users due to variation in skin color. This problem for hand detection is solved by the use of background subtraction, shirt removal and face detection and elimination.

iii.

Detection and tracking of an object whose shape is changing In most tracking systems, an object whose shape does not change is tracked. In our system, an object whose shape is continuously changing i.e. the hand has to be tracked. This problem is solved by using hand detection algorithm in every frame with no adverse effect in real-time system performance.

6.2. Some Unsolved Problems Some of the problems are such that we cannot find a solution due to their complexity. Some of such problems are as follows. i.

Overlapping of hand and face

Figure 6.1 Overlaping hand and face In segmentation of hand, the hand cannot be separated when the face and hand are overlapped. 64

ii.

Deviation from ideal shape In many cases, the normalized image is different when the orientation of hand differs while performing the same sign.Figure 6.2 shows the same symbol “ट”, which is performed differently.The normalized image is different in two cases because of difference inorientation of hand.

Figure 6.2 Deviation from ideal shape of nepali sign “ट” iii.

Effect of lighting Segmentation of hand is different in case of different lighting conditions. The problem of effect of light is still unsolved.

Figure 6.3 Effect of lighting

65

7. RESULT AND ANALYSIS 7.1. Experime ntal Setup The signer has to sit in front of the camera exposing upper half of his body. One or more videos can be recorded to obtain the training images. The training images are the normalized images. These images are extracted from the video by manual selection of frames that tend to be ideal in shape. The class of each image is given manually. So, the experiment is totally supervised. Once the training dataset is ready, a SVM model can be trained and used for prediction in a new video. 7.2. Signer Dependent Testing 7.2.1. Individual Testing In signer dependent testing for an individual, the SVM model is trained by the dataset taken from the video of a signer. While testing the same model is used for predicting the test data for the same signer. Individual testing of all signers are presented in Appendix C. Below is a sample of Sujana Limbu. A. Taking PCA transforme d data in the maximum number of dimensions available. Branch

No. of

No. of

Training data accuracy

number

training

testing

PCA

HU

data

data

(in %)

(in %)

RAW (in %)

Testing data accuracy PCA

HU

RAW

(in %)

(in %)

(in %)

0

220

44

100

83.18

100

95.46

97.73

95.46

1

274

54

100

89.42

100

66.67

92.59

66.67

2

200

43

100

82.00

100

89.66

96.55

89.66

3

78

29

100

94.87

100

85.71

100

85.71

66

Analysis: i.

The accuracy by PCA is 100%. The number of support vectors in SVM model is equal to the number of training data which creates a very rigid model suffering from over fitting. This usually gives cent percent accuracy for training but the gets low for testing data since the model is very rigid.

ii.

The performance of system is highly dependent on the way how signer performs a sign also slightly depends on the angle faced by the hand to the camera.

iii.

The performance of the system can degrade when depending on the signer as shown in detail in appendix C.

B. Taking PCA transformed data in 3 dimensions This result is obtained by taking only 3 dimensions during PCA transformation. The effect of dimension reduction in system accuracy is to be observed. Branch

No. of

No. of

Training data accuracy

Testing data accuracy

number

training

testing

PCA

HU

RAW

PCA

HU

RAW

data

data

(in %)

( in %)

(in %)

(in %)

(in %)

(in %)

0

220

44

100

83.18

100

95.46

97.73

95.46

1

274

54

100

89.42

100

66.67

92.59

66.67

2

200

43

100

82.00

100

89.66

96.55

89.66

3

78

29

100

94.87

100

85.71

100

85.71

Analysis: i.

There seems to be no effect by the reduction of dimensions by PCA transformation in the accuracy of system. But this reduces the training time.

C. Number of Support Vectors When the training dataset is small, almost all data are taken as support vectors in the SVM model. The effect of this in training and testing dataset accuracy is to be observed.

67

Branch

No. of

number

training data

Number of support vectors PCA

PCA

(All features)

(only 3 features)

HU

RAW

0

220

220

220

72

220

1

269

269

269

75

269

2

227

227

227

64

227

3

42

42

42

2

42

Analysis: i.

The number of support vectors for PCA transformed data is equal to the number of training data. This could possibly be due to small size of training dataset.

ii.

There is no change in the SVM model and the support vectors on changing the number of dimensions for PCA projection. There could be a significant change when the dimension is reduced from thousands to few.

iii.

The number of support vectors for HU transformed data is very much less than the number of training data. So, the model is not overfitted giving average accuracy while testing both training and testing dataset.

iv.

The number of support vectors on feeding RAW binary image is same as that of PCA. The time taken for training based on RAW features is high compared to that of PCA with the generation of similar SVM models. So, the technique of RAW features could possibly be dropped.

7.2.2. Time efficiency for SVM versus PCA+SVM The effect of the reduction in dimension using PCA in training time in comparison to the training time using features without dimension reduction is noted. Description

Sample data of Sujana

Number of training

Time for SVM

Time for SVM

data

training of RAW

training of PCA

features

transformed fe atures

25 min 47 sec

43.75 sec (0.75 sec +

772

Limbu

43 sec)

68

Analysis: i.

The time taken for PCA transformation and SVM training is very small compared to the training time of RAW features. SVM can well-known for their performance on large dimensional data but from our observation its performance is largely influenced by the reduction in dimension.

7.3. Signer Independent Testing In signer independent testing, the SVM model is trained by the data taken from all training users. The model obtained is used for testing new data of any other new individual.

7.3.1. Letter wise Accuracy with finger count There will be high accuracy for some signs due to their distinct shape while some signs are difficult to identify since they have very less discriminative property. Further, the accuracy of the system also depends on the use of finger count. The accuracy for each sign is noted. Letter

PCA Training set Testing set accuracy in % accuracy in % (correct (correct prediction/total prediction/total data) data) 100 (80/80) 31.25 (5/16)

Training set accuracy in % (correct prediction/total data) 80 (64/80)

Testing set accuracy in % (correct prediction/total data) 50.00 (8/16)

ग

100 (61/61) 100 (79/79)

52.94 (9/17) 26.31 (5/19)

47.54 (29/61) 91.31 (72/79)

47.06 (8/17) 89.47 (17/19)

घ

100 (79/79)

42.86 (9/21)

44.30 (35/79)

57.14 (12/21)

ङ

100 (59/59)

16.00 (4/25)

72.88 (43/59)

64.00 (16/25)

च

100 (53/53)

38.89 (7/18)

79.24 (42/53)

77.78 (14/18)

छ

100 (63/63)

64.29 (9/14)

6.35 (4/63)

14.29 (2/14)

ज

100 (92/92)

71.42 (15/21)

93.48 (86/92)

90.48 (19/21)

झ

100 (116/116)

61.54 (16/26)

82.76 (96/116)

69.23 (18/26)

ञ

100 (123/123)

100.00 (27/27)

66.67 (82/123)

62.96 (17/27)

ट

100 (24/24)

50.00 (4/8)

91.67 (22/24)

62.50 (5/8)

ठ

100 (82/82)

50.00 (9/18)

75.61 (62/82)

66.67 (12/18)

ड

100 (18/18)

28.57 (2/7)

66.67 (12/18)

42.86 (3/7)

ढ

100 (66/66)

33.33 (8/24)

25.75 (17/66)

37.50 (9/24)

क ख

69

HU

ण

100 (106/106)

78.95 (15/19)

72.64 (77/106)

68.42 (13/19)

त

100 (74/74)

42.86 (6/24)

70.27 (52/74)

78.57 (11/14)

थ

100 (160/160)

100.00 (21/21)

62.5 (100/160)

47.62 (10/21)

द

100 (53/53)

63.64 (7/11)

86.79 (46/53)

81.82 (9/11)

ध

100 (117/117)

70.59 (12/17)

43.59 (51/117)

47.06 (8/17)

न

100 (113/113)

50.00 (11/22)

83.19 (94/113)

77.27 (17/22)

प

100 (92/92)

72.00 (18/25)

77.17 (71/92)

80.00 (20/25)

फ

100 (68/68)

41.67 (10/24)

47.06 (32/68)

41.67 (10/24)

ब

100 (49/49)

35.71 (5/14)

10.20 (5/49)

7.14 (1/14)

भ

100 (58/58)

38.89 (7/18)

91.34 (53/58)

83.33 (15/18)

म

100 (80/80)

47.62 (10/21)

18.75 (15/80)

28.57 (6/21)

य

100 (111/111)

58.33 (21/36)

81.98 (91/111)

88.89 (32/36)

र

100 (67/67)

43.75 (7/16)

50.75 (34/67)

68.75 (11/16)

ल

100 (84/84)

53.57 (15/28)

53.57 (45/84)

67.85 (19/28)

व

100 (52/52)

76.47 (13/17)

63.46 (33/52)

64.71 (11/17)

श

100 (61/61)

66.67 (12/18)

81.97 (50/61)

83.33 (15/18)

ष

100 (43/43)

66.67 (8/12)

39.53 (17/43)

33.33 (4/12)

स

100 (102/102)

76.19 (16/21)

81.37 (83/102)

76.19 (16/21)

ह

100 (138/138)

100 (29/29)

58.70 (81/138)

55.17 (16/29)

क्ष

100 (57/57)

58.33 (7/12)

38.60 (22/57)

50.00 (6/12)

त्र

100 (94/94)

100.00 (20/20)

97.87 (92/94)

95.00 (19/20)

ज्ञ

100 (90/90)

47.62 (10/21)

90.00 (81/90)

80.95 (17/21)

श्र

100 (77/77)

72.00 (18/25)

88.31 (68/77)

92.00 (23/25)

Analysis: i.

This is high change is the way how different signer perform the same sign. The hand shape remains basically the same in 3D space but the shape viewed from the camera is highly affected by rotation of hand along different axes. This can lead to decrease the accuracy of some signs.

ii.

Finger count helps to increase the accuracy.

70

7.3.2. Letter wise Accuracy without finger count How much the accuracy is affected without taking finger count is noted. Letter

PCA

HU

Training set accuracy in % (correct prediction/total data) 100 (80/80)

Testing set accuracy in % (correct prediction/total data) 31.25 (5/16)

Training set accuracy in % (correct prediction/total data) 77.5 (62/80)

Testing set accuracy in % (correct prediction/total data) 43.75 (7/16)

ग

100 (61/61) 100 (79/79)

52.94 (9/17) 26.31 (5/19)

39.34 (24/61) 55.69 (44/79)

47.05 (8/17) 47.36 (9/19)

घ

100 (79/79)

42.85 (9/21)

44.30 (35/79)

57.14 (12/21)

ङ

100 (59/59)

16 (4/25)

55.93 (33/59)

52 (13/25)

च

100 (53/53)

38.88 (7/18)

69.81 (37/53)

72.22 (13/18)

छ

100 (63/63)

64.28 (9/14)

0 (0/63)

0 (0/14)

ज

100 (92/92)

71.42 (15/21)

90.21 (83/92)

85.71 (18/21)

झ

100 (116/116)

61.53 (16/26)

65.51 (76/116)

57.69 (15/26)

ञ

100 (123/123)

51.85 (14/27)

46.34 (57/123)

33.33 (9/27)

ट

100 (24/24)

50 (4/8)

66.66 (16/24)

25 (2/8)

ठ

100 (82/82)

50 (9/18)

59.75 (49/82)

61.11 (11/18)

ड

100 (18/18)

28.57 (2/7)

0 (0/18)

0 (0/7)

ढ

100 (66/66)

33.33 (8/24)

24.24 (16/66)

37.5 (9/24)

ण

100 (106/106)

78.94 (15/19)

55.66 (59/106)

42.10 (8/19)

त

100 (74/74)

42.85 (6/14)

47.29 (35/74)

57.14 (8/14)

थ

100 (160/160)

100 (21/21)

50 (80/160)

38.09 (8/21)

द

100 (53/53)

63.63 (7/11)

84.90 (45/53)

63.63 (7/11)

ध

100 (117/117)

70.58 (12/17)

55.55 (65/117)

52.94 (9/17)

न

100 (113/113)

50 (11/22)

82.30 (93/113)

72.72 (16/22)

प

100 (92/92)

72 (18/25)

70.65 (65/92)

76 (19/25)

फ

100 (68/68)

41.66 (10/24)

42.64 (29/68)

37.5 (9/24)

ब

100 (49/49)

35.71 (5/14)

0 (0/49)

0 (0/14)

भ

100 (58/58)

38.88 (7/18)

77.58 (45/58)

66.66 (12/18)

म

100 (80/80)

47.61 (10/21)

20 (16/80)

28.57 (6/21)

य

100 (111/111)

58.33 (21/36)

62.16 (69/111)

72.22 (26/36)

र

100 (67/67)

43.75 (7/16)

32.83 (22/67)

68.75 (11/16)

ल

100 (84/84)

53.57 (15/28)

53.57 (45/84)

71.42 (20/28)

क ख

71

व

100 (52/52)

76.47 (13/17)

21.15 (11/52)

5.88 (1/17)

श

100 (61/61)

66.66 (12/18)

80.32 (49/61)

83.33 (15/18)

ष

100 (43/43)

66.66 (8/12)

13.95 (6/43)

0 (0/12)

स

100 (102/102)

76.19 (16/21)

81.37 (83/102)

76.19 (16/21)

ह

100 (138/138)

75.86 (22/29)

58.69 (81/138)

55.17 (16/29)

क्ष

100 (57/57)

58.33 (7/12)

38.59 (22/57)

50 (6/12)

त्र

100 (94/94)

80 (16/20)

81.91 (77/94)

85 (17/20)

ज्ञ

100 (90/90)

47.61 (10/21)

88.88 (80/90)

76.19 (16/21)

श्र

100 (77/77)

72 (18/25)

61.03 (47/77)

80 (20/25)

Analysis: i.

Finger count decreases the number of probable classes for a sign, thus increasing the accuracy.

ii.

It is often difficult to count fingers in situations where finger projection is not vivid. This makes the finger count process very critical since if we enter into the wrong branch with different finger count there is no chances of getting right prediction.

7.3.3. Time efficiency for SVM versus PCA+SVM Number of training data 2941

Time for SVM training of RAW features (dimension = 768) 6hrs 16 min

Time for SVM training of PCA transformed features (dimension = 3) 10 sec + 10 min 12 sec = 10 min 22 sec

7.3.4. Number of Support Vectors Branch

No. of

number

training data

Number of support vectors PCA

PCA

(All features)

(only 3 features)

HU

RAW

0

835

820

820

777

820

1

1002

989

989

812

989

2

848

840

840

635

840

3

256

245

114

245

245

72

8. FUTURE SYSTEM ENHANCEMENT i.

Increase the number of recognized hand posture This system currently can recognize only 37 static signs. As the number of signs increases the data extraction time and the system training time increases. Due to limited time we could not add all the signs. The same system can work without modification if we take the training data to include new hand postures.

ii.

Include the spatio-temporal gestures This system can handle only static hand postures. There are a lot of hand gestures which have the property of time and space. All those signs can be included.

iii.

Detect the Non- manual Sign Lexicons in SLR system not only compromise of the static and spatio-temporal gestures but also to the facial expression, body torso movement. By analyzing the facial expression and torso movement we can increase the accuracy of the system. Hence this could be one important enhancement to our system.

iv.

Implement the Natural Language Processing (NLP) When a letter,say क,is followed by हलन्त, the letter has to be made as क् . Such things have to be handled by understanding the Nepali language. The task of natural language processing is even more when words are included.

v.

Implement speech to 3D modeler To make the two way communication possible, we a system that converts the speech into a 3D model that can be understood by the deaf and dumb.

vi.

Increase the training data set to increase the accuracy. In our system there are limited trainers which have greater effect on accuracy. Large training data set either by using distortion a lgorithm or by manually recording the training frames to increase the accuracy of the system.

73

vii.

Similar system for mobile phones When the total automated communication system is implemented for computers, similar system can be implemented for mobile phones which can be used in outdoor environment.

74

9. CONCLUSION The project “Voice for the Voiceless, Nepali Sign Language To Speech Converter” aims at automating the task of recognition of hand postures of Nepali Sign Language. We are very much hopeful that our work will serve as a foundation to ease the communication between normal people and hearing impaired people. The system mainly focuses on image processing, feature extraction and pattern recognition tasks. During the process of development, we studied many research works carried out in their respective fields. Based on the results and conclusions provided by such papers we selected algorithms that best addressed our project scenario. We carefully carried out trade off decisions in every step of these algorithms. Thus, the results obtained seem to have pretty low accuracy but that is likely to improve on increasing the quantity of dataset that are more or less ideal. Despite our great effort, there is still a lot of work and research to be done in the field of automated communication between normal people and hearing impaired people since the complete system requires hand posture recognition system along with other systems like hand gesture recognition system, Natural Language Processing system and Speech to 3D Modeler.

75

10. REFERENCE [1] Daniel Kelly, Computational Models for the Automatic Learning and Recognition of Irish Sign Language [2] OReilly Learning OpenCV [3] Robin Hewitt, Seeing with OpenCV: Face Recognition with Eigenface [4] Lindsay I Smith,A tutorial on Principal Components Analysis [5] MING-KUEI HU, Visual Pattern Recognition by Moment Invariants [6] Christopher J.C. Burges, A Tutorial on Support Vector Machine for Pattern Recognition [7] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin , A Practical Guide to Support Vector Classication

76

11. BIBILOGRAPHY [1] Gary Bradski and Adrian Kaehler, Learning OpenCV, 2008. [2] नेपाली साङ्केततक भाषा शब्दकोष [3] Chih-Chung Chang and Chih-Jen Lin,LIBSVM: A Library for Support Vector Machines [4] DorinComaniciu, Visvanathan Ramesh, Peter Meer, Real-Time Tracking of NonRigid Objects using Mean Shift. [5] Vladimir I. Pavlovic, Student Member, IEEE, Rajeev Sharma, Member, IEEE, and Thomas S. Huang, Fellow, IEEE ,Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review [6] S. Chandrasekaran, B. S. Manjunath,Y. F. Wang, J. Winkeler, and H. Zhang ,An Eigenspace Update Algorithm for Image Analysis [7] Namita Gupta, Pooja Mittal, SumantraDutta Roy SantanuChaudhury, Subhashis Banerjee, Developing a Gesture-based Interface [8] DorinComaniciu, Visvanathan Ramesh, Peter Meer, Real-Time Tracking of NonRigid Objects using Mean Shift. [9] Dr. Jane J. Stephan, Sana'a Khudayer, Gesture Recognition for Human-Computer Interaction (HCI). [10] Thierry Messer, Static hand gesture recognition. [11] http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html [12] Oya Aran, M.S, in CmpE., Bogazici University, 2002, Vision Based Sign Language Recognition:Modeling And Recognizing Isolated SignsWith Manual And Non-Manual Components [13] Muhammad Yousuf Bin Azhar, Israr Ahmed, Sameer Rafiq, Suleman Mumtaz , Mehmood Usman,Razi Ur Rehman, Boltay Haath: Pakistan Sign Language Recognition [14] Rubén San-Segundo1, Verónica López, Raquel Martín1, David Sánchez, Adolfo García, Language Resources for Spanish - Spanish Sign Language (LSE) translation [15] Peter Wray Vamplew, Recognition of Sign Language Using Neural Networks

77

[16] Xiaoming Yin and Ming Xie, Hand Posture Segmentation, Recognition and Application for Human-Robot Interaction [17] http://www.cambridgeincolour.com/tutorials/histograms1.htm [18] http://saravananthirumuruganathan.wordpress.com/2010/04/01/introduction-tomean-shift-algorithm/ [19] http://www.cs.unc.edu/~welch/media/pdf/maybeck_ch1.pdf [20] Omer Rashid, Ayoub Al- Hamadi, Axel Panning and Bernd Michaelis, Posture Recognition using Combined Statistical and Geometrical Feature Vectors based on SVM [21] Xuechuan Wang, Posture Recognition using Combined Statistical and Geometrical Feature Vectors based on SVM [22] Jan Flusser, Tomáš Suk and Barbara Zitová, Moments and Moment Invariantsin Pattern Recognition [23] Dorin Comaniciu, Visvanathan Ramesh, Peter Meer, Real- Time Tracking of Non-Rigid Objects using Mean Shift [24] Caifeng Shan, Yucheng Wei, Tieniu Tan, Frédéric Ojardias,Real Time Hand Tracking by Combining Particle Filtering and Mean Shift [25] Bruno Fernandes, Joaquin Fernández, Using Haar- like Feature Classifiers for Hand Tracking in Tabletop Augmented Reality [26] http://www.csie.ntu.edu.tw/~cjlin/libsvm/ [27] PhilippMichel,

Churchill

College,

AutomatedEmotion Classiﬁcation

78

Support

Vector

Machines

in

12. APPENDIX A: NSL ALPHABETS AND NORMALIZED IMAGE

79

80

13. APPENDIX B: VFV DATABASE TRAINER Student of School of deaf and dumb has cooperated and helped us to train the VFV database. Video from the webcam was taken and then the frame extracted from video was used to train our system. Video includes the gesture of nepali alphabets क,ख,ग, घ,......,ज्ञ 1. Aarati Tamang 2. Aayushma Karki 3. Bhawani Silwal 4. Binita Thapa 5. Nisha Lama 6. Pooja Yadav 7. Pradip Parajuli 8. Pujata Aryal 9. Pushpa Khadka 10. Sangita Acharya 11. Sangita Basnet 12. Sanoj Maharjan 13. Santosh Nepal 14. Sharmila Dangol 15. Sujana Limbu

81

14. APPENDIX C: DETAIL RESULT Taking PCA transformed data in the maximum numbe r of dime nsions available. 1. Aarati Tamang Branch

No.

of No. of Training data accuracy

number training

testing

PCA

HU

RAW

Testing data accuracy

Remarks

PCA

The data

HU

RAW

data

data

for

0

31

19

100

41.94

100

0

89.47

0

branch 0

1

55

47

90.91

69.09

90.91

17.02

91.49

17.02

were not

2

37

46

100

56.76

100

17.39

67.39

17.39

ideal.

3

12

2

100

100

100

100

100

100

2. Aayusma Karki Branch

No.

of No. of Training

number training

data

Testing data accuracy

testing

accuracy

data

data

PCA

HU

RAW

PCA

HU

RAW

0

17

4

100

52.94

100

50

75

50

1

33

4

100

42.42

100

100

100

100

2

44

15

100

77.27

100

25

87.5

25

3

9

2

100

88.88

100

100

100

100

Remarks

3. Deepa Basnet Branch

No.

of No. of Training

number training

data

Testing data accuracy

testing

accuracy

data

data

PCA

HU

RAW

PCA

HU

RAW

0

94

8

100

24.47

100

25

37.5

25

1

115

7

100

78.26

100

42.86

71.43

42.86

2

88

7

100

48.86

100

71.43

85.71

71.43

3

29

2

100

79.31

100

100

100

100

82

Remarks

4. Nisha Lama Branch

No.

of No. of Training

number training testing

data

Testing data accuracy

Remarks

accuracy

data

data

PCA

HU

RAW

PCA

HU

RAW

0

36

16

100

66.67

100

37.5

56.25

37.5

1

36

15

100

50

100

46.67

60

46.67

2

38

14

100

39.47

100

42.86

78.57

42.86

3

5

3

100

100

100

100

100

100

5. Pooja Yadab Branch

No.

of No. of Training

number training testing

data

Testing data accuracy

Remarks

accuracy

data

data

PCA

HU

RAW

PCA

HU

RAW

0

109

25

100

39.45

100

88

84

88

1

56

16

100

19.64

100

75

100

75

2

39

13

100

43.59

100

76.92

92.31

76.92

3

5

1

100

20

100

100

100

100

6. Pradeep Parajuli Branch

No.

of No. of Training

number training testing

data

Testing data accuracy

accuracy

data

data

PCA

HU

RAW

PCA

HU

RAW

0

145

23

100

46.90

100

65.21

86.95

65.21

1

197

27

100

75.13

100

77.78

88.89

77.78

2

144

32

100

78.47

100

65.63

93.75

65.63

3

35

10

100

100

100

100

100

100

83

Remarks

7. Sanoj Maharjan Branch

No.

of No. of Training

number training testing

data

Testing data accuracy

accuracy

data

data

PCA

HU

RAW

PCA

HU

RAW

0

146

25

100

33.56

100

68

76

68

1

195

40

100

58.97

100

75

85

75

2

201

38

100

82.59

100

76.31

97.36

76.31

3

67

13

100

97.01

100

84.61

100

84.61

84

Remarks

Multi-scale Personalization for Voice Search Applications