A New Approach for Eye Detection in Remote Gaze ...

Viewer
Transcript

A New Approach for Eye Detection in Remote Gaze-Estimation Systems

by

Jerry Chi Ling Lam

A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto

c Copyright by Jerry Chi Ling Lam 2007

Abstract A New Approach for Eye Detection in Remote Gaze-Estimation Systems Jerry Chi Ling Lam Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2007 Neural network and face symmetry algorithms were developed for eye detection and eye identification (i.e. left eye or right eye). A convolutional neural network (CNN) with six layers was designed and trained to detect and identify eyes in video images from a remote gaze estimation system. To improve the eye identification performance of the CNN, a face symmetry algorithm that is based on the symmetry of local facial features was designed and integrated with the CNN into a single structure. Experiments with 3 subjects showed that for the full range of expected head movements, the CNN achieved an eye detection rate of 95.2% with a false alarm rate of 2.65 × 10−4 %. The combined CNN and face symmetry algorithm for eye identification achieved an identification rate of 99.4% with a rejection rate (i.e. eyes that cannot be identified) of 0.6%.

ii

Acknowledgements I would like to thank some individuals who have helped me to make this thesis possible. I am greatly indebted to my supervisors, Professor Eizenman and Professor Aarabi, for their valuable advice over the course of this work. I would especially like to thank Professor Eizenman for the immeasurable value of his guidance throughout the thesis. To my fellow lab members, Jeff Kang and Elias Guestrin, I would like to express my appreciation for providing advice and assistance over the past two years. I learned a lot from each of them. I would like to acknowledge the financial support provided by the Natural Sciences and Engineering Research Council of Canada. Last, but not least, I would like to express my gratitude to my parents and my sister for their encouragement and support over the past 7 years of study. They are always with me whenever I need them.

iii

Contents

1 Introduction

1

1.1

Literature Review: Eye Detection . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Feature-Based Eye Detection . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2.1

Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2.2

Feature Extraction and Detection . . . . . . . . . . . . . . . . . .

8

Pattern-Based Eye Detection . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.3.1

Eigen-Eye . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.3.2

Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . .

11

1.3.3

Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.3

1.4

2 Eye Detection

15

2.1

Neural Networks Methodology . . . . . . . . . . . . . . . . . . . . . . . .

15

2.2

CNN Architecture for Eye Detection . . . . . . . . . . . . . . . . . . . .

20

2.3

Training Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.4

The Training of Convolutional Neural Networks . . . . . . . . . . . . . .

26

2.5

Convolutional Neural Network Model Selection . . . . . . . . . . . . . . .

29

2.6

Determining the System Performance . . . . . . . . . . . . . . . . . . . .

32

2.7

Eye Localization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.8

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

iv

2.9

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Midline Detection

40 42

3.1

Symmetry-Based Methodology . . . . . . . . . . . . . . . . . . . . . . . .

42

3.2

Local Symmetry Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .

46

3.2.1

Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

3.2.2

Local Symmetry Detection . . . . . . . . . . . . . . . . . . . . . .

47

3.3

3.4

Performance Comparison: Face Symmetry Algorithm and Local Symmetry Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

4 Eye Identification

57

4.1

Criteria for Eye Identification by CNN . . . . . . . . . . . . . . . . . . .

59

4.2

Criteria for Eye Identification by Local Symmetry Algorithm . . . . . . .

60

4.3

Experimental Results and Discussions . . . . . . . . . . . . . . . . . . . .

64

4.4

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

4.4.1

68

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bibliography

70

v

List of Tables 2.1

Network Connection between Layer S1 and Layer C2 . . . . . . . . . . .

23

2.2

Generalization Performance of Different Architectures . . . . . . . . . . .

31

3.1

Experimental Results: Face Symmetry Algorithm . . . . . . . . . . . . .

53

3.2

Experimental Results: Local Symmetry Algorithm . . . . . . . . . . . . .

53

vi

List of Figures 1.1

Visual Acuity Assessment Based on Visual Scanning Pattern . . . . . . .

2

1.2

Feature-based Image Processing Chain . . . . . . . . . . . . . . . . . . .

4

1.3

A Typical Image Captured by the Eye Tracker’s Camera . . . . . . . . .

7

1.4

Pattern-based Image Processing Chain . . . . . . . . . . . . . . . . . . .

10

1.5

Sample Eye images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

2.1

A Typical Convolutional Neural Network . . . . . . . . . . . . . . . . . .

17

2.2

Hidden Unit Connections . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.3

Framework for the CNN Architecture . . . . . . . . . . . . . . . . . . . .

21

2.4

Sample Eye Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.5

Mirrored Eye Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.6

Rotated Eye Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.7

Intensity Transformed Eye Images . . . . . . . . . . . . . . . . . . . . . .

26

2.8

Sample of Non-Eye Images . . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.9

Average Error vs. the Number of Weight . . . . . . . . . . . . . . . . . .

32

2.10 Network Training Phase Diagram . . . . . . . . . . . . . . . . . . . . . .

32

2.11 ROC Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

2.12 Eye Localization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .

35

2.13 Network Response Images of Different Scaled Images . . . . . . . . . . .

36

2.14 Image Coordinate System . . . . . . . . . . . . . . . . . . . . . . . . . .

37

2.15 Convolutional Neural Network’s False Negative and False Positive . . . .

38

vii

2.16 Translational Head Movements . . . . . . . . . . . . . . . . . . . . . . .

39

2.17 Rotational Head Movements . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.1

Graphical Representation of Face Symmetry . . . . . . . . . . . . . . . .

43

3.2

Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.3

Graphical Representation of Local Symmetric Features . . . . . . . . . .

48

3.4

Symmetry Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

3.5

Asymmetric Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

3.6

Non-Uniform Illumination . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.7

Local Symmetry Algorithm Applied on Head-Tilted Image . . . . . . . .

52

3.8

Translational Head Movements . . . . . . . . . . . . . . . . . . . . . . .

54

3.9

Rotational Head Movements . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.1

Confidence-Based Fusion for Eye Identification . . . . . . . . . . . . . . .

58

4.2

False Eye Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

4.3

Eye Identification Performance using CNN . . . . . . . . . . . . . . . . .

60

4.4

Correct Eye Identification by Local Symmetry Algorithm . . . . . . . . .

61

4.5

False Eye Identification by Local Symmetry Algorithm . . . . . . . . . .

62

4.6

Eye Identification Performance using Local Symmetry Algorithm . . . . .

63

4.7

Eye Rejection Performance using Local Symmetry Algorithm . . . . . . .

63

4.8

CNN Approach for Eye Identification . . . . . . . . . . . . . . . . . . . .

65

4.9

Local Symmetry Algorithm for Eye Identification . . . . . . . . . . . . .

66

4.10 Correct Eye Identification Using Hybrid Approach . . . . . . . . . . . . .

66

4.11 Eye Unidentified by Hybrid Approach . . . . . . . . . . . . . . . . . . . .

67

viii

Chapter 1 Introduction One of the long-term research goals in pediatric ophthalmology is to develop a methodology that will provide reliable and objective assessment of visual functions in preverbal subjects. To reach this goal, Dr. Eizenman and his research group at the University of Toronto proposed a novel methodology that is based on the analysis of infant’s visual scanning behavior. The remote gaze estimation system that is described in this thesis is a key component of the proposed methodology. When infants are presented with a visual stimulus that contains similar patterns, their visual attention is distributed approximately equally between the different patterns. If one of the patterns has unique characteristics (i.e. color, motion), it will tend to attract more of the infants’ attention as long as they can detect this unique characteristic. If, for example, infants view a visual stimulus that includes several static checkerboards and one flickering checkerboard (in a flickering checkerboard the black and white squares alternate), their attention will be drawn to the flickering checkerboard. When the check-size of the checkerboards is small enough, they can no longer differentiate between the static and flickering checkerboards. The check-size for which an infant can no longer differentiate between the static and flickering checkerboards can serve as an estimate of his/her visual acuity. In a preliminary study, with adult subjects, we have shown high correlation (R=0.94) between the min1

2

Chapter 1. Introduction

imum check-size for which the static and flickering checkerboards can be differentiated and the subject’s Visual Acuity (as measured by a standard eye chart). When subjects can differentiate between the static and flickering checkerboards, the typical scanning pattern includes relatively short fixations on the static checkerboards and longer fixations and re-fixations on the flickering checkerboard (Figure 1.1a). As the check-size of the flickering checkerboard is reduced, the visual scanning patterns of the static and flickering checkerboards become similar (Figure 1.1b). By determining the check-size for which the visual scanning patterns of the static and flickering checkerboard are the same, a remote gaze estimation system [1] that monitors the infant’s visual scanning patterns, can determine the infant’s visual acuity.

(a)

(b)

Figure 1.1: Visual Acuity Assessment Based on Visual Scanning Pattern, each dot indicates one fixation point. At the beginning of each assessment, the subject is looking at the center of the screen (i.e. thick border: flickering checkerboard, light border: static checkerboards): (a) the subject can see the flickering checkerboard, he moves his attention from the center to the flickering checkerboard. (b) The subject cannot see the flickering checkerboard, he searches for the flickering checkerboard on the screen. The current remote gaze estimation system allows for a very limited range of head movements and requires that the subject’s head will be supported by a chin-rest. Also,

Chapter 1. Introduction

3

the current system assumes that only one of the subject’s eyes is in the field-of-view of the eye-tracker’s video camera and there is no need to differentiate between the left and right eyes. To monitor infants’ visual scanning patterns, it is important to increase the allowable range of head movements. With increased range of head movements, both eyes will be in the field-of-view of the eye-tracker’s camera and it will be important to determine the identity (left or right) of the tracked eyes. The goal of the research reported in this thesis is to develop algorithms for the detection and identification of the right and left eyes, when the range of translational head movements is increased from 6 × 6 × 6cm3 (the current system) to 20 × 20 × 20cm3 . Note that the current remote gaze estimation system uses a chin-rest and therefore rotational head movements are very limited. Another goal of the algorithm is to allow the remote gaze estimation system to detect the left and right eyes with rotational head movements of ±20◦ in roll, pitch and yaw directions. In the next section, I will provide a literature review of current eye detection techniques.

1.1

Literature Review: Eye Detection

Eye detection is often an important first step in applications such as face recognition, human computer communications and the analysis of visual scanning patterns. In face recognition, finding the eyes, nose and mouth and their geometrical relationships helps to identify the subject [2]. In human computer communications, eye blinks [3] and eye movements [4] can be used as an alternative input modality to allow people with severe disabilities to access a computer. The analysis of visual scanning patterns is used in the quantification of mood disorders [5], studies of perception, attention and learning disorders [6, 7], driving research and safety [8, 9, 10], pilot training [11] and ergonomics [12]. Eye detection requires two steps: a) the detection of regions that contain the eyes,

Chapter 1. Introduction

4

and b) the identification of the left and right eyes. The goal of this thesis is to develop algorithms for reliable detection and identification of the right and left eyes. Following the detection of the right and left eyes, algorithms that were developed by Dr. Eizenman’s group will be used to determine the coordinates of several eye features [13, 14] that are used to compute the subject’s point-of-gaze [1]. Eye detection can be broadly classified into: feature-based approaches and patternbased approaches. In the following sections, a review of feature-based and pattern-based approaches is presented.

1.2

Feature-Based Eye Detection

Feature-based approaches for eye detection require three operations: preprocessing, feature extraction and detection (Figure 1.2). The preprocessing operations separate the foreground (faces) from the background (we will refer to this stage as face detection). The feature extraction operations use distinctive features of the eyes or other facial features to detect and identify eye regions. The detection module is designed to satisfy a set of operation criteria such as a specific false positive rate and often make use of the geometrical relationships between some facial features to detect the left and right eyes. Each module is usually designed by experts who have the domain expertise in the applicable environment for eye detection. In the next two sections, a review of the face detection method is followed by a review of facial feature extraction and detection method.

Figure 1.2: Image Processing Chain for Eye Region Detection: Feature-based Approach

5

Chapter 1. Introduction

1.2.1

Face Detection

Skin color is one of the most widely used techniques for face detection because the distinctive color of human skin is highly robust to changes in face illumination and to geometric variations among people. In the face detection algorithms, the color space of each pixel is first normalized. For example, if one selects the Red, Green and Blue (RGB) color space which is invariant to changes in the relative orientation of the subject’s face and the light source [15], the normalized RGB is: r=

R R+G+B

g=

G R+G+B

b=

B R+G+B

(1.1)

The normalization helps to reduce the dependency of the color components on the brightness of the source [16]. The choice of the optimum color space is application dependent. For instance, Zarit et al. [17], McKenna et al. [18] and Sigal et al. [19] used the HSI (Hue Saturation Intensity) color space because it supports easier segmentation of different facial features (i.e. lips, eyes and eyebrows). While Terrillon et al. [20] and Brown et al. [21] used the TSL (Tint, Saturation, Lightness) color space because it produces the best segmentation results, highest rate of face detection, among the normalized color spaces and Phung et al. [22] used Y Cr Cb because it is less sensitive to large changes in lighting. After the color has been normalized, the next step is to classify each pixel as a skin or a non-skin pixel. This is usually accomplished by introducing a metric that measures the distance of each normalized pixel color from a standard skin tone. For example [23], (R, G, B) is classified as skin if: R > 96 and G > 40 and B > 20 and

(1.2)

max(R, G, B) − min(R, G, B) > 15 and |R − G| > 15 and R > G and R > B The advantage of using this classifier is its simplicity. However, this classifier relies heavily on rigid decision rules that might impair its performance when the face is illuminated

Chapter 1. Introduction

6

by different light sources (e.g. incandescent, fluorescence, sunlight). Other researchers suggest the use of machine learning techniques such as Bayes classifier [24], self-organizing map [21] and mixture of Gaussians [25] for more robust classification because they estimate the skin distribution from training data without deriving an explicit model of the skin color. However, they require large memory storage and the training procedure might take a long time to convergence to a global/local optimal solution. After the pixels are classified based on the color content, the skin pixels are clustered to define the face region. A simple heuristic method that is described in [26] and [27] detects the largest connected region of skin pixels and defines it as the face region. Although techniques based on skin color have been shown to be robust for face detection, the remote gaze estimation system uses a monochromatic infrared light to illuminate the subject’s face and therefore color information cannot be used for face detection. Govindaraju [28] proposed to use edge information for face detection. In his algorithm, edge operators are used to determine contours of objects in the image. The contours are then segmented according to the discontinuity in local curvature. Based on a face model which has two relatively flat (i.e. minimum concavity) vertical segments and one relatively flat horizontal segment, the segments are grouped to form face candidates. Face candidates that satisfy the specific width-to-height ratio of an ideal face are detected as faces in the image. Other techniques rely on the physical appearance of higher-level features to detect faces. For instance, Gunn et al. [29] and Huang et al. [30] use snake algorithms to detect faces. The snake algorithm is a methodology to extract contours of objects. It is an energy-minimizing spline, guided by external constraint forces and influenced by the image forces that pull it toward features such as lines and edges. In Huang et al. [30], the snake algorithm is first initialized at the proximity of the head boundary and then the face boundary is refined iteratively by minimizing an energy function: Esnake = Econtinuity + Ecurvature + Eimage

(1.3)

7

Chapter 1. Introduction

where Econtinuity and Ecurvature are the energies associated with the shape of the snake and Eimage is the energy associated with the image that is surrounded by the snake. The shape of the snake is designed to fit the face contour which is usually smooth and the average curvature along the contour is usually small. In equation 1.3, Econtinuity and Ecurvature are defined so that the magnitude of Econtinuity decreases as the snake contour become smoother and the magnitude of Ecurvature decreases as the average curvature along the contour decreases. Since face images are usually brighter than the background, Eimage is defined as:

Eimage = Eline + Eedge

(1.4)

where Eline decreases with brighter pixels in the area surrounded by the snake and Eedge decreases when more edges are included in the contour of the snake. Since video images from the camera of the gaze estimation system exhibit relatively high contrast between the subject’s face and the background (as shown in Figure 1.3), the snake algorithm can be simplified. In Chapter 3, we will show that in the gaze estimation system, we use an algorithm that tends to maximize the brightness within the contour of the estimated face boundary.

Figure 1.3: A Typical Image Captured by the Eye Tracker’s Camera

Chapter 1. Introduction

1.2.2

8

Feature Extraction and Detection

One of the simplest and most efficient algorithms to locate the eyes within the face is based on image intensity. Since pupils are usually darker than the rest of the face, researchers [27] suggested to search within the face for dark regions by iteratively thresholding the face image to locate eye candidates. Two eye candidates that satisfy certain anthropometric constraints are then identified as the left and right eyes. This approach is reliable as long as subjects remain relatively frontal with respect to the camera and they don’t wear eyeglasses (i.e. glare from the eyeglasses sometimes interferes with the pupil image). Feng et al. [31] suggested to locate the eye boundary (i.e. the upper-eyelid, the lower-eyelid and the eye corners) based on changes in projection functions that guide the search of eye regions. They defined a vertical projection function as the integral of pixel intensity along the y-axis of an eye image and a horizontal projection function as the integral of pixel intensity along the x-axis of an eye image. Eyelids are detected when there is a rapid change in the horizontal projection function and eye corners are detected when there is a rapid change in the vertical projection function. This methodology is not very robust since the projection functions are sensitive to changes in head rotation and to non-uniform illumination of the face. Several researchers [32, 33, 34] proposed to design deformable eye templates that can be translated, rotated and deformed to fit the best representation of eyes in images. In this technique, the eye position is obtained through a recursive process that minimizes a specific energy function. In [32], knowing that the intensity of iris is very low, one of the energy functions is designed such that the energy is minimized when the total brightness inside the iris candidate is small. While this method can detect eyes accurately, it requires the eye model to be initialized in close proximity to the eyes. Furthermore, it is computationally expensive and requires good image contrast. Instead of finding the eye features alone, several researchers suggested to use other

Chapter 1. Introduction

9

facial features to locate the eye regions. For example in [26], the geometric relations of the eyebrows and nostrils are used to locate the eye regions. Other researchers [35] used the eye corners and the nostril as feature points to locate the eye regions. Shinjiro Kawato et al. [36] used the forehead, the nose bridge, the eyes and the eyebrows to design a filter to estimate the location of a mid-point between the eyes. Eyes are subsequently detected as two dark regions, symmetrically located on each side of this mid-point. Feng et al. [37] proposed to use three cues to find the exact location of the eyes. The first cue is associated with the low brightness of the pupil. The second cue is the direction of the line joining the centers of the eyes that is determined by a principal component analysis of the face edge image. The last cue is obtained from the convolution of an eye variance filter with the image. The eye variance filter is designed to match the expected variance in the brightness of an eye image. Based on these three cues, a cross-validation process is conducted to extract the true eye window candidates. This approach has many shortcomings. First, it is difficult to find an optimal rule for the cross-validation process. Second, finding eye regions based on pixel intensity is sensitive to changes in illumination. Third, using PCA to find the line joining the center of the eyes is not robust since PCA is very sensitive to outliers. Lastly, the proposed variance filter suffers from large false positive rate [37].

Since facial features exhibit large variability among subjects and under different experimental conditions, feature-based approaches that explicitly model facial features tend to work well with some subjects in some experiments but can fail completely with other subjects under the same experimental conditions. Also, many feature-based techniques are limited to relatively frontal view of faces. Based on the above observations, we will not use feature-based approaches for eye detection.

Chapter 1. Introduction

1.3

10

Pattern-Based Eye Detection

Pattern-based approach is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to detect the eyes. Pattern-based approach for eye detection has been shown to work well with different head poses and various lighting conditions [38, 39, 40, 41, 42, 43, 44]. The modules of the pattern-based algorithms are similar to those of the feature-based algorithms (Figure 1.4) except that the feature extraction module and the eye detection module are embedded in a model that can be controlled by a set of free parameters. The free parameters are adjusted to minimize a criteria function via a training procedure that classifies images into eye and non-eye classes. The criteria function is designed to give a quantitative measure of the model’s performance in terms of a detection rate. At the end of the training procedure, the model is optimized to differentiate between eye and non-eye images. The number of free parameters in the model determines the flexibility of the model and depends on the complexity of the eye image. The main function of the preprocessing module for the pattern-based approach is to standardize the eye and non-eye images. This involves subtracting the mean intensity from the images, enhancing the contrast of the images and scaling the images to fit the input size of the model.

Figure 1.4: Image Processing Chain for Eye Region Detection: Pattern-based Approach Most pattern-based approaches use a window scanning technique to detect eye regions. The window scanning algorithm is in essence an exhaustive search of the input image for possible eye regions at all scales. Typically, the size of the scanning window, the sub-

Chapter 1. Introduction

11

sampling rate of the image, the step size, and the number of iterations vary depending on the method proposed and the need for a computationally efficient system. In the following sections, a summary of the most popular pattern-based approaches for eye detection is presented.

1.3.1

Eigen-Eye

Eigen-Eye [38, 39, 40] uses Principal Component Analysis (PCA) [45] to compute a set of basis images that provides a low dimensional representation of all possible eye images. To classify an image pattern as an eye or a non-eye, the image pattern is mapped to the space formed by the basis images and a similarity measure is used to classify this pattern as an eye or a non-eye. The basis images are a subset of the eigenvectors of the covariance matrix of a large set of eye images. The basis images (eigenvectors) are selected based on the magnitude of their eigenvalues. The number of eigenvectors depends on the quality of the approximation. In [38], only 5 eigenvectors were used to represent the eye sub-space. The Eigen-Eye approach can fail to detect eye images that cannot be described by a simple linear combination of the basis images. Due to the large variability in expected eye shapes and experimental conditions (i.e. illumination), it is difficult to find a compact linear subspace that will provide support for the full range of expected eye images.

1.3.2

Support Vector Machine

Support Vector Machine (SVM) [46] is a learning machine in which the basis functions can be adapted to the data. To achieve this, SVM defines basis functions that are centered on the training data points and then selecting a subset of these points during training. One advantage of SVM is that, although the training involves nonlinear optimization, the objective function is convex, and so any local solution is also a global optimum. The number of basis functions in the resulting models is generally much smaller than the number of training points, although it is often still relatively large and typically increases

Chapter 1. Introduction

12

with the size of the training set. SVM has been applied to eye detection by using a large database of eye and noneye images [41, 42]. The difference between various SVM techniques is associated with the choice of kernel functions in the feature extraction module and in the preprocessing steps. One of the disadvantages of SVM is that the model designer has to make subjective decisions about the type of kernel to be used before the training procedure. This might not be ideal because it is difficult to justify quantitatively the choice of a specific kernel for a specific application. Also, the number of eye images that are selected as basis functions is often too large for real-time eye detection.

1.3.3

Neural Networks

Multi-layer feedforward networks (Neural networks) [47] are systems of interconnecting artificial neurons working together to produce an output function. The outputs of neural networks rely on the cooperation of the individual neurons within the networks to operate. Neural Networks process information in parallel. They can perform its overall function even if some of the neurons are not functioning which is the reason why they are robust to error or failure. Also, Neural Networks are self-adaptive systems; they learn to solve complex problems from a set of examples and generalize the acquired knowledge to solve unseen data. There are numerous algorithms available to train neural networks. Most of the algorithms used in training neural networks are employing some forms of gradient descent. This is done by simply taking the derivative of the cost function with respect to the network parameters and then changing those parameters in the direction that minimizes the cost function. Neural networks were applied successfully in applications such as Hand-Written Character Recognition[48], Speech Recognition [49], Credit Card Fraud Detection [50], eye detection [43, 44], etc. The output of Neural networks for eye detection is a posterior probability that represents the confidence level that the input image is an eye image. To

Chapter 1. Introduction

13

classify an image pattern as eye or non-eye, the posterior probability is compared with a threshold. Neural Networks can be used to model complex relationships between its input and its output [51]. Unlike Eigen-Eye and SVM, the feature extraction module is built into the neural network architecture and Neural Networks learn discriminative features about the eye from the training data. The difficulties with Neural Networks are that it is impossible to ensure that the trained network is the optimal network for a specific application. It is well known that it is difficult to optimize the parameters of the neural networks when the network has multiple hidden layers with many free parameters [52, 53].

1.4

Research Objectives

The goal of the thesis is to develop algorithms for the detection and identification of the left and right eyes for the expected range of subjects’ translational (X=±10cm, Y=±10cm and Z=±10cm) and rotational head movements (Yaw=±20◦ , Pitch=±20◦ and Roll=±20◦ ). To detect the left and right eyes for different subjects and for the expected range of head movements, the algorithm must be able to cope with large variability in eye images (Figures 1.5). Since pattern-based approaches for eye detection were shown to be more robust than feature-based approaches when eye-images exhibit large variability, we decided to use a pattern-based approach for eye detection in this thesis. Among all pattern-based approaches for eye detection (see Section 1.3), the Neural Networks methodology was found to be the most attractive because it is able to learn the discriminative features for eye detection from a set of eye images. The next chapter describes a novel eye detection algorithm that is based on Neural Networks methodology. The third chapter discusses a methodology that determines the identity (left or right) of the detected eyes. The fourth chapter combines the algorithms

Chapter 1. Introduction

14

Figure 1.5: Sample Eye images for eye detection and eye identification to create a detector that can identify the left and right eyes for the required range of head movements.

Chapter 2 Eye Detection As shown in the literature review (Section 1.2.2), it is possible to use handcrafted rules or heuristics to detect eyes, but in practice such an approach leads to proliferation of rules and exceptions to the rules in order to cope with specific subject’s features and varying experimental conditions. Invariably, this approach leads to poor detection performance when the input set exhibits large variability. In many applications, pattern-based approach provides better detection performance. In a recent review of pattern-based approaches [54], it was found that Multi-layer Feedforward Neural Networks (Neural Networks) provide the best detection performance when the input set exhibits large variability. This chapter starts with an analysis and evaluation of a specific type of Feedforward Neural Networks: Convolutional Neural Networks (CNN). Then, a description of the CNN architecture is presented and finally, the training procedure and the performance of the CNN are discussed.

2.1

Neural Networks Methodology

Lecun et al. [48]. introduced CNN specifically for adaptive image processing and it has been successfully applied in many practical applications [48, 55, 56, 57]. When compared 15

Chapter 2. Eye Detection

16

with conventional Neural Network, CNN has three properties that are important for eye detection. Firstly, CNN is invariant to translation and robust to changes in scale and rotation [48]. Invariance properties are important because eye detection should be independent of the position of the eye or the size of the eye within the image . Conventional neural networks (other than CNN) can learn to cope with changes in translation, scale and rotation, etc. However, this requires that the training set includes numerous examples of the effects of these parameters. This might be impractical because the possible number of combinations grows exponentially with the number of parameters. Secondly, CNN emphasizes a key property of images which is that nearby pixels are much more likely to be correlated than more distant pixels. It achieves this property by extracting features which depend only on small sub-regions of the image. Information from such features is merged in later stages of processing in order to detect more complex features, and ultimately to yield information about the image as a whole. Lastly, in many applications that use conventional Neural Networks [58, 59, 60], the original image is first preprocessed and the processed image is then fed into the Neural Networks. This preprocessing step is essential, for example, for image intensity normalization. CNN does not required any preprocessing steps. It learns to build the preprocessing module and the detection module in a single integrated scheme. A typical CNN is shown in Figure 2.1. It consists of a set of layers each of which contains one or more planes. Each unit in a plane receives inputs from a small neighborhood in the planes of the previous layer. Each plane can be considered as a feature map with a fixed feature detector that is convolved with a local window which is scanned over the planes of the previous layer. Multiple planes are usually used in each layer so that multiple features can be detected. These layers are called convolutional layers. Once a feature has been detected, its exact location is less important. Hence, the convolultional layers are typically followed by another layer which does a local averaging and

Chapter 2. Eye Detection

17

subsampling operation. These layers are called subsampling layers. Finally, the network is typically connected to a fully connected Feedforward Neural Network which carries out the classification task using the features extracted in the previous layers. The network is trained with the usual backpropagation gradient descent procedure [61].

Figure 2.1: A Typical Convolutional Neural Network [48]

Three mechanisms that are unique to CNN and give rise to the three properties that were described at the beginning of this section are: (i) local receptive fields (sub-region connection), (ii) weight sharing, and (iii) sub-sampling. In conventional Neural Networks other than CNN , each hidden unit (Hi ) is connected to all pixels in the previous layer (Vi ) and each hidden unit has its own set of weights as shown in Figure 2.2a. In CNN, each unit in a feature map takes inputs only from a small sub-region of the previous layer, and all the units in the feature map are constrained to share the same weight values as shown in Figure 2.2b. For instance, a feature map in the convolutional layer might consist of 100 units z[i, j] arranged in a 10 × 10 grid with i = 1, · · · , 10 and j = 1, · · · , 10, with each unit taking inputs from a 3 × 3 pixel patch of the image x[i, j]. The whole feature map therefore has 9 adjustable weight parameters W [u, v] plus one adjustable bias parameter w0 . Input

18

Chapter 2. Eye Detection

(a)

(b)

Figure 2.2: Hidden Unit Connections (a) Conventional Neural Networks, (b) CNN values from a patch are linearly combined using the weights and the bias: a[i, j] =

2 P 2 P

u=0 v=0

(W [u, v] × x[i + u, j + v]) + w0

(2.1)

The result is then transformed by a sigmoidal non-linearity h(.) to form the output z[i, j]: z[i, j] = h(a[i, j])

(2.2)

Units in the feature map can be regarded as feature detectors and therefore all the units in a feature map detect the same pattern but at different locations in the input image. Due to the weight sharing, the operation of these units is equivalent to a convolution of a kernel comprising the weight parameters with the input image. If the input image is shifted, the output of the feature map will be shifted by the same amount but will be otherwise unchanged. This provides the basis for the invariance of the network outputs to translations of the input image. Since most often multiple features are needed for effective classification and detection, there are generally multiple feature maps in the convolutional layer, each having its own set of weights and bias parameters. The outputs of the convolutional units serve as inputs to the sub-sampling layer of the network. For each feature map in the convolutional layer there is a plane of units in the sub-sampling layer and each unit averages inputs from a 2×2 region in the corresponding

19

Chapter 2. Eye Detection

feature map. This average is multiplied by an adaptive weight followed by an addition of a bias. The results are then transformed using a sigmoidal non-linear activation function. The receptive fields in the subsampling layer are chosen to be contiguous and non-overlapping so that there are half the number of rows and columns in the subsampling layer compared with the convolutional layer. CNN structures usually include several stages of pairs of convolutional and subsampling layers. At each stage, there is a progressively larger degree of invariance to changes in input parameters (i.e. scale, rotation, etc). In a given convolutional layer, there may be several feature maps for each plane of units in the previous sub-sampling layer, so that the gradual reduction in spatial resolution is compensated by an increase in the number of features. The final layer of the network is typically a fully connected, fully adaptive layer, with output unit activations k:

ak =

M P

j=1

wkj zj + wk0

(2.3)

where wk0 is the bias parameter for the output unit k and zj is the output of the hidden unit j in the previous layer. The number of output units (K) depend on the number of objects to be classified. For the eye detection problem, K = 2 where one output is associated with the detection of eye images and the other output is associated with the detection of non-eye images. The CNN is usually trained by gradient descent error backpropagation algorithm [62]. Due to the use of local receptive fields, the number of weights in the network is much smaller than if the network was fully connected. Furthermore, the number of independent parameters to be learned from the training data is much smaller due to the weight sharing mechanism. In the following sections, the details of a CNN for eye detection are presented.

Chapter 2. Eye Detection

2.2

20

CNN Architecture for Eye Detection

As discussed in the previous section, the architecture of the CNN can be parameterized by the number of stages, the number of feature maps in each convolutional layer and the number of weights connected to each unit (the kernel) in each feature map. Note that all units in a feature map are constrained to share the same weight values and the size of the kernels for all feature maps, in the same convolutional layer, is the same. Since the values of these parameters depend in a complex way on the number of training cases, the complexity of the classification to be learnt, the amount of noise in the outputs and the training algorithm [63, 64], they are determined experimentally. It is impractical to test all the possible structures for a CNN for eye detection and therefore a reasonable framework is needed to constrain the number of parameters that determine the architecture. For CNN, it is essential to use more than one convolutional layer because the first layer generally implements non-linear template-matching at a relatively fine spatial resolution, extracting basic features of the input image [48]. Subsequent layers learn to recognize particular spatial combinations of previous features, generating complex features in a hierarchical manner. To limit the complexity of the architecture, the CNN for eye detection will use only 2 convolutional layers and 2 subsampling layers (i.e. 2 stages). Figure 2.3 illustrates the architecture of the CNN for eye detection. Layers C1 and C2 are the convolutional layers and layers S1 and S2 are the corresponding subsampling layers. In order to force different feature maps to extract different features in layer C2, each of the feature maps in layer C2 receives a different set of inputs from layer S1. For instance, if there are 4 feature maps in S1, one can construct 15 different combinations (without replacement) of these maps to create 15 different feature maps in layer C2. Therefore, the possible number of feature maps in layer C2 is constrained by the number of feature maps in layer C1. It is expected that layer S2 will be able to extract a series of disjoint features of low-dimensionality that can be used for classification. Therefore, layer C3 is a fully connected layer where each unit is connected

21

Chapter 2. Eye Detection

Figure 2.3: Framework for the CNN Architecture to all units of a single corresponding map in layer S2. Finally, all units in C3 are fully connected to form two outputs with softmax activation function a

e k yk = P 2 aj e

(2.4)

j=1

where k = 1 for eye images, k = 2 for non-eye images and aj is the output unit activation defined in equation 2.3. A softmax activation function is used because the outputs of the network can be interpreted as posterior probabilities for the K categorical variables. It is highly desirable for those outputs to be in the range of 0 ≤ yk ≤ 1 and that

P

k

yk = 1.

The purpose of the softmax activation function is to enforce these constraints on the outputs and to provide a confidence measure for the classification. Using equations 2.1, 2.2, 2.3 and 2.4, the overall network function for each output can be represented by

Chapter 2. Eye Detection

22

yk (x, w) where x is the input image and the set of all weight and bias parameters have been grouped together into a vector w. Thus, the Neural Network model is simply a nonlinear function from an input image x to a set of output variables yk controlled by a vector w of adjustable parameters. Using this framework, the degrees of freedom in the architecture of the CNN for eye detection can be reduced to three parameters: the number of feature maps in the first layer (layer C1), the size of the kernel in layer C1 and the size of the kernel in layer C2. As an example, a detailed description of a CNN architecture using 4 feature maps in layer C1 with a 5 × 5 kernels, and a 3 × 3 kernels in layer C2 is presented. This architecture can be denoted as A(4,5,3) . Given an input image of 36 × 36 pixels, each unit in each feature map in layer C1 is connected to a 5 × 5 neighborhood in the input image. The size of the feature maps is 32 × 32 pixels which is the result of the convolution of the 5 × 5 kernel (no zeropadding) with the input image. Each feature map has 26 adaptive parameters and the total number of parameters in layer C1 is 104. Layer S1 is composed of four feature maps, one for each feature map in C1, and the size of each feature map in layer S1 is half the size of feature maps in layer C1 (16 × 16 pixels). Each feature map has 2 adaptive parameters and the total number of the parameters in layer S1 is 8. Layer C2 is a convolutional layer with 15 feature maps. Each unit in each feature map is connected to a small neighborhood, at identical locations, in a subset of the feature maps in S1. An example of the mapping strategy that is used ,in this thesis, to connect the feature maps in layer S1 to the feature maps in layer C2 is shown in table 2.1. According to the design criteria in this example, each connection has a unique 3 × 3 kernel. For instance, referring to table 2.1, feature map number 4 in layer C2 is connected to feature maps number 0 and 1 in layer S1. Therefore, feature map number 4 in layer C2 has 19 adaptive parameters. In total, the size of each feature map in layer C2 is 14 × 14

23

Chapter 2. Eye Detection pixels and the total number of the parameters in layer C2 is 303.

Table 2.1: Each column indicates which feature map in S1 are combined by the units in a particular feature map in C2 0

1

2

3

0 X 1 2 3

X

4

5

6

X

X

X

X X

7

X

X

X X

9

10 11 12 13 14 X

X X

8

X X

X

X

X

X

X X

X X

X

X X

X

X

X

X

X

Layer S2 is composed of 15 feature maps. The size of each feature map is 7 × 7 and there are 30 adaptive parameters in this layer. Layer C3 has 15 units with each unit fully connected to all units of a single feature map in layer S2. Therefore, there are 750 adaptive parameters in this layer. Finally, all units in layer C3 are fully connected to form two outputs and there are 32 adaptive parameters in the final layer. In total, this architecture has 1227 parameters. To determine the values of the three parameters that determine the architecture of the CNN for eye detection, the generalization performance of several different architectures will be measured. The generalization performance is the ability to classify correctly new eye images that differ from those used for the training of the Neural Network. The architecture that achieves the best generalization performance will be selected for the eye detection. In the next section, the training database and the training algorithm are introduced.

2.3

Training Database

The CNN is trained with a large set of eye images and non-eye images. This large set of images is called the training database.

Chapter 2. Eye Detection

24

The training database was built by manually cropping eye images from face images of 10 subjects. For each subject, 150 images of different head poses and face illuminations were collected. For each image, the portions of the image that contained the left and/or right eyes were cropped to fit an image size of 36 × 36 pixels (Figure 2.4). The total number of cropped eye images was 3000 images.

Figure 2.4: Sample Eye Images Contrary to many pattern-based techniques [58, 60, 65, 66], no preprocessing (i.e. overall brightness correction or histogram equalization) was applied to the cropped images. CNN is trained to extract problem-specific features directly from the raw pixels. The generalization performance of the CNN depends strongly on the quality and the quantity of the training data [67]. To include large variety of experimental conditions (i.e. different face illuminations and different head poses), simulated images were created and were added to the original 3000 images. The simulated images were created in the following manner. Since eyes are symmetrical, mirror images of the eyes in the original image were added to the training data (Figure 2.5 is the mirror image of Figure 2.4). To train the network with larger head rotations in the roll direction, rotated versions of the eye images were added to the training data (Figure 2.6). The degree of rotation is randomly selected between −30◦ to 30◦ . To train the network with larger variations in eye illumination, contrast and intensity transformations were applied to the original set of eye images (Figure 2.7). Using this strategy, the number of images in the training database was expanded to 30000 eye images. To collect a representative set of non eye images, a method described in [59, 68] was

Chapter 2. Eye Detection

25

Figure 2.5: Mirrored Eye Images

Figure 2.6: Rotated Eye Images adopted. Any portion of the cropped image obtained by the remote gaze estimation system that does not include an eye can be used as a non-eye image. The method aims to collect only non-eye images with high information value. The idea is the following: 1. Start with a small and possibly incomplete set of non-eye images in the training data. 2. Train the network with the current training data. 3. Use the trained CNN to detect eyes in the full image (1280 × 1024 pixels) of the original face images. If the CNN classify a portion of the image that does not included an eye as an eye image, this portion is cropped from the full image to fit the image size of 36 × 36 pixels and is added to the training data. 4. Return to Step 2. The procedure stops when a total of 30000 cropped (36 × 36 pixels) non-eye images were collected. Some examples of the collected non-eye images are shown in Figure 2.8. The size of the training dataset for eye and non-eye cropped images were 60000.

26

Chapter 2. Eye Detection

Figure 2.7: Intensity Transformed Eye Images

Figure 2.8: Sample of Non-Eye Images

2.4

The Training of Convolutional Neural Networks

The training of Convolutional Neural Networks involves adjusting the network’s parameters by fitting the network function yk (x, w) in equation (2.4) to the training data. This can be done by minimizing an error function that measures the error between the network function, for any given value of w, and the training set of the cropped images. For the softmax activation function (Equation 2.4), the cross-entropy penalty function, defined in equation 2.5, is used as the error function because this error function models more accurately the error exhibited by the CNN then the common sum-of-squares error function. Also, the cross-entropy error function leads to faster training and improved generalization [69, 70] E(w) = −

N P

n=1 2 P

En (w) = −

En (w)

k=1

(2.5) tkn ln yk (xn , w)

27

Chapter 2. Eye Detection

where n = 1, · · · , N indicates the cropped images, tk ∈ {0, 1} indicates the 2 classes of images (eye or non eye), and yk are the network outputs with y1 (x, w) = p(t1 = 1|x) and y2 (x, w) = p(t2 = 0|x). The goal of the optimization is to find a vector w such that E(w) is minimized. Because the neural network function is differentiable with respect to all network parameters, the error E(w) is a smooth continuous function of w. The minimum occurs at the point in weight space where the gradient of the error function vanishes, so that ∇E(w) =

N P

n=1

∇En (w) = 0

(2.6)

However, the error function typically has a highly nonlinear dependence on the weights and bias parameters, and so there will be many points (local minima) in weight space at which the gradient vanishes. In general, it will not be known whether the global minimum has been found. To find the solution for equation 2.6, iterative numerical procedures are used [61]. Most iterative procedures involve choosing some initial value for the weight vector and then moving through weight space in a succession of steps of the form wτ +1 = wτ + ∆wτ

(2.7)

where τ is the iteration step. Different algorithms involve different choices for the weight vector update ∆wτ but they all make use of gradient information and therefore require that, after each update, the value of ∇E(w) is evaluated at the new weight vector wτ +1 . The gradient of an error function can be evaluated efficiently by means of the backpropagation procedure [62]. The backpropagation procedure is a numerically efficient procedure which has a computational complexity that is linear with the dimensionality of w. This is the reason why the use of gradient information forms the basis for most practical algorithms for training neural networks. For the convolutional neural networks, the standard backpropagation procedure must be slightly modified to take into account the weight sharing. The partial derivatives of the loss function with respect to each connection is computed first, as if the network were a conventional multi-layer network

28

Chapter 2. Eye Detection

without weight sharing, then the partial derivatives of all the connections that share the same parameter wk are added to form the derivative with respect to that parameter wk . ∂En ∂wk

=

P

(i,j)∈Vk

∂En ∂uij

(2.8)

where uij is the connection weight from unit j to unit i, Vk is the set of unit index pairs (i, j) such that the connection between i and j shares the parameter wk . To train the Convolutional Neural Network, the stochastic diagonal Levenberg Marquardt method is used [71]. It is based on stochastic gradient descent algorithms [72] which make an update to the weight vector based on one data point at a time. This is contrary to batch learning algorithms that consider an entire training data before weights are updated. Stochastic learning methods are useful for training neural networks on large data sets [73] because it handles redundancy in the data much more efficiently and converges much faster than batch learning algorithms. Also, stochastic learning algorithms have the possibility of escaping from local minima, since a stationary point with respect to the error function for the whole data set will generally not be stationary point for each data point. Other nonlinear optimization algorithms such as Conjugate Gradient Descent [74] and BFGS [75] are not suitable because they can only be trained in batch mode since they rely on accurate evaluation of successive conjugate descent directions. Also, most of the training procedures require full Hessian information and therefore, they can only be applied to very small networks due to their high computation requirements (at least O(W 2 ) per update where W is the number of parameters in the network). The stochastic diagonal Levenberg Marquardt method is an extended version of the standard stochastic gradient descent algorithm wτ +1 = wτ − η∇En (wτ )

(2.9)

where η is a scalar constant (learning rate) and the weight update in (2.7) is comprised of a small step in the direction of the negative gradient based on one data point ∆w τ = −η∇En (wτ ). The problem with the standard stochastic gradient descent algorithm is

29

Chapter 2. Eye Detection

that it uses a single global learning rate η for all weights in the network. Neural Networks with multi-layers usually have very small second derivative in the lower layers and higher second derivative in the upper layers [71]. Therefore, some weights may require a slow learning rate in order to avoid divergence, while others may require a fast learning rate in order to converge at a reasonable speed. In order to allow the network to converge faster, different learning rates for different weights are desirable so that all the weights in the network converge roughly at the same speed. Stochastic Levenberg Marquardt method attempts to estimate the learning rate for each weight in the networks by computing the inverse diagonal Hessian

The second derivatives,

∂2E , ∂wk2

ηk = 1 ∂2 E

2 +µ ∂wk

(2.10)

can be obtained via backpropagation [76] and µ is a

parameter to prevent ηk from numerical instability when the second derivative is small (i.e. when the optimization moves in flat parts of the error function). Finally, the Stochastic Levenberg Marquardt method has the form wτ +1 = wτ − η τ ∇En (wτ )

(2.11)

where η τ represents a vector of η (Equation 2.10) evaluated at time τ . As noted in [71], the second derivatives can be pre-computed, before the training, over a subset of the training set. Also, the re-estimation does not need to be done often since the second order properties of the error surface change slowly.

2.5

Convolutional Neural Network Model Selection

To determine the architecture of the CNN (the effective complexity of the network), several different architectures were trained using the Stochastic diagonal Levenberg Marquardt method and the architecture with the best generalization performance was selected for eye detection.

Chapter 2. Eye Detection

30

The training of nonlinear networks corresponds to an iterative reduction of an error function with respect to a set of training data. As the number of training iteration increases, the error decreases. However, the error with respect to validation data which is not used in the training might not decrease monotonically as a function of the number of training iterations. Since the purpose of the training process is to achieve the best generalization performance, Sarle [64] suggested to stop the training process when the error is minimized for a validation dataset. This can usually result in an earlier stopping of the iterative training process. “Early Stopping” can be explained qualitatively in terms of the effective number of degrees of freedom in the network. The number of degrees of freedom is small at the beginning of the training and then the number grows during the training process, corresponding to a steady increase in the effective complexity of the model. Halting training before the error is minimized represents a way of limiting the effective network complexity. A theoretical explanation of the “Early Stopping” procedure can be found in [64]. To train the CNN with the “Early Stopping” procedure, the training database is divided into 2 partitions, the training dataset and the validation dataset. The training dataset includes 50,000 images (25,000 eye and 25,000 non-eye) and the validation dataset includes 10,000 images (5,000 eye and 5,000 non-eye). Using the framework defined in section 2.2, we have tested CNN architectures with 3, 5 and 7 feature maps in the first layer and, 3 × 3, 5 × 5 and 7 × 7 kernels in layer C1 and layer C2. To extract features with high spatial frequencies, the kernel sizes are kept small. The generalization performances of a total of 27 architectures were evaluated. Each architecture was trained for 100 training iterations and the weight vector associated with the minimum error in the validation dataset was chosen for each architecture. In each iteration, the CNN is trained using the entire training dataset and validated with the entire validation dataset. The results are summarized in Table 2.2. A(a,b,c) defines an architecture A with a representing the number of feature maps in the first layer, b representing the kernel size

31

Chapter 2. Eye Detection

in the first convolutional layer (b × b) and c representing the kernel size in the second convolutional layer (c × c). Note that in Table 2.2 as the number of weight increases, the error decreases. Figure 2.9 shows that the average error decreases rapidly when the number of weight is relatively small and it is relatively constant when the number of weights is greater than 5000. Although A(7,5,3) has the lowest average error, the number of weights of this architecture is relatively large. To limit the complexity of the CNN while achieving good generalization performance, A(5,5,7) (4991 number of weights) was selected for eye detection. Table 2.2: Top: architectures with 3 feature maps in the first layer. Middle: architectures with 5 feature maps in the first layer. Bottom: architectures with 7 feature maps in the first layer. A(3,3,3)

A(3,3,5)

A(3,3,7)

A(3,5,3)

A(3,5,5)

A(3,5,7)

A(3,7,3)

A(3,7,5)

A(3,7,7)

Avg. Error

0.254

0.206

0.176

0.275

0.248

0.199

0.274

0.201

0.132

# weights

628

715

912

571

672

883

643

744

955

A(5,3,3)

A(5,3,5)

A(5,3,7)

A(5,5,3)

A(5,5,5)

A(5,5,7)

A(5,7,3)

A(5,7,5)

A(5,7,7)

Avg. Error

0.038

0.027

0.026

0.041

0.027

0.024

0.033

0.029

0.028

# weights

2920

3735

5252

2535

3412

4991

2655

3532

5111

A(7,3,3)

A(7,3,5)

A(7,3,7)

A(7,5,3)

A(7,5,5)

A(7,5,7)

A(7,7,3)

A(7,7,5)

A(7,7,7)

Avg. Error

0.019

0.016

0.017

0.015

0.016

0.018

0.022

0.021

0.018

# weights

12880

18143

27244

11087

16604

25959

11255

16772

26127

The behavior of the architecture A(5,5,7) during the training session and for the validation set is illustrated in figure 2.10. The training is stopped at epoch 56 corresponding to the minimum error for the validation set with an average cross entropy error of 0.024. This error is reported in Table 2.2. Note that the training error is not a monotonically decreasing function of the iteration index because the learning algorithm is stochastic and the weights may not move directly toward the minimum at each iteration. The trend of the error of the validation dataset during the training session is to decrease at first, then it increases as the network starts to over-fit the training dataset.

32

Chapter 2. Eye Detection Average Cross Entropy Error vs. Number of Weight

Average Cross Entropy Error

0.35 0.3 0.25 0.2 0.15 0.1

Selected Architecture 0.05 0 0

0.5

1

1.5

2

2.5

Number of Weight

3 4

x 10

Figure 2.9: Average Error vs. the Number of Weight Average Cross Entropy Error vs Epoch

Average Cross Entropy Error

0.7

Training Error Validation Error

0.6 0.5 0.4 0.3 0.2 0.1 0 0

Minimum Validation Error = 0.024 at Epoch 56

20

40

60

80

100

Epoch

Figure 2.10: Training Session

2.6

Determining the System Performance

In the remote gaze estimation system, the regions of the eyes that are detected by the CNN are provided to an algorithm (eye feature detector) that estimates specific eye features. These specific eye features are the pupil and a set of corneal reflections. The corneal reflections are virtual images of the light sources that illuminate the eyes. These eye features are used by the remote gaze estimation system to calculate the Point-OfGaze. If the CNN classifies a non eye region as an eye region (false alarm), the eye feature

Chapter 2. Eye Detection

33

detector will not be able to find the specific eye features in the region defined by the CNN and therefore, no Point-Of-Gaze estimation will be derived for this region. Since the cost associated with false eye classification is low, one can set the false positive rate for the CNN to be very high. Unfortunately, the eye feature detector is very computationally intensive and if the CNN has a high false alarm rate, the remote gaze estimation system will not be able to estimate the Point-Of-Gaze in real-time. Based on the current characteristic of the remote gaze estimation system (frame rate, computational power), a real-time Point-Of-Gaze estimation can be achieved with a false alarm rate of 10% for the CNN. To determine the classification criteria for the CNN so that a false alarm rate of 10% is achieved, the Receiver Operating curve (ROC) of the CNN classifier was constructed. Each point on the ROC was determined by applying a different threshold to the output of the CNN for the entire validation dataset. The result is plotted in Figure 2.11. As shown in Figure 2.11, a false alarm rate of 10% corresponds to a true positive rate (detection rate) of 99.3% (the actual network threshold is 0.45).

Figure 2.11: ROC Curve Used to Determine the Threshold

Chapter 2. Eye Detection

2.7

34

Eye Localization Algorithm

For each image from the remote gaze estimation system, the CNN generates a corresponding network response image. The value at each pixel in the network response image corresponds to the confidence of the network that the eye is present at this pixel location. The network response image is then compared with the threshold that was obtained in the previous section (0.45). Note that because the architecture of the CNN contains 2 subsampling layers (each subsamples the image by a factor of 2), the number of pixels in the network response image is approximately 16 times smaller than the number of pixels in the input image. Pixels in the network response image that have values below the threshold are set to zero. As shown in Figure 2.12b, for each eye, there are several pixels with network response above the threshold, this is due to the fact that for 2 adjacent pixels in the network response image, the difference in the input image is only 12.5%. Therefore, it is very likely that two adjacent outputs detect the same eye (Note that when the network response image is mapped back to the original image space in Figure 2.12b, the distance between 2 adjacent outputs is 4 pixels). All network responses that are 4 pixels apart are clustered together to represent an eye candidate (Figure 2.12c). Finally, the center of the eye window for each eye candidate is computed as the centroid of the positions of each pixel in the cluster weighted by the magnitude of the network response. The size of each eye window is similar to the size of the cropped images used for the training of the CNN (36 × 36 pixels). The CNN’s network response for each eye candidate is computed and only eye candidates with a network response higher than 0.45 are considered. Since only one face is presented in the field of view of the eye tracker’s camera, only the two eye candidates with the highest network output will be considered by the eye feature detector (Figure 2.12d). The CNN was developed for a specific optical configuration in which the eye image could fit within a window of 36 × 36 pixels for the expected range of head movements. As the remote gaze estimation system evolved, the sensor resolution and the optical configu-

35

Chapter 2. Eye Detection

(a)

(b)

(c)

(d)

Figure 2.12: Eye Localization Algorithm (a) original image, (b) network response, (c) group image, (d) eye candidates

ration changed and the number of pixels per millimeter (pixel density) vary substantially from the images collected for the training of the CNN. Based on the set of lenses and camera sensors that are used by the remote gaze estimation system, the pixel density can be changed by a factor of 2. This implies that in some optical configurations, the eye image can be as large as 72 × 72 pixels. To cope with different configurations without retraining the network for each specific optical configuration, images from the remote gaze estimation system are subsampled recursively 4 times (each time by a factor of 0.8) to generate 5 scaled images (including the original image) so that at least in one of the images, the eyes can fit within a window of 36 × 36 pixels. Each scaled image is then processed by the CNN and an image containing the network response is obtained. Pixels in each network response image with values lower than the threshold (0.45) are set to zero and the network response image of each scaled image is then mapped back to the original

36

Chapter 2. Eye Detection

image space (Figure 2.13a-e). The sum of all network response images for all scales is shown in Figure 2.13f. Note that in Figures 2.13a-e, the CNN detects the eyes in several different scales, which demonstrates the robustness of the CNN to changes in scale. All network responses that belong to the same eye are clustered together. The center of the eye window for each eye candidate is computed as the centroid of the positions of each pixel in the cluster weighted by the magnitude of the network response at each pixel. The size of the eye window for each eye candidate is computed as the average of the eye window’s size associated with each pixel in the cluster and weighted by the magnitude of the network response at each pixel. Finally, the two eye candidates with the highest network output will be considered by the eye feature detector.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 2.13: Network Response Images of Different Scaled Images with: (a) scaled factor 0.4, (b) scale factor 0.5, (c) scale factor 0.64, (d) scale factor 0.8, (e) scale factor 1 (original image). (f) Total network response image

Chapter 2. Eye Detection

2.8

37

Experimental Results

The performance of the CNN for eye detection was tested on three subjects. For each subject, the range of head movements (relative to the subject’s head initial position) for which the CNN was able to detect the subjects’ eyes were determined. Head movements were measured by a head tracker [77]. The head tracker recorded three translational head movements along the X, Y and Z directions (Figure 2.14) and three rotational head movements: yaw (rotation around the Z-axis), roll (rotation around the X-axis) and pitch (rotation around the Y-axis). A right-handed Cartesian coordinate system is used with the origin positioned at the subject’s head initial position. The nodal point of the camera (image size: 1280 × 1024, 35 mm lens) in the remote gaze estimation system was set to X=75cm, Y=0cm, Z=−45cm and the pitch of the camera was set to 30◦ (Yaw and Roll were set to 0◦ ). The Y-Z plane was set to be parallel to a computer screen with the Z-axis pointing upward as shown in figure 2.14. The computer displays the subject’s face with superimposed eye windows.

Figure 2.14: Image Coordinate System Subjects were asked to move their heads in one direction at a time (X, Y, Z, Yaw, Pitch and Roll) over the entire expected range of head movements in this direction. For translational head movements (X, Y and Z), the expected range is ±10cm and images were captured at intervals that are 1cm apart. For rotational head movements (Yaw, pitch and roll), the expected range is ±20◦ and images were captured at intervals that

38

Chapter 2. Eye Detection are 2◦ apart. A total of 378 images were collected for a test database.

To determine the performance of the CNN, the number of eyes that have been correctly detected and the number of false alarm were determined. An eye is correctly detected if and only if the detected window contains the full eye image. If the detected eye window contains only a partial or no region of the eye, it is counted as a false alarm. The detection rate for the CNN is 95.2%. This detection rate is lower than the expected detection rate reported on the ROC in section 2.6 mainly because the eyes in some of the images in the test database are partially occluded due to eye blinks (Figure 2.15a). Since the training database contains only full eye images, the CNN has difficulty to detect eyes that are partially occluded. The false alarm rate of the CNN is 2.65 × 10−6

(a)

(b)

Figure 2.15: Mistakes made by Convolutional Neural Network a) False Negative, b) False Alarm for the entire test database. This false alarm rate seems to contradict the designed false alarm rate of 10% for cropped non eye images (see Section 2.6). The explanation for the above contradiction is straightforward. Each image from the remote gaze estimation system (1280 × 1024 pixels), there are approximately 81839 non eye regions per image (36 × 36 pixels). Using a false alarm rate of 10%, it is expected to have 8184 false alarms. But, each image from the remote gaze estimation system has a face of only 1 subject, the maximum number of eyes per image is two (i.e. maximum number of false alarm is two). Moreover, since only the two eye candidates with the highest network output are

39

Chapter 2. Eye Detection

selected, false alarms in images that include two eyes are rejected because the network outputs associated with the eye regions are always higher than the network outputs of the non eye regions. However, when only one eye is presented on a given image (Figure 2.15b), it is possible that a portion of image that does not contain an eye will be selected as an eye candidate. Due to the constraint of only two eye candidates per image, the probability that a non eye portion of an image will be classified incorrectly is much higher than the probability of this portion to be eventually selected as an eye candidate for the image. Figure 2.16 and Figure 2.17 show that the CNN is robust for large head movements (changes in scale and orientation) and changes in face illumination. The experimental data suggested that the CNN can detect subjects’ eyes for head movements that span 100% of the required range for translational (X, Y and Z) and rotational head movements (Yaw, Pitch and Roll).

(a)

(b)

(c)

(d)

(e)

(f)

Figure 2.16: Translational Head Movements: (a) Toward the monitor, (b) away from the monitor, (c) to the left, (d) to the right, (e) up, (f) down

40

Chapter 2. Eye Detection

(a)

(b)

(c)

(d)

(e)

(f)

Figure 2.17: Rotational Head Movements: Convolutional Neural Network (a) roll to the left, (b) roll to the right, (c) yaw to the left, (d) yaw to the right, (e) pitch upward, (f) pitch downward

2.9

Conclusions

A specific architecture of CNN that was trained to detect eyes in images from a remote gaze estimation system was evaluated with 3 subjects. The experimental results suggest that the CNN can detect eyes of these subjects for the full range of expected head movements (X ±10cm, Y ±10cm, Z ±10cm, yaw ±20◦ , roll ±20◦ and pitch ±20◦ ). To estimate the Point-Of-Gaze, the gaze-estimation system calculates the position of each eye in space and uses a set of parameters that are unique to each eye to calculate the intersection of visual axis of each eye with objects of interest in the visual field of the subject. Therefore, for each detected eye, the gaze estimation system has to know if it is the right-eye or the left-eye. When two eyes are detected, the relative horizontal position

Chapter 2. Eye Detection

41

of the two eyes in the image can be used to identify the right-eye and the left-eye (this is due to the fact that the subject’s head movements in the roll direction are limited to ±20◦ ). But, for the expected range of head movements, it is possible that the CNN detects only one eye. In this situation, it is impossible to determine if the detected eye is the right-eye or the left-eye. For instance, when the subject exhibits large head rotation in the yaw direction (more than 15◦ ) as shown in Figure 2.17c and Figure 2.17d, one of the eyes is obscured by the nose bridge and the CNN can only detect one eye. In this case, it cannot determine if the detected eye is the right-eye or the left-eye. In the following chapter, we will discuss a novel methodology to determine the identity (i.e. right or left) of each detected eye even when only one eye appears in the image.

Chapter 3 Midline Detection The identity of each detected eye can be easily obtained if the system can determine a line that separates the two eyes. An eye which is detected by the CNN algorithm and lies to the right/left of this line can be identified as the right/left eye. One such line is the axis of face symmetry. In the following section, a detailed description of a methodology to determine the axis of face symmetry is discussed [78]. Then, a new algorithm to detect a vertical line that separates the left and the right eyes (midline) is presented.

3.1

Symmetry-Based Methodology

The algorithm to find the axis of face symmetry is based on the research of Colmenarez et al. [78] who used face symmetry to determine if frontal views of faces can be detected in still images. To derive the symmetry function, let x[n], n = 0, 1, . . . , N − 1 represents a row in the image, and xL and xR are two neighboring subsegments of length 42

W 2

with W ≤ N such

Chapter 3. Midline Detection

43

that

xL [m, k] = x[m − k], k = 1, . . . , W2 and xR [m, k] = x[m + k], k = 1, . . . , W2 where W is the window size and is set to the width of the face and m is the position where the symmetry is measured.

A graphical representation of Equation 3.1 is shown in Figure 3.1.

Figure 3.1: Graphical Representation of Face Symmetry

(3.1)

44

Chapter 3. Midline Detection The symmetry line at a point m is measured with the correlation function C[m]: C[m] =

1 σxL [m]σxR [m]

where x¯L [m] =

1 W/2

W/2 P u=1

W/2 P k=1

(xL [m, k] − x¯L [m])(xR [m, k] − x¯R [m])

x[m − u]

is the mean intensity value of the left window, x¯R [m] =

1 W/2

W/2 P u=1

x[m + u]

is the mean intensity values of the right window, σxL [m] =

s

1 W/2

W/2 P v=1

(xL [m, v] − x¯L [m])

(3.2)

2

is the standard deviation of the left window and σxR [m] =

s

1 W/2

W/2 P v=1

(xR [m, v] − x¯R [m])2

is the standard deviation of the right window. The correlation function C[m] is bounded between -1 and 1.

The symmetry function S[m] is determined by computing the average of the symmetry measurements of all the rows along the y-axis: S[m] =

1 M

M P

i=0

Ci [m]

where i is a row index and M is the number of rows in the face image.

(3.3)

The position of the peak of the symmetry function is an estimate of the position of the axis of face symmetry. The area to the left of the axis of face symmetry contains the left eye and the area to the right of the axis of face symmetry contains the right eye. To detect the axis of face symmetry for a range of roll head movements, the image is first rotated by a set of discrete angles that span the expected range of roll head movements. Then, the face symmetry algorithm is applied to each rotated image and the axis of face symmetry is determined by the line with the highest peak in the symmetry function.

Chapter 3. Midline Detection

45

In practice, the image is first smoothed out first by using a low-pass filter to reduce the effect of noise and spatially limited features. To further enhance the performance of the algorithm, Reisfeld et al. [79] suggested to use the vertical component of a gradient of the image ∆(i, j) instead of the raw data X(i, j) to compute the correlation function in equation 3.2. ∂ Gσ (j)) ? X(i, j) ∆(i, j) = (Gσ (i) ∂j

where Gσ is a gaussian filter with σ = 0.05H

(3.4)

where H is the height of the face in the image. The vertical component of a gradient of the image is used because the face is dominated by horizontal edges. By using the vertical component of a gradient of the image, areas with constant illumination will not bias the symmetry measure. The face symmetry algorithm described above has several limitations [78], that limits its use for eye identification:

1. It fails to find the axis of face symmetry when the illumination of the face is nonuniform. For instance, when one side of the face is partially covered by a shadow, the CNN might only detect the eye that is not covered by the shadow and therefore, a failure to detect the axis of face symmetry will lead to a failure to identify the detected eye.

2. It is sensitive to asymmetric features. For instance, the projection of the eyes on the image plane becomes asymmetric due to the change in the yaw direction of the head. With large head movements in the yaw direction, the CNN might detect only one eye and without the axis of face symmetry, it is impossible to identify the detected eye.

To overcome these limitations, a new local symmetry algorithm was developed.

Chapter 3. Midline Detection

3.2

46

Local Symmetry Algorithm

In the gaze estimation system, face illumination changes very gradually and the illumination of each local area within the face (i.e. left-eye, right-eye) tends to be uniform. Since some of the local face features are internally symmetric (i.e. each eye, the mouth), it is possible to estimate the symmetry axis of these internally symmetric local features and use their locations to estimate a line that separates the left and right eyes (the midline). By using the symmetry of local features within the face, the algorithm becomes less sensitive to non-uniform illumination and to asymmetric projections of face features on the image plane of the eye-tracker’s camera. The design of a local symmetry algorithm to determine the midline has two stages: 1) face detection and 2) local symmetry detection.

3.2.1

Face Detection

The remote gaze estimation system uses Infra Red (IR) light to illuminate the subject’s eyes. Due to this illumination, the brightest regions in the image are associated with reflections from the subject’s face. The face detection algorithm uses a relatively high threshold (“skin” threshold) to locate regions of the face with brightness above the “skin” threshold. To accommodate the expected changes in face brightness due to head movements the “skin” threshold was set to approximately 60% of image saturation (i.e. 160, saturation 255). Figure 3.2b illustrates regions of the face that are above the “skin” threshold in the original image (Figure 3.2a). Each bright region is then dilated using a flood-fill technique [80] to form a face blob. Each unconnected skin blob is considered as a start node for the flood-fill algorithm and the algorithm expands the skin blob to include all pixels with intensity values higher than a second threshold (“face boundary” threshold) that is lower than the “skin” threshold. The dilated skin blobs are then connected to form a face blob (as shown in Figure 3.2c). In the gaze estimation system, the background illumination is very faint and therefore

47

Chapter 3. Midline Detection

the “face boundary” threshold, which is set to separate regions of the face from regions in the background, can have a relatively low level (i.e. 60). The face bounding box (Figure 3.2d) is determined by fitting a rectangular box over the face blob. The width and the height of the bounding box determine the values of W and H, respectively, in equations 3.1, 3.2 and 3.4.

(a)

(b)

(c)

(d)

Figure 3.2: Face Detection: (a) Original Image, (b) Skin Blob, (c) Face Blob, (d) Bounding Box Window

3.2.2

Local Symmetry Detection

The face symmetry algorithm can be easily extended to search for local symmetric features. This is done by reducing the length of the two sub-segments (xL , xR in equation 3.1) as shown in Figure 3.3. As the width of the two sub-segments decreases, the shape of the symmetry function changes from a function with single dominant peak (Figure 3.4a) to a function with multiple peaks (Figure 3.4b). The positions of the peaks mark the

Chapter 3. Midline Detection

48

locations of local symmetric features in the face. Based on standard face measurements [81], one eye occupies approximately 20% of the width of the face. Therefore, when the width of each sub-segment is set to be approximately 10% of the total width of the bounding box (Figure 3.2d), the peaks in Figure 3.4b represent the left-eye, the right eye and the nose-mouth complex. The performance of the local symmetry algorithm is insensitive to the width of the bounding box window as long as it is within a range of 8%-20% of W .

Figure 3.3: Graphical Representation of Local Symmetric Features

In the local symmetry algorithm, the midline is calculated by averaging the locations of up to three peaks in the symmetry function. If the amplitude of a peak is less than 50% of the amplitude of the highest peak of the symmetry function, it is not used in the averaging process (i.e. these peaks do not provide information with respect to local symmetric features of the face). The following examples demonstrate the differences between the face symmetry algorithm and the local symmetry algorithm. Figure 3.5 shows the ability of the local symmetry algorithm to detect the midline when the eye-projections on the image plane are asymmetric. In Figure 3.5b, the locations of the left eye and the nose-mouth complex

49

Chapter 3. Midline Detection

Symmetry Measure vs. Pixel Position 1

0.9

0.9

0.8

0.8

Symmetry Measure

Symmetry Measure

Symmetry Measure vs. Pixel Position 1

0.7 0.6 0.5 0.4 0.3 0.2

0.6 0.5 0.4 0.3 0.2

0.1 0 0

0.7

0.1

200

400

600

800

Pixel Position

(a)

1000

1200

0 0

200

400

600

800

1000

1200

Pixel Position

(b)

Figure 3.4: Symmetry Function (a) Face Symmetry Algorithm, (b) Local Symmetry Algorithm

50

Chapter 3. Midline Detection

are detected by the local symmetry detector and the midline separates the left-eye and the right-eye. Note that the face symmetry algorithm (Figure 3.5a) fails to detect the midline.

Symmetry Measure vs. Pixel Position 1

0.9

0.9

0.8

0.8

Symmetry Measure

Symmetry Measure

Symmetry Measure vs. Pixel Position 1

0.7 0.6 0.5 0.4 0.3 0.2

0.6 0.5 0.4 0.3 0.2

0.1 0 0

0.7

0.1 200

400

600

800

1000

Pixel Position

(a)

1200

0 0

200

400

600

800

1000

1200

Pixel Position

(b)

Figure 3.5: Asymmetric Projection (a) Face Symmetry Algorithm, (b) Local Symmetry Algorithm

Figure 3.6 shows that even when one side of the face is poorly illuminated, the local symmetry algorithm is capable of finding the midline. It detects two local symmetric features on the face (Figure 3.6b) which mark the locations of the left eye and the nosemouth complex and the midline separates the left-eye and the right-eye, while the axis of face symmetry that is detected by the face symmetry algorithm fails to separate the left and right eyes (Figure 3.6a). Figure 3.7 demonstrates the ability of the local symmetry algorithm to detect the mid-

51

Chapter 3. Midline Detection

Symmetry Measure vs. Pixel Position 1

0.9

0.9

0.8

0.8

Symmetry Measure

Symmetry Measure

Symmetry Measure vs. Pixel Position 1

0.7 0.6 0.5 0.4 0.3

0.7 0.6 0.5 0.4 0.3

0.2

0.2

0.1

0.1

0 0

200

400

600

800

Pixel Position

(a)

1000

1200

0 0

200

400

600

800

1000

1200

Pixel Position

(b)

Figure 3.6: Non-Uniform Illumination (a) Face Symmetry Algorithm, (b) Local Symmetry Algorithm.

52

Chapter 3. Midline Detection

line when the subject’s head is tilted in the roll direction. Note that the local symmetry algorithm does not require image rotation prior to the determination of the midline. Therefore, it is computationally much more efficient than the face symmetry algorithm.

Symmetry Measure vs. Pixel Position 1 0.9

Symmetry Measure

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

200

400

600

800

1000

1200

Pixel Position

Figure 3.7: Local Symmetry Algorithm Applied on Head-Tilted Image

53

Chapter 3. Midline Detection

3.3

Performance Comparison: Face Symmetry Algorithm and Local Symmetry Algorithm

The performance of the face symmetry and the local symmetry algorithms was compared with three subjects using the same setting as described in section 2.8. A computer was used to calculate and display in real-time the subject’s face with superimposed midline and face bounding box and a head tracker was used to record the subjects’ head movements (X, Y, Z, yaw, pitch and roll). Figure 3.8 and Figure 3.9 show several examples that demonstrate the performance of the local symmetry algorithm for translational head movements (X - Figure 3.8a and 3.8b, Y - Figure 3.8c and 3.8d and Z - Figure 3.8e and 3.8f) and rotational head movements (roll - Figure 3.9a and 3.9b, yaw - Figure 3.9c and 3.9d and pitch - Figure 3.9e and 3.9f).

The results of the experiments are summarized in tables 3.1 and 3.2.

For transla-

Table 3.1: Experimental Results: Face Symmetry Algorithm Subject

X (cm)

Y (cm)

Z (cm)

Yaw (◦ )

Pitch (◦ )

Roll (◦ )

S1

±21.5

±8

±7

±11

±25

±40

S2

±21

±6.5

±6.5

±8

±46

±45

S3

±19

±5

±10.5

±9

±55

±43

Table 3.2: Experimental Results: Local Symmetry Algorithm Subject

X (cm)

Y (cm)

Z (cm)

Yaw (◦ )

Pitch (◦ )

Roll (◦ )

S1

±21.5

±11.5

±7

±29

±25

±32

S2

±21

±9.5

±6.5

±24

±46

±45

S3

±19

±7.5

±10.5

±33

±55

±31

tional head movements in the X and Z directions and for pitch head rotations, the two

54

Chapter 3. Midline Detection

(a)

(b)

(c)

(d)

(e)

(f)

Figure 3.8: Translational Head Movements: Local Symmetry Algorithm(a) close to monitor, (b) far from monitor , (c) left to the monitor, (d) right to the monitor, (e) upward to the monitor, (f) downward to the monitor

algorithms had similar performance. For translational head movements in the Y direction, the local symmetry algorithm outperforms the face symmetry algorithm. This is due to the ability of the local symmetry algorithm to handle non-uniform face illumination. The local symmetry algorithm also outperforms the face symmetry algorithm in the Yaw direction due to its ability to cope with asymmetric projections of face features. The face symmetry algorithm outperforms the local symmetry algorithm for head rotations in the roll direction. The improved performance of the face symmetry algorithm in the roll direction required significantly more processing time. It took about 90 times longer for the face symmetry algorithm to detect the axis of face symmetry than it took for the local symmetry algorithm to detect the midline (image size 1280 × 1024 pixels, 8 discrete image rotations 10◦ apart). Note that the local symmetry algorithm detects the midline

55

Chapter 3. Midline Detection

(a)

(b)

(c)

(d)

(e)

(f)

Figure 3.9: Rotational Head Movements: Local Symmetry Algorithm(a) roll to the left, (b) roll to the right, (c) yaw to the left, (d) yaw to the right, (e) pitch upward, (f) pitch downward. for the full range of expected head movements in the yaw, pitch, roll and X directions (yaw ±20◦ , pitch ±20◦ , roll ±20◦ and X ±10cm). However, it fails to detect the midline for the required range of head movements in the Y and Z directions (Y ±10cm, Z ±10cm).

3.4

Conclusions

If the local symmetry algorithm could support the full range of required head movements, it can be used to identify the eyes detected by the CNN and the combined detection and identification performance of the two algorithms will satisfy the requirements of the remote gaze estimation system. However, the local symmetry algorithm fails to detect the midline for the full translational head movements in the Y and Z directions.

Chapter 3. Midline Detection

56

Since the performance of the local symmetry algorithm is limited for only translational head movements in the Y and Z directions and the CNN can detect both eyes for the full range of head movements in these directions (If both eyes are detected, the relative horizontal position of the eyes determines their identity). It is possible that by combining the CNN and the local symmetry algorithms, one might improve the eye identification performance. The next chapter will describe an algorithm that combines the CNN and the local symmetry algorithm for eye identification.

Chapter 4 Eye Identification In the gaze-estimation system, the point-of-gaze is calculated for each eye separately. In these calculations, the coordinates of several eye-features (center of the pupil, corneal reflections) are used with physiological parameters that are unique to each eye (corneal radius, the distance from the pupil to the center of cornea, deviation between optical and visual axis) to calculate the point-of-gaze [1]. For these calculations each detected eye has to be identified as either the left or the right eye. Based on our experience in the analysis of visual scanning patterns [5], a false identification rate of 1% is acceptable for scan-path analysis and therefore this rate was selected as the system’s performance criteria for eye identification. However, since one can reduce the false identification rate arbitrarily by not processing all the images from the eye gaze estimation system’s camera, a rejection rate is needed in order to maintain a reasonable temporal resolution for the scan-path analysis. For a system that operates at 20 frames per second, a rejection rate of 5% can still provide enough temporal resolution for the analysis of visual scan-paths [5]. Therefore, 5% was selected as an upper bound for the rejection rate of images in the remote gaze estimation system. To achieve this goal, a hybrid algorithm for eye identification that uses information from both the Convolutional Neural Network and the local symmetry algorithm is proposed. The proposed algorithm fuses the CNN and 57

Chapter 4. Eye Identification

58

the local symmetry algorithm based on their confidence measures as shown in Figure 4.1. If the CNN outputs for the two detected eyes are above certain confidence levels, a, the CNN will be used for eye identification. In situations where the CNN detects only one eye or the network response of one of the detected eyes is lower than the expected threshold for eye identification, the local symmetry algorithm will be used for eye identification. If a criteria for eye identification by the local symmetry algorithm exceeds a certain level, b, the left and right eyes will be determined by the local symmetry algorithm. Otherwise, the remote gaze estimation system cannot identify the left and right eyes and will not estimate the subject’s point-of-gaze. In the next sections, the threshold, a, for the CNN and the threshold, b, for the local symmetry algorithm are determined such that the overall system performance criteria (1% of false identification rate and 5% of rejection rate) are satisfied.

Figure 4.1: Confidence-Based Fusion for Eye Identification

Chapter 4. Eye Identification

4.1

59

Criteria for Eye Identification by CNN

When two eyes are detected by the CNN, the relative horizontal position of the two eyes in the image can be used to identify the right and left eyes. This is because head movements in the roll direction are limited to ±20◦ . However, when the CNN detects two eye candidates in the image but one of them is not a real eye (false alarm), it can lead to a false identification (Figure 4.2). This condition usually occurs when only one eye appears in the image.

Figure 4.2: False Eye Identification: The right-eye is identified as the left-eye as indicated by the label L. To determine a criterion for the output of the CNN so that an overall false identification rate of 1% is achieved, the plot shown in Figure 4.3 is used. Figure 4.3 is based on 5000 face images. For each image, two regions with the highest CNN outputs were classified as the two detected eyes (each output had to exceed the classifier threshold, 0.45). The data in Figure 4.3 shows the false identification rate as a function of the amplitude of the CNN output for the eye with the lower network output. As shown in Figure 4.3, the classifier threshold at the output of the CNN has to be larger than 0.8 to limit the false identification rate to less than 1%. If 0.8 is taken as the threshold for eye identification, 25% of the images cannot be processed since in these images, one could

60

Chapter 4. Eye Identification

not find two regions with CNN output that were higher than 0.8. If one assumes that in some of these images, the eyes will not be identified correctly by the local symmetry algorithm, it implies that the overall error rate for eye identification will exceed 1% and therefore the error rate of the CNN had to be reduced to satisfy the system performance criteria. Based on these considerations, only when the CNN outputs for two regions in an image exceed 0.9, the CNN will be used for eye identification. Under this condition, the false eye identification rate is 0.02% and the rejection rate is 36.74%. Images that are not processed by the CNN are processed by the local symmetry algorithm as shown in Figure 4.1. False Eye Identification Rate vs. Network Output

False Eye Identification Rate

0.05

0.04

0.03

0.02

0.01

0 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 Network Output

1

Figure 4.3: Eye Identification Performance using CNN

4.2

Criteria for Eye Identification by Local Symmetry Algorithm

When the images that are not processed by the CNN (36.74% of all images) are processed by the local symmetry algorithm, the false eye identification rate is 4.4%. This corresponds to an overall false eye identification rate of 1.62% (36.74% × 4.4%) which does not satisfy the system performance criteria. In order to reduce the false eye identification rate, one needs to derive a measure that indicates the confidence in which the

61

Chapter 4. Eye Identification

local symmetry algorithm identifies the left and right eyes. Using this measure, the local symmetry algorithm can reject images in which the local symmetry algorithm has low confidence in identifying the eyes. The measure is based on the following observations. When the local symmetry algorithm determines the midline correctly, the amplitude of the peaks that correspond to local symmetric facial features are much higher than other peaks in the symmetry function (Figure 4.4).

Symmetry Measure vs. Pixel Position 1

0.9

0.9

0.8

0.8

Symmetry Measure

Symmetry Measure

Symmetry Measure vs. Pixel Position 1

0.7 0.6 0.5 0.4 0.3 0.2

0.6 0.5 0.4 0.3 0.2

0.1 0 0

0.7

0.1 200

400

600

800

Pixel Position

1000

1200

0 0

200

400

600

800

1000

1200

Pixel Position

Figure 4.4: Correct Eye Identification by Local Symmetry Algorithm When the local symmetry algorithm fails to identify the eyes correctly, all the peaks in the symmetry function have similar amplitude (Figure 4.5) and the location of the dominant peaks might not represent the locations of symmetrical facial features. Based on this observation, a signal-to-noise ratio’s measure which is defined by equation 4.1 is used to determine the confidence that the local symmetry algorithm identifies correctly the left and right eyes. For each image, the peaks of the symmetry function are first identified and the peak with the largest amplitude (the dominant peak) is selected.

62

Chapter 4. Eye Identification

Symmetry Measure vs. Pixel Position 1 0.9

Symmetry Measure

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

200

400

600

800

1000

1200

Pixel Position

Figure 4.5: False Eye Identification by Local Symmetry Algorithm If two other peaks are at least 50% of the amplitude of the dominant peak, all three peaks are considered to be associated with symmetrical facial features and they are used to compute Asignal in equation 4.1. The other peaks of the symmetry function are used to compute Anoise .

SNR = 20 log10 where Asignal =

s

1 M

M P

i=1

Asignal Anoise

s2i , M ≤ 3 (4.1)

is RMS amplitude of the signal si , Anoise =

s

1 K

K P

i=1

n2i

is RMS amplitude of the noise ni . To determine the criteria for the local symmetry algorithm so that the overall false eye identification rate will be less than 1% and the rejection rate will be less than 5%,

63

Chapter 4. Eye Identification

graphs that show the false identification rate as a function of the SNR (Figure 4.6) and the rejection rate as a function of the SNR (Figure 4.7) are constructed. Note that data in Figures 4.6 and 4.7 is associated with the images that could not be processed by the CNN. False Eye Identification Rate vs. SNR 0.05 0.045 False Eye Identification Rate

0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0

5

10 SNR (dB)

15

20

Figure 4.6: Eye Identification Performance using Local Symmetry Algorithm Rejection Rate vs. SNR 1 0.9 0.8

Rejection Rate

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10 SNR (dB)

15

20

Figure 4.7: Eye Rejection Performance using Local Symmetry Algorithm Figures 4.6 and 4.7 show that as the SNR increases, the false identification rate decreases and the rejection rate increases. Figure 4.6 shows that in order to achieve an overall false eye identification rate that is less than 1%, the SNR has to be larger than 4dB (36.74% × 1.52%). Figure 4.7 shows that to achieve a rejection rate that is less than 5%, the SNR has to be less than 12dB (36.74% × 13.6%). When the data from

Chapter 4. Eye Identification

64

Figures 4.6 and 4.7 is combined, any threshold in the range of 4dB and 12dB can be used to satisfy the false eye identification rate and the rejection rate of the remote gaze estimation system. As discussed in the introduction of Chapter 4, 5% of rejection rate will not affect the ability to analyze the visual scan-path. Therefore, in an attempt to minimize the false eye identification rate, a SNR threshold of 12dB was selected.

4.3

Experimental Results and Discussions

The performance of the identification algorithm was tested using the same test database described in Section 2.8. CNN was used for eye detection and the hybrid algorithm was used for eye identification. For comparison, the CNN approach for eye identification and the local symmetry approach for eye identification were also evaluated. For the CNN approach for eye identification, the threshold was set to 0.8 which corresponds to a false eye identification rate of 1% (Figure 4.3). For the local symmetry approach for eye identification, the threshold was set to 4dB which corresponds to a false eye identification rate of 1%. For the test database, the hybrid approach for eye identification achieved a correct eye identification rate of 99.4% with a rejection rate of 0.6% and a false identification rate of 0%. The CNN approach for eye identification achieved a correct eye identification rate of 92.5% with a rejection rate of 3.17% and a false identification rate of 4.23%. The local symmetry algorithm achieved a correct eye identification rate of 94.4% with a rejection rate of 2.38% and a false identification rate of 3.22%. The hybrid algorithm outperforms the CNN and the local symmetry approaches for eye identification because it was able to combine the strengths of both approaches while minimizing their deficiencies. The CNN has excellent performance for head movements in the X, Y, Z, roll and pitch directions because both eyes can be detected reliably even when faces are poorly illuminated (Figure 4.8a). The CNN approach has poor performance for

65

Chapter 4. Eye Identification

large head movements in the yaw direction because the CNN can only detect one eye in the image (Figure 4.8b).

(a)

(b)

Figure 4.8: CNN Approach for Eye Identification: a) Correct Eye Identification, b) Eye Rejection The local symmetry algorithm has excellent performance for head movements in the X, yaw, roll and pitch directions (Figure 4.9a), but it cannot process images with low contrast which are very common for large head movements in the Y and Z directions (Figure 4.9b). Note that images that cannot be processed by the CNN (Figure 4.8b) are identified correctly by the local symmetry algorithm (Figure 4.9a) and images that cannot be processed by the local symmetry algorithm (Figure 4.9b) can be identified by the CNN (Figure 4.8a). The robustness of the hybrid approach for eye identification comes from the fact that the CNN is used to identify the eyes in images that result from large head movements in the Y and Z directions (Figure 4.10a). While the local symmetry is used to identify the eyes in images that result from large head movements in the yaw direction (Figure 4.10b). From this perspective, the CNN and the local symmetry algorithm are complementary algorithms for eye identification. The hybrid approach for eye identification has a rejection rate of 0.6%. This happens

66

Chapter 4. Eye Identification

(a)

(b)

Figure 4.9: Local Symmetry Algorithm for Eye Identification: a) Correct Eye Identification, b) Eye Rejection

Figure 4.10: Correct Eye Identification Using Hybrid Approach

Chapter 4. Eye Identification

67

when images are poorly illuminated and both the CNN and the local symmetry algorithm cannot identify the detected eyes with high confidence. For instance, in Figure 4.11, when the CNN detects only one eye and the symmetry algorithm has low confidence in determining the identity of the detected eye, the remote gaze estimation system will not process the detected eye for point-of-gaze estimation.

Figure 4.11: Eye Unidentified by Hybrid Approach The experimental results suggest that the hybrid approach can determine accurately the identity of eyes detected by the CNN with an identification rate of 99.4%. This identification performance covers the full range of expected head movements (X ±10cm, Y ±10cm, Z ±10cm, yaw ±20◦ , roll ±20◦ and pitch ±20◦ ). When the detection rate of the CNN (95.2%) is combined with the performance of the hybrid algorithm for eye identification, 94.6% of the eyes in the images from the gaze estimation system are detected and identified correctly.

4.4

Conclusions

The contributions of this thesis are as follows: 1. The development of a novel method to detect eyes over the full range of the expected

Chapter 4. Eye Identification

68

head movements in remote gaze estimation systems. A specific architecture of Convolutional Neural Networks was constructed and trained to detect human eyes. Experiments with 3 subjects showed that the CNN for eye detection achieves a detection rate of 95.2% with a false alarm rate of 2.65 × 10−4 %. 2. The development of a novel algorithm for the identification of the left and right eyes. The local symmetry algorithm uses the symmetry of local facial features to detect a line that separates the left and right eye regions. This line (midline) is used to determine the identity of the detected eyes. It achieves a correct identification rate of 94.4% with a rejection rate of 2.38%. 3. The development of a hybrid algorithm for eye identification. The hybrid algorithm for eye identification fuses the CNN and the local symmetry algorithms based on the confidence level of each algorithm. It achieves a correct identification rate of 99.4% with a rejection rate of 0.6%. The methodology for eye detection and identification that was developed in this thesis enables the remote gaze estimation system to monitor visual scanning patterns for the full range of the expected head movements. This new methodology will be used as part of a system that can provide reliable and objective assessment of visual functions in infants based on the analysis of their visual scanning behavior.

4.4.1

Future Work

Remote gaze estimation systems for the analysis of visual scanning patterns are required to operate in real-time at a rate of at least 20 frames per second. In the current system, eye detection and point-of-gaze estimation algorithm require 20 millisecond per image.

Chapter 4. Eye Identification

69

Therefore, the proposed methodology should take less than 30 millisecond. Currently, the local symmetry algorithm and the Convolutional Neural Network are implemented using Matlab without optimization. For image size of 1280 × 1024 pixels using Intel Pentium 4 3.2 GHz with 2GB RAM, the local symmetry algorithm takes about 100 milliseconds per frame. Whereas, the CNN takes about 4 seconds per frame. In order to be integrated in the remote gaze estimation system, the algorithms might require to be implemented directly on a integrated circuit such as very-large-scale integration (VLSI) [82] or field-programmable gate array (FPGA) [83]. The proposed methodology requires to process the entire image to detect the eyes in every image. This process can be done more efficiently by retaining information between images. When the eyes are detected in a given image, the search windows for the eyes in subsequent images are limited to possible changes in eye positions between frames. By using this temporal information, the time required for the eye detection and identification algorithm can be dramatically reduced. Although the methodology for eye detection was shown to be robust to changes in experimental conditions, further investigation is required to validate its performance with infants. The effectiveness of the proposed methodology has to be tested on subjects of different ages and different facial features in order to verify its detection performance.

Bibliography [1] E. D. Guestrin and M. Eizenman, “General theory of remote gaze estimation using the pupil center and corneal reflections,” IEEE Transactions on Biomedical Engineering, vol. 53, pp. 1124–1134, 2006. [2] R. Chellappa, C. Wilson, and S. Sirohey, “Human and machine recognition of faces: a survey,” in Proceedings of the IEEE, vol. 83, 1995, pp. 705–740. [3] K. Grauman, M. Betke, J. Lombardi, J. Gips, and G. Bradski., “Communication via eye blinks and eyebrow raises: Videobased human-computer interfaces,” Universal Access in the Information Society Journal, pp. 359–373, 2003. [4] R. J. K. Jacob, “The use of eye movements in human-computer interaction techniques: what you look at is what you get,” In ACM Transactions on Information Systems, vol. 9, no. 2, pp. 152–169, 1991. [5] M. Eizenman, L. H. Yu, L. Grupp, E. Eizenman, M. Ellenbogen, M. Gemar, and R. D. Levitan, “A naturalistic visual scanning approach to assess selective attention in major depressive disorder,” Psychiatry Research, vol. 118, pp. 117–128, 2003. [6] C. Karatekin and R. F. Asarnow, “Exploratory eye movements to pictures in childhood-onset schizophrenia and attentiondeficit/ hyperactivity disorder,” Journal of Abnormal Child Psychology, vol. 27, pp. 35–49, 1999. 70

Bibliography

71

[7] M. D. Luca, E. D. Pace, A. Judica, D. Spinelli, and P. Zoccolotti, “Eye movement patterns in linguistic and non-linguistic tasks in developmental surface dyslexia,” Neuropsychologia, vol. 37, pp. 1407–1420, 1999. [8] M. Eizenman, T. Jares, and A. Smiley, “A new methodology for the analysis of eye movements and visual scanning in drivers,” Transport Canada, Tech. Rep., 1999. [9] J. L. Harbluk, I. Y. Noy, and M. Eizenman, “The impact of cognitive distraction on driver visual and vehicle control,” Transport Canada, Tech. Rep., 2002. [10] D. Cleveland, “Unobtrusive eyelid closure and visual point of regard measurement system,” in Proceedings Technical Conference on Ocular Measures of Driver Alertness, 1999, pp. 57–74. [11] P. A. Wetzel, G. Krueger-Anderson, C. Poprik, and P. Bascom, “An eye tracking system for analysis of pilots scan paths,” in Proceedings of the 18th Industry Training Systems and Education Conference, 1996, pp. 1–16. [12] J. H. Goldberg and X. P. Kotval, “Computer interface evaluation using eye movements: methods and constructs,” International Journal of Industrial Ergonomics, vol. 24, pp. 631–645, 1999. [13] E. D. Guestrin, “A novel head-free point-of-gaze estimation system,” M.A.Sc., Dept. of Electrical and Computer Engineering, University of Toronto, Canada, 2003. [14] B. J. M. Lui, “A point-of-gaze estimation system for studies of visual attention,” M.A.Sc, Dept. of Electrical and Computer Engineering, University of Torotno, Canada, 2003. [15] W. Skarbek and A. Koschan, “Colour image segmentation - a survey -,” Institute for Technical Informatics, Technical University of Berlin, Tech. Rep., October 1994.

Bibliography

72

[16] N. Oliver, A. Pentland, and F. Berard, “Lafter: Lips and face real time tracker,” in Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97). Washington, DC, USA: IEEE Computer Society, 1997, pp. 123–129.

[17] B. D. Zarit, B. J. Super, and F. K. H. Quek, “Comparison of five color models in skin pixel classification,” in Proceedings of the International Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems. Washington, DC, USA: IEEE Computer Society, 1999, pp. 58–63.

[18] S. McKenna, S. Gong, and Y. Raja, “Modeling facial colour and identity with gaussian mixtures,” Pattern Recognition, vol. 31, pp. 1883–1892, 1998.

[19] L. Sigal, S. Sclaroff, and V. Athitsos, “Estimation and prediction of evolving color distributions for skin segmentation under varying illumination,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, 2000, pp. 152–159.

[20] J. C. Terrillon, M. N. Shirazi, H. Fukamachi, and S. Akamatsu, “Comparative performance of different skin chrominance models and chrominance spaces for the automatic detection of human faces in color images,” in Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition 2000. Washington, DC, USA: IEEE Computer Society, 2000, pp. 54–61.

[21] D. Brown, I. Craw, and J. Lewthwaite, “A SOM based approach to skin detection with application in real time systems,” in Proceedings of the British Machine Vision Conference 2001, 2001, pp. 10–13.

[22] S. L. Phung, A. Bouzerdoum, and D. Chai, “A novel skin color model in YCbCr color space and its application to human face detection,” in IEEE International Conference on Image Processing (ICIP’2002), vol. 1, 2002, pp. 289–292.

Bibliography

73

[23] P. Peer, J. Kovac, and F. Solina, “Human skin colour clustering for face detection,” EUROCON 2003. Computer as a Tool. The IEEE Region 8, vol. 2, pp. 144–148, 2003. [24] D. Chai and A. Bouzerdoum, “A bayesian approach to skin color classification in ycbcr color space,” in Proceedings IEEE Region Ten Conference (TENCON’2000), vol. 2, 2000, pp. 421–424. [25] M. Yang and N. Ahuja, “Gaussian mixture model for human skin color and its application in image and video databases,” in Procceeding of the SPIE: Conf. on Storage and Retrieval for Image and Video Databases (SPIE 99), vol. 3656, 1999, pp. 458–466. [26] A. M. Bagci, R. Ansari, A. Khokhar, and E. Cetin, “Eye tracking using markov models,” in Proceedings of the 17th Internal Conference on Pattern Recognition, 2004, pp. 1051–4651. [27] R. Stiefelhagen, J. Yang, and A. Waibel, “Tracking eyes and monitoring eye gaze,” in Proceedings of the Workshop on Perceptual User Interfaces (PUI’97), 1997, pp. 98–100. [28] V. Govindaraju, “Locating human faces in photographs,” International Journal Computer Vision, vol. 19, pp. 129–146, 1996. [29] S. R. Gunn and M. S. Nixon, “A dual active contour for head and boundary extraction,” IEE Colloguium on Image Processing Biometric Measurement, vol. 1, pp. 6/1–6/4, 1994. [30] C. L. Huang and C. W. Chen, “Human facial feature extraction for face interpretation and recognition,” Pattern Recognition, vol. 25, pp. 1435–1444, 1992.

74

Bibliography

[31] G. C. Feng and P. C. Yuen, “Variance projection function and its application to eye detection for human face recognition,” Pattern Recognition Letters, vol. 19, pp. 899–906, July 1998. [32] A. L. Yuille, P. W. Hallinan, and D. S. Cohen, “Feature extraction from faces using deformable templates,” International Journal Computer Vision, vol. 8, pp. 99–111, 1992. [33] X. Xie, R. Sudhakar, and H. Zhuang, “On improving eye feature extraction using deformable templates,” Pattern Recognition, vol. 27, pp. 791–799, 1994. [34] L. Kin-Man and Y. Hong, “Locating and extracting the eye in human face images,” Pattern Recognition, vol. 29, no. 5, pp. 771–779, May 1996. [35] F. Wu, T. Yang, and M. Ouhyoung, “Automatic feature extraction and face synthesis in facial image coding,” Sixth Pacific Conference on Computer Graphics and Applications, pp. 218–219, 1998. [36] S. Kawato and J. Ohya, “Real-time detection of nodding and head-shaking by directly detecting and tracking the between-eyes,” in Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition, 2000, pp. 40–45. [37] G. Feng and P. Yuen, “Multi-cues eye detection on gray intensity image,” Pattern Recognition, vol. 34, pp. 1033–1046, 2001. [38] Q. Ji and X. Yang, “Real time 3D face pose discrimination based on active IR illumination,” in Proceedings of the 16th International Conference on Pattern Recognition (ICPR’02), vol. 4, 2002, pp. 310–313. [39] W. Huang and R. Mariani, “Face detection and precise eyes location,” in Proceedings of the International Conference on Pattern Recognition. IEEE Computer Society, 2000, pp. 722–727.

Washington, DC, USA:

Bibliography

75

[40] W. Huang, Q. Sun, C. P. Lam, and J. K.Wu, “A robust approach to face and eyes detection from images with cluttered background,” in Proceedings of the 14th International Conference on Pattern Recognition, vol. 1.

Washington, DC, USA:

IEEE Computer Society, 1998, pp. 110–113. [41] A. Cozzi, M. Flickner, J. Mao, and S. Vaithyanathan, “A comparison of classifiers for real-time eye detection,” in Proceedings of the International Conference on Artificial Neural Networks, 2001, pp. 993–999. [42] J. Huang, X. Shao, and H. Wechsler, “Face pose discrimination using support vector machines,” in Proceedings of the 14th International Conference on Pattern Recognition, vol. 1. IEEE Computer Society, 1998, p. 154. [43] M. Motwani, R. Motwani, and F. C. Harris, “Eye detection using wavelets and ANN,” in Proceedings of Global Signal Processing Conferences and Expos for Industry, 2004, pp. 27–30. [44] H. Peng, C. Zhang, and Z. Bian, “Human eyes detection using hybrid neural method,” in Proceedings of 4th International Conference on Signal Processing, 1998, pp. 1088–1091. [45] I. T. Jolliffe, “Principal component analysis,” Psychometrika, vol. 18, pp. 39–43, 1986. [46] C. Cortes and V. Vapnik, “Support vector networks,” Machine Learning, vol. 20, pp. 273–297, 1995. [47] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural Computation. Addison-Wesley Longman Publ. Co., Inc., 1991. [48] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, 1998, pp. 2278–2324.

76

Bibliography

[49] R. Lippman, “Review of neural networks for speech recognition,” Neural Computation, vol. 1, pp. 1–38, 1989. [50] D. S. Ghosh, “Credit card fraud detection with a neural-network,” in Proceedings 27th Annual Hawaii International Conference on System Science, vol. 3, 1994, pp. 621–630. [51] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Network, vol. 2, no. 5, pp. 359–366, 1989. [52] D. DeMers and G. W. Cottrell, “Nonlinear dimensionality reduction,” Advances in Neural Information Processing Systems, vol. 5, pp. 580–587, 1993. [53] R. Hecht-Nielsen., “Replicator neural networks for universal optimal source coding,” Science, vol. 269, pp. 1860–1863, 1995. [54] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computing and Applications, vol. 18, pp. 1527–1554, 2006. [55] B. Fasel, “Robust face analysis using convolutional neural networks,” in Proceedings of the International Conference on Pattern Recognition (ICPR 02), vol. 2, 2002, pp. 40–43. [56] D. D. Lee and H. S. Seung, “A neural network based head tracking system,” in NIPS ’97: Proceedings of the 1997 conference on Advances in neural information processing systems 10, vol. 10.

Cambridge, MA, USA: MIT Press, 1998, pp. 908–

914. [57] S. J. Nowlan and J. C. Platt, “A convolutional neural network hand tracker,” in Advances in Neural Information Processing Systems, G. Tesauro, D. Touretzky, and T. Leen, Eds., vol. 7. The MIT Press, 1995, pp. 901–908.

Bibliography

77

[58] R. Feraud, O. J. Bernier, J.-E. Viallet, and M. Collobert, “A fast and accurate face detector based on neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 1, pp. 42–53, 2001. [59] K. K. Sung and T. Poggio, “Example-based learning for view-based human face detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 39–51, 1998. [60] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: an application to face detection,” in Proceedings of the 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97). IEEE Computer Society, 1997, pp. 130–136. [61] S. A. Vavasis, Nonlinear optimization: complexity issues. Oxford University Press, Inc., 1991. [62] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” in Neurocomputing: foundations of research. MIT Press, 1988, pp. 696–699. [63] S. Lawrence, C. L. Giles, and A. C. Tsoi, “Lessons in neural network training: Overfitting may be harder than expected,” in Proceedings of the Fourteenth National Conference on Artificial Intelligence. AAAI Press, 1997, pp. 540–545. [64] W. S. Sarle, “Stopped training and other remedies for overfitting,” in Proceedings of the 27th Symposium on the Interface of Computing Science and Statistics, 1995, pp. 352–360. [65] H. Rowley, S. Baluja, and T. Kanade, “Rotation invariant neural network-based face detection,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington, DC, USA: IEEE Computer Society, 1998, pp. 38–44.

Bibliography

78

[66] D. Roth, M. Yang, and N. Ahuja, “A SNoW-based face detector,” in Neural information processing systems. MIT Press, 2000, pp. 862–868. [67] I. S. Helliwell, M. A. Turega, and R. A. Cottis, “Accountability of neural networks trained with real world data,” Artificial Neural Networks Fourth International Conference, pp. 218–222, 1995. [68] D. Beymer, A. Shashua, and T. Poggio, “Example based image analysis and synthesis,” Massachusetts Institute of Technology, Tech. Rep., 1993. [69] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” in ICDAR ’03: Proceedings of the Seventh International Conference on Document Analysis and Recognition. Washington, DC, USA: IEEE Computer Society, 2003, pp. 958–963. [70] R. A. Dunne and N. A. Campbell, “On the pairing of the softmax activation and cross–entropy penalty functions and the derivation of the softmax activation function,” in Proceedings of the Eighth Australasian Conference on Neural Networks, 1997, pp. 181–185. [71] S. Becker and Y. LeCun, “Improving the convergence of back-propagation learning with second-order methods,” in Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, Eds., 1989, pp. 29–37. [72] C. P. R. Corporate, Parallel distributed processing: explorations in the microstructure of cognition, vol. 2: psychological and biological models, R. D. E. and M. J. L., Eds. Cambridge, MA, USA: MIT Press, 1986. [73] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989.

Bibliography

79

[74] M. R. Hestenes and E. Stiefel, “Methods of conjugate gradients for solving linear systems,” Journal of Research of the National Bureau of Standards, vol. 49, no. 6, pp. 409–436, 1952. [75] D. Shanno, “On broyden-fletcher-goldfarb-shanno method,” Journal of Optimization Theory and Applications, vol. 46, pp. 87–94, 1985. [76] C. Bishop, “Exact calculation of the hessian matrix for the multilayer perceptron,” Neural Computing and Applications, vol. 4, no. 4, pp. 494–501, 1992. [77] R. Allison, M. Eizenman, and B. Cheung, “Combined head and eye tracking system for dynamic testing of the vestibular system,” Biomedical Engineering, IEEE Transactions, vol. 43, pp. 1073 – 1082, 1996. [78] A. Colmenarez and T. Huang, “Frontal view face detection,” Visual Communications and Image Processing, vol. 2501, pp. 90–98, 1995. [79] D. Reisfeld and Y. Yeshurun, “Robust detection of facial features by generalized symmetry,” in Proceedings Pattern Recognition Conference A: Computer Vision and Applications, vol. 1, 1992, pp. 117–120. [80] J. New, “A method for hand gesture recognition,” in Proceedings of the ACM MidSoutheast Chapter Fall, November 2002, pp. 12–16. [81] E. Makinen and R. Raisamo, “Real-time face detection for kiosk interfaces,” in Proceedings of Tampere Unit for Computer-Human Interaction 2002, vol. 2, 2002, pp. 528–539. [82] B. E. Boser, E. Sackinger, J. Bromley, Y. LeCun, R. E. Howard, and L. D. Jackel, “An analog neural network processor and its application to high-speed character recognition,” in Proceedings of the International Joint Conference on Neural Networks, vol. 1, July 1991, pp. 415–420.

Bibliography

80

[83] R. Gadea, F. Ballester, A. Mocholi, and J. Cerda, “Artificial neural network implementation on a single FPGA of a pipelined on-line backpropagation,” in Proceedings of the 13th international symposium on System Synthesis. Los Alamitos, CA, USA: IEEE Computer Society, 2000, pp. 225–230.

A New Approach for Eye Detection in Remote Gaze ...

six layers was designed and trained to detect and identify eyes in video ..... take a long time to convergence to a global/local optimal solution. ...... thesis in facial image coding,â Sixth Pacific Conference on Computer Graphics and .... [72] C. P. R. Corporate, Parallel distributed processing: explorations in the microstruc-.

Download PDF

1MB Sizes 2 Downloads 315 Views

Report

A New Approach for Eye Detection in Remote Gaze ...

Recommend Documents