Semantically-based Human Scanpath Estimation with HMMs Huiying Liu , Dong Xu, Qingming Huang, Wen Li, Min Xu, Stephen Lin Institute for Infocomm Research (I2R), Singapore Nanyang Technological University Chinese Academy of Sciences University of Technology, Sydney Microsoft Research, Beijing
Presenter: Huiying Liu
Scanpath estimation What is scanpath Eye gaze sequence
Our purpose Estimate human scanpath
User scanpaths
Estimated scanpaths
Motivation Potential applications Understand humans’ behavior when watching an image Training: Show the trainee how the experienced watch the scene
Medical diagnosis
Motivation Potential applications Understand humans’ behavior when watching an image Training: Show the trainee how the experienced watch the scene
Driving
Driving
Motivation Potential applications Understand humans’ behavior when watching an image Training: Show the trainee how the experienced watch the scene Design: Guide audiences to watch in sequence
Motivation Potential applications Understand humans’ behavior when watching an image Training: Show the trainee how the experienced watch the scene Design: Guide audiences to watch in sequence
Scanpath estimation
Motivation Potential applications Less work Gaze density
Salient region
Scanpath
Information
SUN
Contrast
Itti’s GB VS
SGC Pro obj
Saliency ranking
AIM
Learning
C2OH
Gaze density estimation and salient region detection MTL
Pro obj
Biological method
Itti’s
CRF SVM
Proto object ranking
WW
MIL Rank ing
Scanpath estimation
Saliency ranking
Saliency map
Proto objects
scanpath
Itti saliency ranking Proto object based saliency ranking
L. Itti et al, A model of saliency-based visual attention for rapid scene analysis. T-PAMI, 1998. D. Walther et al, Modelling attention to salient proto-objects. Neural Networks, 2006.
Biologically inspired method
W. Wang et al. "Simulating human saccadic scanpaths on natural images." CVPR, 2011.
Y
Feature saliency
Overview Three factors affecting gaze shift Feature saliency • Salient regions attract more attention Feature • Feature difference saliency • YUV color and Gabor feature
Semantic content • Gaze focus on meaningful contents • HMM • BoVW
Spatial Spatialposition
Semantic content •position Tendency of gaze shift to near position • Levy flight • Cauchy distribution
L
Spatial position
H
Semanti c content
Y
Overview
L
Three factors affecting gaze shift pgt 1 g1 , , gt py t 1 , zt 1 , ut 1 y1 , z1 , u1 , , y t , zt , ut Gaze point
Feature saliency
Semantic content
Spatial position
Assume gaze shift to be a Markov process. pgt 1 gt py t 1 , zt 1 , ut 1 y t , zt , ut
Further assume the independence between the three factors. pgt 1 gt py t 1 y t pzt 1 zt put 1 ut
H
Y
Feature Saliency Weights between two regions Wr ,s y s y r
Transfer probability
py
s
y
r
Wr , s R
W s 1
r ,s
The region of higher difference will be gazed more frequently. Feature difference maximizes saliency J. Harel. "Graph-based visual saliency." NIPS. 2006.
, ,
HMM
H
Transition matrix
Prior distribution
Hidden states Observation described with BoVW Visual words
z1 z2 z3 z4
z2
z1
X1 w1
z3
X3
X2 w2
w3
… w4 … wK
z4
X4
i ,k p wk | zi Emission matrix
Gaze shift prediction
H
Forward method Gazed region and BoVW representation
x1 , x 2 , , xt
g1 , g 2 , , gt
bi x px z i i ,k k K
x
k 1
t ,i pzt i, x1 , x t 1,i bi x1 i
The probability of state i at time t
M
t ,i bi x t j ,i t 1, j , t 1, , T j 1
pxt 1 x x1 , , x t px1 , , xt , x The probability of x be gazed next M
M
i 1
j 1
bi x j ,i t , j
Model learning Forward
1,i bi x1 i M
t ,i bi xt j ,i t 1, j j 1
Backward T ,i 1
H N
1 i N
t ,i , j
n 1,i n 1
t ,i i , j b j xt 1 t 1, j
M
M
i 1 j 1
N Tn 1
i , j t,ni ,j n 1 t 1
M
t 1,i i , j bi x t t , j j 1
b j xt 1 t 1, j
t ,i i , j
N Tn 1 M
n t ,i, j n 1 t 1 j 1
M
t ,i t ,i , j ,T ,i T ,i j N
Tn
i ,k t,ni xt,nk n 1 t 1
K
N
Tn
n n t ,i xt ,k k 1 n 1 t 1
Brief discussion about HMM The trained parameters have real meaning The hidden states do represent the semantic topics The prior distribution represents human gaze
H
Brief discussion about HMM The trained parameters have real meaning The hidden states do represent the semantic topics The prior distribution represents human gaze Transition matrix
D. Brockmann et al. "Are human scanpaths Levy flights?." Artificial Neural Networks, 1999.
H
Spatial position
L
The step length satisfies Levy flight Cauchy distribution pu t 1 u u t
2 u u t
pd
2 d 2 d
d
2
3 2 2
3 2 2
2d
2
3 2 2
Test data
NUSEF 476/758 images Ave. 25 users for each image Two subsets from it NUSEF-portrait NUSEF- face
JUDD 1003 images 15 users for each image
Measurement Similarity between sequences The Smith-Waterman local alignment algorithm
H L1 1L2 1 hi , 0 0,0 i L1
S1=ACACACTA S2=AGCACACA
h0, j 0,0 j L2 if wmatch , wS1,i , S 2, j wmismatch , if 0 h i 1, j 1 wS1,i , S 2, j hi , j max hi 1, j wS1,i , h w, S 2, j i , j 1
S1,i S 2, j S1,i S 2, j
mis match deletion insertion
S1’=A- CACACTA S2’=AGCACAC-A
s1,1 s2,1
S1=ACACACTA
0 h 2 2 S2=AGCACACA 0,0 h1,1 max Match=2 h0,1 1 1 Mismatch=insertion=deletion=-1 h1,0 1 1
O=
A G C A C A C A
- A C A C A C T A 0 0 0 0 0 0 0 0 0 0 ↘ 0 0 0 0 0 0 0
H=
A G C A C A C A
- A C A C A C T A 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0
s2,1 s1,2 0 h 1 0 1 1 1,0 h2,1 max h1,1 1 1 h2,0 1 1
O=
A G C A C A C A
0 0 0 0 0 0 0 0 0
A C A C A C T A 0 0 0 0 0 0 0 0 ↘
↓ H=
A G C A C A C A
0 0 0 0 0 0 0 0 0
A C A C A C T A 0 0 0 0 0 0 0 0 2 1
Similarity=max(H)=12 S1’=A- CACACTA S2’=AGCACAC-A
O=
A G C A C A C A
0 0 0 0 0 0 0 0 0
A 0 ↘ ↓ ↓ ↘ ↓ ↘ ↓ ↘
C 0 → ↘ ↘ ↓ ↘ ↓ ↘ ↓
A 0 ↘ ↓ → ↘ ↓ ↘ ↓ ↘
C 0 → ↘ ↘ → ↘ ↓ ↘ ↓
A 0 ↘ ↓ → ↘ → ↘ ↓ ↘
C 0 → ↘ ↘ → ↘ → ↘ ↓
T 0 → ↘ → → → → → ↘
A 0 ↘ ↓ → ↘ → ↘ → ↘
H=
A G C A C A C A
0 0 0 0 0 0 0 0 0
A 0 2 1 0 2 1 2 1 2
C 0 1 1 3 2 4 3 4 3
A 0 2 1 2 5 4 6 5 6
C 0 1 1 3 4 7 6 8 7
A C T A 0 0 0 0 2 1 0 2 1 1 0 1 2 3 2 1 5 4 3 4 6 7 6 5 9 8 7 8 8 11 10 9 10 10 10 12
Parameter test-codebook size
Large codebook will over-fit the dataset
Parameter test-number of states
It is stable with number of states
Parameter test-training sample
Increase fast at first then slower down. It is stable with the number of training samples
Effectiveness of gaze factors
Semantic content with HMM is effective and outperforms feature saliency and spatial position The full combination of three factors performs significantly better than the individual ones
Comparison with other methods
Our method significantly outperforms Itti and Proto object based method (t-test, p=0.05) Our method significantly outperforms WW on face, portrait, NUSEF, is comparable with it on JUDD
examples
Conclusion A scanpath estimation method is proposed It considers three factors, feature saliency, spatial position, and semantic content Semantic content is represented through HMM Spatial position is modeled as Cauchy distribution
Experiments have verified the effectiveness of the method It outperforms the existing methods