learning style and structure of human behavior

Viewer
Transcript

LEARNING STYLE AND STRUCTURE OF HUMAN BEHAVIOR YI WANG1 , ZHI-QIANG LIU2 and LI-ZHU ZHOU1 1

Department of Computer Science and Technology, Tsinghua University, Graduation School at Shenzhen, China. 2

School of Creative Media, City University of Hong Kong, Kowloon, Hong Kong. E-mail: [email protected], [email protected], [email protected]

Abstract: A new model, VLMM/S-DTG, is presented to learn style and structure of human motion and to support automatic synthesis of long-term motion with complex dynamics and arbitrarily specified style. An EM algorithm is proposed to estimate VLMM/S-DTG by learning and representing prototypical poses as stylized decomposable triangulated graph (SDTG) and modeling the high-order statistical dependencies among the prototypical poses as a variable-length Markov model (VLMM). The VLMM/S-DTG features three major advantages: (1) computational robustness to under-sampling or biasedsampling, which is usual for high-dimensional motion capture data, (2) the ability of extracting motion style from motion structure, instead to treat style as noise, (3) accurate prediction performance, which supports generating of realistic new motion that is even longer than the training motion. Keywords: stylized decomposable triangulated graph, hidden Markov model, variable-length Markov model, linearly parametric Gaussian density

1.

Introduction

Learning human motion is the basis for motion detection, recognition, and identification in the computer vision field, and the basis for automatic motion synthesis in the computer graphics field. But some challenges make the learning difficult: (1) The high dimensionality of the pose space. Parameterizing 3D rotations of only the major 20 joints of a human body results in a pose space over 60 dimensions. Limited by the frequency of motion capturing, the high ratio of dimension/frames gives rise to the danger of under-sampling and biased-samples and requires model and algorithm that are more computationlly robust. (2) The rich semantics and systematic variations. The complex variations resulted from the rich semantics of human motion requires to separate the modeling of motion structure and motion style ([12]), but traditional modeling methods treat style as noise and lose the semantics information. An example is shown in the experiment section of this work. (3) The complex dynamics of human motion. Dynamics of

many full body human motions are too complex to be fully captured by low-order model with short memory, such as the frequently used hidden Markov model ( [13], [9], [11], [2], [12], [1]). To address these difficulties, the VLMM/S-DTG model is presented to learn long-term human motion as a sequence of prototypical poses. The statistical dependencies among the prototypical poses are captured by a variable-length hidden Markov model, which, compared with first-order Markov models, is able to learn the optimal length of memory for more accurate modeling. Each prototypical pose is represented by a stylized decomposable triangulated graph (S-DTG), which features two advantages of stylization and decomposition. Stylization means that the S-DTG is an approximation to a linear Gaussian density with the mean value parameterized by an external style variable. The learning algorithm will automatically extract and quantize the style of training motion as the style value. Decomposition means that the S-DTG holds the a´ priori, which, has been verified practical for human motion modeling by [10], that every dimension of the pose space can be assumed statistically dependent with at most two other dimensions. So that the fully connected statistical dependencies among all dimensions implied by the covariance matrix over pose space is decomposed into a graph named decomposable triangulated graph ([10]), which substitutes the high-dimensional covariance matrix with several 3-dimensional covariance matrices. In this work, the learning algorithm utilizes this property to avoid the singular covariance matrices resulted from under-sampling or biased-sampling. The encouraging experiments show that the VLMM/SDTG model is suitable for automatic synthesis of motions with complex dynamics, longer duration than training motion, and new styles as users’ demanded. 2.

Stylized Decomposable Triangulated Graph

DTG and Stylized-DTG. A DTG models a multivariate probability distribution over N dimensions under the assumption that each dimension is statistically dependent

with other two dimensions. The high dimensionality is decomposed into a set of triangular cliques, where, within each clique c, the three dimensions, denoted as A(c) , B (c) , and C (c) , are described by a conditional p.d.f. P (A(c) | B (c) , C (c) ). The dependencies of cliques are organized that the A(c1 ) of a clique c1 must be the B(c2 ) or C(c2 ) of another clique c2 . So, the DTG forms a general directed acyclic graph (DAG), and the likelihood can be written as, Y p(x|G) = P (R1 , R2 ) P (A(c) | B (c) , C (c) ) , (1)

DTG model. The log likelihood of a DTG model is approximated by the sum of conditional entropy of the triangular cliques as, logP (X | G) =

=

(2)

where, G = {W, µ, Σ} are parameters to be estimated, and x and θ are given in training sample. To ease following derivation, we define θ Z = [W µ], Ωk = (3) 1 so that µ ˆ = W θ + µ = ZΩ. Learning the Style Transformation. Given K sets of training samples, where within each set the Tk samples share the same style value θk , a S-DTG Q can be learned by maximizing the log likelihood log k N (X; ZΩk , Σ). By solving Q ∂ log k N (X; ZΩk , Σ) = 0 as shown in the equation ∂Z Fig. 1, Z can be computed analytically as, " Z=

#−1

#" X k,t

xkt ΩTk

X

Ωk ΩTk

.

(4)

k,t

If under-sampling or biased-sampling are not considered, the covariance matrix Σ can be estimated in the normal way once the mean is determined, Σ=

N X

T −1 X

n=1

t=1

≈ −N ·

p(x|G) = N (x; µ ˆ = W θ + µ, Σ) ,

T 1 X xkt − µ ˆ(θk ) xkt − µ ˆ(θk ) . N −1

(5)

log P xt | G

n=1

c∈C

where, R1 and R2 are the two roots of the general DAG. To ease computation, distribution over the triangles are usually assumed Gaussian, so the joint probability represented by the DTG is also a Gaussian. The S-DTG is a linear parameterization of a Gaussian DTG, where the mean is a linear transformation on the style vector θ of the style transformation matrix W ,

N X

T −1 X

log p

Ant

|

Btn , Ctn

! + log p

AnT

|

BTn , CTn

h(At | Bt , Ct ) − N · h(AT , BT , CT ) ,

t=1

(6) where T is the number of triangles in the DTG. For ddimensional data T = 2d − 2. In [10], a greedy algorithm is proposed to learn the topology by gradually constructing triangle hAt , Bt , Ct i to maximize −h(At | Bt , Ct ) = h(Bt , Ct ) − h(At , Bt , Ct ). A constraint can be easily added to avoid combining triples of dimensions with singular covariance matrix, i.e., |ΣAt ,Bt ,Ct | = 0, which can be computed by applying Equation 5 on every triples of dimensions. For N -dimensional samples, the covariance matrix of a Gaussian joint distribution will be singular as long as any PN i one of the i=2 CN combinations of dimensions has singular covariance matrix. DTG significantly reduces such danger because a valid DTG can be constructed as long as 3 triples are non-singular. N − 2 of all the CN Testing Stylized-DTG. For sample x coupled with style value θ, the likelihood function p(x | G) can be computed directly as Equation 1. For samples with θ absent, the unknown can be estimated by maximizing the log likelihood with respect to θ, i.e., solving the equation Q ∂ k N (X; ZΩk , Σ) = 0, in a similar ways as Fig. 1, ∂θ log h ih i θ = W T Σ−1 W W T Σ−1 (x − µ) . (7) It is notable here that, it is meaningless and mathematically forbidden to choose a style vector with dimension larger than that of the feature space. If these two dimensions are equal, the likelihood function degenerates to having the same value all over the feature space, because the columns of W will span a style space as the same as the feature space.

k,t

3. Learning the Triangulated Topology. To avoid that undersampling or biased-sampling results in singular Σ, the high-dimensional covariance matrix is decomposed into a

Variable Length Markov Model of S-DTGs

Learning Prototypical Poses. Because VLMM can only model stochastic process over 1D discrete feature space, namely alphabet, an EM algorithm is presented here to

Y ∂ log N (X; ZΩk , Σ) ∂Z k XX ∂ T 1 1 = log √ ˆ(θk ) ˆ(θk ) Σ−1 xkt − µ − xkt − µ ∂Z 2π|Σ| 2 t k T 1 XX ∂ ˆ(θk ) =− xkt − µ ˆ(θk ) Σ−1 xkt − µ 2 ∂Z t k ∂ T T −1 1 XX ∂ −2 =− Ωk Z Σ xkt + (ΩTk Z T Σ−1 )(ZΩk ) 2 ∂Z ∂Z t k XX xkt ΩTk − ZΩk ΩTk = Σ−1 k

t

Figure 1: Taking derivative of the equation cluster frames into prototypical poses represented by SDTGs. Therefore, the VLMM can be learned from the sequence of cluster indices. Given K training motion sequences F = {fk,t }k,t with same temporal structure but different style θk , and the number N of clusters to be learned, the clustering can be considered as learning N S-DTGs Gi by maximizing the log likelihood. To ease the estimation, a pose label lk,t is introduced as hidden data for each frame fk,t to designate from which cluster the frame is generated. So, the log of complete-data likelihood maximized by the EM algorithm can be written as, #

" ˆ = log p(F | G) = Q(G | G)

log p(F, L | G)

E

ˆ L|F,G

" =

# X

E

ˆ L|F,G

log p fk,t | Glk,t

k,t

" =

X

# p lk,t

ˆ log p fk,t | Gi = i | F, G

,

∂ ∂Z log

(8) ˆ i }N is the current value of the model paˆ = {G where G i=1 rameters, G = {Gi }N i=1 is to be updated for maximization, and 0 ≤ i ≤ N , L = {lk,t }k,t are the pose labels. As required by the convergence proof of EM algorithm, the hidden variable lk,t must be derived from the ˆ current model parameters by computing p(lk,t = i | F, G) in the E-step. For the clustering problem, a frame is associated with the only single cluster, so we assume a Dirichlet

k

N (X; ZΩk , Σ) = 0

distribution on the hidden variables, ( ˆ = 1, if i = argmaxj p(fk,t | Gj ) p lk,t = i | F, G 0, otherwise (9) which implies the rule to update lk,t as, ˆ = argmaxi p(fk,t | Gi ) lk,t ← argmaxi p(lk,t = i | F, G) (10) Given the updated L, G is adjusted in the M-step to maximize Q(·) by solving the equation ∂ XXX ∂Q ˆ ∂G p(fk,t | Gi ) = 0 = p(lk,t = i | F, G) ∂G p(fk,t | Gi ) t i k (11) ˆ as an extra factor. So we by considering p(lk,t = i | F, G) get " #" # XX XX T T Zi = I{lk,t =i} xkt Ωk I{lk,t =i} Ωk Ωk k,t

k,t,i

Q

i

X I{lk,t =i} P Σi = k,t

k,t

i

T xkt − µ ˆ(θk ) xkt − µ ˆ(θk ) .

I{lk,t =i}

(12) Updating L and G as Equation 10 and Equation 12 alternatively, the EM algorithm is proven to converge to an optimal estimation. ˆ so that each set of frames It is important to initialize G used to train a prototypical pose contains frames from different training sequences. Because training sequences

2. Recode the length of every successive li with the same value as duration of li . For example {lt }Tt=1 = [1, 1, 2, 2, 2, 2, 3] is recoded as {li }i = [1, 2, 3], and {di }i = [2, 4, 1].

are selected with similar temporal structure, the following greedy strategy can be used to initialize {Gi }N i=1 : (1) Use K-means to find N prototypical poses from only the 1 first sequence {f1,t }Tt=1 ; (2) The sequence of pose laT1 1 bels {l1,t }t=1 actually subdivides {f1,t }Tt=1 into a set of short segments, within each segment, all frames belong to the same prototypical pose; (3) For each of rest motion sequences 2 ≤ k ≤ K, linearly scale length of k these segments by Tk /T1 to form {{lk,t }Tt=1 }K k=2 ; (4) EsN timate {Gi }i=1 given the initial sequences of poses labels k {{lk,t }Tt=1 }K k=1 . Then, during the subsequent EM iteration, boundaries of duration of each prototypical pose within the motion sequences are adjusted iteratively, until a maximum likelihood is reached. Learning and Reinterpreting VLMM. Because all training sequences have similar temporal structure, the pose label sequence of any of them can be used to estimate the VLMM to capture the temporal structure. A classic algorithm to learn VLMM was proposed in [8], which represents all the memories with variable length as terminals of a context tree. So VLMM has a graphical representation similar with the first-order Markov model but with each state a sequence of prototypical poses corresponding to the terminals of the context tree. Because the memories learned by VLMM have the minimum length but long enough for accurate prediction, from a statistical view, the states of a VLMM can be interpreted as atomic behaviors, i.e., the segments of motion sequences that are shortest but long enough to carry basic dynamics, and any further breaking of these states makes them too short to accurately predict for the next one. The ability to discover optimal granularity is the main advantage of modeling human motion with VLMM than the first-order Markov model — optimal length of memory increases the accuracy of prediction, and therefore the accuracy of classification, recognition, and detection, as demonstrated in [3]. Generating New Motion Sequences. In order to support automatic motion synthesis, a VLMM/S-DTG model is estimated with 3D motion capture data. For each short period (1/66 secs), a frame ft is captured to record the global position and joint rotations rt and the derivative of the global position and joint rotations rt0 . The synthesis algorithm presented as follows generates new motion from a learned VLMM/S-DTG model given a specified style value θ0 : 1. Generate a path of prototypical pose indices {li }L i=1 by simulating the VLMM.

3. Take the rotation part of the mean vectors of the prototypical poses indexed by {li }i as control points in the pose space, denoted as {r(Gli )}i . The compute of mean vectors relies on θ0 . 4. Between any two adjacent control points r( Gli−1 ) and r(Gli ), two additional control points are generated as r(Gli−1 ) + 61 (di−1 + di )r0 (Gli−1 ) and r(Gli ) − 61 (di−1 + di )r0 (Gli ). 5. The motion trajectory, a B-spline curve T (t) in pose space, can be constructed by the set of control points. New poses can be generated by interpolating T (t). To ensure the synthesized motion has the same framerate as the training motion, the interpolation along T (t) is piecewise linear to enforce di frames are generated from near r(Gli ). Comparing with sampling S-DTGs along the sequence of prototypical pose indices {li }L i=1 , which might result successive frames generated from the same prototypical pose have wrong joint rotation direction, the above algorithm takes both rt and rt0 under consideration, and ensures local derivative of motion trajectory ∂T∂t(t) be consistent with derivatives r0 on the control points, so rotation and movement of frames interpolated from T (t) can be guaranteed be consistent with those of training motion. 4.

Experiments

Experiment 1. Motion Blending. A possible application of VLMM/S-DTG is to generate new motion sequence with blended style. For this experiment, we captured two sequences of human motion as training data as shown in Fig. 2 (a). The first sequence is regular walk performed by an actor (96 frames). The second one is cat walk performed by an actress with her right arm raised over head (87 frames). Compared with dimension of pose space (120D), these two sequences are so short that 48 combinations of dimensions have singular covariance matrix, and cannot be represented by parametric distributions like Gaussian. Both sequences contain four paces and have similar structure of walk. Because the major difference on style is the height of the raised right arm, it can be encoded by a scale style value. We assign 0 to this value for the first sample and assign 1 for the second one. After learning, three new motion sequences are generated with given style value of 0.2, 0.5, and 0.75, as shown in Fig. 2 (b), with right arm

(a)

(b)

Figure 2: Learning VLMM/S-DTG from two motion sequences with different style (a), and new sequences can be generated by interpolating the learned style vector (b). raised higher and higher but within the range of the two training samples. Experiment 2. Synthesis of New Motion Longer Than the Training Motion. As pointed out in Section , an advantage of VLMM over first-order Markov machines is that it can discover atomic behaviors and provide more accurate prediction performance. Utilizing these advantages, new realistic motion sequences that are longer than the training motion can be generated as well. For the model trained in the previous experiment, with topology shown in Fig. 3 (b), a new motion sequence over 10 times longer (including 1447 frames) is generated and is partly shown in Fig. 3 (a). The path of atomic behaviors as required by the synthesis algorithm is synthesized with the back-off algorithm of VLMM. In this experiment, we make a minor modification on the synthesis algorithm: only the global velocity in used in training data, and the global position information is neglected. This is because the training motion is too simple to contain enough cyclic patterns with global position of every frames considered. Because the lack of global position, only joint rotations can be interpolated from the motion trajectory curve. The global position of the frames in the synthesized motion sequences are calculated by accumulating global velocities of previous frames. 5.

Conclusions and Discussions

A new model, VLMM/S-DTG, is presented to cope with three inevitable difficulties of 3D human motion modeling and regenerating: the under-sampling or biasedsampling caused by high dimensional pose space and limited motion capture frequency, the usually wrong modeling of systematic variations as noise, and the highly varied

dynamics of human motion. Experiments are encouraging and show that the VLMM/S-DTG is able to learn undersampled short motion sequences, to extract motion style from motion structure, and to synthesis realistic new motion that is even longer than the training motion. However, the maximum-likelihood learning algorithm of VLMM/SDTG presented in this work considers only the transition probabilities among the prototypical poses, rather than optimizes the likelihood of model with the frames sequence of the training motion. An improvement on this weak point might result in a hidden variable-length Markov model with non-parametric output densities. 6.

Acknowledgment

This research has been supported in part by research grants from Natural Science Foundation of China No. 60173008, Hong Kong RGC CityU 1062/02E and CityU 1247/03E. References [1] Matthew Brand and Aaron Hertzmann. Style machines. In Proc. ACM SIGGRAPH, pages 183–192, 2000. [2] L.W. Campbell, D.A. Becker, A.J. Azarbayejani, A.F. Bobick, and A. Pentland. Invariant features for 3-d gesture recognition. In Proc. Second Int’l Conf. Face and Gesture Recognition, pages 157–162, Killington, 1996. [3] Aphrodite Galata, Neil Johnson, and David Hogg. Learning variable-length markov models of behav-

(a)

(b)

Figure 3: Learning VLMM/S-DTG from two motion sequences with different styles to generate new motion longer than training motion. (a) A new motion with 1447 frames (only part are shown) generated from the VLMM/S-DTG learned from sample w0 and w1 (both are less than 100 frames). (b) The topology of the VLMM/S-DTG model over a set of atomic behaviors. ior. Computer Vision and Image Understanding, 81(3):398–413, 2001. [4] Yan Li, Tianshu Wang, and Heung-Yeung Shum. Motion texture: A two-level statistical model for character motion synthesis. Proc. ACM SIGGRAPH, pages 465–472, 2002. [5] Ben North, Andrew Blake, Michael Isard, and Jens Rittscher. Learning and classification of complex dynamics. IEEE Trans. Pattern Analysis Machine Intelligence, 22(9):1016–1034, 2000. [6] Vladimir Pavlovi`c, James M. Rehg, Tat-Jen Cham, and Kevin P. Murphy. A dynamic bayesian network approach to figure tracking using learned dynamic models. In Proc. IEEE ICCV, volume 1, pages 94– 101, 1999. [7] Vladimir Pavlovi`c, James M. Rehg, Tat-Jen Cham, and Kevin P. Murphy. Impact of dynamic model learning on classification of human motion. In Proc. IEEE ICCV, volume 1, 2000. [8] Jorma Rissanen. A universal data compression system. IEEE Trans. Information Theory, 29(5):656– 664, 1983.

[9] J. Schlenzig, E. Hunter, and R. Jain. Vision based hand gesture interpretation using recursive estimation. In Proc. 28th Asilomar Conf. Signals, Systems, and Computers, 1994. [10] Yang Song, Luis Goncalves, and Pietro Perona. Unsupervised learning of human motion. IEEE Trans. Pattern Analysis Machine Intelligence, 25(25):814– 827, 2003. [11] T.E. Starner and A. Pentland. Visual recognition of american sign language using hidden markov models. In Proc. Int’l Workshop Automatic Face- and GestureRecognition, Zurich, 1995. [12] Andrew D. Wilson and Aaron Bobick. Parametric hidden markov models for gesture recognition. IEEE Trans. Pattern Analysis Machine Intelligence, 21(9):884–900, 1999. [13] J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-sequential images using hidden markov model. In Proc. IEEE CVPR, pages 379–385, 1992.

learning style and structure of human behavior

length of memory for more accurate modeling. Each pro- totypical pose is represented by a stylized decomposable triangulated graph (S-DTG), which features two .... be computed by applying Equation 5 on every triples of dimensions. For N-dimensional samples, the covariance matrix of a Gaussian joint distribution will be ...

Download PDF

349KB Sizes 1 Downloads 236 Views

Report

learning style and structure of human behavior

Recommend Documents