Data-Driven Statistical Models for Computer Integrated ...

Viewer
Transcript

1

Data-Driven Statistical Models for Computer Integrated Surgery Balakrishnan Varadarajan, Carol E. Reiley, Sanjeev Khudanpur, and Gregory D Hager, Abstract—Hidden Markov models (HMMs) are developed for individual surgical gestures (or surgemes) that comprise a typical bench-top surgical training task. It is known that such HMMs can be used to recognize and segment surgemes in previously unseen trials [1]. Here, the topology of each surgeme HMM is designed in a data-driven manner, mixing trials from multiple subjects with varying skill levels, resulting in HMM states that model skill-specific sub-gestures. The sequence of HMM states visited while performing a surgeme are therefore indicative of various skill and style differences. This expectation is confirmed by the average edit distance between the state level “transcripts” of a surgeme performed by two subjects with different expertise levels. Some surgemes are further shown to be more indicative of skill than others. We also introduce a novel technique to obtain the underlying hidden structure to model each of the surgemes used in our training vocabulary. The motivation for this was obtained from Speech where it is crucial to learn several allophonic variations of the phone. We compare our framework with traditional LDA classification approach, Gaussian Mixture Models using a single state HMM. We propose a variant of the classical LDA where we jointly learn the HMM parameters and the transform matrix in an iterative fashion. We use this to obtain the surgemes that make up the surgery from untranscribed trials. This would play a crucial role in identifying surgemes when the vocabulary of gestures is unknown.It can also be used to do alignment of two surgical videos and to assess the expertise level of the subjects performing the designated task(s). Index Terms—Skill Assessment, Structure Induction, Iterative (H)LDA,Registration.

F

1

S URGICAL S KILL A SSESSMENT

Robotic minimally invasive surgery (RMIS) has experienced rapid development and growth over the past decade, and the da Vinci robotic surgery system has emerged as the leader in RMIS [2]. RMIS is considered quite effective in reducing risk of complications and death due to inadequate real-time medical feedback. Yet, adoption of RMIS has not been as widespread in surgery as these benefits may suggest. Training for RMIS has often been cited as challenging, even for experienced subjects [3]. One approach to overcome this challenge is to develop techniques for automatic assessment of surgical skills during the performance of benchmark tasks, such as suturing or knot-tying, that simulate live tasks used for clinical skill evaluation [4]. This paper presents such techniques based on gesture recognition using hidden Markov models (HMMs). RMIS is uniquely amenable to automatic skill assessment. The robot functions as a measurement tool for dexterous motion. As part of its run-time system, the da Vinci exposes an application programming interface (API) that provides accurate and detailed motion measurements, including the surgeon console “master” manipulators and all patient-side tools. These kinematic measurements may be to recognize individual surgical gestures [1]. In prior work, Richards et al [5] have successfully demonstrated that force/torque signatures in RMIS may be used for two-way skill classification of a trial; they observed that novices employ higher force/torque than experts. Rosen et al [6] have used HMMs to model tool-tissue interactions in laparoscopic surgery; a separate HMM for each skill level was trained using a pool of experts, intermediates and novices, and a statistical distance between these HMMs was shown to correlate well with the learning curve of these trainee subjects. In these and other reported efforts, the automatic assessment is for entire trials, while the work presented here assesses finer grained segments, namely individual surgical gestures. Lin et al [7] have used linear discriminant analysis (LDA)

to project high-dimensional kinematic measurements from the da Vinci API to a three or four dimensions, and used a Bayes’ classifier to successfully detect and segment surgical gestures from the low-dimensional signal. Reiley et al [1] replace the Bayes classifier with a simple HMM for each gesture, and demonstrate improved accuracy on previously unseen users. A part of the work presented here improves upon [1] by applying LDA to discriminate not between the kinematic signal of entire gestures, as [7] does, but between sub-gestures represented by individual states of a gesture HMM. More importantly, however, this work applies the HMM methodology to gesture-specific skill assessment. A data-driven algorithm is used to design the HMM topology for each gesture. As a consequence, in addition to improving automatic detection and segmentation of surgical gestures, one is able to compare individual gestures of expert, intermediate and novice subjects in a quantitative manner. For instance, some gestures in a suturing task, such as navigating a needle through the tissue, are demonstrated to be more indicative of expertise than others, such as pulling the suture. Such fine grained assessment can ultimately lead to better automatic surgical assessment and training methods. The remainder of this paper is organized as follows. We begin in Section 2 by describing a bench-top suturing task. The use of HMMs for gesture recognition and segmentation is described in Section 3, including two technical novelties: state-specific LDA and data-derived HMM topologies. These innovations lead to leads to improved recognition accuracies. In section 4, we propose an automatic HMM learning scheme from untranscribed data. In Section 5.1, we demonstrate how paths through the HMM state space are indicative of the expertise with which the gesture has been performed, leading to the main claim of the paper regarding gesture-level surgical skill assessment. We jointly learn the structure and transform matrices from the data using an iterative (H)LDA setup and this is described in section 6. In Section 6.1, we show how the states in the automatically

grown HMM from untranscribed data are indicative of the actual gestures. In this section, we also show the corellation between the automatically discovered labels with the manual ones. We end in Section 6.2.1 by describing a very useful application of our automatic state-growing, namely aligning 2 different trials. Such alignments, which are also called registration, are very useful and can provide real-time feedback to novice surgoens during training.

2

T HE S URGEME R ECOGNITION S ETUP

Kinematic Data Recordings: We recorded robot kinematic measurements from 2 expert, 3 intermediate and 3 novice subjects performing a bench-top suturing task—four throws along a line—on the teleoperated da Vinci surgical system. The average duration of a trial is 2 minutes, and the video and kinematic data are recorded at 30 frames per second. The kinematic measurements include position, velocity, etc. from both the surgeon- and patient-side manipulators for a total of 78 motion variables. We use {yt , t = 1, 2, . . . , T } to denote the sequence of kinematic measurements for a trial, with yt ∈ R78 and T ≈ 3400. A total of 30 trials were recorded, roughly four from each of the eight subjects. Manual Labeling of Surgemes: Each trial was manually segmented into semantically “atomic” gestures, based on the eleven-symbol vocabulary proposed by [1]. Following their terminology, we will call each gesture a surgeme. Typical surgemes include, for instance, (i) positioning the needle for insertion with the right hand, (ii) inserting the needle through the tissue until it comes out where desired, (iii) reaching for the needle-tip with the left hand, (iv) pulling the suture with the left hand, etc. We use {σ[i] , i = 1, 2, . . . , k} to denote the surgeme label-sequence of a trial, with σ[i] ∈ {1, . . . , 11} and k ≈ 20, and [bi , ei ] the begin- and end-time of σ[i] , 1 ≤ bi < ei ≤ T . Note that b1 = 1, bi+1 = ei + 1, ek = T . The Surgeme Recognition Task: Given a partition of the 30 trials into training and test trials, the surgeme recognition task is to automatically assign to each trial in the test partition ˆ and time-marks a surgeme transcript {ˆ σ[i] , i = 1, 2, . . . , k} [ˆbi , ˆbi+1 − 1]. Trials in the training partition are used to train the HMMs, as described below. We report results with three different training/test partitions. Setup I: Of the 30 trials, 8 have some minor errors by the subjects during suturing. These are excluded altogether in Setup I. Leave-one-out cross-validation is carried out with the remaining 22 trials, so that each trial is once in the test partition. The test results of all 22 folds (22 trials) are aggregated. Setup II: The training partition in Setup II comprises the 22 “good” trials, while the test partition comprises only the 8 “imperfect” trials. Setup III: User-disjoint partitions of the 30 trials are created in Setup III. An eight-fold cross validation akin to Setup I is carried out, except that in each fold, all the trials of 1 surgeon are in the test partition and all trials of the remaining 7 subjects are in training. Test results of all 30 trials are aggregated. Setup I is relatively the easiest, with 22 good test trials and the surgeon of each test trial seen in training. Setup II is harder, with seen subjects but with test trials that have some visible errors, a situation not dissimilar from recognition of slightly disfluent speech. Setup II is most similar to the multiple-user results in [1, Table 3], with which we make direct comparisons. Setup III is the hardest, because all trials of the test surgeon have also been removed from training. This simulates the case when the system sees a new surgeon for the first time.

Recognition accuracy is measured as the fraction of kinematic frames that are assigned the correct surgeme label by an automatic system. Formally, Accuracy of test trial {y1 , . . . , yT } =

T 1 X I (σt = σ ˆt ) , T t=1

(1)

where σt = σ[i] for all t ∈ [bi , ei ] and σ ˆt = σ ˆ[i] for all t ∈ [ˆbi , eˆi ]. This measures the goodness of both the labels and the segmentation proposed by {ˆ σt }.

3

BASE PIPELINE ARCHITECTURE USING HMM S

Formally, a HMM is defined using a 5-tuple (S, V, Π, T, E), where S is a set of states, V is a set of possible symbols, Π is the initial state probabilities, T is the transition probabilities and E being the emission distribution attached to each state s ∈ S. In our setup, the emission is the kinematic observation1 depicting the positions of various arms of the daVinci. Each state can be thought as a gesture denoting the kind of movement or action being performed. Embedding Linear discrimination in HMMs: The variability across the gestures/dexemes (states in the HMMs) is described using a generative model for the d = 78 dimensional signal corresponding to each gesture/dexeme. Noting that the degrees of freedom in the daVinci are far fewer than 78, it suffices to model this variability on a subspace of dimension significantly smaller than d = 78. Given a sequence of observations o1 , . . . , oN along with its gesture transcription g1 , . . . , gN , we assume that the variablity between the gestures lie on a subspace X p of dimension p << d. The remaining d − p dimensional subspace X d−p are modeled as a random signal with variability independent of the gesture. We call the component of ot on X p as xpt and that on X d−p as xd−p . This model framework in t the supervised setting (gesture sequence known) isd the HLDA xt initially studied by Kumar,et.al [8]. Using xt = d−p , we can xt write the state-space representation of this model as xt

=

µgt + zgt ,

ot

=

A−1 xt ,

zgt

∼

µc

=

Σc

=

N (0, Σgt ) , p µc , µd−p p Σc 0 0 Σd−p

(2)

The problem now is to jointly estimate the subspaces (A), class-dependent means and the covariance components (µpg , Σpg ) and the class-independent means and covariances (µd−p , Σd−p ). Parameter Estimation: In order to perform a maximum likelihood estimate on the parameters described in (2), describing an observation sequence {o1 , . . . , oN } along with its gesture transcription g1 , . . . , gN (gi ∈ {1, . . . , G}), one may estimate the subspace A by optimizing the following objective (as shown in [8]). 1 ˆ o ATd−p | QΘ = log |A| − log |Ad−p Σ 2 G 1 X Ng ˆ go ATp | − log |Ap Σ (3) 2 g=1 N 1. In fact use a linearly transformed version of the 78-dimensional data as our observation. A standard LDA technique is opted to achieve this objective

TABLE 1 Surgeme Recognition Accuracies with Dexeme-level LDA. (a) A 1-state HMM per Surgeme LDA d Setup I Setup II Setup III 3 75% 75% 58% 5 81% 72% 69% 7 81% 70% 72%

Here Ng is the number of instances that gesture g was seen ˆ o denotes the emperical in the training data o1 , . . . , oN . Σ ˆ go is covariance of the entire d = 78 dimensional observation. Σ the emperical covariance of observations only corresponding to the state g. Finally Ap and Ad−p are the first p rows and the last d − p rows in A respectively. Once the subspace A is identified, one may estimate the parameters µc and Σc from the sufficient statistics of the projected p dimensional vector xpt = Ap ot . In the model described in (2), if one assumes Σpc = Σp ∀c, the estimate for A simplifies to the classical Linear discriminant analysis. In this case, the solution A would is the columns of the eigen-matrix (sorted according to the eigen-values) of W−1 B where W and B are the within and between class covariances of the data points respectively. Surgeme Recognition: A surgeme (HMM) is permitted to be followed by any other surgeme during recognition, and the Viterbi algorithm [9] is used to find the sequence {ˆ st ∈ S Sσ , t = 1, . . . , T } of HMM states with the highest a posteriori likelihood given a test trial {ot }. The surgeme sequence ˆ and time-marks [ˆbi , eˆi ] are a byproduct {ˆ σ[i] , i = 1, 2, . . . , k} of the Viterbi algorithm.

TABLE 2 Surgeme Recognition Accuracies with Dexeme-level LDA for 3-state HMMs. LDA d 3 5 7 9 17

3.2

Using multi-state HMMs per gesture

In this, we model each gesture as a Hidden Markov model with multiple states. The states within each gesture connote the various sub-actions (not annotated by humans) which are semantically identical but possess significant variability in the kinematic signal. For start, we assign a simple left-to-right hidden Markov model to each gesture with the hope to capture the temporal variability within each gesture. Since the statesequence is now a hidden random variable, one may not directly use (3) to identify the subspaces Ap and Ad−p as the state-identities at each time t are unknown. A commonly used approximation is to solve A in (3) using the gestureidentities directly; i.e. assuming there is no state-variability. Once this gesture discriminating subspace is identified, the statedependent parameters may be estimated by simply projecting the observations {ot } onto this subspace. [1] report accuracies of 72% to 77% using this coarse gesture-level LDA.2 2. Recall that LDA is simply obtained by tying the class-dependent covariances Σps in the HLDA setting

Setup III 73% 73% 81% 78% 81%

The primary purpose of (H)LDA is to reduce the dimensionality of {yt } without losing information necessary to discriminate between gestures σt . Note, however, that each surgeme is modeled by a HMM with several states s ∈ Sσ , each of which models a sub-gesture—we name these sub-gestures dexemes to connote small dextrous motions. It is natural, therefore, to investigate whether it is better to perform HLDA to discriminate between dexemes rather than entire surgemes. As described in previous section, one may not directly use (3) to identify to state discriminating subspaces as there are multiple possible statesequences {s1 , . . . , sN } that result in the same gesture sequence {g1 , . . . , gN }. However, using an initial model for subspaces and the state parameters, one may estimate the probability γts of being in state s at time t. This can be done in the HMM setting using the standard forward-backward algorithm. Then the sufficient statistics for each states are accumulated resulting ˆ so as follows in a soft estimation of Ns and Σ =

Using 1-state HMMs per gesture

This special case is not actually a HMM as the entire state sequence s1 , . . . , sN of the observation sequence is known. In other words, we are simply modeling each gesture as a Gaussian random variable. Here st = gt as there is only one state corresponding to each gesture. Hence one may solve for A by directly maximizing the objective described in (2). Using the standard Viterbi algorithm, Table 1(a) shows the gesture recognition accuracies averaged on the test trials.

Setup II 70% 76% 83% 86% 83%

3.2.1 Improved Dimensionality Reduction and Modeling for multi-state HMMs

Ns 3.1

Setup I 79% 83.4% 83.2% 84.1% 87%

N X

γts

t=1

µ ˆso

=

N 1 X s γt ot Ns t=1

ˆ so Σ

=

N 1 X s γt ot oTt − µ ˆso µ ˆso T Ns t=1

ˆ by maximizing the following Using this, one may estimate A objective QΘ

=

1 ˆ o ATd−p | log |Ad−p Σ 2 S 1 X Ns ˆ so ATp | − log |Ap Σ 2 s=1 N

log |A| −

(4)

ˆ may simply be estimated to be the If one wants to do LDA, A eigen-matrix of W−1 B where the within and the between class covariance matrices are estimated using the soft-statistics of the {ot } using the posterior probability γts . Ths dexeme-level (H)LDA is better able to preserve information that distinguishes temporal sub-gestures of a single gesture, as well as stylistic variations within (sub)gestures, as evidenced in Tables 3 and 4(a). The dexeme-level LDA provides up to 86% accuracy, as shown in Table 2. We also see from Table 2 that maximum accuracy is achieved when the number of dimensions d is between 9 and 17 indicating the need for more dimensions to differentiate between the finer grained motions represented by dexemes.

TABLE 3 Surgeme Recognition Accuracies using best 3-state HMM(exhaustive search). LDA d 3 5 7 9 17

4

Setup I 77.1% 84.1% 83.3% 82.7% 85.7%

Setup II 75.24% 77.6% 82.9% 87.1% 84.0%

Setup III 75.72% 78.01% 84.0% 81.1% 80.2%

W HAT IS THE BEST 3- STATE TOPOLOGY ?

In the previous section, we described how recognition accuracies can be improved by providing more flexibility. We accomplished this by modeling each gesture as a 3-state left to right HMM. However it is not clear if the optimal 3-state HMM is a left to right one for each gesture. 4.1

Exhaustive Search

One obvious way to resolve this problem is to perform an exhaustive search over all rooted unlabeled strongly connected graphs with 5 nodes (3 emmiting nodes, a start state and an end state). Liskovets [10] uses Burnside Lemma to give a system of linear recurrences for the number of strongly connected digraphs with n nodes. For each gesture, we use an exhaustive search over all the 287 3-state HMMs. For each of the 287 HMMs, we perform state-level LDA as described in 3.2.1 and compute the data-likelihood under the current model. We pick the structure that gives maximum likelihood on the training data(for each gesture). In Table 3, we show the recognition accuracies for different setups and dimensions using the best 3-state HMM obtained. 4.2

Data-derived HMM Topologies

In the work of [1], and in our initial work here, we used a 3-state left-to-right HMM to model each gesture. However, it is quite clear that each gesture has not only temporally distinct sub-gestures—which would be well modeled by states of a leftto-right HMM—but also contextual variability in sub-gestures. Some of the latter variability is due to the skill level of the surgeon, some due to the dynamics of a previous or subsequent gesture, while some depends on where in the suturing task (e.g. on the first or fourth throw) the gesture is being performed. We investigate induction of an optimal HMM topology directly from the data. More formally we want to obtain the topology of a surgeme HMM that maximizes the likelihood of the training data {xt }. Finding the optimal HMM with n states, however, requires considering every directed graphs with n states, running the Baum-Welch algorithm and evaluating data training likelihood. This is clearly intractable. Similar problems have been addressed in speech recognition: HMM topologies are derived for capturing context-dependent (allophonic) variations of phonemes using greedy algorithms. We apply one such algorithm by Varadarajan et al [11], called the modified successive state splitting (SSS) algorithm, to our problem. We begin with a single-state HMM for each surgeme, and iteratively estimate the HMM parameters and increment the number of HMM states via SSS until diminishing gains are seen in the training data likelihood. Data-derived HMM topologies yield accurate models for surgeme recognition, and also capture sub-gesture patterns indicative of skill, as shown in Section 5.1.

Fig. 1. Modified four-way split of a state s. 4.2.1 Evolving the hidden structure from untranscribed data using SSS We now describe how we can get the structure of the Hidden Markov model that completely describes a Surgical Task. We wish to automatically derive the parameters(transition probabilities, means and the covariances) of a HMM with n states in a completely unsupervised fashion. The states are hoped to represent the gestures/sub-gestures in the designated task. More formally we want to obtain a structure of a HMM which would maximize the likelihood of the training data. If Tij = P (j|i) is the state transition probability and f (x|s) is the likelihood of a single data point at state s, then the total likelihood of the data(x1 , x2 , . . . , xN ) is given as ! Ti N X XY i log P (˜ sk |˜ sk−1 )f (xk |˜ sk ) i=1

s ˜

k=1

In order to get the right structure of n states, it requires us to form all possible directed graphs with n states, run the EM algorithm and evaluate the likelihood. The number of connected di-graphs grows exponentially with n3 . Hence it is not possible to adopt this approach. Instead we need to follow a greedy approach to get the best structure for the given data. A similar problem is encountered in speech where it is required to learn the allophonic variations of a given phone. The general idea in allophone learning is to begin with an inventory of only one allophone per phoneme, and incrementally refine the inventory to better fit the speech signal. Typically, each phoneme is modeled by a separate HMM. In early stages of refinement, when very few allophones are available, it is hoped that “similar” allophones of a phoneme will be modeled by shared HMM states, and that subsequent refinement will result in distinct states for different allophones. The key therefore is to devise a scheme for successive refinement of a model shared by many allophones. In the HMM setting, this amounts to simultaneously refining the topology and the model parameters. A successive state splitting (SSS) algorithm to achieve this was proposed by Takami and Sagayama [12], and enhanced by Singer and Ostendorf [13]. Improvements in phoneme recognition accuracy using these derived allophonic models over phonemic models were obtained. We now briefly describe the modified SSS Algorithm proposed by Balakrishnan and Khudanpur[11]. The improvement of the SSS algorithm of Takami and Sagayama [12], renamed ML-SSS by Singer and Ostendorf [13], proceeds roughly as follows. 1) Model all the signal4 using a 1-state HMM with a diagonal3. There are 10 different HMMs using 2 emmiting states, 287 3-state HMMs, 24427 4-state HMMs and 6222928 5-state HMMs 4. Note that the original application of SSS was for learning the allophonic variations of a phoneme; hence the phrase “all the signal” meant all the signal corresponding separately to each dexeme. Here it really means all the signal.

covariance Gaussian. (N =1.) 2) For each HMM state s, compute the gain in log-likelihood (LL) of the signal by either a contextual or a temporal split of s into two states s1 and s2 . Among the N states, select and and split the one that yields the most gain in LL. 3) If the gain is above a threshold, retain the split and set N = N + 1; furthermore, if N is less than desired, reestimate all parameters of the new HMM, and go to Step 2. Note that the key computational steps are the for-loop of Step 2 and the re-estimation of Step 3. Modifications to the ML-SSS Algorithm: We made the following modifications that are favorable in terms of greater speed and larger search space, thereby yielding a gain in likelihood that is potentially greater than the original ML-SSS. 1) Model all the signal using a 1-state HMM with a fullcovariance Gaussian density. Set N = 1. 2) Simultaneously replace each state s of the HMM with the 4-state topology shown in Figure 1, yielding a 4N -state HMM. If the state s had parameters (µs , Σs ), then means of its 4-state replacement are µs1 = µs − δ = µs4 and µs2 = µs + δ = µs3 , with δ = λ∗ v ∗ , where λ∗ and v ∗ are the principal eigenvalue and eigenvector of Σs and 0 < 1 is typically 0.2. 3) Re-estimate all parameters of this (overgrown) HMM. Gather the Gaussian sufficient statistics for each of the 4N states from the last pass of re-estimation: the state occupancy πsi . The sample mean µsi , and sample covariance Σsi . 4) Each quartet of states (see Figure 1) that resulted from the same original state s can be merged back in different ways to produce 3, 2 or 1 HMM states. There are 6 ways to end up with 3 states, and 7 to end up with 2 states. Retain for further consideration the 4 state split of s, the best merge back to 3 states among the 6 ways, the best merge back to 2 states among the 7 ways, and the merge back to 1 state. 5) Reduce the number of states from 4N to N + ∆ by optimally5 merging back quartets that cause the least loss in log-likelihood of the signal. 6) Set N = N + ∆. If N is less than the desired HMM size, retrain the HMM and go to Step 2. Observe that the 4-state split of Figure 1 permits a slight look-ahead in our scheme in the sense that the goodness of a contextual or temporal split of two different states can be compared in the same iteration with two consecutive splits of a single state. Also, the split/merge statistics for a state are gathered in our modified SSS assuming that the other states have already been split, which facilitates consideration of concurrent state splitting. If s1 , . . . , sm are merged into s˜, the loss of log-likelihood in Step 4 is: m m dX dX πsi log |Σs˜| − πs log |Σsi | , (5) 2 i=1 2 i=1 i Pm 0 i=1 πsi Σsi + µsi µsi Pm where Σs˜ = − µs˜µ0s˜. i=1 πsi Note that there are several ways to select the best ∆ states that can be added to the HMM. We need to select the best out of them. E.g. going up from N = 6 to N + 3 = 9 HMM

5. This entails solving a constrained knapsack problem.

TABLE 4 Surgeme Recognition Accuracies using best 3-state HMM(Successive State Splitting). (a) A LDA d 3 5 7 9 17

3-state HMM per Surgeme Setup I Setup II Setup III 78.1% 74.64% 75.44% 83.2% 77% 77.92% 83.1% 82.9% 83.11% 82.5% 87.1% 80.53% 86.9% 84.2% 81.7%

TABLE 5 Surgeme Recognition Accuracies with Dexeme-level LDA for data derived HMMs. (a) Data-derived HMM Topology LDA d Setup I Setup II Setup III 3 69% 67% 64% 4 73% 73% 70% 10 83% 82% 73% 15 86% 82% 71% 20 87% 83% 70%

states could be achieved by a 4-way split of a single state, a 3-way split of one state and 2-way of another, or a 2-way split of three distinct states. Each of them has to be explored. Since the number of possibilities is too huge, we propose a dynamic programming solution to the problem. In Table 4(a), we show the recognition accuracies for different setups and dimensions using the best 3-state HMM obtained using SSS in conjunction with state-level (H)LDA3.2.1.

5

U SING DATA - DERIVED HMM S FOR SEGMENTATION

We use the algorithm described in Section 4 to evolve a 6state HMM for each of the gestures using the training/test partition described in Section 2. Table 5(a) shows recognition results for the different setups. The recognition accuracies remain high for Setup I and II using data-derived HMMs. The maximum recognition accuracy is obtained when the number of dimensions d is 20, indicating the need for more dimensions needed to differentiate between the larger number of dexemes. We also note that the accuracies drop considerably for Setup III. We conjecture that in addition to expertise-dependent dexemes, the data-derived HMMs may also be modeling user-specific dexemes. This leads to improved recognition when a new trial of a seen user is presented, but also to some overfitting to the seen users. 5.1

Using data-derived HMMs for skill differentiation

To illustrate how the data-derived HMM topologies encode dexterity information, consider Figure 2, which shows a 5state HMM derived via the SSS algorithm for surgeme #3 corresponding to the act of “inserting the needle through the tissue.” Training samples of surgeme #3 were aligned with this 5-state HMM, and the state-level time marks were used to isolate individual dexemes corresponding to the HMM states a, b, c, d and e ∈ S3 . We studied the associated video to understand what the segments that align with each dexeme (HMM state) represent, and observed the following:

TABLE 6 Dexeme Similarity of Surgemes Performed with Different Skill Levels (a) Similarities in Surgeme #2 Expert Inter. Novice Expert 0.65 0.55 0.55 Intermediate 0.55 0.50 0.53 Novice 0.55 0.53 0.46

Fig. 2. The Data-derived HMM for n = 5 States for Gesture #3. Dexemes a, b and c: They all constitute rotating the right hand patient-side wrist to drive the needle from the entry- to the exit-point. Dexeme c versus a and b: All examples that aligned to c were from novice subjects. Examining the videos revealed that c corresponds to a sub-gesture where the novice hesitates/retracts while pushing the needle to the exit point. In most cases, c is followed by a or b, in which the trainee surgeon eventually performs the task (inserting the needle till it exits) correctly. States a and b appear to be indistinguishable, except for some stylistic differences. Dexeme d: It represents the left arm reaching for the exiting needle. Often, when the left arm is already positioned near the exit point, this gesture is omitted. This explains the transitions from states a and b directly to state e. Dexeme e: It represents gripping the needle with the left arm. These observations reinforce the claim that SSS provides a means for automatically inducing meaningful units for modeling dexterous motion. While not demonstrated here, it may be applied to entire trials, automatically discovering and modeling gestures without requiring any manual labeling! 5.1.1

Measuring Expertise by Aligning Dexeme-transcripts

To compare how dissimilar two instances of a surgeme are, we compute an edit distance between their dexeme transcripts as described below. Let {x1t , t = ˆbi , . . . , eˆi } and {x2t , t = ˆbj , . . . , eˆj } denote two automatically segmented and labeled realizations of the surgeme σ, i.e. σ ˆ[i] = σ ˆ[j] = σ. We use the Viterbi alignment of {x1t } with the states Sσ of the surgeme HMM to obtain the sequence {ˆ s1t , t = ˆbi , . . . , eˆi }, and similarly {ˆ s2t , t = ˆbj , . . . , eˆj } from {x2t }. We then obtain the sequence of HMM states visited by {x1t } (resp. {x2t }) by simply compacting each run of state labels. In other words, we ignore how many consecutive frames are aligned with a state, counting them collectively as one “visit” to the ˆ1 } and {ˆ ˆ2 } denote state. Let {ˆ s1[i] , i = 1, . . . , k s2[j] , j = 1, . . . , k the dexeme transcripts of the two gestures generated in this manner. We then align {ˆ s1[i] } and {ˆ s2[j] } using Levenshtein distance, and each element in the two sequences is marked as matched if it is aligned with the identical element in the other sequence. Inserted, deleted and (both sides of a pair of) mismatched symbols are marked as mismatched. The similarity of the realizations σ ˆ[i] and σ ˆ[j] is defined as the number of matched ˆ1 + k ˆ2 . A similarity of 1 corresponds to dexemes divided by k ˆ1 = k ˆ2 and sˆ1 = sˆ2 for each i. identical dexeme sequences: k [i] [i] Otherwise similarity ranges between 0 and 1. We calculate the average edit distance between realizations of σ drawn from different expertise levels for the four most frequent gestures: σ = 2, 3, 4 and 6. Note from Tables 6(a), 6(b) and 6(c) that some surgemes (e.g. #2 : “positioning the needle at the entry point” or #3

(b) Similarities in Surgeme #3 Expert Inter. Novice Expert 0.69 0.60 0.53 Intermediate 0.60 0.51 0.50 Novice 0.53 0.50 0.50 (c) Similarities in Surgeme #4 Expert Inter. Novice Expert 0.71 0.57 0.54 Intermediate 0.57 0.58 0.58 Novice 0.54 0.58 0.51 (d) Similarities in Surgeme #6 Expert Inter. Novice Expert 0.74 0.69 0.68 Intermediate 0.69 0.65 0.67 Novice 0.68 0.67 0.61

: “inserting the needle through the tissue”) show low expertnovice similarity compared to expert-expert, indicating the need for skillful execution. In comparison, surgeme #6 (pulling the suture) in Table 6(d) exhibits significant similarity even between experts and novices.

6 J OINT HMM STRUCTURE AND SUBSPACE IDENTIFICA TION FROM UNLABELED DATA In the last few sections, we demonstrated how to build subspace HMM models from transcribed kinematic data. We now describe a methodology to build HMMs from data with no labels. The motivation for building such models is to deal with large amounts of data with little or no supervision. The subspace identification problem we are dealing now in this case is a slight generalization to that described in Section 3.2.1. We have seen in Section 3 that when each gesture is modeled as a single state HMM, it is suffices to maximize (3) to get the subspace A. However when each gesture is made of multiple hidden states, one needs to employ the techniques described in Section 3.2.1. This requires first estimating the posterior probabilities γts using the current HMM parameters after which one may solve (4) to identify the subspaces Ap and Ad−p . In this particular setting we do not have any labels (even gesture level) for the data. We now attempt to solve the problem of estimating the subspace when all the states are hidden. In addition to the subspace, we also need to identify the HMM structure. Towards this, we simply make use of the techniques in Section 3.2.1 and Section 4.2.1 to learn all the parameters Θ which includes the HMM parameters and the subspace Ap . One may employ a joint EM algorithm to estimate Θ that can be briefly described below: 0 • Set k = 0. Initialize the subspace matrix A by doing a principal component analysis of the data. k • Using the subspace Ap , learn the HMM topology using the algorithm described in Section 4 on the entire set of unlabeled trials. s • Compute the posterioir alignment probability γt for each state at each time using the standard Baum-Welch algorithm. Here s ∈ {1, . . . , S}.

0 sec

2.5 sec

5.0 sec

0 sec

3.5 sec

7.0 sec

0 sec

3.3 sec

6.6 sec

0 sec

5.0 sec

10.0 sec

Fig. 3. Two different segments aligned to state 9 of the HMM State ID 4 6 8 9 10

Manual Label(M) Gesture Gesture Gesture Gesture Gesture

2 3 3 6 4

% of times S maps to M 68.4% 74.2% 81.2% 71.2% 69.5%

TABLE 7 Percentage of the time a state gets aligned to a pre-existing manual label

•

•

•

6.1

Fig. 4. Two different segments aligned to state 8 of the HMM

6.2 6.2.1

Other applications Surgeme Level registration

One may also use the automatic labeling provided by the learned HMM in Section 6 to register similar trials from two different subjects; for example an expert and a novice. One advantage of doing such registrations is to provide post-visual feedbacks to trainee subjects by allowing them to compare their styles and procedures with that of an established surgeon.

ˆ so ) for each state using Estimate the model covariance (Σ the soft sufficient statistics Estimate the transform matrix Ak+1 by solving (4) and set k = k + 1. If ||Ak − Ak−1 || < or likelihood of the training data has converged, stop. Otherwise goto step 2.

Evaluating the outcome of a data-derived HMM

Using the algorithm mentioned in Section 6, we can now jointly learn the subspaces A and the parameters of the HMM. The learned HMM can then used to perform a state-level alignment by finding the best path through the HMM using the standard Viterbi algorithm. We do this on each trial using the learnt HMM and pull out the contiguous frames that get aligned to a particular state. In figure 3, we show the starting, middle and the end frames of 2 segments from different trials that were aligned to the HMM state numbered 9. Similarly, in figure 4, we show 3 frames of 2 different segments that got aligned to the HMM state numbered 8. It can be seen that state number 9 is modeling the pulling gesture while state 8 is modelling a part of the needle inserting gesture. In table 6.1, we make a quantitative description of how the HMM state labels correlate with the manual segmentation. We only list the most prominent HMM state labels. The ones that are not listed either occur very infrequently or occur for an extremely short duration of time which can be either a transitive gesture or any other sub-gesture that is not a typical one.

Fig. 5. Pulling gesture

Fig. 6. First and last frames of a Needle transfer We now briefly describe the procedure to perform such an alignment: • Using the HMM(H) obtained using the algorithm in Section 6, we may use the standard Viterbi algorithm to compute the state-sequence corresponding to each of the 2 trials. • If the 2 trials have length m and n, Let Am,n be the desired alignment of the 2 videos. If Ai,j is the best alignment for

pattern regularity and level of skill – i.e. that learning manual dexterity corresponds to developing an efficient pattern of dexterous motion.

7

Fig. 7. First and last frames of a Needle insertion

Fig. 8. End of needle pulling

the first i frames of trial 1 with the first j frames of trial 2, we can write Aij = d(s1i , s2j ) + min (Ai−1,j , Ai,j−1 , Ai−1,j−1 )

•

(6)

where s1i is the state to which the ith frame of trial 1 is aligned to, while s2j is the state to which the j th frame of trial 2 is aligned. In our case we take d(x, y) as 1 if x 6= y and 0 if x = y. Using the alignment obtained in the last step,we get a sequence of frame pairs for each time 1 ≤ n ≤ N (in , jn ).Using the frame pairs we obtain a video of frame size N using the aligned frame pairs.Since we are allowed to extend each frame in each of the trials, the total aligned length N would be atleast max(m, n). Figures 5,6,7,8 show how we are able to align 2 different trials from 2 different subjects using our alignment procedure.

6.2.2 Trial level Skill differentiation We may also use our unsupervised segmenting technique to measure similarities between two trials without any surgeme labels. First a HMM may be learned using the algorithm mentioned in Section 6. In our case, we evolve a 10 state HMM from all the Suturing trials. The dexeme sequences for each of the trial is computed by performing a Viterbi alignment on each of them. An edit distance, similar to the one described in Section 5.1.1 and a global similarity matrix is computed. We show the dexeme similarities between different levels of expertise in Table 8. Note that the similarity of experts to each other is highest, indicating the experts do converge to a common pattern of motion. Also, similarity drops monotonically with experience level, suggesting that this is a strong relationship between TABLE 8 Dexeme Similarity of the entire trial based on the evolved HMM Expert Intermediate Novice

Expert 0.74 0.55 0.48

Inter. 0.55 0.56 0.47

Novice 0.48 0.47 0.50

C ONCLUSION

We have introduced a complete pipeline for modeling complex motions using Hidden Markov models. We have showed that doing robust dimensionality reduction gives better performance over pre-existing techniques. In the last few sections, we have also introduced a novel technique to learn the HMM structure that can best describe an evolving process. Data-derived HMMs automatically discover and model skill-specific sub-gestures. As shown in Table 6.1, we see that the segmentation obtained by our technique strongly correlates with the manual segments. In other words, our algorithm is able to capture the temporal and contextual variability of complex motions by carefully exploring the exponentially large space of different HMMs in polynomial time using dynamic programming. This also leads to a natural metric for comparing trials for the purpose of skill assessment without supervision. Finally we show that this kind of segmentation can yield to several other useful applications like registration of surgical videos, thereby opening up immense possibilities for teaching and training.

R EFERENCES [1] [2] [3] [4]

[5] [6]

[7] [8] [9] [10] [11]

[12] [13]

Reiley, C., Lin, H., Varadarajan, B., Khudanpur, S., Yuh, D.D., Hager, G.D.: Automatic recognition of surgical motions using statistical modeling for capturing variability. MMVR (2008) Shuford, M.: Robotically assisted laparoscopic radical prostatectomy: a brief review of outcomes. Proc. Baylor University Medical Center 20(4) (2007) 354–356 Lenihan Jr, J., Kovanda, C., Seshadri-Kreaden, U.: What is the Learning Curve for Robotic Assisted Gynecologic Surgery? J Minim Invasive Gynecol 15(5) (2008) 589–94 Martin, J., Regehr, G., Reznick, R., MacRae, H., Murnaghan, J., Hutchison, C., Brown, M.: Objective structured assessment of technical skill (OSATS) for surgical residents. British Journal of Surgery 84(2) (1997) 273–278 Richards, C., Rosen, J., Hannaford, B., Pellegrini, C., Sinanan, M.: Skills evaluation in minimally invasive surgery using force/torque signatures. Surgical Endoscopy 14 (2000) 791–798 Rosen, J., Solazzo, M., Hannaford, B., Sinanan, M.: Task decomposition of laparoscopic surgery for objective evaluation of surgical residents’ learning curve using hidden markov model. Computer Aided Surgery 7(1) (2002) 49–61 Lin, H.C., Shafran, I., Murphy, T.E., Okamura, A.M., Yuh, D.D., Hager, G.D.: Automatic detection and segmentation of robotassisted surgical motions. In: MICCAI. (2005) 802–810 Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank hmms for improved speech recognition. Speech Commun. 26(4) (1998) 283–297 Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2) (1989) 257–286 Liskovets, V.A.: The number of strongly connected directed graphs. Mathematical Notes 8 (1970) 877–882 Varadarajan, B., Khudanpur, S., Dupoux, E.: Unsupervised learning of acoustic sub-word units. In: Proceedings of ACL-08: HLT, Short Papers, Columbus, Ohio, Association for Computational Linguistics (June 2008) 165–168 Takami, J., Sagayama, S.: A successive state splitting algorithm for efficient allophone modeling. In: ICASSP. (1992) 573–576 Singer, H., Ostendorf, M.: Maximum likelihood successive state splitting. In: ICASSP. (1996) 601–604

Data-Driven Statistical Models for Computer Integrated ...

two-way skill classification of a trial; they observed that novices employ higher ... robot kinematic mea- surements from 2 expert, 3 intermediate and 3 novice subjects ... terminology, we will call each gesture a surgeme. Typical surgemes ...

Download PDF

3MB Sizes 0 Downloads 186 Views

Report

Data-Driven Statistical Models for Computer Integrated ...

Recommend Documents