Online Supplementary Materials for Simple decision ...

Viewer
Transcript

A.C. Tan et al.

Online Supplementary Materials for Simple decision rules for classifying human cancers from gene expression profiles Aik Choon Tan1*, Daniel Q. Naiman1,2, Lei Xu1, Raimond L. Winslow1*and Donald Geman1,2* 1

Center for Cardiovascular Bioinformatics and Modeling, Whitaker Biomedical Engineering Institute, and 2 Department of Applied Mathematics and Statistics, Johns Hopkins University, 3400 N. Charles St., Baltimore, MD 21218, USA Contact: {actan, daniel.naiman, leixu, rwinslow, geman}@jhu.edu.

1. 2. 3. 4.

Table of Contents The TSP algorithm ...................................................................................................... 2 The pruning algorithm ................................................................................................ 3 Multi-class decomposition schemes ........................................................................... 8 Additional Results....................................................................................................... 9

* To whom correspondence should be addressed.

1

A.C. Tan et al. 1. The TSP algorithm

TSP Algorithm Input: Training sample S of P genes and N arrays. Output: TSP classifier hTSP. 1. Compute a score ( ij) based on the training set for every pair of genes (i, j), i, j {1,…,P}, i j. 2. Select the maximum score max from the list of scores. 3. Proceed down the list of pairs, if ij = max, recruit pair (i, j) into a candidate list. 4. Calculate the rank score ij for each pair (i, j) in the candidate list. 5. Sort the rank score ij from largest to smallest. 6. Select the top scoring gene pair to construct the TSP classifier (hTSP). 7. Return hTSP. Supplementary Figure 1: Description of the TSP algorithm.

2

A.C. Tan et al. 2. The pruning algorithm We have developed an algorithm that produces a pruned list

of gene pairs that is

guaranteed to contain every pair that could possibly be identified as one of the TSPs and k-TSPs using the TSP (Suppl. Fig. 1) and k-TSP (Fig. 1 in the main paper) algorithms, no matter which of at most n arrays are removed for the purpose of cross-validation. Using this pruned list, the algorithm’s search space is significantly reduced, which leads to a drastic reduction in computational time (especially during cross-validation).

Let

ij(F)

denote the score obtained for a given pair of genes (i, j) when a subset of arrays

F

{1,..., N } is left out from the training set S. Our algorithm is based on the ability to

calculate: (i) a sharp lower bound Lij (n) for the score that is obtained for the gene pair (i, j ) no matter which of at most n arrays are left out, and (ii) a sharp upper bound U ij (n) the score that can be obtained for the gene pair (i, j ) no matter which of at most n arrays are left out. Thus

Lij (n) min{

ij

(F ) : F

S, F

n} ,

U ij (n) max{

ij

(F ) : F

S, F

n} .

and

Details for obtaining these bounds are provided below.

Let kmax denote the pre-specified upper bound on the number of gene pairs that could be included in the final classifier1. Rank all of the pairs (i, j) from largest to smallest according to the value of Lij (n) and apply Step 4 of the k-TSP algorithm (Fig. 1 in the main paper) using 2kmax iterations to yield a list

of disjoint pairs (iu , ju ) , u=1,…,

2kmax . Let L denote the lower bound obtained for the 2kmax-th of these pairs, that is,

Note: Even though we need to set an initial value for kmax, the final value of the parameter k is determined via the internal cross-validation of the k-TSP algorithm.

1

3

A.C. Tan et al. L = Lij (n) for (i, j ) = (ikmax , jkmax ) . We then take as our pruned list

j) for which U ij

all of those pairs (i,

L.

Claim: If U ij (n) < L then the pair (i, j) cannot arise as one of the k-TSP’s when the kTSP algorithm is used on any of the datasets obtained by removing at most n samples, for all k

kmax .

Proof: If (i, j) is a pair of indices for which U ij (n) < L , and the removed samples correspond to the index set F

{1,..., N } with | F | n , then we have

ij

( F ) U ij (n) < L .

On the other hand, by the definition of L, there exist disjoint pairs (i1 , j1 ), (i2 , j2 ),..., (i2 kmax , j2 kmax ), with Liu ju

L , for u = 1,..., 2kmax .

Combining inequalities we

see that ij

(F ) < L

Liu ju

iu ju

(F )

for u = 1,..., 2kmax ,

so that when Step 2c of the k-TSP algorithm is carried out with this choice of F, all 2kmax of the pairs ( iu , ju ) appear in the ordered list

before the pair (i, j). These 2kmax pairs

do not have to appear in any particular order. When Step 4 of the algorithm is applied, and the top scoring pairs list

is created, each pair added to

of the pairs ( iu , ju ) being eliminated from

in (ii). Consequently, in every one of the

kmax iterations, either one of the pairs ( iu , ju ) is added to that precedes (i2 kmax , j2 kmax ) in inclusion in

in (i) leads to at most 2

, or a pair gets added to

. In any case, the pair (i, j) cannot possibly be selected for

.

The pruning algorithm is illustrated in Supplementary Figure 2. Calculation of the pruned list requires two passes through the set of all pairs, and the computational effort is roughly equivalent to double that of computing the k-TSP classifier. The advantage of computing the pruned list arises in the double loop of cross-validation, since repeated kTSP calculations are needed but these can be carried out using a significantly reduced list of gene pairs.

4

A.C. Tan et al.

Pruning Algorithm Input: Training sample S of P genes and N arrays, kmax and n. Output: Pruned list of pairs, . 1.

2. 3.

For every pair of genes (i, j), compute the lower bound Lij (n) and upper bound U ij (n) for the score that can be obtained when a subset of at most n arrays is left out. Create a list of all of the pairs (i, j) in descending order according to the value of Lij (n) .

4.

Apply Step 4 of the k-TSP algorithm to the ordered list 2kmax disjoint pairs. Take L = Lij (n) where (i, j ) = (i2 kmax , j2 kmax ) .

5.

Take the pruned list

to create a list

to consist of all pairs (i, j) for which U ij (n)

of

L.

6. Return . Supplementary Figure 2: Description of the pruning algorithm. Lower and Upper Bounds We now describe the lower and upper bounds Lij (n) and U ij (n) used in the pruning algorithm. Consider a pair (i, j), fixed for the remainder of the discussion. Each of the N samples can be classified into a two by two table according to class and the relative expression levels to give counts a, b, c, and d , as in the following table.

xi < x j

xi > x j

C1

a

b

C2

c

d

For these counts, the score is given by

a s (a, b, c, d ) = a + b b a+b

ij

= s (a, b, c, d ) , where

c , if c+d d , if c+d

5

a c a+b c+d . a c < a+b c+d

A.C. Tan et al. For the lower bound we define

min{s (a n, b, c, d ), s (a, b, c, d n)},

if

Lij (n) = min{s (a, b n, c, d ), s (a, b, c n, d )},

if

a c a+b c+d a c < a+b c+d

and a, d

n.

and b, c

n.

0, otherwise. Similarly, we take the upper bound as max{s (a, b n, c, d ), s (a, b, c n, d )},

if

U ij (n) = max{s (a n, b, c, d ), s (a, b, c, d n)},

if

a c a+b c+d a c < a+b c+d

and b, c

n.

and a, d

n.

1, otherwise.

The fact that these functions have the desired properties reduces to the following. Claim: If na , nb , nc , and nd are nonnegative integers with na + nb + nc + nd s (a na , b nb , c nc , d nd )

n then

Lij (n) ,

and s (a na , b nb , c nc , d nd ) U ij (n) .

Proof: We give the proof of the claim only for the lower bound as the proof for the

upper bound is similar. Suppose that

a a+b

c and a, d c+d

n . Since, in this range

s (a, b, c, d ) is a decreasing function of b and c, we have s (a na , b nb , c nc , d nd )

s (a na , b, c, d nd ) ,

and since s (a, b, c, d ) is an increasing function of na and nd , and we have the constraint that na + nd = n , we can assume that nd = n na . A straightforward calculation shows that s (a na , b, c, d (n na )) is a concave function of na , so the minimum of this n occurs when either na = 0 or na = n , so

function, subject to the constraint that 0 na

s (a na , b, c, d nd ) min {s (a n, b, c, d ), s (a, b, c, d n)} . 6

A.C. Tan et al.

Combining this inequality with the one above, we can conclude that Lij (n) .

s (a na , b nb , c nc , d nd )

The proof for the case when

a c < and b, c a+b c+d

n is analogous. If neither of the two

inequality conditions considered holds then 0 is a trivial lower bound.

7

A.C. Tan et al. 3. Multi-class decomposition schemes

Supplementary Figure 3: Multi-class decomposition schemes. (a) 1-vs-r, (b) 1-vs-1 and (c) HC where number of classes M = 4 and hi denotes the binary classifier in each decomposition scheme.

8

A.C. Tan et al. 4. Additional Results

Supplementary Table 1: Accuracy of classifiers for independent test set for multi-class expression data sets. The best prediction rate for each data set is highlighted in boldface. Method 1-vs-1 1-vs-r HC k-TSP 1-vs-1 1-vs-r HC DT NB k-NN SVM 1-vs-1 1-vs-r PAM TSP

Leuk1 91.18 91.18 97.06 97.06 91.18 97.06 85.29 85.29 67.65 79.41 79.41 97.06

Lung1 56.25 71.88 71.88 75.00 84.38 78.13 78.13 81.25 75.00 87.50 84.38 78.13

Supplementary Table 2: expression data sets. Method 1-vs-1 1-vs-r HC 1-vs-1 k-TSP 1-vs-r HC DT PAM TSP

Leuk1 6 6 4 54 22 36 2 44

Lung1 6 6 4 26 46 20 4 13

Leuk2 60.00 86.67 80.00 66.67 86.67 100 80.00 100 86.67 100 100 93.33

SRBCT 100.00 70.00 95.00 95.00 90.00 100 75.00 60.00 30.00 100 100 95.00

Breast 50.00 56.67 66.67 80.00 73.33 66.67 73.33 66.67 63.33 83.33 76.67 93.33

Lung2 92.54 89.55 83.58 94.03 94.03 94.03 88.06 88.06 88.06 97.01 94.03 100

DLBCL 86.67 63.33 83.33 83.33 63.33 83.33 86.67 86.67 93.33 100 100 90.00

Leuk3 83.04 56.25 77.68 91.07 56.25 82.14 75.89 32.14 75.89 84.82 75.00 93.75

Cancers 59.46 59.46 74.32 63.51 66.22 82.43 68.92 79.73 64.86 83.78 60.81 87.84

GCM 56.52 32.61 52.17 54.35 52.17 67.39 52.17 52.17 34.78 65.22 45.65 56.52

Average 73.56 67.76 78.17 80.00 75.76 85.12 76.35 73.20 67.96 88.11 81.59 88.50

Number of genes used in the classifiers for multi-class Leuk2 6 6 4 22 46 24 2 62

SRBCT 12 8 6 44 36 30 3 285

Breast 20 10 8 72 54 24 4 4822

9

Lung2 20 10 8 52 50 28 5 614

DLBCL 30 12 10 66 72 46 5 3949

Leuk3 42 14 12 202 98 64 16 3338

Cancers 110 22 20 274 98 128 10 2008

GCM 182 28 26 274 136 134 18 1253

Online Supplementary Materials for Simple decision ...

TSP Algorithm. Input: Training sample S of P genes and N arrays. Output: TSP classifier hTSP. 1. Compute a score (âij) based on the training set for every pair of ...

Download PDF

333KB Sizes 1 Downloads 200 Views

Report

Online Supplementary Materials for Simple decision ...

Recommend Documents