IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

1

A Partial Set Covering Model for Protein Mixture Identification Using Mass Spectrometry Data Zengyou He, Can Yang, and Weichuan Yu Abstract—Protein identification is a key and essential step in mass spectrometry (MS) based proteome research. To date, there are many protein identification strategies that employ either MS data or MS/MS data for database searching. While MS-based methods provide wider coverage than MS/MS-based methods, their identification accuracy is lower since MS data have less information than MS/MS data. Thus, it is desired to design more sophisticated algorithms that achieve higher identification accuracy using MS data. Peptide Mass Fingerprinting (PMF) has been widely used to identify single purified proteins from MS data for many years. In this paper, we extend this technology to protein mixture identification. First, we formulate the problem of protein mixture identification as a Partial Set Covering (PSC) problem. Then, we present several algorithms that can solve the PSC problem efficiently. Finally, we extend the partial set covering model to both MS/MS data and the combination of MS data and MS/MS data. The experimental results on simulated data and real data demonstrate the advantages of our method: (1) it outperforms previous MS-based approaches significantly; (2) it is useful in the MS/MS-based protein inference; and (3) it combines MS data and MS/MS data in a unified model such that the identification performance is further improved. Index Terms—Protein Identification, Proteomics, Peptide Mass Fingerprinting, Mass Spectrometry, Set Covering, Linear Programming, Optimization

F

1

I NTRODUCTION

P

ROTEIN identification from MS data or MS/MS data is a key proteomics technology. Many protein identification strategies have been proposed. For a comprehensive review, please refer to [1]. Typical identification strategies include Peptide Mass Fingerprinting (PMF) [2], [3], [4], [5], [6], MS/MS-based database search [7] and de novo sequencing [8]. We can categorize existing protein identification strategies according to the type of input data. PMF takes single MS data as input, MS/MS-based database search and de novo sequencing require MS/MS data as input. The MS/MS-based method is probably the most widely used identification approach nowadays. However, one inherent disadvantage of such method is that it cannot perform tandem mass spectrometry scanning on every single ion, leading to an incomplete identification of peptides. The single-stage MS data has broader mass coverage. Therefore, we may use PMF to discover proteins whose peptide digestion results are not selected for MS/MS sequencing. Besides, we can either combine PMF with the MS/MS-based method [9] or use it to extract protein profiles in profiling-based biomarker discovery [10]. Initially, PMF is used to identify single purified proteins from two-dimensional gel electrophoresis (2D gels). • Z. He, C. Yang and W. Yu are with Laboratory for Bioinformatics and Computational Biology, Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China. E-mail: [email protected], [email protected], [email protected]. Manuscript received 2 Dec. 2008; revised 5 May 2009; accepted 17 May 2009; published online XX 2009.

To date, many popular PMF tools such as Mascot [11], ProFound [12], Protein Prospector [13] and Aldente [14] have been developed. Please refer to [15] for a survey on the history of PMF before 2003 and refer to [16] for a summary on available PMF tools. To further improve the identification accuracy, many new algorithms are proposed recently (e.g., [17], [18], [19], [20], [21], [22]). Note that all these methods focus on single protein identification rather than protein mixture identification. Some methods were also proposed to identify protein mixtures using PMF [23], [24], [25]. Jensen et al. [23] proposed a subtraction strategy in which proteins are identified in an iterative manner. In each iteration, one most possible protein (with the highest ranking score) is identified. Then, the peaks matching this identified protein are removed prior to the next iteration. This procedure terminates after sufficient proteins have been identified. Park and Russell [24] as well as Eriksson and Fenyo¨ [25] also used the same strategy for protein mixture identification. Though the subtraction approach is effective in identifying simple protein mixtures, it is highly heuristic and has deteriorative performance on complex and noisy protein mixtures. In this paper, we first formulate the problem of protein mixture identification as a Partial Set Covering (PSC) problem. More precisely, we take the input peak list as the ground set to be covered and regard each candidate protein as a subset of matched peaks. In addition, the cost of each protein is modeled as the number of theoretically digested peptides. The objective is to find a subcollection of proteins that has minimal cost and covers at least a fixed fraction of peaks. While many algorithms have been proposed to solve

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

the PSC problem, only a few of them have the capability of handling large-scale data in practice. To obtain a tradeoff between effectiveness and efficiency, we suggest three algorithms: greedy algorithm, linear programming (LP) rounding algorithm and dual feasible algorithm. All these algorithms have good identification accuracy and running efficiency. We also show that the PSC model is applicable to the MS/MS-based protein identification. With minor modifications, we present a generalized PSC model that can combine MS data and MS/MS data to identify proteins. One limitation of the PSC model is that it requires the user to specify the desired covering fraction. To solve this issue, we develop an estimation method that suggests a covering fraction automatically. To demonstrate the advantages of our methods, we conduct experiments on both simulated data and real data. The experimental results show that our methods outperform previous MS-based approaches significantly. In addition, our methods are very useful in identifying proteins using only MS/MS data or the combination of both MS data and MS/MS data. The rest of the paper is organized as follows: Section 2 formulates the problem and introduces the algorithms. Section 3 shows the experimental results. Section 4 concludes the paper.

2

M ETHODS

In this section, we first formulate the protein mixture identification problem as a partial set covering problem. Then, we suggest three effective algorithms that are capable of performing large-scale protein mixture identification. Finally, we extend the model to handle data sets that contain both MS data and MS/MS data and propose an algorithm for automatic covering fraction estimation. 2.1

Models

In MS-based protein mixture identification, suppose we have a database of n proteins D = {d1 , d2 , ...dn } and a set of m experimental peaks Z = {z1 , z2 , ...zm } as input, our objective is to find a set of proteins from D that generate m peaks. The size of n depends on the protein database we use. For instance, there are more than 260,000 proteins in the Swiss-Prot database (Release 52). The size of m varies from several hundred to ten thousand. In the real MS data studied in this paper, we have more than 3000 peaks. After protease (such as trypsin) digestion, a protein dj will produce a set of peptides Tj = {tj1 , tj2 , ..., tjnj }. Ideally, each peptide should correspond to a peak in the mass spectrum. Due to the imperfection of sample preparation, mass spectrometry scanning, and other factors, we often observe that some expected peaks are missing, while some noisy peaks are introduced.

2

Usually, we use a user-specified mass tolerance threshold σ to define the peak matching criterion. One experimental peak is considered as corresponding to a theoretical peak if their distance is not larger than σ. Given the mass tolerance threshold σ, we define a set of peaks Sj corresponding to a protein dj as: Sj = {zi |zi ∈ Z, ∃tjk ∈ Tj , ||zi − tjk ||1 ≤ σ}.

(1)

Here Sj is a subset of experimental peaks that can be “explained” by protein dj . The size of Sj reflects the power of protein dj in interpreting the observed peak list. Assuming that random peak matches also occur, |Sj | is proportional to |Tj |, where | · | denotes the size of a set. In other words, if one protein has more theoretical peaks, it has a larger probability of randomly matching more observed peaks. Hence, we define the cost of each Sj as: wj = |Tj |.

(2)

Alternatively, we can incorporate the number of missing peaks as a penalty into the cost function: wj = |Tj | − |Sj |.

(3)

On the one hand, Eq. (3) is better than Eq. (2) in the sense that it will provide preference to proteins with less missing peaks when they have the same number of theoretical peptides. On the other hand, Eq. (3) has the risk of underestimating the cost of longer proteins since the number of random matches in Sj is proportional to |Tj |. In our implementation, we use Eq. (2) as the default setting and provide Eq. (3) as an alternative choice. In the experimental study, we found that they almost exhibit identical performance. Thus, we omit the comparison between these two cost functions in the experimental section. The objective of protein mixture identification is to find a set of proteins that “best” explain Z. Here we decompose the high-level “best” criterion into two computational criteria: 1) Maximum coverage: the number of covered experimental peaks should be maximized. 2) Minimum cost: the total cost should be minimized. We have to make a trade-off between these two criteria since they conflict with each other. A natural formulation is the well-known set covering (SC) problem: Identifying a minimum cost subset from S = {S1 , S2 , ...Sn } such that it covers all elements in Z. Formally, let J = {1, 2, ..., n} denote the set of protein indices, our objective is to identify a subset C ⊆ J such that: ∑ (SC) minimize wj (4) C⊆J

j∈C

subject to |



j∈C

where m is the number of peaks.

Sj | = m,

(5)

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

Unfortunately, the SC formulation is unrealistic since it requires a full coverage of peaks. In current MS data, there are always a large portion of noisy peaks that are not generated from ground-truth proteins. Motivated by this observation, we relax the coverage requirement so that some peaks can remain uncovered. This leads to a partial set covering (PSC) problem: (PSC)

minimize C⊆J



wj

(6)

j∈C

subject to |



Sj | ≥ pm.

(7)

j∈C

where p is the expected covering fraction (0 < p ≤ 1). In the PSC problem, p is specified by the user and the goal here is to find a collection of sets with minimum cost covering at least p-fraction of the ground set Z. The PSC formulation for protein mixture identification offers the following advantages: • The covering fraction p has a clear meaning in practice, i.e., the expected fraction corresponding to non-noisy peaks. This value depends on the experimental procedures of MS data generation. • During the last decade, many effective algorithms have been proposed to solve the PSC problem [26], [27], [28], [29], [30], [31]. We can apply some of these algorithms to protein mixture identification directly. 2.2

Algorithms

While many algorithms are available for solving the PSC problem, not all of them are feasible in our context. For instance, the randomized algorithm in [27] has a time complexity exponential to the number of covered elements. Here, we re-use and design three scalable algorithms: greedy algorithm, linear programming rounding algorithm and dual feasible algorithm. In our original identification problem, the input includes a protein database D, an experimental peak list Z and a mass tolerance threshold σ. To describe the proposed algorithms in a concise manner, we first compute the peak set Sj and the corresponding cost value wj of each protein dj according to σ. Then, we prune the protein database using a parameter k (k ¿ n). More precisely, we retain only the top k proteins with the largest values of |Sj | as candidates. In addition, we delete peaks that didn’t appear in any Si from Z. Without loss of generality, we still use n to denote the number of proteins and use m to denote the number of peaks after pruning. The above procedure provides us a transformed input: an experimental peak list Z = {z1 , z2 , ..., zm }, a family S = {S1 , S2 , ..., Sn } of subsets of Z with a corresponding set of cost values w = {w1 , w2 , ..., wn }, and covering fraction p. The output C should be a subset of protein indices, C ⊆ J and J = {1, 2, ..., n}. Moreover, we use I = {1, 2, ..., m} to denote the set of peak indices.

2.2.1

3

Greedy Algorithm

A greedy algorithm is the most natural heuristic for set covering problem. It works by selecting one set at a time that covers the most elements among the uncovered ones [32]. As shown in Algorithm 1, the greedy algorithm for PSC probelm works in the same way as for SC ∪ problem except that the stopping criterion becomes | j∈C Sj | ≥ pm. Algorithm 1 Greedy Algorithm (S, w, p) Initialize∪C ← ∅; l ← 0 while | j∈C Sj | < pm do l ← l + 1; Select a set Sjl , such that

|Sjl | w jl

= max

Set C ← C ∪ {jl }; Sj = Sj \Sjl ;J = J\{jl } end while return C

j∈J

|Sj | wj

The greedy algorithm has the following salient features: • It is very fast since its time complexity is O(mnr), where r = |C|, m is the number of peaks and n is the number of candidate proteins. • It guarantees an approximation ratio of H(dpme) ∑dpme [26], where H(dpme) = i=1 (1/i) ≤ 1 + ln(pm). It means that the greedy algorithm can always obtain a solution that is at most H(dpme) times the optimal solution. Moreover, it establishes the connection between the PSC model and the subtraction strategy in [23]. In fact, we can regard the greedy algorithm as one special case |S | of the subtraction strategy in which wjj is used as the protein identification score. In the context of protein identification, |Sj | is also called shared peak count [1]. |S | Hence, wjj is actually the normalized shared peak count since it ranges between 0 and 1 when wj = |Tj |. It is well recognized that such simple scoring method can not achieve good performance in single protein identification. To our surprise, it performs extremely well and outperforms other more complicated scoring methods such as Piums [33] and ProFound [12] in our experiments of protein mixture identification. The reason is probably two-fold: |Sj | has a theoretical justification under the PSC • w j model since the greedy algorithm provides a good performance guarantee. • Simple scoring methods are more feasible for protein mixture identification since they are not very sensitive to random matching. More precisely, the shared peak count is linear to the number of peak matches while other sophisticated scores are nonlinear to the number of peak matches. Note that the greedy algorithm is just one possible choice for solving the PSC problem. There are many other algorithms that are totally different from this greedy heuristic and the subtraction strategy.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

2.2.2 LP Rounding Algorithm The linear programming (LP) rounding technique is very popular in the design of approximation algorithms. A general LP rounding method consists of the following steps: 1) Formulating the optimization problem as an integer programming problem. 2) Relaxing the integrality constraint to obtain an LP problem, which can be solved in polynomial time. 3) Rounding the fractional solution to the integral solution. The PSC problem can be formulated as an integer program (PSC IP). In this formulation, the variable xj indicates whether we select Sj (xj = 1 if j ∈ C), whereas the variable yi indicates ∪ whether a peak zi is left uncovered (yi = 0 if zi ∈ j∈C Sj ). Mathematically, constraint (9) guarantees that we either pick at least one set that contains zi , or specify that this element is uncovered by setting yi = 1; Constraint (10) forces any feasible solution to cover at least pm peaks; Constraint (11) and constraint (12) force xj and yi to be binary variable, respectively.

(PSC IP)

minimize x,y



(8)

wj xj

j∈J



subject to

xj + yi ≥ 1, i ∈ I

(9)

j:zi ∈Sj



yi ≤ m(1 − p)

(10)

xj ∈ {0, 1}, j ∈ J yi ∈ {0, 1}, i ∈ I.

(11) (12)

i∈I

The corresponding LP relaxation below is obtained by setting the domain of xj and yi be to 0 ≤ xj , yi ≤ 1. Notice that the upper bound on xj and yi is unnecessary and is thus dropped in constraint (16) and constraint (17).

(PSC LP Primal)

minimize x,y

subject to



wj xj

(13)

j∈J



xj + yi ≥ 1 (14)

j:zi ∈Sj



yi ≤ m(1 − p) (15)

i∈I

xj ≥ 0, j ∈ J yi ≥ 0, i ∈ I.

4

Algorithm 2 LP Rounding Algorithm (S, w, p) Construct the LP relaxation of the PSC IP problem Invoke LP solver to get an optimal solution x Initialize∪C ← ∅; l ← 0 while | j∈C Sj | < pm do l ← l + 1; Select a set Sjl , such that xjl = max xj j∈J

Set C ← C ∪ {jl }; J = J\{jl } end while return C

2.2.3 Dual Feasible Algorithm Dual feasible algorithm solves the problem by finding a feasible solution of its dual problem [32], [29]. The dual problem of PSC is given as: (PSC LP Dual) maximize λ,v

subject to



λi − v(m − pm) (18)

i∈I



λi ≤ wj , j ∈ J(19)

i:zi ∈Sj

λi ≤ v, i ∈ I λi ≥ 0, i ∈ I v ≥ 0.

(20) (21) (22)

The m dual variables (λi for each zi ∈ Z,i ∈ I) correspond to the constraint (14) and the dual variable v corresponds to the constraint (15) in the primal LP, respectively. The dual feasible algorithm (Algorithm 3) derives the dual solution implicitly. The dual information is placed in square brackets to indicate that it is not an indispensable part of the algorithm. Whenever a set is selected, its corresponding dual constraint becomes binding as each of the elements covered by it is assigned an equal share of the set’s (reduced) cost [32]. This algorithm works similarly to the greedy algorithm. The apparent difference is that it chooses a set using the reduced cost. Algorithm 3 Dual Feasible Algorithm (S, w, p) Initialize∪C ← ∅; l ← 0;[λi = 0, i ∈ I] while | j∈C Sj | < pm do l ← l + 1; Select a set Sjl , such that

|Sjl | w jl

= max

Set C ← C ∪ {jl }; Sj = Sj \Sjl ;J = J\{jl }; w w Set wj ← wj − |Sjjl | |Sj |;[λi = |Sjjl | , ∀zi ∈ Sjl ] l l end while return C

j∈J

|Sj | wj

(16) (17)

In Algorithm 2, we present a simple LP rounding algorithm to solve the PSC problem. We first use the LP relaxation to find the fractional solution. Then, we select one set at a time that maximizes the value of xjl (i.e., rounding xjl to 1) until pm peaks are covered.

2.3 Extension to MS/MS Data The MS/MS-based protein identification method first fragments some peptides to generate tandem MS spectra. It then searches these spectra against a database to identify peptides in the sample. Since the same peptide sequence may belong to different proteins, such a

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

database search-based method may lead to the ambiguities in determining the proteins that are indeed present. Correspondingly, inferring proteins from peptide identification results, known as the protein inference problem [34], is a challenging task. To solve the protein inference problem, many formulations and solutions have been proposed [34], [35], [36], [37], [38]. In particular, parsimonious formulations (set covering) are widely adopted [34], [36], [37]. The idea is to find a subset of proteins with minimum cost such that it covers all identified peptides. While the MS/MS-based approach can identify tens of thousands of peptides in large-scale biological studies (e.g., [39], [40]) with a very low false discovery rate (FDR) of 15%, there is still a small portion of false peptide identifications. Furthermore, the most popular method for estimating the FDR is the target-decoy search strategy [41]. Such database-dependent estimation method often underestimates the error rate since many identifications from the target database are also incorrect. Therefore, it is certainly desired to leave those incorrect peptides uncovered in the stage of protein inference. This means that the principle of partial coverage instead of full coverage should be used in the MS/MS-based protein inference. It is straightforward to extend the partial set covering model to the MS/MS-based protein inference: Let Z = {z1 , z2 , ...zm } be a set of identified peptides and define each Sj as: Sj = Z ∩ T j . (23) Then, the optimization problem remains the same and three algorithms developed in previous sections can be applied directly. In the context of MS/MS data, all proteins that contain at least one identified peptide are considered as candidate proteins in the optimization process. Here the covering fraction p can be interpreted as the expected percentage of true peptide identifications. Informally, we can consider (1 − p) as the desired FDR value. For instance, if the expected FDR is 5%, then we set p = 0.95. 2.4 Extension to the Combination of MS Data and MS/MS Data Single-stage MS data is complementary to MS/MS data in that it provides broader mass coverage but less sequence information. The combination of MS data and MS/MS data has been used in [9] to improve protein identification. Here we extend the partial set covering model to the combination of MS data and MS/MS data. First, we use Z = {z1 , z2 , ...zm } as the set of elements to be covered and zi can be either an identified peptide or a peak. Since the peptide identified from a MS/MS spectrum is more informative than a single-stage MS peak, we assign different “benefit” values to different types of elements. More precisely, we introduce a set B = {b1 , b2 , ...bm }, where each bi denotes the importance value of zi . In the current implementation, we use a userspecified parameter λ to generate bi : bi = λ if zi is an

5

identified peptide and bi = 1 − λ otherwise. In general, λ should be larger than 0.5 (e.g., 0.9) to reflect the fact that identified peptides are more important than singlestage MS peaks. Then, the generalized partial set covering (GPSC) problem becomes: (GPSC)

minimize C⊆J

subject to



(24)

wj

j∈C



zi ∈

S

bi ≥ pt,

(25)

Sj

j∈C

where p∑ is the expected covering fraction (0 < p ≤ 1) m and t = i=1 bi is the sum of elements in B. Obviously, the GPSC problem reduces to the standard partial set covering problem when bi = 1(0 ≤ i ≤ m). The three PSC algorithms need the following minor modifications to handle the generalized PSC problem: ∑ 1) In all algorithms, we use zi ∈ S Sj bi ≥ pt instead j∈C ∪ of | j∈C Sj | ≥ pm as the ∑ stopping criterion. 2) In all algorithms, we use zi ∈Sj bi to replace |Sj |. 3) In the LP rounding algorithm, we replace m(1 − p) with t(1 − p) in constraint (15). In candidate protein selection, one protein is kept either when it contains at least one identified peptide or when it is among the best k proteins with respect to the number of matched single-stage MS peaks (k is an input parameter). 2.5

The Estimation of Covering Fraction

The covering fraction p is of primary importance in our algorithms. In this section, we will study how to specify this parameter automatically under the setting of generalized PSC model. Algorithm 4 describes the procedure to estimate the covering fraction. We have introduced the notations of S, w and B in previous sections. Here we use F to denote a set of real numbers. Each number in F is used as the covering fraction parameter. For instance, we can set F = {0.1, 0.2, ..., 0.9}. Furthermore, we use Alg to denote the algorithm that can solve the PSC problem. So far, we have three choices for Alg: Greedy algorithm, LP Rounding algorithm and Dual Feasible algorithm. Given f ∈ F , we obtain a subset of proteins C using Alg when the covering fraction is f . To evaluate the quality of C, one natural choice is u/v, ∑ ∑ where u = S zi ∈ Sj bi is the “benefit” and v = j∈C wj is the j∈C

“cost”. We can consider v/u as the “covering efficiency” of C, i.e., a larger u/v indicates a better cover. In general, the value of this evaluation function decreases when the covering fraction increases. This is because our algorithms generate C in a greedy manner, making it harder for those late-coming proteins to achieve the same-level covering efficiency (with respect to uncovered elements). To compare different settings of covering fractions, we

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

use the modified covering efficiency (u/v)f as the evaluation criterion. We report p = arg max(u/v)f as the f

estimation of the covering fraction. Algorithm 4 Covering Fraction Estimation Algorithm (S, w, B, F, Alg) Initialize E ← 0 and p ← 0 for each f ∈ F do Set C ←∑ Alg(S, w, f ) ∑ Let u = zi ∈ S Sj bi and v = j∈C wj j∈C

if (u/v) · f > E then Set E ← (u/v) · f and p ← f end if end for return p

3

R ESULTS

We use both simulation data and real data to demonstrate the superiority of the PSC model in protein mixture identification. The evaluation criteria are standard performance metrics in information retrieval: precision, recall, and F 1-measure: • P recision (p) is the proportion of identified groundtruth proteins to all identified proteins. • Recall (r) is the proportion of identified groundtruth proteins to all ground-truth proteins. • F 1-measure is the harmonic mean of recall and precision: 2pr/(p + r) with p and r just defined as above. In the experiments, we compare our algorithms against the following two generic identification algorithms: • SPI (single protein identification) algorithm: It uses the standard PMF algorithm directly to identify proteins from the MS spectrum of protein mixture. Concretely, the algorithm ranks each protein in the database separately. To handle protein mixture, we report a set of top-ranked proteins as the results. In this paper, we select two single protein identification methods: Piums [33] and ProFound [12]. The first is the representative of newly developed algorithms and the second is the representative of current stateof-the-art PMF tools. • Subtraction algorithm: The algorithm by Jensen et al. [23] is the only method that aims to identify proteins from the mixture of MS peaks. In the implementation, we repeat g iterations to identify g proteins as the output, where g is a user-specified parameter. In each iteration, we use a standard PMF algorithm to perform protein identification and the matched peaks are removed prior to the next iteration of searching. To be consistent with SPI, we implement two versions of the subtraction algorithm, one version with Piums as the component PMF

6

algorithm and the other version with ProFound as the component PMF algorithm1 . In the experiments, we let the SPI algorithm and the subtraction algorithm to report g proteins, where g is number of ground-truth proteins. Note that we usually don’t know the true protein number in practice. Thus, such a specification will favor the performance of these two algorithms. Under this setting, the precision, recall and F 1-measure are identical in both the SPI algorithm and the subtraction algorithm. We use the following parameters in protein identification: trypsin digestion with a maximum of one missed cleavage, mono-isotopic peaks, single charge state, unrestricted protein mass. 3.1 Simulation Study 3.1.1 Simulator Our simulator requires the following input parameters: mass error, sequence coverage, noise level and protein number. • Mass error measures the difference between the theoretical mass and the observed mass. • Sequence coverage denotes the ratio between the number of detected peptides and the number of all peptides within the mass acquisition range (8004500 Da in our simulation). • Noise level is the ratio between the number of manmade noisy peaks and the number of total peaks. • Protein number is the number of ground-truth proteins in the mixture. The data simulation process works as follows: 1) We randomly select a set of proteins from the sequence database (Swiss-Prot, Release 52) as the ground-truth proteins according to the parameter of protein number. 2) We perform trypsin-based protein digestion in silico (allowing 1 missed cleavage) and simulate the peptide detectability by retaining only a portion of proteolytic peptides according to the parameter of sequence coverage. 3) We alter the mass of each peptide by adding a number randomly generated from a zero-mean Gaussian distribution whose standard deviation equals the mass error parameter. 4) We add a set of noisy peaks that are randomly generated and uniformly distributed within the mass acquisition range according to the parameter of noise level. 3.1.2 Performance Comparison To compare different algorithms, we generate 4 groups of protein mixtures with different characteristics by varying 1. Though Mascot [11] is probably the most popular PMF method for single protein identification, it is difficult to directly embed it into the substraction strategy since the technical details of Mascot are not publicly available.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

7

TABLE 1 Parameter setting in data simulation. Since the ground-truth peaks of one protein may be considered as noise to other proteins, we didn’t include noisy peaks in group d to study the effect of protein number on the identification performance.

40 20 0.01

0.02 0.03 Mass Error (Da)

0.04

Average Recall (%)

40 20 0.2 0.3 Sequence Coverage

0.4

0

0.01

Average Recall (%)

60 40 20 30 50 Noise Level (%)

80 60 40 20 0

0.1

70

60 40 20 0

10

60 40 20 40 70 Number of Proteins

30 50 Noise Level (%) (d2) Average Recall

80

10

0.2 0.3 Sequence Coverage

80

(d1) Average Precision

0

0.02 0.03 Mass Error (Da)

(c2) Average Recall

80

10

100

20

(c1) Average Precision

0

70

40

(b2) Average Recall

60

0.1

0.4

60

(b1) Average Precision 80

0

Average F1−Measure (%)

Average Recall (%)

60

80

100

80 60 40 20 0

10

40 70 Number of Proteins

Protein Number 20 20 20 10, 40, 70, 100

Greedy

0.04

(a2) Average Recall

80

0

Subtraction(ProFound)

Average F1−Measure (%)

Subtraction(Piums)

Noise Level (%) 50 50 10, 30, 50, 70 0

Average F1−Measure (%)

SPI(ProFound)

Sequence Coverage 0.3 0.1, 0.2, 0.3, 0.4 0.3 0.3

(a1) Average Precision

Average Recall (%)

Average Precision (%)

Average Precision (%)

Average Precision (%)

Average Precision (%)

SPI(Piums)

Mass Error (Da) 0.01, 0.02, 0.03, 0.04 0.02 0.02 0.02

Average F1−Measure (%)

Group a b c d

LP Rounding

Dual Feasible

(a3) Average F1−Measure 80 60 40 20 0

0.01

0.02 0.03 Mass Error (Da)

0.04

(b3) Average F1−Measure 80 60 40 20 0

0.1

0.2 0.3 Sequence Coverage

0.4

(c3) Average F1−Measure 80 60 40 20 0

10

30 50 Noise Level (%)

70

(d3) Average F1−Measure 80 60 40 20 0

10

40 70 Number of Proteins

100

Fig. 1. Identification performance comparison of different algorithms on the simulation data. In our algorithms, we set the covering fraction to 0.5 and fix the number of candidate proteins to 5000.

one parameter while fixing the other parameters (see Table 1). Under each specific parameter setting, we randomly create 10 protein mixtures to obtain the average performance of each algorithm. In database searching, the mass tolerance threshold is set to the known mass error for all PMF algorithms. We illustrate the identification performance of different methods in Fig. 1. The results show that our methods outperform previous methods significantly. We also observe that the greedy algorithm, LP rounding

algorithm and dual feasible algorithm have similar performance. Furthermore, Fig. 1 shows that the mass error, the sequence coverage, the noise level and the number of proteins in the mixture have significant influence on the performance of PMF algorithms. Concretely, the increase of mass accuracy and sequence coverage will boost the performance, while the increase of noise level and protein number will deteriorate the performance. We also compare the running time of different algorithms in Fig. 2. Among our methods, the LP rounding

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

SPI(Piums) 3

SPI(ProFound)

Subtraction(Piums)

Subtraction(ProFound)

(a) Mass Error Ranges from 0.01 Da to 0.04 Da

Average Running Time (s)

Average Running Time (s)

Dual Feasible

10

2

10

1

2

10

1

10

10 0.01

0.02 0.03 Mass Error (Da)

0.04

0.1

(c) Noise Level Ranges from 10% to 70%

3

0.2 0.3 Sequence Coverage

0.4

(d) Protein Number Ranges from 10 to 100

3

10

Average Running Time (s)

10

Average Running Time (s)

LP Rounding

(b) Sequence Coverage Ranges from 0.1 to 0.4

3

10

Greedy

8

2

10

1

2

10

1

10

10 10

30 50 Noise Level (%)

70

10

40 70 Number of Proteins

100

Fig. 2. Running time comparison of different algorithms on the simulation data. In our algorithms, we set the covering fraction to 0.5 and fix the number of candidate proteins to 5000.

algorithm is much more time-consuming than other algorithms since it needs to find the optimal LP solution; The greedy algorithm and the dual feasible algorithm need similar execution time since they have the same time complexity. Furthermore, the running time of our algorithms is comparable to that of previous algorithms. This means that they can achieve better identification performance and running efficiency simultaneously. 3.1.3

Parameter Estimation

To test the effectiveness of the covering fraction estimation procedure, we use F = {0.1, 0.2, ..., 0.9} as the set of candidate values for the covering fraction. We also plug all three proposed algorithms into the estimation process to test their performance, respectively. We plot both the estimated covering fraction and the ground-truth in Fig. 3. Here the ground-truth covering fraction equals to (1 − ε), where ε is the noisy level used in generating the simulation data (see Table 1). In Fig. 3(a) and Fig. 3(b), the percentage of noisy peaks is fixed to 0.5. Thus, we consider 0.5 as the underlying true covering faction in evaluating the estimation

method. We found that the estimated covering fraction varies from 0.4 to 0.6, approximating the true covering fraction within a reasonable range. In Fig. 3(c), the ground-truth covering faction ranges from 0.9 to 0.3. Interestingly, the estimation becomes more accurate when the noise level is increased. This is a nice property since real MS data is always very noisy. We believe the reasons are the following: Our estimation is based on (u/v)f , which is a trade-off between u/v and f . With the increase of candidate covering fraction value f , the corresponding u/v decreases and the decreasing rate is roughly proportional to f . Given f and the ground-truth covering fraction f ∗ , we have: •



If f > f ∗ , we can assume that the covering efficiency value obtained at f is less than that of f ∗ . This is because we have to cover at least (f − f ∗ ) noisy peaks, leading to a very low u/v value. If f < f ∗ , there are two cases: – When f ∗ is small (i.e., the noise level is high), the probability of reporting f ∗ is very high because the difference in the covering efficiency

(a) Mass Error Ranges from 0.01 Da to 0.04 Da 60 Greedy LP Rounding Dual Feasible 55 Ground Truth 50

45

40 0.01

0.02 0.03 Mass Error (Da)

Estimiated Covering Fraction (%)

Estimiated Covering Fraction (%)

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

(b) Sequence Coverage Ranges from 0.1 to 0.4 60 Greedy LP Rounding Dual Feasible 55 Ground Truth 50

45

40

0.04

0.1

90 80 70 60 50

Greedy LP Rounding Dual Feasible Ground Truth

40 30 10

30 50 Noise Level (%)

70

0.2 0.3 Sequence Coverage

0.4

(d) Protein Number Ranges from 10 to 100 Estimiated Covering Fraction (%)

Estimiated Covering Fraction (%)

(c) Noise Level Ranges from 10% to 70% 100

9

100 90 80 70 60 50

Greedy LP Rounding Dual Feasible Ground Truth

40 30 10

40 70 Number of Proteins

100

Fig. 3. Covering fraction estimation on the simulation data. We use the Greedy algorithm, the LP Rounding algorithm and the Dual Feasible algorithm as the component algorithm to estimate the covering fraction, respectively. In all tests, we set the number of candidate proteins to 5000.

We use a standard mixture of 49 human proteins in the ABRF sPRG2006 study2 . The 74 labs participating in this study analyzed the sample using different techniques,

e.g., 2D-gels, LC-MS/MS. Here we select one data set that is generated from a linear ion trap (LTQ)-orbitrap instrument. Note that most of the labs in this study did not provide LTQ-Orbitrap data set. The use of such highaccuracy data enables us to obtain high-quality singlestage MS peaks. In the experiments, we search against a sequence database3 that consists of all Swiss-Prot Human proteins and some bonus proteins and contaminant compounds. To extract single-stage MS peaks, we use the Decon2LS software [42] and VIPER software [43] to pre-process the raw LC-MS data with their default parameter settings. The final peak list contains 3366 de-convolved monoisotopic peaks. Note that we use all the single-MS peaks obtained from the raw MS data rather than only the single-MS peaks selected for MS/MS. In database searching, we set the mass tolerance threshold to 1 ppm since the mass accuracy of the data is very high. To test the identification performance of our algorithms on MS/MS data, we use X!Tandem (version 2007.07.01.2) [44] to identify peptides from MS/MS spectra. The parameters used for peptide identification are: mono-isotopic masses, mass tolerance of 2 Da for precursor, mass tolerance of 1 Da for fragment ion, fixed modification on Cys, one missed cleavage site, and only b and y fragment ions are taken into account.

2. http://www.abrf.org/index.cfm/group.show/ ProteomicsInformatics/ResearchGroup.53.htm

3. http://www.abrf.org/ResearchGroups/ ProteomicsInformaticsResearchGroup/Studies/sPRGBICFASTA.zip

is mainly determined by the covering fraction. – If f ∗ is large (i.e., the noise level is low), the probability of obtaining the maximal covering efficiency value at f is very high because the difference in the covering efficiency is dominated by u/v. In Fig. 3(d), the ground-truth covering faction is fixed to 1 since there is no noisy peaks in the simulation data. It shows that the increase of the number of proteins in the mixture leads to the decrease of the estimated covering fraction. This is because the introduction of additional proteins in generating the simulation data can also bring “noisy peaks” (recall that we alter the mass of each peptide by adding a random number from a Gaussian distribution with the mass error parameter as its standard deviation). Overall, the proposed estimation method can provide a covering fraction that is practically useful. We will further clarify this point through the experiments on the real data. 3.2

Real Data

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

3.2.1 MS Data Table 2 reports the identification results of different algorithms using only single-stage MS data. It shows that our algorithms achieve significantly higher protein identification rate than previous PMF methods. TABLE 2 Performance comparison on a real MS data of 49 human proteins. Here the number of reported proteins for the SPI algorithm and the subtraction algorithm is 49, i.e. the number of ground-truth proteins. In our algorithms, the covering fraction is set to 0.2 and the number of candidate proteins is set to 5000. Category SPI Subtraction PSC

Algorithms Piums ProFound

Precision 24% 24%

Recall 24% 24%

F 1-M easure 24% 24%

Piums ProFound

43% 37%

43% 37%

43% 37%

Greedy LP Rounding Dual Feasible

84% 87% 84%

53% 55% 55%

65% 67% 66%

The covering fraction has a significant influence on the performance of our algorithms. To clarify this point, we range the covering fraction from 0.1 to 0.9 to check the identification performance of different algorithms. The results in Fig. 4(a), Fig. 4(b) and Fig. 4(c) show that these algorithms exhibit similar behavior under the same parameter setting. The precision decreases and the recall increases when the covering fraction increases. This is easy to understand since the use of larger covering fraction will make the algorithms report more proteins. Since there is no ground-truth for the covering fraction in the real MS data, we evaluate the covering fraction estimation method based on the identification performance. More precisely, if the estimated covering fraction is very close to the true percentage of non-noisy peaks, it is reasonable to expect that such a parameter setting will yield good identification performance. Fig. 4(d) plots the covering efficiency of different candidate fraction values. It shows that we obtain the maximum covering efficiency when the covering fraction is 0.2. According to the covering fraction estimation method, we will recommend 0.2 to the user. Fig. 4(c) shows that this setting provides the best overall identification performance (F 1-measure) among all candidate values. Furthermore, we observe that the trend of covering efficiency coincides with the trend of F 1-measure very well. 3.2.2 MS/MS Data To demonstrate the benefit of partial coverage against full coverage in the context of MS/MS-based protein inference, we vary the covering fraction p from 0.1 to 1. Note that p = 1 corresponds to a full coverage. Fig. 5 presents the experimental results. It reveals that leaving a small portion of peptides uncovered improves

10

the overall identification performance in terms of F 1measure. This is because there are still some false identifications in the peptide-spectrum pairs. The partial set covering model can help alleviate the effect of these peptides in the stage of protein inference. Since the expected number of false identifications is relatively small, it is reasonable to assume that the percentage of false identifications is around 5%-10% (i.e., the covering fraction is 90%-95%). Here we like to know if we can obtain a good estimation automatically using our estimation algorithm. In Fig. 5(d), we use the covering efficiency as the evaluation criterion to find the best covering fraction. It indicates that 0.8 beats all other candidate values. If we assume that 0.9 is the ground-truth covering fraction, the estimated covering faction does not match the ground truth perfectly. The reason is that, the estimation result is less accurate when the groundtruth covering fraction is relatively large, as previously discussed in the simulation study section. On the other hand, we also like to point out that this parameter setting still outperforms the full covering model and it is the second best among all candidates. 3.2.3 The Combination of MS Data and MS/MS Data The proposed algorithms also work when we combine MS data with MS/MS data. Since MS/MS-based peptide identification results are more informative than singlestage MS peaks, we use λ = 0.9 as the “benefit” value of each identified peptide. Correspondingly, the “benefit” value of each single MS peak is 0.1. Fig. 6 plots the experimental results when the covering fraction varies from 0.1 to 0.9. This figure is similar to Fig. 5 since we place more weight on MS/MS data. In Fig. 6 (d), our estimation algorithm identifies 0.6 as the best covering fraction through the evaluation of covering efficiency values. Again, the estimated value is only the second best, while the best covering fraction is 0.7 in terms of F 1-measure. To illustrate the benefit of including single-stage MS data in protein mixture identification, we compare the identification results of using different types of data in Table 3. Not surprisingly, the MS/MS-based method provides better performance than the MS-based method and the performance gap is still considerable. The promising result is that combining MS data and MS/MS data improves the overall identification performance (F 1measure) consistently.

4

C ONCLUSIONS

In this paper, we proposed a new unified framework named partial set covering (PSC) model to identify protein mixtures using both MS data and MS/MS data. The experimental results demonstrate the advantage of our model and the three PSC-based methods. In our future work, we plan to incorporate prior biological knowledge into the model to further improve the identification performance. For instance, known factors

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

(a) Precision

60 40

80

20 0

Greedy LP Rounding Dual Feasible

100

Recall (%)

80 Precision (%)

(b) Recall Greedy LP Rounding Dual Feasible

100

11

60 40 20 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction (d) Covering Efficiency

(c) F1−Measure 0.06

F1−Measure (%)

100 80 60 40 20 0

Greedy LP Rounding Dual Feasible

0.05 Covering Efficiency

Greedy LP Rounding Dual Feasible

0.04 0.03 0.02 0.01 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

Fig. 4. The effect and estimation of covering fraction on the real single-stage MS data. The number of candidate proteins is set to 5000. In the first three sub-figures, we plot precision, recall and F 1-measure of the proposed algorithms, respectively. In the last sub-figure, we describe the covering efficiency values when different PSC algorithms are used in the estimation procedure. (a) Precision 140

Greedy LP Rounding Dual Feasible

80

100

Recall (%)

Precision (%)

120

(b) Recall 100

80 60

Greedy LP Rounding Dual Feasible

60 40

40 20 20 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

0

1

0.15

Greedy LP Rounding Dual Feasible

Covering Efficiency

F1−Measure (%)

80

1

(d) Covering Efficiency

(c) F1−Measure 100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

60 40

Greedy LP Rounding Dual Feasible

0.1

0.05

20 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

1

Fig. 5. Performance test and covering fraction estimation on the real MS/MS data. From (a) to (c), we compare the performance of partial covering model with that of full covering model. In (d), we plot the covering fraction estimation results in terms of covering efficiency.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

(a) Precision 140

80

100

Recall (%)

Precision (%)

(b) Recall 100

Greedy LP Rounding Dual Feasible

120

80 60

12

Greedy LP Rounding Dual Feasible

60 40

40 20 20 0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

(c) F1−Measure 100

Greedy LP Rounding Dual Feasible

0.1 Covering Efficiency

F1−Measure (%)

80

(d) Covering Efficiency

60 40 20 0

0.08

Greedy LP Rounding Dual Feasible

0.06 0.04 0.02 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

Fig. 6. Performance test and covering fraction estimation on the combination of MS data and MS/MS data. The number of candidate proteins is set to 5000 for MS data. In (a), (b) and (c), we plot the identification performance of different algorithms when the covering fraction varies from 0.1 to 0.9. In (d), we plot the covering efficiency when the covering fraction is set to different candidate values. TABLE 3 Performance comparison using single MS data, tandem MS data and the combination of MS data and MS/MS data. Since our objective here is to investigate the effect of using different types of data, we set the covering fraction p to its known best value, i.e., p=0.2 (single MS), p=0.9 (tandem MS) and p=0.7 (MS and MS/MS). Algorithms Greedy LP Rounding Dual Feasible

Data Single MS Tandem MS MS and MS/MS Single MS Tandem MS MS and MS/MS Single MS Tandem MS MS and MS/MS

such as protein translational modifications (PTMs) and partial digestions may generate “noisy peaks”. Such information can help us to distinguish true peptide-related signals from noises, leading to higher identification rate.

Precision 84% 83% 81% 87% 75% 81% 87% 83% 79%

F 1-M easure 65% 80% 83% 67% 74% 83% 67% 80% 82%

and Technology. The source codes and data are available at: http://bioinformatics.ust.hk/PSCMixture.rar.

R EFERENCES [1]

ACKNOWLEDGEMENTS The comments and suggestions from the anonymous reviewers greatly improved the paper. This work was supported with the general research fund 621707 from the Hong Kong Research Grant Council, a research proposal competition award RPC07/08.EG25 and a postdoctoral fellowship from the Hong Kong University of Science

Recall 53% 78% 86% 55% 73% 86% 55% 78% 86%

[2]

[3]

L. McHugh and J. W. Arthur, “Computational methods for protein identification from mass spectrometry data,” PLoS Compuatational Biology, vol. 4, no. 2, p. e12, 2008. W. J. Henzel, T. M. Billeci, J. T. Stults, S. C. Wong, C. Grimley, and C. Watanabe, “Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases,” Proceedings of the National Academy of Sciences of the United States of America, vol. 90, no. 11, pp. 5011–5015, 1993. P. James, M. Quadroni, E. Carafoli, and G. Gonnet, “Protein identification by mass profile fingerprinting,” Biochemical and Biophysical Research Communications, vol. 195, no. 1, pp. 58–64, 1993.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

[4]

[5] [6] [7]

[8]

[9]

[10]

[11]

[12]

[13] [14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22] [23]

M. Mann, P. Hojrup, and P. Roepstorff, “Use of mass spectrometric molecular weight information to identify proteins in sequence databases,” Biological Mass Spectrometry, vol. 22, no. 6, pp. 338– 345, 1993. D. J. Pappin, P. Hojrup, and A. J. Bleasby, “Rapid identification of proteins by peptide-mass fingerprinting,” Current Biology, vol. 3, no. 6, pp. 327–332, 1993. J. R. Yates, S. Speicher, P. R. Griffin, and T. Hunkapiller, “Peptide mass maps: a highly informative approach to protein identification,” Analytical Biochemistry, vol. 214, no. 2, pp. 297–408, 1993. J. K. Eng, A. L. Mccormack, and J. R. Yates, “An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database,” Journal of the American Society for Mass Spectrometry, vol. 5, no. 11, pp. 976–989, 1994. V. Dancik, T. A. Addona, K. R. Clauser, J. E. Vath, and P. A. Pevzner, “De novo peptide sequencing via tandem mass spectrometry,” Journal of Computational Biology, vol. 6, no. 3/4, pp. 327–342, 1999. B. Lu, A. Motoyama, C. Ruse, J. Venable, and J. R. Yates, “Improving protein identification sensitivity by combining MS and MS/MS information for shotgun proteomics using LTQ-Orbitrap high mass accuracy data,” Analytical Chemistry, vol. 80, no. 6, pp. 2018–2025, 2008. D. Mantini, F. Petrucci, P. D. Boccio, D. Pieragostino, M. D. Nicola, A. Lugaresi, G. Federici, P. Sacchetta, C. D. Ilio, and A. Urbani, “Independent component analysis for the extraction of reliable protein signal profiles from MALDI-TOF mass spectra,” Bioinformatics, vol. 24, no. 1, pp. 63–70, 2008. D. N. Perkins, D. J. C. Pappin, D. M. Creasy, and J. S. Cottrell, “Probability-based protein identification by searching sequence databases using mass spectrometry data,” Electrophoresis, vol. 20, no. 18, pp. 3551–3567, 1999. W. Zhang and B. T. Chait, “Profound: an expert system for protein identification using mass spectrometric peptide mapping information,” Analytical Chemistry, vol. 72, no. 11, pp. 2482–2489, 2000. P. R. Baker and C. K. R, “Protein prospector.” [Online]. Available: http://prospector.ucsf.edu M. Tuloup, C. Hernandez, I. Coro, C. Hoogland, P.-A. Binz, and R. D. Appel, “Aldente and biograph: An improved peptide mass fingerprinting protein identification environment,” in Swiss Proteomics Society 2003 Congress: Understanding Biological Systems through Proteomics, 2003, pp. 174–176. [Online]. Available: http://www.expasy.org/tools/aldente/ W. J. Henzel, C. Watanabe, and J. T. Stults, “Protein identification: The origins of peptide mass fingerprinting,” Journal of the American Society for Mass Spectrometry, vol. 14, no. 9, pp. 931–942, 2003. I. Shadforth, D. Crowther, and C. Bessant, “Protein and peptide identification algorithms using MS for use in high-throughput, automated pipelines,” Proteomics, vol. 5, no. 16, pp. 4082–4095, 2005. ¨ “Probity: a protein identification algoJ. Eriksson and D. Fenyo, rithm with accurate assignment of the statistical significance of the results,” Journal of Proteome Research, vol. 3, no. 1, pp. 32–36, 2004. J. Margnin, A. Masselot, C. Menzel, and J. Colinge, “OLAVPMF: a novel scoring scheme for high-throughput peptide mass fingerprinting,” Journal of Proteome Research, vol. 3, no. 1, pp. 55– 60, 2004. J. A. Siepen, E. J. Keevil, D. Knight, and S. J. Hubbard, “Prediction of missed cleavage sites in tryptic peptides aids protein identification in proteomics,” Journal of Proteome Research, vol. 6, no. 1, pp. 399–408, 2007. Z. Song, L. Chen, A. Ganapathy, X.-F. Wan, L. Brechenmacher, N. Tao, D. Emerich, G. Stacey, and D. Xu, “Development and assessment of scoring functions for protein identification using pmf data,” Electrophoresis, vol. 28, no. 5, pp. 864–870, 2007. D. Yang, K. Ramkissoon, E. Hamlett, and M. C. Giddings, “Highaccuracy peptide mass fingerprinting using peak intensity data with machine learning,” Journal of Proteome Research, vol. 7, no. 1, pp. 62–69, 2008. Z. He, C. Yang, and W. Yu, “Peak bagging for peptide mass fingerprinting,” Bioinformatics, vol. 24, no. 10, pp. 1293–1299, 2008. O. N. Jensen, A. V. Podtelejnikov, and M. Mann, “Identification of the components of simple protein mixtures by high-accuracy pep-

[24]

[25] [26] [27] [28] [29] [30]

[31]

[32]

[33]

[34] [35]

[36]

[37]

[38]

[39]

[40]

[41] [42]

13

tide mass mapping and database searching,” Analytical Chemistry, vol. 69, no. 23, pp. 4741–4750, 1997. Z. Y. Park and D. H. Russell, “Identification of individual proteins in complex protein mixtures by high-resolution,high-massaccuracy MALDI TOF-mass spectrometry analysis of in-solution thermal denaturation/enzymatic digestion,” Analytical Chemistry, vol. 73, no. 11, pp. 2558–2564, 2001. ¨ “Protein identification in complex J. Eriksson and D. Fenyo, mixtures,” Journal of Proteome Research, vol. 4, no. 2, pp. 387–393, 2005. P. Slavik, “Improved performance of the greedy algorithm for partial cover,” Information Processing Letters, vol. 64, no. 5, pp. 251– 254, 1997. M. Bl¨aser, “Computing small partial coverings,” Information Processing Letters, vol. 85, no. 6, pp. 327–331, 2003. R. Gandhi, S. Khuller, and A. Srinivasan, “Approximation algorithms for partial covering problems,” Journal of Algorithms, vol. 53, no. 1, pp. 55–84, 2004. T. Fujito, “On combinatorial approximation of covering 0-1 integer programs and partial set cover,” Journal of Combinatorial Optimization, vol. 8, no. 4, pp. 439–452, 2004. ¨ J. Koemann, O. Parekh, and D. Segev, “A unified approach to approximating partial covering problems,” in Proceedings of 14th Annual European Symposium on Algorithms (ESA 2006), ser. Lecture Notes in Computer Science, Y. Azar and T. Erlebach, Eds., vol. ¨ ¨ 4168. ETH Zurich, Zurich, Switzerland: Springer, September 2006, pp. 468–479. J. Mestre, “Lagrangian relaxation and partial cover,” in Proceddings of 25th International Symposium on Theoretical Aspects of Computer Science (STACS 2008), S. Albers and P. Weil, Eds., Dagstuhl, Germany, 2008, pp. 539–550. [Online]. Available: http://drops. dagstuhl.de/opus/volltexte/2008/1315 D. S. Hochbaum, “Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems,” in Approximation Algorithms for NP-Hard Problems. PWS Publishing Company, 1997, pp. 94–143. ¨ J. Samuelsson, D. Dalevi, F. Levander, and T. Rognvaldsson, “Modular, scriptable and automated analysis tools for highthroughput peptide mass fingerprinting,” Bioinformatics, vol. 20, no. 18, pp. 3628–3635, 2004. A. I. Nesvizhskii and R. Aebersold, “Interpretation of shotgun proteomic data: the protein inference problem,” Molecular & Cellular Proteomics, vol. 4, no. 10, pp. 1419–1440, 2005. A. I. Nesvizhskii, A. Keller, E. Kolker, and R. Aebersold, “A statistical model for identifying proteins by tandem mass spectrometry,” Analytical Chemistry, vol. 75, no. 17, pp. 4646–4658, 2003. B. Zhang, M. C. Chambers, and D. L. Tabb, “Proteomic parsimony through bipartite graph analysis improves accuracy and transparency,” Journal of Proteome Research, vol. 6, no. 9, pp. 3549– 3557, 2007. P. Alves, R. J. Arnold, M. V. Novotny, P. Radivojac, J. P. Reilly, and H. Tang, “Advancement in protein inference from shotgun proteomics using peptide detectability,” in Proceddings of 2007 Pacific Symposium on Biocomputing (PSB 2007), 2007, pp. 409–420. Y. F. Li, R. J. Arnold, Y. Li, P. Radivojac, Q. Sheng, and H. Tang, “A bayesian approach to protein inference problem in shotgun proteomics,” in Proceedings of The 12th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2008), ser. LNBI, M. Vingron and L. Wong, Eds., vol. 4955. Springer, 2008, pp. 167–180. K. Baerenfaller, J. Grossmann, M. A. Grobei, R. Hull, M. HirschHoffmann, S. Yalovsky, P. Zimmermann, U. Grossniklaus, W. Gruissem, and S. Baginsky, “Genome-scale proteomics reveals arabidopsis thaliana gene models and proteome dynamics,” Science, vol. 320, pp. 938–941, 2008. N. E. Castellana, S. H. Payne, Z. Shen, M. Stanke, V. Bafna, and S. P. Briggs, “Discovery and revision of arabidopsis genes by proteogenomics,” Proceedings of the National Academy of Sciences of the United States of America, vol. 105, no. 52, pp. 21 034–21 038, 2008. J. E. Elias and S. P. Gygi, “Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry,” Nature Methods, vol. 4, no. 3, pp. 207–214, 2007. N. Jaitly, A. Mayampurath, K. Littlefield, J. N. Adkins, G. A. Anderson, and R. D. Smith, “Decon2LS: An open-source software

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

package for automated processing and visualization of high resolution mass spectrometry data,” BMC Bioinformatics, vol. 10, p. 87, 2009. [43] M. E. Monroe, N. Tolic, N. Jaitly, J. L. Shaw, J. N. Adkins, and R. D. Smith, “VIPER: an advanced software package to support highthroughput LC-MS peptide identification,” Bioinformatics, vol. 23, no. 15, pp. 2021–2023, 2007. [44] R. Craig and R. C. Beavis, “Tandem: matching proteins with tandem mass spectra,” Bioinformatics, vol. 20, no. 9, pp. 1466–1467, 2004.

Zengyou He received the BS, MS and PhD degrees in computer science from Harbin Institute of Technology, China, in 2000, 2002 and 2006, respectively. Currently, he is a research associate in the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology. His research interests include data mining and computational mass spectrometry.

Can Yang received the bachelor’s degree and master’s degree in automatic control from Zhejiang University, China, in 2003 and 2006, respectively. He is currently working toward the PhD degree in the Department of Electronic and Computer Engineering at the Hong Kong University of Science and Technology. He is interested in bioinformatics, machine learning and data mining.

Weichuan Yu received the Ph.D. degree in Computer Vision and Image Analysis from University Kiel, Germany in 2001. He was a postdoctoral associate at Yale University from 2001 to 2004 and a research faculty member in the Center for Statistical Genomics and Proteomics at Yale University from 2004 to 2006. Since 2006, he has been an assistant professor in the Department of Electronic and Computer Engineering at the Hong Kong University of Science and Technology. He is interested in computational analysis problems with biological and medical applications. He has published papers on a variety of topics including bioinformatics, computational biology, biomedical imaging, signal processing and computer vision.

14

A Partial Set Covering Model for Protein Mixture ...

2009; published online XX 2009. To date, many popular .... experimental study, we found that they almost exhibit identical ...... software [42] and VIPER software [43] to pre-process the ..... Can Yang received the bachelor's degree and master's ...

1MB Sizes 2 Downloads 182 Views

Recommend Documents

The subspace Gaussian mixture model – a structured model for ...
Aug 7, 2010 - We call this a ... In HMM-GMM based speech recognition (see [11] for review), we turn the .... of the work described here has been published in conference .... ize the SGMM system; we do this in such a way that all the states' ...

R routines for partial mixture estimation and differential ...
The following R routines are provided in the file ebayes.l2e.r (available at .... We now analyze the data from Section 2 by computing a moderated t-test.

Dynamical Gaussian mixture model for tracking ...
Communicated by Dr. H. Sako. Abstract. In this letter, we present a novel dynamical Gaussian mixture model (DGMM) for tracking elliptical living objects in video ...

The subspace Gaussian mixture model – a structured ...
Oct 4, 2010 - advantage where the amount of in-domain data available to train .... Our distribution in each state is now a mixture of mixtures, with Mj times I.

A GAUSSIAN MIXTURE MODEL LAYER JOINTLY OPTIMIZED WITH ...
∗Research conducted as an intern at Google, USA. Fig. 1. .... The training examples are log energy features obtained from the concatenation of 26 frames, ...

The subspace Gaussian mixture model – a structured ...
Aug 7, 2010 - eHong Kong University of Science and Technology, Hong Kong, China. fSaarland University ..... In experiments previously carried out at IBM ...... particularly large improvements when training on small data-sets, as long as.

Panoramic Gaussian Mixture Model and large-scale ...
Mar 2, 2012 - After computing the camera's parameters ([α, β, f ]) of each key frame position, ..... work is supported by the National Natural Science Foundation of China. (60827003 ... Kang, S.P., Joonki, K.A., Abidi, B., Mongi, A.A.: Real-time vi

A Robust PTAS for Machine Covering and Packing
We consider two basic scheduling problems where n jobs need to be assigned to m identical ... Supported by Berlin Mathematical School and by DFG research center Matheon. M. de Berg ... We call r · pj the reassignment potential induced by ...

Maximum likelihood estimation of the multivariate normal mixture model
multivariate normal mixture model. ∗. Otilia Boldea. Jan R. Magnus. May 2008. Revision accepted May 15, 2009. Forthcoming in: Journal of the American ...

Model Set-up
Jan 8, 2009 - these exchange rate movements be a concern for policymakers? Would it ...... section 2 below, we present log-linearized versions of these nine ...

Partial Set Air Force Letters re C123K contamination.pdf ...
Partial Set Air Force Letters re C123K contamination.pdf. Partial Set Air Force Letters re C123K contamination.pdf. Open. Extract. Open with. Sign In. Main menu.

Effective Reranking for Extracting Protein-Protein ... - Semantic Scholar
School of Computer Engineering, Nanyang Technological University, ... of extracting important fields from the headers of computer science research papers. .... reranking, the top ranked parse is processed to extract protein-protein interactions.

New Constructions for Covering Designs
Feb 16, 1995 - equivalence classes of nonzero vectors u = (u0,u1,...,um), where two vectors u and v are ...... Art of Computer Programming, section 3.2.1.

A Behavioural Model for Client Reputation - A client reputation model ...
The problem: unauthorised or malicious activities performed by clients on servers while clients consume services (e.g. email spam) without behavioural history ...

New Constructions for Covering Designs
Feb 16, 1995 - icographic codes [6], and two methods that synthesize new coverings ... a code of length n and minimum distance d, arrange the binary n-.

Effective Reranking for Extracting Protein-Protein ... - Semantic Scholar
School of Computer Engineering, Nanyang Technological University, ... different models, log-linear regression (LLR), neural networks (NNs) and support vector .... reranking, the top ranked parse is processed to extract protein-protein ...