A Partial Set Covering Model for Protein Mixture ...

Viewer
Transcript

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

1

A Partial Set Covering Model for Protein Mixture Identiﬁcation Using Mass Spectrometry Data Zengyou He, Can Yang, and Weichuan Yu Abstract—Protein identiﬁcation is a key and essential step in mass spectrometry (MS) based proteome research. To date, there are many protein identiﬁcation strategies that employ either MS data or MS/MS data for database searching. While MS-based methods provide wider coverage than MS/MS-based methods, their identiﬁcation accuracy is lower since MS data have less information than MS/MS data. Thus, it is desired to design more sophisticated algorithms that achieve higher identiﬁcation accuracy using MS data. Peptide Mass Fingerprinting (PMF) has been widely used to identify single puriﬁed proteins from MS data for many years. In this paper, we extend this technology to protein mixture identiﬁcation. First, we formulate the problem of protein mixture identiﬁcation as a Partial Set Covering (PSC) problem. Then, we present several algorithms that can solve the PSC problem efﬁciently. Finally, we extend the partial set covering model to both MS/MS data and the combination of MS data and MS/MS data. The experimental results on simulated data and real data demonstrate the advantages of our method: (1) it outperforms previous MS-based approaches signiﬁcantly; (2) it is useful in the MS/MS-based protein inference; and (3) it combines MS data and MS/MS data in a uniﬁed model such that the identiﬁcation performance is further improved. Index Terms—Protein Identiﬁcation, Proteomics, Peptide Mass Fingerprinting, Mass Spectrometry, Set Covering, Linear Programming, Optimization

F

1

I NTRODUCTION

P

ROTEIN identiﬁcation from MS data or MS/MS data is a key proteomics technology. Many protein identiﬁcation strategies have been proposed. For a comprehensive review, please refer to [1]. Typical identiﬁcation strategies include Peptide Mass Fingerprinting (PMF) [2], [3], [4], [5], [6], MS/MS-based database search [7] and de novo sequencing [8]. We can categorize existing protein identiﬁcation strategies according to the type of input data. PMF takes single MS data as input, MS/MS-based database search and de novo sequencing require MS/MS data as input. The MS/MS-based method is probably the most widely used identiﬁcation approach nowadays. However, one inherent disadvantage of such method is that it cannot perform tandem mass spectrometry scanning on every single ion, leading to an incomplete identiﬁcation of peptides. The single-stage MS data has broader mass coverage. Therefore, we may use PMF to discover proteins whose peptide digestion results are not selected for MS/MS sequencing. Besides, we can either combine PMF with the MS/MS-based method [9] or use it to extract protein proﬁles in proﬁling-based biomarker discovery [10]. Initially, PMF is used to identify single puriﬁed proteins from two-dimensional gel electrophoresis (2D gels). • Z. He, C. Yang and W. Yu are with Laboratory for Bioinformatics and Computational Biology, Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China. E-mail: [email protected], [email protected], [email protected]. Manuscript received 2 Dec. 2008; revised 5 May 2009; accepted 17 May 2009; published online XX 2009.

To date, many popular PMF tools such as Mascot [11], ProFound [12], Protein Prospector [13] and Aldente [14] have been developed. Please refer to [15] for a survey on the history of PMF before 2003 and refer to [16] for a summary on available PMF tools. To further improve the identiﬁcation accuracy, many new algorithms are proposed recently (e.g., [17], [18], [19], [20], [21], [22]). Note that all these methods focus on single protein identiﬁcation rather than protein mixture identiﬁcation. Some methods were also proposed to identify protein mixtures using PMF [23], [24], [25]. Jensen et al. [23] proposed a subtraction strategy in which proteins are identiﬁed in an iterative manner. In each iteration, one most possible protein (with the highest ranking score) is identiﬁed. Then, the peaks matching this identiﬁed protein are removed prior to the next iteration. This procedure terminates after sufﬁcient proteins have been identiﬁed. Park and Russell [24] as well as Eriksson and Fenyo¨ [25] also used the same strategy for protein mixture identiﬁcation. Though the subtraction approach is effective in identifying simple protein mixtures, it is highly heuristic and has deteriorative performance on complex and noisy protein mixtures. In this paper, we ﬁrst formulate the problem of protein mixture identiﬁcation as a Partial Set Covering (PSC) problem. More precisely, we take the input peak list as the ground set to be covered and regard each candidate protein as a subset of matched peaks. In addition, the cost of each protein is modeled as the number of theoretically digested peptides. The objective is to ﬁnd a subcollection of proteins that has minimal cost and covers at least a ﬁxed fraction of peaks. While many algorithms have been proposed to solve

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

the PSC problem, only a few of them have the capability of handling large-scale data in practice. To obtain a tradeoff between effectiveness and efﬁciency, we suggest three algorithms: greedy algorithm, linear programming (LP) rounding algorithm and dual feasible algorithm. All these algorithms have good identiﬁcation accuracy and running efﬁciency. We also show that the PSC model is applicable to the MS/MS-based protein identiﬁcation. With minor modiﬁcations, we present a generalized PSC model that can combine MS data and MS/MS data to identify proteins. One limitation of the PSC model is that it requires the user to specify the desired covering fraction. To solve this issue, we develop an estimation method that suggests a covering fraction automatically. To demonstrate the advantages of our methods, we conduct experiments on both simulated data and real data. The experimental results show that our methods outperform previous MS-based approaches signiﬁcantly. In addition, our methods are very useful in identifying proteins using only MS/MS data or the combination of both MS data and MS/MS data. The rest of the paper is organized as follows: Section 2 formulates the problem and introduces the algorithms. Section 3 shows the experimental results. Section 4 concludes the paper.

2

M ETHODS

In this section, we ﬁrst formulate the protein mixture identiﬁcation problem as a partial set covering problem. Then, we suggest three effective algorithms that are capable of performing large-scale protein mixture identiﬁcation. Finally, we extend the model to handle data sets that contain both MS data and MS/MS data and propose an algorithm for automatic covering fraction estimation. 2.1

Models

In MS-based protein mixture identiﬁcation, suppose we have a database of n proteins D = {d1 , d2 , ...dn } and a set of m experimental peaks Z = {z1 , z2 , ...zm } as input, our objective is to ﬁnd a set of proteins from D that generate m peaks. The size of n depends on the protein database we use. For instance, there are more than 260,000 proteins in the Swiss-Prot database (Release 52). The size of m varies from several hundred to ten thousand. In the real MS data studied in this paper, we have more than 3000 peaks. After protease (such as trypsin) digestion, a protein dj will produce a set of peptides Tj = {tj1 , tj2 , ..., tjnj }. Ideally, each peptide should correspond to a peak in the mass spectrum. Due to the imperfection of sample preparation, mass spectrometry scanning, and other factors, we often observe that some expected peaks are missing, while some noisy peaks are introduced.

2

Usually, we use a user-speciﬁed mass tolerance threshold σ to deﬁne the peak matching criterion. One experimental peak is considered as corresponding to a theoretical peak if their distance is not larger than σ. Given the mass tolerance threshold σ, we deﬁne a set of peaks Sj corresponding to a protein dj as: Sj = {zi |zi ∈ Z, ∃tjk ∈ Tj , ||zi − tjk ||1 ≤ σ}.

(1)

Here Sj is a subset of experimental peaks that can be “explained” by protein dj . The size of Sj reﬂects the power of protein dj in interpreting the observed peak list. Assuming that random peak matches also occur, |Sj | is proportional to |Tj |, where | · | denotes the size of a set. In other words, if one protein has more theoretical peaks, it has a larger probability of randomly matching more observed peaks. Hence, we deﬁne the cost of each Sj as: wj = |Tj |.

(2)

Alternatively, we can incorporate the number of missing peaks as a penalty into the cost function: wj = |Tj | − |Sj |.

(3)

On the one hand, Eq. (3) is better than Eq. (2) in the sense that it will provide preference to proteins with less missing peaks when they have the same number of theoretical peptides. On the other hand, Eq. (3) has the risk of underestimating the cost of longer proteins since the number of random matches in Sj is proportional to |Tj |. In our implementation, we use Eq. (2) as the default setting and provide Eq. (3) as an alternative choice. In the experimental study, we found that they almost exhibit identical performance. Thus, we omit the comparison between these two cost functions in the experimental section. The objective of protein mixture identiﬁcation is to ﬁnd a set of proteins that “best” explain Z. Here we decompose the high-level “best” criterion into two computational criteria: 1) Maximum coverage: the number of covered experimental peaks should be maximized. 2) Minimum cost: the total cost should be minimized. We have to make a trade-off between these two criteria since they conﬂict with each other. A natural formulation is the well-known set covering (SC) problem: Identifying a minimum cost subset from S = {S1 , S2 , ...Sn } such that it covers all elements in Z. Formally, let J = {1, 2, ..., n} denote the set of protein indices, our objective is to identify a subset C ⊆ J such that: ∑ (SC) minimize wj (4) C⊆J

j∈C

subject to |

∪

j∈C

where m is the number of peaks.

Sj | = m,

(5)

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

Unfortunately, the SC formulation is unrealistic since it requires a full coverage of peaks. In current MS data, there are always a large portion of noisy peaks that are not generated from ground-truth proteins. Motivated by this observation, we relax the coverage requirement so that some peaks can remain uncovered. This leads to a partial set covering (PSC) problem: (PSC)

minimize C⊆J

∑

wj

(6)

j∈C

subject to |

∪

Sj | ≥ pm.

(7)

j∈C

where p is the expected covering fraction (0 < p ≤ 1). In the PSC problem, p is speciﬁed by the user and the goal here is to ﬁnd a collection of sets with minimum cost covering at least p-fraction of the ground set Z. The PSC formulation for protein mixture identiﬁcation offers the following advantages: • The covering fraction p has a clear meaning in practice, i.e., the expected fraction corresponding to non-noisy peaks. This value depends on the experimental procedures of MS data generation. • During the last decade, many effective algorithms have been proposed to solve the PSC problem [26], [27], [28], [29], [30], [31]. We can apply some of these algorithms to protein mixture identiﬁcation directly. 2.2

Algorithms

While many algorithms are available for solving the PSC problem, not all of them are feasible in our context. For instance, the randomized algorithm in [27] has a time complexity exponential to the number of covered elements. Here, we re-use and design three scalable algorithms: greedy algorithm, linear programming rounding algorithm and dual feasible algorithm. In our original identiﬁcation problem, the input includes a protein database D, an experimental peak list Z and a mass tolerance threshold σ. To describe the proposed algorithms in a concise manner, we ﬁrst compute the peak set Sj and the corresponding cost value wj of each protein dj according to σ. Then, we prune the protein database using a parameter k (k ¿ n). More precisely, we retain only the top k proteins with the largest values of |Sj | as candidates. In addition, we delete peaks that didn’t appear in any Si from Z. Without loss of generality, we still use n to denote the number of proteins and use m to denote the number of peaks after pruning. The above procedure provides us a transformed input: an experimental peak list Z = {z1 , z2 , ..., zm }, a family S = {S1 , S2 , ..., Sn } of subsets of Z with a corresponding set of cost values w = {w1 , w2 , ..., wn }, and covering fraction p. The output C should be a subset of protein indices, C ⊆ J and J = {1, 2, ..., n}. Moreover, we use I = {1, 2, ..., m} to denote the set of peak indices.

2.2.1

3

Greedy Algorithm

A greedy algorithm is the most natural heuristic for set covering problem. It works by selecting one set at a time that covers the most elements among the uncovered ones [32]. As shown in Algorithm 1, the greedy algorithm for PSC probelm works in the same way as for SC ∪ problem except that the stopping criterion becomes | j∈C Sj | ≥ pm. Algorithm 1 Greedy Algorithm (S, w, p) Initialize∪C ← ∅; l ← 0 while | j∈C Sj | < pm do l ← l + 1; Select a set Sjl , such that

|Sjl | w jl

= max

Set C ← C ∪ {jl }; Sj = Sj \Sjl ;J = J\{jl } end while return C

j∈J

|Sj | wj

The greedy algorithm has the following salient features: • It is very fast since its time complexity is O(mnr), where r = |C|, m is the number of peaks and n is the number of candidate proteins. • It guarantees an approximation ratio of H(dpme) ∑dpme [26], where H(dpme) = i=1 (1/i) ≤ 1 + ln(pm). It means that the greedy algorithm can always obtain a solution that is at most H(dpme) times the optimal solution. Moreover, it establishes the connection between the PSC model and the subtraction strategy in [23]. In fact, we can regard the greedy algorithm as one special case |S | of the subtraction strategy in which wjj is used as the protein identiﬁcation score. In the context of protein identiﬁcation, |Sj | is also called shared peak count [1]. |S | Hence, wjj is actually the normalized shared peak count since it ranges between 0 and 1 when wj = |Tj |. It is well recognized that such simple scoring method can not achieve good performance in single protein identiﬁcation. To our surprise, it performs extremely well and outperforms other more complicated scoring methods such as Piums [33] and ProFound [12] in our experiments of protein mixture identiﬁcation. The reason is probably two-fold: |Sj | has a theoretical justiﬁcation under the PSC • w j model since the greedy algorithm provides a good performance guarantee. • Simple scoring methods are more feasible for protein mixture identiﬁcation since they are not very sensitive to random matching. More precisely, the shared peak count is linear to the number of peak matches while other sophisticated scores are nonlinear to the number of peak matches. Note that the greedy algorithm is just one possible choice for solving the PSC problem. There are many other algorithms that are totally different from this greedy heuristic and the subtraction strategy.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

2.2.2 LP Rounding Algorithm The linear programming (LP) rounding technique is very popular in the design of approximation algorithms. A general LP rounding method consists of the following steps: 1) Formulating the optimization problem as an integer programming problem. 2) Relaxing the integrality constraint to obtain an LP problem, which can be solved in polynomial time. 3) Rounding the fractional solution to the integral solution. The PSC problem can be formulated as an integer program (PSC IP). In this formulation, the variable xj indicates whether we select Sj (xj = 1 if j ∈ C), whereas the variable yi indicates ∪ whether a peak zi is left uncovered (yi = 0 if zi ∈ j∈C Sj ). Mathematically, constraint (9) guarantees that we either pick at least one set that contains zi , or specify that this element is uncovered by setting yi = 1; Constraint (10) forces any feasible solution to cover at least pm peaks; Constraint (11) and constraint (12) force xj and yi to be binary variable, respectively.

(PSC IP)

minimize x,y

∑

(8)

wj xj

j∈J

∑

subject to

xj + yi ≥ 1, i ∈ I

(9)

j:zi ∈Sj

∑

yi ≤ m(1 − p)

(10)

xj ∈ {0, 1}, j ∈ J yi ∈ {0, 1}, i ∈ I.

(11) (12)

i∈I

The corresponding LP relaxation below is obtained by setting the domain of xj and yi be to 0 ≤ xj , yi ≤ 1. Notice that the upper bound on xj and yi is unnecessary and is thus dropped in constraint (16) and constraint (17).

(PSC LP Primal)

minimize x,y

subject to

∑

wj xj

(13)

j∈J

∑

xj + yi ≥ 1 (14)

j:zi ∈Sj

∑

yi ≤ m(1 − p) (15)

i∈I

xj ≥ 0, j ∈ J yi ≥ 0, i ∈ I.

4

Algorithm 2 LP Rounding Algorithm (S, w, p) Construct the LP relaxation of the PSC IP problem Invoke LP solver to get an optimal solution x Initialize∪C ← ∅; l ← 0 while | j∈C Sj | < pm do l ← l + 1; Select a set Sjl , such that xjl = max xj j∈J

Set C ← C ∪ {jl }; J = J\{jl } end while return C

2.2.3 Dual Feasible Algorithm Dual feasible algorithm solves the problem by ﬁnding a feasible solution of its dual problem [32], [29]. The dual problem of PSC is given as: (PSC LP Dual) maximize λ,v

subject to

∑

λi − v(m − pm) (18)

i∈I

∑

λi ≤ wj , j ∈ J(19)

i:zi ∈Sj

λi ≤ v, i ∈ I λi ≥ 0, i ∈ I v ≥ 0.

(20) (21) (22)

The m dual variables (λi for each zi ∈ Z,i ∈ I) correspond to the constraint (14) and the dual variable v corresponds to the constraint (15) in the primal LP, respectively. The dual feasible algorithm (Algorithm 3) derives the dual solution implicitly. The dual information is placed in square brackets to indicate that it is not an indispensable part of the algorithm. Whenever a set is selected, its corresponding dual constraint becomes binding as each of the elements covered by it is assigned an equal share of the set’s (reduced) cost [32]. This algorithm works similarly to the greedy algorithm. The apparent difference is that it chooses a set using the reduced cost. Algorithm 3 Dual Feasible Algorithm (S, w, p) Initialize∪C ← ∅; l ← 0;[λi = 0, i ∈ I] while | j∈C Sj | < pm do l ← l + 1; Select a set Sjl , such that

|Sjl | w jl

= max

Set C ← C ∪ {jl }; Sj = Sj \Sjl ;J = J\{jl }; w w Set wj ← wj − |Sjjl | |Sj |;[λi = |Sjjl | , ∀zi ∈ Sjl ] l l end while return C

j∈J

|Sj | wj

(16) (17)

In Algorithm 2, we present a simple LP rounding algorithm to solve the PSC problem. We ﬁrst use the LP relaxation to ﬁnd the fractional solution. Then, we select one set at a time that maximizes the value of xjl (i.e., rounding xjl to 1) until pm peaks are covered.

2.3 Extension to MS/MS Data The MS/MS-based protein identiﬁcation method ﬁrst fragments some peptides to generate tandem MS spectra. It then searches these spectra against a database to identify peptides in the sample. Since the same peptide sequence may belong to different proteins, such a

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

database search-based method may lead to the ambiguities in determining the proteins that are indeed present. Correspondingly, inferring proteins from peptide identiﬁcation results, known as the protein inference problem [34], is a challenging task. To solve the protein inference problem, many formulations and solutions have been proposed [34], [35], [36], [37], [38]. In particular, parsimonious formulations (set covering) are widely adopted [34], [36], [37]. The idea is to ﬁnd a subset of proteins with minimum cost such that it covers all identiﬁed peptides. While the MS/MS-based approach can identify tens of thousands of peptides in large-scale biological studies (e.g., [39], [40]) with a very low false discovery rate (FDR) of 15%, there is still a small portion of false peptide identiﬁcations. Furthermore, the most popular method for estimating the FDR is the target-decoy search strategy [41]. Such database-dependent estimation method often underestimates the error rate since many identiﬁcations from the target database are also incorrect. Therefore, it is certainly desired to leave those incorrect peptides uncovered in the stage of protein inference. This means that the principle of partial coverage instead of full coverage should be used in the MS/MS-based protein inference. It is straightforward to extend the partial set covering model to the MS/MS-based protein inference: Let Z = {z1 , z2 , ...zm } be a set of identiﬁed peptides and deﬁne each Sj as: Sj = Z ∩ T j . (23) Then, the optimization problem remains the same and three algorithms developed in previous sections can be applied directly. In the context of MS/MS data, all proteins that contain at least one identiﬁed peptide are considered as candidate proteins in the optimization process. Here the covering fraction p can be interpreted as the expected percentage of true peptide identiﬁcations. Informally, we can consider (1 − p) as the desired FDR value. For instance, if the expected FDR is 5%, then we set p = 0.95. 2.4 Extension to the Combination of MS Data and MS/MS Data Single-stage MS data is complementary to MS/MS data in that it provides broader mass coverage but less sequence information. The combination of MS data and MS/MS data has been used in [9] to improve protein identiﬁcation. Here we extend the partial set covering model to the combination of MS data and MS/MS data. First, we use Z = {z1 , z2 , ...zm } as the set of elements to be covered and zi can be either an identiﬁed peptide or a peak. Since the peptide identiﬁed from a MS/MS spectrum is more informative than a single-stage MS peak, we assign different “beneﬁt” values to different types of elements. More precisely, we introduce a set B = {b1 , b2 , ...bm }, where each bi denotes the importance value of zi . In the current implementation, we use a userspeciﬁed parameter λ to generate bi : bi = λ if zi is an

5

identiﬁed peptide and bi = 1 − λ otherwise. In general, λ should be larger than 0.5 (e.g., 0.9) to reﬂect the fact that identiﬁed peptides are more important than singlestage MS peaks. Then, the generalized partial set covering (GPSC) problem becomes: (GPSC)

minimize C⊆J

subject to

∑

(24)

wj

j∈C

∑

zi ∈

S

bi ≥ pt,

(25)

Sj

j∈C

where p∑ is the expected covering fraction (0 < p ≤ 1) m and t = i=1 bi is the sum of elements in B. Obviously, the GPSC problem reduces to the standard partial set covering problem when bi = 1(0 ≤ i ≤ m). The three PSC algorithms need the following minor modiﬁcations to handle the generalized PSC problem: ∑ 1) In all algorithms, we use zi ∈ S Sj bi ≥ pt instead j∈C ∪ of | j∈C Sj | ≥ pm as the ∑ stopping criterion. 2) In all algorithms, we use zi ∈Sj bi to replace |Sj |. 3) In the LP rounding algorithm, we replace m(1 − p) with t(1 − p) in constraint (15). In candidate protein selection, one protein is kept either when it contains at least one identiﬁed peptide or when it is among the best k proteins with respect to the number of matched single-stage MS peaks (k is an input parameter). 2.5

The Estimation of Covering Fraction

The covering fraction p is of primary importance in our algorithms. In this section, we will study how to specify this parameter automatically under the setting of generalized PSC model. Algorithm 4 describes the procedure to estimate the covering fraction. We have introduced the notations of S, w and B in previous sections. Here we use F to denote a set of real numbers. Each number in F is used as the covering fraction parameter. For instance, we can set F = {0.1, 0.2, ..., 0.9}. Furthermore, we use Alg to denote the algorithm that can solve the PSC problem. So far, we have three choices for Alg: Greedy algorithm, LP Rounding algorithm and Dual Feasible algorithm. Given f ∈ F , we obtain a subset of proteins C using Alg when the covering fraction is f . To evaluate the quality of C, one natural choice is u/v, ∑ ∑ where u = S zi ∈ Sj bi is the “beneﬁt” and v = j∈C wj is the j∈C

“cost”. We can consider v/u as the “covering efﬁciency” of C, i.e., a larger u/v indicates a better cover. In general, the value of this evaluation function decreases when the covering fraction increases. This is because our algorithms generate C in a greedy manner, making it harder for those late-coming proteins to achieve the same-level covering efﬁciency (with respect to uncovered elements). To compare different settings of covering fractions, we

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

use the modiﬁed covering efﬁciency (u/v)f as the evaluation criterion. We report p = arg max(u/v)f as the f

estimation of the covering fraction. Algorithm 4 Covering Fraction Estimation Algorithm (S, w, B, F, Alg) Initialize E ← 0 and p ← 0 for each f ∈ F do Set C ←∑ Alg(S, w, f ) ∑ Let u = zi ∈ S Sj bi and v = j∈C wj j∈C

if (u/v) · f > E then Set E ← (u/v) · f and p ← f end if end for return p

3

R ESULTS

We use both simulation data and real data to demonstrate the superiority of the PSC model in protein mixture identiﬁcation. The evaluation criteria are standard performance metrics in information retrieval: precision, recall, and F 1-measure: • P recision (p) is the proportion of identiﬁed groundtruth proteins to all identiﬁed proteins. • Recall (r) is the proportion of identiﬁed groundtruth proteins to all ground-truth proteins. • F 1-measure is the harmonic mean of recall and precision: 2pr/(p + r) with p and r just deﬁned as above. In the experiments, we compare our algorithms against the following two generic identiﬁcation algorithms: • SPI (single protein identiﬁcation) algorithm: It uses the standard PMF algorithm directly to identify proteins from the MS spectrum of protein mixture. Concretely, the algorithm ranks each protein in the database separately. To handle protein mixture, we report a set of top-ranked proteins as the results. In this paper, we select two single protein identiﬁcation methods: Piums [33] and ProFound [12]. The ﬁrst is the representative of newly developed algorithms and the second is the representative of current stateof-the-art PMF tools. • Subtraction algorithm: The algorithm by Jensen et al. [23] is the only method that aims to identify proteins from the mixture of MS peaks. In the implementation, we repeat g iterations to identify g proteins as the output, where g is a user-speciﬁed parameter. In each iteration, we use a standard PMF algorithm to perform protein identiﬁcation and the matched peaks are removed prior to the next iteration of searching. To be consistent with SPI, we implement two versions of the subtraction algorithm, one version with Piums as the component PMF

6

algorithm and the other version with ProFound as the component PMF algorithm1 . In the experiments, we let the SPI algorithm and the subtraction algorithm to report g proteins, where g is number of ground-truth proteins. Note that we usually don’t know the true protein number in practice. Thus, such a speciﬁcation will favor the performance of these two algorithms. Under this setting, the precision, recall and F 1-measure are identical in both the SPI algorithm and the subtraction algorithm. We use the following parameters in protein identiﬁcation: trypsin digestion with a maximum of one missed cleavage, mono-isotopic peaks, single charge state, unrestricted protein mass. 3.1 Simulation Study 3.1.1 Simulator Our simulator requires the following input parameters: mass error, sequence coverage, noise level and protein number. • Mass error measures the difference between the theoretical mass and the observed mass. • Sequence coverage denotes the ratio between the number of detected peptides and the number of all peptides within the mass acquisition range (8004500 Da in our simulation). • Noise level is the ratio between the number of manmade noisy peaks and the number of total peaks. • Protein number is the number of ground-truth proteins in the mixture. The data simulation process works as follows: 1) We randomly select a set of proteins from the sequence database (Swiss-Prot, Release 52) as the ground-truth proteins according to the parameter of protein number. 2) We perform trypsin-based protein digestion in silico (allowing 1 missed cleavage) and simulate the peptide detectability by retaining only a portion of proteolytic peptides according to the parameter of sequence coverage. 3) We alter the mass of each peptide by adding a number randomly generated from a zero-mean Gaussian distribution whose standard deviation equals the mass error parameter. 4) We add a set of noisy peaks that are randomly generated and uniformly distributed within the mass acquisition range according to the parameter of noise level. 3.1.2 Performance Comparison To compare different algorithms, we generate 4 groups of protein mixtures with different characteristics by varying 1. Though Mascot [11] is probably the most popular PMF method for single protein identiﬁcation, it is difﬁcult to directly embed it into the substraction strategy since the technical details of Mascot are not publicly available.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

7

TABLE 1 Parameter setting in data simulation. Since the ground-truth peaks of one protein may be considered as noise to other proteins, we didn’t include noisy peaks in group d to study the effect of protein number on the identiﬁcation performance.

40 20 0.01

0.02 0.03 Mass Error (Da)

0.04

Average Recall (%)

40 20 0.2 0.3 Sequence Coverage

0.4

0

0.01

Average Recall (%)

60 40 20 30 50 Noise Level (%)

80 60 40 20 0

0.1

70

60 40 20 0

10

60 40 20 40 70 Number of Proteins

30 50 Noise Level (%) (d2) Average Recall

80

10

0.2 0.3 Sequence Coverage

80

(d1) Average Precision

0

0.02 0.03 Mass Error (Da)

(c2) Average Recall

80

10

100

20

(c1) Average Precision

0

70

40

(b2) Average Recall

60

0.1

0.4

60

(b1) Average Precision 80

0

Average F1−Measure (%)

Average Recall (%)

60

80

100

80 60 40 20 0

10

40 70 Number of Proteins

Protein Number 20 20 20 10, 40, 70, 100

Greedy

0.04

(a2) Average Recall

80

0

Subtraction(ProFound)

Average F1−Measure (%)

Subtraction(Piums)

Noise Level (%) 50 50 10, 30, 50, 70 0

Average F1−Measure (%)

SPI(ProFound)

Sequence Coverage 0.3 0.1, 0.2, 0.3, 0.4 0.3 0.3

(a1) Average Precision

Average Recall (%)

Average Precision (%)

Average Precision (%)

Average Precision (%)

Average Precision (%)

SPI(Piums)

Mass Error (Da) 0.01, 0.02, 0.03, 0.04 0.02 0.02 0.02

Average F1−Measure (%)

Group a b c d

LP Rounding

Dual Feasible

(a3) Average F1−Measure 80 60 40 20 0

0.01

0.02 0.03 Mass Error (Da)

0.04

(b3) Average F1−Measure 80 60 40 20 0

0.1

0.2 0.3 Sequence Coverage

0.4

(c3) Average F1−Measure 80 60 40 20 0

10

30 50 Noise Level (%)

70

(d3) Average F1−Measure 80 60 40 20 0

10

40 70 Number of Proteins

100

Fig. 1. Identiﬁcation performance comparison of different algorithms on the simulation data. In our algorithms, we set the covering fraction to 0.5 and ﬁx the number of candidate proteins to 5000.

one parameter while ﬁxing the other parameters (see Table 1). Under each speciﬁc parameter setting, we randomly create 10 protein mixtures to obtain the average performance of each algorithm. In database searching, the mass tolerance threshold is set to the known mass error for all PMF algorithms. We illustrate the identiﬁcation performance of different methods in Fig. 1. The results show that our methods outperform previous methods signiﬁcantly. We also observe that the greedy algorithm, LP rounding

algorithm and dual feasible algorithm have similar performance. Furthermore, Fig. 1 shows that the mass error, the sequence coverage, the noise level and the number of proteins in the mixture have signiﬁcant inﬂuence on the performance of PMF algorithms. Concretely, the increase of mass accuracy and sequence coverage will boost the performance, while the increase of noise level and protein number will deteriorate the performance. We also compare the running time of different algorithms in Fig. 2. Among our methods, the LP rounding

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

SPI(Piums) 3

SPI(ProFound)

Subtraction(Piums)

Subtraction(ProFound)

(a) Mass Error Ranges from 0.01 Da to 0.04 Da

Average Running Time (s)

Average Running Time (s)

Dual Feasible

10

2

10

1

2

10

1

10

10 0.01

0.02 0.03 Mass Error (Da)

0.04

0.1

(c) Noise Level Ranges from 10% to 70%

3

0.2 0.3 Sequence Coverage

0.4

(d) Protein Number Ranges from 10 to 100

3

10

Average Running Time (s)

10

Average Running Time (s)

LP Rounding

(b) Sequence Coverage Ranges from 0.1 to 0.4

3

10

Greedy

8

2

10

1

2

10

1

10

10 10

30 50 Noise Level (%)

70

10

40 70 Number of Proteins

100

Fig. 2. Running time comparison of different algorithms on the simulation data. In our algorithms, we set the covering fraction to 0.5 and ﬁx the number of candidate proteins to 5000.

algorithm is much more time-consuming than other algorithms since it needs to ﬁnd the optimal LP solution; The greedy algorithm and the dual feasible algorithm need similar execution time since they have the same time complexity. Furthermore, the running time of our algorithms is comparable to that of previous algorithms. This means that they can achieve better identiﬁcation performance and running efﬁciency simultaneously. 3.1.3

Parameter Estimation

To test the effectiveness of the covering fraction estimation procedure, we use F = {0.1, 0.2, ..., 0.9} as the set of candidate values for the covering fraction. We also plug all three proposed algorithms into the estimation process to test their performance, respectively. We plot both the estimated covering fraction and the ground-truth in Fig. 3. Here the ground-truth covering fraction equals to (1 − ε), where ε is the noisy level used in generating the simulation data (see Table 1). In Fig. 3(a) and Fig. 3(b), the percentage of noisy peaks is ﬁxed to 0.5. Thus, we consider 0.5 as the underlying true covering faction in evaluating the estimation

method. We found that the estimated covering fraction varies from 0.4 to 0.6, approximating the true covering fraction within a reasonable range. In Fig. 3(c), the ground-truth covering faction ranges from 0.9 to 0.3. Interestingly, the estimation becomes more accurate when the noise level is increased. This is a nice property since real MS data is always very noisy. We believe the reasons are the following: Our estimation is based on (u/v)f , which is a trade-off between u/v and f . With the increase of candidate covering fraction value f , the corresponding u/v decreases and the decreasing rate is roughly proportional to f . Given f and the ground-truth covering fraction f ∗ , we have: •

•

If f > f ∗ , we can assume that the covering efﬁciency value obtained at f is less than that of f ∗ . This is because we have to cover at least (f − f ∗ ) noisy peaks, leading to a very low u/v value. If f < f ∗ , there are two cases: – When f ∗ is small (i.e., the noise level is high), the probability of reporting f ∗ is very high because the difference in the covering efﬁciency

(a) Mass Error Ranges from 0.01 Da to 0.04 Da 60 Greedy LP Rounding Dual Feasible 55 Ground Truth 50

45

40 0.01

0.02 0.03 Mass Error (Da)

Estimiated Covering Fraction (%)

Estimiated Covering Fraction (%)

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

(b) Sequence Coverage Ranges from 0.1 to 0.4 60 Greedy LP Rounding Dual Feasible 55 Ground Truth 50

45

40

0.04

0.1

90 80 70 60 50

Greedy LP Rounding Dual Feasible Ground Truth

40 30 10

30 50 Noise Level (%)

70

0.2 0.3 Sequence Coverage

0.4

(d) Protein Number Ranges from 10 to 100 Estimiated Covering Fraction (%)

Estimiated Covering Fraction (%)

(c) Noise Level Ranges from 10% to 70% 100

9

100 90 80 70 60 50

Greedy LP Rounding Dual Feasible Ground Truth

40 30 10

40 70 Number of Proteins

100

Fig. 3. Covering fraction estimation on the simulation data. We use the Greedy algorithm, the LP Rounding algorithm and the Dual Feasible algorithm as the component algorithm to estimate the covering fraction, respectively. In all tests, we set the number of candidate proteins to 5000.

We use a standard mixture of 49 human proteins in the ABRF sPRG2006 study2 . The 74 labs participating in this study analyzed the sample using different techniques,

e.g., 2D-gels, LC-MS/MS. Here we select one data set that is generated from a linear ion trap (LTQ)-orbitrap instrument. Note that most of the labs in this study did not provide LTQ-Orbitrap data set. The use of such highaccuracy data enables us to obtain high-quality singlestage MS peaks. In the experiments, we search against a sequence database3 that consists of all Swiss-Prot Human proteins and some bonus proteins and contaminant compounds. To extract single-stage MS peaks, we use the Decon2LS software [42] and VIPER software [43] to pre-process the raw LC-MS data with their default parameter settings. The ﬁnal peak list contains 3366 de-convolved monoisotopic peaks. Note that we use all the single-MS peaks obtained from the raw MS data rather than only the single-MS peaks selected for MS/MS. In database searching, we set the mass tolerance threshold to 1 ppm since the mass accuracy of the data is very high. To test the identiﬁcation performance of our algorithms on MS/MS data, we use X!Tandem (version 2007.07.01.2) [44] to identify peptides from MS/MS spectra. The parameters used for peptide identiﬁcation are: mono-isotopic masses, mass tolerance of 2 Da for precursor, mass tolerance of 1 Da for fragment ion, ﬁxed modiﬁcation on Cys, one missed cleavage site, and only b and y fragment ions are taken into account.

2. http://www.abrf.org/index.cfm/group.show/ ProteomicsInformatics/ResearchGroup.53.htm

3. http://www.abrf.org/ResearchGroups/ ProteomicsInformaticsResearchGroup/Studies/sPRGBICFASTA.zip

is mainly determined by the covering fraction. – If f ∗ is large (i.e., the noise level is low), the probability of obtaining the maximal covering efﬁciency value at f is very high because the difference in the covering efﬁciency is dominated by u/v. In Fig. 3(d), the ground-truth covering faction is ﬁxed to 1 since there is no noisy peaks in the simulation data. It shows that the increase of the number of proteins in the mixture leads to the decrease of the estimated covering fraction. This is because the introduction of additional proteins in generating the simulation data can also bring “noisy peaks” (recall that we alter the mass of each peptide by adding a random number from a Gaussian distribution with the mass error parameter as its standard deviation). Overall, the proposed estimation method can provide a covering fraction that is practically useful. We will further clarify this point through the experiments on the real data. 3.2

Real Data

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

3.2.1 MS Data Table 2 reports the identiﬁcation results of different algorithms using only single-stage MS data. It shows that our algorithms achieve signiﬁcantly higher protein identiﬁcation rate than previous PMF methods. TABLE 2 Performance comparison on a real MS data of 49 human proteins. Here the number of reported proteins for the SPI algorithm and the subtraction algorithm is 49, i.e. the number of ground-truth proteins. In our algorithms, the covering fraction is set to 0.2 and the number of candidate proteins is set to 5000. Category SPI Subtraction PSC

Algorithms Piums ProFound

Precision 24% 24%

Recall 24% 24%

F 1-M easure 24% 24%

Piums ProFound

43% 37%

43% 37%

43% 37%

Greedy LP Rounding Dual Feasible

84% 87% 84%

53% 55% 55%

65% 67% 66%

The covering fraction has a signiﬁcant inﬂuence on the performance of our algorithms. To clarify this point, we range the covering fraction from 0.1 to 0.9 to check the identiﬁcation performance of different algorithms. The results in Fig. 4(a), Fig. 4(b) and Fig. 4(c) show that these algorithms exhibit similar behavior under the same parameter setting. The precision decreases and the recall increases when the covering fraction increases. This is easy to understand since the use of larger covering fraction will make the algorithms report more proteins. Since there is no ground-truth for the covering fraction in the real MS data, we evaluate the covering fraction estimation method based on the identiﬁcation performance. More precisely, if the estimated covering fraction is very close to the true percentage of non-noisy peaks, it is reasonable to expect that such a parameter setting will yield good identiﬁcation performance. Fig. 4(d) plots the covering efﬁciency of different candidate fraction values. It shows that we obtain the maximum covering efﬁciency when the covering fraction is 0.2. According to the covering fraction estimation method, we will recommend 0.2 to the user. Fig. 4(c) shows that this setting provides the best overall identiﬁcation performance (F 1-measure) among all candidate values. Furthermore, we observe that the trend of covering efﬁciency coincides with the trend of F 1-measure very well. 3.2.2 MS/MS Data To demonstrate the beneﬁt of partial coverage against full coverage in the context of MS/MS-based protein inference, we vary the covering fraction p from 0.1 to 1. Note that p = 1 corresponds to a full coverage. Fig. 5 presents the experimental results. It reveals that leaving a small portion of peptides uncovered improves

10

the overall identiﬁcation performance in terms of F 1measure. This is because there are still some false identiﬁcations in the peptide-spectrum pairs. The partial set covering model can help alleviate the effect of these peptides in the stage of protein inference. Since the expected number of false identiﬁcations is relatively small, it is reasonable to assume that the percentage of false identiﬁcations is around 5%-10% (i.e., the covering fraction is 90%-95%). Here we like to know if we can obtain a good estimation automatically using our estimation algorithm. In Fig. 5(d), we use the covering efﬁciency as the evaluation criterion to ﬁnd the best covering fraction. It indicates that 0.8 beats all other candidate values. If we assume that 0.9 is the ground-truth covering fraction, the estimated covering faction does not match the ground truth perfectly. The reason is that, the estimation result is less accurate when the groundtruth covering fraction is relatively large, as previously discussed in the simulation study section. On the other hand, we also like to point out that this parameter setting still outperforms the full covering model and it is the second best among all candidates. 3.2.3 The Combination of MS Data and MS/MS Data The proposed algorithms also work when we combine MS data with MS/MS data. Since MS/MS-based peptide identiﬁcation results are more informative than singlestage MS peaks, we use λ = 0.9 as the “beneﬁt” value of each identiﬁed peptide. Correspondingly, the “beneﬁt” value of each single MS peak is 0.1. Fig. 6 plots the experimental results when the covering fraction varies from 0.1 to 0.9. This ﬁgure is similar to Fig. 5 since we place more weight on MS/MS data. In Fig. 6 (d), our estimation algorithm identiﬁes 0.6 as the best covering fraction through the evaluation of covering efﬁciency values. Again, the estimated value is only the second best, while the best covering fraction is 0.7 in terms of F 1-measure. To illustrate the beneﬁt of including single-stage MS data in protein mixture identiﬁcation, we compare the identiﬁcation results of using different types of data in Table 3. Not surprisingly, the MS/MS-based method provides better performance than the MS-based method and the performance gap is still considerable. The promising result is that combining MS data and MS/MS data improves the overall identiﬁcation performance (F 1measure) consistently.

4

C ONCLUSIONS

In this paper, we proposed a new uniﬁed framework named partial set covering (PSC) model to identify protein mixtures using both MS data and MS/MS data. The experimental results demonstrate the advantage of our model and the three PSC-based methods. In our future work, we plan to incorporate prior biological knowledge into the model to further improve the identiﬁcation performance. For instance, known factors

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

(a) Precision

60 40

80

20 0

Greedy LP Rounding Dual Feasible

100

Recall (%)

80 Precision (%)

(b) Recall Greedy LP Rounding Dual Feasible

100

11

60 40 20 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction (d) Covering Efficiency

(c) F1−Measure 0.06

F1−Measure (%)

100 80 60 40 20 0

Greedy LP Rounding Dual Feasible

0.05 Covering Efficiency

Greedy LP Rounding Dual Feasible

0.04 0.03 0.02 0.01 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

Fig. 4. The effect and estimation of covering fraction on the real single-stage MS data. The number of candidate proteins is set to 5000. In the ﬁrst three sub-ﬁgures, we plot precision, recall and F 1-measure of the proposed algorithms, respectively. In the last sub-ﬁgure, we describe the covering efﬁciency values when different PSC algorithms are used in the estimation procedure. (a) Precision 140

Greedy LP Rounding Dual Feasible

80

100

Recall (%)

Precision (%)

120

(b) Recall 100

80 60

Greedy LP Rounding Dual Feasible

60 40

40 20 20 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

0

1

0.15

Greedy LP Rounding Dual Feasible

Covering Efficiency

F1−Measure (%)

80

1

(d) Covering Efficiency

(c) F1−Measure 100

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

60 40

Greedy LP Rounding Dual Feasible

0.1

0.05

20 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

1

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

1

Fig. 5. Performance test and covering fraction estimation on the real MS/MS data. From (a) to (c), we compare the performance of partial covering model with that of full covering model. In (d), we plot the covering fraction estimation results in terms of covering efﬁciency.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

(a) Precision 140

80

100

Recall (%)

Precision (%)

(b) Recall 100

Greedy LP Rounding Dual Feasible

120

80 60

12

Greedy LP Rounding Dual Feasible

60 40

40 20 20 0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

(c) F1−Measure 100

Greedy LP Rounding Dual Feasible

0.1 Covering Efficiency

F1−Measure (%)

80

(d) Covering Efficiency

60 40 20 0

0.08

Greedy LP Rounding Dual Feasible

0.06 0.04 0.02 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Covering Fraction

Fig. 6. Performance test and covering fraction estimation on the combination of MS data and MS/MS data. The number of candidate proteins is set to 5000 for MS data. In (a), (b) and (c), we plot the identiﬁcation performance of different algorithms when the covering fraction varies from 0.1 to 0.9. In (d), we plot the covering efﬁciency when the covering fraction is set to different candidate values. TABLE 3 Performance comparison using single MS data, tandem MS data and the combination of MS data and MS/MS data. Since our objective here is to investigate the effect of using different types of data, we set the covering fraction p to its known best value, i.e., p=0.2 (single MS), p=0.9 (tandem MS) and p=0.7 (MS and MS/MS). Algorithms Greedy LP Rounding Dual Feasible

Data Single MS Tandem MS MS and MS/MS Single MS Tandem MS MS and MS/MS Single MS Tandem MS MS and MS/MS

such as protein translational modiﬁcations (PTMs) and partial digestions may generate “noisy peaks”. Such information can help us to distinguish true peptide-related signals from noises, leading to higher identiﬁcation rate.

Precision 84% 83% 81% 87% 75% 81% 87% 83% 79%

F 1-M easure 65% 80% 83% 67% 74% 83% 67% 80% 82%

and Technology. The source codes and data are available at: http://bioinformatics.ust.hk/PSCMixture.rar.

R EFERENCES [1]

ACKNOWLEDGEMENTS The comments and suggestions from the anonymous reviewers greatly improved the paper. This work was supported with the general research fund 621707 from the Hong Kong Research Grant Council, a research proposal competition award RPC07/08.EG25 and a postdoctoral fellowship from the Hong Kong University of Science

Recall 53% 78% 86% 55% 73% 86% 55% 78% 86%

[2]

[3]

L. McHugh and J. W. Arthur, “Computational methods for protein identiﬁcation from mass spectrometry data,” PLoS Compuatational Biology, vol. 4, no. 2, p. e12, 2008. W. J. Henzel, T. M. Billeci, J. T. Stults, S. C. Wong, C. Grimley, and C. Watanabe, “Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases,” Proceedings of the National Academy of Sciences of the United States of America, vol. 90, no. 11, pp. 5011–5015, 1993. P. James, M. Quadroni, E. Carafoli, and G. Gonnet, “Protein identiﬁcation by mass proﬁle ﬁngerprinting,” Biochemical and Biophysical Research Communications, vol. 195, no. 1, pp. 58–64, 1993.

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

[4]

[5] [6] [7]

[8]

[9]

[10]

[11]

[12]

[13] [14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22] [23]

M. Mann, P. Hojrup, and P. Roepstorff, “Use of mass spectrometric molecular weight information to identify proteins in sequence databases,” Biological Mass Spectrometry, vol. 22, no. 6, pp. 338– 345, 1993. D. J. Pappin, P. Hojrup, and A. J. Bleasby, “Rapid identiﬁcation of proteins by peptide-mass ﬁngerprinting,” Current Biology, vol. 3, no. 6, pp. 327–332, 1993. J. R. Yates, S. Speicher, P. R. Grifﬁn, and T. Hunkapiller, “Peptide mass maps: a highly informative approach to protein identiﬁcation,” Analytical Biochemistry, vol. 214, no. 2, pp. 297–408, 1993. J. K. Eng, A. L. Mccormack, and J. R. Yates, “An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database,” Journal of the American Society for Mass Spectrometry, vol. 5, no. 11, pp. 976–989, 1994. V. Dancik, T. A. Addona, K. R. Clauser, J. E. Vath, and P. A. Pevzner, “De novo peptide sequencing via tandem mass spectrometry,” Journal of Computational Biology, vol. 6, no. 3/4, pp. 327–342, 1999. B. Lu, A. Motoyama, C. Ruse, J. Venable, and J. R. Yates, “Improving protein identiﬁcation sensitivity by combining MS and MS/MS information for shotgun proteomics using LTQ-Orbitrap high mass accuracy data,” Analytical Chemistry, vol. 80, no. 6, pp. 2018–2025, 2008. D. Mantini, F. Petrucci, P. D. Boccio, D. Pieragostino, M. D. Nicola, A. Lugaresi, G. Federici, P. Sacchetta, C. D. Ilio, and A. Urbani, “Independent component analysis for the extraction of reliable protein signal proﬁles from MALDI-TOF mass spectra,” Bioinformatics, vol. 24, no. 1, pp. 63–70, 2008. D. N. Perkins, D. J. C. Pappin, D. M. Creasy, and J. S. Cottrell, “Probability-based protein identiﬁcation by searching sequence databases using mass spectrometry data,” Electrophoresis, vol. 20, no. 18, pp. 3551–3567, 1999. W. Zhang and B. T. Chait, “Profound: an expert system for protein identiﬁcation using mass spectrometric peptide mapping information,” Analytical Chemistry, vol. 72, no. 11, pp. 2482–2489, 2000. P. R. Baker and C. K. R, “Protein prospector.” [Online]. Available: http://prospector.ucsf.edu M. Tuloup, C. Hernandez, I. Coro, C. Hoogland, P.-A. Binz, and R. D. Appel, “Aldente and biograph: An improved peptide mass ﬁngerprinting protein identiﬁcation environment,” in Swiss Proteomics Society 2003 Congress: Understanding Biological Systems through Proteomics, 2003, pp. 174–176. [Online]. Available: http://www.expasy.org/tools/aldente/ W. J. Henzel, C. Watanabe, and J. T. Stults, “Protein identiﬁcation: The origins of peptide mass ﬁngerprinting,” Journal of the American Society for Mass Spectrometry, vol. 14, no. 9, pp. 931–942, 2003. I. Shadforth, D. Crowther, and C. Bessant, “Protein and peptide identiﬁcation algorithms using MS for use in high-throughput, automated pipelines,” Proteomics, vol. 5, no. 16, pp. 4082–4095, 2005. ¨ “Probity: a protein identiﬁcation algoJ. Eriksson and D. Fenyo, rithm with accurate assignment of the statistical signiﬁcance of the results,” Journal of Proteome Research, vol. 3, no. 1, pp. 32–36, 2004. J. Margnin, A. Masselot, C. Menzel, and J. Colinge, “OLAVPMF: a novel scoring scheme for high-throughput peptide mass ﬁngerprinting,” Journal of Proteome Research, vol. 3, no. 1, pp. 55– 60, 2004. J. A. Siepen, E. J. Keevil, D. Knight, and S. J. Hubbard, “Prediction of missed cleavage sites in tryptic peptides aids protein identiﬁcation in proteomics,” Journal of Proteome Research, vol. 6, no. 1, pp. 399–408, 2007. Z. Song, L. Chen, A. Ganapathy, X.-F. Wan, L. Brechenmacher, N. Tao, D. Emerich, G. Stacey, and D. Xu, “Development and assessment of scoring functions for protein identiﬁcation using pmf data,” Electrophoresis, vol. 28, no. 5, pp. 864–870, 2007. D. Yang, K. Ramkissoon, E. Hamlett, and M. C. Giddings, “Highaccuracy peptide mass ﬁngerprinting using peak intensity data with machine learning,” Journal of Proteome Research, vol. 7, no. 1, pp. 62–69, 2008. Z. He, C. Yang, and W. Yu, “Peak bagging for peptide mass ﬁngerprinting,” Bioinformatics, vol. 24, no. 10, pp. 1293–1299, 2008. O. N. Jensen, A. V. Podtelejnikov, and M. Mann, “Identiﬁcation of the components of simple protein mixtures by high-accuracy pep-

[24]

[25] [26] [27] [28] [29] [30]

[31]

[32]

[33]

[34] [35]

[36]

[37]

[38]

[39]

[40]

[41] [42]

13

tide mass mapping and database searching,” Analytical Chemistry, vol. 69, no. 23, pp. 4741–4750, 1997. Z. Y. Park and D. H. Russell, “Identiﬁcation of individual proteins in complex protein mixtures by high-resolution,high-massaccuracy MALDI TOF-mass spectrometry analysis of in-solution thermal denaturation/enzymatic digestion,” Analytical Chemistry, vol. 73, no. 11, pp. 2558–2564, 2001. ¨ “Protein identiﬁcation in complex J. Eriksson and D. Fenyo, mixtures,” Journal of Proteome Research, vol. 4, no. 2, pp. 387–393, 2005. P. Slavik, “Improved performance of the greedy algorithm for partial cover,” Information Processing Letters, vol. 64, no. 5, pp. 251– 254, 1997. M. Bl¨aser, “Computing small partial coverings,” Information Processing Letters, vol. 85, no. 6, pp. 327–331, 2003. R. Gandhi, S. Khuller, and A. Srinivasan, “Approximation algorithms for partial covering problems,” Journal of Algorithms, vol. 53, no. 1, pp. 55–84, 2004. T. Fujito, “On combinatorial approximation of covering 0-1 integer programs and partial set cover,” Journal of Combinatorial Optimization, vol. 8, no. 4, pp. 439–452, 2004. ¨ J. Koemann, O. Parekh, and D. Segev, “A uniﬁed approach to approximating partial covering problems,” in Proceedings of 14th Annual European Symposium on Algorithms (ESA 2006), ser. Lecture Notes in Computer Science, Y. Azar and T. Erlebach, Eds., vol. ¨ ¨ 4168. ETH Zurich, Zurich, Switzerland: Springer, September 2006, pp. 468–479. J. Mestre, “Lagrangian relaxation and partial cover,” in Proceddings of 25th International Symposium on Theoretical Aspects of Computer Science (STACS 2008), S. Albers and P. Weil, Eds., Dagstuhl, Germany, 2008, pp. 539–550. [Online]. Available: http://drops. dagstuhl.de/opus/volltexte/2008/1315 D. S. Hochbaum, “Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems,” in Approximation Algorithms for NP-Hard Problems. PWS Publishing Company, 1997, pp. 94–143. ¨ J. Samuelsson, D. Dalevi, F. Levander, and T. Rognvaldsson, “Modular, scriptable and automated analysis tools for highthroughput peptide mass ﬁngerprinting,” Bioinformatics, vol. 20, no. 18, pp. 3628–3635, 2004. A. I. Nesvizhskii and R. Aebersold, “Interpretation of shotgun proteomic data: the protein inference problem,” Molecular & Cellular Proteomics, vol. 4, no. 10, pp. 1419–1440, 2005. A. I. Nesvizhskii, A. Keller, E. Kolker, and R. Aebersold, “A statistical model for identifying proteins by tandem mass spectrometry,” Analytical Chemistry, vol. 75, no. 17, pp. 4646–4658, 2003. B. Zhang, M. C. Chambers, and D. L. Tabb, “Proteomic parsimony through bipartite graph analysis improves accuracy and transparency,” Journal of Proteome Research, vol. 6, no. 9, pp. 3549– 3557, 2007. P. Alves, R. J. Arnold, M. V. Novotny, P. Radivojac, J. P. Reilly, and H. Tang, “Advancement in protein inference from shotgun proteomics using peptide detectability,” in Proceddings of 2007 Paciﬁc Symposium on Biocomputing (PSB 2007), 2007, pp. 409–420. Y. F. Li, R. J. Arnold, Y. Li, P. Radivojac, Q. Sheng, and H. Tang, “A bayesian approach to protein inference problem in shotgun proteomics,” in Proceedings of The 12th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2008), ser. LNBI, M. Vingron and L. Wong, Eds., vol. 4955. Springer, 2008, pp. 167–180. K. Baerenfaller, J. Grossmann, M. A. Grobei, R. Hull, M. HirschHoffmann, S. Yalovsky, P. Zimmermann, U. Grossniklaus, W. Gruissem, and S. Baginsky, “Genome-scale proteomics reveals arabidopsis thaliana gene models and proteome dynamics,” Science, vol. 320, pp. 938–941, 2008. N. E. Castellana, S. H. Payne, Z. Shen, M. Stanke, V. Bafna, and S. P. Briggs, “Discovery and revision of arabidopsis genes by proteogenomics,” Proceedings of the National Academy of Sciences of the United States of America, vol. 105, no. 52, pp. 21 034–21 038, 2008. J. E. Elias and S. P. Gygi, “Target-decoy search strategy for increased conﬁdence in large-scale protein identiﬁcations by mass spectrometry,” Nature Methods, vol. 4, no. 3, pp. 207–214, 2007. N. Jaitly, A. Mayampurath, K. Littleﬁeld, J. N. Adkins, G. A. Anderson, and R. D. Smith, “Decon2LS: An open-source software

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 2009

package for automated processing and visualization of high resolution mass spectrometry data,” BMC Bioinformatics, vol. 10, p. 87, 2009. [43] M. E. Monroe, N. Tolic, N. Jaitly, J. L. Shaw, J. N. Adkins, and R. D. Smith, “VIPER: an advanced software package to support highthroughput LC-MS peptide identiﬁcation,” Bioinformatics, vol. 23, no. 15, pp. 2021–2023, 2007. [44] R. Craig and R. C. Beavis, “Tandem: matching proteins with tandem mass spectra,” Bioinformatics, vol. 20, no. 9, pp. 1466–1467, 2004.

Zengyou He received the BS, MS and PhD degrees in computer science from Harbin Institute of Technology, China, in 2000, 2002 and 2006, respectively. Currently, he is a research associate in the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology. His research interests include data mining and computational mass spectrometry.

Can Yang received the bachelor’s degree and master’s degree in automatic control from Zhejiang University, China, in 2003 and 2006, respectively. He is currently working toward the PhD degree in the Department of Electronic and Computer Engineering at the Hong Kong University of Science and Technology. He is interested in bioinformatics, machine learning and data mining.

Weichuan Yu received the Ph.D. degree in Computer Vision and Image Analysis from University Kiel, Germany in 2001. He was a postdoctoral associate at Yale University from 2001 to 2004 and a research faculty member in the Center for Statistical Genomics and Proteomics at Yale University from 2004 to 2006. Since 2006, he has been an assistant professor in the Department of Electronic and Computer Engineering at the Hong Kong University of Science and Technology. He is interested in computational analysis problems with biological and medical applications. He has published papers on a variety of topics including bioinformatics, computational biology, biomedical imaging, signal processing and computer vision.

14

A Partial Set Covering Model for Protein Mixture ...

2009; published online XX 2009. To date, many popular .... experimental study, we found that they almost exhibit identical ...... software [42] and VIPER software [43] to pre-process the ..... Can Yang received the bachelor's degree and master's ...

Download PDF

1MB Sizes 2 Downloads 220 Views

Report

A Partial Set Covering Model for Protein Mixture ...

Recommend Documents