Extracting Protein-Protein interactions using simple contextual features

Leif Arda Nielsen School of Informatics University of Edinburgh [email protected]

1

Introduction

There has been much interest in recent years on the topic of extracting Protein-Protein Interaction (PPI) information automatically from scientific publications. This is due to the need that has emerged to organise the large body of literature that is generated through research, and collected at sites such as PubMed. Easy access to the information contained in published work is vital for facilitating new research, but the rate of publication makes manual collection of all such data unfeasible. Information Extraction approaches based on Natural Language Processing can be, and are already being used, to facilitate this process. The dominant approach so far has been the use of hand-built, knowledge-based systems, working at levels ranging from surface syntax to full parses (Blaschke and Valencia, 2002; Huang et al., 2004; Plake et al., 2005; Rebholz-Schuhmann et al., 2005; Yakushiji et al., 2005). A similar work to the one presented here is by (Sugiyama et al., 2003), but it is not possible to compare results due to differing datasets and the limited information available about their methods.

2

Data

A gene-interaction corpus derived from the BioCreAtIvE task-1A data will be used for the experiments. This data was kindly made available by J¨org Hakenberg1 and is described in (Plake et al., 2005). The data consists of 1000 sentences marked up for POS 1

See http://www.informatik.hu-berlin.de/ ber/publ/suppl/sac05/

haken-

tags, genes (both genes and proteins are marked as ‘gene’; the terms will be used interchangeably in this paper) and iWords. The corpus contains 255 relations, all of which are intra-sentential, and the “interaction word” (iWord)2 for each relation is also marked up. I utilise the annotated entities, and focus only on relation extraction. The data contains directionality information for each relation, denoting which entity is the ‘agent’ and which the ‘target’, or denoting that this distinction cannot be made. This information will not be used for the current experiments, as my main aim is simply to identify relations between entities, and the derivation of this information will be left for future work. I will be using the Naive Bayes, KStar, and JRip classifiers from the Weka toolkit, Zhang Le’s Maximum Entropy classifier (Maxent), TiMBL, and LibSVM to test performance. All experiments are done using 10-fold cross-validation. Performance will be measured using Recall, Precision and F1.

3

Experiments

Each possible combination of proteins and iWords in a sentence was generated as a possible relation ‘triple’, which combines the relation extraction task with the additional task of finding the iWord to describe each relation. 3400 such triples occur in the data. After each instance is given a probability by the classifiers, the highest scoring instance for each protein pairing is compared to a threshold to decide 2 A limited set of words that have been determined to be informative of when a PPI occurs, such as interact, bind, inhibit, phosphorylation. See footnote 1 for complete list.

the outcome. Correct triples are those that match the iWord assigned to a PPI by the annotators. For each instance, a list of features were used to construct a ‘generic’ model : interindices The combination of the indices of the proteins of the interaction; “P1-position:P2position” interwords The combination of the lexical forms of the proteins of the interaction; “P1:P2” p1prevword, p1currword, p1nextword The lexical form of P1, and the two words surrounding it p2prevword, p2currword, p2nextword The lexical form of P2, and the two words surrounding it p2pdistance The distance, in tokens, between the two proteins inbetween The number of other identified proteins between the two proteins iWord The lexical form of the iWord iWordPosTag The POS tag of the iWord iWordPlacement Whether the iWord is between, before or after the proteins iWord2ProteinDistance The distance, in words, between the iWord and the protein nearest to it A second model incorporates greater domainspecific features, in addition to those of the ‘generic’ model : patterns The 22 syntactic patterns used in (Plake et al., 2005) are each used as boolean features3 . lemmas and stems Lemma and stem information was used instead of surface forms, using a system developed for the biomedical domain.

4

Results

Tables 1 and 2 show the results for the two models described above. The system achieves a peak per3

These patterns are in regular expression form, i.e. “P1 word{0,n} Iverb word{0,m} P2”. This particular pattern matches sentences where a protein is followed by an iWord that is a verb, with a maximum of n words between them, and following this by m words maximum is another protein. In their paper, (Plake et al., 2005) optimise the values for n and m using Genetic Algorithms, but I will simply set them all to 5, which is what they report as being the best unoptimized setting.

formance of 59.2% F1, which represents a noticeable improvement over previous results on the same dataset (52% F1 (Plake et al., 2005)), and demonstrates the feasibility of the approach adopted. It is seen that simple contextual features are quite informative for the task, but that a significant gains can be made using more elaborate methods. Algorithm Naive Bayes KStar Jrip Maxent TiMBL LibSVM

Recall 61.3 65.2 66.0 58.5 49.0 49.4

Precision 35.6 41.6 45.4 48.2 41.1 56.8

F1 45.1 50.8 53.8 52.9 44.7 52.9

Table 1: Results using ‘generic’ model Algorithm Naive Bayes KStar Jrip Maxent TiMBL LibSVM

Recall 64.8 60.9 44.3 57.7 42.7 54.5

Precision 44.1 45.0 45.7 56.6 74.0 64.8

F1 52.5 51.8 45.0 57.1 54.1 59.2

Table 2: Results using extended model

References C. Blaschke and A. Valencia. 2002. The frame-based module of the suiseki information extraction system. IEEE Intelligent Systems, (17):14–20. Minlie Huang, Xiaoyan Zhu, Yu Hao, Donald G. Payan, Kunbin Qu 2, and Ming Li. 2004. Discovering patterns to extract proteinprotein interactions from full texts. Bioinformatics, 20(18):3604–3612. Conrad Plake, J¨org Hakenberg, and Ulf Leser. 2005. Optimizing syntax-patterns for discovering protein-proteininteractions. In Proc ACM Symposium on Applied Computing, SAC, Bioinformatics Track, volume 1, pages 195–201, Santa Fe, USA, March. D. Rebholz-Schuhmann, H. Kirsch, and F. Couto. 2005. Facts from text–is text mining ready to deliver? PLoS Biol, 3(2). Kazunari Sugiyama, Kenji Hatano, Masatoshi Yoshikawa, and Shunsuke Uemura. 2003. Extracting information on protein-protein interactions from biological literature based on machine learning approaches. Genome Informatics, 14:699–700. Akane Yakushiji, Yusuke Miyao, Yuka Tateisi, and Jun’ichi Tsujii. 2005. Biomedical information extraction with predicate-argument structure patterns. In Proceedings of the First International Symposium on Semantic Mining in Biomedicine, pages 60–69.

Extracting Protein-Protein interactions using simple ...

datasets and the limited information available about their methods. 2 Data. A gene-interaction .... from text–is text mining ready to deliver? PLoS Biol, 3(2).

12KB Sizes 0 Downloads 282 Views

Recommend Documents

Extracting Protein-Protein interactions using simple ...
using 10-fold cross-validation. Performance will be measured using Recall, Precision and F1. 3 Experiments. Each possible combination of proteins and iWords.

Extracting Protein-Protein interactions using simple ...
References. C. Blaschke and A. Valencia. 2002. The frame-based module of the suiseki information extraction system. IEEE Intelli- gent Systems, (17):14–20.

Extracting Protein-Protein Interactions from ... - Semantic Scholar
statistical methods for mining knowledge from texts and biomedical data mining. ..... the Internet with the keyword “protein-protein interaction”. Corpuses I and II ...

Extracting Protein-Protein Interactions from ... - Semantic Scholar
Existing statistical approaches to this problem include sliding-window methods (Bakiri and Dietterich, 2002), hidden Markov models (Rabiner, 1989), maximum ..... MAP estimation methods investigated in speech recognition experiments (Iyer et al.,. 199

SkinMarks: Enabling Interactions on Body Landmarks Using ...
of these landmarks to advance on-body interaction towards more detailed, highly curved and challenging body locations. ACKNOWLEDGMENTS. This project received funding from the Cluster of Excellence on Multimodal Computing and Interaction, from the Eur

Extracting Unambiguous Keywords from Microposts Using Web and ...
Apr 16, 2012 - lem can be built out of Web data and search engine logs, combining traditional ... H.3.1 [Information Storage and Retrieval]: Content. Analysis and ... regardless of that, users type queries in sites like Yahoo!,. Bing and Google ... a

Influence of using date-specific values when extracting ... - CiteSeerX
The TIMESAT [1] software program is the most advanced, ..... [1] P. Jonsson and L. Eklundh, “TIMESAT - a program for analyzing time- series of satellite sensor ...

Extracting Unambiguous Keywords from Microposts Using Web and ...
Apr 16, 2012 - ... small and noisy text snip- pets, created by users of social networks such as Twitter and ... guage Processing (NLP) field recently [2, 10]. Moreover, ... regardless of that, users type queries in sites like Yahoo!,. Bing and Google

Distributed Interactions with Wireless Sensors Using ...
Distributed Interactions with Wireless Sensors Using TinySIP for Hospital Automation ... network applications, such as smart home, smart hospital, ..... OPTIONS request and the gateway responds with a list of .... Networks into the Internet.

Distributed Interactions with Wireless Sensors Using ...
services provided by ad-hoc sensor networks. The advan- .... Figure 1. Smart Hospital GUI client nodes to doctors and nurses, patients can query their.

Affective Interactions Using Virtual Reality: The Link ...
some authors suggested possible “recipes,”9,10 it is. 1Applied Technology for .... computer (Sony Vaio Notebook PCG-GRT 996ZP,. Pentium-4 3.20-GHz), with ...

Versatile microrobotics using simple modular ... - Semantic Scholar
Jul 28, 2016 - In addition, the model determines how far off-center the approach can be: the range of allowable ... which we call Δx. We varied the frequency of rotation in our model to investigate the ..... ACS Nano 4, 1799–1804. (2010). 32.

Versatile microrobotics using simple modular ... - Semantic Scholar
Jul 28, 2016 - by allowing a single system to navigate diverse environments and perform ... and requests for materials should be addressed to H.C.F. (email: Henry. ..... Automatic navigation of an untethered device in the artery of a living ...

Extracting Methods to Simplify Testing
Jun 13, 2007 - When a method is long and complex, it is harder to test. ... can be done for you automatically in Python by the open-source refactoring browser.

Extracting Contextual Evaluativity
(§4) strongly support a compositional approach to contextual evaluativity inference. A simple com- ... Note that for this application, we may simplify the compositional picture and treat functors as .... 2http://developer.yahoo.com/search/.

pdf-0712\interactions-mosaic-silver-edition-interactions-1-low ...
... the apps below to open or edit this item. pdf-0712\interactions-mosaic-silver-edition-interaction ... iate-reading-class-audio-cd-by-elaine-kirn-pamela-h.pdf.

Urban Interactions
Nov 2, 2017 - We then structurally estimate the model using data from the National Longitudinal. Survey of ... economics looking at how interactions between agents create agglomeration and city centers.3 ... Using data on email communication between

Extracting data from a graph using grabit MATLAB file.pdf
And load the graph from which you want to extract data. Page 3 of 8. Extracting data from a graph using grabit MATLAB file.pdf. Extracting data from a graph ...