a flexible and trainable Java library for the gene/protein ...

Viewer
Transcript

BioLINK

Moara project: a flexible and trainable Java library for the gene/protein recognition and normalization tasks Mariana L. Neves1,*, José Maria Carazo1 and Alberto Pascual-Montano1,2 1

Biocomputing Unit, Centro Nacional de Biotecnología – CSIC, C/ Darwin 3, Campus de Cantoblanco, 28049, Madrid, Spain 2 Departamento de Arquitectura de Computadores, Universidad Complutense de Madrid, Facultad de Ciencias Físicas, 28040, Madrid, Spain ABSTRACT

Motivation: Gene/protein recognition and normalization are important preceding steps for many biological text mining tasks, such as protein-protein interaction. Even if great efforts have been dedicated to these problems and effective solutions have been reported, the availability of easily integrated tools to perform these tasks is deficient. We therefore propose Moara, a Java library that implements gene/protein tagger and normalization steps based on machine learning approaches. The system may be trained with extra documents for the recognition procedure and new organism may be added in the normalization step. The novelty of the methodology used in Moara consists in the design of a system that is not tailored to a specific organism and therefore does not need any organism-dependent tuning in the algorithms and in the dictionaries it uses. Moara can be used either as a standalone application or incorporated in a text mining system. Moara is available at: http://moara.dacya.ucm.es.

1 INTRODUCTION Some of the most important steps in the process of analysis of scientific literature are related to the extraction and identification of genes and proteins in the text and their association to the particular entry in their corresponding biological database. This is known as gene/protein recognition and normalization and is common prerequisite tasks to some more complex text mining tasks. The main difficulties of the gene mention and normalization problems lay in the high number of existing genes and proteins entities along with the lack of rules in their nomenclature. First, some of these entities coincide with common English words, which make their detection in free-form text very complex. Also, nomenclature may appear as long descriptive names or as its acronyms, which make identification even more difficult. This situation gets worse since existing biological entities may also have their original name changed. New biological entities being discovered do not alleviate this problem since their new assigned name may be the same of some existing ones. In the case of the gene normalization task, different organisms need different strategies depending on the complexity of their nomenclature and the degree of ambiguity in the

synonyms in an organism and among organisms, because a same mention may refer to distinct entities of an organism, or event of distinct organisms. Due to its importance, the gene/protein extraction and normalization problems have received a lot of attention from the scientific community. One clear example is the BioCreative evaluation [1-3], a community-wide effort for evaluating text mining systems applied to the biological domain. Although there are some freely available taggers, a mix of them is desirable in order to be able to extract most of the mentions from a text, as none of them is usually good enough when used alone. Also, despite the great efforts of the scientific community in the improvement of the gene/protein extraction and normalization tasks, the availability of reliable systems and dictionaries of synonyms that can be easily integrated in more general text mining systems is still deficient. We then propose Moara, which comes as a freely available Java library alternative to these systems. Moara has been running for one year and improvements have been made regarding new functionalities and a more stable version. The gene/protein recognition and normalization tasks are carried out by a Case-Based Reasoning approach (CBRTagger) and mix of few organism-dependent knowledge and machine learning methodologies (ML-Normalization), respectively. The system makes use of some MySQL databases and two external libraries: Weka machine learning tool [4] and SecondString1 for the string distances. Detailed documentation for the available classes as well as examples of codes is presented at the documentation page2. 2 2.1

To whom correspondence should be addressed.

© Oxford University Press 2005

CBR-Tagger

The gene/protein recognition is carried out by the CBRTagger [5], a tagger based on Case-Based Reasoning (CBR) foundations that in its initial version [6] participated in the BioCreative 2 Gene mention task [1]. The tagger is available in five different versions according to the datasets that have been in the training step: the BioCreative 2 Gene Mention task [1] alone (CbrBC2) and also combined with the 1

*

DESCRIPTION AND USE

2

http://secondstring.sourceforge.net/ http://moara.dacya.ucm.es/documentation.html

1

M. Neves et al.

BioCreative task 1B [2] corpus for the yeast (CbrBC2y), mouse (CbrBC2m), fly (CbrBC2f) and the three of them (CbrBC2ymf), in order to be able to better extract mentions from different organisms. A specific Java class is available for the gene/protein extraction (GeneRecognition) as well as five methods according to the training set that has been used for training the tagger. There is no need to train the system; all five models are included in the specified database. In all cases, the method receives a string argument that corresponds to the text from which the mention would be recognized. The output is represented by an array of the “GeneMention” class that encapsulates the extracted mention as well as its start and end position in the original text. The example below extracts and prints the mention present in the specified text: . . . String text = "A gene (pkt1) was isolated from " + "the filamentous fungus Trichoderma reesei, " + "which exhibits high homology with the " + "yeast YPK1 and YKR2 (YPK2) genes."; GeneRecognition gr = new GeneRecognition(); ArrayList gms = gr.extractForYeast(text); for (int i=0; i
CBR-Tagger may be trained with extra corpora and the only requirement is that the documents should be provided in the format used in the BioCreative 2 Gene Mention task3 in which the text of the documents and the annotated gene/protein mention are provided in two distinct files. In addition, it is possible to use the cases that have been already learned for the CBR-Tagger, from the five training datasets previously discussed. This would allow the retraining and improvement of the tagger. The code below illustrates the use of the training functionality when choosing the use of the data generated during the training of the tagger with the BioCreative 2 Gene Mention task in addition of the documents provided in the specified file. . . . TrainTagger tt = new TrainTagger(); tt.useBC2Data(); tt.readDocuments("train.txt"); tt.readAnnotations("annotations.txt"); tt.train(); . . .

2.2

ML-Normalization

The normalization task is accomplished by MLNormalization that consists of a flexible and a machine learning matching approaches currently available for four organisms: Saccharomyces cerevisiae (yeast), Mus muscu3

http://biocreative.sourceforge.net/biocreative_1_task1.html

2

lus (mouse), Drosophila melanogaster (fly) and Homo sapiens (human). However, the system may be trained with new organisms. Also, in case that more than one identifier matches a given mention, a disambiguation strategy decides for the best candidate. The latter takes into account the text under consideration and some minimum organism-specific data that is freely available for the scientific community, such as the name, symbol, description and GeneOntology terms for the genes and proteins of the specified organisms. ML-Normalization is provided with specific classes for the available matching methods, exact (flexible) and machine learning. Both classes receive as arguments the original text, the one presented to CBR-Tagger, and the mentions that have been extracted from it, as an array of the GeneMention class, the output of the CBR-Tagger. However, it does not impose that CBR-Tagger should be used, as the GeneMention class is provided with specific constructors. The examples below illustrate the normalization procedure for both matching procedures: . . . ExactMatchingNormalization gn = new ExactMatchingNormalization(); gn. useCosineDisambiguation(); gms = gn.normalize(Constant.ORGANISM_YEAST,text,gms); . . . MachineLearningNormalization gn = new MachineLearningNormalization(); gn. useNumWordsDisambiguation(); gms = gn.normalize(Constant.ORGANISM_YEAST,text,gms); . . . for (int i=0; i
Pre-defined parameters are available for the four organisms discussed above. For each mention, the system saves the normalized identifier, if any, as a GenePrediction object that provide the information related to the normalized entity, such as the synonym of the dictionary that has been matched to the mention as well as the score of the disambiguation strategy. All the candidates that have been taken in account during the disambiguation step are outputted as well as their respective scores. With respect to the disambiguation procedure, the single or multiple selections may be selected, in which only the best candidate or the best of them are chosen, respectively. Also, different methods are available for scoring the candidates: cosine similarity, number of common words, or a mix of both (cf. 3.2). Additional functions are also available for the machine learning matching in order to be able to select the model according to the parameters used in the training of the algorithm (cf. 3.2): type of algorithm, selection of the pair of synonyms, string distance, among others. The system comes with a default model according to the parameters that better work for the four organisms, although, the system may be

Moara project: a flexible and trainable Java library for the gene/protein recognition and normalization tasks

trained for other values. Detailed documentation on both training and normalization of the machine learning matching may be found at Moara documentation’s page. Even if the system is initially implemented for only four organisms, it may be trained to support others for both matching approaches. For the organisms to be included, it is enough to provide the system with their genome data (gene_info.gz and gene2go.gz files) available at Entrez Gene FTP4, as presented below for the Bos taurus: . . . String code = "9913"; String name = "cattle"; String directory = "normalization"; TrainNormalization tn = new TrainNormalization(code); tn.train(name,directory); . . .

3 3.1

METHODS CBR-Tagger

CBR-Tagger [5] is based on Cased-Based Reasoning foundations [7], a machine learning method that consists of first learning cases from the training documents and retrieving a case the most similar to a given problem during the testing step, from which will be given the final solution. One of the advantages of the algorithm is the possibility of getting an explanation of why a certain category has been attributed to given token, by means of checking the features that compose the case-solution. Also, the base of cases may be used as a natural source of knowledge from which to learn extra information about the training dataset, such as the number of tokens (or cases) that share a certain value of a feature. In a first step, several cases of the classes here considered (gene mention or not) are stored in two bases, one for the known and one for the unknown cases [8]. The known cases are the ones used by the system to classify those tokens that are not new, i.e. tokens that have appeared in the training documents. The unknown base saves the cases that represent tokens that are not present in the training documents. The main difference between the know and unknown cases is that in the former, the system saves the token itself, while in the latter, a shape of the token is kept in order to allow the system to be able to classify unknown tokens by means of looking for cases with a similar shape. The shape of the token is given by its transformation in a set of symbols: “A” and “a” for upper and lower case letters, respectively; “1” for numbers; “p” for stopwords; “g” for Greek letters; and “$” for identifying 3-letters-prefixes and 4letters-suffixes. For example, “Dorsal” is represented by “Aa”, “Bmp4” by “Aa1”, “the” by “p”, “cGKI(alpha)” by “aAAA(g)”, “patterning” by “pat$a” (‘$’ separates the 3letters prefix) and “activity” by “a$vity” (‘$’ separates the 4-letters suffix). In the testing step, the system searches the bases for a case most the similar to the problem and the decision is given by

the class of the latter. If more than one case if found, the one with higher frequency is the one chosen. The search procedure is separated in two parts, for the known and for the unknown cases. Priority is always given to the known ones. 3.2

ML-Normalization

The flexible matching is accomplished by a flexible matching between the mention extracted from the text and the synonyms of the dictionaries. The initial lists of synonyms for the four organisms were the one made available in the two editions of the BioCreative challenge: BioCreative task 1B [2] for yeast, mouse and fly; and BioCreative 2 gene normalization task [3] for the human. The flexible matching is called as so because both the mention and the synonyms are previously pre-processed by means of dividing the tokens according to punctuations, numbers, Greek letters, BioThesaurus5 terms and organism’s names (NCBI Entrez Taxonomy database6) and then ordering its parts alphabetically. Also, we ignore some biomedical terms of the BioThesaurus lexicon, in a gradual cleaning way in which variations of the same mention (or synonym) are generated by gradually ignoring the biomedical terms in its composition. This is a gradual cleaning because we increase the number of terms that are taken into account according their frequency in the lexicon. For example, after all above editing procedure, the synonym “alpha subunit of the rod cGMP-gated channel” is transformed to “cgmp channel gated phosphodiesterase rod subunit”, “cgmp channel gated phosphodiesterase rod” and “cgmp phosphodiesterase rod”, as the biomedical terms “subunit”, “channel” and “gated” are gradually cleaned. This procedure increase the possibility of finding an exact matching with no need of providing specific data of an organism. The machine learning matching is based on the Weka implementation of Support Vector Machines, Random Forests or Logistic Regression algorithms. In order to construct a training set for the algorithms, we used the methodology proposed in [9]: the attributes of the training examples are obtained by a comparison of two synonyms in the dictionary according to some predefined features. When the comparison is between a pair of synonyms of the same gene/protein, it consists in a positive example for the machine learning algorithm; otherwise, it is a negative example. In case that more than one identifier is obtained for the same mention, a disambiguation procedure is used in order to decide which of the candidates is most probably correct. The selection decision among the candidates is performed by a document similarity between the abstract of the article and a document representative of each of the genes/proteins (gene-document). The gene-document is constructed by a compilation of some information extracted from several databases, such as Entrez Gene7. Three methodologies may be selected for the disambiguation step. The first of them uses only the cosine similarity 5

http://pir.georgetown.edu/pirwww/iprolink/biothesaurus.shtml http://www.ncbi.nlm.nih.gov/sites/entrez?db=taxonomy 7 http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene 6

4

ftp://ftp.ncbi.nih.gov/gene/DATA/

3

M. Neves et al.

[10] between the article and the gene-documents, while the second one takes in account the number of common tokens between both texts. In the first case, the gene-document with the highest cosine similarity is chosen as the correct identifier to the mention. In the second one, the gene-document with highest number of common tokens is chosen as the best solution. The third methodology is based on the two others and the decision is given by the higher product of the cosine similarity and the number of common tokens. Also, a single or a multiple disambiguation strategies are available. The first one selects only the best candidate while the second one returns the top scoring ones according to a given threshold that is automatically calculated for each ambiguous mention and is given by 50% of the value of the highest score candidate.

results achieved in the past BioCreative competitions, in which the participating systems have made use of specific knowledge for each of the organism considered, which is not always available to the scientific community. Our intention here was to construct a system that could perform reasonably well to any organism by providing the minimum organ-ism-specific information as possible. The free availability and the easy usability of the library as well as the possibility of training both the CBR-Tagger and the ML-Normalization with extra documents or organisms, respectively, makes it a good and necessary piece in any text mining system. Table 2. Results for gene/protein normalization task. Organism

4

RESULTS

Table 1 presents the results for the gene mention extraction problem, evaluated on the test dataset (5,000 documents) of the BioCreative 2 Gene Mention [1], for to the five datasets used for training the CBR-Tagger, as well as the best result of the challenge. The results shown in Table 1 confirm that the CbrBC2 is the best dataset for training CBR-Tagger to the gene mention recognition problem. However, results presented in Table 2 for the normalization task show that in some cases, depending of the organism in consideration, a tagger trained with specific documents may improve the gene/protein normalization recall and F-Measure.

Yeast Mouse Fly Human

Best BioCreative R 89.4 81.9 80.0 83.3

P 95.0 76.5 83.1 78.9

FM 92.1 79.1 81.5 81.0

Exact matching R 83.52 77.57 69.76 83.31

P 95.17 65.83 59.12 55.00

FM 88.97 71.22 63.58 66.26

ML matching R 84.34 79.60 69.00 85.99

P 81.67 32.90 55.22 29.13

FM 82.99 46.56 61.35 43.52

ACKNOWLEDGEMENTS This work has been partially funded by the Spanish grants BIO2007-67150-C03-02, S-Gen-0166/2006, PS-0100002008-1, TIN2005-5619. APM acknowledges the support of the Spanish Ramón y Cajal program. The authors acknowledge support from Integromics, S.L. REFERENCES

Table 1. Results for the gene/protein recognition task. Training set CbrBC2 CbrBC2y CbrBC2m CbrBC2f CbrBC2ymf Best BioCreative

Recall 64.11 42.90 29.14 51.05 24.53 86.0

Precision 76.01 80.98 76.08 73.66 37.21 88.5

F-Measure 69.56 56.08 42.14 60.30 77.00 87.2

1.

Smith, L., et al., Overview of BioCreative II gene mention recognition.

2.

Hirschman, L., et al., Overview of BioCreAtIvE task 1B: normalized

3.

Morgan, A.A., et al., Overview of BioCreative II gene normalization.

4.

Witten, I.H. and E. Frank, Data mining: Practical machine learning

5.

Neves, M., et al. CBR-Tagger: a case-based reasoning approach to

Genome Biol, 2008. 9 Suppl 2: p. S2. gene lists. BMC Bioinformatics, 2005. 6 Suppl 1: p. S11. Genome Biol, 2008. 9 Suppl 2: p. S3. tools and techniques. 2nd ed. 2005, San Francisco: Morgan Kaufmann.

Table 2 presents the results for the gene/protein normalization task, evaluated on the test corpora of BioCreative task 1B [2], for the yeast, mouse and fly, and on the BioCreative 2 [3] for the human, as well as challenge’s best result. Many were the experiments that have been carried out on the development dataset in order to achieve the best set of parameters that works reasonably well for the four organisms, i.e., the best set of features for the machine learning algorithm as well as the best configuration for the disambiguation procedure. These results are available at Moara’s web page8.

the gene/protein mention problem. in Proceedings of the BioNLP 2008 Workshop at ACL 2008. 2008. Columbus, OH, USA. 6.

Second BioCreative Challenge Evaluation Workshop. 2007. Madrid, Spain. 7.

munications, 1994. 7(1): p. 39-59.

CONCLUSIONS

The java library presented in this work represents a good alternative for those scientists working in the text mining field, where gene and protein mention and normalization is needed as an important prerequisite step in the process. The performance of Moara may be somewhere below the best

Aamodt, A. and E. Plaza, Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches. AI Com-

8.

5

Neves, M. Identifying Gene Mentions by Case-Based Reasoning. in

Daelemans, W., et al. MBT: A Memory-Based Part of Speech TaggerGenerator. in Fourth Workshop on Very Large Corpora. 1996. Copenhagen, Denmark.

9.

Tsuruoka, Y., et al., Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics, 2007. 23(20): p. 2768-74.

10. Shatkay, H. and R. Feldman, Mining the biomedical literature in the genomic era: an overview. J Comput Biol, 2003. 10(6): p. 821-55.

8

http://moara.dacya.ucm.es/results.html

4

TRAINABLE FRONTEND FOR ROBUST AND ... - Research