New challenges for biological text-mining in the next decade ...

Viewer
Transcript

New challenges for biological text-mining in the next decade

1

New challenges for biological text-mining in the next decade Hong-Jie Dai1,2, Yen-Ching Chang1, Richard Tzong-Han Tsai3 and Wen-Lian Hsu1,2 1 Institute of Information Science, Academia Sinica, 115, Taiwan, R.O.C. 2 Department of Computer Science, National Tsing-Hua University, 300, Taiwan, R.O.C. 3 Department of Computer Science and Engineering, Yuan Ze University, 320, Taiwan, R.O.C. E-mail: [email protected]; ro3789@ iis.sinica.edu.tw; [email protected]; [email protected]

Abstract

The massive flow of scholarly publications from traditional paper journals to online outlets

has benefited biologists because of its ease to access. However, due to the sheer volume of available biological literature, researchers are finding it increasingly difficult to locate needed information. As a result, recent biology contests, notably JNLPBA and BioCreAtIvE, have focused on evaluating various methods in which the literature may be navigated. Among these methods, text-mining technology has shown the most promise. With recent advances in text-mining technology and the fact that publishers are now making the full texts of articles available in XML format, TMSs can be adapted to accelerate literature curation, maintain the integrity of information, and ensure proper linkage of data to other resources. Even so, several new challenges have emerged in relation to full text analysis, life-science terminology, complex relation extraction, and information fusion. These challenges must be overcome in order for text-mining to be more effective. In this paper, we identify the challenges, discuss how they might be overcome, and consider the resources that may be helpful in achieving that goal. Keywords 1

Bioinformatics databases, Mining methods and algorithms, Text mining

Introduction Life-science journal publishing has under-

gy contests, such as JNLPBA [1] and BioCreA-

gone a digital revolution in the last decade. The

tIvE [2, 3], have evaluated ways in which the lite-

massive flow of scholarly publications from tradi-

rature may be navigated. Among the methods

tional paper journals to online outlets has bene-

evaluated, text mining has shown the most prom-

fited biologists in the ease of access, but has also

ise because it makes biological literature more ac-

left these scholars adrift in the deluge of biological

cessible, and therefore more useful [4, 5].

literature they have made available. Recent biolo

Text mining involves analyzing a large col-

Regular Paper

This work was supported by the National Science Council under grants NSC 97-2218-E-155-001 and NSC96-2752-E-001-001-PAE, the Research Center for Humanities and Social Sciences, and the Thematic Program of Academia Sinica under grant AS95ASIA02.

2

J. Comp. Sci. & Tech.

lection of documents in a manner that reveals spe-

terature curation, maintain the integrity of infor-

cific information, such as the relationships and

mation, and ensure proper linkage of data to rele-

patterns buried in the collection, which is normally

vant resources. However, several new challenges

imperceptible to readers. A key text mining task

have emerged in relation to full text analysis,

involves linking extracted information to form

life-science terminology, complex relation extrac-

new facts or new hypotheses that can be explored

tion, and information mergence. First, terms, such

further by more conventional means of experi-

as gene names and corresponding database iden-

mentation [6].

tifiers, are so numerous and varied that even spe-

In the biomedical domain, several tools [7-9],

cialists have difficulty understanding them and

competitions [10-12] and projects [13, 14] have

keeping track of updates and revisions. While

started to incorporate text mining technology.

state-of-the-art normalization systems developed

However, text mining is difficult to implement in

for BioCreAtIvE II [16] can normalize gene iden-

many cases because the vital components of scien-

tifiers for humans relatively well, such systems

tific communication—journals and databases—are

have yet to be developed for inter-species norma-

designed to be read by people, not computers.

lization. Second, full-text analysis requires the use

Computers cannot extract information efficiently

of more sophisticated natural language processing

from unstructured text, which is the format

(NLP) techniques than current biological informa-

adopted by most journals and databases. Fortu-

tion retrieval and extraction tools can handle [17,

nately, some publishers, e.g., the Public Library of

18]. For example, in full text analysis, TMSs must

Science (PLoS) and BioMed Central, have sought

extract cross-sentence relations, while most cur-

to address this problem by making the full texts of

rent TMSs can only extract relations within sen-

their publications available as downloadable XML

tences. Third, current TMSs are unable to merge

files that can be processed easily by computer

information from disparate sources with different

programs. The FEBS Letters journal is currently

contextual

experimenting with embedding text-mining sys-

However, associating works in the literature via

tems (TMSs) in the manuscript submission

pathway information is essential. In the next sec-

process to construct structured digital abstracts

tion, we discuss the above-mentioned challenges

semi-automatically [15]—machine-readable XML

in more detail.

summaries of pertinent facts in the published ar-

2

and

typographical representations.

Full Text Processing

ticles. With the recent advances in text-mining technology and the fact that some publishers are now making the full texts of articles available in XML format, TMSs can be applied to the full texts rather than just the abstracts, and to accelerate li-2-

Most

Biological

Natural

Language

Processing systems have only been applied to abstracts because of the latter’s availability and abridged nature. Abstracts are good targets for information extraction (IE) as they summarize the

New challenges for biological text-mining in the next decade

-3-

content of articles. However, the full texts of pa-

literature that are machine-readable, but many

pers contain more information, relevant or not,

other aspects need to be improved further, as we

which should be treated carefully [19].

explain in the following subsections.

The preliminary results of applying current

2.1

Named Entity Identification In Full Text

state-of-the-art TMSs to full texts showed a promising F-score1 of 28.85% [22] in the BioCreAtIvE (critical assessment for information extraction in biology) protein-protein interaction (PPI) annotation extraction task [20]. However, they also re-

Errors resulting from converting PDF or

Difficulties in processing tables and figure

Multiple references to organisms and the resulting

inter-species

ambiguity

in

gene/protein normalization 4.

Sentence boundary detection errors

5.

Difficulties in extracting the associations and handling the coordination of multiple interaction pairs in single sentences.

6.

Phrases used to describe interactions in legends or titles that do not correspond to grammatically correct sentences in the text.

7.

first step towards making full use of the informa-

ty recognition (NER) task in the biomedical do-

newswire domain, such as the MUC-7 NER task [21]. The unique difficulties of biomedical NER

legends 3.

ical terms, such as gene and protein names, is the

main has different characteristics from that in the

HTML formatted documents to plain text 2.

The fundamental task of recognizing biolog-

tion encoded in biomedical texts. The named enti-

vealed several issues of concern: 1.

2.1.1 Named Entity Recognition

Errors in shallow parsing and POS-tagging tools trained on general English text collections when applied to specific expressions and abbreviations found in biomedical texts. The open text mining interface, a project di-

rected by the Nature Publishing Group, helps solve the data format conversion errors and difficulties mentioned in issues 1 and 2 because it provides open access to full text documents published in XML format. The full-text versions of scientific

are as follows. First, the number of new gene names is growing continually, and it is hard to recognize all of them because there is so much inconsistency among them [22]. Second, authors do not use standardized names; they prefer to use abbreviations or other forms depending on personal inclination [23]. Because of their limited length, abbreviations/acronyms are often identical to the respective genes’ symbols and thus increase the ambiguity of the nomenclature [24]. For instance, 80% of the abbreviations listed in the UMLS have ambiguous versions in MEDLINE [25]. Third, gene names are similar to/occur with other terminology varying from gene/protein names, such as the names of cells, tissues or organs [26]. For example, C1R is a cell line, but it is also a gene (SwissProt P00736). TMSs must be able to distinguish between different genes with identical names as well as to determine whether certain gene names refer to completely different biological entities like viruses. For compound names, it is

1

F-score is the weighted harmonic mean of precision and recall.

also necessary to determine where the name be-

4

J. Comp. Sci. & Tech.

gins and ends within a sentence. The task can be

In addition, it may also be possible to use al-

particularly difficult when verbs and adjectives are

gorithms that can identify acronyms/abbreviations

embedded in names [27].

to extract acronyms from text automatically with-

A large number of machine learning algo-

out checking whether they overlap with the gene

rithms have been developed to deal with the NER

nomenclature. Although several algorithms have

problems; for example, the hidden Markov model

been proposed for this purpose [35, 36], only a

[28], the support vector machine model [29], the

few can extract acronyms and disambiguate gene

maximum entropy Markov model [30] and the

names [37]. We hope that integrating these tools

conditional random field model [31]. To capture

will improve the NER performance.

the diverse characteristics of biomedical entities,

2.1.2 Inter-species Normalization

several feature sets, including lexicons, ortho-

Gene normalization (GN) determines the

graphic/affix information, and even external re-

unique identifiers of genes and proteins mentioned

sources like the WWW have been incorporated

in the literature. The concept was inspired by a

into different algorithms. It is conceivable that the

step in a typical curation pipeline for model or-

recognition results derived by these algorithms

ganism databases. After an article has been se-

will be diverse but complementary to each other.

lected for curation, curators list the genes or pro-

One natural idea for improving the performance of

teins of interest in this article [16]. Although the

biomedical NER is to combine the results of sev-

concept of GN was inspired by curation, the Bio-

eral algorithms. The results of the BioCreAtIvE II

CreAtIvE I/II computer-aided GN task [16, 38]

gene mention task [22] and those reported by Si et

oversimplified curation and performed GN by

al. [32] show that it is possible to achieve higher

normalizing genes in abstracts rather than on the

recognition accuracy by combining the results of

full-text. Actually, human curators normally work

multiple NER algorithms.

on the full texts and only identify particular kinds

False positive gene/protein names found in

of genes of interest. Cohen et al. [39] proposed a

the full texts of articles pose great challenges for

computer-aided GN system that, given a document,

TMSs in such basic tasks as identifying gene and

provides a ranked list of genes that are discussed

protein names in biomedical texts. Broadening the

in the document. The BioCreAtIvE II.5 competi-

range of entities beyond genes/proteins to include

tion of 2009 [40] included a similar ranking task.

entities like chemicals and diseases [33, 34] can

Such a ranked list could be used as an aid by hu-

resolve the problem. Identifying these entities also

man curators.

allows us to consider biologically relevant rela-

Computer-aided GN presents several difficult

tions, such as which entities they are derived from,

problems that need to be solved in order to reduce

where they are located, which have agency in

the workload of human curators. First, gene and

which processes, or which participate in what

protein names often have several spelling varia-

processes.

tions or abbreviations. Second, gene products are

-4-

New challenges for biological text-mining in the next decade

-5-

often described indirectly via phrases, such as

observe that ―eRF1‖ appears in the title as a hu-

―light chain-3 of microtubule-associated proteins

man gene and in the third sentence of the abstract

1A and 1B,‖ instead of by specific names or codes.

as

A number of approaches [41, 42] have been pro-

―eRF1.eRF3.GTP‖ in the last sentence is a protein

posed to address these problems in the BioCreA-

complex and should not be associated with any

tIvE I/II’s GN task. The evaluation results provide

database identifiers. These few sentences illustrate

some insight into how these problems affect our

how much more GN needs to be improved before

capacity to normalize the genes mentioned in bio-

it can be used in practice.

a

yeast

gene.

Finally,

the

complex

logical abstracts, but GN is not yet practical. The BioCreAtIvE I/II GN task involved normalizing various abstracts and demonstrating how much TMSs’ success varied according to the organism discussed in the abstracts. The results showed that the performances were satisfactory for normalizing abstracts that mentioned the genes and proteins of humans (F-score 0.81), mice (F-score 0.79), yeast (F-score 0.92) and flies (F-score 0.82), re-

HemK2 protein, encoded on human chromosome 21, methylates translation termination factor eRF1. Abstract The ubiquitous tripeptide Gly-Gly-Gln in class 1 polypeptide release factors triggers polypeptide release on ribosomes. The Gln residue in both bacterial and yeast release factors is N5-methylated, despite their distinct evolutionary origin. Methylation of eRF1 in yeast is performed by the heterodimeric methyltransferase (MTase) Mtq2p/Trm112p, and requires eRF3 and GTP. Homologues of yeast Mtq2p and Trm112p are found in man, annotated as an N6-DNA-methyltransferase and of unknown function. Here we show that the human proteins methylate human and yeast eRF1.eRF3.GTP in vitro, and that the MTase catalytic subunit can complement the growth defect of yeast strains deleted for mtq2. [PMID: 18539146]

spectively. However, the task did not address the important issue of inter-species GN, which exists

Fig.1. The abstract (PMID 18539146) in PubMed

in many published articles. The extract in Figure 1, taken from an abstract in PubMed, exemplifies the challenges posed by many articles when using computer-aided GN. One name, abbreviation or code, may refer to genes in multiple species, each with its own unique ID, or even to multiple genes in the same species or across different species. For example, the abstract (PMID: 18539146) in Figure 1 discusses the methylation of the gene ―eRF1‖. In the UniProt database, the gene’s name is listed as a synonym of multiple genes, such as ZFP36L1 (SwissProt Q07352) and ETF1 (SwissProt P62495) even though their functions are different. Moreover, both ZFP36L1 and ETF1 refer to multiple species, namely humans, mice, and rats. We also

2.2

Relation and Fact Extraction from Full

Texts TMSs, like human curators, should work on full text articles. The information provided in the headings, figure legends, and tables of full text articles helps TMSs extract relations and facts; and may help users discover implicit associations between genes and diseases in the future [18]. Seki and Javed [43] conducted a small preliminary experiment and reported that using the full text articles, rather than just their abstracts, to extract gene-disease relations greatly improved the ability of their text-mining system to discover facts and relations. In addition, Cooper and Kershenbaum [44] conducted a detailed study of 65 abstracts and

6

J. Comp. Sci. & Tech.

found that some PPIs were only reported in the

there is a need for much greater collaboration be-

full texts of the respective papers. The abstracts of

tween researchers in the two fields before TMSs

some papers did not contain any protein names.

can perform image recognition and mine related

Hence, TMSs should analyze the full text articles,

text easily.

not just their abstracts. In the following sections,

2.2.2

we consider the challenges that must be addressed. 2.2.1

Relevant versus Irrelevant Information

Relation Extraction In the biomedical field, researchers are inter-

ested in PPIs, gene-gene interactions and pro-

TMSs need to distinguish between relevant

tein-disease interactions. The major goal of rela-

and irrelevant information, but different criteria

tion extraction is to discover the relations embed-

may have to be applied depending on which sec-

ded within sentences, paragraphs, or entire docu-

tion of an article the text-mining system is analyz-

ments. Currently, the most popular relation extrac-

ing or mining. Shah et al. [45] demonstrated that

tion approaches include rule-based [48, 49], ker-

there are substantial differences in the content of

nel-based [50, 51], and co-occurrence-based [52,

different sections of a publication. For example,

53] methods. Most works focus on identifying the

specific terms, like the names of certain genes,

relations between proteins [53-55]. Craven and

may be mentioned in the titles of articles in a pa-

Kumlien [56] identified the relations between pro-

per’s bibliography, but TMSs should disregard

teins and sub-cellular locations; while Rindflesch

such terms. To identify useful terms, TMSs should

et al. [57] extracted the relations between can-

compare terms mentioned in papers’ abstracts,

cer-related genes, drugs and cell lines. Less work

which usually contain a high density of relevant

has been done on extracting the relations between

terms (keywords), to terms appearing throughout

genes and diseases [58, 59], but the area is now

the full texts of the respective papers.

attracting more research efforts.

Moreover, TMSs should also be able to asso-

Among existing methods, employing parsers

ciate useful pieces of information in the legends of

to analyze syntactic and semantic structures is

figures and tables with the text of the article, but

useful. Yusuke et al. [60] performed a comparative

this task is quite challenging. One reason is that

evaluation of state-of-art syntactic parsing me-

figures and images often have multiple sub-figures,

thods, including dependency parsing, phrase

so TMSs must be able to identify the sub-figures

structure parsing and deep parsing, and their con-

and match each one with the appropriate sentences

tribution to PPI extraction. The study provides re-

or references in the text. Although this task may

searchers with a good reference for choosing ap-

seem difficult, TMSs that have such a capacity

propriate parsers for their work. However, there is

might discover more useful relations or facts than

no guarantee that the results reported by Yusuke et

those normally extracted. Some researchers have

al. can be generalized to other datasets and tasks.

been successful in combining text-mining and

The results of the BioCreAtIvE II PPI task

image recognition techniques [46, 47]; however,

[20] demonstrate that current TMSs can detect bi-

-6-

New challenges for biological text-mining in the next decade

-7-

nary relations in abstracts reasonably well [49, 61],

with co-references. Nguyen et al. [64] conducted a

but they are not always as effective in extracting

pioneering study of the differences between

significant relations from full-text articles. There

newswire and biomedical co-reference annotated

are three reasons for this phenomenon.

corpora. We look forward to the integration of

First, biomedical terms, such as gene names, may have different meanings in full texts depend-

more sophisticated NLP techniques in this respect. 3

The Future of Text-mining Applications

ing on the context or the section in which they appear. The same gene in one section may belong to

3.1

User-focused Applications

different species (consider the example shown in

Text-mining researchers are typically good at

Figure 1). Second, the frequent use of synonyms,

analyzing textual content, but they are not as good

abbreviations, and acronyms in biomedical texts

at building interactive systems that users can adopt

hinders semantic analysis. For instance, extracting

easily [33]. To resolve the problem, researchers

facts from the Results section may require resolv-

must design applications with intuitive interfaces

ing acronyms or synonyms only mentioned in the

that require little or no knowledge of text-mining

Introduction. Third, biomedical texts usually con-

and NLP technology. The objective is to provide

tain several compound nouns as well as noun

bioinformatics, biological, biomedical, and phar-

phrases linked by prepositions. Fourth, TMSs have

macological researchers with a high-level view of

difficulty when one or more proteins involved in

biological interactions and help them form new

an interaction are expressed by more than one

hypotheses. The useful PubMed-EX browser ex-

sentence; or when they are expressed using ana-

tension [65], shown in Figure 2, is an example of

phora, as shown in the following example:

such an effort.

Human growth hormone (hGH) binds to its receptor (hGHr) in a three-body interaction: one molecule of it and two identical monomers of the receptor from a trimer. Many papers have addressed relation extraction, summarization, and evaluation issues, but few have focused on co-reference (anaphora) resolution [62], possibly because there are few publicly available datasets for system building and evaluation. Despite the substantial amount of annotation work carried out on co-referencing in molecular biology, few biomedical corpora with co-reference annotations are currently

available

[63]. Recently, the GENIA corpus was annotated

Fig.2. A PubMed abstract annotated with text-mining results by PubMed-EX

PubMed-EX annotates onsite PubMed search results with additional text-mining information but users don’t pay any extra effort such as to learn how to input a specific query. Currently, its processing speed is quite slow, but it does hide the complicated text-mining technology on which it is

8

J. Comp. Sci. & Tech.

based. Text-mining researchers should strike a compromise between the accuracy of text-mining results and the overall processing speed. Obviously, full text analysis requires more computational capacity and time than the analysis of abstracts. Users may accept a processing time of 10 minutes per article for off-line processes, such as database curation, but they may not be as patient when it comes to on-line services that provide semantic annotations or relation extraction. Therefore, providing on-the-fly full text processing, while maintaining a satisfactory accuracy level, remains a

informatics data, so complex queries remain challenging. Integrating data from multiple databases and analyzing it via TMSs is difficult. Zhang et al. [66] proposed a Web 2.0 [69] based model that represents a shift in focus from working locally to working in networked settings. Under this new approach, the Web is seen as a social, collaborative, and collective space. The model provides a vision of the future, where annotation will be performed collaboratively and innovative web tools will support such collaboration. Further development of tools like WikiProtein [70] CBioC [71], which support collaborative annotation is essential. 3.3

Information fusion

challenge for text-mining researchers. Certain types of users, such as content pro-

With the advent of advanced TMSs, re-

viders and corpus annotators, require interfaces

searchers may be able to integrate mined informa-

that allow them to change annotations, dredge for

tion and thereby gain more insight into biological

information, link resources, and create new infor-

literature. The most critical biological reactions

mation resources to capture new concepts [33].

are recorded in ―pathways,‖ which include a my-

The research community requires more collabora-

riad of cellular or disease events with multiple

tive annotation and up-to-date knowledge in bio-

protein-protein relationships and tend to influence

logical databases, but it is does not have the tools

each other directly. However, for a number of rea-

that make these procedures easy to implement. We

sons, TMSs have trouble fusing mined information

discuss this issue in the next section.

to reveal pathways. First, mapping named entities to nodes in

3.2

Integration, Communication and Colla-

boration Bioinformatics researchers often need to consult numerous databases and web servers, but many find integrating heterogeneous datasets from disparate databases associated with multiple web servers a daunting task [66]. To integrate biological data from multiple heterogeneous databases, researchers have adopted two major approaches: centralization [67] and decentralization [68]. However, the integration efforts have been piecemeal and have only considered a fraction of bio-8-

pathways requires highly context dependent properties. Named entities (NEs) may have different meanings in the same context. For example, an NE may be located in the nucleus, in the cytoplasm, or on the cell membrane. It may also refer to a cellular function, in which case it might be phosphorylated or acetylated. Thus, two consecutive sentences may mention a named entity, but the named entity may actually refer to two totally different events.

New challenges for biological text-mining in the next decade

-9-

Currently, biologists use their domain knowledge to infer information that text mining cannot predict accurately. Oda et al. [72] categorized six

hope that more resources will be used to accelerate progress in the field 4.1

Evaluating text mining via task-based

inference characters, namely, the state of an entity before or after reaction, the function of an entity

challenges

before or after a reaction, the influence of state or

Evaluation via task-based challenges is es-

functional changes of an entity, related reactions,

sential to the biology community [74]. To date,

reverse reactions, and characteristics of reactions.

several biological tasks, including document re-

If annotated corpora incorporated these features,

trieval, NER, and relation extraction, have been

TMSs would be able to infer information with lit-

evaluated. We list the major challenges below:

tle human help [72].



Experimental data even confuse biologists

ticipants to identify papers to be curated for

sometimes. Open databases of pathway references, such as BioCarta2, STKE3, and KEGG [73], ena-

The KDD Cup 2002 task 14 [75] asked par-

Drosophila gene expression. 

The TREC Genomics Track5 [76], one of the

ble biologists to predict the next steps of protein

largest and longest-running challenge evalua-

pathways. However, subtle factors cause the re-

tions in biomedicine (from 2003 to 2007),

sults of many experiments to deviate from what is

evaluates systems for information retrieval.

considered consistent for proven pathways. Incon-



The Genic Interaction Extraction6 (GIE) chal-

sistencies do not necessarily mean that the proven

lenge [77], a part of the Learning Language in

pathways are wrong, but they may indicate me-

Logic workshop, evaluates the ability of par-

chanisms or parts of pathways that were previous-

ticipating TMSs to identify protein/gene inte-

ly unobserved. Therefore, pathway prediction

ractions from biological abstracts.

should be independent of experiments. In the fu-



BioCreAtIvE7 [2] is a community-wide effort

ture, TMSs may be able solve many of the argu-

that promotes the development and evaluation

ments or discrepancies that occur in research today

of text-mining and IE systems applied in the

because of their ability to map large amounts of

biological domain. The most recent challenge

data quickly.

(BioCreAtIvE II.5) in March 2009, which also

4

involved the publisher Elsevier/FEBS Letters

Text-mining Resources Text-mining

main-specific

resources,

thesauri,

such

lexicons,

as

and the MINT database, evaluated real-time

do-

text-mining capabilities on full text articles.

terminology

standards, ontologies, and additional evaluations by task-based challenges are very important. We summarize them in the following sections. It is our

3 4 5 6

2

http://cgap.nci.nih.gov/Pathways/BioCarta_Pathways

7

http://stke.sciencemag.org/ http://www.biostat.wisc.edu/~craven/kddcup/ http://ir.ohsu.edu/genomics/ http://genome.jouy.inra.fr/texte/LLLchallenge/ http://www.biocreative.org/

10



J. Comp. Sci. & Tech.

The BioNLP shared task8 is concerned with the recognition of bio-molecular named entities [1] and events [78] that appear in biomedical literature. The 2009 task used a dataset based on the GENIA event corpus [79]. In contrast to BioCreAtIvE II.5, which aims to

lated to molecular interaction and published between 1996 and 2001.  A disease corpus14 provided by Jimeno et al. [34] could serve as a benchmark for other disease NER systems. 4.2.2 Relation Extraction Corpora 

support the curation of PPI databases, the BioNLP task concerns to support the development of more detailed and structured databases,



e.g., pathway databases [80], and the Gene Ontology Annotation databases [81]. 

4.2

Text-Mining Corpora

4.2.1 Named Entity Identification Corpora 









The GENIA corpus9 [82] contains 2000 abstracts taken from the MEDLINE database and annotated with various levels of linguistic and semantic information. Biological named entities were annotated according to the taxonomy defined in GENIA ontology. Currently, there are 47 biological named entity categories. GENETAG10 [83] is a corpus of 20,000 sentences taken from MEDLINE abstracts annotated with gene/protein names. The dataset of the JNLPBA Bio-NER task11 is annotated with five types of named entities: protein, DNA, RNA, cell line and cell type. The training and test sets of BioCreAtIvE I/II gene mention and normalization tasks12 provide an evaluation standard for the two problems. The Yapex Corpus13 is annotated with protein names mentioned in MEDLINE abstracts re-









14 8

http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/ 9 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ 10 ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/GENETAG.tar.gz 11 http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/ERtask/report.html 12 http://sourceforge.net/projects/biocreative/files/ 13 http://www.sics.se/humle/projects/prothalt/#data - 10 -

15 16 17 18 19 20

The GENIA event corpus [79] is based on the GENIA corpus and is annotated with events mentioned in biomedical abstracts. Binarized BioInfer 15 [84] is a corpus annotated with the binary relations between proteins in abstracts. AIMed16 [54] is a corpus constructed by using the query word ―human‖ to obtain abstracts from MEDLINE. In total, 1955 sentences were extracted and annotated with gene/protein names and PPIs. EDGAR 17 [57] contains annotation for the interaction of drugs, genes, and cells. The FetchProt18 corpus is comprised of 190 full text articles of which 140 describe experimental evidence for tyrosine kinase activity in at least one protein. Its annotation includes specific experiments and results, the proteins involved in the experiments and related information. The BioText project 19 provides two corpora for relation extraction: (1) PPI data [55] annotates the interaction types between proteins in full texts; and (2) a corpus containing abstracts randomly selected from MEDLINE 2001 for evaluation of mining disease-treatment relations [85]. The IEPA corpus20 [86] contains 303 PubMed ftp://ftp.ebi.ac.uk/pub/software/textmining/corpora/diseases http://mars.cs.utu.fi/BioInfer/ ftp://ftp.cs.utexas.edu/pub/mooney/bio-data/ ftp://ftp.ncbi.nlm.nih.gov/pub/tanabe/EDGAR_GS.txt http://fetchprot.sics.se/#corpus http://biotext.berkeley.edu/data.html http://class.ee.iastate.edu/berleant/s/IEPA.htm

New challenges for biological text-mining in the next decade

abstracts with annotations for PPIs for each sentence.  The Craven group’s IE data sets21 [56] were compiled from MEDLINE abstracts. There are three datasets, which are labeled, respectively, with instances of the following binary relations: (1) subcellular-localization gathered from the Yeast Proteome Database (YPD); (2) disease-association gathered from the Online Mendelian Inheritance in Man database (OMIM); and (3) PPIs from the MIPS Comprehensive Yeast Genome Database (CYGD).  The BioCreAtIvE-PPI dataset and DIPPPI corpus22 were derived from the dataset of BioCreAtIvE I task 1A and the Database of Interaction Proteins (DIP) respectively. The BioCreAtIvE-PPI corpus contains 1,000 sentences annotated with PPI information. The PPIs annotated in the DIPPPI corpus are restricted to proteins from yeast. The goal is to find evidence of relations in the text of a paper. Whenever possible, full texts are included in the corpus as well as abstracts.  The training and test dataset for the GIE challenge23 [87] contain annotations for gene interactions. Each dataset are decomposed into two subsets. The first subset does not include co/cross-references or ellipsis, but the second subset contains both features. 4.2.3 Part-of-speech, syntactic and semantic

- 11 -

1100 PubMed abstracts on the inhibition of cytochrome P450 enzymes. It is annotated with



21 22 23 24 25 26

PASBio 24 [88] and BioProp 25 [89] contain predicate-argument structures (PAS) for event extraction in molecular biology. The PennBioIE [90] CYP corpus 26 contains http://www.biostat.wisc.edu/~craven/ie/ http://www2.informatik.hu-berlin.de/~hakenber/corpora/ http://genome.jouy.inra.fr/texte/LLLchallenge/#task1 http://research.nii.ac.jp/~collier/projects/PASBio/ http://bws.iis.sinica.edu.tw/BioProp/ http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=L

sentence

boundary,

and

part-of-speech (POS) information. In addition, 324 of the abstracts are syntactically annotated. Another PennBioIE corpus, the PennBioIE Oncology corpus 27 contains similar annotations but in addition to its abstract is related to cancer concentrating on molecular genetics. 

The GENIA corpus contains annotations for parts-of-speech (POS) [91] and a treebank [92].



The Brown-GENIA Treebank28 [93] contains the syntactic structures of 21 abstracts (215 sentences) taken from the GENIA corpus. There is no overlap with the GENIA treebank (beta version, 500 abstracts).



MedPost [94] is a corpus29 containing 5,700 sentences selected randomly from various thematic subsets of MEDLINE and annotated with POS information.



The PDG Bio-splitter corpus 30 contains a small collection of text datasets compiled from PubMed abstracts to develop sentence splitting tools.



The BioText project provide a corpus annotated with the definitions of abbreviations [36]

annotations 

paragraph,

taken from 1,000 randomly selected abstracts by querying MEDLINE with the term ―yeast‖. 4.2.4 Full text corpora DC2008T20 27 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=L DC2008T21 28 http://bllip.cs.brown.edu/resources.shtml#corpora 29 ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz 30

http://www.pdg.cnb.uam.es/martink/LINKS/biosplitter_corpus.htm

12



J. Comp. Sci. & Tech.

BioMed Central’s open access full-text cor-

enable biologists to exploit knowledge more effec-

pus31 has released 55003 full text articles to

tively.

date, including structured XML version, cov-







5

Although text-mining technologies are now

ered by open access license agreements.

quite mature, there are still some important unre-

The PPI corpus of BioCreAtIvE II and II.5

solved problems in the field. Fortunately, biomed-

[95].

ical text mining is an extremely active research 32

The FlySlip

corpus [96] is the first corpus of

area, and the outlook for continued progress is

biomedical full-text articles to be annotated

encouraging. We can foresee that the texts of ar-

with anaphora information.

ticles will be systematically mined by computer

The molecular interaction maps corpus33 [97]

programs, allowing the interrelation of journal

contains passages from full-text articles that

texts and the vast repository of knowledge to be

describe interactions summarized in a molecu-

stored semi-automatically in databases. It is ex-

lar interaction map [98].

pected that text mining tools will be used by every

Conclusion We have considered important research issues

biologist in the future. Acknowledgements

related to the exploitation of text mining in the

This research was supported in part by the

biomedical field, and drawn the following conclu-

National Science Council under grant NSC

sions.

97-2218-E-155-001, NSC 96-2752-E-001-001 and

1.

the thematic program of Academia Sinica under

The availability of full texts is clearly very

important because abstracts usually lack sufficient relevant information. Techniques for mining in-

grant AS95ASIA02. References

formation from full biomedical texts need to be improved substantially. 2.

[1]

bio-entity recognition task at JNLPBA,"

Text mining has the potential to be used in

Proceedings of the International Workshop

different applications and to fuse knowledge in the

on

literature and biological databases. However, to

[2]

and

Processing

its

in

Applications

L. Hirschman, et al., "Overview of BioCreAtIvE:

various data sources. If highly complex texts and

critical

assessment

of

information extraction for biology," BMC

bio-inference sentences can be processed effi-

Bioinformatics, vol. 6, 2005.

ciently and accurately, information fusion would [3] http://www.biomedcentral.com/info/about/datamining/ http://www.wiki.cl.cam.ac.uk/rowiki/NaturalLanguage/FlySlip 33 http://www.it.usyd.edu.au/~tara/mim_corpus/ - 12 -

Language

(JNLPBA-04), pp. 70-75, 2004.

are needed, such as methods for acronym and

31

Natural

Biomedicine

realize text mining’s full potential, new methods

co-reference resolution, and the integration of

K. Jin-Dong, et al., "Introduction to the

M. Krallinger, et al., "Evaluation of text-mining systems for biology: overview

32

of the Second BioCreative community

New challenges for biological text-mining in the next decade

- 13 -

challenge," Genome Biology, vol. 9, p. S1, 2008. [4]

[13]

for Text Mining: Aims and Objectives,"

mining," presented at the Proceedings of

UKKDD'5, 2007.

for

Computational

Linguistics

[15]

automatic text mining," FEBS Letters, vol.

database revolution," Nature, vol. 448, pp.

582, pp. 1170-1170, 2008. [16]

Morgan,

et

"Overview

(http://people.ischool.berkeley.edu/~hearst

Genome Biology, vol. 9, p. S3, 2008.

Available:

[17]

II

al.,

BioCreative

gene

of

normalization,"

G. Gonzalez, et al., "Mining gene-disease

http://people.ischool.berkeley.edu/~hearst/t

relationships from biomedical literature:

ext-mining.html

weighting protein-protein interactions and

H.-J. Dai, et al., "BIOSMILE web search:

connectivity measures," Proceedings of the

a

Pacific

web

application

for

annotating

Symposium

on

Biocomputing,

2007. [18]

R. T.-H. Tsai, et al., "HypertenGene:

D. Rebholz-Schuhmann, et al., "Text

Extracting key hypertension genes from

processing through Web services: calling

biomedical literature with position and

Whatizit," Bioinformatics, vol. 24, p. 296,

automatically-generated template features,"

2008.

in 8th InCoB - Seventh International

J. M. Fernández, et al., "iHOP web

Conference on Bioinformatics 2009, 2009. [19]

W21-W26, 2007. Elsevier

Article

2.0

Contest

html). "The

Grand

text

mining," Briefings in Bioinformatics, vol. 6, pp. 57-71, 2005. [20]

Elsevier

A. M. Cohen and W. R. Hersh, "A survey of current work in biomedical

(http://article20.elsevier.com/contest/home.

[12]

A.

M. Hearst. (2003, What is text mining

services," Nucl. Acids Res., vol. 35, pp.

[11]

M. Seringhaus and M. Gerstein, "Manually

U. Hahn, et al., "Text mining: powering the

Acids Res., vol. 1;36, pp. W390-8, 2008.

[10]

Prospect

(http://www.projectprospect.org/)."

biomedical entities and relations.," Nucl.

[9]

Project

structured digital abstracts: A scaffold for

/text-mining.html).

[8]

"RSC

Maryland, 1999.

130-130, 2007.

[7]

[14]

on

Computational Linguistics, College Park,

[6]

S. Ananiadou, et al., "The National Centre

M. A. Hearst, "Untangling text data

the 37th annual meeting of the Association

[5]

ive-ii5/biocreative-ii5/)."

Challenge

M. Krallinger, et al., "Overview of the protein-protein

interaction

annotation

(http://www.elseviergrandchallenge.com/).

extraction task of BioCreative II," Genome

"

Biology, vol. 9, p. S4, 2008.

"BioCreAtIvE

II.5

(http://www.biocreative.org/events/biocreat

[21]

N. Chinchor, "MUC-7 named entity task definition," in Proceedings of the 7th

14

[22]

[23]

[24]

J. Comp. Sci. & Tech.

Message Understanding Conference, 1997.

of the ACL-02 workshop on Natural

L. Smith, et al., "Overview of BioCreative

language processing in the biomedical

II gene mention recognition," Genome

domain

Biology, vol. 9, p. S2, 2008.

Pennsylvania, 2002.

U. Leser and J. Hakenberg, "What makes a

[26]

[28]

J. Finkel, et al., "Exploiting context for

to

Bioinformatics, vol. 6, p. 357, 2005.

International Joint Workshop on Natural

R. A. A. Erhardt, et al., "Status of

Language Processing in Biomedicine and

text-mining

its Applications, pp. 88-91, 2004.

techniques

applied

to [31]

the

web,"

Proceedings

of

the

R. T.-H. Tsai, et al., "NERBio: using

vol. 11, pp. 315-325, 2006.

selected

H. Liu, et al., "A study of abbreviations in

normalization, and global patterns to

MEDLINE abstracts," in AMIA Annual

improve

Symposium, 2002, p. 464.

recognition," BMC Bioinformatics, vol. 7,

L. Tanabe and W. J. Wilbur, "Tagging gene

p. S11, 2006. [32]

word

conjunctions,

biomedical

named

term

entity

L. Si, et al., "Boosting performance of

presented at the Proceedings of the

bio-entity recognition by combining results

ACL-02 workshop on Natural language

from multiple systems," presented at the

processing in the biomedical domain -

Proceedings

Volume 3, Phildadelphia, Pennsylvania,

workshop on Bioinformatics, Chicago,

2002.

Illinois, 2005.

L. Tanabe and W. J. Wilbur, "Tagging gene

[33]

of

the

5th

international

R. Altman, et al., "Text mining for biology

and protein names in biomedical text,"

- the way forward: opinions from leading

Bioinformatics, vol. 18, pp. 1124-1132,

scientists," Genome Biology, vol. 9, p. S7,

2002.

2008.

S. Zhao, "Named Entity Recognition in

[34]

A. Jimeno, et al., "Assessment of disease

Biomedical Texts using an HMM Model,"

named entity recognition on a corpus of

presented at the Proceedings of the

annotated sentences," BMC Bioinformatics,

International Joint Workshop on Natural

vol. 9, p. S3, 2008. [35]

H. Yu, et al., "Mapping abbreviations to

its Applications, 2004.

full forms in biomedical articles," Journal

J. i. Kazama, et al., "Tuning support vector

of the American Medical Informatics

machines for biomedical named entity

Association, vol. 9, pp. 262-272, 2002.

recognition," presented at the Proceedings - 14 -

Phildadelphia,

the biomedical literature," Briefings in

Language Processing in Biomedicine and

[29]

3,

biomedical entity recognition: from syntax

and protein names in full text articles,"

[27]

Volume

gene name? Named entity recognition in

biomedical text," Drug Discovery Today,

[25]

[30]

-

[36]

A. S. Schwartz and M. A. Hearst, "A

New challenges for biological text-mining in the next decade

simple

[37]

[38]

[39]

[40]

algorithm

for

- 15 -

identifying

abbreviation definitions in biomedical

hereditary

text," in Pac Symp Biocomput., 2003, pp.

Biocomput., Maui, Hawaii, 2007, pp.

451-462.

316-327.

R. Podowski, et al., "Suregene, a scalable

Pac

Symp

J. W. Cooper and A. Kershenbaum,

of gene and protein names," Journal of

using

Bioinformatics and Computational Biology,

statistical

vol. 3, pp. 743-770, 2005.

BMC Bioinformatics, vol. 6, p. 143, 2005.

L. Hirschman, et al., "Overview of

[45]

a

combination and

of

graphical

linguistic,

information,"

P. K. Shah, et al., "Information extraction

BioCreAtIvE task 1B: normalized gene

from full text scientific articles: Where are

lists," BMC Bioinformatics, vol. 6, p. S11,

the keywords?," BMC Bioinformatics, vol.

2005.

4, p. 20, 2003.

W. Cohen and E. Minkov, "A graph-search

[46]

H. Shatkay, et al., "Integrating image data

framework for associating gene identifiers

into

with documents," BMC Bioinformatics, vol.

Bioinformatics, vol. 22, pp. e446-453, July

7, p. 440, 2006.

15, 2006 2006.

F.

Leitner,

"Comparative

community

[47]

Z.

biomedical

Kou,

et

text

al.,

categorization,"

"A

STACKED

assessments for applied biomedical text

GRAPHICAL

mining: BioCreative II challenge and

ASSOCIATING INFORMATION FROM

metaservices," in Intelligent Systems for

TEXT AND IMAGES IN FIGURES," in

Molecular Biology (ISMB) and European

Pac Symp Biocomput., Maui, Hawaii,

Conference on Computational Biology

2007.

K.

Fundel,

et

al.,

"Exact

[48]

MODEL

FOR

J. Saric, et al., "Extraction of regulatory

versus

gene/protein networks from Medline,"

approximate string matching for protein

Bioinformatics, vol. 22, pp. 645-650,

name identication," in Proceedings of the

March 15, 2006 2006.

Challenge

Evaluation

[49]

T. Ono, et al., "Automated extraction of

Workshop 2004, 2004.

information on protein-protein interactions

J. Hakenberg, et al., "Me and my friends:

from

gene

Bioinformatics, vol. 17, pp. 155-161, Feb

mention

normalization

with

background knowledge," in Proceedings of Second BioCreAtIvE Challenge Evaluation

[43]

in

"Discovery of protein-protein interactions

BioCreative

[42]

[44]

diseases,"

system for automated term disambiguation

(ECCB), Highlights Track, 2009. [41]

implicit associations between genes and

the

biological

literature,"

2001. [50]

S. Kim, et al., "Kernel approaches for

Workshop, 2007, pp. 23-25.

genic

interaction

extraction,"

K. Seki and M. Javed, "Discovering

Bioinformatics, vol. 24, p. 118, 2008.

16

[51]

J. Comp. Sci. & Tech.

R. Bunescu and R. Mooney, "Subsequence

from the biomedical literature," in Pac

kernels

Symp Biocomput., 2000, pp. 515-524.

for

relation

extraction,"

ADVANCES IN NEURAL INFORMATION

[52]

[58]

T.

Barnickel,

et

al.,

Extraction

"Large

Scale

Proceedings of the Pacific Symposium on

from

Biocomputing, pp. 4-15, 2006. [59]

Biomedical

R. T.-H. Tsai, et al., "HypertenGene: Extracting key hypertension genes from biomedical literature with position and

A. Ramani, et al., "Consolidating the set of

automatically-generated template features,"

known human protein-protein interactions

to appear in BMC Bioinformatics, 2009. [60]

Y. Miyao, et al., "Evaluating Contributions

the human interactome," Genome Biology,

of

vol. 6, p. R40, 2005.

Protein-Protein

R.

Bunescu,

et on

extractors

for

interactions,"

al.,

"Comparative

learning

information

proteins

and

[61]

their

Artificial Intelligence

classification:

protein-protein

application

interactions,"

Technology

and

Interaction

Parsers

to

Extraction,"

L. Wong, "PIES, a protein interaction

Symposium on Biocomputing, vol. 6, pp. 520-531, 2001. [62]

J. Castaño, et al., "Anaphora resolution in

to

biomedical literature," in International

in

Symposium on Reference Resolution, 2002.

Proceedings of the conference on Human Language

Language

extraction system," Proceedings of Pacific

in

B. Rosario and M. A. Hearst, "Multi-way relation

Natural

Bioinformatics, Dec 9 2008.

Medicine, vol. 33, pp. 139-155, 2005.

[63]

J. Pustejovsky, et al., "Medstract: creating large-scale

Empirical

information

servers

for

Methods in Natural Language Processing,

biomedical libraries," in Proceedings of the

2005, pp. 732-739.

ACL-02 workshop on Natural language

M. Craven and J. Kumlien, "Constructing

processing in the biomedical domain, 2002,

Biological Knowledge Bases by Extracting

pp. 85-92.

Information

from

Text

Sources,"

in

[64]

N. Nguyen, et al., "Challenges in Pronoun

Proceedings of the 7th International

Resolution System for Biomedical Text,"

Conference on Intelligent Systems for

Proceedings of the Sixth International

Molecular Biology, 1999, pp. 77-86.

Language

T.

(LREC'08), 2008.

C.

Rindflesch,

et

al., "EDGAR:

extraction of drugs, genes and relations - 16 -

of

Texts," 2009.

experiments

[57]

"Extraction

domain dictionaries and machine learning,"

in preparation for large-scale mapping of

[56]

al.,

2006.

Relation

[55]

et

gene-disease relations from Medline using

Semantic Role Labeling for Automated

[54]

Chun,

PROCESSING SYSTEMS, vol. 18, p. 171,

Application of Neural Network Based

[53]

H.-W.

[65]

Resources

and

Evaluation

R. T.-H. Tsai, et al., "PubMed-EX: A web

New challenges for biological text-mining in the next decade

- 17 -

browser extension to enhance PubMed

genomes to life and the environment,"

search

Nucleic Acids Research, vol. 36, p. D480,

with

text

mining

features,"

Bioinformatics, vol. [Epub ahead of print], 2009. [66]

[67]

[68]

[69]

[74]

C.

Blaschke,

Text Mining for Biology and Medicine,

pp. 1-10, January 1, 2009 2009.

2005, pp. 213-245.

K. Cheung, et al., "Semantic Web approach

[75]

A. Yeh, et al., "Background and overview

to database integration in the life sciences,"

for KDD Cup 2002 task 1: Information

Semantic Web: Revolutionizing knowledge

extraction from biomedical articles," in

discovery in the life sciences, pp. 11-30,

ACM SIGKDD Explorations Newsletter,

2007.

2002, pp. 87-89.

R.

Dowell,

et

al.,

"The

distributed

[76]

W. Hersh and E. Voorhees, "TREC

annotation system," BMC Bioinformatics,

genomics

vol. 2, p. 7, 2001.

Information Retrieval, vol. 12, pp. 1-15,

T. O'Reilly. (2005, What is Web 2.0:

2009. [77]

special

issue

overview,"

J. Hakenberg, et al., "LLL’05 challenge:

the Next Generation of Software. Available:

Genic interaction extraction-identification

http://www.oreillynet.com/pub/a/oreilly/ti

of language patterns based on alignment

m/news/2005/09/30/what-is-web-20.html

and finite state automata," in Proceedings

B. Mons, et al., "Calling on a million

of

minds

Language in Logic (LLL05), 2005, pp.

for

community

annotation

in

the

ICML05

workshop:

Learning

89-97. [78]

J.-D. Kim, et al., "Overview of BioNLP'09

C. Baral, et al., "CBioC: beyond a

Shared Task on Event Extraction," in

prototype for collaborative annotation of

Proceedings of the BioNLP 2009 Workshop

molecular interactions from the literature,"

Companion Volume for Shared Task, 2009,

presented

pp. 1-9.

at

Computational

the

Proceedings

Systems

of

Bioinformatics

[79]

J.-D. Kim, et al., "Corpus annotation for

Conference, San Diego, California, 2007.

mining biomedical events from literature,"

K. Oda, et al., "New challenges for text

BMC Bioinformatics, vol. 9:10, 2008.

mining: manually

mapping curated

between

text

pathways,"

and

[80]

BMC

M. Kanehisa, et al., "KEGG for linking

G. Bader, et al., "Pathguide: a pathway resource list," Nucleic Acids Research, vol.

Bioinformatics, vol. 9, p. S5, 2008. [73]

and

bioinformatics," Brief Bioinform, vol. 10,

R89, 2008.

[72]

Hirschman

"Evaluation of Text Mining in Biology," in

WikiProteins," Genome Biology, vol. 9, p.

[71]

L.

Z. Zhang, et al., "Bringing Web 2.0 to

Design Patterns and Business Models for

[70]

2008.

34, pp. D504-506, 2006. [81]

E. Camon, et al., "The Gene Ontology

18

J. Comp. Sci. & Tech.

annotation

(GOA)

knowledge

in

database:

sharing

predicate-argument structures for event

Gene

extraction in molecular biology," BMC

Ontology," Nucleic Acids Research, vol. 32,

Bioinformatics, vol. 5, p. 155, Oct 19 2004.

Uniprot

with

pp. D262-266, 2004. [82]

[83]

[89]

J. D. Kim, et al., "GENIA corpus--a

Method for Annotating a Biomedical

semantically

for

Proposition

bio-textmining," Bioinformatics, vol. 19,

Proceedings

pp. 180-182, 2003.

Frontiers

L. Tanabe, et al., "GENETAG: a tagged

Corpora, Sydney, Australia, 2006.

annotated

corpus

corpus for gene/protein named entity

[84]

[90]

J. Heimonen, et al., "Complex-to-Pairwise

Workshop

on

Linguistically

Annotated

S. Kulick, et al., "Integrated annotation for

[91]

Y. Tateisi and J. Tsujii, "Part-of-Speech

Mapping of Biological Relationships using

Annotation

a

Abstracts," in Proceedings of the 4th

Semantic

Network

International

Representation," Symposium

of

Biology

Research

on

International Conference on Language

Semantic Mining in Biomedicine (SMBM

Resource and Evaluation (LREC2004). ,

2008), 2008, pp. 45-52.

2004, pp. 1267-1270.

B. Rosario and M. A. Hearst, "Classifying

[92]

Y. Tateisi, et al., "Syntax Annotation for

semantic relations in bioscience texts,"

the GENIA corpus," Proc. IJCNLP 2005,

presented at the Proceedings of the 42nd

Companion volume, pp. 222–227, 2005.

Meeting

D.

on

Association

Linguistics,

for

[93]

Barcelona,

M. Lease and E. Charniak, "Parsing biomedical literature," presented at the Proceedings of Second International Joint

Berleant,

et

al.

(2003,

Corpus

properties

of

protein

interaction

descriptions

in

MEDLINE.

Available:

Conference

on

Natural

Language

Processing, 2005. [94]

L.

Smith,

et

al.,

"MedPost:

a

http://class.ee.iastate.edu/berleant/home/m

part-of-speech tagger for bioMedical text,"

e/cv/papers/corpuspropertiesstart.htm

Bioinformatics, vol. 20, pp. 2320-2321,

C.

Nedellec,

logic-genic

- 18 -

the

HLT/NAACL-2004, 2004, pp. 61-68.

Spain, 2004.

[88]

ACL

at

p. S3, 2005.

Computational

[87]

in

of

presented

biomedical information extraction," in

Annual

[86]

Bank,"

recognition," BMC Bioinformatics, vol. 6,

Third

[85]

W.-C. Chou, et al., "A Semi-Automatic

"Learning

language

interaction

in

extraction

September 22, 2004 2004. [95]

M. Krallinger, et al., "The BioCreative II.5

challenge," in Proceedings of the ICML05

challenge overview," in Proceedings of the

workshop: Learning Language in Logic

BioCreative II.5 Workshop 2009 on Digital

(LLL05), 2005, pp. 97-99.

Annotations, Madrid, Spain, 2009.

T.

Wattarujeekrit,

et

al.,

"PASBio:

[96]

C. Gasperin, et al., "Annotation of

New challenges for biological text-mining in the next decade

- 19 -

anaphoric relations in biomedical full-text

Yen-Ching Chang received

articles using a domain-relevant scheme,"

her B.S. degree in biochemi-

presented at the Proceedings of the Anaphor

cal science and technology

Resolution Colloquium, Lagos (Algarve),

from National Taiwan Uni-

Portugal, 2007.

versity and her M.S. degree

Discourse

[97]

Anaphora

and

T. McIntosh and J. Curran, "Challenges for automatically

extracting

interactions from full-text articles," BMC

[98]

in biochemistry and molecu-

molecular lar biology from National Taiwan University col-

Bioinformatics, vol. 10, p. 311, 2009.

lege of medicine. Since 2008, she has been and

K. W. Kohn, "Molecular Interaction Map

research assistant at the Academia Sinica. Her re-

of the Mammalian Cell Cycle Control and

search interests include: proteomics and text min-

DNA Repair Systems," Mol. Biol. Cell, vol. 10, pp. 2703-2734, August 1, 1999 1999. Hong-Jie Dai

ing.

received

Richard Tzong-Han Tsai

re-

his B.S. degree in Computer

ceived the B.S. degree in Com-

Science

Information

puter Science and Information

Engineering from Tung Hai

Engineering from National Tai-

University and his M.S. degree in Computer

wan University, Taipei, Taiwan, in 1997, the M.S

Science and Information Engineering from Na-

degree in Computer Science and Information En-

tional Central University in Taiwan in 2003 and

gineering from National Taiwan University in

2005, respectively. Since 2005, he has been an as-

1999, and, respectively, the Ph.D. degree in Com-

sistant researcher at the Academia Sinica. His re-

puter Science and Information Engineering from

search interests include: bioinformatics, machine

National Taiwan University in 2006. He was a

learning, text mining, natural language processing

postdoctoral fellow at Academia Sinica from 2006

and Software Engineering.

to 2007. He is now an assistant professor of De-

and

partment of Computer Science and Engineering, Yuan Ze University, Zhongli, Taiwan. His research areas

are

natural

language

processing,

cross-language information retrieval, biomedical

20

J. Comp. Sci. & Tech.

literature mining, and information services on mo-

Research Fellow award of the National Science

bile devices.

Council, K. T. Li breakthrough award, IEEE Fellow, Academia Sinica Investigator Award, and TeWen-Lian Hsu is a Distin-

co Technology award. From 2001 to 2002, he was

guished Research Fellow in

the President of the Artificial Intelligence Society

the Institute of Information

in Taiwan.

Science, Academia Sinica. He received a B.S. from the Department of Mathematics, National Taiwan University in 1973 and a Ph.D. in operations research from Cornell University in 1980, respectively. He then joined Northwestern University, and was promoted to tenured associate professor in 1986. In 1989, he joined the Institute of Information Science as a research fellow. Earlier in his career, Dr. Hsu focused on theoretical graph algorithms and frequently published papers in top-notch journals, such as JACM, SIAM Journal on Computing. After returning to Taiwan, he started research on automatic conversion of Pinyin to characters. In 1993, he invented a Chinese natural input method which has since attracted two million users and revolutionized the phonetic input for Chinese in Taiwan. Later, he moved into question answering, and bioinformatics. He is currently the Director of the International Graduate Program in Bioinformatics in Academia Sinica. Dr. Hsu received many awards including the Distinguished - 20 -

education and training requirements in the next decade ...