Proceedings of EACL '99

The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers Nigel Collier, Hyun Seok Park, Norihiro Yuka Tateishi, Chikashi Nobata, Tomoko Tateshi

Ogata Ohta

Sekimizu, Hisao Imai, Katsutoshi Ibushi, Jun-ichi

Tsujii

{nigel ,hsp20, ogat a,yucca, nova, okap ,sekimizu,hisao ,k-ibushi, tsuj ii}~£s, s .u-tokyo. ac. jp

Department of Information Science, Graduate School of Science University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113, Japan Abstract

1.1

Terminology identification and

classification We present an outline of the genome information acquisition (GENIA) project for automatically extracting biochemical information from journal papers and abstracts. GENIA will be available over the Internet and is designed to aid in information extraction, retrieval and visualisation and to help reduce information overload on researchers. The vast repository of papers available online in databases such as MEDLINE is a natural environment in which to develop language engineering methods and tools and is an opportunity to show how language engineering can play a key role on the Internet.

1

Introduction

In the context of the global research effort to map the human genome, the Genome Informatics Extraction project, GENIA (GENIA, 1999), aims to support such research by automatically extracting information from biochemical papers and their abstracts such as those available from MEDLINE (MEDLINE, 1999) written by domain specialists. The vast repository of research papers which are the results of genome research are a natural environment in which to develop language engineering tools and methods. This project aims to help reduce the problems caused by information overload on the researchers who want to access the information held inside collections such as MEDLINE. The key elements of the project are centered around the tasks of information extraction and retrieval. These are outlined below and then the interface which integrates them is described.

271

Through discussions with domain experts, we have identified several classes of useful entities such as the names of proteins and genes. The reliable identification and acquisition of such class members is one of our key goals so that terminology databases can be automatically extended. We should not however underestimate the difficulty of this task as the naming conventions in this field are very loose. In our initial experiments we used the ENGCG shallow parser (Voutilainen, 1996) to identify noun phrases and classify them as proteins (Sekimizu et al., 1998) according to their cooccurrence with a set of verbs. Due to the difficulties caused by inconsistent naming of terms, we have decided to use multiple sources of evidence for classifying terminology. Currently we have extended our approach and are exploring two models for named entity recognition. The first is based on a statistical model of word clustering (Baker and McCallum, 1998) which is trained on pre-classified word lists from Swissprot and other databases. We supplemented this with short word lists to identify the class from a term's final noun if it existed in a head final position. In our first experiments on a judgement set of 80 expert tagged MEDLINE abstracts the model yielded F-scores for pre-identified phrases as follows: 69.35 for 1372 source entities, 53.00 for 3280 proteins, 66.67 for 56 RNA and 45.20 for 566 DNA: We expect this to improve with the addition of better training word lists. The second approach is based on decision trees (Quinlan, 1993), supplemented with word lists for classes derived from Swissprot and other databases. In these tests the phrases for terms were not pre-identified. The model was trained on a corpus of 60 expert tagged MEDLINE abstracts and tested on a corpus of 20 articles yielding F-scores of: 55.38 for 356 source, 66.58 for 808 protein entities. The number of RNA

Proceedings of EACL '99 and DNA entities was too small to train with. As part of the overall project we are creating an expert-tagged corpus of MEDLINE abstracts and full papers for training and testing our tools. The markup scheme for this corpus is being developed in cooperation with groups of biologists and is based on a conceptual domain model implemented in SGML. The corpus itself will be crossvalidated with an independent group of biologists. 1.2

Information extraction

We are using information extraction methods to automatically extract named entity properties, events and other domain-specific concepts from MEDLINE abstracts and full texts. One part of this work is the construction and maintenance of an ontology for the domain which is executed by a system which we are now developing called On-

tology Extraction-Maintenace System (OEMS). OEMS extracts three types of information about the domain-ontology, (Ogata, 1997), called typing information, from the abstracts: taxonomy (a subtype structure), mereology (a part-whole structure), synonymy (an identity structure). Eventually we hope to be able to identify and extract domain specific facts such as protein-protein binding information from full texts and to aid biochemists in the formation of cell signalling diagrams which are necessary for their work. 1.3

Thesaurus building

A further goal of our work is to construct a thesaurus automatically from MEDLINE abstracts and domain dictionaries consisting of medical domain terms for the purpose of query expansion in information retrieval of databases such as MEDLINE, e.g. see (Jing and Croft, 1994). We are currently working with the Med test set (30 queries and 1033 documents) on SMART (e.g. see (Salton, 1989),(Buckley et al., 1993)). Eventually we plan on building a specialised thesaurus for the genome domain but this currently depends on the creation of a suitable test set. 1.4

2

Conclusion

This paper has provided a synopsis of the GENIA project. The project will run for a further two years and aims to provide an online demonstration of how language engineering can be useful in the genome domain.

References L.D. Baker and A.K. McCallum. 1998. Distributional clustering of words for text classification. In Proceedings of the 21st Annual International

ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia. C. Buckley, J. Allan, and G. Salton. 1993. Automatic routing and ad-hoc retrieval using SMART: TREC-2. In D. K. Harman, editor,

The second Text REtrieval Conference (TREC2), pages 45-55. NIST. GENIA. 1 9 9 9 . Information on the GENIA project can be found at:. http://www.is.s.utokyo.ac.jp/-nigel/GENIA.html. Y. Jing and W. Croft. 1994. An association thesaurus for information retrieval. In Proceedings of RIAO'94, pages 146-160. MEDLINE. 1999. The PubMed database can be found at:. http://www.ncbi.nlm.nih.gov/PubMed/. Norihiro Ogata. 1997. Dynamic constructive thesaurus. In Language Study and Thesaurus:

Proceedings of the National Language Research Institute Fifth International Symposium: Session I, pages 182-189. The National Language Research Institute, Tokyo. J.R. Quinlan. 1993. c4.5 Programs for Machine Learning. Morgan Kaufmann Publishers, Inc., San Mateo, California. G. Salton. 1989. Automatic Text Processing- The

Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Publishing Company, Inc., Reading, Massachusetts.

Interface

A key aspect of this project is providing easy interaction between domain experts and the information extraction programs. Our interface provides a link to the information extraction programs as well as clickable links to aid in querying for related information from publically available databases on the WWW within a single environment. For example, a user can highlight proteins in the texts using the named entity extraction program and then search for the molecule structure diagram.

272

T. Sekimizu, H. Park, and J. Tsujii. 1998. Identifying the interaction between genes and gene products based on frequently seen verbs in medline abstracts. In Genome Informatics. Unviersal Academy Press, Inc. A. Voutilainen. 1996. Designing a (finite-state) parsing grammar. In E. Roche and Y. Schabes, editors, Finite-Slate Language Processing. A Bradford Book, The MIT Press.

The GENIA project: corpus-based knowledge ... - ACL Anthology

In the context of the global research effort to map the human .... a link to the information extraction programs as ... using the named entity extraction program and.

188KB Sizes 0 Downloads 206 Views

Recommend Documents

Expected Sequence Similarity Maximization - ACL Anthology
even with respect to an approximate algorithm specifically designed for that task. These re- sults open the path for the exploration of more appropriate or optimal ...

Randomized Language Models via Perfect Hash ... - ACL Anthology
ski et al. (1996) implies that if M ≥ 1.23|S| and k = 3, the algorithm succeeds with high probabil-. Figure 2: The ordered matching algorithm: matched = [(a, 1), (b ...

SemEval-2017 Task 1: Semantic Textual Similarity ... - ACL Anthology
numerous applications including: machine trans- lation (MT) ... More recently, deep learning became competitive with top ...... CNN (Shao, 2017). 83.4. 78.4.

Named Entity Transcription with Pair n-Gram Models - ACL Anthology
We submitted results for each of the eight shared tasks. Except for Japanese name kanji restoration, which uses a noisy channel model, our Standard Run submissions were produced by generative long-range pair n- gram models, which we mostly augmented

Robust VPE detection using Automatically Parsed Text - ACL Anthology
and uses machine learning techniques on free text that ... National Corpus using a variety of machine learn- ... we used (Lager, 1999) ran into memory problems.

Robust VPE detection using Automatically Parsed Text - ACL Anthology
King's College London ... ing unannotated data that is parsed using an auto- matic parser are presented, as our ... for the BNC data, while the GIS-MaxEnt has a.

Paraphrasing Adaptation for Web Search Ranking - ACL Anthology
4 Aug 2013 - (Quirk et al., 2004), model optimization (Zhao et al., 2009) and etc. But as far as we know, none of previous work has explored the impact of using a well designed paraphrasing engine for web search ranking task specifically. In web sear

Deceptive Answer Prediction with User Preference ... - ACL Anthology
Aug 9, 2013 - answer, which is defined as the answer, whose pur- pose is not only to ...... ference on Knowledge discovery and data mining, pages 821–826.

SemEval-2017 Task 1: Semantic Textual Similarity ... - ACL Anthology
Word2vec: https://code.google.com/archive/ p/word2vec/ .... A comprehensive solution for the statisti- ..... ings of AMTA 2006. http://mt-archive.info/AMTA-2006-.

A Polynomial-Time Dynamic Programming Algorithm ... - ACL Anthology
Then it must be the case that c(Hj) ≥ c(Hj). Oth- erwise, we could simply replace Hj by Hj in H∗, thereby deriving a new 1-n path with a lower cost, implying that H∗ is not optimal. This observation underlies the dynamic program- ming approach.

A Structured Prediction Approach for Statistical ... - ACL Anthology
Abstract. We propose a new formally syntax-based method for statistical machine translation. Transductions between parsing trees are transformed into a problem of sequence tagging, which is then tackled by a search- based structured prediction method

Zone Identification in Biology Articles as a Basis for ... - ACL Anthology
Zone Identification in Biology Articles as a Basis for Information Extraction. Yoko MIZUTA and Nigel COLLIER. National Institute of ... support for our framework toward automatic ZI. 1 Introduction. Information extraction (IE) in the .... (3) A wide

March 2018 - Public Knowledge Project - SFU
Mar 2, 2018 - report, but it reflects what we, the PKP Team, .... PKP's educational content requires regular updating and expansion to better serve the commu-.

The Gamma database machine project - Knowledge ...
a number of hidden bugs in the VAX version of the code as the VAX does not .... ager is to act as a central repository of all conceptual and ...... ada, May 1987.

The Partner Abuse State of Knowledge Project Manuscripts.pdf ...
Sarah Desmarais, Kim A. Reeves, Tonia L. Nicholls, Robin Telford. & Martin S. ... Jennifer Langhinrichsen-Rohling, Tiffany A. Misra, Candice Selwyn, &. Martin L.

acl hfd.pdf
Page 1 of 1. FATHER'S. DAY. HAPPY. Page 1 of 1. acl hfd.pdf. acl hfd.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying acl hfd.pdf. Page 1 of 1.

acl rental.pdf
Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. acl rental.pdf. acl rental.pd