The 14th Annual Bio-Ontologies Meeting Nigam Shah, Stanford University Susanna-Assunta Sansone, University of Oxford Susie Stephens, Johnson & Johnson Pharmaceutical Research & Development Larisa Soldatova, The University of Aberystwyth The Bio-Ontologies meeting provides a forum for discussion of the latest and most cutting-edge research in ontologies and more generally the organization, presentation and dissemination of knowledge in biology. Over the years, the Bio-Ontologies SIG has provided a forum for discussion on the latest and most innovative topics in this area. The informal nature of the SIG has provided an environment where work has been presented up to a year before its formal publication. It has existed as a SIG at ISMB for over a decade, making it one of the longest running. This year, the meeting runs for two days: On day one (July 15th), the keynote speaker is Andrew Su and on day two (July 16th), the keynote speaker is Ian Dix.

July 15-16th, 2011 Co-located with ISMB/ECCB 2011 Vienna, Austria

Day 1, July 15th Start 8:30

End 8:40

Author

Title Introduction and welcome OpenTox Predictive Toxicology Framework: toxicological ontology and semantic media wiki-based OpenToxipedia

8:40

9:05

Tcheremenskaia et al

9:05

9:30

LePendu et al

9:30

9:55

Tsatsaronis et al

10:00 11:00

11:00 11:25

Pang et al

11:25

11:50

Eales et al

11:50

12:30

12:30

1:30

1:35

1:55

Shimoyama et al

1:55 2:15

2:15 2:35

Good et al Travillian et al

2:35

2:55

Ciccarese et al

2:55

3:15

Two Flash updates, 10 min each

Linking genes to diseases with a SNPedia-Gene Wiki mashup The Vertebrate Bridging Ontology (VBO) DOMEO: a web-based tool for semantic annotation of online documents • Sansone et al, BioSharing: standards, policies and communication in bioscience • Whetzel et al, Collaborative Development of Ontologies using WebProtégé and BioPortal

3:15 4:15

4:15 5:00

5:00

6:00

Six Flash updates, 10 min each

Coffee break (3:30 - 4:00) + Posters Interactive session around ontology tools • Tripathi et al, Automated Assessment of High Throughput Hypotheses on Gene Regulatory Mechanisms Involved in the Gastrin Response • Ramirez et al, New search method to mine biological data • Kibbe et al, Coupling disease and genes using Disease Ontology, NCBI GeneRIFs, and the NCBO Annotator service • Zhukova et al, KiSAO: Kinetic Simulation Algorithm Ontology • Duek et al, CALOHA: A new human anatomical ontology • Ison et al, The EDAM ontology for bioinformatics tools and data

Annotation Analysis for Testing Drug Safety Signals A Maximum-Entropy Approach for Accurate Document Annotation in the Biomedical Domain Coffee (10:15 - 10:45) + Posters The Coriell Cell Line Ontology: Rapidly Developing Large Ontologies An exercise in kidney factomics: From article titles to RDF knowledge base Keynote: Andrew Su – Cultivating and mining the Gene Wiki for crowd sourced gene annotation Lunch and Posters Using Multiple Ontologies to Annotate and Integrate Phenotype Records from Multiple Sources

Day 2, July 16th Start 8:30 8:40 9:05

End 8:40 9:05 9:30

Author

9:30

9:55

Schulz et al

10:00

11:00

Horridge et al Jupp et al

Title Introduction and general announcements The State of Bio-Medical Ontologies Exploring Gene Ontology Annotations with OWL Records and situations. Integrating contextual aspects in clinical ontologies Coffee (10:15 - 10:45) + Posters

11:00 11:25

11:25 11:50

11:50

12:30

12:30 1:35 2:00 2:25

1:30 2:00 2:25 2:50

Grewe Batchelor et al Goldfain et al

2:50

3:15

Bada et al

3:15

4:15

4:15

5:00

5:00

6:00

Beisswanger et al Livingston et al

Four Flash updates, 10 min each

“Of Mice and Men” Revisited: Basic Quality Checks for Reference Alignments Applied to the Human-Mouse Anatomy Alignment An Ontology of Annotation Content Structure and Provenance Keynote: Sorana Popa— Why does Drug R&D need good vocabularies and Semantic Integration? Lunch and poster viewing Relating Processes and Events for Granularity-neutral Modeling Processes and properties Vital Sign Ontology An Ontological Representation of Biomedical Data Sources and Records Coffee break (3:30 - 4:00) + Posters • Hastings et al, What’s new and what’s changing in ChEBI in 2011 • Jacobsen et al, EMO – The Enzyme Mechanism Ontology • Oellrich et al, Quantitative comparison of two mapping methods between Human and Mammalian Phenotype Ontology • Yao et al, Using Machine Learning on a Translational Biomedical Ontology for Alzheimer’s Disease Invited talk by Martin Krallinger and Andrew Chatr-aryamontri : Detecting associations between scientific articles and ontology terms—the Molecular Interaction Ontology and BioCreative text mining challenges experience

Flash Updates and Posters No. 1

Authors

Title

Tripathi et al

2 3 4 5 6 7 8

Ramirez et al

Automated Assessment of High Throughput Hypotheses on Gene Regulatory Mechanisms Involved in the Gastrin Response New search method to mine biological data

9 10

Hastings et al Oellrich et al

11

Yao et al

12 13

Patricia L. Whetzel et al

14

Zhukova et al Duek et al Ison et al Sansone et al Jacobsen et al Kibbe et al

Jessica D. Tenenbaum and the Biositemaps Consortium Patricia L. Whetzel et al

KiSAO: Kinetic Simulation Algorithm Ontology CALOHA: A new human anatomical ontology The EDAM ontology for bioinformatics tools and data BioSharing: standards, policies and communication in bioscience EMO – The Enzyme Mechanism Ontology Coupling disease and genes using Disease Ontology, NCBI GeneRIFs, and the NCBO Annotator service What’s new and what’s changing in ChEBI in 2011 Quantitative comparison of two mapping methods between Human and Mammalian Phenotype Ontology Using Machine Learning on a Translational Biomedical Ontology for Alzheimer’s Disease Collaborative Development of Ontologies using WebProtégé and BioPortal Biositemaps: A Framework for Biomedical Resource Discovery NCBO Resource Index: Ontology-based Search and Mining of Biomedical Resources

Keynote Speakers Andrew Su: Andrew Su is an Associate Professor in the Department of Molecular and Experimental Medicine at the Scripps Research Institute. Prior to joining Scripps in July 2011, he was the Associate Director for Bioinformatics at GNF, a pharmaceutical research institute. His group has built several well-used tools for biomedical research. Most notably, his group has led the development of the Gene Wiki and BioGPS, two projects that leverage the principle of community intelligence. In addition to building biomedical resources, his lab also directly pursues biomedical discovery in the fields of mouse genetics, cancer biology, and transcriptional regulation. More information can be found at http://sulab.org. His keynote talk is: Cultivating and mining the Gene Wiki for crowd-sourced gene annotation Sorana Popa: Sorana Popa serves as the Vocabulary Integration Leader in the Knowledge Engineering Programme led by Ian Dix at AstraZeneca R&D. Sorana leads the vocabulary integration, both from a strategic and operational perspective, mostly in discovery. Sorana leads a team of developers and domain experts; recently her team has started working on incorporating vocabularies in the clinical areas. Sorana Popa was born in Bucharest, Romania, in 1968. She finished her primary and secondary school in Sweden and in 1988 she returned to Bucharest to study at the Carol Davila University of Medicine and Pharmacy which she graduated from in 1994. She has worked as a MD in the Gothenburg area until May 1997 when she joined former Astra Hässle, currently AstraZeneca R&D, based in Mölndal, outside Gothenburg. Sorana has served in different roles in different organizations at AstraZeneca, mainly as an information analyst and team leader, providing scientific information to different disease areas and drug discovery projects. Her keynote talk is: Why does Drug R&D need good vocabularies and Semantic Integration?

Special Sessions Interactive session on Friday, July 15th 4:15 pm where groups presenting flash updates and papers on the latest tools for using ontologies will give one on one demonstrations. Invited talk by Martin Krallinger and Andrew Chatr-aryamontri : Detecting associations between scientific articles and ontology terms— the Molecular Interaction Ontology and BioCreative text mining challenges experience, on Saturday, July 16th 5:00 pm.

Acknowledgements

We acknowledge the assistance of Steven Leard and all at ISCB for their excellent technical assistance. We also wish to thank the program committee for their excellent input and reviews – the program committee, organized alphabetically is: Mike Bada Colin Batchelor Judith Blake Olivier Bodenreider Mathias Brochhausen Alison Callahan Kei Cheung Paolo Ciccarese John Copen Adrien Coulet Lindsay Cowell Sudeshna Das Duncan Davidson

Karen Dowell Michel Dumontier John Gennari Graciela Gonzalez Benjamin Good Yongqun He William Hogan Clement Jonquet Cliff Joslyn Weech Lee Paea Lependu Phillip Lord John Madden

James Malone Scott Marshall Robin Mcentire Onard Mejino Genevieve Melton-Meaux Parsa Mirhaji Chris Mungall David Newman Chime Ogbuji Helen Parkinson Alex Passant Philippe Rocca-Serra Matthias Samwald

Susanna-Assunta Sansone Neil Sarkar Nigam Shah Stefan Shulz Larisa Soldatova Holger Stenzhorn Susie Stephens Robert Stevens Andrew Su Jessica Turner Trish Whetzel Mark Wilkinson Li Zhou

OpenTox Predictive Toxicology Framework: toxicological ontology and semantic media wiki-based OpenToxipedia Olga Tcheremenskaia* (A), RomualdoBenigni (A), IvelinaNikolova (B), Nina Jeliazkova (B), Sylvia E.Escher (C), Helvi Grimm (C), Thomas Baier (C), Vladimir Poroikov (D), Alexey Lagunin (D), MichaRautenberg (E) and Barry Hardy* (F) (A) IstitutoSuperiore di Sanità, Environment and Health Department, Viale Regina Elena 299,Rome 00161, Italy; (B) Ideaconsult Ltd, A. Kanchev 4, Sofia 1000, Bulgaria; (C) Fraunhofer Institute for Toxicology &Experimental Medicine, NikolaiFuchs-Str. 1, 30625 Hannover, Germany; (D) Institute of Biomedical Chemistry of RussianAcademy of Sciences, Pogodinskaya street 10,119121Moscow, Russia; (E) In silico Toxicology, Altkircher Str. 4, CH-4052 Basel, Switzerland; (F) Douglas Connect, Baermeggenweg 14, CH-4314 Zeiningen, Switzerland ABSTRACT The OpenTox Framework, developed by the partners in the EC FP7 OpenTox project, aims at providing a unified access to toxicity data, predictive models and validation procedures (B. Hardy, 2010). Interoperability of resources is achieved using a common information model, based on anOpenTox OWL−DL ontology and related ontologies, describing predictive algorithms, models and toxicity data. As toxicological data may come from different, heterogeneous sources, a deployed ontology unifying the terminology and the resources is critical for the rational and reliable organization of the data, and its automatic processing. Up to now the following related ontologies have been developed for OpenTox: Toxicological ontology – listing the toxicological endpoints; Organ system ontology – addressing targets/examinations and organs observed in invivo studies; ToxML ontology – representing semi-automatic conversion of the ToxML schema; ToxLink–ToxCast assays ontology; OpenTox ontology– representation of OpenTox framework components: chemical compounds, datasets, algorithms, models and validation web services; Algorithms Ontology – types of algorithms. Besides being defined in an ontology, OpenToxcomponents are made available throughstandardized REST web services, where every compound, data set or predictive method has a unique resolvable address (URI), used to retrieve its Resource Description Framework (RDF) representation, or to initiate the associated calculations and generate new RDF-based resources. The services support the integration of toxicity and chemical data from various sources, the generation and validation of computer models for toxic effects, seamless integration of new algorithms and scientifically sound validation routines and provide a flexible framework, which allow building arbitrary number of applications, tailored to solving different problems by end users (e.g. toxicologists).

1

INTRODUCTION

OpenTox (OT) was funded by FP7 to develop a framework for predictive toxicology modelling and application development. Based on the framework of webservices two initial OT web-applications have been made available: ToxPredict1 that predicts the activity of a chemical structure submitted by the user in respect to a given toxicity endpoint, and ToxCreate2 that creates a predictive toxicology model from a user-submitted dataset. Ontology definition is important for OT, as information can be integrated in a more efficient and reliable manner, thus reducing the cost, maintenance and risk of application development and deployment. At the moment, our toxicological ontology structure aims to cover five “critical” toxicity study types: carcinogenicity, in vitro and in vivo mutagenicity from micronucleus assays, repeated dose toxicity and aquatic toxicity studies. Even though several ontologies for the biomedical field are publicly available, currently a systematic ontology for toxicological effects and predictive toxicology is not covered by the OBO Foundry3 or Bioportal ontology depositories4. Whenever possible, we are trying to integrate relevant information of neighboring ontologies, such as the Foundational Model of Anatomy (FMA), Ontology for Biomedical Investigation (OBI), NCI Thesaurus, and SNOMED Clinical Terms together with the ToxML (Toxicology XML standard) schema5. The Organs Ontology developed is very closely linked to the INHAND initiative (International Harmonization of Nomenclature and Diagnostic Criteria for Lesions in Rats and Mice). INHAND aims to develop for the first time an internationally accepted

1

ToxPredict http://www.toxpredict.org/ ToxCreate http://www.toxcreate.org/ 3 The OBO Foundry http://www.obofoundry.org/ 4 NCBO BioPortal http://bioportal.bioontology.org/ 5 Leadscope ToxML Schema www.leadscope.com/toxml.php 2

* To whom correspondence should be addressed. Email: Barry.Hardy -(at)douglasconnect.com

1

O.Tcheremenskaia et al.

standardized vocabulary for neoplastic and non-neoplastic lesions as well as the definition of diagnostic features for organ systems observed in in vivo studies. Recently, the description of the respiratory system has been published (R. Renne, 2009). OpenToxipedia6 is a new related community resource of toxicology terminology organized by means of Semantic Media Wiki. OpenToxipedia allows creating, adding, editing and keeping terms used in both experimental toxicology and in silico toxicology. The particular importance of OpenToxipediarelies on the description of all the terms used in OTapplications such as ToxPredict and ToxCreate.

2

METHODS

The construction of formal ontology follows relatively established principles in knowledge representation. We have taken into consideration those available for biomedical ontology development, particularly the OBO Foundry principles. An open, public approach to ontology development supports currentand future collaborations with different projects. We use the DL species of the Web Ontology Language (OWL DL) supported by the Protégé OWL editor. An overview of the OT ontology is given on the public area of the OT website7 together with instructions on how to enter the OT Collaborative Protégé Server and contribute to existing OWL projects. Some of the ontologies are manually created from scratch, others partially reuse existing ones and extend them with task related concepts and relations. The ToxML ontology is semi-automatically generated from the existing ToxML schema by parsing it to OWL and applying specific rules, which convey the semantics and remove redundant information in the new format. OpenToxipedia has been developed using the Semantic Media Wiki (SMW). It was created manually by experts in the fields of in silico and experimental toxicology on the basis of known regulatory documents, glossaries, dictionaries and some primary publications. All registered members are welcome to add new entries, suggest definitions and edit the existing resource at www.opentoxipedia.org. The OpenToxipedia is curated by OT toxicology experts. SMW was chosen for the OpenToxipedia representation for the following main reasons: it enables automatic processing of the wiki knowledge base; it gives a possibility for data transfer between RDF and SMW through SPARQL. SMW will facilitate the automatic data exchange between OpenToxipedia, the ontologies and their use by the OpenTox web servicesdealing with RDF data. The SMW is a collaborative system, supports versioning, RDF export, tools to lock

6 7

OpenToxipedia www.opentoxipedia.org OpenTox ontology development page www.opentox.org/dev/ontology

pages by a curator (fixing a validated vocabulary) and the possibility to addannotation without changing the ontology/rdf information.

3

RESULTS

Up to now, six ontologies have been made available through the OT Collaborative Protégé Server: • Toxicological ontology; • Organ system ontology; • ToxML ontology; • OpenTox ontology, representing components of OpenTox web services framework; • Algorithm types ontology; • ToxLink (ToxCast assays ontology). The OT Toxicological ontology at the moment contains five toxicity study types: carcinogenicity, in vitro bacterial mutagenesis, in vivo micronucleus, repeated dose toxicity (e.g., chronic, sub-chronic or sub-acute study types) and aquatic toxicity (see Figure 1). The purpose of this ontology is to enable the attributesof toxicological dataset entriesto be associated with ontology concepts. The main OWL classes are “ToxicityStudyType”, “TestSystem” (includes subclasses such as strains, species, sex, route of exposure, etc), “TestResult” (includes subclasses such as toxicity measure, test call, mode of action, target sites, etc). The aquatic toxicity ontology was based on the requirements of the directive of the European Union 92/69/EEC (O.J. L383 A), i.e., acute toxicity for fish (method C.1.), acute toxicity for Daphnia (C.2.), and the algal growth inhibition test (C.3.).

Fig. 1.OT toxicological ontology structure. The “target sites” toxicological class is to be linked to the Organ system ontology, developed by the Fraunhofer Institute for Toxicology & Experimental Medicine. The Organs Ontology is one of the most challenging ontology classes addressing targets, examinations and organs observed in in vivo studies such as repeated dose toxicity and carcinogenicity. The ontology includes the detailed description of organs starting from organs systems down to histological components. It was decided to usea hierarchical structure starting 2

with the organs system (e.g. digestive system) instead of orientating the ontology on the examinations performed in guideline studies such as histopathology, necropsy, and clinical observations. So the principal structure of the organs ontology is as follows: − Class Organs system - Subclass Organs system |− Class Target organs - Subclass Target organs 1 to N |− Class Histopathology - Subclasses if needed At the moment the Organs Ontology includes 12 organs systems: digestive system, respiratory system, circulatory system, endocrine system, male genital system, female genital system, hematopoietic system, integumentary system, nervous system and special sense organs, urinary system, musculoskeletal system, immune system and lymphatic organs. Synonyms are included to account for differences in terminologies.It focuses on the organs observed in rodents, which are frequently used for toxicity testing. Species specificity will be introduced, when combining the organ ontology with the toxicological endpoint ontology. Currently, the Toxicological EffectsOntology comprises neoplastic and non-neoplastic effects observed in repeated dose and cancer studies. Endpoint specificity of the effects will be included when combining the organ/effect root ontology with the toxicological endpoint ontology. The effects ontology consists of three main parts: classes of effects, linked to pathological effects, which are furtherlinked to detailed diagnostic features as agreed in the INHAND initiative. Its functionality has been initially developed for the respiratory tract. The structure of the combined organ and effectsontology is depicted in Figure 2.

Fig. 2. Overview of the structure of the combined organ (in orange) and effect (in green) ontologies

The ToxML ontology is a semi-automatic conversion of ToxML schema to OWL-DL. The most recent ToxML release has a comprehensive, well-structured scheme for many toxicity studies (carcinogenicity, in vitro mutagenicity, in vivo micronucleus, repeated dose toxicity) which fit well the OpenTox purposes. This was verified by manually mapping various existing database entries to the ToxML schema. The resulting ontology will be applied as a media to reference/annotate the contents of databases coming from various sources and toxicity studies. Our purpose is not only to develop a cross database matching schema but also to bene-

fit from the powerful reasoning mechanism that OWL offers to inference on existing facts in the databases. In order to use ToxML as a scheme for accommodation of our data we need to overcome the issues raised by the nature of the XML description: there exist many fields with free text instead of named concepts; standardized vocabularies for many classes do not exist (e.g. target sites, mode of action, route of exposure), therefore some classes and properties are named by more than one label and others have labels which are ambiguous; the XML nested structure does not follow the natural IS-A relation used for subclassing in OWL. For this reason ad hoc rules for conversion are implemented. The resulting ontology has a flat structure representing numerous relations except the IS-A relation, since IS-A does not apply to the concepts in use. Ambiguous labels are unified and a step towards label standardization is achieved, where possible object type properties are introduced instead of datatype ones; thus the referenced values remain named instead of string values. The OpenTox ontology8 provides a common information model for the most common components, found in any application, providing predictive toxicology functionality, namely chemical compounds, datasets of chemical compounds, data processing algorithms, machine learning algorithms, predictive models and validation routines. The OpenTox framework exposes REST web services, corresponding to each of these common components. A generic OWL representation is defined for every component (e.g. every OTdataset is a subclass of ot:Dataset, every algorithm is subclass of ot:Algorithm and every model is a subclass of ot:Model). This allows unified representation across diverse data and algorithms, and a uniform interface to data processing services, which take generic ot:Dataset resources on input and generate generic ot:Dataset resources on output. Specific types of algorithms are described in the algorithm types ontology and even more details of descriptor calculation algorithms are specified via the Blue Obelisk ontology (Guhaat al., 2006) of cheminformatics algorithms (e.g. algorithm references, descriptor categories) and extensions, specifically developed to cover algorithmsdeveloped by OpenTox developers. Assigning specific information about the datasets, properties and types of algorithms and models is done via linking to the relevant ontologies, for example by subclass-ing (rdf:type), owl:sameAs links, or Blue Obelisk ontology bo:instanceOf predicate. The simultaneous use of OT datasets and compound properties as resources of generic ot:Dataset type and ot:Feature type in the OT ontology, and linking to specific toxicology ontologies, provides a flexible mechanism for annotation. It allows users of OT web services to upload datasets of

8

OpenTox Ontology http://www.opentox.org/api/1.1/opentox.owl

3

O.Tcheremenskaia et al.

chemical compounds and arbitrary named properties of the compounds. The datasets are converted into a uniform ot:Dataset representation and chemical compound properties can bemanually annotated with the proper terms from toxicology ontologies. The annotation and assigning of owl:sameAs links is currently only done manually, via OT REST web service interface, which modifies the relevant resource representation by adding/modifying triples. In principle, more sophisticated techniques could be applied, and the corresponding RDF representation updated via the same REST interface.This approach is currently used to enter and represent data in OT services and applications. Description of one of the OT API implementations, and examples of RDF representation of various resources is provided in (N. Jeliazkova, V. Jeliazkov, 2011). The sixth ontology project initiated is the ToxLink ontology representing the ToxCast assays from the US EPA. This development is a collaborative effort of OpenTox withToxCast9to provide an ontological descriptionof in vitro toxicological assays. At present, OpenToxipedia contains 862 toxicological terms with description and literature references classified into 26 categories (see Figure 3).

ceive a message and decide what additions are approved and will become publicly available);(ii) Edit description of terms – curators;(iii)Add remarks - any registered user (curators receive an alertmessage). 4

DISCUSSION AND CONCLUSIONS

The need for speeding up the toxicological assessment of chemicals, and of using less animals and more inexpensive tools has strongly stimulated the development of predictive toxicology and of structure-based approaches. A wide spectrum of predictive approachesapplied to toxicityexist today, including read-across, regulatory categories, and (Quantitative) structure-activity relationship ((Q)SAR) modelling. All these predictive approaches share the need of a highly structured information as a starting point: the definition of ontology and of controlled vocabulary is a crucial requirement in order to standardize and organize the chemical and toxicological data on which the predictive toxicology methods build on. In addition, the availability of ontology specific for predictive toxicology is crucial to the interoperability of OpenTox services and other platforms and software in developing and deploying user applications. The ontology will be submitted to the Bioportal website for dissemination and feedback.

ACKNOWLEDGEMENTS

Fig. 3. OpenToxipedia categories for predictive toxicology.

The terms can be browsed either by category or in alphabetical order. Specialists in different toxicology fields are invited to take part in the creation and curation of OpenToxipedia. It can be used as a compendium for free available predictive toxicology resources supporting the application and development of the standards for representation of toxicology data, vocabulary and ontology development needed by OpenTox use cases and web services.The following rules for term management in OpenToxipedia have been developed:(i)Add terms – any registered user (curators re-

9

ToxCast www.epa.gov/ncct/toxcast/

OpenTox - An Open Source Predictive Toxicology Framework, www.opentox.org, is funded under the EU Seventh Framework Program: HEALTH-2007-1.3-3 Promotion, development, validation, acceptance and implementation of QSARs (Quantitative Structure-Activity Relationships) for toxicology, Project Reference Number Health-F5-2008-200787 (2008-2011). Project Partners Douglas Connect, In Silico Toxicology, Ideaconsult, IstitutoSuperiore di Sanita', Technical University of Munich, Albert Ludwigs University Freiburg, National Technical University of Athens, David Gallagher, Institute of Biomedical Chemistry of the Russian Academy of Medical Sciences, Seascape Learningand, The Fraunhofer Institute for Toxicology & Experimental Medicine.

REFERENCES B. Hardy, N. Douglas et al. (2010), Collaborative Development of Predictive Toxicology Applications, Journal ofCheminformatics, 2:7; doi:10.1186/1758-2946-2-7. R. Renne , A. Brix et al. (2009), Proliferative and Nonproliferative Lesions of the Rat and Mouse Respiratory Tract,Toxicologic Pathology, 37: 5-73 R.Guha, M.T. Howard et al. (2006), The Blue Obeliskinteroperability in chemical informatics, May-Jun;46(3):991-8. N. Jeliazkova, V. Jeliazkov (2011), AMBIT RESTful web services: an implementation of the OpenTox application programming interface, Journal of Cheminformatics, 3:18; doi:10.1186/1758-2946-3-18

4

Annotation Analysis for Testing Drug Safety Signals Paea LePendu*1, Stephen Racunas2, Srinivasan Iyer1, Yi Liu1, Cédrick Fairon3, and Nigam H. Shah1 1

Stanford University, Stanford CA, 2Grass-Roots Science, 3Université catholique de Louvain, Belgium

ABSTRACT With the availability of tools for automated coding of unstructured text using natural language processing, the existence of over 250 biomedical ontologies, and the increasing access to large volumes of electronic medical data, it is possible to apply data-mining techniques to the large amounts of unstructured data available in medicine and health care. For example, by computationally encoding the free-text narrative—comprising the majority of the clinical electronic medical data—it may be possible to test drug safety signals in an active manner. We describe the application of NCBO Annotation tools to process clinical text and the mining of the resulting annotations to compute the risk of having a myocardial infarction on taking Vioxx (rofecoxib) for Rheumatoid arthritis. Our preliminary results show that it is possible to apply annotation analysis methods for testing hypotheses about drug safety using electronic medical records.

1

INTRODUCTION

Changes in biomedical science, public policy, information technology, and electronic heath record (EHR) adoption have converged recently to enable a transformation in the delivery, efficiency, and effectiveness of health care. In a recent report (PCAST 2010), the President’s Council of Advisors on Science and Technology outlined a data–centric approach, propelled by Federal incentives, and aimed at galvanizing EHR adoption rates and catalyzing health information exchange. While analyzing structured EHRs have proven useful in many different contexts, the true richness and complexity of health records—roughly 80 percent—lies within the clinical notes, which are free-text reports written by doctors and nurses in their daily practice. Advances in natural language processing now provide the technology to process these textual notes rapidly and accurately (Friedman, Johnson et al. 1995; Savova, Masanz et al. 2010), allowing us to computationally encode and to analyze the free-text narrative. Equipped with the vast computing infrastructure and sophisticated mining and machine learning tools available to us today, we are poised to cross the “threshold of sufficient data” and to make significant leaps in medicine (Halevy, Norvig et al. 2009).

*

To whom correspondence should be addressed ([email protected]).

The U.S. Food and Drug Administration (FDA) Amendments Act of 2007 mandated that the FDA develop a system for using health care data to identify risks of marketed drugs and other medical products (Stang, Ryan et al. 2010); which resulted in the recent formation of the Observational Medical Outcomes Partnership. Meanwhile, adverse drug events currently result in significant costs—in fact an estimated 200,000 inpatient and 2-million ambulatory ADRs could be prevented, resulting in a savings of over $4.5-billion per year (Hillestad, Bigelow et al. 2005). It is estimated that roughly 30% of hospital stays have an adverse drug event (Classen, Resar et al. 2011) and current one-drug-at-a-time methods for surveillance are woefully inadequate because no one monitors the “real life” situation of patients getting over 3 concomitant drugs (Classen, Resar et al. 2011). The current paradigm of drug safety surveillance is based on spontaneous reporting systems (SRS), which are databases containing voluntarily submitted reports of suspected adverse drug events encountered during clinical practice. In the USA, the primary database for such reports is the Adverse Event Reporting System (AERS) database at the FDA. The largest of such SRS is the World Health Organization’s Programme for International Drug Monitoring1. The reports in these databases are typically mined for drug-event associations via statistical methods based on disproportionality measures, which quantify the magnitude of difference between observed and expected rates of particular drug-event pairs (Bate and Evans 2009). Given the large amounts of data available in resources such as AERS (Weiss-Smith, Deshpande et al. 2011), researchers are starting to develop methods for detecting potential multi-drug adverse events (Tatonetti, Fernald et al. 2009), for detecting multi-item adverse events (Harpaz, Chase et al. 2010) or discovering drug groups that share a common set of adverse events (Harpaz, Perez et al. 2011). Increasingly there are efforts to use other data sources, such as EHRs, for the purpose of detecting potential adverse events (Harpaz, Haerian et al. 2010) (Wang, Hripcsak et al. 2009), for countering the biases inherent in SRS (Schneeweiss and Avorn 2005), and for discovering multi-drug adverse events (Coloma, Schuemie et al. 2011). Researchers have also at1

http://www.who-umc.org/

1

P. LePendu et al.

tempted to use billing and claims data for active drug safety surveillance (Dore, Trivedi et al. 2009) (Nadkarni 2010) as well as attempted to reason over published literature and discover drug-drug interactions based on properties of drug metabolism (Tari, Anwar et al. 2010). Given these advances in detecting (i.e., discovering or inferring) drug safety signals from the AERS, it becomes crucial to develop methods testing for (i.e., searching for or applying) these signals throughout the EHR so as to realize their benefits on new patients before an adverse event occurs. We hypothesize that using ontology based approaches, analogous to enrichment analysis (LePendu, Shah et al. 2011), can help to fill this gap. To validate our hypothesis, we computed the risk of having a myocardial infarction on taking Vioxx for Rheumatoid arthritis (Graham, Campen et al. 2005) using the annotations created on the textual notes for over 1 million patients in the Stanford Clinical Data Warehouse (STRIDE). The main challenge in computing this risk is that the EHR mainly comprise of unstructured, free-text narrative. To extract disease and drug annotations from the EHR text, we developed an Annotation Workflow based on the National Center for Biomedical Ontology (NCBO) Annotator Web Service (Shah, Bhatia et al. 2009). On analyzing this extracted data, we were able to recapitulate the Vioxx risk. Without using the annotations extracted from the unstructured text, the Vioxx risk signal is lost in the background noise—showing that, after pre-processing with tools such as the NCBO Annotator Workflow, the EHR might be a viable source for testing drug safety signals.

2

THE NCBO ANNOTATOR WORKFLOW

We created a standalone Annotator Workflow based upon the existing NCBO Annotator Web Service. The Annotator Workflow is highly optimized for both space and time when performing large-scale annotation runs. Whereas the NCBO Annotator Web service would have taken over 6 months and 800 GB of free disk space to process the roughly 9.5 million patient notes in the Stanford Clinical Data Warehouse (STRIDE), the Annotator Workflow takes only 7 hours and 4.5 GB of disk space. Moreover, we extended the Annotator Workflow to incorporate negation detection—the ability to discern whether a term is negated with the context of the narrative. Negation detection is based on trigger terms used in the NegEx algorithm (Chapman, Bridewell et al. 2001). The annotation process utilizes the NCBO BioPortal ontology library (over 250 ontologies, including SNOMED-CT, RxNORM, NDFRT, and MedDRA) to identify biomedical concepts from text using a dictionary of terms generated from the ontologies. For this study, we specifically configured the workflow to use 16 ontologies most relevant to

2

clinical domains. Ontologies provide a normalization of terms found within text. Moreover, ontologies define relationships among these terms, e.g., parent–child relationships, which can be used to generalize and to aggregate information automatically. Such reasoning capabilities can play a crucial role in ADR detection: for example, extrapolating the known relationship between myopathy and rhabdomyolyis could have automatically inferred the adverse relationship between myopathy and cerivastatin and prevented 2 years of unmitigated risk (Bate and Evans 2009).

3

TESTING FOR THE VIOXX RISK SIGNAL

Graham et al. showed that patients having Rheumatoid arthritis (RA) who took Vioxx (rofecoxib) showed significantly elevated risk (Relative Odds Ratio=1.34) for myocardial infarction (MI), which resulted in the drug being taken off the market in 2004 (Graham, Campen et al. 2005). To reproduce this risk, we needed to identify patients in the EHR who have the given condition (RA), who are taking the drug, and who suffer the adverse event (see Figure 1). Furthermore, we need to look at records before 2005, since Vioxx was discontinued subsequently. STRIDE2 is a repository of 17-years worth of patient data at Stanford. It contains data from 1.6 million patients, 15 million encounters, 25 million coded ICD9 diagnoses, and a combination of pathology, radiology, and transcription reports totaling 9.5 million unstructured clinical notes. To identify patients with RA and MI, we scanned through the structured data of 25 million coded ICD9 diagnoses for codes beginning with the ICD9 codes for RA and MI (codes beginning with “714” and “410”, respectively). We also scanned through the normalized annotations of the unstructured data, to look for non-negated mentions of MI and RA. We denote the first occurrence or mention of the condition as t0(RA) and t0(MI). We did not have access to the structured medication data; therefore, we relied upon annotations derived from the textual notes to identify patients possibly taking Vioxx. We scanned through the normalized annotations of the unstructured data to look for non-negated mentions of Vioxx or rofecoxib (Vioxx is a trade name for rofecoxib). We denote the first occurrence or mention of the drug as t0(Vioxx). From the observed patient counts, we constructed a contingency table shown in Table 1 and calculated the reporting odds ratio (ROR) and the proportional reporting ratio (PRR) as described in (Bate and Evans 2009). We conducted the test with the expected temporal constraints taken into consideration, as depicted in Figure 1. We obtained a ROR of 2.058 with a confidence interval (CI) of [1.804, 2.349]; and 2

https://clinicalinformatics.stanford.edu/projects/cdw.html

Annotation Analysis for Testing Drug Safety Signals

PRR of 1.828 with CI of [1.645, 2.032]. The uncorrected X2 statistic was significant with a p-value < 10-7.

Figure 1 The Vioxx risk pattern (P5) occurs within a background of RA patients (P1 & P2) that either never have an initial MI incident (t0 denotes first occurrence) or never take Vioxx; and among significant noise, such as records in which MI occurs prior to the diagnosis of RA (P3), or prior to Vioxx being prescribed (P4). Table 1 Contingency table for Vioxx and Myocardial infarction within the STRIDE dataset with temporal constraints using both ICD9 coded data for RA and MI, and NCBO tagged 9.5 million unstructured clinical notes.

Patients with RA before 2005 Vioxx No Vioxx Total

MI

No MI

Total

a=339 c=1488 1827

b=1221 d=11031 12252

(a+b)=1560 (c+d)=12519 14079

In comparison, without using the unstructured data, the results are more ambiguous. (See Table 2.) The corresponding risks for the results without the unstructured data were: ROR=1.524 with CI=[0.872, 2.666] confidence interval; and PRR=1.508 with CI=[0.8768, 2.594]; and X2=0.06816. The confidence intervals are too large and the significance is too low to be meaningful. Note in particular that the ratios a/b and c/d have been significantly reduced in Table 2 by nearly a factor of 10 compared to Table 1. Table 2 Contingency table for Vioxx and Myocardial infarction within the STRIDE dataset with temporal constraints; using only ICD9 coded data for RA and MI.

Patients with RA before 2005 Vioxx No Vioxx Total

4

MI

No MI

Total

a=16 c=61 77

b=487 d=2831 4089

(a+b)=503 (c+d)=2892 3395

DISCUSSION

Our results support our hypothesis that the unstructured data in the EHR provide a viable source for testing drug safety

signals using annotations created from the textual notes. We used the well-known relationship between Vioxx and MI as an example to demonstrate that such testing is possible. Clearly, our results hinge upon the efficacy of the annotation mechanism. We have conducted a comparative evaluation of two concept recognizers used in the biomedical domain—Mgrep and MetaMap—and found that Mgrep has clear advantages in large-scale, service-oriented applications specifically addressing flexibility, speed and scalability (Shah, Bhatia et al. 2009). The precision of concept recognition varies depending on the text in each resource and type of entity being recognized: from 87% for recognizing disease terms in descriptions of clinical trials to 23% for PubMed abstracts, with an average of 68% across four different sources of text. We are currently conducting similar studies for text in the clinical domain. For the current work, we assume a similar level of performance. Finally, to further improve the performance of annotation, we have extended the Annotation Workflow system to optionally incorporate the Unitex (Paumier 2003) concept recognition tool, which provides more powerful features than the NCBO Annotator currently supports, e.g., morphemebased matching. Other than taking a little longer to complete an annotation run (~10% longer), deploying the workflow using the Unitex concept recognizer added no additional complexity. This continuing effort is part of our long-term goal of incrementally improving the annotation tools from NCBO. Along with these validation efforts, we are focusing on ease of use and have packaged the workflow on a USB stick. For example, setting up an instance of the workflow to annotate 12-million radiology reports at the University of California at San Francisco took about 45 minutes of customization and explanation time to successfully deploy the tool on UCSF’s infrastructure.

5

CONCLUSION

We have significantly scaled and subsequently applied the National Center for Biomedical Ontologies (NCBO) Annotator tool to computationally annotate the free-text narrative of over 9.5 million reports from the Stanford Clinical Data Warehouse. We analyzed the EHR annotations and recapitulated the latent Vioxx risk signal. We found that the risk is far more perceptible when the unstructured data in the EHR is used versus using coded data alone. We recapitulated the Vioxx risk by means of the reported odds ratio, which is closely related to enrichment analysis (LePendu, Shah et al. 2011), demonstrating the potential for ontology based annotation analysis methods.

3

P. LePendu et al.

We believe that our results establish the feasibility of using annotations created from clinical notes as a source for testing as well as possibly detecting drug safety signals.

ACKNOWLEDGEMENTS This work was funded in large part by the NIH grant U54 HG004028 for the National Center for Biomedical Ontology. We are grateful to Tanya Podchiyska and Todd Ferris from STRIDE for their assistance with accessing and obtaining the patient records. We are also grateful to Mark Musen for his feedback and support. This work was conducted with the appropriate IRB approval from Stanford University.

REFERENCES Bate, A. and S. J. W. Evans (2009). "Quantitative signal detection using spontaneous ADR reporting." Pharmacoepidemiol Drug Saf 18(6): 427-436. Chapman, W. W., W. Bridewell, et al. (2001). "A simple algorithm for identifying negated findings and diseases in discharge summaries." Journal of Biomedical Informatics 34(5): 301-310. Classen, D. C., R. Resar, et al. (2011). "‘Global Trigger Tool’Shows That Adverse Events In Hospitals May Be Ten Times Greater Than Previously Measured." Health Affairs 30(4): 581. Coloma, P. M., M. J. Schuemie, et al. (2011). "Combining electronic healthcare databases in Europe to allow for large-scale drug safety monitoring: the EU-ADR Project." Pharmacoepidemiol Drug Saf 20(1): 1-11. Dore, D. D., A. N. Trivedi, et al. (2009). "Association between extent of thiazolidinedione exposure and risk of acute myocardial infarction." Pharmacotherapy 29(7): 775-783. Friedman, C., S. Johnson, et al. (1995). "Architectural requirements for a multipurpose natural language processor in the clinical environment." Proceedings of the Annual Symposium on Computer Application in Medical Care: 347. Graham, D., D. Campen, et al. (2005). "Risk of acute myocardial infarction and sudden cardiac death in patients treated with cyclo-oxygenase 2 selective and non-selective non-steroidal antiinflammatory drugs: nested case-control study." The Lancet 365(9458): 475-481. Halevy, A., P. Norvig, et al. (2009). "The unreasonable effectiveness of data." Intelligent Systems, IEEE 24(2): 8-12. Harpaz, R., H. S. Chase, et al. (2010). "Mining multi-item drug adverse effect associations in spontaneous reporting systems." BMC Bioinformatics 11 Suppl 9: S7. Harpaz, R., K. Haerian, et al. (2010). "Mining electronic health records for adverse drug effects using regression based

4

methods." Proceedings of the 1st ACM International Health Informatics Symposium: 100-107. Harpaz, R., H. Perez, et al. (2011). "Biclustering of adverse drug events in the FDA's spontaneous reporting system." Clin Pharmacol Ther 89(2): 243-250. Hillestad, R., J. Bigelow, et al. (2005). "Can electronic medical record systems transform health care? Potential health benefits, savings, and costs." Health Aff (Millwood) 24(5): 1103-1117. LePendu, P., N. Shah, et al. (2011). "Enabling Enrichment Analysis Using the Human Disease Ontology." Journal of Biomedical Informatics (to appear). Nadkarni, P. M. (2010). "Drug safety surveillance using deidentified EMR and claims data: issues and challenges." J Am Med Inform Assoc 17(6): 671-674. Paumier, S. (2003). De la reconnaissance de formes linguistiques à l'analyse syntaxique., Université de Marne-la-Vallée. Doctorat. PCAST (2010). "Realizing the Full Potential of Health Information Technology to Improve Healthcare for Americans: The Path Forward." 1-108. Savova, G. K., J. J. Masanz, et al. (2010). "Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications." Journal of the American Medical Informatics Association 17(5): 507-513. Schneeweiss, S. and J. Avorn (2005). "A review of uses of health care utilization databases for epidemiologic research on therapeutics." J Clin Epidemiol 58(4): 323-337. Shah, N. H., N. Bhatia, et al. (2009). "Comparison of concept recognizers for building the Open Biomedical Annotator." BMC Bioinformatics 10 Suppl 9: S14. Stang, P. E., P. B. Ryan, et al. (2010). "Advancing the science for active surveillance: rationale and design for the Observational Medical Outcomes Partnership." Ann Intern Med 153(9): 600606. Tari, L., S. Anwar, et al. (2010). "Discovering drug–drug interactions: a text-mining and reasoning approach based on properties of drug metabolism." Bioinformatics 26(18): i547. Tatonetti, N., G. Fernald, et al. (2009). "A novel signal detection algorithm to identify hidden drug-drug interactions in the FDA Adverse Event Reporting System." AMIA TBI 18(6): 427-436. Wang, X., G. Hripcsak, et al. (2009). "Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study." AMIA 16(3): 328-337. Weiss-Smith, S., G. Deshpande, et al. (2011). "The FDA drug safety surveillance program: adverse event reporting trends." Arch Intern Med 171(6): 591-593.

A Maximum-Entropy Approach for Accurate Document Annotation in the Biomedical Domain George Tsatsaronis*, Natalia Macari, Sunna Torge, Heiko Dietze, Michael Schroeder Biotechnology Center (BIOTEC), Technische Universität Dresden, 01062, Dresden, Germany

ABSTRACT

Motivation: The increasing number of scientific literature on the Internet and the absence of efficient tools used for classifying and searching the documents are the two most important factors that influence the speed of the search and the quality of the results. Previous studies have shown that the usage of ontologies makes it possible to process document and query information at the semantic level, which greatly improves the search for the relevant information and makes one step further towards the Semantic Web. A fundamental step in these approaches is the annotation of documents with ontology concepts, which can also be seen as a classification task. In this paper we address this issue for the biomedical domain and present a new automated and robust method, based on a Maximum Entropy approach, for annotating biomedical literature documents with MeSH concepts, which provides very high F-measure. The experimental evaluation shows that the suggested robust to the ambiguity of terms, and can provide very good performance even when a very small number of training documents is used.

1

INTRODUCTION

With the rapid expansion of the Internet as a source of scientific and educational literature, the search for relevant information has become a difficult and time consuming process. The current state of the Internet can be characterized by weak structured data and, practically, the absence of relationships between data. Current search engines, such as Google and Yahoo, provide a keyword-based search, which takes into account mainly the surface string similarity between query and document terms, and often a simple synonym expansion, omitting other types of information about terms, such as polysemy and homonymy. In order to address this problem and improve search results, the usage of ontologies is suggested to allow for document annotation with ontology concepts. The usage of ontologies provides a content-based access to the data, which makes it possible to process information at the semantic level and significantly improve the search of relevant documents, as it has been shown by recent studies in the case of the search in the life sciences literature (Doms,2008;Doms and Schroeder,2005).

Fig. 1. Left: number of PubMed articles(blue line) indexed over the period 1965-2010 and logarithmic trend (red line). Right: number of PubMed articles(blue line), plotted with the number of MeSH annotated documents (red line).

Some representative examples of such search engines for the biomedical domain are: (a) GoPubMed 1 which uses the Gene Ontology (GO) and the Medical Subject Headings (MeSH) as background knowledge for indexing the biomedical literature stored in the PubMed database, and various text mining techniques and algorithms (stemming, tokenization, synonym detection) for the identification of relevant ontology entities in PubMed abstracts, (b) semedico 2, which provides access to semantic metadata about MEDLINE abstracts using the JULIE Lab text mining engine 3 and MeSH as a knowledge base, and (c) novoseek, which uses external available data and contextual term information to identify key biomedical terms in biomedical literature documents. However, in all cases the challenges that arise are several and difficult to resolve; more precisely: (i) the amount of scientific documents to be annotated and indexed is very large, as PubMed documents grow really fast in number, (ii) the presence of ambiguous concepts constitutes the classification (annotation) process a challenging task, and, (iii) the classifier model used needs to be trained and tuned specifically for this domain, in order to achieve the best possible results, and in tandem needs to be fast and robust to address challenges (i) and (ii) respectively.

1 *

To whom correspondence should be addressed: [email protected].

http://www.gopubmed.com/web/gopubmed/ http://www.semedico.org 3 http://www.julielab.de 2

1

G. Tsatsaronis et al.

Fig. 2. Pie showing the ambiguous MeSH terms, examining 4,078 terms, and consulting three dictionaries/thesauri. As a proof of concept for (i), we present in Figure 1 the growth of PubMed documents over the period 1965 - 2010. The figure shows clearly that new PubMed documents are nowadays doubled within the past 20 years (left), as also discussed by Biglu (2007). The exponential trend (red line) also shows that this tendency continues. In parallel, we can observe that the annotated documents with MeSH concepts (red line) attempts to keep up with the document growth (right). For this purpose, the Medical Text Indexer System is used, which makes the annotation process semi-automatic and improves the efficiency of indexing PubMed articles. This constitutes as fundamental the need for fast and accurate automated annotation methods with MeSH concepts, so that the growth of PubMed documents can be followed with respective concept annotations. As a proof of concept for (ii), we have selected randomly a set of 4,078 MeSH, which are the terms under the roots: diseases, anatomy, and psychology. In these terms we will also base our analysis and our experimental evaluation. For all of them we have measured the number of different meanings that these terms may carry, consulting three very popular thesauri/lexica, namely the WordNet thesaurus for the English language, the Wikipedia encyclopedia (English version), which is currently the largest electronic encyclopedia available, and the UMLS thesaurus, which is also focused in our examined domain. The measurements shown in the pie of Figure 2, reveal that 23.3% of the examined terms are ambiguous, i.e., they have more than one meaning. Another interesting finding is the coverage of the non-domain specific lexica, i.e., WordNet and Wikipedia, which is 78% combined. In fact only 22% of the examined have entries only in the domain specific UMLS thesaurus. In order to stress out the implications of the existence of such ambiguous terms in the annotation process, we have furthermore analyzed the number of different documents these 4,078 terms appear literally in GoPubMed, as well as in another popular and general purpose search engine, namely Yahoo. The aim of this analysis is to show how the number of documents that these terms appear literally varies, depending on their number of entries in the two used lexica. In Figure 3 we present four plots showing the results of this analysis.

2

Fig. 3. Scatter plots of number of documents where the terms appear literally in GoPubMed (horizontal axis), and Yahoo (vertical axis). Red lines show medians. The top left figure shows for all the terms the number of documents in which each of the examined term appears literally in the GoPubMed (horizontal axis) and the Yahoo (vertical axis) indexed documents. The figure shows that the difference on the number of the retrieved documents comparing the results returned by GoPubMed and Yahoo is several orders of magnitude. A typical term appears literally in almost 5,000 GoPubMed documents and in 1 million Yahoo documents. The remaining three plots highlight respectively the terms for which there is no entry in the majority of the used lexica (yellow), the terms for which there is exactly one entry in the majority of the used lexica (red), and the terms which are ambiguous in according to the majority of the used lexica. It is obvious from the plots, that there is a shift of the placement of the terms from left to right and, in parallel, from bottom to top as the number of entries increase. This fact shows that the ambiguous terms may appear in a very large number of documents (contexts), larger compared to the rest of the terms, and, thus, any contextbased model for document annotation will have to handle a lot of noise for those terms, highlighting the need for a very robust annotator.

2

APPROACH

The approach that we follow for automated document annotation of biomedical literature documents with MeSH concepts creates a context model for each and every concept of the used ontology, which characterizes the term and that consists of the lexical tokens taken from related PubMed articles' abstracts. The approach uses the notion of Maximum Entropy, whose principle is to measure the uncertainty of each class (also known as entropy), expressed by information that we do not have about the classes occupied by

Accurate Document Annotation in the Biomedical Domain

the data. Given the fact that the Maximum Entropy (MaxEnt) approach has been applied successfully in the past to several natural language and computational linguistic tasks, such as word sense disambiguation (Doms,2008), part of speech tagging, prepositional phrase attachment, and named entity recognition (Ratnaparkhi,1998), but also to gene annotation (Raychauduri et al., 2002), and to mining patient medication status (Pakhomov et al., 2002), in this work we decided to adopt this approach in order to investigate its performance in the task of document annotation for the biomedical domain. The MaxEnt method is insensitive to noisy data and capable to process incomplete data such as sparse data or data with missing attributes. In addition, the MaxEnt models can be trained on massive data sets (Mann et al.,2009}, and their implementation is publicly available through open source projects, such as OpenNLP 4.

In Figure 4 we show in detail how we apply MaxEnt for the annotation of documents with MeSH concepts. The algorithm is separated into two parts: training and testing. For each MeSH term we measure the values of pre-selected features by examining PubMed documents. The features in our case are of four types: (1) lexical tokens from the titles of PubMed documents, (2) lexical tokens from the abstracts of PubMed documents, (3) name of the journal in which the respective documents were published, and (4) year of the published documents. The algorithm constructs a context model for each of the terms, trained on a pre-selected set of positive and negative examples. For the training part, the weights of the features are maximized using iteratively reweighted least squares (IRLS). The classes on which the classifier is trained are always two for each constructed model, i.e., for each term: positive, denoted with 1, and negative, denoted with 0. Once the feature weights for each class are maximized and known for each term mj in M (βj1 and βj0 respectively), the testing procedure can be applied, which decides for each term mj separately whether it should annotate the instance ti (positive class), or not (negative class). For this reason, a classification threshold using a parameter δ is used.

3

RESULTS

For our experimental setup we used 4,078 MeSH terms, under the MeSH roots: diseases, anatomy, and psychology. This selection is not random, as psychology is considered to have difficult terms for annotation, because many terms are general, diseases is considered to have easy terms, and anatomy has an unknown difficulty. Thus, the selection spans across all levels of annotation difficulty. All of the experiments shown next were conducted using 10-fold cross validation, and in all cases we measure average precision, recall and F-Measure based on the classification results. In all cases, only the title and the abstract of each document were used for the lexical features (i.e., the two of the four features used by MaxEnt), as explained in the previous section. The δ parameter was set to the value that was found optimal in the validation set (10% of the training was always kept as validation set). This value was 0.1. Table 1. Results of annotation for two methods, Exact Matching and MaxEnt. Results on ambiguous terms are also shown separately.

Method

All Terms P

R

Ambiguous Terms F

Exact Matching 52.3 22.1 31.1 MaxEnt 89.8 91.8 90.8

Fig. 4. The MaxEnt algorithm for annotating documents with MeSH ontology terms. 4

http://opennlp.sourceforge.net/index.html

P

R

F

45.4 37 40.9 99.3 86.8 92.6

Table 1 shows the results for our method (MaxEnt) as well as a simple baseline technique for annotation, which is the use of exact matching. Exact Matching searches for the ex-

3

G. Tsatsaronis et al.

act or stemmed appearance of each of the terms in the abstract or the title of the documents. In case it is found, the document is annotated with that term. The table shows that the MaxEnt approach gives an F-Measure of 90.8% for all the terms of our experiment, which is almost three times larger than the F-Measure of the Exact Matching approach (30.1% respectively). The most interesting observations arise from the separate study of the ambiguous terms, i.e., in this case the terms with more than one entry in UMLS, which are included, however, in the results of all terms shown in Table 1. Naturally, the Exact Matching approach drops its precision in those terms, by almost 7 percentage points (p.p.), and increases its recall by almost 15 p.p.. MaxEnt manages to have much higher performance in those terms from Exact Matching, with the interesting finding that the behavior is inversed: MaxEnt increases its precision and drops its recall in the ambiguous terms. In the future we plan to investigate and interpret in detail this behavior. Regarding the performance for the individual MeSH branches, the MaxEnt F-Measure was 93.52% for anatomy, 92.21% for diseases and 91.35% for psychology.

when a large number of training documents is used. We also present the F-Measures when several combinations of features are explored (bottom right). The results show again that year is very important (blue and black lines), since, if it is omitted (green and red lines), the performance drops significantly. Overall, the results show that MaxEnt can annotate documents successfully with MeSH terms, and with very few training documents needed. The results also show that MaxEnt produces robust models that are not affected in precision and F-Measure by the ambiguity of the terms.

4

CONCLUSIONS AND FUTURE WORK

In this work we introduced a novel approach for annotating documents of the biomedical literature with concepts from the MeSH ontology. The approach is based on the use of Maximum Entropy (MaxEnt) classifiers to perform the annotation. For each of the terms, a MaxEnt model is trained and it can be applied to any document in order to decide whether it should be annotated with the respective term or not. We performed a thorough experimental evaluation on the application of the proposed MaxEnt approach on a selected set of 4,078 MeSH terms that were used to annotate PubMed documents. We showed that the used feature types (title, abstract, year, and journal) are sufficient for producing high accuracy annotations. The results showed that the proposed approach was able to annotate PubMed documents with an average precision of 89,8%, average recall of 91.8%, and average F-Measure of 90.7%. Regarding the tuning of the used parameters, we found that a delta value of 0.1 produces the best results, and that even few training documents are sufficient to achieve very good performance. As a future work, we plan to investigate the connection of the ambiguity of terms to the semantic search procedure and the ranking of documents.

REFERENCES

Fig. 5. The changes in F-Measure when training examples increase, the distribution of the performance of MaxEnt only in the ambiguous terms, and feature analysis. Figure 5 (top left) shows the F-Measure of MaxEnt for increasing number of training documents. As shown, MaxEnt can perform really well, even with few hundreds of training documents per term. Top right shows the distribution of the F-Measure values in the ambiguous terms. In the majority of the cases, the F-Measure is really high, more than 90%. The two bottom figures show F-Measures obtained when using each feature type individually. As shown, title and year are the most important features, while journal is very important

4

Biglu, M.H. (2007) The editorial policy of languages is being changed in Medline. Acimed, 16(3). Doms, A. and Schroeder, M. (2005) GoPubMed: Exploring PubMed with the Gene Ontology. Nucleid Acids Research, 33. Doms, A. (2008) GoPubMed: Ontology-based literature search for the life sciences. PhD Thesis, Technical University of Dresden. Mann, G., et.al. (2009) Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models . Advances in Neural Information Processing Systems, 22. Pakhomov, S.V., et al. (2002) Maximum entropy modeling for mining patient medication status from free text. Proc. of AMIA Symposium, pp. 587-591. Ratnaparkhi, A. (1998) Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD Thesis, University of Pensylvania. Raychauduri, S., et al. (2002) Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Res. 12(1).

The Coriell Cell Line Ontology: Rapidly Developing Large Ontologies Chao Pang, Tomasz Adamusiak, Helen Parkinson and James Malone* European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SD ABSTRACT Motivation: Many online catalogues of biomedical products and artifacts exist that are loosely structured but of great value to the community. These include cell lines, enzymes, antibodies, reagents, and laboratory equipment. Improving the representation of these products has several benefits: reporting of products used in experimental protocols and integration of experimental data BioSample databases. Formalization of these resources is often time-consuming, laborintensive and expensive. We describe an approach to structuring these catalogues using semi-automated techniques to rapidly develop OWL ontologies. We demonstrate the approach using the Coriell Cell Line catalogue, and the resulting ontology of 28,000 classes which imports classes from other community ontologies such as Disease Ontology, Cell Type ontology and FMA. Availability: http://bioportal.bioontology.org/ontologies/1589

1

INTRODUCTION

The biomedical community has embraced the use of ontologies as a means of describing scientific data, such as experimental protocols (OBI) (The OBI Consortium, 2010) and experimental variables (EFO) (Malone et al., 2010) Manual development of these ontologies is a costly and time consuming activity. There is clearly value in producing robust expertly curated ontologies such as The Gene Ontology (GO). However, development in this form is clearly not repeatable across every area of biomedicine. Programmatic approaches can be powerful when transforming and enhancing resources with pre-existing structure into an ontological form (Antezana et al., 2009). Loosely structured data sources contain implicit knowledge – within the data or within the presentation layer, e.g. within categories in a drop-down list on a website. Similarly, implicit knowledge may be contained within column headers of spreadsheets or database table and field names. It is possible to exploit this implicit knowledge and enable a rapid transform into explicit ontology classes. Here we present our approach to the rapid development of the Coriell cell line ontology based on a collection of semistructured cell line descriptions from the Coriell cell line catalogue which contains ~27,000 mammalian cell lines and metadata about these. We demonstrate that by using a standardized modeling pattern and text mining approaches, a large ontology (~28,000 classes) can be rapidly produced *

To whom correspondence should be addressed.

which logically describes each cell line and their biological properties. The scope of this work was representation of the catalogue in OWL, and development of a robust design pattern for cell lines, however, we expect the approach to scale and be adaptable for other similar resources.

2

METHODS

The principle methodology underlying this work is ontology normalization (Rector, 2003). Specifically, that we manage multiple inheritance using class descriptions in OWL and infer structure using description logic reasoners such as HermiT. By providing axioms on classes, the need to assert potentially conflicting or fragile subsumption hierarchies is removed. This approach ensures biological knowledge used to create the hierarchy is explicit and renders implicit knowledge explicit in the ontology. The first step was to develop a standardized model for cell lines. In collaboration with the Cell Line Ontology (Sarntivijai et al., 2011) and the Cell Type Ontology (Meehan et al., 2011) we created a model (Figure 1) which aligns these ontologies and which was used during development.

Fig. 1. The cell line model used to represent Coriell cell lines.

Our primary queries of interest are contained in this model and determined which data we extracted from the catalogue, specifically: cell line name, cell type, disease, organism parts, organism and gender. The model was evaluated against primary competency questions derived from use cases related to the development of a BioSample Database (www.ebi.ac.uk/biosamples/) at the EBI. These include queries by common cell types, by disease and tissues. We a use the relation, is_model_for, to reflect use of cell lines as models for particular diseases. Given the large size of the

1

Malone et al.

Coriell catalogue we developed a scalable semi-automatic approach to creating the ontology. Information on each cell line was contained within 104 separate and redundant text files describing different aspects of the Coriell products and derived from an SQL dump of a relational database. Five key files were selected which contained semi-structured descriptions covering the entities described in Figure 1 and which corresponded to our use cases. These files were merged, redundant information was removed and a single ‘cell line’ spreadsheet was produced using bespoke Perl scripts.

2.1

Lexical concept recognition

The cell line spreadsheet was used as an input for lexical concept recognition with the aim of generating list of classes from reference ontologies that matched the textual descriptions in the catalogue. The Perl Onto-Mapper (www.ebi.ac.uk/efo/tools) was employed as it has previously been used successfully in building similar application ontologies (Malone et al., 2009). The approach allows for fuzzy matching to identify classes from class labels and their synonyms. Given the nomenclature of areas such as disease and anatomy where synonymy is common, a fuzzy matching approach provided flexibility in mapping. A metric was assigned to each match and those with less than 100% confidence were manually inspected. The reference ontologies (Table 1) were selected based on the catalogue content and the model. Anatomy was challenging as although the Coriell cell lines are primarily mammalian no single mammalian anatomy ontology exists which would provide the coverage necessary. Although some efforts are ongoing to develop an homology based anatomy ontology (Travillian et al., 2010) we used a preexisting resource the Minimal Anatomy Terminology (Bard et al., 2008). This species neutral ontology provides mappings to multiple anatomical ontologies and is subsumed by the Experimental Factor Ontology, with which we plan to merge the Coriell Cell Line Ontology in future. Some human specific classes were also imported from FMA. Note, however, that the majority of the terms were generated de novo representing cell lines rather than simply imported. There was no ontological resource for the Coriell catalogue prior to this work. The disease information within the Coriell descriptions consisted of references to OMIM (McKusick, 2007). Since OMIM is not a disease ontology we exploited the links provided within the Human Disease Ontology (DO) to OMIM and imported DO classes. Where no cross references were found a manual inspection using BioPortal (Noy et al., 2009) was required to extract the disease.

2

Table 1. Reference ontologies used in the Coriell cell line ontology Domain Reference Ontology Term Number Organism Anatomy Cell Type Disease Gender

2.2

NCBI Taxonomy, OBI Experimental Factor Ontology, FMA Cell Type Ontology Human Disease, NCI Thesaurus PATO

93 61 11 337 3

Ontology engineering using the OWL-API

The lexical mapping resulted in a set of files containing mappings between a label and the corresponding URI from the reference ontology, one file per domain. These mappings were used to construct the ontology programmatically (Figure 2)

Fig. 2. Methodology for programmatic ontology creation

The process was implemented as follows: (1) Input of cell line descriptions contained in the single merged spreadsheet. (2) Files containing mappings from class label to reference ontology class International Resource Identifer (IRI) are matched. (3) Class IRIs are used to import corresponding ontology classes from reference ontologies, along with axiomatic and annotation information within the class signature if present and parent classes. (4) The EFO upper level is re-used here (a slim version of BFO) and determines where imported classes should be placed, e.g. disease classes are imported under the disease parent, itself a child of disposition. (5) The Coriell cell line ontology in OWL is output. (6) The ontology was manually reviewed for correctness, checked for consistency using HermiT 1.3.1 and test defined classes were added.

3

RESULTS

The Coriell cell line ontology contains 27,002 cell line classes, covering 11 cell types, 61 anatomical terms and 93 organisms. 657 OMIM numbers were attached to cell lines and 393 OMIM numbers were mapped to 337 Disease Ontology classes. 7,688 cell lines were confirmed to model

The Coriell Cell Line Ontology: Rapidly Developing Large Ontologies

disease and a small number modeled multiple diseases, for example ND00139 which models Parkinson ’s disease and Lewy Body Disease. Following the creation of the ontology and validation of all lexical matches and the ontology by a domain expert some refinements to the imported structure were required as follows:

3.1

Organism taxonomy

Organism classes imported from the NCBI taxonomy have long chains of parent classes, e.g. Homo sapiens has 28 classes in a subclass hierarchy. We retrospectively removed some of these nodes, applying the following design principle; 1. Remove intermediate classes when the child class does not have more than 2 siblings, 2. When the deletion leads to >3 child classes, the parent class is retained. This strategy removed a large number of classes which were not required by our query use cases.

3.2

Adding defined classes to infer structure

Use of normalisation methodology results in an asserted flat cell line hierarchy, i.e. the only asserted parent class of each cell line is the cell line class. For browsing purposes, however, it is often useful to produce an organizational hierarchy and as such we created some under cell line using defined classes in OWL, i.e. classes with necessary and sufficient restrictions describing members.

Fig. 3. Inference of human cancer cell line hierarchy in Protégé

For example, human cancer cell line (Figure 3) shows inferred subclasses and has the following necessary and sufficient restriction using Manchester OWL syntax: 'cell line' and (is_model_for some cancer) and (derives_from some ('cell type' and (part_of some ('organism part' and (part_of some 'Homo sapiens')))))

Cell type

22 unique cell type terms were mapped to the Cell Type Ontology. 11 terms are with 100% similarity. Partial mappings were refined manually e.g. smooth muscle is not a cell type and was modified to smooth muscle cell. Myeloma is not a cell type, but a cancer of plasma cells and was changed to plasma cell. Another 11 unmapped terms were not cell type terms and were removed.

3.4

3.5

Anatomy

There were 81 unique terms describing anatomy, 45 mapped exactly to pre-existing terms in the MAT. Unmapped terms describe classes other than anatomy such as fibroma, leiomyoma (diseases) and were removed. Buttock-thigh and Thorax/abdomen could be separated into two single terms but it is not clear which part the terms were describing and these were also removed. 9 terms were unmapped which did not appear to fit into anatomy, such as Keloid breast organoid, so were removed. Among the remaining terms unmapped from the concept recognition step, 12 terms are mapped to FMA, 9 terms to EFO, 2 to SNOMED CT, 2 to NCI Thesaurus and 1 term is unmapped. Mixing of terms from disease and anatomy domains was found to be common in many parts of the Coriell Catalogue; manual effort was spent assessing outputs from lexical matching to correct these.

3.3

fect the DO child and parent classes and the canonical structure from DO and IRIs are preserved.

Disease

We imported 337 Disease Ontology terms into the Coriell cell line ontology. DO is not well axiomatised except for the use of subclass relationships. EFO, however, provides more information for the class relationships (e.g. disease to anatomical parts). For disease we therefore added axioms from EFO to allow construction of defined classes based on e.g. disease e.g. ‘liver disease cell lines’. Imported classes were axiomatised using additional logical restrictions e.g. an axiom linking disease to anatomical part. This does not af-

The nesting reflects an important distinction between separate statements; in effect, we are saying for a specific organism, for which a specific organism part is part, and from which a specific cell type was taken. For the example in Figure 3, the defined class restricts membership to those classes where cancer is the modeled disease and which are derived from humans (more specifically cell types that are part of an organism part which are part of humans). We have also used disjoints in some areas of the ontology, for example by making Homo sapiens disjoint from other siblings under organism, we are able to ask the query for things which are not Homo sapiens because they have been explicitly defined as such.

3.6

Rapid generation and regeneration

The ontology was developed over 3 months by one person working full time. The majority of this time was spent de-

3

Malone et al.

veloping the code to produce the ontology; a repeat exercise would take a great deal less. We made several changes to the ontology as we progressed and refined the model slightly; the programmatic method used meant regenerating the new OWL ontology took minutes. Rapidly addition of content programmatically is also possible. By comparison with manual development of a similar ontology e.g. the cell line ontology we estimate that ~12 months development time was saved.

lowed us to refine the cell line model within EFO to be consistent with the CLO and this will be revised in future releases of EFO. Future work also includes the release of the Coriell ontology to Bio2RDF for linked open data access. Finally our programmatic approach is fully compatible with manual curation and ontology development, and a combined approach is likely to produce rich, well structured ontologies for community use.

ACKNOWLEDGEMENTS 4

DISCUSSION

One of the central claims of this work is that the ontology was rapidly developed using the methods described. Over the 3 months that this work was conducted, we estimate 2 months comprised investigation of the catalogue content and Perl scripting to merge and format the initial input files. A further month’s programming resulted in an ontology of ~28,000 classes. Generalizable components of the methodology include: design of reusable design patterns, re-use of ontology development code and exploitation of the MIREOT process for term imports. There is a trade-off between hand crafted curation by individual experts and the rapid development of a very large resource. Our approach is of most benefit when a semistructured data exists and existing Foundry type ontologies are available e.g. for cell types. As a one-off SQL dump was used for development updates need to be managed in future and a dynamic method for accessing new data is desirable. One of the criteria for inclusion in the OBO Foundry effort (Smith et al., 2009) is that every class is given a textual definition. The effort required to manually produce good textual definitions for an ontology the size of the Coriell cell line ontology is significant. Given the axiomatisation of the ontology, however, efforts such as producing natural language from OWL statements may offer an effective and rapid method to producing textual definitions (Stevens et al., 2011). If such an approach can be applied we will seek to include the artifact into the OBO Foundry in the future. We are also currently working with the Cell Line Ontology to ensure our respective models are synchronized and to merge the Coriell cell line ontology with the CLO which is currently derived from the American Tissue Culture Collection (ATCC). Other work includes mapping to all resources which contain cell line references and addition of these to the ontology, re-running of imports to detect changes in source ontologies, term requests from e.g. the cell type ontology to classify cells by anatomical part and addition of information manually where possible. E.g. much text containing phenotypic descriptions was unstructured and could be mined added. A complete evaluation of additional meta data vs. that of the CLO is also desirable in order to prioritise where to add curation effort and which additional data could added to the core we have built. This work has al-

4

We thank the Functional Genomics Production Team, the Coriell Institute for Medical Research, Alan Ruttenberg and Science Commons for providing the Coriell SQL dump. Lynn Schriml and colleagues from the Disease Ontology for OMIM mappings and Sirarat Sarntivijai, Oliver He, Alexander Diehl and Terry Meehan for discussions on the cell line model. Funding: The European Molecular Biology Laboratory, and EC (HEALTH theme no. 200754 Gen2Phen).

REFERENCES Antezana, E. et al. (2009) The Cell Cycle Ontology: an application ontology for the representation and integrated analysis of the cell cycle process. Genome Biology, 10(5):R58. Bard, JBL. et al. (2008) Minimal anatomy terminology (MAT): a speciesindependent terminology for anatomical mapping and retrieval. Proc. of ISMB 2008 SIG meeting on Bio-ontologies, Toronto. McKusick, VA. (2007). Mendelian inheritance in man and its online version, OMIM. Am. J. Hum. Genet., 80(4): 588–604. Malone, J, et al (2009). Developing an ontology from the application up. Proceedingsof OWLED 2009. Malone, J. et al. (2010) Modeling sample variables with an experimental factor ontology. Bioinformatics, 26(8):1112-1118. Meehan, TF. et al. (2011) Logical development of the cell ontology. BMC Bioinformatics, 12:6. Noy, NF. et al. 2009. BioPortal: ontologies and integrated data resources at the click of a mouse. Nuc. Acids Res. 1;37(Web Server issue):W170-3. OMIM, Online Mendelian Inheritance in Man (2011) McKusick-Nathans, National Library of Medicine (Bethesda, MD), (date accessed: January 12, 2011). URL: http://www.ncbi.nlm.nih.gov/omim/ Rector, AL. (2003) Modularisation of domain ontologies implemented in description logics and related formalisms including OWL. Proc. of 2nd Int. Conf. on Knowledge Capture 2003. Sarntivijai, S. et al. (2011) Cell Line Ontology: Redesigning Cell Line Knowledgebase to Aid Integrative Translational Informatics. ICBO 2011, Buffalo. Smith, B. et al. (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25:1251-1255. Stevens, R. et al. (2011) Automating generation of textual class definitions from OWL to English. J. Biomedical Semantics, 2(Suppl 2):S5. The Gene Ontology Consortium (2000) Gene ontology: tool for the unification of biology. Nat. Genet., 25(1):25-9. The OBI Consortium (2010) Modeling experimental processes with OBI. J. Biomedical Semantics, 1(Suppl 1):S7. Travillian, RS. et al. (2011) Anatomy ontologies and potential users: Bridging the gap. J. BioMedical Semantics. In Press.

An exercise in kidney factomics: From article titles to RDF knowledge base James M. Eales1, George Demetriou1 and Robert Stevens1* 1

School of Computer Science, University of Manchester, UK

ABSTRACT Motivation: There are many existing resources that integrate data between databases; they do this either semantically by the use of RDF and triplestores (e.g. Bio2RDF), or with web links and ID mapping services (e.g. PICR, eUtils). Results declared in the literature are, however, only rarely interlinked with existing databases and even more rarely interlinked with each other. We describe a method to take factual statements reported in the literature and turn them into semantic networks of RDF triples. We use a method based on finding titles of papers that contain positive, direct statements about the outcome of a biomedical investigation. We then use dependency parsing and an ontological perspective to create and combine graphs of knowledge about a domain. Our aim in this work is to collect knowledge from the literature for inclusion in the Kidney and Urinary Pathway Knowledge Base (KUPKB), which will be used in the e1 LICO project to illustrate the utility of data-mining methods for biomarker discovery and pathway modelling.

1

INTRODUCTION

A common approach for creating a knowledge base is to transform an existing database or data resource into triples and to then model the relationships between these triples using either existing or newly developed ontological resources2 (Jupp et al. 2010, Croset et al. 2010). Once the ontological structure behind a knowledge base has been defined we can augment it with information from other sources, such as the literature or further databases. This augmentation takes a considerable amount of effort to identify statements in the literature and form them into a representation compatible with the knowledge base (Croset et al. 2010, Coulet et al. 2010a). The Kidney and Urinary Pathway Knowledge Base3 (KUPKB) has been created from existing databases and ontologies, it also has its own ontology (KUPO) for modeling relationships between data (Jupp et al. 2010). Currently the KUPKB has been populated with experimental results To whom correspondence should be addressed. http://www.e-lico.eu/ 2 A Prototype Knowledge Base for the Life Sciences. http://www.w3.org/TR/hcls-kb/ 3 http://www.e-lico.eu/kupkb/ * 1

manually extracted from the published literature. We extend this by providing triples for inclusion in the KUPKB that are automatically extracted from the literature and are known to describe an experimental result. We want to identify a focused set of reliable statements on what is known about the KUP domain. Traditionally most text mining systems use article abstracts or full text; instead we use article titles. We are analysing titles because they are short and to-the-point. Titles can summarise the findings of a whole study into a single sentence. Titles are also the first thing a user sees when searching PubMed and they are therefore important for advertising an article to potential readers. If a study identifies a new piece of definable knowledge then the authors will usually want to present this clearly in the title. If a study finds a slightly less than clear result, then the language used to describe it is often softened and we can detect this using text mining methods. Our process for extracting these triples will involve the computationally expensive task of dependency parsing (Coulet et al. 2010a, Klein and Manning 2003), therefore it is important to limit the number and length of sentences in the text to be analysed. Titles can use complex language but are also quite short, this makes them a useful alternative to abstracts or full text documents. Previous work in this area has also used dependency parsing (Coulet et al. 2010a, Coulet et al. 2010b) and has proven useful in the field of pharmacogenomics when looking for relationships between pre-defined sets of entities. Our approach will focus on identifying the facts presented from arbitrary biological articles, representing them as RDF triples, and then later matching these to entities in existing ontologies. Further work incorporating the use of semantic patterns for identifying entity relationships (Gaizauskas et al. 2003, Humphreys et al. 2000) has proven useful for capturing relationships describing protein structure, metabolic pathways and the function of enzymes. All of which have used a semantic framework to make the results of the analysis more widely usable, but also to make it easier to incorporate newly identified relationships with existing knowledge and then form queries based on the combined relationships; it is this flexibility that we seek in this work. Our approach is to collect a set of titles, classify them into factual and non-factual groups and then extract sets of tri-

1

J.Eales et al.

ples from the factual titles. We define a factual title here as “A positive, direct statement, about the outcome of a biomedical investigation”. An example of a factual title is: “Bluetongue virus RNA binding protein NS2 is a modulator of viral replication and assembly.” (PubMed ID: 17241458)

We can see how the title does not contain “soft” or “hedged” language and instead clearly states a result of an investigation. Such statements do not contain all the necessary contextual information to fully comprehend the implications of its finding, but this is not our aim; instead we hope to capture what is reported by the authors and then present this to other readers who can investigate further. An example of a nonfactual title is: “A role for NANOG in G1 to S transition in human embryonic stem cells through direct binding of CDK6 and CDC25A” (PubMed ID: 19139263)

This title contains many specifics (e.g. the NANOG protein, and the G1 and S cell cycle phases) and does allude to a role of NANOG in cell cycle transition, but the role is not explicitly defined, it merely suggests that a role exists. Such statements are important and could be used, but the lack of an explicit role would have to be recorded; this will be future work. Our work has revealed other kinds of titles, such as ones that report, “hedged” or possible results; describe tools or methods; and those that simply say what the article is about. In this work we concentrate on the positive, direct descriptions of an investigation to create a focused corpus as possible of statements on a given topic as a basis for triplification.

2

Titles were collected through the eUtils interface to PubMed. A keyword search for ‘kidney’ or ‘renal’ in the title/abstract field returned 86,217 results (21/10/10), the title and PubMed ID for each of the citations was retrieved and stored for analysis. These titles were also split into a set of 91,626 sentences using the same method used for the training data. For each sentence we derive a set of attributes to describe the title. These attributes fall into 5 groups: simple; word; phrasal; sentence and biological attributes. We use information on tokens, biological named entities, POS tags, chunks, the parse tree and the list of dependencies to profile each sentence. A full list of these attributes and the training data set can be found in our myExperiment pack4. Of significant note is our use of the Whatizit (RebholzSchuhmann et al. 2008) named entity recognition service which provides database IDs for proteins, genes, diseases and chemicals; we use the number of matches for each entity type as attributes and we use the IDs to create URI references. We build an SVM classifier model using the full set of profiles from the training data. We use the SMO implementation of an SVM classifier from Weka in RapidMiner5.

2.1

Our triplification process uses the dependency parse of the sentence, provided by the Stanford parser (de Marneffe et al. 2006, Klein and Manning 2003), to identify subjects, objects and predicates by the application of heuristic rules. The dependencies are retrieved from the classification process and reused in the triplification process. The rules are applied in the following order: (1) Concatenate all compound noun (nn) and adjectival modifier (amod) dependencies on shared governor tokens.

METHODS

(2) Identify all nominal subject (nsubj) and nominal subject passive (nsubjpass) dependencies. (the subjects)

The resources referred to in this paper are available as part of myExperiment pack 1814. Titles often contain multiple sentences and these can have distinct linguistic purposes. As we want to be able to distinguish between the factual and non-factual titles (but a single title can contain both factual and non-factual parts), we split all titles into their component sentences using the OpenNLP sentence detector and a set of heuristics that improve its performance. A training data set of 1,938 title sentences (derived from 1,875 titles) were annotated with a simple label of ‘good’ or ‘bad’, pertaining to whether they are factual (good) or not (bad). The training data titles were randomly collected from a set of 82 biologically-themed journals present in PubMed Central. These were not specific to the kidney and urinary pathway domain, but to biological articles in general. 4

http://www.myexperiment.org/packs/181.html

2

Triplification

(3) Identify all direct object (dobj) dependencies (the objects) (4) Attempt to join all subject and object dependencies by a common governor token. The shared governor token becomes the predicate of a new triple, with the dependent token of the subject and object dependency becoming the subject and object of the new triple respectively. (5) If the object of the new triple has a prepositional modifier (prep) then attempt to create a second new triple (Figure 1) using the dependent token of the prep as its object. The object of the first and the subject of the second triple are set as a new anonymous entity with a unique label. 5

http://rapid-i.com/content/view/202/206/lang,en/

An exercise in kidney factomics: From article titles to RDF knowledge base

(6) For each new triple look for conjunct (conj) dependencies with a token shared between the triple’s object and the dependency’s dependent token. Create new triples with shared subject and predicate tokens, but with the object set as the dependent token of the conjunct dependency (Figure 2). (7) For each sentence look for abbreviation (abbrev) dependencies and create new triples with a “has_label” predicate. (8) Create a separate ontological form of each extracted triple by nominalising the predicate. This results in two statements, the first linking the subject with the predicate (via a “participates_in” relationship) and the second linking the predicate with the object (by a “has_participant” relationship). The predicate becomes an instance of class “biological process”. This form of the triples allows a more ontological view, by nominalising the predicate verb as a biological process. This ontological form of the triples is not used in the graph visualisation or triple evaluation but can be used for ontology mapping.

Table 1. Classifier cross-validation output on training data Class

Precision

Recall F-Measure (F1) N

Good Bad Weighted average

80.64 93.27 90.60

74.33 95.23 90.82

Figure 2. Triplification output for example sentence containing a conjunct dependency. “Nitric oxide diminishes apoptosis and p53 gene expression after renal ischemia and reperfusion injury.”. Only the conjunction-derived triples are shown.

3 3.1

RESULTS Training data cross-validation

A 10-fold, stratified, cross-validation of the training data (Table 1) produced a weighted average F1 of 90.7% and the F1 for the factual (good) class of titles was 77.4%. 409 sentences were labeled as ‘good’ (21.1%) and 1,529 as ‘bad’ (78.9%).

409 1529

The training data were annotated by RS and GD independently; their annotations were found to disagree on 38 (2%) sentences, giving an inter-annotator agreement (using Cohen’s kappa coefficient) of 0.936.

3.2

KUP title classification

We classified each of the KUP title sentences using a model built using the full set of training data, this gave us 5,735 (6.3%) sentences classified as factual and the remaining 85,891 (93.7%) classified as non-factual. The proportion of ‘good’ titles varies considerably between the training (21.1%) and KUP title (6.3%) collections. In a preliminary manual analysis of the first 300 sentences classified as ‘good’, we found that 209 (70%) were true ‘good’ titles, this compares favourably with the ‘good’ classification accuracy on the training data of 74% (304 correct out of 409 sentences).

3.3

Figure 1. Triplification output for example sentence including a prepositional modifier. “Mycophenolic acid inhibits the phosphorylation of NF-kappaB and JNKs and causes a decrease in IL-8 release in H2O2-treated human renal proximal tubular cells.”. Only the first preposition-derived triples are shown.

77.36 94.24 90.68

KUP title triplification

Using the list of dependencies for each sentence we apply the rules defined in section 2 to create a set of triples. This process created a set of 7,113 triples, containing 9,080 unique nodes, 6,989 edges of 1,255 unique edge types (triples are available in tab-delimited and RDF/XML format on myExperiment). These can be formed into a graph by connecting triples with shared subject and object entities. The largest connected component of this graph (see myExperiment for Cytoscape graph) contains 2,676 nodes and has 2,765 edges of 603 distinct types (see myExperiment for visualisation). The central region of this graph has several highly connected entities, the most highly connected being “rats”. Other highly connected entities are “kidney”, “renal function”, “angiotensin II” and “renal injury”. In a manual analysis of a sample of 150 triples extracted from the KUP titles, we found that 96 (64%) were correct. It should be emphasised that titles erroneously classified as “good” were found to commonly produce incorrect triples, thus compounding errors made before triplification. Furthermore the Stanford parser has not been trained on biomedical text, this can lead to parser errors and therefore dependency and triplification errors.

4

CONCLUSIONS

We have described a twin approach to putting facts from the literature into RDF triples. Our main goal was to create a corpus of fact-orientated statements about a particular do-

3

J.Eales et al.

main. We did this by training a classifier to recognise titles that form positive, direct statements about the outcome of an investigation. We then turned these into co-ordinated sets of triples using a dependency parser, in which we expand key verb relationships into new triples containing anonymous entities to which other entities can be linked. Our results so far are satisfactory in that we do create a focused corpus of titles of the right kind. It may be possible to optimise our features and our generation of the initial set of titles to improve performance. For example, we could include the names of disease, gene or protein entities found in the text. We have deliberately favoured precision over recall in an attempt to avoid too much "noise" in our resulting triples. This is obviously at the expense of recall, but this was a price worth paying to avoid a larger "tidying up" task. We are also deliberately doing "factomics", where we retrieve and encode fact-like statements. Scientific papers are rich in context that is needed to fully interpret such facts (Mons 2009). This approach does not attempt any kind of full extraction of the scientific knowledge necessary for the interpretation of the facts. Instead, we have taken the approach that we are exposing the "headlines" of what has been said, and provided links back to the original paper for when a scientist finds a "fact of interest". On inspection of a sample of triples we have found that 64% were correct. At each stage of our process, however, there will be unwanted titles and poor triplification and noise will accumulate. It seems that improvements in the title classification process should pay the greatest dividends, by providing a tighter and more focused set of genuinely factual titles to the triplification process. To interlink our triples with the KUPKB, we intend to rewrite our text-based subject, predicate and object values using URIs derived from several sources. Using our existing named entity recognition results from Whatizit (see Methods) we will replace matching subject and object values with URIs for Uniprot (in the case of proteins) and ChEBI (for chemical entities). We will also use the NCBO BioPortal Annotator6 to replace further subject and object labels with URIs from the Mouse anatomy ontology7. We will also replace any matching predicate values found in the Molecular Interactions ontology8. We will use these predicate mappings to nominalise each verb by expanding a single RDF statement into two. Finally each set of triples will be given an RDF context of the corresponding PubMed ID. All of these mapped ontologies are currently part of the KUPKB thus easing the integration of literature-derived knowledge into the knowledge base.

6

http://bioportal.bioontology.org/annotator http://purl.bioontology.org/ontology/MA 8 http://purl.bioontology.org/ontology/MI 7

4

Literature of various sorts forms a vital repository of a domain’s knowledge. This knowledge needs to be exposed in integrated, computationally accessible forms. As well as integrating within the literature, we need to integrate with knowledge from resources such as databases. RDF forms an attractive means for doing this, especially when combined with the common vocabularies that are being developed by the community. Text mining offers a tempting means to expose this literature-based knowledge, yet can suffer from the need to create corpora of focused collections of desirable kinds of statements. We have presented one technique for creating a focused corpus of one kind of statement and turning this into triples for a domain knowledge base.

ACKNOWLEDGEMENTS This work was funded by the e-LICO project EU/FP7/ICT-2007.4.4.

REFERENCES Coulet,A., Shah,N.H., Garten,Y., Musen,M.A. and Altman,R.B. (2010a) Using text to build semantic networks for pharmacogenomics. J. Biomed. Informatics 43(6): 1009-1019 Coulet,A., Shah,N.H., Hunter,L., Baral,C. and Altman,R.B. (2010b) Extraction of Genotype-Phenotype-Drug Relationships from Text: From Entity Recognition to Bioinformatics Application. Proceedings of Pacific Symposium on Biocomputing 2010 Croset,S., Grabmueller,C., Li,C., Kavaliauskas,S. and RebholzSchumann,D. The CALBC RDF Triple Store: Retrieval over Large Literature Content. Proceedings of SWAT4LS 2010 Gaizauskas,R., Demetriou,G., Artymiuk,P.J. and Wilett,P. (2003) Protein structures and information extraction from biological texts: the PASTA system. Bioinformatics 19(1): 135-143 Humphreys,K., Demetriou,G. and Gaizauskas,R. (2000) Automatically Augmenting Terminological Lexicons from Untagged Text. Proceedings of the Workshop on Natural Language Processing for Biology, held at the Pacific Symposium on Biocomputing (PSB2000), Hawaii, USA, 505-516. Jupp,S., Klein,J., Schanstra,J. and Stevens,R. (2010) Developing a Kidney and Urinary Pathway Knowledge Base. Proceedings of Bio-ontologies SIG, ISMB 2010 Klein, D. and Manning C. (2003) Accurate Unlexicalized Parsing. Proceedings of the 41st Meeting of the Association for Computational Linguistics, Sapporo, Japan, 423-430. de Marneffe,M.C., MacCartney,B. and Manning,C.D. (2006) Generating Typed Dependency Parses from Phrase Structure Parses. Proceedings of LREC 2006 Mons,B. and Velterop,J. (2009) Nano-Publication in the e-science era. Proceedings of SWASD 2009. Rebholz-Schuhmann,D., Arregui,M., Gaudan,S., Kirsch,H. and Jimeno,A. (2008) Text processing through Web services: calling Whatizit. Bioinformatics 24(2): 296-298

Using Multiple Ontologies to Annotate and Integrate Phenotype Records from Multiple Sources Mary Shimoyama* , Rajni Nigam, Melinda Dwinell Medical College of Wisconsin, Milwaukee Wisconsin

ABSTRACT Motivation: The completion of finished and draft sequences for model organisms such as rat has been followed by multiple SNP and knockout projects as well as the complete genome sequencing of a variety of strains exhibiting a vast array of phenotypes. While there have been several larger scale phenotyping projects, in general the data has not been integrated and the majority of phenotype measurement data remains scattered with a small proportion of it available in published literature. Because laboratories use various strains, methods and experimental protocols, phenotype data has been difficult to integrate. Described here is a multiple ontology approach to standardizing and integrating data from multiple laboratories using various protocols.

1

INTRODUCTION

The potential value of integrating phenotype data from multiple sources (different laboratories, varying techniques to measure similar phenotypes, multiple strains) is enormous. The power to identify novel genes associated with human disease is greatly increased by including phenome data since environment, experimental conditions and background genome influences can have a significant impact. The inclusion of environmental and experimental context increases the success of generating phenome-genome relationships for understanding the role of genes in disease.1 However, most phenotype data is gathered or generated without thought to integrating the results with the results from other studies even within the same laboratory, creating a barrier to integrating and comparing results reported in publications. Experimental conditions, strain (genetic background), age and time course (multiple measurements made across time or under different experimental conditions) all contribute to the difficulty in comparing phenotype data from multiple sources. For example, the comparison of blood pressure measured in different laboratories or programs can be impacted by the way in which blood pressure is measured (e.g. direct measurement via catheter in artery, telemetry, blood pressure cuff), the conditions under which the animals have been housed (e.g. low salt/high salt diet, chemicals in wa-

ter), surgical manipulations (e.g. removal of a kidney), gender and age. Two approaches are currently used to house rat or mouse phenotype data. One approach is to ensure that all data for a phenotype measurement is measured using a standard operating procedure with baseline conditions. 2, 3 The procedures and information on assay method used, sample information and other information is contained within Standard Operating Procedure documents and they are not part of the phenotype data records. Users have access to all the data in the database for cross strain comparisons for these limited sets of assays and experimental conditions. A second approach is to allow data from multiple projects but to keep them as separate datasets or projects and allow the user access to a single project data at a time. 4, 5 This allows access to data from multiple projects, including those with varying experimental conditions and assays, but it does not allow the data to be truly integrated because of the lack of standardized formats and labeling, nor does it allow the user to compare data from multiple projects easily.

Fig 1. Components for standardizing and integrating phenotype data.

Presented here are four ontologies that provide the backbone of a standardized format for integrating phenotype mea*

To whom correspondence should be addressed.

1

M. Shimoyama et al

surement data from multiple sources. This system is designed to accommodate mammalian model organisms such as rat and mouse.

2

ability to distinguish between substrains which may exhibit subtle genetic differences.

METHODS

Multiple ontologies were developed to address standardization of the four major elements of phenotype measurement records: 1) who was measured, 2) what was measured, 3) how was it measured, and 4) under what conditions was it measured (Fig 1). All ontologies are available through the National Center for Biomedical Ontology BioPortal (http://bioportal.bioontology.org/) and the Rat Genome Database ftp site (http://rgd.mcw.edu/pub/ontology/)

2.1

Rat Strain Ontology

The Rat Strain Ontology (RS) (Fig 2) was created to standardize nomenclature and organize strains according to type of strain: inbred, outbred, mutant, congenic, and consomic,

Fig 3 Clinical Measurement Ontology

2.2

Fig 2 Rat Strain Ontology

and also according to breeding history. It presents a hierarchy for parental strains, substrains and those with portions of the parental genetic background to allow users to retrieve and compare annotations and phenotype records for groups of related strains and also provides them with the

2

Clinical Measurement Ontology

The Clinical Measurement Ontology (CMO) (Fig 3) provides the standardized vocabulary necessary to indicate the type of measurement made to assess a particular trait. Each term in CMO describes a distinct type of measurement used to assess one or more traits. The terms are arranged in a hierarchy of classes organized on the higher levels according to the body system in which the measurement is made. The ontology is designed to address phenotype measurements commonly made in both clinic and research settings humans and model organisms alike. This affords the opportunity for integrating data from the medical record into data gathered as part of clinical trials or research and also facilitates comparisons across organisms.

Using Multiple Ontologies to Annotate and Integrate Phenotype Records from Multiple Sources

2.3

Measurement Method Ontology

An important component of a phenotype measurement record is identification of the method used to make the measurement since results can vary based on method. The Measurement Method Ontology (MMO) was created to standardize method classification using the mechanism of the method as the underlying principle for organization (Fig 4). It is organized around two major branches “ex vivo method” and “in vivo method”. This ontology was developed in parallel with the Clinical Measurement Ontology as phenotype measurement areas were addressed. Methods were identified from publications, experimental protocols, laboratory manuals and vendors’ catalogues.

studies. The Experimental Conditions Ontology (XCO) (Fig 5) was created to provide structure and standardization of the variety of experimental conditions which are typically

Fig 5 Experimental Condition Ontology

Fig 4 Measurement Method Ontology

2.4

Experimental Condition Ontology

Changing experimental conditions to identify the effect on particular clinical measurements is a common part of research design so it is important to capture this information in a standardized way to allow data to be compared across

imposed in model organism and clinical research projects. These include such factors as diet, oxygen and carbon dioxide levels, drugs and chemicals, activity and body position, as well as surgical interventions. In cases such as chemicals and drugs, the organization and terminology used was borrowed from existing sources such as ChEBI6 with the inclusion of appropriate identifiers for reference and integration.

3

SUMMARY

The four ontologies created here are currently used in two projects involving rat and human data. PhenoMiner is a

3

M. Shimoyama et al

project to integrate phenotype measurement data for the laboratory rat from multiple sources and is housed at the Rat Genome Database (http://rgd.mcw.edu/phenotypes/). Over 14,000 records have been mapped to the four ontologies and integrated into a single database from two large scale phenotype sources PhysGen (http://pga.mcw.edu/ ), the National BioResource Project for the Rat in Japan (http://www.anim.med.kyoto-u.ac.jp/nbr/ ) and from published literature. The innovative query and data display tools leverage the power of the ontologies so that researchers can create and filter queries and manage data returns and displays easily. Figure 6 illustrates the strength of the ontology driven data integration. Although the data represented is present in the PhysGen resource, users can only access data one protocol at a time and would have to download and devise their own system for examining sex differences across experimental conditions. Through the PhenoMiner resource, a single query allows users to identify these differences at a glance

phenotypes and diseases. In order to make these connections, researchers need to easily access and analyze phenotype measurement data related to individuals and various model strains, and information on experimental conditions and methodologies that may affect the measurement values. Employing multiple ontologies to standardize data formats facilitates the integration of these vital datasets and provides the structure on which innovative data mining, analysis and presentation tools can be built. These types of resources can provide researchers with a more accurate picture of phenotype variations among populations and as well as the impact that measurement methods may have on measurement results. The influence of experimental and environmental conditions on phenotypes and disease will also be easier to elucidate when researchers have access to large numbers of measurements from a wide variety of studies. This is an important step in helping investigators link genotypes to phenotypes.

ACKNOWLEDGEMENTS The authors would like to acknowledge the efforts of the Rat Genome Database curators and bioinformatics staff.

REFERENCES Butte AJ, Kohane IS. (2006) Creation and implication of a phenome-genome network. Nat Biotechnol. 24:55-62

Mashimo T, Voigt B, Kuramoto T, Serikawa T. (2005) Rat phenome project: the untapped potential of existing rat strains. J Appl Physiol, 98(1):371-9 Mallon AM, Blake A, Hancock JM. (2008) EuroPhenome and EMPReSS:online mouse phenotyping resource. Nucleic Acids Res. 36(Database issue):D715-8. Fig 6 Sex differences for BN/NHsdMcwi across multiple conditions

The Clinical Measurement Ontology, Measurement Method Ontology and Experimental Condition Ontology are also being used by the Cardiovascular Ontologies and Vocabularies in Epidemiological Research (COVER) project which integrates demographic and phenotype measurement data from three large scale family blood pressure studies. Using the ontologies to map data elements from each of the studies to a common format, to date, records for 8,778 subjects spanning over 100 phenotype measurement types have been integrated (http://cover.wustl.edu/Cover/). The ontologies have proven successful in standardizing phenotype measurement data regardless of technology platform. Creating structures to integrate phenotype measurement data from multiple sources is an important task as investigators draw on the strength of the genomic and sequence variation resources to identify underlying genotype factors related to

4

Kwitek AE, Jacob HJ, Baker JE, Dwinell MR, Forster HV, et al. (2006) BN phenome: detailed characterization of the cardiovascular, renal, and pulmonary systems of the sequenced rat. Physiol Genomics. 25(2):303-13. Bogue MA, Grubb SC, Maddatu TP, Bult CJ. (2007) Mouse Phenome Database (MPD). Nucleic Acids Res. 35(Database issue):D643-9. Degtyarenko K, deMatos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcantara R, Darsow M, Guedj M, Ashburner M, (2008) ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 36(Database issue):D344-D350.

Linking genes to diseases with a SNPedia-Gene Wiki mashup Benjamin M. Good1+, Salvatore Loguercio2+, Andrew I. Su1* 1 2

Genomics Institute of the Novartis Research Foundation, 10675 John Jay Hopkins Drive, San Diego, CA, 92121. Technische Universität Dresden, Biotechnology Center, Tatzberg 47/49, 01307 Dresden

ABSTRACT A variety of topic-focused wikis are used in the biomedical sciences to enable the mass-collaborative synthesis and distribution of diverse bodies of knowledge. To address complex problems such as defining the relationships between genes and disease, it is important to bring the knowledge from many different domains together. Here we show how advances in wiki technology can be used to automatically assemble ʻmeta-wikisʼ that present integrated views over the data collaboratively created in multiple source wikis. In particular, we introduce a meta-wiki formed from the Gene Wiki and SNPedia whose purpose is to identify connections between genes and diseases. (Supplementary data is available at http://goo.gl/3VYhj).

1

INTRODUCTION

One of the key goals of current biomedical research is the elucidation of the relationships that hold between genes, environment and disease. Tackling this complex challenge requires the coordination of information emerging from a variety of scientific and medical communities. Increasingly, wiki technology is being used to enable such communities to collaboratively synthesize and distribute their knowledge. These ‘bio-wikis’ are emerging in many different areas (Callaway, 2010; Waldrop, 2008). We have wikis about genes (Hoffmann, 2008; Huss, et al., 2010; Huss, et al., 2008), proteins (Mons, et al., 2008), protein structures (Stehr, et al., 2010; Weekes, et al., 2010), SNPs (http://www.snpedia.com), pathways (Florez, et al., 2009; Pico, et al., 2008), specific organisms (Florez et al., 2009) and many other biological entities. These bio-wikis have become important concept-centric knowledge resources but no single wiki contains all of the knowledge needed to answer most biological questions. The task of integrating the knowledge across different wikis remains the job of the end user. Recently, three key factors have emerged that make it possible to dynamically produce ‘meta-wikis’ that provide end users with consolidated views of information spanning multiple underlying wikis. The first factor is the widespread adoption of the MediaWiki software for implementations within the bio-wiki communiTo whom correspondence should be addressed. +These authors contributed equally to this work. *

ty. MediaWiki installations now provide a powerful web API (Application Programming Interface) for direct, highlevel access to the data contained in their databases. The API uses RESTful calls (Fielding, 2000) to permit automated processes to make queries and post changes. Since many of the bio-wikis now have, by default, the same API, this implies that the same software can be used to query and edit the content of many different wikis without alteration. The second factor is the increasing adoption of standardized systems for describing and recognizing biological concepts across multiple sites. Such systems provide identifiers for genes (e.g. NCBI Gene ids) and other biological concepts (e.g. Gene Ontology terms). These shared names can be used to identify when two different wikis contain knowledge pertinent to the same things and hence provide the key starting point for integration. The third factor is the Semantic extension to the Media Wiki system (Krotzsch, et al., 2007). By installing this extension, wiki administrators make it possible to add semantic links between articles in the wiki - for example, GeneX hasSNP snpY. These semantic links can be used in queries to the system that are much like queries to a database. The combination of a consistent API across many different wikis, a growing collection of unifying ontologies and the Semantic extension enables the rapid formation of wikimashups or ‘meta-wikis’. Such meta-wikis offer the potential to produce integrated views of the knowledge dispersed across many different sources. Here we show how an automatically generated meta-wiki composed of elements drawn from SNPedia and the Gene Wiki exposes substantially more evidence of links between genes and diseases than either resource contains independently.

2 2.1

METHODS SNPedia

SNPedia provides textual information about links between variations in human genes and human phenotypes (http://www.snpedia.com). It uses standard identifiers from trusted authorities - primarily dbSNP (Sayers, et al., 2011) to enable extensive linking to other public bioinfomatics databases and to personal genomics companies like

1

B. Good et al.

23andMe (http://www.23andme.com). It is not a comprehensive listing of SNPs, rather it focuses on SNPs that have some evidence of association with a human phenotype.

a. Using the SNPedia query API, identify all SNPs with a wikilink directed to or from an article in the SNPedia category ‘medical condition’.

2.2

b. Map SNPedia medical conditions to Disease Ontology terms using the NCBO Annotator.

Gene Wiki

The Gene Wiki is an attempt to generate a collaboratively written, continuously updated review article for every human gene (Huss, et al., 2008). It provides textual descriptions of gene function in normal conditions as well as descriptions of the role the gene may play in disease. Currently, it includes more than 10,300 Wikipedia articles about human genes.

2.3

(5) Add semantic links to the mashup between: − Gene Wiki genes and SNPs (using NCBI Gene Ids as the naming standard); − genes and diseases discovered in step 3; − SNPs and diseases discovered in step 4.

SNPedia + Gene Wiki

Bringing information from both the Gene Wiki and SNPedia together into one consistent framework allows us to better address the following important question. “Based on what we know now, what genes are linked to which diseases?” It is important to note that there is no official database established as yet for structuring and curating such information. The closest example is Online Mendelian Inheritance in Man (OMIM) but there is no way to answer this question given the tools that OMIM provides aside from searching, one by one, through thousands of textual entries. Other groups have attempted to build such resources through textmining e.g. GeneRIFs (Osborne, et al., 2009) but none has yet emerged as a standard reference. In the protocol illustrated in Figure 1, we describe how to automatically construct a semantic wiki instance suitable for exploring the relationship between genes and disease both by browsing and through structured queries. The resultant meta-wiki contains semantic relations linking genes to diseases, genes to SNPs, and SNPs to diseases. The steps to build this meta-wiki are as follows: (1) Install the MediaWiki software with the Semantic extension as the meta-wiki platform; (2) Utilize the MediaWiki API at the source wikis to pull the articles of the Gene Wiki from Wikipedia and SNP articles related to human genes from SNPedia. Insert them into the mashup using the WriteAPI at the meta-wiki; (3) Identify Disease Ontology terms in the text of Gene Wiki articles using the NCBO annotator (Jonquet, et al., 2009) ; (4) Identify SNP-Disease relationships in SNPedia:

2

Fig. 1. Meta-wiki assembly process. (1a, 1b) Article content is obtained from the source wikis using GET calls to their MediaWiki APIs, and written to the target wiki (2a, 2b) via POST calls to its MediaWiki API. In parallel, the Annotator is used to identify Disease Ontology terms in the text of the gene wiki articles and to map medical conditions in SNPedia to Disease Ontology terms. The content at the target wiki is then enhanced with the Disease Ontology associations generated using the Annotator (3a, 3b).

3

RESULTS

Overall, the SNPedia/Gene Wiki meta-wiki captures 4,426 distinct gene-disease relationships. As illustrated in Figure 2, SNPedia accounts for 1,037 (via gene-SNP-disease connections), the Gene Wiki provided 3,525 (via direct genedisease associations) and only 136 (3%) of the gene-disease pairs appear independently in both sources. The 136 gene-

Linking genes to diseases with a SNPedia-Gene Wiki mashup

disease pairs in the overlap contained 47 distinct diseases and 125 distinct genes linked to 271 SNPs. For example, the gene CYSLTR1 is linked to asthma in the text of the Gene Wiki: “The cysteinyl leukotrienes [...] are important mediators of human bronchial asthma” and in the text of the SNPedia article on SNP Rs320995 (which occurs in CYSLTR1): “subjects without T-allele in SNP rs320995 had 3.1 times higher risk of asthma”.

Fig. 2. Overlap of gene-disease associations derived from SNPedia and from the Gene Wiki.

As Figure 2 clearly illustrates, both the Gene Wiki and SNPedia contain substantial amounts of knowledge pertinent to the challenge of finding associations between genes and diseases. The low level of overlap between the genedisease associations found in these resources indicates the potential value of their combination.

3.1

RDF and Semantic Media Wiki

One of the key advantages of the Semantic Media Wiki framework is its ability to generate structured exports of the knowledge it contains that adhere to the Resource Description Framework (RDF) standard. This makes it possible to take advantage of the growing collection of tools built on this standard to conduct analysis of the data. For example, the gene-disease pairs in the overlap mentioned above can be identified with the following SPARQL query (SPARQL is the standard query language for RDF).

While the aggregation of data from multiple sources in a queryable, structured form is useful for computational scientists, few ‘end-user’ biologists can be expected to enter SPARQL queries or even queries in the Semantic Media Wiki syntax. For the majority of users, the value of a metawiki such as this is in the direct improvements to the individual articles that they will discover while browsing. Hence we made two specific additions to the visible areas of the meta-wiki articles. First, we added a ‘known variants’ table to all the gene articles. This table presents SNPs related to the gene described in the article and phenotypes related to those SNPs drawn from the data gathered from SNPedia. Figure 3 shows the known variants table for the ACCN1 gene. The table materializes a connection between the gene and Multiple Sclerosis (supported by (Bernardinelli, et al., 2007)) that was missing from the orig-

inal ACCN1 article. Fig. 3. Example of the ‘known variants’ tables added to the Gene Wiki articles from data collected from SNPedia. Here showing a SNP on the ACCN1 gene linked to Multiple Sclerosis.

In addition to the enhancements to the gene articles, we added a ‘related genes and SNPs’ table to the disease articles (brought in from Wikipedia as part of the Gene Wiki import). This table presents genes and SNPs that are linked to the disease either in the text of a Gene Wiki article or through genetic associations found in SNPedia. Figure 4 shows how the article on Bipolar Disorder has been expanded with a section detailing related genes as well as related

PREFIX wiki: SELECT ?gene ?disease ?do_term ?snp WHERE { ?gene wiki:Property-Is-associated-with ?disease . ?gene wiki:Property-HasSNP ?snp .

SNPs on these genes.

?snp wiki:Property-Is-associated-with ?disease . ?disease wiki:Property-Same-as ?do_term . FILTER regex(?do_term, "^DOID", "i") . }

3.2

Enhancements to the user experience

Fig. 4. Example of the ‘related genes and SNPs’ boxes added to the disease articles from data collected from both SNPedia and the Gene Wiki. Here showing some of the genes and SNPs linked to Bipolar Disorder.

3

B. Good et al.

4

DISCUSSION

The low amount of overlap between the gene-disease relationships found in the gene wiki and the gene-SNP-disease relationships from SNPedia is likely the result of differences in both the protocol used to mine them and the content itself. It is possible that we would obtain a higher amount of overlap if we used the same procedure to find SNP-disease associations as we did to find direct gene-disease associations and this would be a useful experiment to conduct in future work. However, based on our inspections of the data the more important driver of the low overlap appears to be the basic differences in the core content of SNPedia and in the Gene Wiki. There are many reasons why a particular gene might be associated with a disease in a gene wiki article that do not implicate a particular SNP. For example, genes may be involved in pathways known to be important to disease pathogenesis or to the body’s immune response while there may not be any known SNPs associating that gene with that disease. One of the weaknesses of the approach used to build this meta-wiki is that it represents a one-way sync. If editors make changes to the articles in the meta-wiki, there is currently no automated mechanism for migrating those changes back to the articles in the original wikis. While one option is to let these meta-wikis evolve independently of their parents, a better approach might be to establish mechanisms through which edits made to a meta-wiki article could flow back into the articles used to create them. Such a mechanism would effectively extend the reach of the source wikis - both in terms of exposing their contents and of acquiring more editors. There are tools emerging that will make this possible. For example, the Distributed Semantic Media Wiki system is an extension that enables the creation of a network of Semantic Media Wiki servers that share common semantic wiki pages (Skaf-Molli, et al., 2010). With such a system in place, we might imagine that meta-wikis like the one discussed here could serve not only as new integrated resources for consuming information but also new points for users to contribute information back to the community collection.

5

CONCLUSION

We have demonstrated how a high-level linking of genes and diseases can be accomplished through the meta-wiki approach, but we have not touched on the deeper, more difficult question of how these genes are linked to these diseases. To address this complex challenge, the work of thousands of specialists needs to be assembled into integrated wholes that can be understood and used to drive action. The topic-focused wikis emerging in different areas of biology represent one step of this process of collaborative knowledge synthesis. Looking forward, meta-wikis such as the one presented here offer the potential to go one step 4

further - to help unearth and present the latent relationships that exist between different concepts and different communities.

ACKNOWLEDGEMENTS Thanks to Mike Cariaso for suggesting how to extract SNPdisease relationships from the hyperlinks in SNPedia. This work was supported by NIGMS (GM083924).

REFERENCES Bernardinelli, L., et al. (2007) Association between the ACCN1 gene and multiple sclerosis in Central East Sardinia, PLoS One, 2, e480. Callaway, E. (2010) No rest for the bio-wikis, Nature, 468, 359-360. Fielding, R. (2000) Architectural Styles and the Design of Network-based Software Architectures. Doctoral dissertation, University of California, Irvine. Florez, L.A., et al. (2009) A community-curated consensual annotation that is continuously updated: the Bacillus subtilis centred wiki SubtiWiki, Database (Oxford), 2009, bap012. Hoffmann, R. (2008) A wiki for the life sciences where authorship matters, Nat Genet, 40, 1047-1051. Huss, J.W., 3rd, et al. (2010) The Gene Wiki: community intelligence applied to human gene annotation, Nucleic Acids Res, 38, D633-639. Huss, J.W., 3rd, et al. (2008) A gene wiki for community annotation of gene function, PLoS Biol, 6, e175. Krotzsch, M., et al. (2007) Semantic Wikipedia, Journal of Web Semantics, 5, 251-261. Mons, B., et al. (2008) Calling on a million minds for community annotation in WikiProteins, Genome Biol, 9, R89. Osborne, J.D., et al. (2009) Annotating the human genome with Disease Ontology, BMC Genomics, 10 Suppl 1, S6. Pico, A.R., et al. (2008) WikiPathways: pathway editing for the people, PLoS Biol, 6, e184. Sayers, E.W., et al. (2011) Database resources of the National Center for Biotechnology Information, Nucleic Acids Res, 39, D38-51. Skaf-Molli, H., Canals, G. and Molli, P. (2010) DSMW: Distributed Semantic MediaWiki., Proceedings of ESWC 2010, 2, 426-430. Stehr, H., et al. (2010) PDBWiki: added value through community annotation of the Protein Data Bank, Database (Oxford), 2010, baq009. Waldrop, M. (2008) Big data: Wikiomics, Nature, 455, 2225. Weekes, D., et al. (2010) TOPSAN: a collaborative annotation environment for structural genomics, BMC Bioinformatics, 11, 426.

 

The Vertebrate Bridging Ontology (VBO) Ravensara Travillian1, *, James Malone1, Chao Pang2, John Hancock3, Peter W.H. Holland4,   Paul Schofield5, and Helen Parkinson1   EMBL­EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK; 2 Genomics Coordination Center,  Groningen Bioinformatics Center, University of Groningen & Dept. of Genetics, University Medical Center Groningen, P.O.  Box 30001, 9700 RB Groningen, The Netherlands; 3 MRC Harwell, Harwell, Oxfordshire, OX11 0RD, UK; 4 Dept. of Zoology,  University of Oxford, South Parks Road, Oxford, OX1 3PS, UK; 5 Dept. of Physiology, Development and Neuroscience, Uni­ versity of Cambridge, Downing Street, Cambridge CB2 3EG, UK  1

ABSTRACT Abstract: The recent proliferation of ontologies for organizing and modeling anatomical, phenotypic, and genetic information is a welcome development, with a great deal of potential for transforming the way scientists access and use knowledge. Realization of this potential calls for effective ways of integrating and computing on various information sources. In this paper, we introduce the Vertebrate Bridging Ontology (VBO), which permits the transfer of information about homologous anatomical structures between species— a first step towards the integration of species-specific anatomical ontologies. We present the ontology, design patterns, and methodology, and discuss how it can be applied to use-cases to meet the information needs of the scientific user community. Available at: http://sourceforge.net/projects/vbo/files/

1

INTRODUCTION

The problem of integrating diverse single-species anatomy ontologies is well-documented (Travillian 2011). Comparison of conserved and divergent patterns of gene expression and mutant phenotypes between species has become a powerful approach for investigating gene function and its evolution, particularly as more and more data accumulates from a wide range of species. In order to facilitate a computational approach to cross-species comparisons it is necessary to formalize the description of anatomy in each species, but this then leaves us with the problem of crossing between evolutionarily homologous structures in separate species. Two existing approaches have been attempted: lexical matching and the generation of a “universal” vertebrate anatomy ontology. The former is, for reasons discussed in (Travillian 2011) and below, always going to be intrinsically flawed. The latter has met with some success with the development of the CARO upper-level anatomy ontology, and the Uberon multi-species metazoan anatomy ontology (Haendel 2007, Haendel 2009). However, neither take full account of the evidence-based inferred evolutionary rela*

tionships between anatomical structures in different taxa. In this paper, we introduce the Vertebrate Bridging Ontology (VBO), an evidence-based approach which permits the transfer of information about homologous anatomical structures across species—a first step towards the integration of species-specific anatomical ontologies.

2

DEVELOPMENT AND IMPLEMENTATION OF VBO

The VBO is developed in the Web Ontology Language (OWL) using Protégé 4, in order to provide a common representation compatible with that of the single-species ontologies it is intended to integrate. The OBO (Open Biomedical Ontologies) recommendation of unique namespaces and identifiers has been adhered to in its development. Use-cases collected at a VBO community workshop in June 2010 include key questions the evolutionary-biology and biomedical research communities might wish to address: (1) Compare expression of (a) a named gene or (b) gene family or (c) combination of genes between species in an anatomical framework. The queries from this use-case will take forms such as: Which anatomical structures are involved in the expression of this {gene | gene family | combination of genes}? Are these structures the same or different? And in which species do they occur? (2) Compare a particular anatomical structure between species. The queries from this use-case will take forms such as: For a particular structure, are the same genes or different genes are expressed in particular species for that structure? What are the differential expression patterns among homologous tissues in different species? (3) Compare gene expression similarity/difference in particular tissues between species to test a hypothesis of homology. The queries from this use-case will take forms such as: Is Tissue A in Species 1 likely to be homologous to Tissue B in Species 2?

To whom correspondence should be addressed.

1

RS Travillian et al.

Data for these use-cases comes from user annotations of model organisms within ongoing human disease mechanism studies, comparative gene expression studies for functional genomics and evolutionary biology, and phenotype/genotype association studies in adult and developing organisms.

Fig. 1. The MRCA approach (left) specifies homologies from the MRCA to its descendants, and homologies among the descendants are inferred. The homology chains approach (right) specifies homologies among the descendants, and requires one explicit connection to the MRCA for that characteristic in order to infer all the other homologies from the descendants to the MRCA.

2.1

The two approaches are similar in efficiency, but in principle we favored the MRCA approach as it is more similar to the way biologists reason over evolutionary relationships. In practice, we ended up using a hybrid approach, because the data often were available for one approach but not the other.

Approach

The VBO is based only on anatomical homology—that is, evolutionary relatedness of structures by uninterrupted descent from a common ancestor. The other types of structural similarity in classical comparative anatomy—analogy (similarity of function), and homoplasy (similarity of appearance independent of common descent)—are in VBO's scope. Homology is a relation between anatomical structures in different species. It is: • symmetric: A homologous-to B  B homologous-to A. For example, the fact that the mammalian ear ossicle incus is homologous-to the reptilian jaw bone quadrate necessarily implies that the reptilian quadrate is homologous-to the mammalian incus. • reflexive: A homologous-to A. Any structure (e.g., feline parathyroid gland) is necessarily evolutionarily related to itself through uninterrupted descent from a common ancestor. • transitive: A homologous-to B AND B homologous-to C  A homologous-to C. For example, the Asiatic black bear gall bladder is homologous-to the sun bear gall bladder, and the sun bear gall bladder is homologous-to the sloth bear gall bladder. Since connection by descent from the common ancestral structure is uninterrupted in each case, then the Asiatic black bear gall bladder is also homologous-to the sloth bear gall bladder (as well as all other mammalian ones) by transitivity. Thus the homologous nodes for a particular structure (nodes = n) form a maximally-connected graph (vertices on the order of 2n) for the relation homologous-to. The combinatorial complexity of the possible axioms linking anatomical entities of even a few species requires a programmatic approach to populating the classes and relationships within the VBO framework. There are two ways to leverage evolutionary anatomical relationships to programmatically populate VBO: a most recent common ancestor (MRCA, "topdown") approach and a homology chain ("bottom-up) approach" (Osumi-Sutherland 2010), illustrated in Fig. 1.

2

2.2

Entities

There are two types of entity in VBO: anatomical structures and taxa. An anatomical structure consists of the following data structure (Fig. 2), where the surrounding circles represent annotation properties that link the structure to the homologous structure in other ontologies and taxonomies:

Fig. 2. The data structure of a anatomical entity in the VBO (center), with annotation properties (surrounding).

The corresponding structure(s) in the Experimental Factor Ontology (EFO) (Malone 2010) is/are linked via the EFO ID, the corresponding structure(s) in the Foundational Model of Anatomy (FMA) (Rosse 2003) are linked via the FMA ID, the corresponding structure(s) in the Teleost Anatomy Ontology (TAO) (Dahdul 2010) are linked via the TAO ID, and so forth. The annotation property "Other" represents additional IDs that can be added as the VBO is aligned with additional species anatomy ontologies. For VBO 1.0, we selected the adult skeletal system for demonstration and proof-of-principle, as it is a relatively straightforward example to model: it tends to be bilaterally symmetrical and highly conserved, with relatively little sexual dimorphism. However, data for other systems became available during the course of the project, so VBO also contains structures outside the adult skeletal system. Taxon entities can be at any level of phylogenetic ranking, because anatomical structures can be characteristic of any level of ranking. For example, jaws are characteristic of the

The Vertebrate Bridging Ontology (VBO)

infraphylum Gnathostomata, while hair, sweat (eccrine) glands, and mammary glands are characteristics of the class Mammalia, and hypertrophied manus digits supporting wings are characteristic of the order Chiroptera. While the scope of the VBO is vertebrate structures, many structures that are characteristic of vertebrates actually originate further back in evolutionary history, so a rigorous modeling of the VBO requires the ability to model structures as differentia at the appropriate taxon ranking. The current VBO phylogeny is consistent with the NCBI taxonomy for vertebrates. A compound class represents a structure in a species, whose parent class is the anatomical structure with no species marker, and whose species is also represented as a class in VBO. Compound classes also have annotation properties representing the source of the assertion that:

Table 1. Representative identically-named vertebrate and invertebrate non-homologous structures in PubMed. Structure

Invertebrate taxa Refers to

acetabulum parasitic worms sucker (trematodes), (feeding) leeches

Vertebrate taxa Refers to tetrapods (4-limbed vertebrates)

femur trochanter

insects insects

leg segment tetrapods leg segment tetrapods

coxa

insects

leg segment tetrapods

tibia

insects

leg segment tetrapods

concave pelvic surface meeting femur at hip joint long bone: leg part of thigh bone hip (joint or anatomical region) long bone: leg

The open-world assumption means that any lexicalmatching tool used to populate VBO or any other homol∀Structure-in-Taxon  {∃(x ∈ Taxon) : x has-structure Structure} (1) (1) ogy-based ontology will create a high number of false positives based on lexical matches such as these, since—under that assumption—there could, in future, be insect structures The following relationships operate on compound entities. that are homologous to their vertebrate homonyms. This Relationships. These relationships in the VBO describe possibility, permitted under the open-world assumption, homology relationships among compound entities. actually violates a biological constraint on homology. To 1. Homologous-to. The relationship homologous-to deprevent those false positives, to provide metaknowledge for scribes a 1:1 and onto structural similarity based on evolufuture data mining tools, to mitigate human error in creating tionary relationship between a structure in one species and a axioms containing NOT and a vast number of disjoints in structure in a second species. Protégé, and to make reasoning more tractable, we have While not definitively ruling out a genetic event that ocexplicitly encoded the not-homologous-to relationship, curred after the species' separation from the MRCA, a 1:1 along with any necessary invertebrate species, in the VBO and onto mapping tends to be indicative of evolutionary in order to definitively rule out that possibility. Although it conservation. When the mapping by term name/structure is is not an ideal solution, it is a workable compromise, given not itself 1:1 and onto with a homologous structure (which the state of the art and the scope of the problem. We do not can indicate an evolutionary event), there may be a 1:1 and represent a phylogeny of invertebrates, nor do we make any onto mapping from a structure in one species to some part statements about the relationships among not-homologousof the homologous structure in the second species. to relationships, as those are clearly out of scope, so not2. Not-homologous-to. The need to explicitly encode a homologous-to forms a simply-connected graph, and not a negative relation in VBO is a consequence of the combinamaximally-connected one. tion of open-world reasoning and the history of comparative Entities and relationships as described above provide the anatomy. The not-homologous-to relationship can be onecontent of VBO. VBO was initially populated by a combito-many. nation of manual and automated approaches. Annotations The naming of structures in one species, based on analogy from the Gene Expression Atlas (Kapushesky 2010, Parkin("wing" in insect, pterosaur, bird, and bat) or homoplasy son 2009), ERA-PRO (Birschwilks 2011), Europhenome (panda's "thumb") to a non-related structure in a different (Morgan 2010, Mallon 2008), and Phenoscape (Dahdul species, muddies the waters tremendously for determining 2010) databases provided anatomical structures and species homology based on lexical matching. Haendel et al (acfor the ontology. Additionally, Uberon and FMA provided cessed 10 April 2011) have remarked upon the case of the structures for VBO. These structures and species were frontal bone in the zebrafish being homologous to the premanually added to the OWL file in Protégé. For VBO 1.0, frontal bone, and not the frontal bone, in humans. The probinclusion of a taxon or structure class in one of the above lem is magnified tremendously by the use of important verdatabases or ontologies was considered sufficient evidence tebrate skeletal terms to refer to segments in insects, and of existence to include it in the ontology. The use of these that is in turn magnified by the importance of those insects, sources also uncovered some major discrepancies between such as Drosophila, in the comparative medical research how major ontologies, such as FMA and Uberon, represent community. Table 1 presents an illustration of the problem anatomical classes versus the way the terms corresponding for some representative skeletal structures.

3

RS Travillian et al.

to those classes are used in real-world contexts (Travillian 2011). Those considerations influenced how we developed composition of compound entities, for example, and will continue to inform future versions of VBO. Some preliminary data-mining of PubMed abstracts was carried out to populate VBO. Python scripts which searched PubMed iteratively through a list of structures from FMA and Uberon were used to collect abstracts of articles that contained musculoskeletal terms with references to nonhuman vertebrate species. Reference to a structure in a species in an abstract was considered evidence of a compound entity (Equation [1]), and the compound entity was evaluated for homology to that structure in humans or another species. This evaluation was carried out on the basis of available evidence—reference material, journal articles, and so forth. The provenance of the evidence was recorded as well. This direct connection to evidence for homology statements is a unique strength of VBO. When sufficient evidence established the homology between the compound entities, the triple relationship was recorded as a "pairwise mapping" in a spreadsheet. A set of Java tools was developed to transform the spreadsheet's pairwise mappings into classes and relationships in Protégé, and to create the relationships among the nodes of the maximally-connected graph. These generated relationships are marked evidentially as inferred from homology. A beta version of VBO has been successfully integrated into the EFO to support cross-species comparisons of orthologous genes in homologous tissues through the Gene Expression Atlas interface.

FUTURE WORK We plan to continue integrating VBO into the Gene Expresson Atlas via EFO, and improving the functionality and the interface. We will add more sophisticated analysis of evidence that can work with the Phenoscape taxonomy of evidence model for easier integration and sharing of data. More complex systems which present more complicated modeling challenges, and incorporating developmental structures as well as adult structures are also areas into which we plan to extend VBO.

ACKNOWLEDGEMENTS We thank the members of the VBO Scientific Advisory Board, Jonathan Bard, Claudio Stern, Martin Ringwald, and Monte Westerfield, who guided and supported this project. In addition, we thank Hilmar Lapp, and the participants in our community workshops, who provided valuable feedback and suggestions.

4

Funding: Biotechnology and Biological Sciences Research Council (grant #BB/G022755/1), and European Molecular Biological Laboratory core funding.

REFERENCES Birschwilks M, Gruenberger M, Adelmann C, Tapio S, Gerber G, Schofield PN, Grosche B. The European radiobiological archives: online access to data from radiobiological experiments. Radiat Res. 2011 Apr;175(4):526-31. Dahdul WM, Lundberg JG, Midford PE, Balhoff JP, Lapp H, Vision TJ, Haendel MA, Westerfield M, Mabee PM. The teleost anatomy ontology: anatomical representation for the genomics age. Syst Biol. 2010 Jul;59(4):369-83. Haendel M, Gkoutos G, Lewis S, Mungall C. Uberon: towards a comprehensive multi-species anatomy ontology. Available from (2009), Nature Precedings. Haendel MA, Neuhaus F, Osumi-Sutherland D, et al. CARO - The Common Anatomy Reference Ontology. In: Anatomy Ontologies for Bioinformatics, Principles and Practice Albert Burger, Duncan Davidson and Richard Baldock (Eds.), 2007. Kapushesky M, Emam I, Holloway E, Kurnosov P, Zorin A, Malone J, Rustici G, Williams E, Parkinson H, Brazma A. Gene Expression Atlas at the European bioinformatics institute. Nucleic Acids Res. 2010 Jan;38(Database issue):D690-8. Mallon AM, Blake A, Hancock JM. EuroPhenome and EMPReSS: online mouse phenotyping resource. Nucleic Acids Res. 2008 Jan;36(Database issue):D715-8. Malone J, Holloway E, Adamusiak T, Kapushesky M, Zheng J, Kolesnikov N, Zhukova A, Brazma A, Parkinson H. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics. 2010 Apr 15;26(8):1112-8. Morgan H, Beck T, Blake A, Gates H, Adams N, Debouzy G, Leblanc S, Lengger C, Maier H, Melvin D, Meziane H, Richardson D, Wells S, White J, Wood J; EUMODIC Consortium, de Angelis MH, Brown SD, Hancock JM, Mallon AM. EuroPhenome: a repository for high-throughput mouse phenotyping data. Nucleic Acids Res. 2010 Jan;38(Database issue):D577-85. Osumi-Sutherland D. personal communication, 2010. Parkinson H, Kapushesky M, Kolesnikov N, Rustici G, Shojatalab M, Abeygunawardena N, Berube H, Dylag M, Emam I, Farne A, Holloway E, Lukk M, Malone J, Mani R, Pilicheva E, Rayner TF, Rezwan F, Sharma A, Williams E, Bradley XZ, Adamusiak T, Brandizi M, Burdett T, Coulson R, Krestyaninova M, Kurnosov P, Maguire E, Neogi SG, Rocca-Serra P, Sansone SA, Sklyar N, Zhao M, Sarkans U, Brazma A. ArrayExpress update--from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 2009 Jan;37(Database issue):D868-72. Rosse C, Mejino JL. A reference ontology for biomedical informatics: the Foundational Model of Anatomy. J Biomed Inform. 2003 Dec;36(6):478-500. Travillian RS, Adamusiak T, Burdett T, Gruenberger M, Hancock J, Mallon AM, Malone J, Schofield P, Parkinson H. Anatomy Ontologies and Potential Users: Bridging the Gap. Journal of Biomedical Semantics, forthcoming.

DOMEO: a web-based tool for semantic annotation of online documents Paolo Ciccarese*, Marco Ocana**, and Tim Clark*†‡ *Harvard Medical School and Massachusetts General Hospital, Boston MA; **Balboa Systems, Newton MA †University of Manchester, School of Computer Science, Manchester UK.

ABSTRACT

DOMEO (Document Metadata Exchange Organizer), is an extensible web application enabling users to visually and efficiently create and share ontology-based annotation metadata on HTML or XML document targets, using the Annotation Ontology (AO) RDF model. The tool supports manual, fully automated, and semi-automated annotation with complete provenance records, as well as personal or community annotation with access authorization and control. DOMEO is the user-facing component of the SWAN Annotation Framework. DOMEO creates AO RDF, linking text strings within the document to term URIs in scientific – particularly biomedical – ontologies, as stand-off annotation, supporting full positional metadata on web documents without requiring update control of the target. AO RDF is orthogonal to any domain ontology by design, and therefore widely applicable to ontology driven annotation and curation tasks across many biomedical and scientific domains. AO metadata is an example of so-called “stand-off metadata”, being managed separately from the annotation target.

1

INTRODUCTION

Last year, for Bio Ontologies 2010, we presented Annotation Ontology (AO), an OWL ontology providing a model for creating ‘stand-off’ annotation anchored to online resources such as documents, images and databases and their fragments [1-3]. AO provides a robust set of methods for linking online resources, for example text in scientific publications, to ontological elements, with full representation of the annotation provenance. Through AO, existing domain ontologies and vocabularies – in OWL[4] or SKOS[5] - can be utilized, out of the box, for creating extremely rich stores of metadata on web resources. In the bio-medical field, subjects for ontological structuring include biological processes, molecular functions, anatomical and cellular structures, tissue and cell types, chemical compounds, and biological entities such as genes and proteins. However, it is important to keep in mind that AO is not limited to the bio-medical domain and can be easily used in other scientific and non-scientific contexts. In fact, AO is already used by other projects focusing on biodiversity [5] and social tagging [6, 7].

‡ To whom correspondence should be addressed.

AO, by linking new scientific content to computationally defined terms and entity descriptors, can help establish semantic interoperability across the diverse masses of specialist science embodied in digital media -- from journals, to wikis and blogs [8, 9], to the growing world of web-based research “collaboratories”[10]. Annotation – either marking up contributions with comments, or more importantly, with relevant concepts and entities from biomedical ontologies – provides a technological boost to “strategic reading” for members of such communities [11, 12] and can selectively breach established specialist focus boundaries and semantic barriers [13]. In biomedicine, semantic interoperability facilitates crossspecies comparisons, pathway analysis, disease modeling, and the generation of new hypotheses through data integration and machine reasoning. While AO provides the model for encoding and sharing annotation in the convenient RDF (Resource Description Framework) format, it is still necessary to develop software applications allowing the users, in our case bio-medical scientists, to manually or semi-automatically create, share/publish, search and utilize annotation, and to manage algorithmically created annotation. As we strongly believe developing actual software is required to test the exchange model format against real use cases, we developed AO in parallel with the SWAN Annotation Framework, a web application suite with a rich set of features including (i) semantically annotating - manually or semi-automatically - online HTML and XML documents; (ii) sharing the annotation in RDF; (iii) searching the annotation while leveraging semantic inference. DOMEO is the user interface component of the SWAN Annotation Framework. We present DOMEO in the following sections.

2

THE DOMEO ANNOTATION TOOL

The DOMEO annotation tool is a web component developed using the Google Web Toolkit and JavaScript. It allows users to create - manually or semi-automatically - unstructured, semi-structured and semantic annotation that can be kept private, shared within selected groups, or made public and therefore available to the entire web. The tool is currently in alpha release with approximately 50 alpha users. It was developed upon an initial set of require-

1

Ciccarese et al.

ments accumulated in developing curation-intensive biomed- sections of it by simply selecting the desired portion of text, ical knowledge bases and scientific online communities. attaching a topic or, in other words, an instance of one of the Requirements were approximately equally distributed across: several available annotation types. § The ALZSWAN knowledge base – a customization of the Semantic Web Applications in Neuromedicine (SWAN) platform for Alzheimer Disease – developed in collaboration with the Alzheimer Research Forum (http://www.alzforum.org). It organizes 184 hypotheses and 2,214 specific scientific claims, with relevant evidence, referring to 266 gene-protein groups and 2,567 bibliographic resources (http://hypothesis.alzforum.org). § StemBook [13] (http://www.stembook.org) – a web portal for the Stem Cell community collecting several original review articles. Some of the articles are currently annotated with Gene Ontology (GO) terms. § The Science Commons Antibodies Resource – an OWL model for formally representing antibodies as referred to in the scientific literature. Developed in collaboration with Science Commons and the Alzheimer Research Forum, this knowledge base required intensive curation of existing relational databases as well as of the documents provided by the antibody vendors [14] § PDOnline [15] (http://pdonlineresearch.org) – a web portal and forum for the Parkinson Disease researcher community, collecting several relevant resources including extensive online discussions by scientists. The resources can be annotated according to PDGuide controlled vocabulary where terms are organized as taxonomy. § Pain Research Forum (http://painresearchforum.org) - a web portal and forum for the Pain Research Community.

Figure 1: DOMEO is a user interface component that allows loading and annotating any HTML document.

The simplest annotation item that can be created is the semantic tag or in AO terms the ‘Qualifier’. The tool allows attaching ontology or vocabulary terms –this can be any term identified by a URI - to a document or document fragment. The process is enabled by a user interface that performs the search operation by connecting to an external web service. The currently deployed alpha version of the tool connects to the NCBO (National Center for Biomedical Ontology) BioPortal REST web service for ontology-driven entity identification [16, 17]. The text-hit results are presented in a linear list. Alternative ontology search and exploration tools with expanded features and improved algorithms are under develAfter the first alpha release of DOMEO, in October 2010, we opment by several of our collaborators, along with web serinitiated an intense social process across several categories of vice interfaces to DOMEO for a variety of text mining algopotential users, to assure a constant flow of use-cases as well rithms. as continuous and valuable community feedback. To guaran- It is important to note that the tool allows users potentially to tee the desired level of coverage and flexibility of the appli- connect any search service and therefore to customize the list cation, we connected with a variety of collaborating partners of available vocabularies. By simply changing the set of voincluding pharmaceutical companies, a major scientific pub- cabularies used for performing the annotation, it is possible lisher (Elsevier), a philanthropy (the Spinal Muscular Atro- to tackle domains other than biomedicine. Once the annotaphy Foundation) and several academic groups with different tion – in this case a qualifier – is created, the annotated span capacities and goals. Many inputs also came from the W3C of text of the document is visually detectable. It is also possiHealth Care and Life Sciences (HCLS) Interest Group ble to click on the span of text to inspect the annotation items (http://www.w3.org/blog/hcls) and, in particular, from the associated to it through a popup (Figure 2). HCLS Scientific Discourse sub-task. Valuable use cases have also been provided by several academic groups specializing in text-mining.

3

MANUALLY CREATING ANNOTATION

DOMEO was designed to blend in with the scientists’ everyday workflow. The DOMEO user loads a specified URL into the application and then will see DOMEO-specific menus in a bar just above the normally-displayed document (Figure 1). The user can then manually annotate the whole document, or

2

Figure 2: Users click on the annotated text to inspect the associated annotation items and semantic entities. DOMEO is extensible. Besides the qualifier, it can allow several other types of annotation through development of

Ciccarese et al.

additional software components. Already developed plugins include features for modeling scientific discourse according to the model provided by the SWAN ontology [18] and features for modeling antibody usage. The latter consists of annotating text with one of the antibody entries of antibodyregistry.org and, optionally, with the methods and species involved in the particular study reported in the document content. New annotation types can be added to the tool by developing additional plug-ins to define user interface components, semantic aspects of the new annotation topics, and connectors to external services when needed.

4

SEMI-AUTOMATIC ANNOTATION

In many cases, the efficiency of mass-scale manual annotation can be significantly augmented by annotation algorithms. DOMEO allows implementing the RECS (Run, Encode, Curate, Share) process. Using this process, it is possible to select and run external text mining services, encode the results in the AO format, display the results in the context of the annotated document (Figure 3) to enable the curation process. Curation is a crucial aspect of scientific publication and therefore an important aspect for both our annotation ontology and our annotation tool. We enable curation for annotation generated by both humans and text mining services. In the case of automatic generated annotation, the tool allows curators to judge each annotation item (or set of annotation items) according to a configurable set of judgment categories. By default the set of categories is: wrong, right, too broad, unclear – and where unclear means the curator is unable to judge the result. Every time a curator judges and responds to a result, s/he can also provide motivation that can be used later on for further evaluation.

5

PROVENANCE, ACCESS CONTROL AND RDF SHARING

In working with online scientific communities, we are particularly aware of the importance of provenance tracking for establishing trust and properly documenting evolution of the science. AO offers a rich set of properties for modeling provenance based on the Provenance Authoring and Versioning (PAV) ontology originally developed for the SWAN project [18]. Our annotation tool tracks all the provenance aspects transparently while the user performs the annotation process. For every piece of annotation and annotation curation, the tool records the originating user, date, and the specific version of any software or web service involved. The annotation and curation items, together with all the provenance data, can be then serialized in RDF format according to the AO model. Serialization includes RDF representing aspects of the domain ontologies used in any annotation, as well as the AO RDF itself.

Figure 4: Annotation Sets access control DOMEO also implements another feature of AO: the Annotation Set, a mechanism for grouping annotation items. The notion of an Annotation Set was included in AO to assist in annotation organization. Sets can be used, for instance, to collect items of the same type – i.e. proteins or genes –, to show/hide multiple items, and to define the corresponding access policy. Using the annotation tool it is also possible to define, for each set, which users will be able to access the annotation items (Figure 4): only the creator (personal annoFigure 3: The text-mining results are displayed on the tation), selected groups, or everybody (public annotation). document and the curation popup lets the user review and respond to automatically generated annotation items 6 CONCLUSIONS As several users may produce annotation on the same document, several users or curators may therefore curate the same results. The annotation tool enables both concurrent and collaborative annotation and curation processes.

Fifty alpha testers currently use DOMEO. The number is planned to double shortly when the beta release candidate becomes available. With the beta release, many additional features now in development will be brought into production. One important feature for the beta will allow integration with the Apache UIMA framework so that textminers using that

3

Ciccarese et al.

architecture will be able to display and curate the results their text mining with our tool. With this tool, and the collaborations currently in place, we expect to be able to publish large quantities of high quality annotations on scientific documents in RDF AO format. The published annotation will include the content of the AlzSWAN knowledge base (http://hypothesis.alzforum.org) with the discourse elements – claims, hypotheses, and questions – linked to the correspondent text in original papers. We also note that annotation produced with our tool can be displayed on the corresponding PDF documents in the Utopia application [19, 20] as Utopia can now consume AO RDF. We are currently working to with the Utopia group to enable the opposite workflow: producing annotation on a PDF of a scientific paper, and displaying it on the HTML version.

ACKNOWLEDGEMENTS Development of The SWAN Annotation Framework, including DOMEO, has been funded by a grant from the National Institute on Drug Abuse, National Institutes of Health, as part of the Neuroscience Information Framework; by a grant from EMD Serono, Inc. as part of the MS Discovery Forum project; by a grant from Elsevier; and by a grant from Eli Lilly and Company. We are most grateful for the financial support of these organizations. We thank Professor Maryann Martone and Anita Bandrowski of the University of California at San Diego; Anita deWaard, Bradley Allen and Antony Scerri of Elsevier B.V., Adam West and Ernest Dow of Eli Lilly and Company, and Carole Goble of the University of Manchester, for their continuing support and for many fruitful discussions and much joint work. We also thank Steve Pettifer of University of Manchester for the work toward integration of our tool with the Utopia PDF annotator.

REFERENCES 1. Ciccarese P, Ocana M, Das S, Clark T: AO: An open annotation ontology for science on the Web. In: Bio Ontologies 2010: July 9-13, 2010 2010; Boston MA, USA. 2. Ciccarese P, Ocana M, Garcia-Castro LJ, Das S, Clark T: An Open Annotation Ontology for Science on Web 3.0 BMC Bioinformatics in press. 3. The Annotation Ontology on Google Code [http://code.google.com/p/annotation-ontology/] 4. McGuinness D, van Harmelen F: OWL Web Ontology Language. W3C Recommendation 2004. 5. Miles A, Bechhofer S: SKOS Simple Knowledge Organization System Reference. W3C Recommendation 2009. 6. Tags4Labs [http://www.biotea.ws/node/3] 7. Garcia-Castro A, Labarga A, Garcia L, Giraldo O, Montaña C, Bateman JA: Semantic Web and Social Web heading towards Living Documents in the Life

4

Sciences. Web Semantics: Science, Services and Agents on the World Wide Web 2010, 8(2-3):155-162. 8. Waldrop M: Big data: Wikiomics. Nature 2008, 455(7209):22-25. 9. Waldrop MM: Science 2.0. Scientific American 2008, 298(5):68-73. 10. Bos N, Zimmerman A, Olson J, Yew J, Yerkie J, Dahl E, Olson G: From shared databases to communities of practice: A taxonomy of collaboratories. Journal of Computer-Mediated Communication 2007, 12(2):article 16. 11. Renear AH, Palmer CL: Strategic Reading, Ontologies, and the Future of Scientific Publishing. Science 2009, 325(5942):828 - 832. 12. Shotton D, Portwin K, Klyne G, Miles A: Adventures in semantic publishing: exemplar semantic enhancements of a research article. PLoS Comput Biol 2009, 5(4):e1000361. 13. Das S, Goetz M, Girard L, Clark T: Scientific Publications on Web 3.0. In: 13th International Conference on Electronic Publishing (ELPUB 2009): 1012 June 2009; Milan, Italy. 2009. 14. Science Commons Semantic Resources Project: Antibody Resource [http://neurocommons.org/page/Semantic_resources_proj ect/Antibodies] 15. Das S, Rogan M, Kawadler H, Corlosquet S, Brin S, Clark T: PD Online: a case study in scientific collaboration on the Web. In: Workshop on the Future of the Web for Collaborative Science, 19th International World Wide Web Conference: April 26-30, 2010 2010; Raleigh, NC, USA. 16. Jonquet C, Musen MA, Shah N: A system for ontologybased annotation of biomedical data. In: International Workshop on Data Integration in the Life Sciences 2008, DILS'08: 2008; Evry, France. 17. Jonquet C, Musen MA, Shah NH: Help will be provided for this task: Ontology-Based Annotator Web Service. In. Stanford, CA: Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine; 2008: 16. 18. Ciccarese P, Wu E, Wong G, Ocana M, Kinoshita J, Ruttenberg A, Clark T: The SWAN biomedical discourse ontology. J Biomed Inform 2008, 41(5):739751. 19. Attwood TK, Kell DB, McDermott P, Marsh J, Pettifer SR, Thorne D: Calling International Rescue: knowledge lost in literature and data landslide! Biochemical Journal 2009, 424(3):317-333. 20. Attwood TK, Kell DB, McDermott P, Marsh J, Pettifer SR, Thorne D: Utopia documents: linking scholarly literature with research data. Bioinformatics 2010, 26(18):i568-i574.

The State of Bio-Medical Ontologies Matthew Horridge, Bijan Parsia and Ulrike Sattler School of Computer Science, The University of Manchester

ABSTRACT This paper presents a logic based analysis of the biomedical ontologies that are contained in the NCBO BioPortal Repository. In total, 218 OBO and OWL ontologies were analyzed using entailment checking and justificatory structure based analysis. It was found that approximately half of all BioPortal ontologies fit into the tractable OWL2EL profile of OWL, with the other half being built in a variety of expressive fragments, that range from ALC through to the full expressivity of SROIQ that underpins OWL2. Moreover, BioPortal contains a large number of logically rich ontologies that have large numbers of non-trivial entailments and nontrivial reasons for these entailments.

1

INTRODUCTION

In recent years the number of publically available ontologies in the bio-medical arena has grown significantly. Many of these ontologies have been made available via the NCBO BioPortal ontology repository [1]. At the time of writing, BioPortal provides access to the imports closures of over 250 bio-medical ontologies in various formats, and is of interest to various consumers of ontologies from ontologists to tool builders. In particular, the BioPortal corpus provides ontologies that: (1) Vary greatly in size (2) Vary greatly in expressivity; (3) Are real world ontologies, as opposed to reasoned test-bed ontologies; (4) Were designed and built by users (domain experts) for application purposes. This paper presents a logic based analysis of these ontologies, that provides an insight in into the logical richness of real world, state-of-the-art, ontology construction. In contrast to many of the other published analyses, which are purely syntactic, a logic based analysis, in particular the use of justifications, makes it possible to “go under the hood” and detect the rich interplay of axioms in ontologies that cannot be determined by simple expressivity metrics alone. The presented results should be of interest to many bioontology consumers, from ontology developers through to ontology tool implementers.

2

PRELIMINARIES

The work presented in this paper focuses on understanding the BioPortal corpus from a logic based perspective. In particular, focusing on (non-trivial) entailments, or infer*

Address correspondence to [email protected]

ences, and the reasons for these entailments. For this reason, the subset of ontologies written in OWL and OBO formats were used in the experiments that follow. This section provides a brief overview of OWL, OBO and some of the terminology used later in the paper. OWL The latest version of the Web Ontology Language, OWL 2, became a W3C recommendation in October 2009. An OWL 2 ontology may be regarded as being a set of axioms, which make statements about the domain of interest. OWL 2 is a highly expressive ontology language, and features a rich set of axiom and class constructors. In particular it allows complex class expressions to be built, so for example, it is possible to describe the class of cells that have at least one nucleus. These complex class expressions can then be used in axioms, which state the relationships between them. OBO In the bio-ontology arena, there is another, widely used language, called OBO. This language has good tool support in the form of OBO-Edit, and has a popular flat file format which is easy to read and edit in a regular text editor. Despite the fact that OBO is often described as a simple ontology language that is easy for biologists and domain experts to understand, it is in fact a highly expressive language. Indeed, there is a close relationship between OBO and OWL 2, and it is possible to faithfully translate the logical aspects of an OBO ontology into an OWL 2 ontology. Several mappings between OBO and OWL have been proposed, but the mapping that was used for the experiments described in this paper is the one described and documented by Mungall et Al1 Entailments and Reasoning One of the key features of OWL (and essentially OBO), is that it is a logic based ontology language, which means that OWL (OBO) ontologies are amenable to automated reasoning. This means that it is possible to use “off the shelf” reasoners to perform various reasoning tasks such as consistency checking, checking for unsatisfiable classes, computing whether one class is a subclass of another class, and whether an individual is an instance of a class. These tasks can be seen as entailment checking tasks. In OWL, an entailment may be regarded as a statement, or more correctly an axiom, that follows from an ontology or a subset of an ontology, which is itself a set of axioms. The process of reasoning is used to compute 1

ftp://ftp.geneontology.org/pub/go/www/obo-syntax.html

1

M Horridge, B. Parsia and U. Sattler

whether or not an entailment holds in an ontology. Entailments may be asserted directly into an ontology or may be inferred from other axioms in the ontology. For example, the AminoAcid ontology does not directly state that Methi-­‐ onine is a subclass of LargeAliphaticAminoAcid, but, due to other axioms in the ontology, it entails this axiom. Justifications In the OWL world, justifications are a popular form of explanation for entailments in ontologies. For any given entailment in an ontology, there will be one, or more, justifications that explain why the entailment holds. To be more precise, a justification is a minimal subset of an ontology that is sufficient for an entailment to hold. A justification is minimal in the sense that the entailment in question does not hold in any proper subset of the justification. An example of a real justification for an entailment in a BioPortal ontology is presented later in this paper, but as simple abstract example consider the ontology O = {A  SubClas-­‐ sOf   B,   B   SubClassOf   D,   B   SubClassOf   C}, which entails A   SubClassOf  C. A justification for this entailment is the minimal set of axioms {A  SubClassOf  B,  B  SubClassOf  C}. Non-Trivial Entailments In the presentation of results that follows, the notion of a non-trivial entailment, is used. In the context of this paper, a non-trivial entailment is an entailment that has at least one justification that is not itself. In other words, either a non-trivial entailment is not directly asserted into an ontology, or, there is a further reason (justification) as to why the entailment holds besides being it being directly asserted.

parsed due to syntax errors. This left a total of 218 OWL and OBO ontology documents that could be downloaded parsed into OWL ontologies. After parsing, four of the ontologies were found to violate the OWL 2 global restrictions. In all cases, the violation was caused by the use of transitive (non-simple) properties in cardinality restrictions. These ontologies were discarded and were not processed any further in the entailment checking experiments. Procedure Each ontology was checked for consistency. Next each consistent ontology was classified and realized in order to measure reasoner performance and extract entailments for inspection. It should be noted that, due to practicalities, a time out of 30 minutes of CPU time per ontology for the tasks of consistency checking, classification and realization was imposed. Entailed (both asserted and inferred axioms) direct subsumptions between named classes (i.e. axioms of the form A   SubClassOf   B) were extracted, along with direct class assertions between named individuals and named classes (i.e. axioms of the form b   Type   A). These kinds of entailments are of interest because they are the kinds of entailments that are exposed through the user interfaces of tools such as Protégé, and are therefore the kinds of entailments that users of these tools are interested in, and typically seek justifications for. Next, the set of entailments for each ontology was filtered in order to split them into trivial and non-trivial entailments. Finally, justifications for each non-trivial entailment were computed.

3

4

MATERIALS AND METHOD

Apparatus All experiments were performed on a 3.06 GHz Intel Core 2 Duo MacBook Pro, with a maximum of 4 GB allocated to the Java virtual machine. Three reasoners were used: JFaCT, which is a Java version of the FaCT++ reasoner, HermiT, and Pellet. Corpus The BioPortal ontology repository was accessed on the 12th March 2011 using the BioPortal RESTful Service API. In total, 261 ontology documents (and their imports closures) were listed as being available. Out of these, there were 125 OWL ontology documents, and 101 OBO ontology documents, giving a total of 226 “OWL compatible” ontology documents that could theoretically be parsed into OWL ontologies. Each listed OWL compatible ontology document was downloaded and parsed by the OWL API. Any imports statements were recursively dealt with by downloading the document at the imports statement URL and parsing it into the imports closure of the original BioPortal “root” ontology. If an imported ontology could not be accessed (for whatever reason) the import was silently ignored. Out of the 226 OWL compatible ontology documents that were listed by the BioPortal API, 7 could not be downloaded due to HTTP 500 errors, and one ontology could not be

2

RESULTS

Space limitations prevent the direct inclusion of results in this paper, but detailed tables, with classification times, number of non-trivial entailments, and number of justifications per entailment are available at: http://owl.cs.manchester.ac.uk/bio-ontologies. Size and Expressivity The average number of logical axioms per ontology was 20,532 with a standard deviation of 115,163 and maximum number of 1,484,923. Table 1 Class Constructor Usage Class Constructor

# Onts

Occurrences Per Ontology Mean

ObjectSomeValuesFrom

StDev

Max

133

7841

38731

351672

ObjectAllValuesFrom

46

367

1275

7757

ObjectMinCardinality

32

14

59

340

ObjectMaxCardinality

16

3

5

24

ObjectExactCardinality

32

17

45

257

ObjectHasValue

8

5

3

9

ObjectIntersectionOf

61

1628

7202

54038

ObjectUnionOf

69

60

243

2024

ObjectComplementOf

23

10

20

100

ObjectOneOf

18

6

10

48

The State of Bio-Medical Ontologies

Table 2 Axiom Type Usage Axiom Type

# Onts

Occurrences Per Ontology Mean

AsymmetricObjectProperty

StDev

Max

Unsatisfiable Classes Out of the remaining consistent ontologies, 1 ontology contained 9 unsatisfiable classes in its signature. Non-Trivial Entailments A total of 72 ontologies had nontrivial entailments (direct subclass axioms and direct class assertions). Table 3 shows a summary of total entailments and non-trivial entailments. Note that the percentage of Non-Trivial Entailments (Column 4) is the mean/max percentage across the corpus rather than the percentage of Column 3 to Column 2.

4

2

1

2

ClassAssertion

48

11470

43549

232642

DataPropertyAssertion

28

35906

165907

896647

DataPropertyDomain

51

16

23

118

DataPropertyRange

53

26

64

449

DifferentIndividuals

14

6

10

42

DisjointClasses

84

576

2395

20238

3

7

8

19

71

509

1846

10757

3

3

2

5

FunctionalDataProperty

42

21

54

338

StdDev

FunctionalObjectProperty

52

15

46

337

InvFunctionalObjectProperty

26

17

64

InverseObjectProperties

61

28

65

DisjointObjectProperties EquivalentClasses EquivalentObjectProperties

Table 3 Average Number of Entailments Per Ontology Total

Non-Trivial

Non-Trivial %

5509

1549

30.2

16030

6187

26.6

Min

7

1

0.03

337

Max

89468

49537

100

475

25th Percentile

175.75

35.00

9.66

Mean

IrreflexiveObjectProperty

5

2

1

3

75th Percentile

1838.50

277.00

44.64

ObjectPropertyAssertion

24

13092

53724

268578

90th Percentile

10069.40

2392.90

71.82

ObjectPropertyDomain

69

33

46

259

ObjectPropertyRange

66

35

47

268

9

4

3

9

ReflexiveObjectProperty SameIndividual

1

1

0

1

200

12529

56568

513246

SubDataPropertyOf

13

46

122

466

SubObjectPropertyOf

73

47

123

958

SubPropertyChainOf

6

3

1

4

SubClassOf

SymmetricObjectProperty

25

4

4

18

Class constructor usage and axiom type usage are shown in the Table 1 and Table 2, where “#Onts” is the number of ontologies that the constructor appears in, and “Occurrences Per Ontology” the usage of the constructor in the ontologies that use it. A large proportion, 123, of the ontologies correspond to OWL2EL ontologies. The remainder range from the moderately expressive AL2 family of languages through to, SROIQ, which represents the full expressivity of OWL2. Reasoner Performance There were three ontologies, for which consistency checking could not be completed within 30 minutes of CPU time. These were: GALEN, the Foundational Model of Anatomy and NCBI Organismal Classification. A complete listing of times per ontology per reasoner may be found at the link at the start of this section. Inconsistent Ontologies Out of the ontologies that could be processed by one or more of the three reasoners, 5 were found to be inconsistent. 2

AL is a Description Logic that is regarded as being the base language for more expressive description logics (including the one that underpins OWL 2). It supports atomic negation, concept intersection, quantification, and a limited form of existential quantification. ALC is obtained from AL by adding complex concept negation.

Justification Metrics For ontologies with non-trivial entailments Table 4 shows the average number and size of justifications per entailment per ontology. Table 4 Average Number and Size Justifications Per Ontology Number

Size

Mean

2.83

3.01

SD

3.44

3.86

Min

1

1

Max

837

37

25th Percentile

1.52

1.14

75th Percentile

2.40

2.52

90th Percentile

5.00

6.61

5

DISCUSSION

Expressivity The BioPortal corpus contains ontologies that vary greatly in expressivity and size. Interestingly, 133 ontologies, well over half of the 218 OWL and OBO ontologies in the repository, use OWL SomeValuesFrom restrictions, with an average of 7841 restrictions per ontology. Other class expression types are used by slightly fewer ontologies, but the usage of AllValuesFrom   (46 onts) and Un-­‐ ionOf   (69 onts), IntersectionOf   (61 onts) and Comple-­‐ mentOf  (23 onts) is still notable. Roughly speaking, BioPortal ontologies can be split into two halves. One half contains OWL2EL ontologies and the other half highly expressive ontologies some of which use the full expressivity of OWL2. With regards to the OWL2EL ontologies, it is not clear whether it was deliberate attempt, or design goal, to remain in a lightweight, trac-

3

M Horridge, B. Parsia and U. Sattler

table fragment of OWL, whether tooling was used to impose a limit on expressivity, or whether it was accidental that these ontologies fall into this profile. In any case, the sizeable number of ontologies that fall into this profile surely vindicates the design of the profile and its inclusion in the OWL 2 specification. The remainder of the ontologies fall into various expressive fragments of OWL, and use features that were introduced into OWL 2. For example, a handful of ontologies use asymmetric and (ir)reflexive properties, property chain axioms, and qualified cardinality restrictions. Inconsistent Ontologies On closer inspection of the five inconsistent ontologies, it became apparent that not one of the ontologies was inconsistent for trivial reasons (such as a literal not being in the range of a property). Each ontology was natively OWL, and had multiple justifications for the inconsistency with several axioms per justification. One particular example, had around 366 justifications, each roughly 10 axioms in size. While it is odd that inconsistent ontologies were uploaded to the BioPortal, the fact that there are inconsistent ontologies, with non-trivial reasons, indicates the use of fairly expressive class and axiom constructors. It is unclear as to what (OWL) tool chain was used to construct these inconsistent ontologies, and what reasoning, if any, was used during the development of the ontologies. Unsatisfiable Classes One BioPortal ontology contained 9 unsatisfiable classes in its signature, with each unsatisfiable class having several overlapping justifications for its unsatisfiability. It may seem strange that an ontology with unsatisfiable classes, that could be considered to be bugs, was uploaded to the repository. However, in this case the ontology is a native OBO ontology. It is therefore doubtful that full sound and complete OWL reasoning was used to detect these unsatisfiable classes. Non-Trivial Entailments and Justifications One of the most surprising aspects of many of the BioPortal ontologies is that a large number of them (72/218) contain large numbers of non-trivial entailments. Recall that non-trivial entailments are entailments for which there is at least one justification that is not the entailment itself. On average, there were roughly 1500 non-trivial entailments per ontology, out of an average of 5500 entailments per ontology. One of the most striking examples, the Coriell Cell Line Ontology, which contains over 45,000 non-trivial entailments, and an average of 4 justifications per entailment, peaking out at 65 justifications for one particular entailment.

In terms of number of justifications, the UBERON ontology has around 4000 entailments with an average of 25 justifications per entailment (SD=60) and one entailment with a staggering 804 justifications. Moreover, this ontology falls into the lightweight OWL2EL profile. It is therefore a case in point that, low expressivity does not necessarily indicate a logically impoverished ontology and a low degree of interplay between axioms in an ontology. Another noteworthy example, that is logically rich, is the International Classification of Nursing Practice ontology (of SHIF expressivity), which has slightly over 2000 entailments, with an average of 8 justifications per entailment (SD=33.61) with one entailment that has 837 justifications. An example justification from this ontology is shown below in Figure 1. It should be noted that justifications of this ilk, in terms of type, style and length of axioms, are common for each entailment in the ontology, and indeed, for many of the entailments in the other 72 ontologies. In terms of size, across all ontologies, most justifications were around 3 axioms (SD=3.86) and therefore not trivial single axiom justifications. The largest justification over all ontologies was a massive 37 axioms in size. All of this is even more significant given that the entailments that the justifications are for are direct subclass axioms and class assertions, which implies that the justifications are not simply “long chains” of named subclass axioms.

6

SUMMARY AND CONCLUSIONS

This paper has presented an analysis of OWL and OBO biomedical ontologies that are contained in the NCBO BioPortal. Half of the ontologies use the tractable OWL2EL language, and the other half vary greatly in expressivity, up to the full expressivity of OWL 2. A logic based analysis, which involved checking consistency, computing entailments and then computing justifications for these entailments revealed that a significant proportion of the ontologies have many entailments that have large numbers of sizeable justifications. In essence, the justificatory structure of non-trivial entailments indicates a panoply of logical richness that is present throughout many of the biomedical ontologies in the BioPortal.

REFERENCES [1] Daniel L. Rubin et al. (2008) A Web Portal to BioMedical Ontologies. AAAI Spring Symposium Series.

Figure 1 An Example Justification from the International Classification of Nursing Practice Ontology 4

Exploring Gene Ontology Annotations with OWL Simon Jupp1*, Robert Stevens1 and Robert Hoehndorf2 1

School of Computer Science, University of Manchester, UK. 2 Department of Genetics, University of Cambridge, Cambridge, UK

ABSTRACT Motivation: Ontologies such as the Gene Ontology (GO) and their use in annotations make cross species comparisons of genes possible, along with a wide range of other activities. Tools, such as AmiGO, allow exploration of genes based on their GO annotations. This human driven exploration and querying of GO is obviously useful, but by taking advantage of the ontological representation we can use these annotations to create a rich polyhierarchy of gene products for enhanced querying. This also opens up possibilities for exploring GO annotations (GOA) for redundancies and defects in annotations. To do this we have created a set of OWL classes for mouse genes and their GOA. Each gene is represented as a class, with appropriate relationships to the GO aspects with which it has been annotated. We then use defined classes to query these gene product classes and to build a complex hierarchy. This standard use of OWL affords a rich interaction with GO annotations to give a fine partitioning of the gene products in the ontology.

1

INTRODUCTION

The creation of the Gene Ontology (GO) (Harris 2004) has had a major impact on the description and communication of the major functionalities of gene products for many species. GO has some 24,000 terms for annotating gene products and is used in around 40 species databases and in cross species databases such as Uniprot and Interpro (Camon 2004). It is widely used for querying such databases, making cross species comparison or in data analyses, such as over-expression analysis in microarray data (Baehrecke 2004). The GO is mainly used as a controlled vocabulary to ensure genes are consistently annotated using standard terminology across many data resources; this alone offers many benefits for data integration and analysis. GO is, however, much more than a vocabulary; it also provides additional information about how these GO terms are related to each other. These relationships have a strict semantics in their representation that bring added value to the GO. For example, the hierarchical relationships allow for all kinds of a particular term to be retrieved, as well as those with an annotation of

the term itself. These and other relationships provide support for navigation, as well as making explicit the relationship between the entities being described. The AmiGO browser (Carbon 2009) (see also DynGO (Liu 2005), QuickGO (Binns 2009)) provides such an interface and exploits the hierarchical structure of the gene ontology to support query expansion. For example, when searching AmiGO for receptor activity genes, the results returned also include genes involved in GPCR activity as GPCR activity is a subclass of receptor activity. This hierarchical structure is also useful for data mining tasks (Pavlidis 2004). For example, enrichment analysis is a common technique used in the analysis of high-throughput gene expression data; sets of interesting genes can be grouped or clustered based on common GO annotations (See http://www.geneontology.org/GO.tools.shtml for more GO tools). Whilst highly useful, many of these tools fail to exploit the full potential of the GO’s representation for reasoning and querying over gene annotations. Most of the tools that were investigated do not facilitate rich querying that takes into account the semantics of the GO relationships. For example, it was difficult to ask for all gene products that are located in a membrane or part of a membrane, that are receptor genes involved in a metabolic process. To answer such a query correctly some form of reasoning over the ontology is required. The ability to perform such rich queries would enable more precise and flexible exploration of the GO annotations. The Web Ontology Language (OWL)1 and the Open Biomedical Ontology (OBO)2 format have a strict semantics that makes it possible to use automated reasoners to help build and use knowledge captured in an ontology. In order to explore the potential of reasoning over the GO annotations we need to describe the relationships between the genes and their annotation within a framework that can also exploit the semantics encoded into the GO. Our approach uses OWL, for which a mapping from OBO exists, to represent both the GO annotations alongside the GO to exploit the GO and its annotation for querying and exploration.

1 *

To whom correspondence should be addressed.

2

http://www.w3.org/TR/owl-ref/ http://obofoundry.org/

1

Jupp et al.

As an ontology of attributes of gene products, GO itself does not explicitly contain gene products; GO annotations are attached to gene products in databases or flat-files (See http://www.geneontology.org/GO.annotation.shtml). Using the compositional approach to ontology building we can create an ontology from these annotations that explicitly relates gene products to GO and then add defined classes to impose a hierarchy on the gene products. For example, we can create a defined class (in Manchester OWL syntax) such as: Class: NuclearMembraneReceptorGeneProduct EquivalentTo: GeneProduct that has_molecular_function some ReceptorActivity and located_in some NuclearMembrane

This defined class will recognize any class of gene product that has both of these attributes, or children of these attributes, and subsume it within the hierarchy of gene products. In this standard use of OWL and automated reasoning, we can add more such defined classes to build an arbitrarily complex polyhierarchy for querying and navigation of entities annotated with the GO. Figure 1 shows such an inferred polyhierarchy centered on annotations for the GRM1[MGI:1351338] gene product.

2

METHOD

Step one: An initial set of GO annotations for mouse genes were downloaded from the Mouse Genome Informatics (MGI) site3. In order to reduce the size of the dataset to ease development we only selected annotations that had evidence codes of EXP, IDA, TAS, RCA, IC (See http://www.geneontology.org/GO.evidence.shtml for definitions). We also filtered these genes to exclude the RIKEN cDNA genes. Step two: In order to express these annotation ontologically we created a primitive OWL class for each of the genes. We then describe each gene according to its annotation using existential OWL restrictions. From this a simple pattern emerges where each gene class is restricted by the corresponding GO term from the annotation. Class: GeneProduct SubclassOf: participates_in some GO:biological_process and located_in some GO:cellular_component and has_molecular_function some GO:molecular_function

3

http://www.informatics.jax.org - accessed Nov 20, 2011.

2

Rather than generating the axioms by hand, we used the Ontology Pre-Processor Language (OPPL) to specify and instantiate the pattern (Iannone 2009). OPPL allows us to express patterns for each of the three branches of GO. A Java program is then used to parse the GO annotations file downloaded from MGI and instantiate the OPPL and generate the OWL ontology. Step three: We created a GO association ontology by importing this file of generated primitive classes together with the three aspects of GO (in their OWL form) into a master ontology file. Step four: The generated GO association ontology was then manually edited using Protégé 4.1 (beta, build 220) to add defined classes. We initially created defined classes to represent subsets of the top level GO terms by defining OWL classes for genes found in a particular cellular compartment. For example, we created the class of mitochondrial gene products as follows:

Class: MitochondrialGeneProduct EquivalentTo: GeneProduct that located_in some (GO:’mitochiondria’ or (part_of some GO:’mitochondria’))

We repeated this basic pattern for the top level cellular compartments, and then continued for the biological processes and molecular function classes. From these base level class descriptions we then began to create more complex class descriptions composed of classes previously created. We then created a class for the mitochondrial receptor gene products with the following class definition: Class: MitochondrialReceptorGeneProduct EquivalentTo: GeneProduct and MitochondrialGeneProduct and has_molecular_function some GO:’receptor activity’

This pattern was repeated until we began to create classes that were composed of terms from all three branches of the gene ontology. For example, to find the mitochondrial gene products that are receptor gene products and participate in cell killing we generated the following OWL defined class:

Exploring Gene Ontology Annotations with OWL

Class: ReceptorActivityGeneProduct EquivalentTo: GeneProduct that has_molecular_function some GO:’ receptor activity’

Class: CellKillingMitochondrialReceptorGeneProduct EquivalentTo: GeneProduct and MitochondrialReceptorGeneProduct and participates_in some GO:’cell killing’

Class: MetabolicProcessGeneProduct EquivalentTo: GeneProduct that participates_in some GO:’metabolic process’ Class: NuclearMembraneGeneProduct EquivalentTo: GeneProduct that (located_in some (GO:’nucleus’ or (part_of some GO:’nucleus’)) and (located_in some GO:’membrane’ or (part_of some GO:’membrane’))

An arbitrary number of defined classes cam be created in this pattern, each of which will subsume and be subsumbed by other classes fitting the definition in the growing ontology. At the leaves of this polyhierarchy we have the primitive classes representing the gene products themselves.

Class: MetabolicNuclearMembraneReceptorGeneProduct MetabolicProccessGeneProduct and ReceptorActivityGeneProduct and

3

RESULTS

NuclearMembraneGeneProduct

We extracted all mouse genes from the MGI database and applied our filtering, producing a total of 29,559 geneannotation pairs (see step one). The conversion to OWL classes gave 10,104 primitive gene product classes (see step two). After importing GO, the final ontology of primitive gene product classes and the GO contained 39,332 primitive classes (see step three). We created a further 120 defined classes describing various gene categories (see Step four). As an exemplar, we concentrated on genes with receptor activity, located in some membrane and with processes involved in cell growth, metabolism and signal transduction. In order to classify the ontology we used several DL reasoners. Classification was performed on a 2.2ghz i7 Mac Book Pro requiring around 3GB of memory. Table 1 shows the performance times for each reasoner. Reasoner

Version

After reasoning over the ontology we infered that only the Grm1 gene is a subclass of our MetabolicNuclearMembraneReceptorGeneProduct class. Although this is a relatively simple query, in order for it to answer some reasoning is required, which is made possible by this approach of using OWL. Our attempts to replicate such a query in the popular online tools for querying GOA using a simple conjunction of these terms yielded no results, showing a clear advantage to the OWL approach over existing tools.

Average Timing (Seconds)

Fact++ Pellet HermiT

1.52 2.1.2 1.3.3

~ 400 ~ 300 ~ 500

To illustrate the capabilities of the generated ontology we show a query to get the genes that are located in the nuclear membrane of the cell, that participate in some metabolic process and have the function of some receptor activity. Figure 1 shows a screen shot from Protégé of a define class named MetabolicNuclearMembraneReceptorGeneProduct. This class is composed of the intersection of three other defined classes named NuclearMembraneGeneProduct, ReceptorActivityGeneProduct, and MetabolicProcessGeneProduct. These classes are defined in OWL as the following:

Figure 1. The classification of the GRM1 gene according to generated defined classes for gene products

4

DISCUSSION

Although the queries demonstrated here are relatively simple, they serve to illustrate the potential of a pure OWL approach to querying gene ontology annotations. Using similar patterns we can begin to imagine more complex class description that utilise additional expressivity in OWL, such as the use of complement classes to query for genes that

3

Jupp et al.

‘has_molecular_function some not (ReceptorActivity) and participates_in some SignalTransduction’, which would find those genes that have a function other than receptor activity and are involved in signal transduction. (Note that the semantics mean that such genes can have a receptor activity, but must have an activity other than receptor activity. GO annotations are not closed, so we cannot say ‘not (has_molecular_fucntion some ReceptorActivity)’ and expect to recognize any genes.) If all the GO molecular function classes were to be replicated as defined classes, we would replicate the GO molecular function ontology; the same would happen for each aspect of GO. As we combine the different aspects of GO in more complex defined classes, we will generate a more complex hierarchy of gene products. The announcement of the GO cross products extension to the GO4 will provide logical definitions for the GO classes. These definitions will enable richer OWL queries over the GO annotations and the potential to infer more annotations on existing GOA genes (Fernández-Breis 2010). The next stage of development in our work will be to incorporate more defined classes and different ontologies such as the phenotype annotations for mouse genes and descriptions of cells in which they are known to function. This will enable queries such as those genes that are known to participate in processes that are involved in a particular phenotype. Our current exploratory implementation performs well in practice, but the number of defined classes is currently small. Adding more expressive constructs to the ontology will afford further opportunities; adding disjointness axioms to GO may help us uncover mis-annotations and we have yet to fully exploit property characteristics such as transitivity and functionality. We can also explore ways of flexibly incorporating annotations with differing degrees of confidence through use of the GO evidence codes and programmatically generating the defined classes that form the polyheirarchy of genes. Finally, we need to present the ontology via tools such as the OWLBrowser5 . In this work we have made a straight-forward use of OWL and automated reasoning to deliver a flexible way to query all aspects of GO annotations. The polyhierarchy formed also provides similarly rich navigation in a gene product orientated setting. Finally, we provide a flexible framework for exploring and manipulating GO and other valuable annotations developed by the community.

AVAILABILITY The ontologies and associated files are available to download from 4 5

http://wiki.geneontology.org/index.php/Category:Cross_Products http://code.google.com/p/ontology-browser/

4

http://owl.cs.manchester.ac.uk/mouse_goa/index.html. We recommend Protégé 4.1 beta for viewing the generated ontology.

ACKNOWLEDGEMENTS This work was funded by the e-LICO project -EU/FP7/ICT-2007.4.4.

REFERENCES Harris MA et al (2004). The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. Jan 1;32(DATABASE):D258–D261 Evelyn Camon and Rolf Apweiler et al. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology Nucl. Acids Res. (2004) 32(suppl 1): D262D266 doi:10.1093/nar/gkh021 Eric H Baehrecke, Niem Dang, Ketan Babaria and Ben Shneiderman. Visualization and analysis of microarray and gene ontology data with treemaps BMC Bioinformatics 2004, 5:84doi:10.1186/1471-2105-5-84 Seth Carbon, Amelia Ireland, Christopher J. Mungall, ShengQiang Shu, Brad Marshall, Suzanna Lewis, the AmiGO Hub, and the Web Presence Working Group. AmiGO: online access to ontology and annotation data. Bioinformatics. 2009 January 15; 25(2): 288–289. Liu H, Hu ZZ, Wu CH. DynGO: a tool for visualizing and mining of Gene Ontology and its associations. BMC Bioinformatics. 2005 Aug 9;6:201. Binns D, Dimmer E, Huntley R, Barrell D, O'Donovan C, Apweiler R. QuickGO: a web-based tool for Gene Ontology searching. Bioinformatics. 2009 Nov 15;25(22):3045-6. Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E. Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochem Res. 2004 Jun;29(6):1213-22. Luigi Iannone, Alan L. Rector, Robert Stevens: Embedding Knowledge Patterns into OWL. ESWC 2009: 218-232 Jesualdo Tomás Fernández-Breis, Luigi Iannone, Ignazio Palmisano, Alan L. Rector, Robert Stevens. Enriching the Gene Ontology via the Dissection of Labels Using the Ontology Preprocessor Language. In Proceedings of EKAW'2010. pp.59~73

Records and situations. Integrating contextual aspects in clinical ontologies Stefan Schulz1,2* and Daniel Karlsson3 1

Institute for Medical Informatics, Statistics and Documentation, Medical University, Graz, Austria Institute of Medical Biometry and Medical Informatics, University Medical Center, Freiburg, Germany 3 Department of Biomedical Engineering, Medical Informatics, Linköping University, Sweden 2

ABSTRACT In order to achieve interoperability between different flavors of information model / ontology combinations to represent medical record entries we propose a comprehensive framework based on expressive description logics. Focusing on the context of clinical findings we demonstrate how the variability of clinical discourse can be logically represented. We emphasize the need for a clear categorial distinction between information entities and clinical objects, based on principles of Applied Ontology. An example OWL file can be downloaded from http://purl.org/steschu/BO2011.

1

INTRODUCTION

SNOMED CT 1 [IHTSDO 2011] claims to cover the entirety of the electronic health record by roughly 300,000 concepts. Although named and promoted as a terminology, SNOMED CT’s content development process is, inherently, also a process of ontology engineering, as its development is based on a logic-based framework, which enforces precise definitions (using the Description Logics [Baader 2007]). The dependability of entailments computed out of these definitions is crucial for whatsoever use case that requires more than just the provision of controlled reference terms. A considerable amount of SNOMED CT concepts do not simply denote domain entities but represents rather complex clinical assertions [Schulz 2010]. Expressions like Family history unknown, Injury of head without lack of consciousness, Planned cholecystectomy are not clinical terms but propositions about complex situations. Thus they facilitate single-code representations for commonplace utterances which place one or more domain terms in (i) a physical or social context (the clinical situation to which the utterance refers) as well as (ii) an epistemic context (referring to what is known about this situation) [Bodenreider 2004]. SNOMED CT, which has inherited many of such expressions from one of its sources, CTV3 2, has reserved an own branch for them, named Situation in specific context. Computer representations of health record content have motivated the development of information models for mes*

To whom correspondence should be addressed. Systematized Nomenclature of Medicine Clinical Terms. 2 Clinical Terms Version 3 1

sages and documents in the frameworks of, e.g., HL7 Version 3 3 and openEHR 4 archetypes, in order to express information about entities involved into the diagnostic and treatment process. Such information, by large, extends simple instantiation of concepts from a terminology or ontology, usually including a spatio-temporal specification of the patient and the time of the assertion. Further, this information specifies its sources and includes statements about plans, hypotheses, beliefs, and certainties. For instance, Planned cholecystectomy 5 denotes a plan [Schulz 2011a], but not an operation which may or may not ensues. A diagnostic statement Pneumonia done by a general practitioner may be speculative and does not imply the existence of a real instance of pneumonia, as little as a patient's mention of pneumonia in childhood can be taken at face value. Nevertheless such information needs to be documented. It has been postulated that a clear boundary exists between ontologies of information and ontologies of reality 6 . Whereas the latter represents the context-independent properties of types of entities health professionals refer to, the former describes the composition of information entities as in the electronic patient record. In current information models and ontologies the distinction between the ontology of clinical entities and the ontology of observation of those clinical entities, is blurred. Users of both types of systems tend to be unaware of the very nature of things they represent. The resulting overlaps give rise to conflicting representations, which require sophisticated mitigation strategies 7. A mixed representation of the invariant properties of entities as they are (ontology), the implicit setting to which they are related, and the way they are seen / known / recorded is prevalent in most biomedical terminology systems. Unless these issues are dealt with, the deployment of informatics applications like decisionsupport systems will be hampered [Rector 2001]. 3

Health Level 7 http://openehr.org 5 The principle of term formation (especially the role of adjectival modifiers) is misleading: “Open cholecystectomy” refers to a surgical procedure, but “planned cholecystectomy” refers to a plan. “Early pregnancy” refers to a pregnancy; “prevented pregnancy” of course, doesn’t. 6 http://openehr.org/releases/1.0.2/architecture/overview.pdf, figure 4 7 http://www.hl7.org/v3ballot/html/infrastructure/terminfo/terminfo.html 4

1

S.Schulz & D. Karlsson

We are neither very optimistic that the postulated boundary between ontologies and information models will be accounted for in future representational artifacts, nor that a final consensus can be reached where this line is to be drawn. It is realistic to expect that that the very same complex information (e.g. a clinician’s hypothesis of a stenosis of the left carotid artery) is represented to different proportions in clinical ontologies and clinical information models, therefore hampering semantic interoperability [Garde 2007]. We therefore propose a different strategy. Instead of defending a “canonic” division between ontologies and information models we recommend a common ontological framework which helps us to reach interoperability between different representational flavors 8 .

tions allow the definition of simple terms, but they impede any more complex terms or statements to be compositionally represented. Table 2 provides additional constructors required for representing more complex assertions.

2

2.2

METHODS

2.1

Representational language

We use the ontology web language OWL-DL [OWL2 2009], based on description logics (SNOMED CT uses an inexpressive variant known as EL), in which classes are arranged in taxonomic hierarchies. This means that all members of a class Gallbladder (i.e. all individual gallbladders) are also members of the parent class Digestive organ, expressed by Gallbladder subclassOf Digestive Organ. The meaning of OWL classes can be further described by the properties all their members have in common. In the following example, we employ ‘and’, together with the existential quantifier (‘some’). For example, the expression InflammatoryDisease and hasLocation some Gallbladder extends to all instances that both instantiate Inflammatory disease and are further related through the relation hasLocation to some instance of Gallbladder. This example actually gives us both the necessary and the sufficient conditions needed in order to fully define a class, e.g.: Cholecystitis equivalentTo InflammatoryDisease and hasLocation some Gallbladder. SNOMED CT, is so far limited to simple constructors as summarized in Table 1. Table 1. SNOMED CT’s logical constructors, corresponding to the description logics EL Constructor Meaning Example and Intersection Acid and between E and F Organic Molecule some Existential re- partOf some Liver striction of the relation r by G subclassOf B subsumes A Liver subclassOf Organ equivalentTo C and D are OrganicAcid equivalent equivalentTo Acid and OrganicMolecule

Note that is not possible to express value constraints (e.g. hasLaterality can only have the values Right and Left), and (ii) negations, such as Injury without infection. Such restric8

http://openehr.org/wiki/display/term/Information+Model++Terminology+Equivalence

2

Table 2. Additional description logics (DL) constructors

DL Constructor

Meaning

Example

not only

Negation of A Value restriction of the relation r by the filler G Union of A with B Cardinality restriction

Base and not Acid

or max/min/ exactly INT

Hand subclassof hasLaterality only (Left or Right) Object and bearerOf exactly 1 Color

Ontological foundations

We subscribe to the tenet of realist ontologies [Klein 2010], which – though not uncontroversial - have gained ground in the fields of biology and medicine, and which we defend primarily by practical reasons. One guiding principle is the use of well-defined categorial divisions such as provided by upper level ontologies. Another principle is to consistently interpret terms and codes as denoting classes of individual objects, grouped together according to the properties they have in common. Our upper-level distinction discriminates (among others) between the categories: 1. Living organism, normally the subject of care, i.e. the patient, a human (or an animal in veterinary medicine). 2. Clinical condition: (mostly abnormal) processes, states, dispositions, qualities and material entities, which are reportable in the context of the medical records. They are mainly related to (parts of) the subject of care, but also to specimens, derived materials, and to other persons. The most generic relation we use is hasLocus, which encompasses parthood, location, and inherence. 3. Clinical situation: the sum of all processes that make up a treatment episode, as suggested by [Rector 2008]. 4. Information artifact: an entity that is generically dependent on some artifact and stands in relation of aboutness to some entity [Ruttenberg 2010]. Electronic patient records and their components are typical instances of information artifacts. We further point out record entries, as atomic parts of the electronic patient record, as not further divisible piece of structured clinical discourse. In the following we will demonstrate how typical representations of clinical statements, for which different combinations of information model / ontology combinations had been proposed, can be expressed by a common framework.

3 3.1

RESULTS Representation of finding contexts

We will concentrate on a generic representation of an atomic clinical finding, as illustrated by the following template: Attribute

Value

Finding Context Disorder Location Laterality

(includes some (LivingHuman and (bearerOf SubjectOfRecordRole) and (locusOf some StenosisOfLeftCarotidArtery)))))

*Disorder D* *BodyPart B* *Laterality L*

Given the definition

We propose the following DL formalization: RecordEntryAboutDisorder_D equivalentTo RecordEntry and (isAbout only (Situation and (includes some (LivingHuman and (bearerOf SubjectOfRecordRole) and (locusOf some (*Disorder_D* and (hasLocus some (*BodyPart_B* and bearerOf some *Laterality_L*))))))))

StenosisOfLeftCarotidArtery equivalentTo Stenosis and (hasLocus some (CarotidArtery and bearerOf some LeftLaterality))

(2)

Value (undefined) Stenosis of the left carotid artery

RecordEntry and (isAbout only (Situation and

Value Known present (…)

In OWL this is encoded in the following two equivalence statements: ConfirmedRecordEntryAboutDisorder equivalentTo RecordEntryAboutDisorder and isAbout some Situation

If the mentioned disorder is known to be absent, the template is modified as follows: Attribute Finding Context (…)

Value Negated (…)

We propose the following OWL encodings for this: RecordEntryAboutAbsenceOfDisorder_D equivalentTo RecordEntry and (isAbout some (Situation and (includes some (LivingHuman and (bearerOf SubjectOfRecordRole) and not (locusOf some (*Disorder_D* and (hasLocus some (*BodyPart_B* and bearerOf some *Laterality_L*))))))))

3.2

Alternatively, the same scenario is described with precoordination at the ontology level: Attribute Finding Context Disorder Location Laterality

Attribute Finding Context (…)

(1)

Value (undefined) Stenosis Carotid artery Left

RecordEntry and (isAbout only (Situation and (includes some (LivingHuman and (bearerOf SubjectOfRecordRole) and (locusOf some (Stenosis and (hasLocus some (CarotidArtery and bearerOf some LeftLaterality))))))))

(4)

a description logics reasoner can state the equivalence of expressions (2) and (3). If the reported disorder is known to be present, the template is refined as follows:

The formalized pattern exposes several entities which are not explicit in the attribute-value schema, such as a clinical situation, a record entry, a human and the role he/she plays, as well as the relations between them. Note that this pattern states the existence of a record entry, but not of a situation it refers to. As argued above, this is an important aspect, as medical records may express beliefs or hypotheses which not necessarily correspond to the reality of the patient. In order to refer to OWL classes for which the existence of members cannot be asserted we use a modeling pattern recently proposed by several authors [Hastings 2011, Schulz 2011b], using the universal quantifier "only", thus opposing to the practice of the Information Artifact Ontology [Ruttenberg 2010]. It can therefore be refined in terms of epistemic contexts such as "known present" or "known absent". Let us instantiate this pattern with a record entry about a stenosis of the left carotid. In the first example postcoordination is done at the information model level in an attribute-value structure: Attribute Finding Context Disorder Location Laterality

(3)

(5)

Representation of other contexts and typical clinical statements

We briefly sketch how to account for other contexts. The Subject Relationship Context, according to SNOMED CT, is asserted if the referred situation does not apply to the patient but to a family member. Here we substitute SubjectOfRecordRole by other roles, e.g. ParentRole. A Temporal Context can be specified, at the instance level, by additional references to timestamps. The default temporal context is the situation about which the record entry is about. For abstract DL representations, amenable for DL queries, we can introduce qualitative modifiers, such as substituting Situation by PastSituation. If we want to include a reference to a future Situation we must avoid the 'known present' context, as this is, by definition, disjoint from a future context. More detail would be required for an analysis of Procedure contexts. In [Beale 2010] an extensive value list for procedure modifiers is given, including heterogeneous values such as {action status unknown, stopped before comple-

3

S.Schulz & D. Karlsson

tion, rejected, under consideration} to name just a few out of 46 items. Practically all of them do not modify procedures but procedure plans. Just as with findings, an ontologically precise representation would require to clearly distinguish between record entries and real procedures. Again, value restrictions are used to avoid false existential statements as we find in the current version of SNOMED CT. The procedure context is epistemic in relation to the situation the record entry is about, but not to the record entry itself. The record entry being an information entity is the result of some observation or evaluation procedure. Still, there are several open questions, e.g. on how to represent partially completed or aborted procedures. Other types of record entries, such as lab results or statements about signs and symptoms, may also be represented using this schema. Lab results are about some quality inherent in the patient, and they are the result of some observation procedure. However, the relation between the resulting value and the quality the result is about mostly is not straightforward, due the inherent uncertainty of the observation procedures. Still, in most contexts, it is safe to infer an inherent quality from an observation result. Record entries making statements about relations between signs and symptoms and disorders, e.g. statements about causality, is still another area for consideration. We find it is still an open question whether e.g. causality is inherent in the situation or in the human assessment of that situation.

3.3

Evaluation based on competency questions

For a preliminary evaluation using competency questions expressed as DL queries we refer to the example OWL file at http://purl.org/steschu/BO2011. One query retrieves all records for which a disease of a certain type was referred to but not confirmed. Another query shows how a confirmed assertion of a ‘situation without stenosis’ rules out that the situation contains a stenosis of the left carotid artery. Equally important is an assessment of the computational properties of the approach. Rich description logics with the constructors in Table 2 is known for its computational complexity and lack of scalability. Benchmark simulations are required to ascertain to what extent an acceptable performance can be reached when scaled up towards clinically interesting dimensions.

4

CONCLUSION

Formally representing statements which refer to units of unclear reference is a common problem in both scientific and clinical discourse. As proposed for the representation of e.g. chemical entities of unclear existence we here present how to express the reference to a certain disease in a patient record in hypothetic, affirmative and negative context. We demonstrate how semantic equivalence between more ontology-oriented and more information-model-oriented encodings can be proven. Our approach constitutes a moderate first step towards the ambitious goal of interoperable representations of health records using a common logical frame-

4

work grounded in expressive ontologies. This expressiveness is a major challenge and a potential obstacle to implementation due to the known complexity of rich description logics with negation and value restrictions. Additionally, new ground needs to be broken by leveraging the use of description logics as a query language for clinical queries. Acknowledgements. This work was supported by the EC project “DebugIT” (FP7-217139).

REFERENCES IHTSDO (Intern. Health Terminology Standards Development Organisation). Systematized Nomenclature of Medicine - Clinical Terms. http://www.ihtsdo.org/snomed-ct . Baader F, Calvanese D, McGuinness DL, Nardi D, Patel-Schneider PF, editors. The Description Logic Handbook. Theory, Implementation, and Applications (2nd Edition). Cambridge: Cambridge University Press, 2007. Bodenreider O, Smith B, Burgun A (2004). The OntologyEpistemology Divide: A Case Study in Medical Terminology. Int. Conf. on Formal Ontology and Information Systems (FOIS 2004). Amsterdam: IOS-Press, 185-195. Rector A, Johnson P, Tu S, Wroe C, Rogers J. Interface of inference models with concept and medical record models. In: S Quaglini, P Barahona and S Andreassen (eds) Proc Artificial Intelligence in Medicine Europe. 2001: 314-323. Garde S, Knaup P, Hovenga E, Heard S. Towards semantic interoperability for electronic health records. Methods of Information in Medicine 2007; 46(3): 332-343. Hastings J, Batchelor C, Neuhaus F, Steinbeck C. What's in an 'is about' link? Chemical diagrams and the Information Artifact Ontology. International Conference on Biomedical Ontologies, 2011, Accepted for Publication. Klein GO, Smith B. Concept Systems and Ontologies: Recommendations for Basic Terminology. Transactions of the Japanese Society for Artificial Intelligence. 2010;25(3):433-441. OWL2 Web Ontology Language. W3C. (2009) http://www.w3.org/TR/owl2-overview/ Rector AL, Brandt, S. Why Do It the Hard Way? The Case for an Expressive Description Logic for SNOMED. Journal of the American Medical Informatics Association 2008; 15: 744–751. Ruttenberg, A., Courtot, M., The IAO Community: The Information Artifact Ontology (2010) http://code.google.com/p/information-artifact-ontology/ Schulz S, Schober D, Daniel C, Jaulent MC. Bridging the semantics gap between terminologies, ontologies, and information models. Studies of Health Technology and Informatics 2010;160 (Pt 2):1000-1004. Schulz S, Cornet R, Spackman K. Consolidating SNOMED CT’s ontological commitment. Applied Ontology 6 (2011a) 1-11 DOI 10.3233/AO-2011-0084 Schulz S, Brochhausen M, Hoehndorf R. Higgs bosons, mars missions, and unicorn delusions: How to deal with terms of dubious reference in scientific ontologies. International Conference on Biomedical Ontologies (2011b), accepted for Publication.

“Of Mice and Men” Revisited: Basic Quality Checks for Reference Alignments Applied to the Human-Mouse Anatomy Alignment Elena Beisswanger* and Udo Hahn Jena University Language and Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany

ABSTRACT Identifying relationships between hitherto unrelated entities in different ontologies is the key task of ontology alignment. An alignment is either manually created by domain experts or automatically by an alignment system. In recent years, several alignment systems have been made available, each using its own set of methods for relation detection. To evaluate and compare these systems, typically a manually created alignment is used as so-called reference alignment. Based on our experience with several of these reference alignments we derived requirements and translated them into simple quality checks to ensure the alignments’ reliability and also their reusability. In this paper, these quality checks are applied to a standard reference alignment in the biomedical domain, the OAEI Anatomy Track reference alignment.

1

INTRODUCTION

In knowledge-intensive domains such as the life sciences, there is an ever-increasing need for concept systems and ontologies to organize and classify the large amounts of clinical and lab data and to describe it with value-adding meta data. For this purpose, numerous ontologies on different levels of coverage, expressivity and formal rigor have evolved that, from a content point of view, complement each other and partially even overlap sometimes. To facilitate the interoperability between information systems using different ontologies and to detect overlaps between them, ontology alignment has become a crucial task. Since the manual alignment of ontologies is quite laborexpensive and time-consuming, alignment tools have been developed that can automatically detect correspondences between entities1 in different ontologies as, for example, equivalentClass and subClassOf relations between ontology classes. Many different approaches to and techniques for ontology alignment have been proposed up until now, and dedicated scientific workshops have been organized to accelerate the progress in this field. In 2005, the Ontology Alignment Evaluation Initiative (OAEI) initiated a series of annual evaluation events to monitor and compare the quality of different alignment systems. A somewhat broader view

on the evaluation of semantic technologies is promoted by the Semantic Evaluation At Large Scale (SEALS) project2 that started in 2009. An open source platform is under development to facilitate the remote evaluation of ontology alignment systems and other semantic technologies3 in terms of both, large-scale evaluation campaigns but also ad hoc evaluations of single systems. Amongst others, the platform provides a test data repository, a tools repository, and a results repository for the evaluation and comparison of systems. The most valuable content of the SEALS platform’s test data repository and also the core of the OAEI campaigns are manually created or at least manually curated reference alignments which constitute the ground truth against which alignment systems are to be evaluated. Clearly, the quality of these reference alignments is of paramount importance for the validity and reliability of the evaluation results. For the evaluation of our own ontology alignment system, we were also looking for trustable test data (ontologies and reference alignments). Some data sets we inspected have been used for several years in the OAEI campaigns, or have already been integrated in the SEALS test data repository. Others have just recently been published and have not been used in any public challenge up until now. Notwithstanding the enormous efforts that have gone into the development of such resources, our inspection of many different data sets revealed a number of content-specific shortcomings and technical deficiencies. Hence, we decided to formulate a list of basic quality checks, summarizing our observations. We propose to apply these checks to any given alignment as a kind of minimal reliability test before it is used as a reference standard in any evaluation. In the remainder of this paper, we will first introduce the basic requirements we have defined and then we will apply them to one of the standard data sets used in the yearly OAEI campaigns, the anatomy reference alignment. Finally, we will discuss how the application of the checks to this data set leads to an improved version of both, the reference alignment itself and the input ontologies.

*

2

1

3

To whom correspondence should be addressed. Following Euzenat and Shvaiko [2007], Section 2.2.1, we consider entities to comprise mainly ontology classes, instances, and relations.

http://www.seals-project.eu/ On the SEALS website, a pre-production version is scheduled for August 2011.

1

E. Beisswanger and U. Hahn .

2

BASIC QUALITY CHECKS FOR REFERENCE ALIGNMENTS

An alignment consists of a set of correspondences between entities from two different ontologies. In this paper, we focus on correspondences between ontology classes only. In this case, a correspondence consists of a pair of classes (one class from the first, the other from the second input ontology) and the relation that, according to the creator of the alignment, holds between these classes. Most alignments that have been proposed so far are only concerned with equivalentClass and subClassOf relations. The usefulness of a manually created or curated alignment as reference data for the evaluation of ontology alignment systems depends on various parameters. The following quality checks address fundamental reliability and reusability aspects: Check 1: Is the alignment provided together with the input ontologies on which it is based and are the input ontologies provided in the correct release versions? Check 2: Are the classes to which correspondences in the alignment refer still available in the provided versions of the input ontologies? Check 3: If classes are referred to by URI-label pairs in the alignment, do the URI-label pairs still persist in the available versions of the input ontologies? Check 4: Is the alignment made available in a machinereadable format? Check 5: Are ontology classes in the alignment referred to in terms of unique identifiers (e.g., URIs)? Check 6: Are the relations holding between classes specified explicitly for all correspondences in the alignment? Check 7: If there are cases in which a class from ontology O1 is linked to several (target) classes in ontology O2 by equivalentClass relations, are the target classes in O2 linked by equivalentClass relations as well? Check 8: Are pairs of classes with identical labels linked by an equivalentClass relationship in the alignment? Check 9: How many non-trivial correspondences (ones that cannot be detected via the identity of class labels after applying a simple term normalization step) occur in the alignment? The first six quality checks focus on the (re)usability of an alignment as reference for the evaluation of alignment systems. Checks 1 and 2 test whether the correspondences contained in the alignment can be found at all by the alignment systems based on the available release versions of the input ontologies (imagine cases where, e.g., classes are

2

deleted from an ontology, and, consequently correspondences in the reference alignment referring to these classes cannot be reproduced anymore). Check 3, which tests for label changes, is targeted at the tacit evolution of the meaning of a class. In particular for light-weight ontologies lacking thorough formal class definitions, verbal labels virtually carry the entire meaning of a class and, hence, a new label might indicate a subtle change of the meaning of an ontology class requiring further scrutiny. Of course, if check 1 is positive, checks 2 and 3 can be skipped. Check 4 is concerned with the accessibility of an alignment, while check 5 aims at finding out whether the references to classes are unique (imagine the case where local names or labels would be given as class references, then those references might be ambiguous). Check 6 is meant to assure that the relationships asserted between the classes by the alignment creator are made explicit (according to our experience some alignments are published without a clear distinction between different types of semantic relations). Since in an alignment a class from one ontology should be mapped to at most one class in the other ontology by an equivalentClass relation (or, if it links to several classes, these should be marked as being equivalent themselves), check 7 may provide valuable hints for implicit class equivalences in the input ontologies, but also for redundant or even mistaken correspondences in an alignment. Check 8 picks on the observation that when two ontologies are aligned, especially when they show a strong conceptual overlap, label identity between classes is a strong hint for class equivalence. Checking for label identity may help in detecting missing correspondences in an existing alignment. Check 9 incorporates evidence we found that it often makes sense to evaluate an alignment also against the non-trivial subset of a reference alignment to see how much better the ontology alignment system does than a simple exact string matcher. Certainly, a large proportion of trivial correspondences in an alignment decreases its value as reference alignment, although trivial correspondences do play a certain role as anchors for advanced alignment strategies.

3

ANATOMY USE CASE

To illustrate the potential of the proposed quality checks we now apply them to one of the standard reference alignments in the biomedical domain, viz. the one used in the anatomy track of the OAEI campaign since 2007.

3.1

OAEI Anatomy Track Reference Alignment

The reference alignment used in recent years in the OAEI anatomy track links classes from the anatomy branch of the NCI Thesaurus4 (describing human anatomy) to the mouse adult gross anatomy ontology (MA) based on the Anatomical Dictionary for the Adult Mouse [Hayamizu et al., 2005]. 4

http://ncit.nci.nih.gov/ncitbrowser/

“Of Mice and Men” Revisited: Basic Quality Checks for Reference Alignments Applied to the Human-Mouse Anatomy Alignment

This alignment was created in a combined manual and automatic effort (the automatic alignment exploited lexical and structural techniques) followed by an extensive manual curation step [Bodenreider et al., 2005]. The version of the alignment used in the OAEI 2010 anatomy track comprises 1,520 correspondences linking pairs of classes. The vast majority denotes equivalentClass relations (few subClassOf relations were added by the anatomy track organizers after the original alignment had been published).

3.2

Applying the Quality Checks

We found the following results when we applied the nine basic quality checks described in Section 2 to the anatomy reference alignment. Check 1. In the OAEI anatomy track, the reference alignment is used together with a version of the NCI Thesaurus anatomy branch as from 2006-02-13, and a version of the MA as from 2007-01-18 (both in OWL format), while the alignment itself was created based on the NCI Thesaurus release version 04.09a (from 2004-09-10) and the MA version as from 2004-11-22 [Bodenreider et al., 2005]. Obviously, different release versions of the input ontologies have been mixed for the creation of the reference alignment and for running the anatomy track. Check 2. All classes involved in the alignment are still contained in the new versions of the input ontologies used in the anatomy track. Hence, class consistency is preserved. Check 3. Although in the version of the reference alignment used in the OAEI anatomy track classes involved in correspondences are specified by URIs only (and no class labels), we received from the curator of the alignment the original mapping table on which the alignment was based. This mapping table lists both, URIs as well as the labels of class pairs. We tested whether the URI-class label combinations are still valid in the new versions of the input ontologies and found 85 NCI classes and 34 MA classes for which the labels had changed. A manual inspection revealed that in most cases labels had been made more precise in the new ontology versions (e.g., the label of class NCI_C12443 was changed from Cortex to Cerebral Cortex), were replaced by synonyms (e.g., the label of class NCI_C33178 was changed from Nostril to External Nare), or small spelling or syntax modifications were inserted (e.g., the label of class MA_0000475 was changed from aortic arch to arch of aorta), while the meaning of the classes remained stable and the mappings were still valid. However, the check also pointed us to six mistakes in the alignment that seem to have been caused by shifts in the mapping table. For example, the class NCI_C49334 brain white matter was mapped to MA_0000810 brain grey matter and NCI_C49333 brain gray matter to MA_0000820 brain white matter.

Check 4. The reference alignment is distributed in the Alignment API format [Euzenat, 2004] and thus can easily be accessed and used via the JAVA-based Alignment API.5 Check 5. Classes are referred to by class URIs. Check 6. For each correspondence, the relation holding between the two classes involved is explicitly specified. Check 7. Looking at the equivalentClass relations expressed in the anatomy alignment, we found 17 NCI classes being linked to more than one MA class (three MA classes in one case and two in all other cases) and 22 MA classes being linked to more than one NCI class (namely two). We checked for equivalentClass relations between the respective target classes in the ontologies, but found none. Thus we manually inspected all cases of multiple mapping targets. We found that in 20 cases, the target classes in fact seem to be equivalent classes6 that are just not yet marked appropriately in the given versions of the respective ontologies. Cross-checking with the most recent versions of the input ontologies revealed that from this set 12 target class pairs from the NCI meanwhile have been merged. For another three cases we proposed a merger to the NCI team (for example, for the classes NCI_C33708 suprarenal artery and NCI_C52844 adrenal artery) that already have been accepted and will be considered for the next version release. Furthermore, we found 18 cases in which the target classes were linked by relations other than equivalentClass in the respective ontologies. In 12 cases the target classes were linked by partOf relations, in four cases by subClassOf relations, and in two cases they were treated as sibling classes. We inspected these relations and judged the majority of them as being correct. This allowed us to draw the conclusion that for the classes concerned only the mapping to one target class is correct, while the others should be removed from the alignment. In a few cases, we considered the relation that we found between target classes as being incorrect. Check 8. After lowercasing all labels and removing underscores we found 14 class pairs between the NCI thesaurus anatomy branch and the MA ontology with identical labels that were not linked by an equivalentClass relationship in the reference alignment. A manual inspection revealed that in two cases the respective classes, in fact, referred to slightly differently defined concepts. For example, the classes MA_0000323 and NCI_C12378 share the label gastrointestinal system. However, the MA class fits the usual understanding of gastrointestinal system comprising the stomach, intestine and the structures from mouth to anus, while the NCI class does not, but includes, in addition, accessory organs of digestion, such as the pancreas and the liver. (The NCI anatomy branch comes with another class, 5

http://alignapi.gforge.inria.fr/ This assertion reflects our own assessment. We are in contact with the developers of the anatomy alignment and the input ontologies to approve this finding. 6

3

E. Beisswanger and U. Hahn .

NCI_C22510 gastrointestinal tract, which corresponds to the class MA_0000323). However, in the remaining 12 cases the equivalentClass relationships between classes seem to be effectively missing in the alignment. An example is the class pair (NCI_C33460, MA_0002730) sharing the label renal papilla. Check 9. We found that 937 out of 1,520 correspondences (62%) in the anatomy alignment are trivial ones.

4

RESULTS AND DISCUSSION

The result from check 2 guarantees that, at least from a formal point of view, all correspondences in the anatomy reference alignment can be found by an automatic alignment system. This check was compulsory, since (given the result of check 1) more recent versions of the input ontologies had been used in the anatomy track than the original alignment is based on. Obviously, it could have been the case that in newer versions classes had been removed or made obsolete. The results of checks 4, 5 and 6 reflect the fact that the anatomy alignment serves as reference data set in a public evaluation campaign. Other than some more recent alignments, that we have already reviewed as well, it is published in a community-accepted standard format and classes and relations are referred to in a well-defined way. Check 9 revealed that only one third of the correspondences in the alignment are non-trivial, i.e., they cannot be detected by simple string matching tools. Since the alignment is quite large with respect to the number of correspondences, this makes it still a valuable evaluation data set. However, the large percentage of trivial correspondences must be considered when interpreting the results that alignment systems achieve on this data set, or when comparing these results to those achieved by the same systems on different data sets.7 By far the most interesting results we achieved analyzing the outcomes of checks 3, 7 and 8. In total, these checks helped us detect 30 erroneous correspondences that need to be removed from the reference alignment (this accounts for 2% of the complete alignment and 5% of the non-trivial subset) and 14 new ones that we propose to add to the alignment. The list of invalid and newly proposed correspondences has already been communicated to the anatomy alignment curators. In agreement with the organizers of the OAEI anatomy track the confirmed changes will be considered in the 2011 OAEI campaign. An issue that we did not focus on in this paper is checking for the logical consistency of an alignment. With regard to this issue, we refer the reader to related work by Meilicke et al. [2009], who propose a Web-based tool that supports the human alignment curator in detecting and solving conflicts in an alignment by capitalizing on logical reasoning. 7

The OAEI anatomy track organizers are aware of this fact and compute, in addition to standard recall and precision, a value they call “recall+”. It refers to the non-trivial correspondences that a system was able to detect.

4

5

CONCLUSIONS

We presented nine basic quality requirements and associated checks intended to assist developers and curators of ontology alignments to create and maintain both, reliable and easy to (re)use references for the evaluation of alignment systems. As we could show – using the example of the anatomy reference alignment – very basic checks can already help in detecting both, incorrect correspondences that should be removed from an alignment and missing correspondences that should be added. We also observed that the tests can reveal shortcomings in the input ontologies themselves, such as missing or invalid relations between classes. The set of basic checks presented in this paper should be seen as a first, rather simple, yet effective step in a multistage procedure of extensively checking the quality of an alignment before it is used as a reference in an evaluation setting. Our work is thus targeted at the sanity of comparison standards, an issue of prime importance for any conclusion we can draw from the outcome of any evaluation campaign. We plan to complement the basic checks by more advanced logical consistency checks and more elaborate considerations on alignment quality, as proposed, e.g., by Joslyn et al. [2009], checking for the structural preservation of semantic hierarchy alignments.

ACKNOWLEDGEMENTS We would like to thank Terry Hayamizu (curator of the anatomy alignment) and Christian Meilicke (OAEI anatomy track organizing committee) for their active collaboration and Stefan Schulz (Graz University of Medicine, Austria) for assisting us in anatomy questions. This work was funded under BMBF grant 0315581D as part of the JenAge project.

REFERENCES Bodenreider, O., Hayamizu, T. F., Ringwald, M., de Coronado, S., and Zhang, S. (2005). Of Mice and Men: Aligning mouse and human anatomies. Proceedings of the 2005 AMIA Annual Symposium, pp. 61–65. Euzenat, J. (2004). An API for ontology alignment. Proceedings of the 3rd International Semantic Web Conference, pp. 698-712. Euzenat, J. and Shvaiko, P. (2007). Ontology matching. SpringerVerlag. Hayamizu, T. F., Mangan, M., Corradi, J., Kadin, J., and Ringwald, M. (2005) The Adult Mouse Anatomical Dictionary: A tool for annotating and integrating data. Genome Biology, 6(3): 1-8. Joslyn, C., Paulson, P., and White, A.M. (2009). Measuring the structural preservation of semantic hierarchy alignment. Proceedings of the 4th International Workshop on Ontology Matching at the 8th International Semantic Web Conference. Meilicke, C., Stuckenschmidt, H., and Svab-Zamazal, O. (2009). A reasoning-based support tool for ontology mapping evaluation. Proceedings of the 6th European Semantic Web Conference (Demo-Paper).

An Ontology of Annotation Content Structure and Provenance Kevin M. Livingston, Michael Bada, Lawrence E. Hunter, Karin M. Verspoor Center for Computational Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA

ABSTRACT Motivation: Representing and understanding complex biological systems requires knowledge representations that can relate multiple concepts to each other though sets of assertions. Annotation efforts that seek to curate this information require the ability to annotate with more than single ontology concepts or identifiers. We propose an extension to the Information Artifact Ontology (IAO) for representing annotations, including single term annotations as well as annotations containing multiple statements. This extension enables tracking the provenance of annotations in terms of other annotations, as well as the provenance of individual parts of statements in multi-statement annotations.

1

INTRODUCTION

Representing and understanding complex biological systems requires constructing knowledge representations that go beyond the selection of a single term in a single ontology to describe them. For example, fully representing an event requires capturing information about the type of event, its participants, locations, and other contextual information, into one knowledge structure. All this information together is relevant to any understanding or reasoning done by human or machine with respect to that particular event. Annotations are one of the primary ways biological information is being curated and distributed. There are two prominent kinds of biomedical annotations: (1) associating ontology terms with genes, gene products, or other entities, such as Gene Ontology (GO) annotation, and (2) associating ontology terms or database identifiers with (typically) textual references in documents, such as the output of the NCBO Annotator (Jonquet et al. 2009). The primary focus of most annotation work to date has been annotating with single ontology terms or identifiers. These annotations have proven useful, for example, in computing term enrichment or for indexing for search. However, single term annotations do not provide a complete understanding of the biological content they are annotating. The Entrez Gene record for Human TP53 (7157) lists 10 different phenotypes, 79 process or function annotations, and 17 component annotations. It is highly unlikely that all processes and components are associated with each of the phenotypes listed; rather there are various subsets of all of these annotations that are associated with each other in different contexts. Likewise, having a set of ontology terms associat-

ed with a document provides far more information than just the textual strings, but it is a long way from a complete understanding of the document content. For example, consider a document annotated with multiple proteins and the ontology term for calcium transport. Viewing those annotations as an unstructured set provides no information as to which protein (if any, or all) may perform that function. While ontologies strive to be complete, it is likely that specific applications will require concepts that are not explicitly expressed in an ontology and thus require dynamic construction. Furthermore, as information needs increase and annotation efforts expand to cover more complex concepts, knowledge structures containing relations among many parts will need to be represented. Annotators also need the ability to reference existing annotations, or their content, as the provenance for more complex annotations. We present an extension of the Information Artifact Ontology (IAO1) for representing annotation content, i.e., concepts or sets of assertions denoted by annotations. This proposal focuses on annotation content and not on metadata such as author or creation date, but it is consistent with existing models for representing this information. We provide a mechanism for representing both the provenance of annotations in terms of other annotations as well as the provenance of their semantic content in terms of other semantic content. This ontology consistently applies to use cases in both entity-oriented annotation (e.g., GO annotation) and document-oriented annotation (e.g., NCBO Annotator).

2

BACKGROUND

There are many formats for recording biomedical annotations. Most associate single ontology terms with genes, gene products, or other entities. Among those that afford more complex representations is the Gene Association Format (GAF 2.02). GAF provides the ability to add “annotation extensions” to a GO annotation that can, for example, further constrain the annotation to occur in a particular location, or to part of another component. The existence of such extensions demonstrates the community’s need for recording more structured annotations. However, they are specific to GO annotations and GAF-formatted data. Our model is generally applicable to all annotations and affords access to 1 2

http://code.google.com/p/information-artifact-ontology/ http://www.geneontology.org/GO.format.gaf-2_0.shtml

1

K. M. Livingston et al.

Semantic Web tools such as reasoners and visualizations not available to idiosyncratic formats. There are several proposed RDF-based models for representing annotations over web resources, including but not limited to text. Most of these models, and most of the well studied use cases for annotation (e.g., biological literature curation (Clark et al. 2011), or digital humanities applications (Hunter et al. 2010)) require only the association of an individual concept or database identifier with a given information source such as a text segment (e.g., identifying genes). In contrast, the natural language processing community has developed solutions for representing complex syntax and semantics for documents, such as, full parse trees in the Penn Treebank (Marcus et al. 1993), but these representations are mostly idiosyncratic and not interoperable. Our own work revolves around two primary use cases although has benefit to the community in general. The first use case is oriented towards publishing semantic content produced by text mining systems in a format that would integrate with existing semantic web tools, such as the Annotation Ontology (Ciccarese et al. 2010) and viewer. Our model is consistent with these existing efforts, while capable of capturing annotations with more structure than single terms. The second use case is enabling natural language understanding systems that can reason over RDF-based annotations. Specifically Direct Memory Access Parsing (Riesbeck and Martin 1986) systems that use patterns of lexical and semantic elements to recognize and interpret language, e.g., OpenDMAP (Hunter et al. 2008) and REDMAP (Livingston 2009). Our model would allow annotations produced by other systems to be leveraged in producing more complex semantic structures and precisely record their provenance.

3

ANNOTATION REPRESENTATION

We propose an extension to the Information Artifact Ontology (IAO) to represent the content of annotations, including annotations of single ontology terms or identifiers as well as annotations containing sets of assertions. Annotations can be composed and the provenance of that composition can be fully recorded, both for the annotations themselves and the individual elements of statements in the annotations. We divide annotations into two primary classes: ResourceAnnotation, which is an annotation that associates a single RDF resource with a target, and StatementSetAnnotation, which is an annotation that associates a set of assertions with a target. This proposal focuses on the structure of the content of annotations and is neutral with respect to annotation schema or annotation content. Details about how to record metadata, such as author and creation date, are therefore elided from this discussion. A base annotation class that is consistent with any of the existing RDF-based annotation methods (e.g., the Annotation Ontology) is assumed and can be treat-

2

Fig. 1. Example of kiao:ResourceAnnotation denoting a single rdfs:Resource, a protein. Relevant ontology terms: gray, classes: boxes, instances: ovals, properties: no border. Standard rdf/rdfs namesapces omitted.

ed as a parent class for both ResourceAnnotation and StatementSetAnnotation. We reuse or extend existing community-curated ontologies where possible. The Information Artifact Ontology (IAO) is a good starting point for annotations. Our in-house knowledge base of biology (KaBOB) is the aggregator of our work; KaBOB extensions of an ontology are named by prefixing the ontology’s namespace with the letter ‘k’. The namespace kiao: is therefore used for our extension of the IAO. Both kiao:StatementSetAnnotation and kiao:ResourceAnnotation are rdfs:subClassOf iao:data item. The ex: namespace is used for examples.

3.1

ResourceAnnotation

A resource annotation is an annotation that associates a single rdfs:Resource with a location and is of rdf:type kiao:ResourceAnnotation. The relation kiao:denotesResource is used to associate the resource with the concept being annotated. This property is rdfs:subPropertyOf iao:denotes, which relates an information content entity (in this case an annotation) to something that it is specifically intended to refer to (in this case a rdfs:Resource). Figure 1 depicts an example annotation of an interferon gamma protein (pro:000000017 from the Protein Ontology (Natale et al. 2007)). For the purposes of alignment with existing annotation models, kiao:denotesResouce could be made rdfs:subPropertyOf a corresponding relation, e.g., ao:hasTopic. A single rdf:Statement could be used as the denoted resource of a ResourceAnnotation (since rdf:Statement is rdfs:subClassOf rdfs:Resource) or in a single-statement StatementSetAnnotation.

3.2

StatementSetAnnotation

While ResourceAnnotation instances denote a single RDF Resource, StatementSetAnnotation instances represent sets of RDF statements. These statements can correspond to any set of RDF triples desired by the annotator. We make no restriction as to which triples are allowed or what they represent; this proposal only recommends how to represent them and assign provenance to the content of the triples and their constituent members.

An Ontology of Annotation Content Structure and Provenance

itself with any annotation used in its construction, e.g., between the two previous annotations, see top arc in Figure 3. This property can be used both when there is a direct relation between annotations, such as one directly using elements of another; or when there is an indirect relationship, such as one annotation being used as the justification for another’s existence even though no parts were shared (e.g., the annotation of a specific gene being used to justify the annotation of the concept “gene” from the Sequence Ontology, so:0000704).

Fig. 2. Example of kiao:StatementSetAnnotation mentioning two statements: (T1 subClassOf go:0006412) and (T1 resultsInFormationOf pro:000000017). See Fig 1 for figure key.

The triples comprising the content of a StatementSetAnnotation are represented as instances of rdf:Statement. A StatementSetAnnotation is associated with one or more of these reified statements using the property kiao:mentionsStatement; a rdfs:subPropertyOf of iao:mentions. The IAO defines mentions as meaning that the subject of the statement ro:has_part that iao:denotes the object. The example in Figure 2 represents an annotation for a translation (go:0006412 from the Gene Ontology) of an interferon gamma protein (pro:000000017). The use of reified statements protects users of the annotation from committing to or believing the propositions represented in the annotation unless they want to. For example, one annotation could contain the triple (Earth hasShape Flat). In its reified form, a reader of this annotation is not committed to believing the Earth is flat. What has been represented is effectively “this particular annotation says, ‘the Earth is flat.’” Should a user of a StatementSetAnnotation choose to reason about the contents of an annotation, the statements that it encodes can be recomposed from their reified forms. Again, this proposal is only about the structure and provenance of the semantic content of annotations, not confidence, trust, or other epistemological or modal information, which could be modeled independently.

3.3

3.3.2 Element Level The second layer of annotation content provenance is more detailed and allows for tracking the provenance of the individual statement elements of an annotation. This is done by reifying statement elements and relating them to the elements of other annotations. There are two sets of properties used to model statement elements, one set for indicating what statement a particular element is part of, and one set to model what it is based on. The relation kiao:mentionsStatementElement relates a StatementSetAnnotation to statement elements; the object of this relation is a reified StatementElement. A StatementElement is then associated with the statement in the set that it is part of via one of three relations: kiao:isSubjectOf, kiao:isPredicateOf, or kiao:isObjectOf. These relations correspond to the three positions in a reified Statement that a StatementElement could be representing. A StatementElement is then related to the content that it is based on using one of four properties: kiao:basedOnResourceOf, kiao:basedOnSubjectOf, kiao:basedOnObjectOf, or kiao:basedOnPredicateOf. The relation kiao:basedOnResourceOf is used to record the provenance of a StatementElement as the denoted resource of a ResourceAnnotation. Fig. 3 depicts the element-level provenance, documenting that statement element ex:E1 is the object of statement ex:S2 and is based

Provenance

The ontology extension we propose provides two levels of abstraction for provenance tracking of annotations content. 3.3.1 Annotation Level The first and simplest is annotation-level provenance. If one annotation builds on another annotation in any way, it can document this relationship using the kiao:basedOnAnnotation property to associate

Fig. 3. A partial representation highlighting provenance, depicting A2 (from Fig. 2) based on A1 (from Fig. 1). Also depicting a single kiao:StatementElement that is the object of Statement S2, and based on the denoted Resource of A1.

3

K. M. Livingston et al.

and enables tracking of the provenance of content between annotations at two levels of granularity. Coarse-grained provenance is modeled by linking annotations to the annotation they were based on using a single relation. A small set of relations can further be used to provide far more detailed provenance about each element of semantic structures. Our model is compatible with existing RDF-based annotation proposals. Because of the layered nature of our model, annotations represented in it could be understood by existing tools largely without change. We will submit this model to the IAO for consideration for inclusion. We believe adoption of this model within the Bio-ontologies community will enable standardization and interoperability of annotations both within that community and further open up these annotations for use in the broader Semantic Web.

ACKNOWLEDGEMENTS Fig. 4. Example of kiao:StatementSetAnnotation mentioning two statements: (R1 subClassOf go:0006412) and (R1 regulates T1). A partial example of element-level provenance is shown below the dashed line documenting that T1 is based on part of Statement S1 (from Fig. 2). See Fig. 1 for figure key.

on the resource denoted by annotation ex:A1. To record the provenance of a StatementElement as an element of another statement, normally in another StatementSetAnnotation, one of the last three “based on” properties can be used to relate it to another reified rdf:Statement, e.g., kiao:basedOnSubjectOf. For example, consider a third annotation representing the regulation of the translation from the second annotation, it and its provenance could be represented as depicted in Figure 4, using “regulation of translation” from the Gene Ontology (go:0006417). The part of Figure 4 below the dashed line shows statement element ex:E3 as the object of statement ex:S4 and is based on the subject of statement ex:S1 (i.e., the translation event ex:T1 from Figure 2). Just as it is not required that parts of an annotation be directly used in another annotation to make a kiao:basedOnAnnotation statement, it is not required that elements be identical to make kiao:basedOnSubjectOf, etc. statements. For example, if a specific protein is used in the representation of a complex event (e.g., a protein transport event), an annotator can create a generic protein class annotation and document the specific protein element in the statement set as its source.

4

CONCLUSION

We have presented an ontology extension to the IAO for representing annotation content and the provenance of that semantic content. The model represents annotation content, i.e., concepts or sets of assertions denoted by annotations,

4

We appreciate the support of the other members of the CCP. This work was supported by NIH grants R01LM009254, R01GM083649, and R01LM008111 to LH, R01LM010120 to KV, and 3T15 LM009451-03S1 to KL.

REFERENCES Ciccarese, P., Ocana, M., Das, S. and Clark, T.. 2010. AO: An Open Annotation Ontology for Science on the Web. Paper presented at the Proceedings of the Bio-ontologies SIG at ISMB 2010, Boston, MA. Clark, T., Ciccarese, P., Attwood, T., de Waard, A. and Pettifer, S. 2011. A Round-Trip to the Annotation Store: Open, Transferable Semantic Annotation of Biomedical Publications. Paper presented at the Beyond the PDF Workshop, University of California San Diego. Hunter, J, Cole, T., Sanderson, R., and Van de Sompel, H. 2010. The Open Annotation Collaboration: A Data Model to Support Sharing and Interoperability of Scholarly Annotations. Paper presented at the Digital Humanities 2010. Hunter, L., Lu, Z., Firby, J., Baumgartner Jr., W. A., Johnson, H. L., Ogren, P. V., and Cohen, K. B. 2008. OpenDMAP: An open-source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-specific gene expression. BMC Bioinformatics 9. Jonquet C., Shah N.H., Youn C.H., Callendar C., Storey M., Musen M. 2009 NCBO Annotator: Semantic Annotation of Biomedical Data, In 8th International Semantic Web Conference, Poster and Demonstration Session, ISWC'09. Livingston, K. M. 2009. Language understanding by reference resolution in episodic memory: Northwestern University. Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19.313-30. Natale D, Arighi C, Barker W, et al. 2007 Framework for a Protein Ontology. BMC Bioinformatics 2007;8:S1 Riesbeck, C. K., and Martin, C. E. 1986. Direct memory access parsing. In Experience, memory, and reasoning, ed. J L. Kolodner and C. K. Riesbeck. Hillsdale N J: L. Erlbaum.

Relating Processes and Events for Granularity-neutral Modeling Niels Grewe* University of Rostock, Rostock, Germany

ABSTRACT This paper investigates an approach of classifying temporally extended entities (occurrents) which distinguishes events and processes by means of their inner structure and their relation to change. By assuming processes to be homogeneous up to a certain intrinsic granularity, I develop a suggestion on how to model events and processes in a way that is to a great extent neutral to granularity issues.

1

INTRODUCTION

Biomedical reality is full of changes and processes, things that unfold in time: The citric acid cycle, an infection of the sinuses, the beating of a heart, an appendectomy, the growing of a tree: These examples of changes are affecting things on different levels of granularity and show a great deal of variation. Despite efforts in various top-level ontologies, no uniform treatment of such „unfolding“, temporal entities has emerged in biomedical ontologies. Rather, specialized accounts have sprung up to treat the needs of particular disciplines (e.g. in the fields of systems biology (cf. LeNovere 2007) or epidemiology (cf. Kawazoe et al. 2008). This is unfortunate from a data integration perspective, especially if entities on multiple layers of granularity are involved. This is evident, for example, in the infectious disease ontology (IDO), which includes on the one hand classes describing biomolecular interactions (e.g. Immune Response, imported from the Gene Ontology) and on the other hand classes concerning populations of organisms (e.g. Infectious Disease Outbreak) but does not, make any attempt to establish any relations between those different levels of granularity. In this paper I take a small detour to motivate a suggestion on how an integration-friendly ontology of temporally unfolding entities should look like. This detour draws upon ideas of Galton and Mizoguchi (2009) about the traditional ontological distinction between occurrents and continuants, a simplified model that does not account for “changes” of occurrents, which are important in their own right. I then further distinguish mutable and immutable entities and use the categories of events and processes thus characterised to highlight how integration between different levels of granularity could be achieved. The paper will conclude with

*

a brief sketch on how this approach could be applied to IDO.

2

CONTINUANTS AND OCCURENTS

2.1

The Standard Account

In most attempts at ontologically adequate modeling one distinction is ubiquitous: The distinction between entities that are present as a whole at every moment of their existence (called “continuants” or “endurants“), and those which are only partially present at each moment (called “occurrents” or “perdurants”, cf. Simons 1987). Hence, toplevel ontologies such as DOLCE (cf. Masolo et al. 2003) or BFO (cf. Grenon and Smith, 2004) adopt this distinction as the primary means to partition the entities in the world. At first sight, this distinction aligns nicely with everyday experience and scientific method: Continuants, both the familiar, independent variety (e.g. a DNA molecule or a scalpel) and the more outlandish dependent continuants (e.g. the weight of the DNA molecule or the colour of the scalpel) do not have tardy parts: If they are present at all, they are wholly present. Occurrents, on the other hand are fleeting, we never experience more than a single “slice” of them at once. At a given moment, we will, for example, observe exactly one phase of a mitosis, but never, say, prophase and metaphase of one and the same mitosis together. This neat picture has another important consequence: Only continuants are entities that endure in time and hence only continuants can be the subjects of change because in order to speak of change, we need to ascribe different properties to them at different instants. For example, a scalpel used during an appendectomy can be the same scalpel at the beginning and the end of the procedure, even though it might have undergone some changes in the meantime (e.g. it was located on the table at 𝑡1 and in the hands of the surgeon at 𝑡2 , or it may have been quite sharp at 𝑡1 and rather blunt at 𝑡3 ). The same cannot be said for occurrents. Since the parts of the appendectomy present at 𝑡1 and 𝑡2 are clearly distinct, claiming that the appendectomy has changed would be as absurd as claiming that the scalpel „changed“, just because the blade exhibits characteristics different from those of the handle. Consequentially, the division between continuants

To whom correspondence should be addressed.

1

N. Grewe

and occurrents lines up with the distinction of entities which allow for change and those which do not.1

2.2

Mutable Occurents

Still, this standard “continuants vs. occurrents”-picture has been repeatedly challenged. Often those challenges have been put forward by ontologists who strive for extreme parsimony and try to achieve an adequate description of reality with just a minimal set of categories. Some suggest that continuants should be perceived as four-dimensional space-time worms, which then also encompass what we call “occurrents” (cf. Quine 1960). Others, in a loosely Whiteheadian tradition (cf. Whitehead 1929), advocate that continuants should be absorbed to occurrents (Seibt 1997). Against those, at least in part reductive, proposals, the intuitive appeal of the assumption that continuants and occurrents are distinct categories that are divided because of substantial ontological differences has to be stressed. I side with the authors of the BFO (cf. Grenon and Smith 2004) in assuming that both a diachronic, occurrent-centered and a synchronic, continuant-centered description are necessary and adequate representations of biomedical reality. Still, even if one accepts the general distinction, much can be found at fault within the confines of that framework. Most importantly, Galton (2006) and Galton and Mizoguchi (2009) have raised concerns about the nature of different classes of occurrents. They argue that specific occurrents can actually undergo changes. Our alignment of continuants as mutable entities and occurrents as immutable entities would then be incorrect. The primary justification for this is that there are many propositions that ascribe changes to occurrents. Some of these appear quite natural in everyday language. For example, someone might be tempted to say that “the snowing got more intense during the night.” Here different intensity qualities are assigned to a perduring episode of snowing at two different points of time. Likewise, many properties of physical systems, which can be described by differential equations, can often readily be explained by appealing to changes of an occurrent. Key examples would be the concepts of damping or acceleration. Damping describes the change of amplitude in an oscillation process and acceleration the change of velocity in a motion process. The fact that one usually ascribes the corresponding property to a participant of the process (for example, “acceleration of a body” or, to take a simpler example, “velocity of a body”) is only a superficial counterargument. Such descriptions rather serve to illustrate a peculiar feature 1

It deserves mention that the distinction between occurrents and continuants should not be thought of as a mere linguistic distinction between the rôles played by verb and noun phrases in a scentence. The question of whether one is dealing with a continuant or an occurrent seems to be one oflanguage independent reality, if it is at all reasonable to speak of such a thing.

2

of scientific parlance: The (in this case: physical) models are usually reduced to their bare minimum in order to capture only those factors that are essential for the system under study. Hence the phrase “velocity of a body” arises as a convenient shorthand for “velocity of a body participating in motion process p relative to an inertial system s“. But this shorthand is only unambiguous for simple models. One can easily see this when asked to determine the velocity of the earth: What is meant by “velocity of the earth”? The velocity that characterises the motion around the earth's axis or to the velocity that characterises the motion of the earth around the sun? In this case, one cannot do away with the reference to the motion process and “short-circuit” to ascribing the velocity to the object alone. Hence, if the velocity changes, we might be compelled to ascribe the change (at least partially) to the motion process. 2 This observation is clearly at odds with the intuition that occurrents do not undergo change. One is thus obligated to either provide some rationale for overriding the intuition or to provide a analysis that avoids the conclusion that occurrents do change.

2.3

Events and Processes

The solution that Galton and Mizoguchi provide for this conundrum rests on the distinction between events and processes.3 This distinction is, in turn, modeled in analogy to the distinction of objects and matter on the continuant side. Events are said to be analogous to objects in as far as both “are discrete individuals which may be referred to using count nouns.” (Galton and Mizoguchi, 2009, 74) This is certainly true: Just as the surgeon can count the scalpels and hemostats on the operating table, he can count the appendectomies and colonoscopies he has performed. Each of them will be a single, clearly delineated individual. Closely related to this feature of “discreteness” is that of „definite extension“: Just as each object takes up a fixed amount of space, each event occupies a fixed time interval. 4 Additionally both categories share certain constraints on their internal structure: Both are “non-dissective”, or heteromerous. This means that no part of the original whole is still of the same kind as the whole. The blade of a scalpel is a blade, not a scalpel, and the initial incision during the appendectomy is an incision, not an appendectomy. 2

One is not, however, compelled to claim that the change is effected by the process. I take the conservative but very plausible stance that there are no free-floating changes of processes: Every change to a process is due to an underlying change in the participating continuants. 3 Caveat lector: The terminological confusion about the terms “event” and “process” is Babylonian in extent. I will use them here to denote two sibling-classes under the parent Occurrent. Their specific differences will become clear in what follows. 4 The extension of either events or objects might, however, exhibit some degree of fuzzyness.

Relating Processes and Events for Granularity-neutral Modelling

On the other side of the spectrum, an equivalent analogy is drawn between matter and processes, which are characterised by the inverse set of characteristics: Matter is not discrete in the sense that it constitutes a cleanly delineated individual, it is rather “the „stuff‟ from which those individuals are made” (Galton and Mizoguchi 2009, 74). Hence, the hemostat will be made of steel, and steel alone does not yet carry any clear criterion of individuation since we can never say that there is complete “steel”, only complete chunks of steel. Table 1. Analogous features of objects, events, matter, and processes according to Galton and Mizoguchi (2009). Spatial/Temporal Discreteness

Definite extension

Dissectivity

Object

+

+

-

Event

+

+

-

Matter

-

-

+

Process

-

-

+

The same is said to be true for processes with regard to events: The incision event that is the first part of the appendectomy is made up from a cutting process. The cutting process, as such, does not have a definite criterion of what makes a “complete” cutting. The incision event in the appendectomy, on the other hand, has one: It is complete once the intended endpoint (e.g. McBurney's point) has been reached by the scalpel. Again, this also means that there is no definite extension for either matter or processes. But the most important analogy between matter and processes is their dissectivity: Save for granularity issues that will have to be dealt with later on, a certain kind of matter can be arbitrarily divided into smaller portions and still be of the same kind. The same is arguably true for processes: If cutting is going on from 𝑡1 until 𝑡𝑛 , cutting is also going on from 𝑡1 ≥ 𝑡1 until 𝑡𝑚 ≤ 𝑡𝑛 . This notion of dissectivity (also called homogenity or homoeomericity) is one key ingredient to solving the change conundrum: In as far as occurrents are non-dissective (i.e. events), they cannot change because the selfsame event is only completely present over the whole time interval it occupies and there is no point in claiming that a change took place from 𝑡1 until 𝑡2 because there would not be the same entity present at both points in time (cf. Galton and Mizoguchi 2009, 78). As far as occurrents are dissective (i.e. processes), it seems to be possible to speak of change. The reasons are as follows: If one assumes that a process p is going on between 𝑡1 and 𝑡𝑛 , the same process is also going on at every 𝑡𝑚 such that 𝑡1 ≤ 𝑡𝑚 ≤ 𝑡𝑛 due to its dissectivity. Thus one can identify the process p at multiple points in time and ascribe

different qualities to it at those timepoints. The transition between them amounts to something that is at least analogous to the change of a continuant. It needs to be stressed that this is only possible because a process is dissective: There is at least one (non-contingent) aspect about it that stays the same while the process is going on. This position is thus on the one hand different from views that merely use dissectivity as a classification criterion for distinguishing occurrents that effect changes from those that merely describe the continued existence of a state, which is how DOLCE introduces the categories of “stative” and “eventive” occurrents. (cf. Masolo et al. 2003, 17) On the other hand there is a marked difference to views such as that of Rowland Stout, who makes similar claims to continuant-like characteristics of processes but does not require them to be dissective. He instead appeals to an allegedly primitive human capacity of “tracking” things (i.e. objects or processes) through time to account for the reidentification of a process through time. (cf. Stout 2003, 148) Since there is no explanation on offer for this capacity, it has a certain air of obscurum per obscurius, which makes it less useful for the purpose at hand. Galton and Mizoguchi conclude that a process seems to be “more ike an object than an event, calling into question the neat division into continuants and occurrents.” (Galton and Mizoguchi 2009, 79) Their complete solution is more sophisticated and quite revisionary in that it describes objects as “interfaces” between processes (Galton and Mizoguchi 2009, 92). Since scientific ontologies should be founded on well-understood and uncontroversial principles I will not discuss it here but instead restrict the discussion to the granularity issues that are crucial in this context.

3 PROCESS/EVENT RELATIONS AND INTRINSIC GRANULARITY 3.1

Temporal Windows

Assuming matter and processes to be homogeneous is a fitting abstraction if one tries to expound the conceptual similarities between the two categories. It is not, however, an adequate principle for concrete modeling. Matter may be dissective on the macro- or mesoscale but on the microscale, there are limits to dissectivity. When one start dividing a given portion of water, one will at first obtain different portions of the same kind: But that is no longer true once one has divided the portion that only consists of two water molecules. Any further division will produce entities that are no longer of the same kind, for example a hydroxide anion and a proton. Hence, the H2 O molecule is the natural grain of the divisible water-stuff (cf. Jansen and Schulz, 2010). Such intrinsic levels of granularity play an important role in the definition of processes as well. For example, an episode of walking can only be divided into further episodes

3

N. Grewe

of walking until the granularity of a single step has been reached. The same is already true for very basic physical processes, like the emission of a sound at a certain frequency f, which can only be subdivided into intervals that last at least 1⁄𝑓 seconds. According to Galton and Mizoguchi, these intrinsic granularities can be used to define “temporal windows” for processes (Galton and Mizoguchi 2009, 83–85). These windows are time slots which are just long enough so that the characteristics of the process kind in question can be realised. Hence, for a walking process, the temporal window will have the duration of a single step and for a process of a bacterial infection spreading, the temporal window will accommodate individual cell divisions. When a process is going on, the temporal window moves along with the present temporal extension of the process and might even shrink or grow as needed, for example if the person walking slows its pace or if the rate of cell division in a bacteria colony increases. Since for each temporal window the qualities of the process can be determined, the succession of different qualities amounts to the changes a process undergoes. But the temporal window of a process is different from its temporal parts: The temporal parts of a walking process are the individual movements of the left and the right leg, and those of the spreading process of a bacterial infection are the various phases of the single cell divisions. Neither of these constitutes the complete processes they are temporal parts of. Temporal windows are more like “temporary parts” 5 of the process: At every temporal window, the same process is wholly present, but each temporal window is only present during a small duration of the time that the process is going on. In the same way most (or probably all) of my epithelial cells are part of my body only for a short period of time, but my body is still present as a whole during each of those periods. Unfortunately, the “temporary part of”-relation seems rather arcane. With continuants, it clearly corresponds to temporally indexed spatial parthood (“x is (spatial) part of y at t”, but this is clearly not an option for processes, because what is needed is a relation between two occurrents. One potential candidate is a constitution relation6 between occurrents, especially since the two entities (i.e. the process p and the occurrent o occupying a temporal window w of p) coincide during the interval that w occupies while still differing in important respects. One could then spell out the relation as “o (temporally) constitutes p”, saying that a 5

I am borrowing this term from Stout (2003, 153), who uses it in a similar manner. 6 Cf. e.g. Baker 2002 for a discussion of the (material) constitution relation, applied to continuants, which it is far less exotic than the application to occurrents.

4

process p is constituted by an occurrent o during the duration of o. I will use this as a rough approximation of what is needed here.

3.2

Interrelation between Processes and Events

3.2.1 Events as Process-Chunks With these clarifications we can turn our attention to the interrelations between processes and events Fortunately the model just developed for dealing with change ascriptions to processes also proves to be very useful to clarify those interrelations To do this it is useful to consider the analogy from the continuant categories of matter and object again Considered carefully matter is a rather abstract category. In the reality we experience there is no such thing as “raw” matter: We never see steel or water by themselves but instead chunks of steel and portions of water Water and steel are still the stuff that the chunks and portions are made of but we always need to assign a definite extension to them With processes, things are rather similar. We never come across entities which are just walking or cutting. What we experience and talk about are rather concrete episodes of walking or cutting. In this case, by adding temporal extension as a delimiting factor, we create an event from the process. For example “the episode of walking from 5:00pm to 5:30pm” would describe an event that delimits, and hence is made of or constituted by, a walking process. Events are thus “chunks” of processes, (Galton and Mizoguchi 2009, 82) which is unsurprising given the fact that the distinction between processes and events was motivated by looking the way matter and objects relate. 3.2.2 Processes as Event-Masses But there is an additional type of relation between processes and events. This can be seen when one tries to answer the following question: What kind of thing fits into temporal windows? It is clear that we are dealing with an occurrent here, but are we dealing with a mutable or immutable entity? Since temporal windows are aligned to the intrinsic granularity of the process they are temporal windows of, they seem to require a specific time interval to be specified. But if that is the case, the entity contained in each window is already fixed by the boundaries of the window which would forbid it from changing. It thus needs to be an event. The temporal windows of a process contain events that are atomic with regard to the process: A walking process is (temporally) constituted by a series of step events. But since events are only complete when they are already gone by, this introduces a neat little puzzle: Suppose I am walking across the street and in the middle of a step somebody pushes me so that I do not get hit by an approaching bus: Would I be right to say “I was walking across the street when I was pushed.”? From the vantage of the present discussion, and quite counter-intuitively, it seems that this is not the case: If the temporal windows of a walking process

Relating Processes and Events for Granularity-neutral Modelling

can only be filled by (complete) step events, I was only walking until I set out to make the last step before being pushed. One might try to remedy this by allowing the last temporal window of a process to be filled by the initial segment of the usual process-grain event as well. But there are other cases where this does not seem plausible. For example, the process of flashing a light twice a second would have temporal windows with a duration of 1s. In each window, the process would be constituted by an event of two flashes, which decomposes into two temporal parts with one flash each. If after some time, I switch to flashing the light only once per second, I would be compelled to include the first flash of the new sequence as belonging to the previous process. The semantics of such interruptions and process replacements seem to be rather subtle and I am inclined to assume that the question is a material one that needs to be answered on a case by case basis. 3.2.3 Integrative modelling Still, the assumption that the temporal windows of a process are filled by events is a very useful one when combined with the assumption that some events are made up from processes. 7 One can then have processes, which constitute events, which constitute processes (because they form temporal windows). And since this structure can be nested, it opens an avenue for integrating different levels of granularity. Let us take the growth of a cancerous ulcer as an example. The intrinsic granularity of this growth process is the single cell division and at each point in time the growth process will be temporally constituted by at least one and potentially a large number of cell division events. A cell division in turn is an event with multiple parts (in this case know as phases). If we look at, for example, the anaphase, we can see that it, again, has two parts: The separation of the sister chromosomes and their movement of the respective centrosomes. The event of the chromosomes moving to the centrosomes is in turn constituted by a movement process which can be analysed further into the events it is made up from. This way, multiple levels of granularity can be combined into a consistent picture without introducing any tight coupling: If we are not interested in the specifics of how chromosomes move to the centrosomes, we can just leave it at describing their movement as an event that is atomic with regard to the process on the higher level. This kind of modeling can be regarded as granularity-neutral. It allows for integration with upper and lower levels where needed, but it does not force modellers to adopt those levels if they have no use for them. 7

Since events can be specified by supplying completely arbitrary fiatboundaries, not all events can be constituted by processes (cf. Stout 2003, 154).

4

AN APPLICATION TO IDO

Similar reasoning could be applied to the Infectious Disease Ontology (cf. Cowell and Smith 2010) in order to link the different levels of granularity represented in the ontology. Doing this is not without issues and might require some arbitrary descisions because IDO, as a BFO based ontology, applies a quite different classification scheme for occurents and does, for example, not distinguish between events and processes in any way. As an example, I will pick the IDO8 classes Infectious Disease Course (C) and Infectious Disease Epidemic (E). These are defined as follows: C =𝐷𝑒𝑓 A disease course that is the realization of an infectious disease.9 E =𝐷𝑒𝑓 A process of infectious disease realizations and for which there is a statistically significant increase in the infectious disease incidence of a population.10

C is consequentially a class of occurrent entities that concern a single organism, while E encompasses occurrents that involve populations of such organisms, but it seems to only implicitly note that occurrents of type C have something to do with occurrents of type E. Since it is impossible to judge from the defintions alone, I will asume that both are meant to refer to events. The question that needs to be answered is thus whether one can identify a kind of process from which an epidemic is “made up”. Kawazoe et al. (2008) include a process type called “Spreading” in their ontology, which could be a potential candidate. Intuitively, this seems to be a process in the revelant sense because if a disease d is spreading in a population p during some period of time, it is sensible to claim that it is spreading during every proper subinterval. It is also ontologically sensible to claim that an disease epidemic is consituted by a spreading of the disease: There is actually something going on that effects the increased disease incidence, namely its spreading, i.e. there is an actual change between the initial state of the population (characterised by low disease incidence) and the final state of the population (characterised by increased disease incidence) and this change cannot be reduced to the succession of states. It is also easy to identify the contents required to fill the temporal window of a spreading process. They have to be individual occurrences of the disease. In my mind, however, 8

All references to IDO refer to r344 of the ontology, which can be obtained via the project‟s SVN repository at http://infectious-diseaseontology.googlecode.com/. 9 A disease is thus taken to be the disposition to undergo a certain (pathological) process. 10 Italics mine. I will ignore the part of the definition that is associated with disease incidence and the related statistical problems. Also note that this definition does not employ the term “process” in the sense of this paper.

5

N. Grewe

it is not sufficient to claim that it consists of events of type C because an occurrence of an instance of C does not, by itself, explain how the disease course came to be. I thus propose that the temporal constituent of a spreading process is the event series composed of an instance of a Process Of Establishing An Infection (IDO_0000603) followed by an instance of C. The intrinsic granularity of a flu spreading process would thus, colloquially speaking, be the sequence of catching the flu and having it. The linkage between the level of the population and the individual organism could then be given by the following preliminary natural-language definitions: 𝐸′ =𝐷𝑒𝑓 An event made up from a process of infectious disease spreading [given that it has the correct statistical properties]. 𝑆 =𝐷𝑒𝑓 A process temporally constitued by an event type T such that (i) all instances of T constituting the process are spatially and temporally contiguous and they may also overlap (ii) have one temporal part of type IDO_0000603 followed by a temporal part of type C.

These definitions are still far from accurate (for example, one would also need to require that there is some continuity between the paritcipants in the different “grains” of the process; also some kind of constraint would need to be established for processes of type S so that the required statistical features obtain. Still, the definition 𝐸′ is much more explicit about the relationship between the levels of granularity than the IDO definition.

5

CONCLUSION

The problem of whether occurrents can be the subjects of change is a particularly difficult one since one is always at risk of mistaking linguistic artifacts or mere colloquialisms for proper ontological facts. It is thus useful to show that this kind of scheme has additional merit apart from a proper treatment of mutability. As it turns out, the consideration of this problem does indeed lead to a new perspective on granularity issues that pertain to the ontology of events. Differentiating between processes and events and encapsulating them in one-another can reduce the need for a fixed base-granularity in event descriptions, which seems to be useful for projects that need to integrate information from various levels of reality. Designing a concrete ontology that implements the suggestions made in this paper would, however, require quite some work to achieve a working formal definition of process and event mereology as well as of the constitution relation that seems to be required to properly talk about processes and the atomic events they are made up from.

ACKNOWLEDGEMENTS This work is supported by the German Science Foundation (DFG) as part of the research project “Good Ontology

6

Design” (GoodOD). Many thanks go to Ludger Jansen and Johannes Röhl (Rostock) for challenging and fruitful discussions on the topic of this paper and to four anonymous reviewers for their learned and insightful comments.

REFERENCES Baker, L.R. (2002) On Making Things Up: Constitution and Its Critics Philosophical Topics: Identity and Individuation, 30, 31–52. Cowell, L.G. and Smith, B. (2010) Infectious Disease Ontology. Infectious Disease Informatics. Ed. by Sintchenko, V. New York, Dodrecht, Heidelberg, London, 373–395. Galton, A. (2006) On What Goes On: The Ontology of Processes and Events. Formal Ontology in Information Systems: Proceedings of the Fourth International Conference (FOIS2006). Ed. by Bennett, B. and Fellbaum, Chr. Amsterdam, 4–11. Galton, A. and Mizoguchi, R. (2009) The water falls but the waterfall does not fall: New perspectives on objects, processes and events. Applied Ontology, 4, 71–107. DOI: 10.3233/AO2009-0067. Grenon, P. and Smith, B. SNAP and SPAN: Towards dynamic spatial ontology. Spatial Cognition and Computation, 4.1, 69– 104. Jansen, L. and Schulz, St. (2010) Grains, Components and Mixtures in Biomedical Ontologies. OBML 2010 Workshop Proceedings. Ed. By Herre, H. et al. Leipzig, 43–46. Kawazoe, A. et al (2008) Structuring an event ontology for disease outbreak detection. BMC Bioinformatics 9.Suppl 3, S8. DOI: 10.1186/1471-2105-9-S3-S8. Le Novère, N., Courtot. M. and Laibe, C. (2007) Adding Semantics in Kinetics Models of Biochemical Pathways. Proceedings of the 2nd International Symposium on experimental standard conditions of enzyme characterizations. Beilstein Institut, Frankfurt a. M., 137–153. Masolo, C. et al. (2003) WonderWeb Deliverable D17. The WonderWeb Library of Foundational Ontologies and the DOLCE ontology. 2003. Quine, W.V.O. (1960) Word and Object. Cambridge, MA. Seibt, J. (1997) Existence in Time: From Substance to Process. Perspectives on Time. Ed. by Faye, J., Scheffler, U. and Urchs, M. Dordrecht, 143–182. Simons, P. (1987) Parts. A Study in Ontology. Oxford. Stout, R. (2003) The Life of a Process. Process Pragmatism. Essays on a Quiet Philosophical Revolution. Ed. by Guy Debrock. Amsterdam and New York: Rodopi, 145–157. Whitehead., A.N. (1979) Process and Reality: An Essay in Cosmology. New York.

Processes and properties Colin Batchelor,1* Janna Hastings2 and Christoph Steinbeck2 1 2

Royal Society of Chemistry, Thomas Graham House, Cambridge, UK CB4 0WF. European Bioinformatics Institute, Hinxton, Cambridge, UK CB10 1SD.

ABSTRACT Many of the entities most commonly studied and investigated by biologists are processes, that is, they involve change over time, such as cell development and blood circulation. Most bio-ontologies such as the Gene Ontology distinguish processes from the entities that participate in them (cells, blood). As bio-ontologies of processes become more sophisticated, however, we need to accurately describe their properties, such as rates and speeds. Unlike the properties of material entities, there is as yet little consensus within the bio-ontology community on how such properties should be represented. Upper-level ontologies such as BFO and DOLCE are divided on whether to allow properties of processes as foundational entities. We discuss the properties of processes in formal ontology and specifically whether they can be said to have qualities, that is to say categorical properties that are separate from those of their participants. We will concentrate on heart rate.

1

INTRODUCTION

Jones drove along the road. How did he drive? He drove without a licence, without insurance, without lights after dark, under the influence of alcohol and above the speed limit. Naughty Jones. But how many of those properties were truly properties of his driving process, and how many of them were ―really‖ properties of something else, such as Jones himself? One test is whether the property goes away when the process finishes. In this simple case we can see that nearly everything is uncontroversially a property of one of the participants, be it Jones or his vehicle, except for perhaps the speed. But what happens when Jones sees the inevitable flashing blue light in his rear-view mirror? In the following paper we look at Jones’s heartrate, a difficult case for ontology. The rest of the paper is structured as follows. In section 2 we briefly review existing ontological frameworks for processes. In section 3 we consider ontological dependence, a vital part of the formal-ontological approach to bioontologies, and in section 4 we look at how an is_a *

hierarchy of process properties compares to the typically discrete hierarchies of objects and the typically continuous hierarchies of qualities. We present our conclusions in section 5.

2

PROCESSES

We follow Davidson (1967) and latterly BFO: Smith (2005), DOLCE: Masolo et al. (2003) and GFO: Herre et al. (2007) in identifying processes as first-class citizens in our ontological framework. It goes without question that we can write genus–differentia definitions for processes. The question we address is what sort of differentia is most suitable for representing the sorts of processes that biomedical scientists are interested in. Taking BFO, DOLCE and GFO together, BFO stands out as not explicitly providing for properties of processes beyond their boundaries and durations. GFO admits process roles, but only DOLCE explicitly identifies temporal qualities as being properties of processes, as applied by Devaraju and Kauppinen (2010) in their account of a blizzard. But all of these are quite vague. The shared framework is that processes/perdurants have proper temporal parts, are never wholly present at a given time, and are clearly distinct from continuants/endurants. Of the three, BFO deals most explicitly with ontological dependence as an organizing principle and has the fewest classes in it. Outwith bio-ontologies, Seibt (2004) takes a monocategorial approach to ontology where all things are processes, but we do not know of any practical implementations of her work. Aitken and Curtis (2002) have an approach taking three perspectives, those of ordering, participants and conditions. Conditions here are preconditions for a process taking place and postconditions are consequences of a process. There are no obvious slots here for process properties as such. Galton (2008) defines processes as things that are ongoing; once a process is complete and becomes history, he calls it an event. His approach admits process attributes such as speed and throughput.

To whom correspondence should be addressed.

1

Johansson (2006) provides a useful structure for thinking about processes. He identifies four kinds of is_a relation, those of genus-subsumption, determinable-subsumption, specialization and specification. Most familiar to us in the bio-ontology community is genus-subsumption, which proceeds by way of class intersection. Less familiar is determinable-subsumption, which applies to certain sorts of quality. Here red, let us say, is the gerrymandered portion of colour that consists of scarlets, vermilions, crimsons and so forth. Another way of expressing this contrast is that genussubsumption hierarchies are discrete rather than continuous; there is no way of smoothly moving between a benzene molecule and a water molecule, just as there is no way of moving smoothly between a portion of cytoplasm and an ion channel. In contrast, with a determinable-subsumption hierarchy, one can move smoothly across a colour volume, say, without encountering discontinuities. This is exactly like DOLCE‘s quality–quale distinction. A quality is a determinable and a quale is a determinate. Specialization and specification are directly relevant to our discussion of processes. Specifications of processes involve definitions in terms of the participants, so Jones‘s driving of a milkfloat, a tractor, a dodgem, a golf cart, a combine harvester, are all specifications of driving a vehicle. These can be written as class intersections in the normal way. Johansson, however, leaves specialization somewhat vague, giving as examples careful painting, careless painting, fast painting and slow painting, and pointing out that carefulness, carelessness, fastness and slowness in the context of painting derive from painting and aren‘t universals that are somehow transferable to other activities. We will investigate heart rate in the following sections in the light of this framework.

3

HEART RATE AND ONTOLOGICAL DEPENDENCE

At least on first glance, heart rate appears to be intrinsically bound up with a heart beating process. Furthermore, it allows comparison between different heart beating processes: we would like to have some sort of way of getting at the idea that person A‘s heart rate is faster than person B‘s heart rate. This is similar to person A‘s height, at least in some frameworks a determinable quality of person A, being greater than person B‘s. (1) Process P can be more or less X than process P‘. Previous work on heart rate includes Lord and Stevens (2010) who argue for using mathematical modelling as the

ontological model for this sort of scenario, which has much to recommend it from a computer programming point of view, but reasoning with combined mathematical and ontological models is not yet well developed, and thus, exposing the implications of the mathematical part of the model for standard reasoning tasks such as data integration will be difficult. Temal et al. (2009) model heart rate beneath both BFO and DOLCE and conclude that DOLCE is better since it allows heart rate to be a property of the heart beating process. In the Gene Ontology (GOC 2000), ‗regulation of heart rate‘ is a class (GO:0002027) with definition ‗Any process that modulates the frequency or rate of heart contraction.‘ and synonyms ―regulation of heart contraction rate‖, ―cardiac chronotropy‖, and ―regulation of rate of heart contraction‖, however, heart rate is only referred to in the text name as the target of the regulation, not explicitly included in the ontology as a separate entity. Lastly Nunes et al. (2007), using an approach based on DOLCE, explicitly describe the different steps in a single heartbeat but don‘t consider the heart rate itself. There is an important way in which heart rate differs from speed. Since Newton we have been happy with taking the limit of a speed as the length over which we evaluate the speed goes to zero, there is a minimal length of time over which we can think about a heart rate—a single cycle. The key point is that heart rate is only ever an average, in contrast to Jones driving along a residential street at 31 mph. This means that if the heart rate is a property of anything, it is a property of some proper part, specifically a proper temporal part, of the heart beating process, and different temporal parts have different heart rates. We need to modify proposition (1) to say: (2) Different parts p, p’ of a process P can have different values X, X‘ at times t, t‘. This is similar to how hair colour might be different for different spatial parts of your hair, but more complex because hair colour, like speed, is wholly defined for each (arbitrarily small) part. A more fruitful analogy is with length in geographic science. While most people are comfortable with length being a property (however defined), some lengths, famously that of the coastline of Scotland, vary with scale, getting longer the smaller the ruler you use until we reach the atomic scale and edges become difficult to identify. Similarly, Jones‘s heart rate at time t differs according to the length of time T, the precise processual part p, over which we consider it. For length, BFO prefers to talk about projections, so that each object has a projection onto a particular portion of space, and that space has a length. But the same thing applies to, for example, the circumference of the projection of Scotland into space, and it is not clear that

2

using this strategy gets around the underlying problem, since the projection process preserves the ambiguity. What does this mean for dependence relations? We contrast DOLCE, which allows chains of (presumably) transitive dependence relations, so that qualities can have qualities and those qualities too can have qualities, with BFO, where any property y must first depend directly on some independent continuant x and second all y must depend on instances x of some universal X, as opposed to some ―defined class‖. BFO‘s approach is stronger in that it is more useful in constraining ontology developers, but is weak because it lacks a clear definition of ontological dependence. Lowe (2010) discusses ontological dependence at length and it is clear that the sort of dependence intended by BFO in the sense of ―independent continuant‖ as opposed to ―dependent continuant‖ is not unlike Lowe‘s substance dependence. Lowe defines it in terms of identity dependence: x depends for its identity upon y iff there is a function f such that it is part of the essence of x that x is f(y) and then says that x is a substance iff x is a particular and there is no particular y such that y is not identical with x and x depends for its identity upon y. The DOLCE approach would be to say something like: (3) heart rate depends_on heartbeat depends_on heart but if we follow BFO and disallow dependence chains then we get:

(4) heart rate depends_on regularly beating heart and regularly beating heart is not a universal. If a given heart stops beating, or starts to beat irregularly, then it has not changed into a different kind of thing, it has merely changed its behavior. Hence there are two universals here: the heart and the beating process. Is there a way out that allows us to define heart rate in terms of formal relations in the sense of Smith and Grenon (2004)? The first thing to observe is that there are hidden dependence relations in the BFO framework. For example, every disposition depends on some underlying quality. Every process depends on some disposition. Likewise, just as the life of a frog substancedepends on the frog, so conversely the frog depends on its own life. But these dependence relations are weaker than Lowe‘s substance dependence, and are examples of Lowe‘s essential existential dependence, the case where it is part of the essence of x that x exists only if y exists. So our dependence relations at the class level, according to the pattern all x related_to some y, for heart rate are as follows:

(5) heart rate substance_depends_on heart (6) heart rate existentially_depends_on heartbeat (7) heartbeat substance_depends_on heart Without all three parts of the triad (5)–(7) we cannot have a heart rate.

4

CLASSIFYING PROCESS QUALITIES

BFO and DOLCE concur in dividing most of the furniture of the universe into things wholly present at a given time, which BFO calls ―continuants‖ and DOLCE calls ―endurants‖, and things not wholly present at a given time, which BFO calls ―occurrents‖ and DOLCE calls ―processes‖. They do, however, give us different answers to whether a quality is a continuant/endurant or an occurrent/perdurant. BFO says a continuant, DOLCE says neither. Since our own work, see for example Batchelor et al. (2010), relies on dispositional properties, on which DOLCE is silent, we will attempt to build on the BFO framework. One argument against process qualities per se, voiced by a participant at an OBO Foundry meeting, is that a quality, a dependent continuant, is wholly present at one point in time, and can change over time (for example, your height changes as you grow up), whereas a process quality depends on the process as a whole, thus would have to extend over time (the time of the process) and cannot be wholly present at a given time nor change over time. Hence it would have to be an occurrent rather than a continuant. We have already dealt with that last point and identify a process quality as being an occurrent in the general case, see proposition (2), but there are further difficulties with that counterargument. The possibly naïve straw man that it counters is that instead of defining a 90 bpm heartbeat as (8) heartbeat that has_process_quality 90 bpm then that BFO approach would instead say (9) heartbeat at 90 bpm is_a heartbeat In other words, the BFO approach is to use the subsumption hierarchy for heart beating processes to model attributes of processes. How do we decide between the two? (9) is simpler but it fails the basic test of proposition (1). There is nothing in the structure of (9) that enables us to say that heartbeat at 90 bpm is quantitatively different from heartbeat at 100 bpm. There is also nothing to contrast

3

these classes with other heartbeat subclasses such as feeble heartbeat or pounding heartbeat, which cannot be straightforwardly quantitatively compared. Proposition (8) achieves this since it includes ‗90 bpm‘ as a first-class entity in its own right. A further argument for process qualities comes from considering Johansson‘s (2006) kinds of is_a relation. We can see straight away that if we have heart rates as a kind of process quality then they fit into a clean determinablesubsumption hierarchy and proposition (8) becomes a sort of specification that we can handle with an OWL reasoner in the usual way. Conversely, we see that while (10) heartbeat is_a cyclical physiological process is an example of genus subsumption, any tree that contains both (9) and (10) will mix subsumption relations, since (9) is really a determinable subsumption relation where the processes stand in relation to each other as determinable and determinates. Therefore we reject (9) and choose (8). There are process qualities, and they are occurrents because they are not wholly present at any one point of time, in contrast to BFO qualities. We can now go further and suggest that this sheds light on Johansson‘s mysterious specialization relation between properties. Just as in a specification hierarchy each specification is a genus–differentia definition of a process where the differentia fits into a genus-subsumption hierarchy, so is a specialization hierarchy one where each definition is a genus–differentia definition where the differentia fits into a determinable-subsumption hierarchy.

5

CONCLUSIONS

We have argued that upper-level ontologies should accept process qualities in order to handle things like heart rate, on the basis of its ontological dependence relations and from a consideration of Johansson‘s four kinds of is_a relation. Process qualities themselves are occurrents. They existentially-depend on the processes that they are process qualities of and substance-depend on the participants of those processes. They fit into a determinable-subsumption hierarchy. Not all things that might be considered to be process qualities, however, should be. In our view, speed and acceleration, for example, being wholly present at a single point in time, are conventional continuant qualities. In future work we will consider the rates of chemical reactions and whether they are qualities or process qualities.

ACKNOWLEDGEMENTS JH thanks Werner Ceusters and Barry Smith for useful discussions. This work was partially supported by the BBSRC, grant agreement number BB/G022747/1 within the "Bioinformatics and biological resources" fund.

REFERENCES Aitken,S. and Curtis,J. (2002) A Process Ontology, in Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW 02), ed. Gomez-Perez,A., Springer Verlag, pp. 108-113. Batchelor,C., Hastings,J. and Steinbeck,C. (2010) Ontological dependence, dispositions and institutional reality in chemistry, Frontiers in Artificial Intelligence and Applications, 209, 271-284. Davidson, D. (1967) The Logical Form of Action Sentences, in The Logic of Action and Decision, ed. Rescher,N., Pittsburgh University Press, Pittsburgh, PA, 81-95. Devaraju,G. and Kauppinen,T. (2010) GeoProcesses and Properties Observed by Sensors: Can We Relate Them?, in Proceedings of GeoChange 2010 – GIScience for Environmental Change, Campos do Jordao, Sao Paulo, Brazil. Galton,A. (2008) J. Logic Computation., 18(3), 323-340. Herre,H., Heller,B., Burek,P., Hoehndorf,R., Loebe,F. and Michalek,H., General Formal Ontology (GFO): A Foundational Ontology Integrating Objects and Processes. Part 1: Basic Principles. Research Group Ontologies in Medicine (Onto-Med), University of Leipzig. Johansson,I. (2006), Four Kinds of ―Is_A‖ Relations: genussubsumption, determinable-subsumption, specification and specialization, in WSPI 2006: Contributions to the Third International Workshop on Philosophy and Informatics, ed. I. Johansson and B. Klein, Saarbruecken, May 3–4. Lord,P. and Stevens,R. (2010) Adding a Little Reality to Building Ontologies for Biology, PLoS ONE, 5(9), e12258. Lowe,E.J. (2010) Ontological Dependence, The Stanford Encyclopedia of Philosophy (Spring 2010 Edition), ed. E. Zalta, http://plato.stanford.edu/archives/spr2010/entries/dependenceontological/ Masolo,C., Borgo,S., Gangemi,A., Guarino,N., Oltramari,A. and Schneider,L. (2003), WonderWeb Deliverable D17, http://www.loa-cnr.it/Papers/DOLCE2.1-FOL.pdf Nunes,B.G., Guizzardi,G. and Filho,J.G.P (2007) An Electrocardiogram (ECG) Domain Ontology, in Proceeedings of the Second Brazilian Workshop on Ontologies and Metamodels for Software Data Engineering (WOMSDE’07), 22nd Brazilian Symposium on Databases (SBBD)/21st Brazilian Symposium on Software Engineering (SBES), Joao Pessoa, Brazil. Seibt,J. (2004) Free Process Theory: Towards a Typology of Occurrings, Axiomathes, 14, 23-55. Smith,B. (2005) Against Fantology, in Experience and Analysis, ed. Reicher,M.E. and Marek, J.C., öbv&hpt, Vienna. Smith,B. and Grenon,P. (2004) The Cornucopia of Formal Ontological Relations, Dialectica, 58, 279-296.

4

Vital Sign Ontology Albert Goldfain* , Barry Smith**, Sivaram Arabandi†, Mathias Brochhausen‡, William R. Hogan• *Blue Highway LLC, Syracuse NY; **University at Buffalo, Buffalo NY, †The Evolvers Group LP, Flower Mound TX, ‡IFOMIS, Saarbrücken Germany, •University of Arkansas for Medical Sciences, Little Rock AK

ABSTRACT We introduce the Vital Sign Ontology (VSO), an extension of the Ontology for General Medical Science (OGMS) that covers the consensus human vital signs: blood pressure, body temperature, respiratory rate, and pulse rate. VSO provides a controlled structured vocabulary for describing vital sign measurement data, the processes of measuring vital signs, and the anatomical entities participating in such measurements. VSO is implemented in OWL-DL and follows OBO Foundry guidelines and best practices. If properly developed and extended, we believe the VSO will find applications for the EMR, clinical informatics, and medical device communities.

1

INTRODUCTION

The Vital Sign Ontology (VSO) 1 is a realist ontology covering the four bodily qualities that have by consensus been identified as human ‘vital signs’: blood pressure, body temperature, heart rate, and respiration rate. These qualities are measured at least once in almost every healthcare encounter and they are continuously monitored in intensive care situations. The vital signs are universally accepted as clinically significant not only because they are signs of life (they collectively help to differentiate a living from a dead human organism), but because they are reliable indicators of a patient’s current and future health state. Our goal in developing VSO is to provide a scientifically rigorous, consistent, computable, and extensible controlled vocabulary to facilitate data exchange and annotation in applications where a reference to vital signs is required. The terms in VSO are defined using both concise natural language definitions and OWL-DL. Although the vital signs are typically measured together and reported together, there is little ontological support for the class ‘vital sign’ being a universal. There are, in fact, very many signs that an organism is alive and very many indicators of its future health state, so without a principle of exclusion, we should expect the class of vital signs to be much larger than the consensus four. As such, the notion of ‘vital

*

To whom correspondence should be addressed. 1 http://www.buffalo.edu/~ag33/vso.owl

sign’ demands some attention from the ontology community. We take universals to be the counterparts in reality of (some of) the general terms used in the formulation of scientific theories [1]. Unlike ‘vital sign’, the classes ‘blood pressure’, ‘body temperature’, ‘pulse rate’, and ‘respiration rate’ are universals. These would be classified as qualities and seem to have all of the following features in common: (1) Measurability: Vital signs have been measured (with increasing accuracy) for a large part of medical history. (2) Necessity: With rare exceptions (e.g., pulseless artificial hearts), the vital signs must be present in a living human organism. (3) Punctuality: A rapid and significant change in vital signs typically signifies a rapid deterioration or improvement of health state. This is in contrast to lagging indicators such as fluid intake level, which would lead to thirst before dehydration has an impact on the core four vital signs. (4) Regulated: The mechanisms of human homeostasis induce a response whenever any of these qualities departs from an acceptable range. (5) Causal relevance in many pathological processes: At least one of the vital signs will be affected in any significant short-term departure from health. The subclass of disease courses in which vital signs change is very large. (6) Well understood: Vital signs are interpreted within an entrenched theoretical framework of anatomy, physiology, and pathophysiology. Within these fields, there are well established ranges of normal values for vital sign measurements and well established ranges of abnormal values for particular conditions. (7) Signs of cardiovascular functioning: The consensus four vital signs are qualities of anatomical components of the circulatory and respiratory systems. Using these features as principles of exclusion, we may get nearer to the essence of the four traditional vital signs used

1

to assess basic body functioning. In certain clinical contexts, these may be expanded to include other signs such as ‘oxygen saturation’ [2], end-tidal CO2 [3], and pain [4]. VSO describes the properly functioning physiology associated with the vital signs. Departures from proper functioning fall outside of the purview of VSO but can be described by combining terms from VSO and a disease ontology. A brief description of how the VSO can be used in these circumstances is given in the discussion section.

2

VSO AND OBO

VSO contains terms for the consensus vital signs, as well as vital sign measurement processes, vital signs measurement data (the outputs of such processes), and the various anatomical entities that the vital signs are qualities of. The Ontology for General Medical Science (OGMS) contains a set of high-level clinical terms, including: ‘disease’, ‘disorder’, ‘syndrome’, ‘sign’, ‘symptom’, ‘disease course’, and ‘diagnosis’. OGMS is built around a general theory of disease described in [5] VSO interoperates with OGMS and fills a niche in the suite of Open Biomedical Ontologies (OBO). VSO depends on the following ontologies: • • • • • •

Basic Formal Ontology (BFO): continuants, occurrents Ontology for General Medical Science (OGMS): sign Ontology for Biomedical Investigations (OBI): measurement process, measurement datum Phenotypic Qualities Ontology (PATO): rate, temperature, pressure Foundational Model of Anatomy (FMA): anatomical entities Gene Ontology (GO): biological processes, regulation of a biological quality

There is currently no OBO coverage for cardiopulmonary physiology or medical device types. As such, VSO includes a thin representation of the cardiac cycle phases and vital sign measurement devices. The scope of VSO is currently limited to human vital signs, although it can potentially cover many similar organisms. VSO imports the theoretical commitments of OGMS, including a particular theory of signs and symptoms as they relate to disease. Although the issue of signs and symptoms is by no means settled in the OGMS community, OGMS is committed at least to an objective/subjective distinction between signs and symptoms. Signs are objectively ob-

2

served, measured, and quantified. Symptoms are experienced, in a first person, subjective, private sort of way. As such, entities such as pain, which many consider to be a fifth vital sign, would actually be considered the first vital symptom according to current distinctions in OGMS. In OGMS, the term ‘sign’ refers to a disjunctive or defined class encompassing several OGMS universals. The presence of disorders, diseases, disease courses, and pathological processes can all signify something (and thus could be signs). As such, ontology developers seeking to use OGMS are advised to subclass entities on the basis of these universals. There is something epistemic about signs; a sign is not a sign unless it signifies something to someone, and it cannot signify something to someone unless some framework for its interpretation is established. Since the universals represented in an ontology should be understood to exist independently of any epistemic phenomena, ‘sign’ cannot be a universal. However, these issues are left to OGMS, as they do not have a significant impact on the vital signs in VSO.

3

ORGANIZATION

The relational organization of VSO is described using the OBO Relation Ontology (RO) relations and proposed extensions. The basic relational structure of the term ‘systolic left ventricular pressure’ is illustrated in Figure 1.

3.1

Blood Pressure

Blood pressure is defined as the pressure exerted by circulating blood on the walls of blood vessels. In VSO, ‘Blood pressure’ is asserted as a pato:pressure (which is a bfo:quality) as well as a member of the defined class ‘vital sign’. Blood pressure subtypes are first differentiated by the cardiac cycle phase during which the quality exists: systole (the period of contraction) or diastole (the period of relaxation). These, along with other temporal intervals in the cardiac cycle, are represented in VSO because they are of central importance to describing vital signs and, as previously stated, there is currently no OBO coverage of cardiac cycle terms. Systolic and diastolic blood pressure are further differentiated by the blood vessel wall (FMA anatomical entity) that the pressure is exerted towards. We use the RO relation ‘towards’ as a relation binding a quality to a material entity. The anatomical location helps to differentiate between central and peripheral blood pressure. For this purpose, anatomical terms are imported from the FMA using the MIREOT mechanism [6]. The two levels of differentiation yield necessary and sufficient conditions for blood pressure subtypes:

Vital Sign Ontology

Figure 1VSO relational structure around the term 'systolic left ventricular pressure'

‘systolic blood pressure’ = ‘blood pressure’ AND exists_during SOME systole ‘systolic left ventricular pressure’ = ‘systolic blood pressure’ AND towards ONLY ‘wall of left ventricle’ The ordering of differentia in the hierarchy impacts the way in which blood pressure can be referred to by VSO users. It is possible to refer to ‘blood pressure’ without specifying the anatomical location or cardiac cycle phase, and it is possible to refer to ‘systolic blood pressure’ without specifying anatomical location (as is typical in clinical settings), but to refer to ‘left ventricular pressure’ without specifying cardiac cycle phase a new class must be built via cross products: ‘left ventricular pressure’ = ‘blood pressure’ AND towards ONLY ‘wall of left ventricle’ The fact that ‘left ventricular pressure’ is not in the asserted (single-inheritance) hierarchy of VSO, therefore, does not deter a user from referring to it. This is important because there are very many ways to differentiate vital signs (see section 4 below) and these are bound to vary from application to application.

3.2

Body temperature is defined as the temperature of a part of the human body. Body temperature subtypes are differentiated by the anatomical part where the measurement occurs. The body temperature is fairly uniform regardless of measurement site, but it is important to represent the site because normal values for temperature vary by site in small but clinically significant ways, and because different patients will require temperature monitoring in different ways. For example, rectal thermometry is typically used for very young patients. Many of the anatomical sites where temperature is measured are holes, lumens, and cavities. There are general ontological issues with measuring qualities of immaterial parts (Is the depth of a (dental) cavity a quality of the cavity or a quality of the tooth in which it is a cavity?). These can be resolved in VSO by noting that the air temperature in any one of these ontological holes may actually be what gets measured. This measurement is used as an inference to the temperature of the enclosing part, which is in turn used to infer the body temperature. All vital sign measurements lie on a spectrum of how direct or inferred the measurement is. Blood pressure, pulse rate, and respiration rate are all qualities whose measured values oscillate and that are interpreted by the ways in which their values oscillate. Body temperature also oscillates, albeit over a longer window. In fact, this can be used as a fertility sign in women. This illustrates that a reasoner should be able to infer that any of the vital signs may signify different things in different contexts.

Body Temperature

3

3.3

Pulse Rate and Respiratory Rate

Pulse rate is defined as the rate at which an artery pulses (i.e., participates in expansion-contraction cycles) as blood passes through it. Pulse rates subtypes are differentiated by the particular artery (from the FMA) that is undergoing a pulsation process. Pulse rate is often conflated with heart rate, although they are not identical (ontologically or clinically) and is one of a network of closely related signs observed or inferred by monitoring the pumping of the heart: cardiac output, stroke volume, ejection fraction, among others. Respiratory rate is defined as the rate at which an organism breathes and is not differentiated any further in the current version of VSO. Rates are qualities in both PATO and BFO. Specifically, rates are temporal derivative qualities 2, depending simultaneously on their bearers and time. The measurement processes for pulse rate and respiration rate will, in turn, involve measurements of an anatomical site (e.g., by palpation) and of time (e.g., by stopwatch).

4

5

DISCUSSION

When VSO is paired with disease ontologies (such as the Infectious Disease Ontology), it becomes possible to specify that a certain vital sign profile is a consequence of a particular pathological process in a disease course (e.g., ‘sepsis induced hypotension’ or ‘recurrent fever in P. falciparum malaria’). This technique is being applied in the semantic alarm framework for multiparameter monitoring devices [7]. We are confident that VSO can also find application in settings where multiple devices from multiple manufacturers need exchange vital sign data. Such devices typically report data in manufacturer-specific formats. Since VSO is designed to be application and manufacturer agnostic, it could be used as a contract for the meaning of shared vital signs data. Also, in such settings, if different devices are measuring the same vital sign (in a different way), data provenance and the measurement process associated with the redundant devices may become important. VSO can help in this regard because it contains terms for the various measurement processes involved.

VSO INFERRED HIERARCHY

In clinical settings, vital signs are contextualized relative to not only anatomical location (left-ventricular blood pressure), but also: kinetic state (resting pulse rate, ambulatory blood pressure), postural state (sitting blood pressure), time of day (night-time respiration rate), stage within a process (blood pressure during REM sleep), stage within a life course (premenopausal body temperature), measurement method (invasive blood pressure measurement), relative to a therapy or treatment (postoperative blood pressure) and relative to a disease state (sepsis-induced hypotension).

6

CONCLUSION

VSO fills a gap in OBO ontology coverage of clinical signs. Preliminary development of the VSO has uncovered a host of interesting ontological issues that we believe are worthy of attention from the bio-ontology community. Further development of VSO will require coordination of existing resources and the principled creation of new ones. We believe this will establish VSO as a valuable resource for systems that produce and consume vital sign data.

REFERENCES Creating a code and a class (in an ontology) to accommodate each of these would quickly result in a resource that is difficult to maintain and extend. Instead, we can create any of these compositional classes by importing terms from relevant OBO ontologies and description logic restrictions. For example, blood pressure is known to rise during the REM sleep stage, and thus, we may want to refer to entities such as ‘REM systolic blood pressure’: ‘systolic blood pressure during REM sleep’ = ‘blood pressure’ AND exists_during ONLY systole AND occurs_during SOME REM sleep Notice that the DL easily allows us to demarcate the temporal interval relative to both the REM stage of sleep and the oscillating periods of systole within REM.

2

See http://code.google.com/p/bfo/wiki/Bfo2DeterminableDeterminate

4

[1] Smith, B. and Ceusters, W. (2010) Ontological Realism as a Methodology for Coordinated Evolution of Scientific Ontologies. Applied Ontology, 5(3-4): 139-188. [2] Neff, T.A. (1988) Routine Oximetry: A fifth vital sign? Chest, 94(2): 227. [3] Vardi, A., Levin, I., Paret, G. and Barzilay, Z. (2000) The sixth vital sign: end-tidal CO2 in pediatric trauma patients during transport. Harefuah 139 (3-4): 85-87, 168. [4] McCaffery, M. and Pasero, C.L. (1997) Pain ratings: the fifth vital sign. American Journal of Nursing, 97(2): 15-16. [5] Scheuermann, R.H., Ceusters, W. and Smith, B. (2009) Toward an Ontological Treatment of Disease and Diagnosis. Proc. of AMIA 2009 Summit on Translational Bioinformatics, 116-120. [6] Courtot, M., Gibson, F., Lister, A.L., Malone, J., Schober, D., Brinkman, R.R. and Ruttenberg, A. (2011) MIREOT: The minimum information to reference an external ontology term. Applied Ontology, 6(1): 23-33. [7] Goldfain, A., Chowdhury, A., Xu, M., DelloStritto, J. and Bona, J. (2011) Semantic Alarms in Medical Device Networks. Proceedings of Third Joint HCMDSS/MDPnP Workshop.

An Ontological Representation of Biomedical Data Sources and Records Michael Bada, Kevin Livingston, and Lawrence Hunter University of Colorado Anschutz Medical Campus, Aurora, CO, USA ABSTRACT Large RDF-triple stores have been the basis of prominent recent attempts to integrate the vast quantities of data in semantically divergent databases. However, these repositories often conflate data-source records, which are information content entities, and the biomedical concepts and assertions denoted by them. We propose an ontological model for the representation of data sources and their records as an extension of the Information Artifact Ontology. Using this model, we have consistently represented the contents of 17 prominent biomedical databases as a 5.6-billion RDF-triple knowledge base, enabling querying and inference over this large store of integrated data.

1

INTRODUCTION

The rising importance of high-throughput analysis methods in biomedical research relies upon effectively making use of the ever-growing amount of data and knowledge stored in a profusion of distributed, heterogeneous biomedical databases. Large RDF-triple stores have been the basis of prominent recent attempts to integrate the vast stores of data in these many semantically divergent databases (e.g., Belleau et al., 2008; Ruttenberg et al., 2009). The motivation behind such integration efforts is to take advantage of these existing data and knowledge and apply it to current biomedical investigations: By querying the knowledge base and synthesizing relevant information with novel data and hypotheses of interest, we hope to accelerate scientific discovery. However, effective synthesis of the data relies upon a synthesis or mutual mapping of the disparate knowledge models of the data sources. Ideally, we would like to base this integration on a representation of biomedical reality (or at least our conceptualization of biomedical reality) grounded in high-quality community ontologies, but the representation in most large biomedical stores, which are predominantly founded on relational-database technology, is not consistent with this paradigm. Because a rigorous representation of these stored data as ontologically grounded biomedical concepts will likely be difficult in many cases, we are proposing an OWL-based model for the representation of these database records as an intermediate solution for the integration of these data in

RDF stores. We are using this representation in our construction of KaBOB (Knowledge Base of Biology), a large RDF store in which we are currently representing the contents of 17 prominent biomedical databases, including DIP, Entrez Gene, GAD, GOA, HGNC, HomoloGene, InterPro, KEGG, MGD, OMIM, PharmGKB, Reactome, TRANSFAC, and UniProt, and 14 ontologies whose terms are referenced in these databases. Our representation builds off the Information Artifact Ontology (IAO) (http://code.google.com/p/information-artifact-ontology/), which focuses on the representation of information content entities, where such an entity is defined as one that “is generically dependent on some artifact and stands in relation of aboutness to some entity”. We have used this model to represent all of the records of the files of these biomedical data sources, resulting in a knowledge base of 5.6 billion RDF triples. This is an intermediate solution to the persistent problem of integration of disparate databases: While these data have not yet been rigorously represented in terms of biomedical concepts, the data records have been consistently represented as information content entities in one resource, thus permitting their effective querying and inference. Explicitly representing data sources' information content entities such as records, fields, field values, and schemas will also enable the explicit representation of the axiomatizations for the conversion from data records to statements of biomedical concepts in the same knowledge base—a preferable strategy to burying such conversion in code. Additionally, this representation will allow us to make fine-grained statements of provenance of the assertions of biomedical concepts. Furthermore, as our representation is not specific to KaBOB, it could serve as a model for RDF-based distribution of databases, allowing interoperability among distributed data sources.

2

RESULTS AND DISCUSSION

We can categorize the components of our model into three layers of representation of data sources: (1) basic classes general to data sources, (2) instances of these basic classes to represent a specific data source, and (3) instances that represent the records of this data source.

1

Fig. 1. An OWLPropViz (http://www.wachsmann.tk/owlpropviz/) rendering of the basic classes and relations of our ontological modeling of data sources and their contents. The links among these classes are shown, as are the links to existing IAO classes.

2.1 Basic Representation of Data Sources Starting with the basic classes general to data sources, we use the existing IAO:data set, which is defined in the IAO as a “data item that is an aggregate of other data items of the same type that have something in common”, i.e., a collection of like data. Though the large majority of the data currently stored in KaBOB were accumulated from biomedical databases, we have also allowed for the representation of data sets from sources other than databases, e.g., experimental data sets. We have created KIAO:data source as a subclass of IAO:information content entity and KIAO:database as a subclass of KIAO:data source. (KaBOB has branches that hold extensions of the ontologies that we have imported. Our notational convention for an extension of an ontology is to prefix the ontology’s official prefix with a “K”, with the semantics that it is the KaBOB extension of the given ontology; thus, KIAO concepts are extensions of the IAO.) A data set is declared to be an integral part of a data source, and a data source is composed of one or more data sets. As databases are often distributed in the form of one or more text files of like records, we have been regarding a data set as one of these files that is part of a database. However, this conceptualization is not specified within our ontology, and other users are free to regard, e.g., the Web pages of a database as its records. (We do not regard this as inconsistent, as data sources can be outputted as different data sets with different schemas; the user should make clear as to which data sets are being represented with our model.) We have additionally created KIAO:schema and KIAO:field as subclasses of IAO:information content entity and KIAO:record and KIAO:field value as subclasses of the more specific IAO:data item. A field is an integral part of a schema, and a field value is an integral part of a record, which is itself an integral part of a data set. A record has a schema as its template, and a field value has a field as its template; additionally, a data set is linked to the (same) template of its member data. A simplified view of the representation of the general classes of our model, i.e., those not specific with regard to data source is shown in Figure 1.

2.2 Representation of a Specific Data Source An instance of KIAO:data source (or the more specific KIAO:database) is created to represent the data source

itself, and an instance of IAO:data set is created for each data set of the data source and made an integral part of the data source instance. Additionally, for each data set, an instance of KIAO:schema is created and asserted to be the template of the data set. Lastly, for each created schema instance, an instance of KIAO:field is created for each field of the schema and made an integral part of the schema. Figure 2 displays an example set of these types of instances for our storage of the DIP database. All of these datasource-specific instances are dynamically created (and named) during RDF generation. Our ontology is fully capable of handling the evolution of data sources: If the schema of a given data set is changed, a new instance of the schema is simply created, along with the instances of the fields of the new schema. If the data sets of a data source change (or a new set is made available), an instance for each new data set can be created, along with instances for its schema and fields. (Modeling of incremental change rather than creation of new instances may be desirable but poses significant representational challenges.) Additionally, using our model, if a researcher wishes to work with multiple versions of a given data source (e.g., to analyze some aspect of multiple versions of a given database), an instance for each version of the data source can be created. If different versions of a data source consist of different data sets (e.g., different file organizations) and/or different schemas and fields, the explicit representation of all of these elements and their linkages will make the respective structures of the disparate data-source versions unambiguous. Furthermore, it may be the case that only a subset of a data source needs to be represented; in such a case, only instances of the data sets, schemas, and fields of interest are created.

2.3 Basic Representation of the Data of a Specific Data Source This most low-level part of our model focuses on the representation of the data of a given data source through the instantiation of KIAO:record and KIAO:field value: An instance of the former for each entry (i.e., record) in a given data set and an instance of each field value of each of these records are created. Each record is made an integral part of the previously created data-set instance and has the previously created schema instance declared its template. Analogously, each field value of a given record is made an

Fig. 2. An OWLPropViz rendering of instances representing the DIP database, a DIP data set (a file of protein-protein interactions), the schema for this data set, and two basic fields of this schema (representing the two interactors of an interaction).

integral part of the record and has the previously created field instance declared its template. Each field value is its own instance, even if its actual value is the same as another; representing the actual values is not discussed here due to space limitations. Figure 3 displays an example set of instances representing one record of the data set represented in Figure 2 and two of the field values of this record. This layer that represents the actual data of the data source accounts for the large majority of the triples generated based on this model. As seen in Figure 3, three triples are generated for each record instance and four triples for each fieldvalue instance (including one denoting its value, not shown in the figure). This is a more verbose representation compared to one in which assertions mirroring the structure of the data source are created; however, whereas other models typically conflate representation of data-source content with assertions of biomedical concepts, our model accurately represents this content, which can then be used to track the provenance of biomedical-concept assertions.

2.4 Representation of Record Substructure Some data sets of data sources have complex fields, i.e., have substructure that cannot be represented in terms of straightforward field values. This field substructure can manifest itself in several ways but usually involves certain

values of fields semantically linked to other values. One type of substructure can be seen in a pair of fields within a data set each having multiple values where the field values of one field correspond to those of the other field, e.g.: MI:0045|MI:0114PMID:8494892|PMID:8043575 These are the sets of field values for two fields in the DIP data set shown in the previous examples (specifically the fields identifying the method used to detect a given interaction and the PubMed ID of the article reporting this interaction) of a record of the DIP interaction file represented in Figure 1. In this example, the first value of the interactiondetection-method field is tied to the first value of the publication field, and analogously for the second values of these two fields. We have conceptualized such record substructure as subrecords; e.g., using this pair of complex fields, one subrecord contains the first values representing values of these two fields (MI:0045 and PMID:8494892) as its values and another subrecord contains the second values of these two fields (MI:0114 and PMID:8043575) as its values. Without such an explicit representation of this substructure, the linkage implied in the complex structure of this file would be lost.

Fig. 3. An OWLPropViz rendering of instances representing one record (DipInteractionFileRecord1) of the data set represented in Fig. 2 (DipInteractionFile) and two field values (DipInteractionFileRecord1_FV1 and DipInteractionFileRecord1_FV2) of this record.

Fig. 4. An OWLPropViz rendering of instances representing record substructure of the example presented in Section 2.4. The first value of each of the two fields (DipInteractionFileRecord1Subrecord1_FV1 and DipInteractionFileRecord1Subrecord_FV2) are integral parts of a subrecord (DipInteractionFileRecord1Subrecord1), which is an integral part of the full record (DipInteractionFileRecord1). Additionally, these two field values have two respective subfields as their templates (DipInteractionFileSubfield1 and DipInteractionFileSubfield2), and these subfields are integral parts of a subschema (DipInteractionFileSubschema1), which is an integral part of the schema of the full record.

To represent this substructure, the model shown in Figure 1 was made slightly more complex. A subrecord is made an instance of KIAO:record (as for ordinary records). While a record can only be part of a data set in the basic representation, in this richer representation we have asserted that a record is an integral part of a record or a data set (i.e., the union of these classes) since a subrecord is part of a record. Thus, we can declare that the subrecord is a part of its record, which is therefore transitively part of the data set. Just as a record has a schema as its template, a subrecord has a subschema as its template, so a subschema for the subrecord is made an instance of KIAO:schema (as for ordinary schemas). As the subrecord is a part of its record, the subschema is analogously a part of the schema of the record. Additionally, a record can contain multiple subrecord types without conflict. Figure 4 displays instances of a subschema, subrecord, and its two values based on this example.

3

RELATED WORK

Biomedical RDF-triple stores often conflate database records, i.e., information content entities, and the concepts and assertions denoted by them. Such a distinction is clearly made in the Neurocommons knowledge base (Ruttenberg et al., 2009), but an explicit representation of the database records is not attempted; standardized URIs are instead used to refer to records. Our model has not been designed to be used to make assertions of biomedical concepts, as in the referent tracking of Ceusters et al. (Ceuster and Smith, 2005). There have been efforts to programmatically transform the contents of databases to parallel ontological constructs (e.g., Astrova and Stantic, 2004); however, parallel representation is likely often incorrect given the substantial differences between modeling of data schemas and ontological engineering (Spyns et al., 2002). D2R MAP is an XML-based language designed to map database data to RDF (Bizer, 2003), whereas our proposal is a formal ontological model and is not limited to relational data. Relational.OWL is an OWL-based effort to model database schemas to serve as a rigorous exchange format (de Laborda and Conrad, 2005), whereas our model is not designed to capture all details of data-source schemas but to represent the data of these sources; furthermore, Relational.OWL is strict-

ly modeled on the relational paradigm, whereas we have based our model on a more general concept of a data set, which may or may not originate from a database, extended from the IAO.

4

CONCLUSIONS

We have presented an ontological model extended from the Information Artifact Ontology for the representation of data sources and their content. Using this model, we have consistently represented the records of 17 prominent biomedical databases as 5.6 billion RDF triples (which we have loaded in 3.1 days into a bigdata store); we regard the representation of these information content entities as an intermediate representation toward one in terms of biomedical concepts. In addition to affording querying and inference over these wide-ranging data, this representation will allow us to declaratively model the axiomatizations for conversion to biomedical concepts and tracking of provenance at a finegrained level.

ACKNOWLEDGEMENTS We gratefully acknowledge the support of this work by NIH grants R01LM009254, R01GM083649, and R01LM008111.

REFERENCES Astrova, I. and Stantic, B. (2004) Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms. 1st Eur. Sem. Web Symp. 327-341. Belleau, F. (2008) Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. J. Biomed Inform. 41(5), 706-16. Bizer, C. (2003) D2R MAP - A Database to RDF Mapping Language. Proc. WWW 2003. Ceusters, W. and Smith, B. (2005) Tracking Referents in Electronic Health Records. In: Engelbrecht, R. et al. (eds.) Proc. 2005 Medical Informatics Europe, IOS Press, Amsterdam, 71-76. de Laborda, C.P. and Conrad S. (2005) Relational.OWL: a data and schema representation format based on OWL. Proc. 2nd Asia-Pac. Conf. Conceptual Modelling (APCCM), 89-96. Ruttenberg, A., Rees, J.A., Samwald, M., and Marshall, M.S. (2009) Life sciences on the Semantic Web: the Neurocommons and beyond. Briefings in Bioinform. 10(2), 193-204. Spyns, P., Meersman, R., and Jarrar, M. (2002) Data modelling versus Ontology engineering. SIGMOD Record 31(4), 12-17.

BioSharing: standards, policies and communication in bioscience Susanna-Assunta Sansone1,2,*, Dawn Field2,3, Annapaola Santarsiero4, Eamonn Maguire1, Philippe Rocca-Serra1, Chris Taylor1,6, Lee Harland5 and the BioSharing communities. 1. University of Oxford, UK; 2. MIBBI consortium; 3. NERC-NEBC, UK; 4.The Mario Negri Institute, Italy; 5. Pfizer Ltd, UK; 6. EBI, UK; * [email protected]; [email protected]

1

STANDARDS FOR REPRODUCIBLE RESEARCH

Research communities, funding agencies, and journals participate in the development of reporting standards for the bioscience domain (Field, Sansone, et al., 2009) to ensure that shared experiments are reported with enough information to be comprehensible and (in principle) reproducible, compared or integrated. Similar trends exist in both the regulatory arena (e.g. US FDA, “CDER Data Standards Plan Version 1.0”) and commercial science (e.g. Barnes, 2009). Proliferation of standards is a positive sign of stakeholders’ engagement, however • how much do we know about these standards? • which ones are mature and stable enough to use or recommend? • which domain(s) do they cover? • which one is related to others? • which tools and databases implement which standard(s)?

2

BIOSHARING CATALOGUE

BioSharing (biosharing.org) works at the global level to build stable linkages in particular between journals, funders, implementing data sharing policies, and well-constituted standardization efforts in the biosciences domain. In doing so, it works to expedite the communication and the production of an integrated standards-based framework for the capture and sharing of high-throughput genomics and functional genomic bioscience data, in particular. This presentation will introduce the BioSharing catalogue that in partnership with key players (e.g. www.biomedcentral.com/bmcresnotes/series/datasharing) aims to: 1. centralize community-developed bioscience standards (figure below), classified into three types: a. reporting requirements, minimal information checklists to report of the same core set information, b. terminological artifacts, such as controlled vocabularies and ontologies to describe the information, and c. exchange formats, to communicate the information; 2. link to policies, other portals, (e.g. www.mibbi.org, bioportal.bioontology.org, www.obofoundry.org), open

access resources (e.g. precedings.nature.com) and lists of tools and databases implementing the standards (e.g. www.neuinfo.org); 3. develop and maintain a set of criteria for assessing the quality and formal rigor of the standards, but also the interoperability and relations among them; 4. foster interoperability, addressing overlaps and duplication of efforts that hamper their wider uptake and interfere with the creation of standards-compliant systems. Built using Drupal, the catalogue is work in progress and will be developed iteratively in collaboration with many communities (biosharing.org/?q=communities) to improve: - Content: adding new entries, improving their classifications (domains), tracking the status and progress of each standard; - Relations: adding relations between and within standards, (e.g. ontologies that import others), link to policies, where possible; - Views: creating dynamic functionalities to explore and visualize the standards and their relations; - Implementations: linking to standards-compliant systems and research data, where possible; - Export: ultimately serving an RDF representation of the catalogue’s content.

REFERENCES Field D., Sansone SA, et al. (2009) Megascience. 'Omics data sharing. Science, Oct 9;326 (5950):234-6. Barnes M et al. (2009) Lowering industry firewalls: precompetitive informatics initiatives in drug discovery. Nat Rev Drug Discov, Sep;8(9):701-8.

1

Automated Assessment of High Throughput Hypotheses on Gene Regulatory Mechanisms Involved in the Gastrin Response Sushil Tripathi1, Aravind Venkatesan1, Mikel Egaña Aranguren2, Zahra Zavareh1, Konika Chawla1, Vladimir Mironov1, Liv Thommesen1, Torunn Bruland1, Martin Kuiper1, and Astrid Lægreid1 1

Norwegian University of Science and Technology, NTNU, Trondheim, Norway; 2Universidad Politécnica de Madrid, Spain

1 INTRODUCTION Systems Biology is an integrated approach to build and simulate biological models from a variety of data sources in order to generate and validate new research hypotheses. In its most basic form, a hypothesis may propose a biological relationship between two biological components like a protein and a gene. Such binary relation hypotheses may be the subject of experimental validation and form building blocks for assembly into larger models. Today’s high throughput experiments are capable of producing vast amounts of data that allow the assertion of large numbers of binary relationships. We have devised a general approach to semi-automatically convert experimental data into individual research hypotheses. The formalised hypotheses in our example specify possible interactions between two components that are part of a general regulatory transcriptional network. 2 BIOLOGICAL SYSTEM Our biological system is an in vitro cell culture treated with gastrin, a stomach peptide hormone with pleiotropic effects. Gene expression time profiles were recorded with transcriptome microarrays at 11 time points covering a 14h gastrin response. We found ~3000 genes with significantly changed mRNA levels (Bruland et al., 2011) and named these ‘target genes’ (TG). 3

RESULTS

We built a partially automated pipeline that performs reasoning over all possible relationship hypotheses derived from our microarray results. These hypotheses concern binary interactions between putative protein regulators (with mRNA levels serving as their proxy) and target genes:

Protein X - (up/downregulates) - TG Y Hypotheses are initially formulated by considering every protein encoded by an expressed gene in our model cell line (~10.000) as a candidate regulator of every other expressed gene, resulting in 100 million potential binary interactions. An elaborate reasoning process is then used to ‘upgrade’ or ‘downgrade’ specific hypotheses, based on the assignment of scores. Cumulative scores provide a means to sort hypotheses for supporting evidence and priority for subsequent experimental validation. Scores were derived from a large number of sources including biological background information including Gene Ontology and other available sources of annotation of function of the putative regulators; from gene expression dynamics (genes that do not respond to a stimulus are unlikely target candidates) and timeliness of expression (new protein synthesis needs to be underway before a transcription factor (TF) whose activity is regulated by its gene expression can affect the regulation of its target genes). An automated query pipeline searched for supporting information both from our own Datamart with experimental findings and public annotations of transcriptional regulation, and through federated querying against distributed SPARQL endpoints. Several examples of high scoring TF-TG gene pairs illustrate the power of our approach. REFERENCES Bruland T, Flatberg A, Andersen E, Misund K, Fjeldbo CS, Thommesen L. Lægreid A, Exploring signalinduced cellular regulatory subnetworks by use of partial least square regression (PLSR) multidimensional analysis of gene expression time series data. Manuscript

1

New search method to mine biological data Fidel Ramírez, Glenn Lawyer and Mario Albrecht Department of Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Campus E1.4, 66123 Saarbrücken, Germany

Biological databases contain an ever growing catalog of annotations characterizing the function of genes and proteins. By searching this knowledge, life scientists can uncover hidden functional relationships between genes and proteins. For instance, disease-causing candidate genes can be discovered by identifying genes taking part in the same molecular pathway, sharing protein interactions and having expression patterns similar to known disease genes. Currently, the lack of integrated data repositories hampers straightforward information searches and forces researchers to query each database individually. Moreover, available search tools tend to focus on only one source, usually the Gene Ontology [1-5]. Construction of such repositories is considered a major challenge in bioinformatics due to the distributed nature of the biological databases, the differing data structure, the diversity of names to identify genes and proteins, and the heterogeneous schemes used to represent the data. Using a comprehensive data warehouse model, we integrated, unified and consolidated many well-known databases containing biological annotations for human genes and proteins. These include molecular functions, disease and drug associations, sequence family classifications, protein domain architectures, metabolic and signaling pathways, ortholog species information, protein interactions and protein complexes, and tissue expression. Those sources come from both ontological and categorical data. Based on this data warehouse, we devised a new method for quantifying the similarity of the associated function annotations to search the integrated data for functional relationships. This method uses a Boolean representation of the database annotations to record the absence or presence of annotations for each gene or protein. We evaluated the performance of this method by searching for known functional relationships using annotations based only on the Gene Ontology or on our large integrative data warehouse. Other methods for measuring the similarity of Gene Ontology annotations were adapted to use the integrated annotation information and were compared with our method.

tology. When searching for known functional relations, our method was the most accurate by consistently ranking functionally relevant proteins among the top two out of over 18,000 human gene products. As another test of our method, we performed a systematic search for disease-associated genes. In 23 out of 54 cases, the correct disease gene association was found at top ranks. Researchers can access our data warehouse and tools online at http://biomyn.de

REFERENCES [1] Julie Chabalier, Jean Mosser, and Anita Burgun. A transversal approach to predict gene product networks from ontologybased similarity. BMC Bioinformatics, 8:235, 2007. [2] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble. Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, 19(10):1275–1283, Jul 2003. [3] Catia Pesquita, Daniel Faria, André O Falcão, Phillip Lord, and Francisco M Couto. Semantic similarity in biomedical ontologies. PLoS Comput Biol, 5(7):e1000443, Jul 2009. [4] Andreas Schlicker, Francisco S Domingues, Jörg Rahnenführer, and Thomas Lengauer. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics, 7:302, 2006. [5] José L Sevilla, Víctor Segura, Adam Podhorski, Elizabeth Guruceaga, José M Mato, Luis A Martínez-Cruz, Fernando J Corrales, and Angel Rubio. Correlation between gene expression and GO semantic similarity. IEEE/ACM Trans Comput Biol Bioinform, 2(4):330–338, 2005.

Our results show that the search performance is improved substantially for almost all methods when multiple annotation sources were included instead of solely the Gene On-

1

Coupling disease and genes using Disease Ontology, NCBI GeneRIFs, and the NCBO Annotator service Warren A. Kibbe1, Tian Xia1,2, Simon Lin1, and Lynn Schriml3 1The Robert H. Lurie Comprehensive Cancer Center, Northwestern University, Chicago, IL, USA, 2Huazhong University of Science and Technology, China, 3Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, USA

ABSTRACT Motivation: We have previously released MMTx/NegEx driven annotations coupling human disease to genes, but this process has been difficult to maintain and update. We have implemented a new analysis pipeline using the NCBO annotator web service that is better abstracted and easy to run. We believe the basic pattern should be very reusable.

1

most current analysis shows just more than 10% of the GeneRIF statements include a disease reference. Some of this reduction may be due to the removal of more than 2000 terms from Disease Ontology that were symptoms rather than disease. We have not gone back to the previous collections of GeneRIFs to see if the discrepancy is in fact due to an increased specificity of the Disease Ontology (removal of symptoms).

INTRODUCTION

The Human Disease Ontology (DO) is a community driven, open source ontology focused on representing human disease. DO provides a semantically computable structure of inherited, environmental and infectious human disease that is based on the definition of disease in the Ontology for General Medical Science (OGMS). As defined in OGMS, a disease is a disposition (i) to undergo pathological processes that (ii) exists in an organism because of one or more disorders in that organism (1). Disease Ontology includes references to UMLS, MeSH, SNOMED, OMIM and ICD-9 and ICD-10. Previously, we have released Disease Ontology to gene mappings using the UMLS MetaMap Transfer tool (MMTx), taking advantage of the fact that 98% of the terms in Disease Ontology include UMLS concept unique identifiers (CUIs) (Osborne et al 2007, 2009). This process has been very hard to maintain, in part because it requires processing the entire UMLS data set and setting up and configuring MMTx locally, as well as formatting the GeneRIF statements so that they are processable by MMTx. We have been very interested in the NCBI Annotator service (http://bioportal.bioontology.org/annotator) since it was first made publicly available (Jonquet, Shah, Musen, 2009, Good, 2010) as a potential replacement for this process, and several examinations of the use of the NCBO Annotator service compared with MMTx have been published recently (Shah et al 2009).

2

METHODS

We have built a small Java application to take the GeneRIF statements and submit them to the NCBO Annotator webservice (described at http://www.bioontology.org/wiki/index.php/Annotator_User _Guide) and assemble the results. The results of the mappings are available at http://zl3021.chinaw3.com/ and can be searched by DOID, keyword or GeneID as shown in Figure 1. Figure 2 shows the results of a search for ‘epithelial ovarian cancer’.

Figure 1. Search screen for finding Gene mappings using GeneID, DOID, or text.

Previously we have shown that nearly 15% of all GeneRIFs contained disease terms (Osborne, 2007). However, the number of GeneRIF statements have nearly tripled and the *

To whom correspondence should be addressed.

1

W. Kibbe et al.

Figure 2. Results of a search for ‘epithelial ovarian cancer’

DISCUSSION The availability of the NCBO Annotator webservice has allowed us to dramatically simplify our mapping of GeneRIFs to Disease Ontology terms and cut the mapping times from weeks (including the time required for downloading, installing, configuring and reloading UMLS) to three days for scanning 393562 GeneRIF statements for disease terms and building the resultant mapping files.

ACKNOWLEDGEMENTS This project has been funded under the American Recovery and Reinvestment Act of 2009 through NIH NCRR Award 1R01RR025342. We would also like to recognize the NCBO Bioportal (http://bioportal.bioontology.org/) for making the Human Disease Ontology available widely throughout the biomedical community and providing the NCBO Annotator web service.

REFERENCES OGMS definition of disease: http\://ontology.buffalo.edu/medo/Disease_and_Diagnosis.pdf. Osborne JD, Flatow J, Holko M, Lin SM, Kibbe WA, Zhu LJ, Danila MI, Feng G, Chisholm RL. (2009) Annotating the human genome with Disease Ontology. BMC Genomics 10 S1:S6. PMID 1959488. Osborne JD, Lin SM, Zhu LJ, Kibbe WA. (2007) Mining biomedical data using MetaMap Transfer (MMtx) and the Unified Medical Language System (UMLS). Methods Mol Biol. 408:153-69. PMID 18314582. Jonquet C, Shah NH, Musen MA, (2009) The Open Biomedical Annotator, AMIA Summit on Translational Bioinformatics, p. 56-60, San Francisco, CA, USA. PMID: 21347171 Shah NH, Bhatia N, Jonquet C, Rubin D, Chiang AP, Musen MA. (2009) Comparison of concept recognizers for building the Open Biomedical Annotator. BMC Bioinformatics. 10 Suppl 9:S14. PMID:19761568 Good B. http://i9606.blogspot.com/2010/12/ncbo-annotatorversus-metamap-on-go.html

2

KiSAO: Kinetic Simulation Algorithm Ontology Anna Zhukova*, Nick Juty, Camille Laibe and Le Novère Nicolas EMBL-EBI, Hinxton CB10 1SD United Kingdom

Many algorithms have been designed to run numerical simulation of dynamical models. Depending on the characteristics of a model, and the questions asked about its behavior, not all algorithms can be used for instantiating a simulation. To enable the execution of a simulation task, as required by the MIASE guidelines [1], it is important to identify both the algorithm used and its setup. Since the details of all algorithms are not publicly available, and many are implemented only in a limited number of simulation tools, it is also useful to identify others with similar characteristics, that would be able to provide comparable results using the same simulation setup. The Kinetic Simulation Algorithm Ontology (KiSAO) is developed to address that problem of describing and structuring existing algorithms used to simulate Systems Biology models, their characteristics and parameters. It enables unambiguous references to simulation algorithms from a simulation experiment description, such as SED-ML [2].

such as the type of variables (discrete or continuous), and information on the treatment of spatial descriptions. It also stores numerical characteristics including the system's behavior (deterministic or stochastic) as well as the progression mechanism (fixed or adaptive time steps). In addition to the algorithm and characteristics hierarchies, the new structure of KiSAO describes the parameters that algorithms need to know to properly run, such as 'absolute-' and 'relative tolerance'. The algorithm parameters are stored as OWL data properties, which enables representing information regarding parameter types using data property range restrictions (to built-in data types, such as 'xsd:int' and 'xsd:double' [5]). Data property domain restrictions associate parameters with the algorithms, which they are usable with in a simulation. Examples are ('tau-leaping epsilon' (domain: 'tau-leaping method', range: 'xsd:double')) and ('LSODA max stiff order' (domain: 'Livermore solver for ordinary differential equations with automatic method switching', range: 'xsd:int')).

2

3

1

INTRODUCTION

KISAO STRUCTURE AND CONTENT

We recently entirely restructured and extended KiSAO. Under its original structure KiSAO was a single subsumption, encoded in OBO format [3]. Subclassing was based on types of algorithms, derivation (algorithm derived from another pre-existing one), and characteristics. KiSAO is now implemented in OWL [4]. Simulation algorithms (e.g. 'Bortz-Kalos-Liebowitz method') and algorithm characteristics (e.g. 'adaptive timesteps') compose the two main branches of KiSAO OWL class hierarchy. Algorithms are linked to the characteristics they possess using the relation 'kisao:hasProperty'. For instance ('Bortz-KalosLiebowitz method' 'hasProperty' ('adaptive timesteps' and 'discrete variables' and 'stochastic rules')). OWL offers the possibility to use negations to express that an algorithm does not possess a property. For instance not ('Bortz-KalosLiebowitz method' 'hasProperty' 'spatial description'); The simulation algorithm branch itself is hierarchically structured using the relation 'rdfs:subClassOf'. In addition to its identifier and name, each algorithm is annotated with a definition, a reference to the publication describing the algorithm, and synonym names. The algorithm characteristic branch stores characteristics describing the model as represented in a simulation run,

CONCLUSIONS AND PERSPECTIVES

KiSAO allows to describe and relate algorithms used in numerical simulations in an extensible, flexible yet powerful manner. Its use in conjunction with SED-ML will allow simulation software to automatically choose the best algorithm available to perform simulation run. The availability of algorithm parameters, together with their type may permit the automatic generation of user-interfaces to configure simulators.

REFERENCES 1. Waltemath, D. et al. (2011) Minimum Information About a Simulation Experiment (MIASE). PLoS Comput Biol, in the press 2. Köhn D., Le Novère N. (2008) SED-ML - An XML Format for the Implementation of the MIASE Guidelines. Proceedings of the 6th conference on Computational Methods in Systems Biology, Heiner M and Uhrmacher AM eds, Lect Notes Bioinfo, 5307: 176190. 3. Day-Richter, J. (2006) The OBO Flat File Format Specification, version 1.2 Tech. rep. The Gene Ontology Project. 4. W3C OWL Working Group. OWL 2 Web Ontology Language Document Overview (http://www.w3.org/TR/owl2-overview/). 5. Biron, P.V., Malhotra, A. XML Schema Part 2: Datatypes Second Edition. W3C Recommendation 28 October 2004 (http://www.w3.org/TR/xmlschema-2/).

1

CALOHA: A new human anatomical ontology as a support for complex queries and tissue expression display in neXtProt. Paula D. Duek, Anne Gleizes, Catherine Zwahlen, Anaïs Mottaz, Amos Bairoch and Lydie Lane* CALIPHO Group, SIB – Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet 1211 Geneva 4 neXtProt (http://www.nextprot.org) is a new bioinformatics resource aiming to be a comprehensive human-centric discovery platform, offering its users a seamless integration of and navigation through protein-related data. neXtProt integrates all of the high-quality human sequences and annotations from UniProtKB/Swiss-Prot (UniProt consortium, 2010) and also contains a vast amount of information obtained by mining many external data resources with very stringent quality criteria. It provides, for every integrated data, a Gold/Silver quality tag. Regarding human protein expression, neXtProt integrates data obtained on healthy tissues from two resources: microarray and EST data from BGee (http://bgee.unil.ch/), and immunohistochemistry data from the Human Protein Atlas (HPA) (http://www.proteinatlas.org/). These resources capture expression data using different experimental methodologies, at different levels of granularity, in partially overlapping anatomical structures ranging from cell types to organs, and at different developmental stages. In addition, the different resources use synonyms to describe the same object. As a discovery platform, neXtProt intends: (1) to describe all imported expression data with the original granularity (e.g.: intestinal epithelium and not intestine), (2) to compare datasets with different granularity levels and (3) to support complex queries about protein expression.. A prerequisite to accomplish those objectives was to use a suitable ontology of human anatomy that describes organs, tissues and cell types. This ontology should contain all terms provided by the different expression resources; should be complete enough to support user’s queries, but simple enough to be used also as expression viewer. This led us to develop CALOHA, the “CALipho Ontology for Human Anatomy”. The current version of CALOHA contains 688 terms, including terms describing currently available data and additional terms that permit the connection of these anatomical entities. New terms will be added according to new imported data and to the identification of neXtProt user’s queries requirements. Each term has cross-references to eVoc, BRENDA and MeSH, and is associated with a wealth of synonyms collected from different resources. The definitions are imported from MeSH, NCI Thesaurus, Wikipedia or from the literature.

The ontology is structured on the basis of two relationships: is_a and part_of, and is implemented in OBO format. It is organized in different interconnected categories: anatomical systems (alimentary, circulatory, dermal, endocrine, exocrine, etc.), tissues (epithelium, mucosa, connective, lymphoid, etc.), cell types, fluids and secretions, and gestational structures (embryo, fetus, extraembryonic tissues and fluids). In this way, the ontology can be browsed fluently from a system down to its constituting organs, tissues and cell types. This ontology has been implemented in neXtProt as a support for capturing experimental information from HPA and BGee. It is used to reconcile data obtained at different granularity levels: in neXtProt expression tables, information captured at each level is integrated in upper levels, facilitating visualization and allowing comparison between experiments done in different sub-fractions of a same entity provided by different resources. To be able to describe experimental results with accuracy, we complement this ontology with a controlled vocabulary for human developmental stages, from embryo to adulthood, maintained by F. Bastian and collaborators (Bastian et al. 2008) and downloadable on http://bgee.unil.ch/. Along with our stringent selection of data to be integrated, and our system of quality grading (Gold/Silver quality tag), the full integration of expression data from various sources across the complete human anatomy allows neXtProt users to obtain high-quality sets of proteins expressed in a given location. These sets can easily be obtained by searching for a particular term, under the topic ‘expression’ and selecting the desired Gold/Silver stringency. For example, a set of proteins known to be expressed in the retina with a high confidence level can be obtained using the following URL: http://www.nextprot.org/db/search#{f:expression,t:retina} In conclusion, CALOHA, a new human anatomical ontology, has been successfully used to reconcile expression data from heterogeneous resources. It has been implemented in neXtProt as a support for complex queries and tissue expression display. We want to keep it up to date; if possible in collaboration with groups that are interested in using and/or developing such ontology. It can be downloaded at ftp.nextprot.org.

The EDAM ontology for bioinformatics tools and data Jon Ison*, Matus Kalas**, Steve Pettifer*** and Peter Rice* * EMBL European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK ** Computational Biology Unit, Uni Computing, 5008 Bergen, Norway *** School of Computer Science, The University of Manchester, Manchester, M13 9PL, UK

1

MOTIVATION

Researchers demand simple and powerful means to organise, find, compare, use and connect an increasingly large and complex set of tool and data resources. These tasks depend on consistent, machine-understandable resource descriptions. There is an urgent need for an ontology that unifies semantically common bioinformatics concepts and provides a controlled vocabulary for the annotator.

2

Topic

• A hierarchical set of coarse-level topics for categorising any bioinformatics resource • A comprehensive set of terms for concepts describing common types of data and operations • Comprehensive catalogues of common data formats and types of data identifiers EDAM provides a starting point for nomenclature and is ready for use in pilot annotations. Table 1. EDAM sub-ontologies Branch

Description

topic

A general field of bioinformatics study, analysis or technique, e.g. “Sequence analysis”, “Phylogenetics” data A type of data commonly used in bioinformatics, e.g. “Sequence alignment”, “Sequence record” format A commonly used data format, e.g. “FASTA”, “SAM” identifier A label that identifies (typically uniquely) biological or computational entities, e.g. “Ensembl ID”, “EC number” operation A specific, singular function performed by a tool. What is done, but typically not how or in what context, e.g. “Sequence alignment”, “Sequence database search” EDAM includes 5 sub-ontologies which collectively define the scope.

has_input in_topic

is_format_of

Format

EDAM ONTOLOGY

EDAM (Figure 1) includes 5 sub-ontologies (Table 1) within the scope of bioinformatics resource (tool and data) description. There are 5 types of EDAM-specific relationships (Table 2) which relate concepts from different branches. EDAM provides:

Operation

in_topic

has_output

Data is_identifier_of

Identifier

Fig. 1. Sub-ontologies are in boxes, relations are shown as arrows.

EDAM terms reflects well-established concepts and correspond to categories of things. The current version includes over 2000 concepts with names (terms) and definitions, 1000 EDAM-specific relations and 3000 is_a (subclass) relations. Table 2. EDAM relations Relation

Description

is_a

A (child) concept is a specialisation of its parent, e.g. “Pairwise sequence alignment is_a Sequence alignment” in_topic A concept ('data' or 'operation') is within scope of a 'topic', e.g. “Sequence alignment” in_topic “Sequence analysis” has_input An 'operation' consumes a certain type of 'data', e.g. “Sequence alignment has_output Sequence” has_output An 'operation' produces a certain type of 'data' , e.g. “Sequence alignment has_input Sequence” is_format_of A data 'format' is a format of a certain type of 'data', e.g. “FASTA is_format_of Sequence record” is_identifier_of A data 'identifier' is an identifier of a certain type of 'data', e.g. “EMBL accession is_identifier_of Sequence record” EDAM includes 5 custom relations (in addition to is_a) which relate concepts from one branch (in quotes, e.g. 'data') to another.

3

DOWNLOADS http://edamontology.sourceforge.net/ https://sourceforge.net/projects/edamontology/files/ http://bioportal.bioontology.org/ontologies/44600

1

What’s new and what’s changing in ChEBI in 2011 Janna Hastings*, Paula de Matos, Adriano Dekker, Marcus Ennis, Kenneth Haug, Zara Josephs, Gareth Owen, Steve Turner and Christoph Steinbeck European Bioinformatics Institute, Hinxton, UK CB10 1SD

ChEBI -- Chemical Entities of Biological Interest -- is an ontology of chemical entities such as molecules and ions, and their roles in biological contexts (de Matos, 2010). As of April 2011, it contains in total around 25,000 classes. Here, we report on recent developments and changes in the ontology, and give a brief view on ongoing work that will lead to changes in the future.

munity curation platform in the form of a web-based submission tool. The submission tool is accessible at: https://www.ebi.ac.uk/chebi/submissions. The key advantage for users of the submission tool is that they are able to directly deposit new chemicals or roles into the ChEBI production database, and retrieve an identifier which they can start to use immediately. The identifier is maintained, although it will only become publicly available via the ChEBI public interface at the next monthly release.

2

4

1

INTRODUCTION

ONTOLOGY RESTRUCTURING

CURATION EFFORTS

Role-structure disentanglement. Prior to 2009, the ‘is a’ relationship in ChEBI was overloaded, linking molecular entities with chemical classes and specifying the ‘roles’ that chemical entities can enact in various contexts. To address this, the relationship ‘has role’ was introduced and used to link molecular entities to roles, for example, the molecular entity acetylsalicylic acid (CHEBI:15365) ‘has role’ non-narcotic analgesic (CHEBI:35481). The initial disentanglement was performed programmatically, and subsequent manual curation was required to clean up, since errors occurred in cases where, say, a chemical entity lacked a structure and was only classified with a role parent. Current curation efforts are underway to fully define classes which are specified with both structural and role-based features, such as the entity tricyclic antidepressant (CHEBI:36809), which is defined as ‘is a’ organic tricyclic compound and ‘has role’ antidepressant.

In order to adequately deal with user-requested mixtures and polymers within the ontology, ChEBI has expanded its ‘chemical substance’ hierarchy, differentiating between pure and mixed substances. A pure substance is a macroscopic homogeneous collection of molecular entities, while a mixture contains a non-homogeneous collection -- at least two different types of molecular entity. In particular, this allows us to adequately model racemic mixtures, which are crucial in the representation of drugs, since many active substances found in drugs are formulated as racemic mixtures.

Mapping to upper level ontologies. In order to comply with our goal of increasing interoperability with other ontologies in the biomedical domain, ChEBI has undertaken to provide a mapping to the upper level ontology BFO. Mapping multiple ontologies beneath a common upper level allows easier linking between ontologies, since it reduces ambiguities in interpretations through clear ontological commitment. The ChEBI-BFO mapping is provided as a bridge OWL file, downloadable alongside the ChEBI ontology OBO and OWL exports, available at ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/.

ChEBI is moving towards including full logical definitions for structure-based classes in the ontology where possible, and is continuing the alignment with BFO and with the Relation Ontology (RO). These efforts include a thorough evaluation of the relationships currently used in the ontology. Some relationships will be added, such as ‘disjoint from’, while others will be deprecated if they prove resistant to being assigned a logical definition.

3

AUTOMATED SUBMISSIONS

ChEBI growth is entirely community and user request driven. To better meet our user needs, ChEBI provides a com-

A large-scale ongoing effort is focused on annotating compounds relevant for immunology. Also, ChEBI is currently refactoring the representation of natural products in the ontology, which is currently inconsistently represented. Natural products will be given the role ‘secondary metabolite’.

5

RELATIONSHIP EVALUATION

REFERENCES de Matos, P., Alcántara, R., Dekker, A., Ennis, M., Hastings, J., Haug, K., Spiteri, I., Turner, S., and Steinbeck, C. (2009). Chemical entities of biological interest: an update. Nucleic Acids Res, 38, D249--D254.

1

EMO – The Enzyme Mechanism Ontology Julius O. B. Jacobsen, Nicholas Furnham and Gemma L. Holliday EMBL-EBI, The Wellcome Trust Genome Campus, Hinxton, Cambs, CB10 1SD, UK

Enzymes, as biological catalysts, are important and extremely intricate systems, without which life, as we know it, could not exist. Enzyme complexity is a function of the protein sequence, 3D structure and the mechanism of the reaction they catalyze. Whilst enzymes range in size from tens of amino acid residues to thousands, only a few residues are catalytically vital (the catalytic residues). These are found in a cleft, often deeply buried in the protein, called the active site. Information relating to these residues, identified using the enzyme’s atomic structure, are held in the Catalytic Site Atlas (CSA) [Porter,2004] while chemical reaction and mechanistic details are held in a sister database MACiE [Holliday,2007]. Both of these databases utilize a controlled vocabulary, with MACiE possessing a more detailed vocabulary as it focuses on enzymes in a much greater depth to include thorough descriptions of the chemical reaction steps performed. Likewise, the Swiss-Prot section of the UniProt KnowledgeBase (UniProtKB/Swiss-Prot) [UniProt,2011] also captures enzyme related data at a broader protein sequence level, including information on catalytic residues. Annotations are made as both free text and using an independently developed controlled vocabulary. Whilst the CSA and MACiE resources have been developed somewhat in tandem and thus share a common data model, it is not currently simple to link these to enzyme annotations in resources, primarily UniProtKB, due to differences in the definitions of enzyme properties and the vocabularies used in their description. Though descriptions and definitions of some of the information held in all three databases are made in existing ontologies such as GO and the ChEBI ontology, marrying these and applying them uniformly to all three databases proved far from trivial. In this paper we present the Enzyme Mechanism Ontology, EMO, which builds upon the controlled vocabulary developed for MACiE and the CSA and will be submitted to the OBO Foundry. This vocabulary was created to describe the active components of the enzyme’s reactions (cofactors, amino acid residues and cognate ligands) and their roles in the reaction. EMO builds upon this by formalizing key concepts, and the relationships between them, necessary to define enzymes and their functions. This describes not only the general features of an enzyme, including the EC number (catalytic activity), 3D structure and cellular locations, but also allows for the detailed annotation of the mechanism. This mechanistic detail can be either at a

gross level (overall reaction only), or the more detailed granularity of the steps and components required to effect the overall chemical transformation. EMO allows for many different resources to be drawn together for a more complete description of an enzyme and its function/mechanism, even where data are only partially annotated in some resources. Communication between databases can be facilitated through the use of such a universal resource that maps disparate terms to a common data model. To this end, EMO is being applied to the Enzyme Portal, a project in development within the EBI, which aims to provide a unified portal to all EBI enzyme-related resources. The ontology will also allow us to ask more general enzyme related questions, which are currently not trivial to address or require queries to be run across many different databases. Questions such as identifying which enzymes can be found in specific cellular compartments, or the exact nature and combination of cofactors will be able to be addressed in a coherent manner. It will also be possible to identify disease and drug associations relating to specific enzymes, linking this information back to more specific mechanistic details. Furthermore it should enable automated classification and detection of misclassification of enzymes, based on their mechanism. This ontology, as a collaboration between the UniProt Consortium, CSA and MACiE, has been created in an effort to standardize our vocabulary and has not only permitted us to unify the various levels of detail held about similar information in each of our databases, but the implementation of which will also permit the many other users of enzyme data to cross-reference and share data with each other.

REFERENCES Holliday,G.L., Almonacid,D.E., Bartlett,G.J., O’Boyle,N.M., Torrance,J.W., Murray-Rust,P., Mitchell,J.B.O and Thornton,J.M (2007) MACiE (Mechanism, Annotation and Classification in Enzymes): novel tools for searching catalytic mechanisms. Nucl. Acids Res., 35, D515-D520. Porter,C.T., Bartlett,G.J. and Thornton,J.M. (2004) The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucl. Acids Res., 32, D129-D133 The UniProt Consortium (2011), Ongoing and future developments at the Universal Protein Resource. Nucl. Acids Res., 39, D214D219

1

Quantitative comparison of mapping methods between Human and Mammalian Phenotype Ontology Anika Oellrich1 , George Gkoutos2 , Robert Hoehndorf2 , Dietrich Rebholz-Schuhmann1 1 European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SD 2 Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH

Phenotypic outcomes of mouse mutagenesis experiments are widely described using the Mammalian Phenotype Ontology (MP) [Smith et al., 2005]. Phenotypes characteristic for human diseases are represented using the Human Phenotype Ontology (HP) [Robinson et al., 2008]. To facilitate large scale analysis of human diseases and their association to mutagenesis experiments based on phenotypes, mappings between MP and HP are required. Generating those mappings manually is not only time-consuming but also costly given the size of the ontologies and hence, automated methods are sought. Here, we compare two methods which have been applied to generate a mapping between HP and MP. The first method relies on lexical matching using the names and synonyms of each of the concepts in either ontology and is available as Lexical OWL Ontology Matcher (LOOM) [Ghazvinian et al., 2009]. The second method relies on the formal definitions [Mungall et al., 2010] developed for each of the ontologies and is available from [PhenomeBlast]. In this study, we did not only investigate the overlap of both mappings but also the potential of disease gene candidate predictions for human diseases. The data sets used for the prediction and evaluation was extracted from both, the Online Mendelian Inheritance in Man (OMIM) and Mouse Genome Informatics (MGI) database. MGI provides genes with an MP phenotype description and OMIM provides the disease phenotypes represented in HP. To facilitate a prediction, either the gene information has to be “translated” to HP or the disease information to MP. We investaged both directions and show the rank distributions for gene-disease associations deemed to be correct according to MGI and OMIM. As shown in table 1, both methods do not generate a mapping for all concepts contained in each of the ontologies. Implications are that formal definitions might not exist especially for complex phenotypes and lexical matching is limited by the creation and availability of concept names and synonyms, including ambiguous word usage. Our results show that both mappings share information but also lead to different associations of concepts (see figure 1). The low number of exact matches indicates a gap between both methods and might give some insights about potential errors inside the mappings. The gene prediction results illustrated in figure 2 suggest that the method incorporating the formal definitions for each ontology yields better results than the method using lexical matching on concept names and synonyms.

Cynthia L Smith et al. The mammalian phenotype ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol, 6(1):R7, Jan 2005.

Table 1. Illustrates the numbers of concepts contained in each ontology but also incorporates the results of the mapping methods. The first bracket is the percentage calculated based on the total number of available concepts in each ontology and the second is the average number of concepts one particular concept is mapped to. HP (% of total ) (average # mapped)

MP (% of total) (average # mapped)

# concepts

10104 (100%)( - )

8507 (100% )( - )

# with formal definition

4860 ( 48.10% )( - )

5389 ( 63.35% )( - )

# mapped with LOOM

2740 ( 27.12% )( 7.17 )

1046 ( 12.30% )( 6.97 )

# mapped with formal definitions method

8184 ( 80.10% )( 5.48 )

4446 ( 52.26% )( 6.64 )

Fig. 1: Shows the agreement of mappings generated by each method, separated into five different types of overlap. In general, more mapped concepts are available for HP than for MP. The low proportion of exact matches suggests a deviation which may be an indicator for errors inside the mappings. Nothing matches can be used to identify potential errors in mappings of either method.

Fig. 2:

Shows the distribution of ranks of the gene prediction results, only including the ranks of gene-disease associations which are deemed to be correct according to MGI and OMIM. Both plots show that the gene ranks are better when the mappings generated with the formal definitions are applied suggesting that those mappings are better suited for the purpose of connecting phenotypic information across species.

REFERENCES Amir Ghazvinian et al. Creating mappings for ontologies in biomedicine: simple methods work. AMIA Annu Symp Proc, 2009:198–202, Nov 2009. Christopher J Mungall et al. Integrating phenotype ontologies across multiple species. Genome Biol, 11(1):R2, Jan 2010. doi: 10.1186/gb-2010-11-1-r2. PhenomeBlast. URL http://code.google.com/p/phenomeblast/. Peter N Robinson et al. The human phenotype ontology: a tool for annotating and analyzing human hereditary disease. Am J Hum Genet, 83(5):610–5, Nov 2008.

1

Biositemaps: A Framework for Biomedical Resource Discovery Jessica D. Tenenbaum1, Patricia L. Whetzel2, Consgor Nyulas2, Charles D. Borromeo3, Harpreet Singh3, Nancy B. Whelan3, Brian Athey4, Michael J. Becich3, Mark A. Musen2 and the Biositemaps Consortium 1

Stanford University, Stanford, CA; 2University of Pittsburgh, Pittsburgh, PA, 3Duke University, Durham, NC, University of Michigan, Ann Arbor, MI

4

The Biositemaps1 project has developed technologies to address locating, querying, composing, and mining biomedical resources. The project initially focused on resources developed through the National Center for Biomedical Computing (NCBC) program and later expanded to include resources developed by the Clinical and Translational Science Awards (CTSA) Informatics Inventory and Resources Workgroup (IRWG). The project implemented a webaccessible inventory of biomedical research resources, which can be queried through the Resource Discovery System (RDS)2, thus providing a national, searchable and interactive inventory of resources for clinical and translational research. Project partners include the University of Pittsburgh, Duke University, University of Texas Health Science Center, Oregon Health & Science University, University of California Davis, Emory University, Stanford University, and University of Michigan, from the CTSA and NCBC programs.

RDS website has had over 1200 unique visitors from 47 countries. Future goals include – Accelerated development of Biositemaps across NCBCs, CTSAs and other applicable NIH-funded organizations Promotion and outreach of this information framework, using web and print-based media, web-based seminars and presentations, and participation in national meetings Continued harmonization of BRO and NIFSTD for common resource descriptors Further alignment of the Biositemaps Information model with eagle-i information model Funding CTSA 3UL1RR024153-03S1, 5UL1RR024128-03S1 NCBC 3U54DA021519-04S1, 3U54HG004028-04S1

REFERENCES Specific aims for the project include – 1. Creating an inventory of software, material, funding, service, training, and people resources 2. Developing an informatics infrastructure 3. Publishing the resource inventory to enable a federated, web-accessible catalog of resources that enable clinical and translational research Notable accomplishments are – 1. A standards-based informatics infrastructure that includes development of a biomedical resource inventory 2. Development of the Biomedical Resource Ontology (BRO)3, 4 3. Harmonizing BRO and the Neuroscience Information Framework Standard ontology (NIFSTD) 4. Mapping the Biositemaps Information Model with the eagle-i Information Model

1

http://biositemaps.ncbcs.org/ http://biositemaps.ncbcs.org/rds/ 3 http://purl.bioontology.org/ontology/BRO 4 Tenenbaum JD, Whetzel PL, Anderson K, Borromeo CD, Dinov ID, Gabriel D, Kirschner B, Mirel B, Morris T, Noy N, Nyulas C, Rubenson D, Saxman PR, Singh H, Whelan N, Wright Z, Athey BD, Becich MJ, Ginsburg GS, Musen MA, Smith KA, Tarantal AF, Rubin DL, Lyster P. (2011) The Biomedical Resource Ontology (BRO) to enable resource discovery in clinical and translational research. J. Biomed. Inform. 44:137-45. PubMed PMID: 20955817; PubMed Central PMCID: PMC3050430. 2

Resource inventories from over 100 institutions/organizations across 6 NIH-funded research programs (CTSA, CVRG, NCBC, NHLBI, NITRC, SysBioCenters) totaling 1,479 resources have been collected thus far. The

1

Collaborative Development of Ontologies using WebProtégé and BioPortal Patricia L. Whetzel, Natalya F. Noy, Paul Alexander, Tania Tudorache, Csongor Nyulas, Mark A. Musen Stanford University, Stanford, CA

In many scientific disciplines and in biomedicine in particular, researchers rely on ontologies to enable them to annotate and integrate their data. These ontologies are living and constantly evolving artifacts and the ontology authors must rely on their user community to ensure that the coverage of the ontologies is sufficient for annotations and other tasks for which users deploy the ontologies. Thus, development of ontologies requires an integrated platform that not only allows for ontology developers to collaboratively edit the ontology, but also allows for the collection of comments from subject matter experts. The integration of WebProtégé and BioPortal provides such an environment. Web Protégé is a lightweight Web-based ontology browser and editor, which provides a collaborative environment for editing. These features include the ability to simultaneously browse and edit the ontology, to discuss entities in the ontology (e.g. class, property, or individual), and to track all changes to the ontology. Once the cycle of edits is complete and a new version of the ontology is generated, many ontology developers choose to release their ontology by placing it on the Web for others to access the ontology and to collect comments from the community. BioPortal provides such a mechanism to publish ontologies and collect community feedback, in addition to a range of other functionality. Using the BioPortal Web services, the ontology and it’s metadata can be published directly to BioPortal. Subject matter experts can login to BioPortal to provide feedback to ontology authors, to request new terms, and to use provisional terms in their applications. The ontology authors can use the same infrastructure to explore this feedback in their ontology-editing environment, to update the ontology, to record their decisions on the users’ requests, and to publish both the updated ontology and the information on how they acted on the requested changes. These changes are stored within the Notes ontology. The notes are stored in a common representation and therefore, the ontology editors can see the note in the context of editing the ontology and make the needed changes. Notes can be “archived” once the editing task is complete, which will then hide the note from view in BioPortal. User feedback and change requests can be accessed programmatically via the Notes API.

Requirements for the implementation of an integrated editing platform for ontology development and publishing have been collected from review of existing tools and workflows of large collaborative ontology development projects. The main drivers and users of this collaborative platform include the World Health Organization in development of ICD-11, the Biositemaps Consortium in development of the Biomedical Resource Ontology and harmonization with NIFSTD, the Radiological Society of North America in development of RadLex, and application use cases presented by developers of the Ontology Development and Information Extraction (ODIE) toolkit and Phenoscape.

REFERENCES Noy NF, Shah NH, Whetzel PL, Dai B, Dorf M, Griffith N, Jonquet C, Rubin DL, Storey MA, Chute CG, Musen MA. (2009) BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res. 1, W170-3. Tudorache T, Falconer S, Nyulas C, Storey MA, Ustün TB, Musen MA. (2010) Supporting the Collaborative Authoring of ICD-11 with WebProtégé. AMIA Annu Symp Proc. 13, 802-6.

1

NCBO Resource Index: Ontology-based Search and Mining of Biomedical Resources Patricia L. Whetzel1, Clement Jonquet2, Paea LePendu1, Sean M. Falconer1, Adrien Coulet3, Natalya F. Noy1, Mark A. Musen1, Nigam H. Shah1 1

Stanford University, Stanford, CA; 2Université Montpellier, France, 3University of Nancy, France

Biomedical researchers generate an enormous amount of data across a wide array of domains, such as genomic information, protein-interaction data, clinical trial results, pharmacological data, and pathology reports. Adding to the data deluge issue is the multitude of databases that store this information and the different data formats used across these databases. Technology to integrate this data can enhance the pace of scientific discoveries by providing researchers with a unified view of this diverse information. To address this problem, we developed the NCBO Resource Index, which provides a unified view by linking the data through ontology-based annotations. The key enabling technologies of the Resource Index include the NCBO Annotator, a text annotation tool which uses MGREP for concept recognition, and BioPortal, a library of over 200 biomedical ontologies. By annotating or “tagging” the textual metadata from database records to corresponding ontology terms as well as annotating the database records with additional ontology terms obtained by traversing the ontology hierarchy, a set of direct and expanded annotations are generated. These ontology annotations form a network of links that enhance finding relevant data sets. The Resource Index can be queried via our user interface (http://bioportal.bioontology.org/resources) or via direct programmatic access through the Resource Index Web services (http://www.bioontology.org/wiki/index.php/NCBO_REST _services).

1

Bio-Ontology SIG 2011 Schedule and Papers.pdf

11:00 11:25 Beisswanger et al. “Of Mice and Men” Revisited: Basic Quality Checks for Reference. Alignments Applied to the Human-Mouse Anatomy Alignment. 11:25 11:50 Livingston et al An Ontology of Annotation Content Structure and Provenance. 11:50 12:30. Keynote: Sorana Popa— Why does Drug R&D need good.

8MB Sizes 0 Downloads 364 Views

Recommend Documents

2011 Schedule L.pdf
Loading… Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. 2011 Schedule L.pdf. 2011 Schedule L.pdf.

2011 Course Schedule (final).pdf
2011 COURSE OFFERINGS. CORE COURSE IN APPRAISAL STUDIES. CORE COURSE IN APPRAISAL. STUDIES – DISTANCE EDUCATION. ADVANCED ...

Cruise Ship Schedule 2011.pdf
Page 1 of 10. Bermuda. Compiled by the Department of Marine and Ports Services. Information subject to change without notice. Dated: January 17th, 2011. www.marops.bm. 2011 CRUISE SHIP SCHEDULE. Page 1 of 10 ...

Spring Athletic Schedule 2011
3/22/11 King's Way. A. WSD field. 4:00 pm. 3/25/11 Castle Rock. H. Cardon/Camas. 6:30 pm. 3/28/11 Ridgefield. H. Cardon/Camas. 6:30 pm. 3/30/11 Woodland.

Sig Figs Notes Workings.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Sig Figs Notes ...

XII-ForumTIG-SIG-2013-Programa.pdf
Real-time exposure assessment under free-living conditions. David Donaire. CREAL. Modera: Jordi Guimet , IDEC. 12.30 – 14:00 : Solucions de Realitat Augmentada. Casos d'èxit. Tecnologies de la Realitat Augmentada basades en visió. David Marimon,

SLIND MOBIS Rate Schedule 2011.pdf
and the option to create an electronic delivery order are available through GSA. Advantage!TM, menu-driven database systems. For more information, go to GSA ...

project group wise examination schedule dec 2011.pdf
4642 Shraddha Munot B8058581. Page 3 of 4. project group wise examination schedule dec 2011.pdf. project group wise examination schedule dec 2011.pdf.

Insight Meditation & Yoga Gathering 2011 Schedule -
Sitting meditation. Breakfast & free time. Dogen's café open. 9.30 - 10.50 Workshops. Hall 1. TBA. Anton Eastick. Hall 2. Buddhism &. Social action. Jill Jameson.

SIG OTO Rapport obs 't Vrieske Honk te Vriescheloo dec. 2011.pdf ...
Het doel van het onderzoek is het meten van de oudertevredenheid ten aanzien van een. aantal onderwerpen, te weten: het schoolgebouw, het functioneren van het personeel en de. schoolleiding en het functioneren van de (school)organisatie. De resultate

Title I SIG Schools Spanish.pdf
Page 1 of 19. MAKALAH GLOBAL WARMING. BAB 1. PENDAHULUAN. 1.1. Latar Belakang Masalah. Makalah ini dibuat untuk menambah pengetahuan ...

Sig Figs Notation Conversions Rearranging Worksheet Answers.pdf ...
There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Sig Figs Notation Conversions Rearranging Worksheet Answers.

SCHEDULE
NATIONAL MOOT COURT COMPETITION, 2015. SCHEDULE. IMPORTANT DATES FOR PARTICIPATING TEAMS. (The above dates are subject to change).

Schedule
Nov 14, 2014 - Keynote Title: “Internet of Things - New Security and Privacy Challenges” ... Secure ID-Based Strong Designated Verifier Signature Scheme for E-Commerce ... Implementation Structure of Cloud Storage in Digital Library Resource Shar

Schedule and Teams.pdf
Guy McLean Jay Gelb Saher Al Ghazi Eugene Chang Michael Yu Kock-Yee Law. Vadim Feygin Boris Shmoys Berl Stein Ed Bizari Dominic Sanzotta Sr. Ta-Min ...

Schedule and Teams.pdf
Ramaswamy, Sri [email protected], 585.455.6564. Al Ghazi, Saher [email protected], 585.755.6522. Haidvogel, Roger [email protected], ...

Program Schedule
Oct 8, 2013 - Discovery Companies. Protagonist Therapeutics. Mirna Therapeutics. Inovio Pharmaceuticals. Sialix. 8:30 – 8:55am. Altravax. Spring Bank.

Invited SIG: designing for the living room tv experience
May 10, 2012 - app development; interaction design; user studies for tv; understanding living room; designing for tv. ACM Classification Keywords. H.5.m.

Flights and Ferries Schedule - www.kawahijeninn.com.pdf ...
you don't have to buy the ticket with. unofficial people. Page 1 of 1. Flights and Ferries Schedule - www.kawahijeninn.com.pdf. Flights and Ferries Schedule ...

Session Schedule and Template -
Issue Tree. ▫ Research plan. ▫ Research resources. ▫ Financial Modelling. ▫ Competitive analysis. ▫ Competitive mapping. ▫ Quantitative Analysis. ▫ 2 X 2 Matrix.

SBA 2016 Schedule and Teams.pdf
FIELD EVENTS – STARTING AT 9:30. Girls Boys. 9:30 Discus ... Sanborn Central/Woonsocket. Sunshine Bible ... SBA 2016 Schedule and Teams.pdf. SBA 2016 ...