Generality and Reuse in a Common Type System for ...

Viewer
Transcript

Generality and Reuse in a Common Type System for Clinical Natural Language Processing Stephen T. Wu1, Vinod C. Kaggal1, Guergana K. Savova2, Hongfang Liu1, Dmitriy Dligach3, Jiaping Zheng2, Wendy W. Chapman4, Christopher G. Chute1 1

Mayo Clinic, Rochester, MN USA Childrens Hospital Boston and Harvard Medical School, Boston, MA USA 3 University of Colorado at Boulder, Boulder, CO USA 4 University of California, San Diego, San Diego, CA USA

2

ABSTRACT The aim of Area 4 of the Strategic Healthcare IT Advanced Research Project (SHARP 4) is to facilitate secondary use of data stored in Electronic Medical Records (EMR) through high throughput phenotyping. Clinical Natural Language Processing (NLP) plays an important role in transforming information in clinical text to standard representation that is comparable and interoperable. To meet the NLP requirement of different secondary use cases of EMR, accommodate different NLP approaches, enable the interoperability between structured and unstructured data generated in different clinical settings, we define a common type system for clinical NLP that integrates a comprehensive model of clinical semantics with language processing types for SHARP 4. The type system has been implemented in UIMA (Unstructured Information Management Architecture), which allows for flexible passing of input and output data types among NLP components, and is available at the SHARP 4 website.

Categories and Subject Descriptors J.3 [Computing Applications]: Life and Medical Sciences – Medical information systems; H.4.m [Information Systems Applications]

General Terms Measurement,Documentation,Design, Reliability, Standardization.

Keywords Natural language processing, medical semantics, interoperability, clinical information standards, Clinical Element Models, UIMA

1. INTRODUCTION Extensive information about clinical practice has been stored in Electronic Medical Records (EMR). The aim of Area 4 of the Strategic Healthcare IT Advanced Research Project (SHARP 4, or SHARPn) is thus to reuse this electronically stored data to improve both practice and research. More specifically, efforts are underway to analyze records on a large scale, across many patients – an effort known as high throughput phenoytyping. In order to accomplish this, information across patients, areas of practice, and institutions must be comparable and interoperable. SHARP 4 has adopted Intermountain Healthcare’s Clinical Element Models (CEMs) as the standardized format for

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MIXHS’11, October 28, 2011, Glasgow, Scotland, UK. Copyright 2011 ACM 978-1-4503-0954-7/11/10...$10.00.

information aggregation and comparison. This representation is both concrete and specific, yet allows for some of the ambiguity that is inherent in clinicians’ explanation of a clinical situation. However, a significant amount of information in the EMR is not available in any form that could be easily mapped to CEMs. Because healthcare professionals are human, it is natural that they prefer to record a significant proportion of their information in the format of human language, rather than more structured formats like CEMs. Therefore, Natural Language Processing (NLP) techniques are necessary to tap into this extensive source of clinical information. One of the challenges in Clinical NLP is that there are many use cases. In practice, a clinician may only be interested in large-scale data about a small number of things, e.g., exposure to medications, peripheral arterial disease, or radiological findings. Some may be interested in very precise information, others will sacrifice precision to make sure every instance of a concept is found. Furthermore, the end target is a moving one – patient or document classification, cohort discovery, knowledge discovery, and question-answering are all possible ways to use the information. Another challenge in Clinical NLP is the diversity of methodological approaches that can be (and are) taken to discover this information. In practice, this often leads to disjointed systems, duplication of effort, and/or faulty independence assumptions. Expert, rule-based systems have historically been the most common despite their connection to a single use case, but significant research and clinical resource development has made machine learning techniques a viable option as well. Additionally, differences may arise from the environment in which medical records are generated, linguistic characteristics of the texts themselves, or the needs of use cases. In this milieu we define a common type system that implements a comprehensive model of clinical semantics based on CEMs, may be used for arbitrary clinical use cases, and is compatible with a diversity of NLP approaches. This type system is designed in UIMA (Unstructured Information Management Architecture [1]), which allows for flexible passing of input and output data types between components of a clinical NLP system. From an NLP perspective, this common type system embeds a deep semantic representation analogous to those that have been used in the computational semantics and dialogue systems communities[2, 3]. It distinguishes between semantic content that refers to real-world phenomena and the textual surface form used to communicate the semantics. However, we might expect the impact of a mature, deep semantic representation for Clinical NLP to be much greater, since this is an enabling technology for many downstream tasks like patient classification and high-throughput

phenotyping. Designing the type system to account for these deep semantics as output gives room for technological innovations around the CEM structure. It should be noted that this type of semantic structure differs greatly from the ontology representations that are present in existing type systems, including the existing cTAKES type system. Ontology-mapped text shallowly defines the meaning. Adding relations gives additional structure, but is often ad-hoc and therefore not as usable as a well-structured deep semantic framework like CEMs. In addition to providing a well-developed semantic data model, it provides the necessary data types to bridge from text to its semantics. In doing so, it allows for downstream access to both the more raw, textual data types and the deeper semantic representation. A preliminary implementation and documentation of the type system is available at the SHARPn website (www.sharpn.org); additionally, Mayo Clinic’s popular open source NLP tool, cTAKES (clinical Text Analysis and Knowledge Extraction System [4]) will adopt the type system in its future releases. We begin with background on UIMA and CEMs in Section 2, the description of the type system in Section 3, and some statistics and implications for further NLP research and development in Section 4.

2. BACKGROUND 2.1 UIMA and Type Systems UIMA was originally designed by IBM to process documents such as text, speech, or video`[1]. Here, we concern ourselves with clinical text as our possible modality and domain of input. Each clinical document that is processed within UIMA is automatically marked up (annotated) by means of components called analysis engines, which are often arranged in a pipeline. Analysis engines may be interchanged if they solve the problems and annotate the data in the same way. However, the structure of the markup must be defined in order for analysis engines to be interoperable. A type system defines the structure for possible markup, providing the necessary data types for downstream components to make use of partially processed text, and gives upstream components a target representation for markup data. The data are then passed between these components in an efficient framework, the Common Analysis Structure (CAS), which includes the original document, the results of analysis, and indices over these results. To facilitate outputs from and inputs to UIMA, the CAS can also be efficiently serialized and de-serialized. Due to this architecture, UIMA enables interoperability between analysis engines and encourages a development of “best-of-breed” components. The current cTAKES type system (from which this work inherits) is shown in Figure 1 and described in Section 2.2. Any UIMA system will have a type system`[5-7], and there are also projects providing common type systems for diverse NLP use cases. For example, a common type system was defined for the JULIE Lab UIMA Component Respository (JCoRE), which consists of NLP components developed by the JULIE lab and those available in the public domain. Another common type system was defined by the UCompare project, an evaluation platform for many existing NLP tools in the biomedical domain. Both of these type systems include types for storing syntactic, semantic, and document metadata annotations.

Figure 1: cTAKES v1.1 Analysis Engines and Data Types The reported common type system defined for SHARP 4 is the first attempt to provide a common type system for diverse NLP use cases involving clinical texts. Different from other common type systems that target NLP tasks themselves, it takes the semantics interoperability between structured and unstructured data into consideration. Therefore, the most significant contributions of the proposed type system are the extensive semantic model based on CEMs and the separation between textual semantic types and referential (referring to the real-world) semantic types. These features enable a development of diverse technologies that serve different clinical use cases. The design attempts to follow best practices for UIMA type systems, as recommended by UIMA’s original developers. These have to do with ease and completeness of representation, as well as computational cost. For example, we do not put locally-used (component-specific) types in the CAS, as there is no garbage collection in UIMA and extra types only bloat the type system. Also, indices for types are reliably calculated, so that defining a subtype is an efficient way to subset data. Governance is also important, and we envision versioning and releases of the type system that are selfcontained. It may then serve as a target for interoperability with other NLP systems.

2.2 Existing cTAKES Type System The common type system introduced here is an extension to the cTAKES v1.1 type system, so we now present the existing type system. cTAKES was developed to operate as a single pipeline, and thus the type system often arises from the needs or capabilities of individual analysis engines. Therefore Figure 1 shows cTAKES v1.1 analysis engines (bulleted entries in gray) above the related data types for those engines. To illustrate the interplay between the components and the type system, we have shown where the primary purpose of analysis engines is to update the attributes of previously determined types, by designating set: type::attribute. The Analysis Engine groupings

Figure 3: Proposed preprocessing and adjunct types (in bold) are not officially part of the type system, but they will help facilitate the discussion on the modifications and additions in the common type system. The top half of the diagram is comparatively well-developed, since there are associated fundamental general-domain NLP techniques. The bottom half is more idiosyncratic to clinical data, dealing with semantic meaning at both shallow and deep levels. For example, at the use case level, the recent Drug NER module requires an extensive set of additional annotations to find attributes associated with a drug mention. These concepts are not shared with other use cases. At a technology level, the LookupWindowAnnotation type is only used by the Dictionary Lookup component to find spans with possible NamedEntities. For clarity in this depiction of the cTAKES type system, we have left off a few small types or subtypes and the inheritance structure. The majority of types inherit directly or indirectly from UIMA’s Annotation type, which includes begin and end offsets as attributes. Lemmas, OntologyConcepts, and UmlsConcepts inherit directly from UIMA’s TOP type. This is because these are normalized types, typically ones that refer to some external resource. OntologyConcept is stores shallow semantics – a reference to an unambiguous external ontology; UmlsConcept is specifically for the oft-used Unified Medical Language System. In cTAKES v1.1, some components are considered “core” technologies that every project or use case would need; components at the left side of the diagram (Text Span, Morphology, and Named Entity Recognition types) are core types associated with the core technologies. Those at the right side (Syntax, Context, and Use Case types) are considered to be extensions that may or may not be relevant for a particular use case; they are thus included by reference to separate type systems, each associated with an analysis engine. While the cTAKES type system is functional and practical, there was no focused design process that produced it. As a model of the data that can be discovered in clinical text, therefore, it is somewhat idiosyncratic, particularly in the semantics of the text. So we now turn to the major organizing factor for semantics in the type system.

2.3 Clinical Element Models Detailed models of clinical data are necessary to make secondary use of the semantic meaning in EMRs. Models such as the Clinical Element Models (CEMs) used in SHARPn provide a normalized objects, that are more amenable to computational algorithms. For SHARPn, six general templates of CEMs have been identified (though more much more specific CEMs may be defined): Diseases and Disorders, Procedures, Signs and Symptoms, Medications, Anatomical Sites, and Labs. An example of a “cough” CEM might be: ... ...

This partial example would be built according to the Signs and Symptoms template. The basic structure of a CEM consists of a Type, a Key, a Value Choice; Qualifiers, Modifiers, and Attributions give further detail. The Type is a coded value that represents the constraints to which all instances of a given model will conform (e.g., cwe – coded with extensions, or pq – physical quantity). The Key is a coded value for the real world concept that is important to what an instance is attempting to describe (e.g., since we are modeling text, Assertion is a common key). Finally, the Value Choice is a choice between a “data property” or “items,” where the former is a derivative of the HL7 version 3 data type “ANY,” and the latter is a sequence of one or more clinical elements (e.g., “CoughType” is a data property, constraining data values).

Figure 4: Additional spanned types for linguistic processing A qualifier captures information that does not change the meaning of the value choice (e.g., the “periodicity” of a cough). A modifier adds information that changes the meaning of the value choice (e.g., “negationInd” may reverse the asserted CoughType). An attribution defines an action and the contextual information for the action (e.g., “observed” gives the context of the Cough).

3. TYPE SYSTEM DESCRIPTION The type system described here is an extensive update of the cTAKES v1.1 type system, with modifications, restructuring, and additions. The types from Text Span, Morphology, Syntax, Named Entity Recognition, and Context (Figure 1) are propagated into the common type system unless otherwise mentioned. Use case-specific types are dropped, as they would be local to custom components at the end of a pipeline. We now highlight the differences in three broad categories: Preprocessing types, Spanned Linguistic types, and Semantic reuse types.

3.1 Preprocessing and Adjunct Types Figure 2 shows three type systems that are preprocesses to, or supporters of, syntactic and semantic analysis.

3.1.1 Structured Data Types Aside from DocumentID, cTAKES v1.1 largely ignored information stored in structured data. For documents coming directly from institutions (rather than in free text), structured data may be a significant source of information in the EMR. In the context of Natural Language Processing, structured data may prove to be useful because it can give a concrete sense of the context in which the note was generated. For instance, a coreference resolver may seek to determine to whom the pronoun she refers to. If this note is about a male patient with a mother present, Demographic data such as a patient’s gender may be useful. The source of a document may also be important; typical grammatical parsing models may perform acceptably on well-formed discharge summaries, but suffer when encountering more abbreviated grammatical styles. It should be noted that the Metadata contains an “other” attribute that allows for an arbitrary number of other document-level structured data attributes.

3.1.2 Utility Types This preliminary type system replaces a type from cTAKES v1.1 called Property (which is not pictured in Figure 1); the Pair type is used here to make brute-force implementation of a Hash table – UIMA does not provide a map or hash structure. A probability distribution is then a type of hash that stores keys in “attribute” and probabilities in “value.”

3.1.3 Text Span Types In addition to the existing Segment and Sentence types from the previous type system, Paragraph and List have been added as types that subdivide a text. Components in cTAKES do not currently discover or make use of Paragraph or List structures. Paragraph boundaries may be important, e.g., to resolve clinical relations or coreference. It may also contribute towards applying cognitive state in measures of semantic distance. Lists are arrays of other Annotations so that bulleted paragraphs, sentences, chunks, or even tokens may be considered as lists. Syntax and discourse structure in lists may differ from other types of text.

3.2 Spanned Linguistic Types The Morphology and Syntax types of cTAKES v1.1 are still present in the common type system, with a few updates. We also add support for relations, as shown in Figure 4.

3.2.1 Syntactic Types Because syntactic processing has been well studied in NLP, the common type system adopts established standards for its syntactic types. Deep parses can be represented in the Treebank format [8]. TerminalTreebankNode annotations are included to tie the tree down to real tokens and to represent gaps. TopTreebankNode contains a list of terminal nodes, so that traversal of the tree may proceed from bottom-up. Of course, the tree may also be traversed from a top node by following the “children” attribute. The dependency parse representation, similar to cTAKES v1.1, is based on the CONLL-X format of dependency nodes [9]. Each word in a sentence is stored as “form”, while its syntactic head is a reference to another node. When producing labeled trees, “deprel” is the relation that is labeled on each word-to-word edge. A similar syntactic formalism is the Stanford dependencies [10]. These are in fact triples of two tokens plus the relationship

between them. Thus, they are represented as binary relation types that relate heads to dependents. Semantic roles are also represented as relations. It is envisaged that these will relate a verb to arguments or modifiers, as defined in the PropBank guidelines [11].

3.2.2 Relation Types Relations themselves are not tied to specific spans of text. However, the Relation type itself is not intended for practical usage; its existence just ensures that subtypes all have a “category” attribute (the label or semantic type of the relation), and the ability to represent negative (“polarity”) or uncertain (“uncertainty”) relationships. Relation subtypes that connect textual types (BinaryTextRelation and CollectionRelation) use RelationArguments. The main attribute of a RelationArgument is its argument, which as a UIMA Annotation is enforced to span text. Depending on the type of relation, there may be a particular “role” to play in the relationship; head and dependent in a Stanford dependency is an example. RelationArgument also includes a “participatesIn” attribute that stores all of the relationships that the argument is a part of. UIMA can quickly iterate over the index of Relation subtypes and find the arguments, but this attribute makes the reverse process easier. This may be necessary to consolidate all of the arguments and modifiers of a verb (SemanticRole relations), or when aggregating everything that has been said about some UMLS concept. One non-syntactic binary relation subtype that has been included is CoreferenceRelation. By subtyping from Relation, this should be conformant to, or mappable from, MUC-7standards2. These coreference relations would then have IDENT as their “category,” since they are relating two text spans that refer to the same entity. UMLSRelations are also included in the type system. There are over 50 relations defined in the UMLS that would be stored in the “category” attribute. It should be noted that this type only deals with relationships that are tied to text. The same relationships from the UMLS are also present in the referential semantic parts of the type system (Section 3.3.2).

3.3 Semantic Reuse Types Figure 6 represents the main contribution of this type system. For ease of understanding, the type system is broken down into several different sections, with a clean separation between textual semantic types (bottom left, inheriting from Annotation) in Section 3.3.1 and referential semantic types (the rest of Figure 6) in Section 3.3.2.

3.3.1 Textual Semantic Types Textual semantic types are similar to Named Entities and Events as defined by the Automatic Content Extraction and MUC-7 tasks, and emerging ISO standards. The IdentifiedAnnotation type is a reorganization of cTAKES’ original IdentifiedAnnotation and NamedEntity types. We assume that any IdentifiedAnnotation is a span of text that must be discovered, and may have a specific “typeID” (e.g., enumerating drugs, disorders, etc.) or more flexibly, some “category” (e.g., a TUI from UMLS). As a first pass at referring to the outside world for entities, IdentifiedAnnotations may be mapped to OntologyConcepts. We allow multiple hypotheses about how the text should be mapped to an ontology; these are stored in an array. In other cases (e.g., 2

See www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html

for Time, which is based on ISO TimeML’s TIMEX3 [12]), this array may remain unused. The array allows (but does not require) users to utilize techniques that separate word sense disambiguation (WSD) from the initial recognition. It should be clear from the subtypes of IdentifiedAnnotation that a deeper semantic mapping is possible for text spans. For example, “whiplash” has a SNOMED code 39848009, and this may be used more than once in a text to refer to different real-world entities: “Patient experienced some whiplash; the other passengers also complained of whiplash.” Thus, subtypes of IdentifiedAnnotation each have an attribute that refers to a deeper semantic representation, to be described below.

3.3.2 Referential Semantic Types The remaining types in Figure 6 are a type of semantics that refers to something in the real world, similar to (but greatly extending) the referential OntologyConcept type from cTAKES v1.1. This begins with the definition of an Element, which is a generalpurpose unit of information in the context of clinical care. Since it is expected that IdentifiedAnnotations will refer to Elements, Elements contain some of the deeper semantic information such as “polarity” (negation) and “uncertainty” (possible, conditional, etc). It currently assumes a single, disambiguated word sense to be set as an “ontologyConcept.” Because clinical text is almost invariably about people’s health, a “subject” attribute identifies the person; “generic” indicates that the concept is used abstractly and not as something tangible in the real world. Also, because multiple spans in the text can refer to the same thing, Elements keep a “mention” array that allow quick access to mapped text. Again, some of these attributes may remain unused for subtypes of Element, such as the “polarity” attribute for Time and Date.

Clinical Element Types Although the Element type is general enough to handle real world entities of many kinds, we take special effort to develop structure for things in the clinical domain. These types in the bottom right corner of Figure 6 directly use the structure in CEMs as defined by the SHARP task. There is a division between Entities and Events, where the only distinction is whether the Element is able to have a temporal aspect. Thus, since a physical entity like a Medication has temporal importance for a patient, it is classified as an Event. EventProperties are based on ISO TimeML and included for downstream compatibility with temporal relation applications. SHARP 4 has identified 6 general NLP-relevant CEMs derived from the fine-grained specific representations: Anatomical Sites, Diseases and Disorders, Labs, Medications, Procedures, and Signs and Symptoms. Each of these clinical elements has different attributes; for example, a “strength” attribute as used in Medications would be irrelevant for Procedures. In order to encourage interoperability between modules and maintain the CEMs as a truly normalized semantic representation, we have opted towards stronger typing of these attributes. To define a “bodySide,” for instance, we enforce the usage of BodySide rather than just a String indicating “left” or “right.”

Clinical Attribute Types Despite the fact that Clinical Elements are strongly typed (e.g., qualifiers might be PQ – physical quantities in ISO/FDIS 21090), Attribute types in the current model simply contain strings or numbers with units, as a practical matter. Users would then internally restrict the values that attributes can take in their own components.

Figure 6: Common types for semantic reuse

Semantic Relation Types ElementRelations and AttributeRelations extend the Relation type rather than the BinaryTextRelation type; thus, they also operate without a spanned location. Therefore, the types of relationships they are designed to express are those between real-world objects. For example, a coreference relation would be inappropriate for

this type of relation, because it links multiple textual mentions as referring to the same entity. This type of relation might instead be a higher-level relationship between two such coreference-resolved entities. ElementRelations are used for things like TemporalRelations that link two Time, Date, Event annotations. Again, by inheriting

Figure 8: Alternate pipelined architectures for extracting CEMs from clinical text from Relation, these TemporalRelations implement the essential parts of the ISO TimeML TLINK relationship. Other subtypes of ElementRelation include specific UMLS relations; these differ from those presented in Section 3.2.2 in that they are between referential semantic objects rather than spans of text. Under this definition, AttributeRelations are slightly more restricted than ElementRelations in that they enforce one of the arguments to be an Attribute.

4. RESULTS AND DISCUSSION As this is a type system that is yet to be implemented in the context of an NLP system, we here report statistics concerning the type sytem itself and discuss implications for NLP system design. The defined common type system contains a total of 96 types and 171 attributes. This is expanded from 60 types in cTAKES v1.1; 38 of the types are modified, 22 are deleted (from Use Cases), and 58 are newly defined as detailed in previous sections. The average number of attributes per type has also increased from 1.63 in cTAKES v1.1, to 1.78 attributes per type. The detailed semantic CEMs are a significant factor in this. The impact of CEMs can also be seen in the conceptual groupings below. Table 1: Distribution of types in the Common Type System Type Subdivisions Structured Syntax RefSem TextSem TextSpan Util Relation Total

# of Types 4 26 31 13 5 3 14 96

Pct 4.17% 27.08% 32.29% 13.54% 5.21% 3.13% 14.58% 100.00%

4.2 Technology interoperability One of the implications of the common type system is that many different technological architectures and methodologies may be used to accomplish the end goal of normalization to CEMs. This improves on the original cTAKES type system, which clearly grows out of specific UIMA analysis engines. Two example architectures are shown in Figure 7, replacing the bottom half of Figure 1. One is organized around what can be discovered from text, mapping to CEM structures at the end; the other is centered on the target CEMs, updating attributes piece by piece.

4.2.1 Text-centric pipeline A text-centric pipeline (top) begins by identifying text spans of interest. Here, rule-based or machine learning methods may be used to identify EntityMentions (e.g., anatomical sites, or linguistic mentions like anaphora), EventMentions (e.g, diseases and disorders, labs, medications, procedures, signs and symptoms), Modifiers (e.g., negation indicators, uncertainty indicators), and TimeMentions (could also use DateAnnotation or TimeAnnotation from the older cTAKES). Doing all of these mentions at once simplifies the organization of the processing. More importantly, this allows for methodologies that discover or induce multiple tokens at the same time. For example, sequence-based Hidden Markov Models or Conditional Random Fields may be used to jointly model when NEs begin and when they transition to modifiers, or vice versa. Or, topic models may be extended to include an additional semantic type variable that includes all the different types of mentions. MedLEE-style knowledge-based systems [13] could similarly make use of a semantic lexicon to extract these different types of mentions; other rule-based systems could have a single repository for all text-based rules.

Part of the goal with this common type system is to reduce the amount of effort required to make use of NLP results on a different use case. The CEMs as used in SHARP 4 comprehensively define highly frequent, highly relevant clinical elements. As such, they aim to be a sufficient semantic representation for the majority of use cases.

Once all the interesting text spans are identified, UMLS relations or coreference (identity) relations may be discovered between them. This includes entity-entity and entity-attribute UMLSRelations, and CoreferenceRelations. Discovering all the relations at once organizes the system and consolidates the rules or techniques used. Additionally, there is assumed upstream information that may help in processing. For example, knowing that one kind of Modifier is in the vicinity of an NE could influence whether it participates in a relationship.

Many methodologies developed using the common type system should then be amenable to different tasks. Of course, expertbased and ad-hoc methodologies are also supported in the type system. The difference between these is the actual analysis engines themselves, rather than the type system.

The final step in the first pipeline is semantic normalization, which includes the process of word sense disambiguation (WSD), then filling out all the parameters in an Element. With all the concept and relation identification already done, this may be done as a simple mapping procedure.

4.1 Use case generality

Despite the benefits in code organization and possibilities in methodology development, there are tradeoffs to this text-centric organization. A component cannot make use of the output of downstream components. For example, determining whether to look for modifiers cannot take into account disambiguated word senses, or linked coreferring EntityMentions.

4.2.2 Element-centric pipeline The element-centric pipeline starts by trying to find the Entities and Events. To get there, we may include the pipelined tasks mapping of the mentions to an ontology code, word sense disambiguation, and coreference resolution. This organization allows for a different set of techniques than in Section 4.2.1. For example, coreference chains could be resolved at the same time as NE recognition [14]or concurrently with word sense disambiguation. Since cognitive models of language processing show that humans do not do these linguistic tasks separately, there is promise in joint inferencing. A similar synergy is possible for attributes in the second task of the Element-centric pipeline, which is to assign Attributes and Relations to the Element. Recognizing textual Modifiers may be done more effectively when considering what AttributeRelations they might be involved in, or both may be done together. It should also be noted that even without combining Modifier recognition with relation recognition, this architecture has the ability to do attribute discovery for specific, established Elements. Such discovered Elements may be highly significant features in the discovery of Attributes and Relations. This strength of the pipeline is also a weakness, however, in that errors will cascade if the initial step is inaccurate. Of course, there are a multitude of other architectural options besides these two. It is hoped that the common type system is able to sufficiently represent the quantities of interest, such that alternative methodologies can be interoperable at intermediate stages of processing; regardless, any full pipeline will be interoperable with the same deep semantic structure.

5. CONCLUSION We have presented a comprehensive type system for Clinical NLP that defines a standard for deep computational semantic processing and builds upon the type system of cTAKES v 1.1. The semantic representation presented here is a conglomeration of semantic standards like CEMs and ISO TimeML. Additionally, elements of previous type systems and language processing standards for preprocessing, morphology, and syntax are included in the type system. We have shown that development under this new type system shows promise for tackling a wide variety of use cases due to its comprehensive semantic model. We have also illustrated some possible methodological architectures that are compatible with this type system. It is hoped that this emerging, practical type system will be used by the community at large as it provides a nurturing context for a diversity of activities in Clinical NLP.

6. ACKNOWLEDGMENTS We thank Marshall Schor for the discussion on type system principles. Thanks to Peter Szolovits, Lee Christensen, Scott Halgrim, Cheryl Clark, Jon Aberdeen, Arya Tafvizi, Ken Burford, and Jay Doughty. Also, special thanks to Burn Lewis and Guy Divita for informative presentations. Finally, thanks to Andrew Sheppard for aid in data recovery.

This work was funded in part by the SHARPn (Strategic Health IT Advanced Research Projects) Area 4: Secondary Use of EHR Data Cooperative Agreement from the HHS Office of the National Coordinator, Washington, DC. DHHS 90TR000201.

7. REFERENCES [1] Ferrucci, D. and Lally, A. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng., 10, 3-4 (Sept 1 2004), 327-348. [2] Klabbers, E., Odijk, J., De Pijper, J. and Theune, M. GoalGetter: Football results, from teletext to speech. IPO Annual Progress Report, 311996), 66-75. [3] Stent, A., Dowding, J., Gawron, J. M., Bratt, E. O. and Moore, R. The CommandTalk spoken dialogue system. In Proc. 37th annual meeting of the Association for Computational Linguistics (College Park, MD, 1999), 183-190. [4] Savova, G. K., Masanz, J. J., Ogren, P. V., Zheng, J., Sohn, S., Kipper-Schuler, K. C. and Chute, C. G. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc, 17, 5 (Sep-Oct 2010), 507-513. [5] Verspoor, K., Baumgartner Jr, W., Roeder, C. and Hunter, L. Abstracting the types away from a UIMA type system. In C. Chiarcos, E. de Castilho and M. Stede. From Form to Meaning: Processing Texts Automatically. Narr, Tubingen, 2009. [6] Hahn, U., Buyko, E., Landefeld, R., Mühlhausen, M., Poprat, M., Tomanek, K. and Wermter, J. An overview of JCoRe, the JULIE lab UIMA component repository. In Proceedings of the LREC (Marrakech, Morocco, 2008), 1–7. [7] Kano, Y., Baumgartner, W. A., Jr., McCrohon, L., Ananiadou, S., Cohen, K. B., Hunter, L. and Tsujii, J. U-Compare: share and compare text mining tools with UIMA. Bioinformatics, 25, 15 (Aug 1 2009), 1997-1998. [8] Marcus, M. P., Marcinkiewicz, M. A. and Santorini, B. Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19, 2 (June 1993), 313330. [9] Buchholz, S. and Marsi, E. CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning (New York City, New York, 2006), 149-164. [10] de Marneffe, M.-C. and Manning, C. D. The Stanford typed dependencies representation. In Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation (Manchester, United Kingdom, 2008), 1-8. [11] Kingsbury, P. and Palmer, M. Propbank: the next level of treebank. In Proc. Treebanks and Lexical Theories (2003). [12] Pustejovsky, J., Hanks, P., Sauri, R., See, A., Gaizauskas, R., Setzer, A., Radev, D., Sundheim, B., Day, D. and Ferro, L. The timebank corpus. In Proceedings of Corpus Linguistics 2003 (2003), 647-656. [13] Friedman, C., Kra, P. and Rzhetsky, A. Two biomedical sublanguages: a description based on the theories of Zellig Harris. J. of Biomedical Informatics, 35, 4 (August 2002), 222-235. [14] Haghighi, A. and Klein, D. An entity-level approach to information extraction. In Proceedings of the ACL 2010 Conference Short Papers (Uppsala, Sweden, 2010), 291-295.

Generality and Reuse in a Common Type System for ...

Compare project, an evaluation platform for many existing NLP tools in the biomedical ... clinical use cases. The design attempts to follow best practices for UIMA type systems, .... It should be noted that this type only deals with relationships ...

Download PDF

2MB Sizes 0 Downloads 87 Views

Report

Generality and Reuse in a Common Type System for ...

Recommend Documents