Using Hedges to Enhance a Disease Outbreak Report Text Mining System Mike Conway, Nigel Collier National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku Tokyo 101-8430, Japan {mike|collier}@nii.ac.jp

1

Introduction

Identifying serious infectious disease outbreaks in their early stages is an important task, both for national governments and international organizations like the World Health Organization. Text mining and information extraction systems can provide an important, low cost and timely early warning system in these circumstances by identifying the first signs of an outbreak automatically from online textual news. One interesting characteristic of disease outbreak reports — which to the best of our knowledge has not been studied before — is their use of speculative language (hedging) to describe uncertain situations. This paper describes two uses of hedging to enhance the BioCaster disease outbreak report text mining system. Following a brief description of the BioCaster system and corpus (section 2), we discuss in section 3 previous uses of hedging in NLP and the methods used to identify hedges in the current work. In section 4 we describe some initial classification experiments using hedge features. Section 5 describes a “speculative” method of tagging disease outbreak reports with a metric designed to aid users of the BioCaster system in identifying articles of interest.

2

BioCaster System & Corpus

The BioCaster system scans online news reports for stories concerning infectious disease outbreaks (e.g. H5N1, Ebola) and makes its results available to registered users as email alerts (Collier et al., 2008). In addition to this email service, data that has been filtered through a topic classifier but which is still 142

Son Doan Vanderbilt University Medical Center 2525 West End Ave., Suite 800 Nashville, TN 37235, USA [email protected]

uninterpreted is used to populate a Google Map application called the Global Health Monitor.1 The BioCaster corpus consists of 1000 news articles downloaded from the WWW and then manually categorized and annotated with Named Entities by two PhD students. Articles were collected from various news sources (e.g. BBC, New York Times and ProMED-Mail2 ). Each document is classified as either relevant (350) or reject (650).3 The corpus is designed to include difficult borderline cases where more advanced understanding of the context is required. For example, an article may be about, say, polio, but not centrally concerned with specific outbreaks of that disease. Instead, the article could report a vaccination campaign or research breakthrough.

3

Hedges

According to Hyland (1998), in an extensive study of speculative language in science writing, hedges “are the means by which writers can present a proposition as an opinion rather than a fact.” More recently, Kilicoglu and Bergler (2008) have presented a method for automatically identifying hedges in the biomedical domain. In the current work, we used a science orientated hedge lexicon derived from Mercer et al. (2004). The lexicon consisted of 72 verbs (including appear, appears, appeared, appearing, indicate, indicates, indicated, indicating, and so on) and 32 non-verbs (including, about, quite, poten1

www.biocaster.org ProMED-Mail is a human curated service for monitoring disease outbreak reports (www.promedmail.org.) 3 For copyright reasons, the BioCaster corpus is not publicly available. 2

Proceedings of the Workshop on BioNLP, pages 142–143, c Boulder, Colorado, June 2009. 2009 Association for Computational Linguistics

Rank 1 2 3 4 5 6 7 8

Hedge reported suspected probable suspect usually see reports sought

Rank 9 10 11 12 13 14 15 16

Hedge suggests estimated appeared appearing mostly assumes predicted suggested

High Medium Low

9000 χ2 Unigram Unigram+hedge

Naive Bayes Acc F 94.8 0.93 88.4 0.85 88.0 0.85

SVM Acc F 92.2 0.89 90.9 0.87 91.7 0.89

Table 2: Classification Results

tially, likely and so on). Preliminary work showed that the frequency of hedge words differs in the two categories of the BioCaster corpus (relevant and reject) at a highly significant level using the χ2 test (P < 0.01). Table 1 shows the 16 most discriminating hedge words in the BioCaster corpus (identified using the χ2 feature selection method.)

4

Classification Experiment

The current BioCaster system uses n-gram based text classification to identify disease outbreak reports, and reject other online news. We used hedging features to augment this classifier, and evaluated the results using a subset of the BioCaster corpus. One binary hedging feature was used. The feature was “true” if and only if one of the 105 hedge lexemes identified by Mercer et al. (2004) occurred in the input document within 5 words of a disease named entity. Results are shown in Table 2, where it can be seen that the addition of a single binary hedge feature to the unigram feature set increases accuracy by 0.8%. The performance does not however reach the level achieved by the χ2 9000 n-gram feature set described in Conway et al. (2008).

5

Reject (%) 48.3 36.7 15.0

Table 3: Proportion of Articles in Each Category

Table 1: Statistically Significant Hedges Features

Accept (%) 64.2 29.5 6.3

Towards a “Speculative” Metric

Users of the BioCaster system would benefit from an indicator of how “speculative” each news article is, as breaking news regarding disease outbreaks is characterized by uncertainty, which is encoded using hedging. We use the Mercer list of 105 hedging words as described above, in conjunction with statistics derived from a 10,000 document sec143

tion of the Reuters corpus to provide a “speculative” metric.4 We calculated total frequencies for all 105 hedge words in each of the 10,000 Reuters documents — that is, the total number of hedge words per document — then ranked these frequencies (after normalizing the frequencies to take account of document length). The bottom third of documents had hedge percentages in the range 0% - 0.2544% (L OW). The middle third had hedge percentages in the range 0.2545% - 1.0574 (M EDIUM). The range for the top third was 1.0575% - 100% (H IGH). Documents inputted to the BioCaster system automatically have their proportion of hedge words calculated and are assigned a value according to their position on the scale (L OW, M EDIUM or H IGH). Table 3 shows that a majority of the documents in the accept segment of the BioCaster corpus can be tagged as highly speculative using this method.

References N. Collier, S. Doan, A. Kawazoe, R. Matsuda-Goodwin, M. Conway, Y. Tateno, Q-H. Ngo, D. Dien, A. Kawtrakul, K. Takeuchi, M. Shigematsu, and K. Taniguichi. 2008. BioCaster: Detecting Public Health Rumors with a Web-based Text Mining System. Bioinformatics, 24(24):2940–2941. M. Conway, S. Doan, A. Kawazoe, and N. Collier. 2008. Classifying Disease Outbreak Reports Using N-grams and Semantic Features. Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), Turku, Finland, pages 29– 36. K. Hyland. 1998. Hedging in Scientific Research Articles. John Benjamins, Amsterdam. H. Kilicoglu and S. Bergler. 2008. Recognizing Speculative Language in Biomedical Research Articles: a Linguistically Motivated Perspective. BMC Bioinformatics, 9(Suppl 11):S10. R. Mercer, C. DiMarco, and F. Kroon. 2004. The Frequency of Hedging Cues in Citation Contexts in Scientific Writing. In Proceedings of the Canadian Conference on AI, pages 75–88. 4 Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19 (Release date 2000-11-03, Format version 1, correction level 0).

Using hedges to enhance a disease outbreak report ...

outbreak reports — which to the best of our knowl- edge has not been ... registered users as email alerts (Collier et al., 2008). In addition to this ... For example, an article may be about, say ... cle could report a vaccination campaign or research.

103KB Sizes 1 Downloads 183 Views

Recommend Documents

disease outbreak and pandemic response
In the event that Alberta Health Services identifies evidence of a significant outbreak in the region, the chief superintendent has the authority to make final decisions regarding crisis response, as guided by the district emergency response plan. Wh

Structuring an event ontology for disease outbreak detection
Apr 11, 2008 - Abstract. Background: This paper describes the design of an event ontology being developed for application in the machine understanding of infectious disease-related events reported in natural language text. This event ontology is desi

Using Ontologies to Enhance Data Management in ...
ontology as a means to represent contextual information; and (iii) ontology to provide ... SPEED (Semantic PEEr Data Management System) [Pires 2009] is a PDMS that adopts an ... Finding such degree of semantic overlap between ontologies ...

Using Checkpointing to Enhance Turnaround Time on ...
We propose to share checkpoints among desktop machines in order to ... demand, and prediction-based checkpointing combined with replication. We used a set of .... to implement their practical assignments, and to access email and the web.

Using Fuzzy Logic to Enhance Stereo Matching in ...
Jan 29, 2010 - Stereo matching is generally defined as the problem of discovering points or regions ..... Scheme of the software architecture. ..... In Proceedings of the 1995 IEEE International Conference on Robotics and Automation,Nagoya,.

A case study of lumpy skin disease outbreak in Rrapëz ...
milk drop, apathy, and some degree of fever, skin nodules, oedema, lameness were observed and recorded by experienced veterinarians. Dead animals. 4. 7.1.

QUARTERLY AQUATIC ANIMAL DISEASE REPORT
Containment measures- closed aquarium system (three 1800 L tanks), emptied ..... Names of infected areas: Central Java (Jepara); East Java (Gresik, Tuban);.

QUARTERLY AQUATIC ANIMAL DISEASE REPORT
is more inclined or intended for fulfilling international certification system for exported aquaculture .... information available this period in the Australian Capital Territory. 5. Infection ...... Manager Director of Aquatic Animal Diseases Depart

Persuasive Speaking: A Review to Enhance the ...
his/her name as a judge for persuasive speaking finals? To help ... sales, persuasion, oratory, peace oratory, original oratory, public address ... solution domain.

Hire a Business Coach to Enhance your Skills.pdf
Whoops! There was a problem loading more pages. Retrying... Hire a Business Coach to Enhance your Skills.pdf. Hire a Business Coach to Enhance your ...

Enhance Performance of K-Mean Algorithm Using MCL
K- Mean does not determine the membership of data ... exploratory analysis scenario in which there are no predetermined notions about what will constitute an.

REPORT Detecting Disease-Causing Mutations in the ...
strategy for detecting mutations that is based on comparing affected haplotypes with closely matched control sequences from healthy ..... Cell lines from these individuals would ... coverage—that is, under the assumption of a Poisson coverage.

REPORT Detecting Disease-Causing Mutations in the ...
Address for correspondence and reprints: David H. Spencer, Department of Genome Sciences, ..... Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bem-.

Fruit disease recognition using improved sum and ...
A fungal disease, apple blotch appears on the surface of the fruit as dark, irregular or lobed edges. The precise segmentation is required for the defect detection.

Flu Outbreak Raises Questions About Americas Healthcare ...
Page 1 of 3. Joe Ready. Flu Outbreak Raises Questions About America's. Healthcare Preparedness. readylifestyle.com/flu-outbreak-raises-questions-about-americas-healthcare-system/. This year's flu outbreak is putting the spotlight on U.S. hospitals an

MMRRC – Helping to Optimize and Enhance Scientific Rigor ...
Rigorous Experimental Design - the MMRRC provides authentic and key ... to publicize information about their strains on their own web site may want to include ...