What types of textual data can we use for disaster ...

Viewer
Transcript

Finding and Integrating Food Hazard Information from Haystack Disaster Management Workshop September 29, 2016

Sung-Hyon Myaeng Information Retrieval & Natural Language Processing Lab School of Computing

Motivation | “Detecting Global Food Safety Threats” • Food safety threat  Foods are damaged in production, process and supply chain  Governments heavily rely on human labor to detect and analyze the events that cause global/domestic food safety threats.

• Solution  To automate this process by analyzing various text sources (e.g., news, blogs, official reports) with Natural Language Processing (NLP) techniques  To detect potential treats quickly

Copyright © 2016 Sung-Hyon Myaeng

2

Overview of the Task [Previous workshop] Extract & organize hazard information

[Part 1] Find food hazard documents

Web documents (news, SNS, …)

[Part 2] Decide what event to focus & integrate by summarizing

Hazard docs Organized event information

KyoungRok JANG

Integrated & summarized view of events

Hwon IHM Copyright © 2016 Sung-Hyon Myaeng

3

Take-away Messages from the Last Talk • The disaster information contained in news and SNS has different characteristics.  News: detailed information of a disaster (e.g. time, place, action taken) and opinion  SNS: mainly sentiment/opinions about the disaster event or just propagation of news

• News coverage about a disaster is ephemeral.  After a certain time span following the event, no more news articles are usually written.  Sometimes articles appear long time after the occurrence.  Reflection or summary of the past disaster tend to contain richer information.

• Only a small proportion of data actually contains disaster event information  Need a source-specific strategy.  e.g. SNS for opinions or sentiment about a disaster, neither for details about the disaster nor for early warning.

 “Finding and Integrating Food Hazard Information from Haystack” Copyright © 2016 Sung-Hyon Myaeng

Web documents (news, SNS, …)

Hazard docs Organized event information

Integrated & summarized view of events

Finding Food Hazard Information from Heterogeneous Text

5

Introduction | Not All Texts are Created Equal Documents (texts) containing food hazard information can be strikingly different in terms of words, syntax, styles, lengths, etc.





Factors: target audience, purpose of writing, authoring platform (e.g. Twitter)

Some observation from real data



  

News & official reports: long, formal terms/grammatical Blogs: long, much loose wording/grammatical Twitter: short, slangs/abbreviations, bad grammar

This surface-level differences cause a problem in automating the information finding process



6

© 2016 IR&NLP Lab. All rights reserved.

Problem | Keyword search couldn’t save us! Keyword search is a simple, precise, most widely used method of finding documents that contain desired information





Basically it works under exact-matching policy

But since human language is ever-changing, full of variations, keyword matching may slip texts that are semantically relevant but doesn’t exactly match with provided queries



food ≠ rice, bread, …

7

© 2016 IR&NLP Lab. All rights reserved.

How to tackle the problem 

As long as the basic unit of text representation is keywords, there is no fundamental solution for the discrepancy between how the humans read text (meaning) and how the machines read text (keywords)



We need a way to computationally model the meaning of the text to enable more flexible, meaning-based information retrieval



We tackle the problem with deep neural networks’ representation & learning capabilities 

8

Away from feature engineering (manual process) © 2016 IR&NLP Lab. All rights reserved.

A Primer on Text Representation using Deep Learning Each word is represented as a vector (a.k.a. “word embedding”)



𝑟𝑖𝑐𝑒 = (0.12, 0.31, 0.74, … ) Usually 200-300 dimensional

Exact vector values are learned from a huge amount of text corpus using a deep learning algorithm



 

9

Learning objective: “make two word vectors similar if they share similar context” Intuition behind : “distributional semantics”

© 2016 IR&NLP Lab. All rights reserved.

Properties of Word Embeddings

The word’s “meaning” information is distributed across vector elements (degree of similarity can be measured)

Similar word vectors are close to each other in vector space (= semantic space)

Image: http://redcatlabs.com/2015-01-15_Presentation-PyDataSG/ 10

© 2016 IR&NLP Lab. All rights reserved.

Recent Triumph “King” – “Man” + “Woman” = “Queen”

Semantic computation done using vector algebra image: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ 11

© 2016 IR&NLP Lab. All rights reserved.

How can it be applied to our task? Create a computational model that can classify whether unseen text is about food hazard or not

1.  

Predict the class of new text Retrieving food hazard documents is a classification problem

Feed the model with more data as collected to make it evolve with the language use

2.  

12

Model learning like this is usually not successful because of little supervision to the data quality Deliberate measures should be prepared to prevent the model’s semantic drift

© 2016 IR&NLP Lab. All rights reserved.

Experiment: Building a food hazard sentence classifier 1. Collect sentences of three classes (FOOD_HAZARD, FOOD, OTHER) 

Only “title” of documents is collected, which are focused and representative of the whole document

FOOD

OTHER

Ministry of Food & Drug Safety

Food-focused press

General press (portal)

14,955 sentences

255,676 sentences

7,388,189 sentences

FOOD_HAZARD Ministry of Food & Drug Safety

13

© 2016 IR&NLP Lab. All rights reserved.

Experiment: Building a food hazard sentence classifier 

Example sentences FOOD

OTHER

“미국, 이집트산 냉동 딸기 스무디 관 련 A형 간염 피해지역 확대”

“투썸 플레이스 노하우 집약 포스코 사거리점 전략적 오픈”

“폭스바겐 2014년 골프 GTI 공개 오 토쇼 이목 집중”

“US, Hepatitis A infection associated with frozen strawberries from Egypt is outspreading”

“Twosome Place ambitiously opened new branch in Pohang”

“Revealed: 2014 Volkswagen Golf GTI - Geneva 2013”

FOOD_HAZARD

14

© 2016 IR&NLP Lab. All rights reserved.

Experiment: Making food hazard sentence classifier 2. Randomly sample 15,000 sentences from each category to mitigate class imbalance 15,000

15,000

Sampling with replacement

15

© 2016 IR&NLP Lab. All rights reserved.

15,000

Experiment: Building a food hazard sentence classifier 3. Learn class sentences’ representation using 30,000 sentences (10,000 from each class), and test its classification power to remaining 15,000 documents (5,000 from each class) 10,000

5,000

TRAIN

TEST

FOOD_HAZARD

FOOD

"𝒔𝒆𝒏𝒕𝒆𝒏𝒄𝒆"

16

© 2016 IR&NLP Lab. All rights reserved.

OTHER

Experimental Result | Accuracy & Confusion Matrix Classification Accuracy

77.2 % 11,580 correct predictions out of 15,000

The model correctly classified 4,840 out of 5,000 unseen FOOD_HAZARD sentences including: 17

The model is mostly confused with FOOD

Confusion Matrix

Reason: food-focused press in fact talks about all kind of information (food, hazard, others)

Little confusion between FOOD_HAZARD and General

© 2016 IR&NLP Lab. All rights reserved.

Conclusion 

The model showed a promising result for classifying food hazard-related text with unsupervised learning



With enough data and more advanced deep (text) learning model, we could automate “information finding” process for disaster management 



Can reduce much of required human labors

Limitations 

No validation on generalization power of the model 

 

Maybe the model is doing well because dataset is to homogeneous (news-like) It seems like FOOD dataset with mixed topics degraded the model’s performance 

18

Apply it to a new dataset other than news (e.g. texts from twitter, blogs)

May need a bootstrapping with some initial tagging, perhaps through human computation © 2016 IR&NLP Lab. All rights reserved.

Web documents (news, SNS, …)

Hazard docs Organized event information

Integrated & summarized view of events

Multi-document summarization for Disaster Management

19

Introduction | Disaster Event Templates 

E.g. “Detecting Global Food Safety Threats” Categories Food Production Company

Food Information

Food Safety Threat Information

Fields Name Address Phone Number Food Type Food Name Data of Manufacturing Expiration Date Units Place of Origin Type of Threats Detailed Types of Treats Name of Threat Data of Occurrence Actions Taken Date of Actions

Department in Charge 20

Based on classification hierarchies

Introduction | How to fill in Disaster Event Templates 

Lexicons  



Regular expressions  



High precision brand name food products, manufacturer names, names of cities, etc.

Low recall or precision, depending on the strictness of the patterns URLs, names of dishes, date/time, phone numbers, etc.

Machine-learning based classifiers  

21

Require annotated data types of threats, food hazard class, people's level of attention, or sentiment polarity from the food hazard classifier of the previous section of the slides © 2016 IR&NLP Lab. All rights reserved.

Another way is to monitor the dynamics of incoming data  

Identify the time period during which the data would reveal specific events How to relate each peak with a disaster situation or event?

22

© 2016 IR&NLP Lab. All rights reserved.

The filled template can be used for detecting significant events Document Published ID datetime

Media

Food Hazard

Threat type

Food name

Virus name

Original text

10001

2016-02-03 00:01:31

Josun

yes

chemical

tofu, kimchi

norovirus

∙∙∙

10002

2016-02-03 00:03:23

Joongang

yes

biological

miso soup, kimbab

e. coli, norovirus

∙∙∙

10003

2016-02-03 00:04:00

Kyounghyang

no

-

bulgogi

-

∙∙∙

10004

∙∙∙

∙∙∙

∙∙∙

∙∙∙

∙∙∙

∙∙∙

∙∙∙

-

Monitor the incoming data to keep on eye on the frequency distribution of a particular threat based on the disaster event template Summarization will help associating the peaks with a particular threat 23

© 2016 IR&NLP Lab. All rights reserved.

Problem | How to monitor activities or events? Korean News Articles in 2016

So many spikes... But which peaks are worth investigating?

2,577,809

Articles that contain the word "virus"

15,525

Articles marked with threat "norovirus" 309

Frequency of documents that contain the threat "norovirus" 24

© 2016 IR&NLP Lab. All rights reserved.

Problem | We need a way to summarize documents Reduce multiple text documents to a summary Possible solutions for summarization  Keywords - hard to see the context 

Document titles

-

not all documents have proper titles news headlines are designed to catch eyes often with overly dramatic and/or compressed information



Snippets

- suited for single-document summarization 25

© 2016 IR&NLP Lab. All rights reserved.

Proposed Solution | Multi-document summary in a full sentence What drove the bursts in the document stream? Our aim is to find a sentence that best represents the topic shared by the majority of documents

Doc1

26

Doc2

Doc3

Doc4

Text as a Graph Vertices = Cognitive units (words, sentences, etc.) Edges = relations between cognitive units (similarity, distance)

Words connected by levenshtein distance 27

© 2016 IR&NLP Lab. All rights reserved.

“TextRank” Google’s PageRank is adapted Model the cohesion of text using inter-sentential similarity Underlying idea: Repetition in text "knits the discourse together" (Hobbs 1974) 1. 2.

3. 4.

28

Vertices = Sentences (or specific parts of speech) Edges = Inter-sentential similarity (e.g. cosine similarity/distance, BM25, etc.) Run graph analysis algorithm Keep top N ranked sentences (sentences most "recommended" by other sentences)

© 2016 IR&NLP Lab. All rights reserved.

Experiment: summarize each peak Apr 18

Aug 23

Jan 3 May 24 Feb 8 Jun 16

Frequency of documents that contain the threat "norovirus" 29

© 2016 IR&NLP Lab. All rights reserved.

Experiment : summarize each peak 1. 2. 3. 4. 5. 6.

30

Conduct PoS tagging for each sentence Keep only nouns Calculate similarities between each noun vector pairs using cosine similarity Remove dulplicate sentences Run TextRank Keep top ranked sentence

© 2016 IR&NLP Lab. All rights reserved.

Preliminary Result Jan 3 : 겨울철 노로바이러스에 대한 경각심이 높아지면서 증상과 예 방에 대한 관심도 모아지고 있습니다 Apr 18 Jan 3

People seek information on symptoms and preventive measures Aug 23 for norovirus as people become more aware of the winter outbreaks of norovirus. Top keywords : anteritis, symptom, winter, norovirus, vomit May 24

Feb 8

Apr 18 : 한때 20만명이 대피소에서 생활을 했던 일본 구마모토현 지 진 피해Jun 지역에서 전염성이 있는 노로바이러스 감염자가 발생, 방역 16 에 비상이 걸렸다

Centers for disease control are on emergency alert as as norovirus patients were reportedly found in the Kumamoto Prefecture, Japan, where over 200,000 people have sheltered after the earthquake Top keywords : earthquake, Japan, norovirus, shelter, outbreak, Kumamoto 31

© 2016 IR&NLP Lab. All rights reserved.

Preliminary Result May 24 : 식중독을 유발하는 노로바이러스 감염 여부를 현장 에서 바로 확인할 수 있는 종이형태의 검출·진단 키트가 국 내 연구팀에 의해 개발됐다 Apr 18 A paper-kit for on-the-spot norovirus infection diagnosis was Jan 3 developed by a Korean research team Top keywords : norovirus, diagnosis, research, kit, develop, spot, paper

Aug 23

May 24

Feb 8

Aug 23 : 국내에서 15년만에 처음으로 콜레라 환자가 발생해 방역당국에 비상이 걸렸습니다 A first cholera patient was confirmed in 15 years, and the health authorities show concerns over domestic spread Top keywords : cholera, patient, infection, outbreak, domestic 32

© 2016 IR&NLP Lab. All rights reserved.

Jun 16

Conclusion 

Graph-based model showed a promising result at summarizing groups of documents



Limitations   

33

Difficult to capture more than one topic, because Top N sentences look quite similar Vulnerable to spams unless duplicate sentences are removed Tend to output sentences composed of very general nouns when too many sentences are grouped (+ 100,000)

© 2016 IR&NLP Lab. All rights reserved.

34

© 2016 IR&NLP Lab. All rights reserved.

We are here to add what we can to life, not what we can ...

Can We Use Trust in Online Dating?

Electronic cigarettes: what can we learn from the UK experience?

Microblogging: What and How Can We Learn From It? - dmrussell.net

Electronic cigarettes: what can we learn from the UK experience?

What can we learn from the fifties?

Microblogging: What and How Can We Learn From It?

Acquiring Data for Textual Entailment Recognition - raslan 2013

Fast Case Retrieval Nets for Textual Data

Performance Considerations of Data Types - SQL Fool

(Aerial) Data for Disaster Response

Can We Talk?

Traceable Data Types for Self-Adjusting Computation - University of ...

Once we get the Excel worksheet from GE we can use the same

Twitter Under Crisis: Can we trust what we RT?

Different types of data, data quality, available open ...

WHAT WE KNOW What We Know About Leadership ...

Acquiring Data for Textual Entailment Recognition - raslan 2013