Finding and Integrating Food Hazard Information from Haystack Disaster Management Workshop September 29, 2016
Sung-Hyon Myaeng Information Retrieval & Natural Language Processing Lab School of Computing
Motivation | “Detecting Global Food Safety Threats” • Food safety threat Foods are damaged in production, process and supply chain Governments heavily rely on human labor to detect and analyze the events that cause global/domestic food safety threats.
• Solution To automate this process by analyzing various text sources (e.g., news, blogs, official reports) with Natural Language Processing (NLP) techniques To detect potential treats quickly
Copyright © 2016 Sung-Hyon Myaeng
2
Overview of the Task [Previous workshop] Extract & organize hazard information
[Part 1] Find food hazard documents
Web documents (news, SNS, …)
[Part 2] Decide what event to focus & integrate by summarizing
Hazard docs Organized event information
KyoungRok JANG
Integrated & summarized view of events
Hwon IHM Copyright © 2016 Sung-Hyon Myaeng
3
Take-away Messages from the Last Talk • The disaster information contained in news and SNS has different characteristics. News: detailed information of a disaster (e.g. time, place, action taken) and opinion SNS: mainly sentiment/opinions about the disaster event or just propagation of news
• News coverage about a disaster is ephemeral. After a certain time span following the event, no more news articles are usually written. Sometimes articles appear long time after the occurrence. Reflection or summary of the past disaster tend to contain richer information.
• Only a small proportion of data actually contains disaster event information Need a source-specific strategy. e.g. SNS for opinions or sentiment about a disaster, neither for details about the disaster nor for early warning.
“Finding and Integrating Food Hazard Information from Haystack” Copyright © 2016 Sung-Hyon Myaeng
Web documents (news, SNS, …)
Hazard docs Organized event information
Integrated & summarized view of events
Finding Food Hazard Information from Heterogeneous Text
5
Introduction | Not All Texts are Created Equal Documents (texts) containing food hazard information can be strikingly different in terms of words, syntax, styles, lengths, etc.
Factors: target audience, purpose of writing, authoring platform (e.g. Twitter)
Some observation from real data
News & official reports: long, formal terms/grammatical Blogs: long, much loose wording/grammatical Twitter: short, slangs/abbreviations, bad grammar
This surface-level differences cause a problem in automating the information finding process
6
© 2016 IR&NLP Lab. All rights reserved.
Problem | Keyword search couldn’t save us! Keyword search is a simple, precise, most widely used method of finding documents that contain desired information
Basically it works under exact-matching policy
But since human language is ever-changing, full of variations, keyword matching may slip texts that are semantically relevant but doesn’t exactly match with provided queries
food ≠ rice, bread, …
7
© 2016 IR&NLP Lab. All rights reserved.
How to tackle the problem
As long as the basic unit of text representation is keywords, there is no fundamental solution for the discrepancy between how the humans read text (meaning) and how the machines read text (keywords)
We need a way to computationally model the meaning of the text to enable more flexible, meaning-based information retrieval
We tackle the problem with deep neural networks’ representation & learning capabilities
8
Away from feature engineering (manual process) © 2016 IR&NLP Lab. All rights reserved.
A Primer on Text Representation using Deep Learning Each word is represented as a vector (a.k.a. “word embedding”)
𝑟𝑖𝑐𝑒 = (0.12, 0.31, 0.74, … ) Usually 200-300 dimensional
Exact vector values are learned from a huge amount of text corpus using a deep learning algorithm
9
Learning objective: “make two word vectors similar if they share similar context” Intuition behind : “distributional semantics”
© 2016 IR&NLP Lab. All rights reserved.
Properties of Word Embeddings
The word’s “meaning” information is distributed across vector elements (degree of similarity can be measured)
Similar word vectors are close to each other in vector space (= semantic space)
Image: http://redcatlabs.com/2015-01-15_Presentation-PyDataSG/ 10
© 2016 IR&NLP Lab. All rights reserved.
Recent Triumph “King” – “Man” + “Woman” = “Queen”
Semantic computation done using vector algebra image: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/ 11
© 2016 IR&NLP Lab. All rights reserved.
How can it be applied to our task? Create a computational model that can classify whether unseen text is about food hazard or not
1.
Predict the class of new text Retrieving food hazard documents is a classification problem
Feed the model with more data as collected to make it evolve with the language use
2.
12
Model learning like this is usually not successful because of little supervision to the data quality Deliberate measures should be prepared to prevent the model’s semantic drift
© 2016 IR&NLP Lab. All rights reserved.
Experiment: Building a food hazard sentence classifier 1. Collect sentences of three classes (FOOD_HAZARD, FOOD, OTHER)
Only “title” of documents is collected, which are focused and representative of the whole document
FOOD
OTHER
Ministry of Food & Drug Safety
Food-focused press
General press (portal)
14,955 sentences
255,676 sentences
7,388,189 sentences
FOOD_HAZARD Ministry of Food & Drug Safety
13
© 2016 IR&NLP Lab. All rights reserved.
Experiment: Building a food hazard sentence classifier
Example sentences FOOD
OTHER
“미국, 이집트산 냉동 딸기 스무디 관 련 A형 간염 피해지역 확대”
“투썸 플레이스 노하우 집약 포스코 사거리점 전략적 오픈”
“폭스바겐 2014년 골프 GTI 공개 오 토쇼 이목 집중”
“US, Hepatitis A infection associated with frozen strawberries from Egypt is outspreading”
“Twosome Place ambitiously opened new branch in Pohang”
“Revealed: 2014 Volkswagen Golf GTI - Geneva 2013”
FOOD_HAZARD
14
© 2016 IR&NLP Lab. All rights reserved.
Experiment: Making food hazard sentence classifier 2. Randomly sample 15,000 sentences from each category to mitigate class imbalance 15,000
15,000
Sampling with replacement
15
© 2016 IR&NLP Lab. All rights reserved.
15,000
Experiment: Building a food hazard sentence classifier 3. Learn class sentences’ representation using 30,000 sentences (10,000 from each class), and test its classification power to remaining 15,000 documents (5,000 from each class) 10,000
5,000
TRAIN
TEST
FOOD_HAZARD
FOOD
"𝒔𝒆𝒏𝒕𝒆𝒏𝒄𝒆"
16
© 2016 IR&NLP Lab. All rights reserved.
OTHER
Experimental Result | Accuracy & Confusion Matrix Classification Accuracy
77.2 % 11,580 correct predictions out of 15,000
The model correctly classified 4,840 out of 5,000 unseen FOOD_HAZARD sentences including: 17
The model is mostly confused with FOOD
Confusion Matrix
Reason: food-focused press in fact talks about all kind of information (food, hazard, others)
Little confusion between FOOD_HAZARD and General
© 2016 IR&NLP Lab. All rights reserved.
Conclusion
The model showed a promising result for classifying food hazard-related text with unsupervised learning
With enough data and more advanced deep (text) learning model, we could automate “information finding” process for disaster management
Can reduce much of required human labors
Limitations
No validation on generalization power of the model
Maybe the model is doing well because dataset is to homogeneous (news-like) It seems like FOOD dataset with mixed topics degraded the model’s performance
18
Apply it to a new dataset other than news (e.g. texts from twitter, blogs)
May need a bootstrapping with some initial tagging, perhaps through human computation © 2016 IR&NLP Lab. All rights reserved.
Web documents (news, SNS, …)
Hazard docs Organized event information
Integrated & summarized view of events
Multi-document summarization for Disaster Management
19
Introduction | Disaster Event Templates
E.g. “Detecting Global Food Safety Threats” Categories Food Production Company
Food Information
Food Safety Threat Information
Fields Name Address Phone Number Food Type Food Name Data of Manufacturing Expiration Date Units Place of Origin Type of Threats Detailed Types of Treats Name of Threat Data of Occurrence Actions Taken Date of Actions
Department in Charge 20
Based on classification hierarchies
Introduction | How to fill in Disaster Event Templates
Lexicons
Regular expressions
High precision brand name food products, manufacturer names, names of cities, etc.
Low recall or precision, depending on the strictness of the patterns URLs, names of dishes, date/time, phone numbers, etc.
Machine-learning based classifiers
21
Require annotated data types of threats, food hazard class, people's level of attention, or sentiment polarity from the food hazard classifier of the previous section of the slides © 2016 IR&NLP Lab. All rights reserved.
Another way is to monitor the dynamics of incoming data
Identify the time period during which the data would reveal specific events How to relate each peak with a disaster situation or event?
22
© 2016 IR&NLP Lab. All rights reserved.
The filled template can be used for detecting significant events Document Published ID datetime
Media
Food Hazard
Threat type
Food name
Virus name
Original text
10001
2016-02-03 00:01:31
Josun
yes
chemical
tofu, kimchi
norovirus
∙∙∙
10002
2016-02-03 00:03:23
Joongang
yes
biological
miso soup, kimbab
e. coli, norovirus
∙∙∙
10003
2016-02-03 00:04:00
Kyounghyang
no
-
bulgogi
-
∙∙∙
10004
∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙
∙∙∙
-
Monitor the incoming data to keep on eye on the frequency distribution of a particular threat based on the disaster event template Summarization will help associating the peaks with a particular threat 23
© 2016 IR&NLP Lab. All rights reserved.
Problem | How to monitor activities or events? Korean News Articles in 2016
So many spikes... But which peaks are worth investigating?
2,577,809
Articles that contain the word "virus"
15,525
Articles marked with threat "norovirus" 309
Frequency of documents that contain the threat "norovirus" 24
© 2016 IR&NLP Lab. All rights reserved.
Problem | We need a way to summarize documents Reduce multiple text documents to a summary Possible solutions for summarization Keywords - hard to see the context
Document titles
-
not all documents have proper titles news headlines are designed to catch eyes often with overly dramatic and/or compressed information
Snippets
- suited for single-document summarization 25
© 2016 IR&NLP Lab. All rights reserved.
Proposed Solution | Multi-document summary in a full sentence What drove the bursts in the document stream? Our aim is to find a sentence that best represents the topic shared by the majority of documents
Doc1
26
Doc2
Doc3
Doc4
Text as a Graph Vertices = Cognitive units (words, sentences, etc.) Edges = relations between cognitive units (similarity, distance)
Words connected by levenshtein distance 27
© 2016 IR&NLP Lab. All rights reserved.
“TextRank” Google’s PageRank is adapted Model the cohesion of text using inter-sentential similarity Underlying idea: Repetition in text "knits the discourse together" (Hobbs 1974) 1. 2.
3. 4.
28
Vertices = Sentences (or specific parts of speech) Edges = Inter-sentential similarity (e.g. cosine similarity/distance, BM25, etc.) Run graph analysis algorithm Keep top N ranked sentences (sentences most "recommended" by other sentences)
© 2016 IR&NLP Lab. All rights reserved.
Experiment: summarize each peak Apr 18
Aug 23
Jan 3 May 24 Feb 8 Jun 16
Frequency of documents that contain the threat "norovirus" 29
© 2016 IR&NLP Lab. All rights reserved.
Experiment : summarize each peak 1. 2. 3. 4. 5. 6.
30
Conduct PoS tagging for each sentence Keep only nouns Calculate similarities between each noun vector pairs using cosine similarity Remove dulplicate sentences Run TextRank Keep top ranked sentence
© 2016 IR&NLP Lab. All rights reserved.
Preliminary Result Jan 3 : 겨울철 노로바이러스에 대한 경각심이 높아지면서 증상과 예 방에 대한 관심도 모아지고 있습니다 Apr 18 Jan 3
People seek information on symptoms and preventive measures Aug 23 for norovirus as people become more aware of the winter outbreaks of norovirus. Top keywords : anteritis, symptom, winter, norovirus, vomit May 24
Feb 8
Apr 18 : 한때 20만명이 대피소에서 생활을 했던 일본 구마모토현 지 진 피해Jun 지역에서 전염성이 있는 노로바이러스 감염자가 발생, 방역 16 에 비상이 걸렸다
Centers for disease control are on emergency alert as as norovirus patients were reportedly found in the Kumamoto Prefecture, Japan, where over 200,000 people have sheltered after the earthquake Top keywords : earthquake, Japan, norovirus, shelter, outbreak, Kumamoto 31
© 2016 IR&NLP Lab. All rights reserved.
Preliminary Result May 24 : 식중독을 유발하는 노로바이러스 감염 여부를 현장 에서 바로 확인할 수 있는 종이형태의 검출·진단 키트가 국 내 연구팀에 의해 개발됐다 Apr 18 A paper-kit for on-the-spot norovirus infection diagnosis was Jan 3 developed by a Korean research team Top keywords : norovirus, diagnosis, research, kit, develop, spot, paper
Aug 23
May 24
Feb 8
Aug 23 : 국내에서 15년만에 처음으로 콜레라 환자가 발생해 방역당국에 비상이 걸렸습니다 A first cholera patient was confirmed in 15 years, and the health authorities show concerns over domestic spread Top keywords : cholera, patient, infection, outbreak, domestic 32
© 2016 IR&NLP Lab. All rights reserved.
Jun 16
Conclusion
Graph-based model showed a promising result at summarizing groups of documents
Limitations
33
Difficult to capture more than one topic, because Top N sentences look quite similar Vulnerable to spams unless duplicate sentences are removed Tend to output sentences composed of very general nouns when too many sentences are grouped (+ 100,000)
© 2016 IR&NLP Lab. All rights reserved.
34
© 2016 IR&NLP Lab. All rights reserved.