8/14/2009
UNIVERSIDAD POLITÉCNICA DE VALENCIA
On Clustering and Evaluation of Narrow Domain Short-Text Corpora PhD Thesis Dissertation David Eduardo Pinto Avendaño Advisors: Dr. Paolo Rosso Dr. Héctor Jiménez-Salazar
Outline !
Overview
!
Hypothesis
!
Challenges
!
Clustering of narrow domain short-text corpora
!
Assessment of text corpora
!
Validity measures
!
Conclusions and research directions
1
8/14/2009
Overview In this thesis, we deal with the treatment of narrow domain short-text collections in three areas: !
evaluation, clustering, and validation of corpora.
Mainly focused on short-text clustering: short vs. long documents Quite relevant, given the current and future way people use “smalllanguage” (e.g. blogs, snippets, news and text-message generation such as email or chat).
Complemented with domain broadness of corpora: narrow vs. wide domain In the categorization task, it is very difficult to deal with narrow domain corpora such as scientific papers, technical reports, patents, etc.
Overview To study possible strategies to tackle the following two problems: a) the low frequencies of vocabulary terms in short texts, and b) the high vocabulary overlapping associated to narrow domains.
To provide a general framework for the evaluation of corpus features which are classifier-independent.
2
8/14/2009
Outline !
Overview
!
Hypothesis
!
Challenges
!
Clustering of narrow domain short-text corpora
!
Assessment of text corpora
!
Validity measures
!
Conclusions and research directions
Hypothesis !
The fact of a corpus could be composed of narrow domain short texts is identifiable.
!
Clustering results obtained with narrow domain short-text corpora can be improved avoiding the use of external knowledge resources not always available in narrow domains.
3
8/14/2009
Outline !
Overview
!
Hypothesis
!
Challenges
!
Clustering of narrow domain short-text corpora
!
Assessment of text corpora
!
Validity measures
!
Conclusions and research directions
Challenges 1. To propose a framework for the assessment of a set of corpus features that would be useful to understand the nature of the documents from the viewpoint of the shortness and broadness. "
The proposed measures will allow us to evaluate the relative hardness of corpora to be clustered and to study additional corpus features such as the particular writing style of scientific researchers.
"
To be able to distinguish corpora that are composed of narrow domain short texts from those that are not.
4
8/14/2009
Challenges 2. By determining the degree of broadness and shortness of corpora we may analyse the following issues: issues: " To test clustering methods in order to determine the complexity of classifying narrow domain short-text collections " To investigate the possible components that could improve the obtained accuracy in the clustering task. 3. To validate clustering results in the two following ways: " By applying internal clustering validity measures in order to “validate” the quality of the obtained clusters by a given clustering method. " Employing similar measures in order to assess the quality of gold standards.
Outline !
Overview
!
Hypothesis
!
Challenges
!
Clustering of narrow domain short-text corpora Introduction Improving the classical approach " The case study of word sense induction " "
!
Assessment of text corpora
!
Validity measures
!
Conclusions and research directions
5
8/14/2009
Introduction: clustering definition !
Clustering analysis refers to the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait, often proximity, according to some defined distance measure.
Introduction: clustering definition !
Clustering analysis refers to the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait, often proximity, according to some defined distance measure.
Categories = {green, red, yellow, blue} or Categories = {sphere, cube, pyramid}
6
8/14/2009
Introduction: object representation
Introduction: Document representation
7
8/14/2009
Introduction: Clustering process T1 T2 … TV D1 D2 : : Dn
w11 w12 … w1v w21 w22 … w2v : : : : : : wn1 wn2 … wnv
D1 D2 … Dn D1 D2 : : Dn
θ11 θ12 … θ1n θ21 θ22 … θ2n : : : : : : θn1 θn2 … θnn
Clustering Method K-Means, K-Star, MajorClust, SLC, CLC, etc
Introduction: Motivation !
Supervised approach " " " " " " " " " " " "
[N. Ide and J. Véronis, 1990] [Hynek et al., 2000] [S. Zelikovitz and H. Hirsh, 2000] [Hotho et al., 2003a] [Hotho et al., 2003b] [S. Zelikovitz and F. Marquez, 2005] [Montejo-Ráez et al., 2005] [Q. Pu and G.-W. Yang, 2006] [Buscaldi et al., 2006] [Montejo-Ráez et al., 2006] [Peng et al., 2007] :
!
Unsupervised approach " "
[Makagonov et al., 2004] [Alexandrov et al, 2005]
8
8/14/2009
Introduction: Motivation Hierarchical relationship on the difficulty of clustering different kind of text data? The low frequencies of vocabulary terms in short texts, and The high vocabulary overlapping associated to narrow domains.
Introduction: Motivation !
Practical applications #
Clustering of search engine results
#
Word sense discrimination/induction
#
Homonymy discrimination
#
Summarization
#
clustering of blogs (opinion analysis)
#
assessment of quality of corpora
9
8/14/2009
Outline !
Overview
!
Hypothesis
!
Challenges
!
Clustering of narrow domain short-text corpora Introduction Improving the classical approach " The case study of word sense induction " "
!
Assessment of text corpora
!
Validity measures
!
Conclusions and research directions
More knowledge = Better clustering
Term expansion External resource (WordNet, HowNet, etc)
Self-term expansion
10
8/14/2009
Self-Term Expansion
Clustering Method K-Means, K-Star, MajorClust, SLC, CLC, etc
T1 T2 … TV D1 D2 : : Dn
w 11 w 21 : : w n1
w 12 … w 1v w 22 … w 2v : : : : w n2 … w nv
Self-Term Expansion
Self-Term Expansion Technique + Term Selection Technique = Self-Term Expansion Methodology
11
8/14/2009
Clustering narrow domain short-text corpora (CICLing-2002 / K-Star)
DF
TP
TS
Clustering narrow domain short-text corpora (hep-ex / K-Star)
DF
TP
TS
12
8/14/2009
Clustering narrow domain short-text corpora (CICLing-2002 / DK-Means)
Clustering narrow domain short-text corpora (hep-ex / DK-Means)
13
8/14/2009
Outline !
Overview
!
Hypothesis
!
Challenges
!
Clustering of narrow domain short-text corpora Introduction Improving the classical approach " The case study of word sense induction " "
!
Assessment of text corpora
!
Validity measures
!
Conclusions and research directions
Word Sense Discrimination/Induction
14
8/14/2009
Clustering narrow domain short-text corpora (WSI-SemEval)
DF
TP
TS
More knowledge = Better clustering !
A priori assessment of text corpora
!
Self-term expansion methodology
!
Clustering narrow domain short-text corpora
15
8/14/2009
Outline !
Overview
!
Hypothesis
!
Challenges
!
Clustering of narrow domain short-text corpora
!
Assessment of text corpora
!
Validity measures
!
Conclusions and research directions
Assessment of text corpora Shortness Document length Stylometry It refers to the linguistic style of a writer Class imbalance The document distribution across the corpus Structure Structural properties of the categories distribution Domain broadness Narrow vs wide domain
16
8/14/2009
Assessment of Text Corpora Corpus feature Document shortness Stylometry Class imbalance Structure Domain broadness
to be used as Internal measure Internal measure External measure External measure Both
unsupervised vs supervised
Shortness-based evaluation measures Given a corpus made up of n documents D={d1,d2,...,dn} and V(di) the vocabulary of di, the following three measures are considered to be related with the shortness of D.
17
8/14/2009
Stylometry-based evaluation measure
20 Newsgroups
(Test)
CICLing-2002
Stylometry-based evaluation measure Given a corpus D with vocabulary V(D), the probability of term ti is calculated by its term frequency in D, tf(ti,D), as follows:
Whereas the expected Zipfian distribution (s=1) of terms is obtained as:
18
8/14/2009
Domain broadness evaluation measures !
Based on statistical language modeling
Given a corpus D with gold standard
LM2
LM1
Domain broadness evaluation measures !
Based on statistical language modeling
19
8/14/2009
Domain broadness evaluation measures !
Based on vocabulary dimensionality
Narrow domain vs Wide domain
complete vocabulary
complete vocabulary
Domain broadness evaluation measures !
Based on vocabulary dimensionality
Given a corpus D={d1,d2,...,dn} with gold standard
20
8/14/2009
Class imbalance evaluation measure
Class imbalance evaluation measure Given a corpus D={d1,d2,...,dn} with gold standard
21
8/14/2009
Structure-based evaluation measures
Assessment of text corpora
22
8/14/2009
Assessment of text corpora
Assessment of text corpora
23
8/14/2009
Assessment of text corpora
Assessment of text corpora
T(SLMB(C))=0.82
24
8/14/2009
Assessment of text corpora ! ! ! ! ! ! ! ! !
Kendall Tau (SLMB) Kendall Tau (ULMB) Kendall Tau (SVB) Kendall Tau (UVB) Kendall Tau (SEM) Kendall Tau (DL) Kendall Tau (VL) Kendall Tau (CI) Kendall Tau (Ro)
= = = = = = = = =
0.82 0.56 0.67 0.56 0.86 0.96 0.78 1.00 0.64
! ! ! !
Kendall Tau (mRH-J) Kendall Tau (mRH-C) Kendall Tau (VDR) Kendall Tau (Dunn)
= 0.09 = - 0.05 = 0.05 = - 0.09
Assessment of text corpora
25
8/14/2009
WikiArabic Clustering Corpus ! INEX 2006 Arabic Wikipedia corpus ! 3,638 categories (tagging one or more documents) ! 11,637 xml files (Arabic and Buckwalter) ! 1,725 categories tag only one document ! 8,690 documents are single-categorized ! the 30 most frequent categories were selected ! 1,089 documents belong to R30
WikiArabic Clustering Corpus Assessment measure
WikiBuckCorpusR30.col
WikiBuckCorpusR30.col.TOK
DL
211.92
241.17
VL
129.26
125.02
VDR
0.907
0.880
SEM CI
0.112 0.016
0.045 0.016
SLMB
2612.71
1442.05
ULMB SVB
856.91 10.46
296.96 8.86
UVB
11.02
9.38
Dunn-C
0.977
0.974
EDM-C
1.51
1.44
TotalTerms
230,785
262,641
CorpusVocSize
47,155
38,833
26
8/14/2009
WikiArabic Clustering Corpus Assessment measure
WikiBuckCorpusR30.col
WikiBuckCorpusR30.col.TOK
SEM
WikiArabic Clustering Corpus Assessment measure
WikiBuckCorpusR30.col
WikiBuckCorpusR30.col.TOK
VL
27
8/14/2009
WikiArabic Clustering Corpus Assessment measure
WikiBuckCorpusR30.col
WikiBuckCorpusR30.col.TOK
CI
WaCOS
(http://nlp.dsic.upv.es:8080 (http://nlp.dsic.upv.es: 8080/watermarker) /watermarker)
28
8/14/2009
WaCOS: modes of execution
WaCOS: Graphical view of results
29
8/14/2009
Domain broadness: Vocabulary-based
Class imbalance: category cardinalities
30
8/14/2009
Stylometric: Zipfian distribution
Structure: Dunn index family
31
8/14/2009
Outline !
Overview
!
Hypothesis
!
Challenges
!
Clustering of narrow domain short-text corpora
!
Assessment of text corpora
!
Validity measures
!
Conclusions and research directions
Validity measures Category 1
Category Category
2
3
Category 1 Category 4
Category 2
Category 3
Category 4
32
8/14/2009
The relative hardness !
An analysis of the relative hardness of reuters-21578 subsets [Debole & Sebastiani, 2005]
!
Subjective and Objective measures applied to the clustering of narrow domain scientific abstracts (in Spanish)
[Ingaramo et al, 2007]
!
On the relative hardness of clustering corpora
!
Evaluation of Internal Validity Measures in Short-Text Corpora
[Pinto & Rosso, 2007]
[Ingaramo et al, 2008]
Correlation of validity measures (CICLing-2002)
33
8/14/2009
Correlation of validity measures (WSI-SemEval)
Relative Hardness (R8 Reuters)
a) Training
b) Test
34
8/14/2009
Outline !
Overview
!
Hypothesis
!
Challenges
!
Clustering of narrow domain short-text corpora
!
Assessment of text corpora
!
Validity measures
!
Conclusions and research directions
Conclusions The clustering of narrow domain short-text corpora is one of the most difficult tasks of unsupervised data analysis. " "
the high overlapping of vocabularies among the texts the low term frequency of short texts
We have addressed the above problems by studying three research directions: 1. 2. 3.
The study of methods and techniques for improving clustering of narrow domain short-text corpora. The determination of classifier-independent corpus features and the assessment of each of them. The applications of the proposed methods and techniques in different areas of natural language processing.
35
8/14/2009
Conclusions We have confirmed the hypothesis formulated at the beginning of this PhD thesis:
!
The fact of a corpus could be composed of narrow domain short texts is identifiable.
!
Clustering results obtained with narrow domain short-text corpora can be improved avoiding the use of external knowledge resources not always available in narrow domains.
Contributions !
The study and introduction of evaluation measures to analyse the following features of a corpus: shortness, domain broadness, class imbalance, stylometry and structure.
!
The development of the Watermarking Corpora On-line System, named WaCOS, for the assessment of corpus features.
!
A new unsupervised methodology (which does not use any external knowledge resource) for dealing with narrow domain short-text corpora. This methodology suggests first applying self-term expansion and then term selection.
36
8/14/2009
Research directions !
To observe the possible relationship that the clustering of narrow domain short-text corpora may have with summarization and viceversa.
!
To test the performance of the proposed approach in multicategorized narrow domain short-text corpora (fuzzy clustering).
!
To transfer the technology: different enterprises are interested in the classification of short texts.
UNIVERSIDAD POLITÉCNICA DE VALENCIA On Clustering and Evaluation of Narrow Domain ShortShort-Text Corpora PhD Thesis Dissertation David Pinto
[email protected] http://nlp.dsic.upv.es:8080/watermarker/
37