8/14/2009

UNIVERSIDAD POLITÉCNICA DE VALENCIA

On Clustering and Evaluation of Narrow Domain Short-Text Corpora PhD Thesis Dissertation David Eduardo Pinto Avendaño Advisors: Dr. Paolo Rosso Dr. Héctor Jiménez-Salazar

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

1

8/14/2009

Overview In this thesis, we deal with the treatment of narrow domain short-text collections in three areas: !

evaluation, clustering, and validation of corpora.

Mainly focused on short-text clustering: short vs. long documents Quite relevant, given the current and future way people use “smalllanguage” (e.g. blogs, snippets, news and text-message generation such as email or chat).

Complemented with domain broadness of corpora: narrow vs. wide domain In the categorization task, it is very difficult to deal with narrow domain corpora such as scientific papers, technical reports, patents, etc.

Overview To study possible strategies to tackle the following two problems: a) the low frequencies of vocabulary terms in short texts, and b) the high vocabulary overlapping associated to narrow domains.

To provide a general framework for the evaluation of corpus features which are classifier-independent.

2

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

Hypothesis !

The fact of a corpus could be composed of narrow domain short texts is identifiable.

!

Clustering results obtained with narrow domain short-text corpora can be improved avoiding the use of external knowledge resources not always available in narrow domains.

3

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

Challenges 1. To propose a framework for the assessment of a set of corpus features that would be useful to understand the nature of the documents from the viewpoint of the shortness and broadness. "

The proposed measures will allow us to evaluate the relative hardness of corpora to be clustered and to study additional corpus features such as the particular writing style of scientific researchers.

"

To be able to distinguish corpora that are composed of narrow domain short texts from those that are not.

4

8/14/2009

Challenges 2. By determining the degree of broadness and shortness of corpora we may analyse the following issues: issues: " To test clustering methods in order to determine the complexity of classifying narrow domain short-text collections " To investigate the possible components that could improve the obtained accuracy in the clustering task. 3. To validate clustering results in the two following ways: " By applying internal clustering validity measures in order to “validate” the quality of the obtained clusters by a given clustering method. " Employing similar measures in order to assess the quality of gold standards.

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora Introduction Improving the classical approach " The case study of word sense induction " "

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

5

8/14/2009

Introduction: clustering definition !

Clustering analysis refers to the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait, often proximity, according to some defined distance measure.

Introduction: clustering definition !

Clustering analysis refers to the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait, often proximity, according to some defined distance measure.

Categories = {green, red, yellow, blue} or Categories = {sphere, cube, pyramid}

6

8/14/2009

Introduction: object representation

Introduction: Document representation

7

8/14/2009

Introduction: Clustering process T1 T2 … TV D1 D2 : : Dn

w11 w12 … w1v w21 w22 … w2v : : : : : : wn1 wn2 … wnv

D1 D2 … Dn D1 D2 : : Dn

θ11 θ12 … θ1n θ21 θ22 … θ2n : : : : : : θn1 θn2 … θnn

Clustering Method K-Means, K-Star, MajorClust, SLC, CLC, etc

Introduction: Motivation !

Supervised approach " " " " " " " " " " " "

[N. Ide and J. Véronis, 1990] [Hynek et al., 2000] [S. Zelikovitz and H. Hirsh, 2000] [Hotho et al., 2003a] [Hotho et al., 2003b] [S. Zelikovitz and F. Marquez, 2005] [Montejo-Ráez et al., 2005] [Q. Pu and G.-W. Yang, 2006] [Buscaldi et al., 2006] [Montejo-Ráez et al., 2006] [Peng et al., 2007] :

!

Unsupervised approach " "

[Makagonov et al., 2004] [Alexandrov et al, 2005]

8

8/14/2009

Introduction: Motivation Hierarchical relationship on the difficulty of clustering different kind of text data? The low frequencies of vocabulary terms in short texts, and The high vocabulary overlapping associated to narrow domains.

Introduction: Motivation !

Practical applications #

Clustering of search engine results

#

Word sense discrimination/induction

#

Homonymy discrimination

#

Summarization

#

clustering of blogs (opinion analysis)

#

assessment of quality of corpora

9

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora Introduction Improving the classical approach " The case study of word sense induction " "

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

More knowledge = Better clustering

Term expansion External resource (WordNet, HowNet, etc)

Self-term expansion

10

8/14/2009

Self-Term Expansion

Clustering Method K-Means, K-Star, MajorClust, SLC, CLC, etc

T1 T2 … TV D1 D2 : : Dn

w 11 w 21 : : w n1

w 12 … w 1v w 22 … w 2v : : : : w n2 … w nv

Self-Term Expansion

Self-Term Expansion Technique + Term Selection Technique = Self-Term Expansion Methodology

11

8/14/2009

Clustering narrow domain short-text corpora (CICLing-2002 / K-Star)

DF

TP

TS

Clustering narrow domain short-text corpora (hep-ex / K-Star)

DF

TP

TS

12

8/14/2009

Clustering narrow domain short-text corpora (CICLing-2002 / DK-Means)

Clustering narrow domain short-text corpora (hep-ex / DK-Means)

13

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora Introduction Improving the classical approach " The case study of word sense induction " "

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

Word Sense Discrimination/Induction

14

8/14/2009

Clustering narrow domain short-text corpora (WSI-SemEval)

DF

TP

TS

More knowledge = Better clustering !

A priori assessment of text corpora

!

Self-term expansion methodology

!

Clustering narrow domain short-text corpora

15

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

Assessment of text corpora Shortness Document length Stylometry It refers to the linguistic style of a writer Class imbalance The document distribution across the corpus Structure Structural properties of the categories distribution Domain broadness Narrow vs wide domain

16

8/14/2009

Assessment of Text Corpora Corpus feature Document shortness Stylometry Class imbalance Structure Domain broadness

to be used as Internal measure Internal measure External measure External measure Both

unsupervised vs supervised

Shortness-based evaluation measures Given a corpus made up of n documents D={d1,d2,...,dn} and V(di) the vocabulary of di, the following three measures are considered to be related with the shortness of D.

17

8/14/2009

Stylometry-based evaluation measure

20 Newsgroups

(Test)

CICLing-2002

Stylometry-based evaluation measure Given a corpus D with vocabulary V(D), the probability of term ti is calculated by its term frequency in D, tf(ti,D), as follows:

Whereas the expected Zipfian distribution (s=1) of terms is obtained as:

18

8/14/2009

Domain broadness evaluation measures !

Based on statistical language modeling

Given a corpus D with gold standard

LM2

LM1

Domain broadness evaluation measures !

Based on statistical language modeling

19

8/14/2009

Domain broadness evaluation measures !

Based on vocabulary dimensionality

Narrow domain vs Wide domain

complete vocabulary

complete vocabulary

Domain broadness evaluation measures !

Based on vocabulary dimensionality

Given a corpus D={d1,d2,...,dn} with gold standard

20

8/14/2009

Class imbalance evaluation measure

Class imbalance evaluation measure Given a corpus D={d1,d2,...,dn} with gold standard

21

8/14/2009

Structure-based evaluation measures

Assessment of text corpora

22

8/14/2009

Assessment of text corpora

Assessment of text corpora

23

8/14/2009

Assessment of text corpora

Assessment of text corpora

T(SLMB(C))=0.82

24

8/14/2009

Assessment of text corpora ! ! ! ! ! ! ! ! !

Kendall Tau (SLMB) Kendall Tau (ULMB) Kendall Tau (SVB) Kendall Tau (UVB) Kendall Tau (SEM) Kendall Tau (DL) Kendall Tau (VL) Kendall Tau (CI) Kendall Tau (Ro)

= = = = = = = = =

0.82 0.56 0.67 0.56 0.86 0.96 0.78 1.00 0.64

! ! ! !

Kendall Tau (mRH-J) Kendall Tau (mRH-C) Kendall Tau (VDR) Kendall Tau (Dunn)

= 0.09 = - 0.05 = 0.05 = - 0.09

Assessment of text corpora

25

8/14/2009

WikiArabic Clustering Corpus ! INEX 2006 Arabic Wikipedia corpus ! 3,638 categories (tagging one or more documents) ! 11,637 xml files (Arabic and Buckwalter) ! 1,725 categories tag only one document ! 8,690 documents are single-categorized ! the 30 most frequent categories were selected ! 1,089 documents belong to R30

WikiArabic Clustering Corpus Assessment measure

WikiBuckCorpusR30.col

WikiBuckCorpusR30.col.TOK

DL

211.92

241.17

VL

129.26

125.02

VDR

0.907

0.880

SEM CI

0.112 0.016

0.045 0.016

SLMB

2612.71

1442.05

ULMB SVB

856.91 10.46

296.96 8.86

UVB

11.02

9.38

Dunn-C

0.977

0.974

EDM-C

1.51

1.44

TotalTerms

230,785

262,641

CorpusVocSize

47,155

38,833

26

8/14/2009

WikiArabic Clustering Corpus Assessment measure

WikiBuckCorpusR30.col

WikiBuckCorpusR30.col.TOK

SEM

WikiArabic Clustering Corpus Assessment measure

WikiBuckCorpusR30.col

WikiBuckCorpusR30.col.TOK

VL

27

8/14/2009

WikiArabic Clustering Corpus Assessment measure

WikiBuckCorpusR30.col

WikiBuckCorpusR30.col.TOK

CI

WaCOS

(http://nlp.dsic.upv.es:8080 (http://nlp.dsic.upv.es: 8080/watermarker) /watermarker)

28

8/14/2009

WaCOS: modes of execution

WaCOS: Graphical view of results

29

8/14/2009

Domain broadness: Vocabulary-based

Class imbalance: category cardinalities

30

8/14/2009

Stylometric: Zipfian distribution

Structure: Dunn index family

31

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

Validity measures Category 1

Category Category

2

3

Category 1 Category 4

Category 2

Category 3

Category 4

32

8/14/2009

The relative hardness !

An analysis of the relative hardness of reuters-21578 subsets [Debole & Sebastiani, 2005]

!

Subjective and Objective measures applied to the clustering of narrow domain scientific abstracts (in Spanish)

[Ingaramo et al, 2007]

!

On the relative hardness of clustering corpora

!

Evaluation of Internal Validity Measures in Short-Text Corpora

[Pinto & Rosso, 2007]

[Ingaramo et al, 2008]

Correlation of validity measures (CICLing-2002)

33

8/14/2009

Correlation of validity measures (WSI-SemEval)

Relative Hardness (R8 Reuters)

a) Training

b) Test

34

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

Conclusions The clustering of narrow domain short-text corpora is one of the most difficult tasks of unsupervised data analysis. " "

the high overlapping of vocabularies among the texts the low term frequency of short texts

We have addressed the above problems by studying three research directions: 1. 2. 3.

The study of methods and techniques for improving clustering of narrow domain short-text corpora. The determination of classifier-independent corpus features and the assessment of each of them. The applications of the proposed methods and techniques in different areas of natural language processing.

35

8/14/2009

Conclusions We have confirmed the hypothesis formulated at the beginning of this PhD thesis:

!

The fact of a corpus could be composed of narrow domain short texts is identifiable.

!

Clustering results obtained with narrow domain short-text corpora can be improved avoiding the use of external knowledge resources not always available in narrow domains.

Contributions !

The study and introduction of evaluation measures to analyse the following features of a corpus: shortness, domain broadness, class imbalance, stylometry and structure.

!

The development of the Watermarking Corpora On-line System, named WaCOS, for the assessment of corpus features.

!

A new unsupervised methodology (which does not use any external knowledge resource) for dealing with narrow domain short-text corpora. This methodology suggests first applying self-term expansion and then term selection.

36

8/14/2009

Research directions !

To observe the possible relationship that the clustering of narrow domain short-text corpora may have with summarization and viceversa.

!

To test the performance of the proposed approach in multicategorized narrow domain short-text corpora (fuzzy clustering).

!

To transfer the technology: different enterprises are interested in the classification of short texts.

UNIVERSIDAD POLITÉCNICA DE VALENCIA On Clustering and Evaluation of Narrow Domain ShortShort-Text Corpora PhD Thesis Dissertation David Pinto [email protected] http://nlp.dsic.upv.es:8080/watermarker/

37

Outline

Aug 14, 2009 - (e.g. blogs, snippets, news and text-message generation such as email ... Employing similar measures in order to assess the quality of gold ..... To transfer the technology: different enterprises are interested in the classification ...

3MB Sizes 2 Downloads 227 Views

Recommend Documents

Annotated Outline
Jul 31, 2010 - 3 Several papers provide an analytical basis for this idea. .... In order to test this hypothesis we use sector-level panel data to build a ..... development and that the effect is bigger for firms in the sector that relies more heavil

curriculum outline
This unit explores the disciples' experience of the Resurrection and Ascension of Jesus. As we die with Jesus, we rise with Jesus also. The unit teaches about our hope in everlasting life. Confirmation: Celebrating the Gifts of the Holy Spirit. This

Course Outline
(You can find the solution in the rotunda in Middlesex College.) I also use the university campus as a large, outdoor office. ☺). Office Telephone: 519 661-2111 ...

Annotated Outline
participants at a seminar at the Inter-American Development Bank for their comments and suggestions, and to ..... imply an average increase in financial development between 6.4% and 25% of GDP, depending ... 17 For countries like Philippines and Cost

curriculum outline
Religious Education includes prayer, liturgy, and the way we live our lives and treat each other daily. During Term ... beliefs about human interaction have changed over time, how the environment influences the human characteristics of places and ...

curriculum outline
This curriculum outline is to inform you of the content that your child will be learning in Year 5 during Term Two. ... lessons except Mathematics where print is the norm. Students will be encouraged to use correct typing skills when using technology

curriculum outline
Our HSIE unit this term is 'Factors That Shape Places: Antarctica''. The students will explore issues and decision-making involved in human interaction with a significant world environment, the Antarctic. The unit focuses on how beliefs about human i

Annotated Outline
The Politics of Financial Development: The Role of Interest Groups and ..... example, as presented in Table A4, developing plastic products is much more capital.

Outline of inquiry.pdf
and developing new ideas. Page 1 of 1. Outline of inquiry.pdf. Outline of inquiry.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Outline of ...

subject outline periwinkle.pdf
Download. Connect more apps... Try one of the apps below to open or edit this item. subject outline periwinkle.pdf. subject outline periwinkle.pdf. Open. Extract.

outline of fracture.pdf
In extension subluxations of spine, anterior longitudinal ligament is ruptured. • Automatic emptying of urinary bladder when full after3 months of cord injury- ...

Updated web outline -
May 22, 2013 - Isaacs" , Ronald Hunter , Tom Abbott. . Isn't a survey ...

Bio Poem Outline
Who is ______,. ,. Three descriptive traits. Sibling of. Lover of. (three things, people or ideas). Who fears___________________________________________________________________. (up to three things). Who needs. (up to three things). Who.

Expanded Plan Outline
This is possible as a high capacity server is established to store content files thus enabling streaming .... Watch&Play helps IPTV provider to improve service in game section. .... Special game best described by RPG (role playing game) genre.

Expanded Plan Outline
Additional various services are also supported such as internet shopping, karaoke .... Watch&Play will serve as an outsourcing software company serving game.

Survivor Notebook Outline Accounts
vi. Employment vii. Interesting note about self, or something you want folks to remember you for/by b. Funeral Home Information c. Pallbearers d. Minister/Priest.

state report outline -
What you will be writing and showing in your report. GEOGRAPHY. This section should include at least the following information…. - Location, bordering states and/or countries. - Physical features (forests, mountains, plains, deserts, valleys, caves

Course Outline - WordPress.pdf
เข้าใจเกี่ยวกับ WordPress และองค์ประกอบสําาคัญอาทิเช่น Internet และ Hosting. วิเคราะห์ความต้องการ ข้อมูลà¸

Download [Pdf] Schaum's Outline of Statistics and Econometrics, Second Edition (Schaum's Outline Series) Read online
Schaum's Outline of Statistics and Econometrics, Second Edition (Schaum's Outline Series) Download at => https://pdfkulonline13e1.blogspot.com/0071755470 Schaum's Outline of Statistics and Econometrics, Second Edition (Schaum's Outline Series) pd

Download [Pdf] Schaum's Outline of Advanced Mathematics for Engineers and Scientists (Schaum's Outline Series) Full Pages
Schaum's Outline of Advanced Mathematics for Engineers and Scientists (Schaum's Outline Series) Download at => https://pdfkulonline13e1.blogspot.com/0071635408 Schaum's Outline of Advanced Mathematics for Engineers and Scientists (Schaum's Outlin

Carly-Persuasive-preparation outline-final.pdf
Page 3 of 4. Carly-Persuasive-preparation outline-final.pdf. Carly-Persuasive-preparation outline-final.pdf. Open. Extract. Open with. Sign In. Main menu.

Cara-preparation outline-final.pdf
... “the most common. forms of pollution are salt pollution, industrial pollution, and bacteria.” Public Speaking Center w www.uwlax.edu/psc w 251 Murphy Library.