Outline

Viewer
Transcript

8/14/2009

UNIVERSIDAD POLITÉCNICA DE VALENCIA

On Clustering and Evaluation of Narrow Domain Short-Text Corpora PhD Thesis Dissertation David Eduardo Pinto Avendaño Advisors: Dr. Paolo Rosso Dr. Héctor Jiménez-Salazar

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

1

8/14/2009

Overview In this thesis, we deal with the treatment of narrow domain short-text collections in three areas: !

evaluation, clustering, and validation of corpora.

Mainly focused on short-text clustering: short vs. long documents Quite relevant, given the current and future way people use “smalllanguage” (e.g. blogs, snippets, news and text-message generation such as email or chat).

Complemented with domain broadness of corpora: narrow vs. wide domain In the categorization task, it is very difficult to deal with narrow domain corpora such as scientific papers, technical reports, patents, etc.

Overview To study possible strategies to tackle the following two problems: a) the low frequencies of vocabulary terms in short texts, and b) the high vocabulary overlapping associated to narrow domains.

To provide a general framework for the evaluation of corpus features which are classifier-independent.

2

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

Hypothesis !

The fact of a corpus could be composed of narrow domain short texts is identifiable.

!

Clustering results obtained with narrow domain short-text corpora can be improved avoiding the use of external knowledge resources not always available in narrow domains.

3

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

Challenges 1. To propose a framework for the assessment of a set of corpus features that would be useful to understand the nature of the documents from the viewpoint of the shortness and broadness. "

The proposed measures will allow us to evaluate the relative hardness of corpora to be clustered and to study additional corpus features such as the particular writing style of scientific researchers.

"

To be able to distinguish corpora that are composed of narrow domain short texts from those that are not.

4

8/14/2009

Challenges 2. By determining the degree of broadness and shortness of corpora we may analyse the following issues: issues: " To test clustering methods in order to determine the complexity of classifying narrow domain short-text collections " To investigate the possible components that could improve the obtained accuracy in the clustering task. 3. To validate clustering results in the two following ways: " By applying internal clustering validity measures in order to “validate” the quality of the obtained clusters by a given clustering method. " Employing similar measures in order to assess the quality of gold standards.

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora Introduction Improving the classical approach " The case study of word sense induction " "

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

5

8/14/2009

Introduction: clustering definition !

Clustering analysis refers to the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait, often proximity, according to some defined distance measure.

Introduction: clustering definition !

Clustering analysis refers to the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait, often proximity, according to some defined distance measure.

Categories = {green, red, yellow, blue} or Categories = {sphere, cube, pyramid}

6

8/14/2009

Introduction: object representation

Introduction: Document representation

7

8/14/2009

Introduction: Clustering process T1 T2 … TV D1 D2 : : Dn

w11 w12 … w1v w21 w22 … w2v : : : : : : wn1 wn2 … wnv

D1 D2 … Dn D1 D2 : : Dn

θ11 θ12 … θ1n θ21 θ22 … θ2n : : : : : : θn1 θn2 … θnn

Clustering Method K-Means, K-Star, MajorClust, SLC, CLC, etc

Introduction: Motivation !

Supervised approach " " " " " " " " " " " "

[N. Ide and J. Véronis, 1990] [Hynek et al., 2000] [S. Zelikovitz and H. Hirsh, 2000] [Hotho et al., 2003a] [Hotho et al., 2003b] [S. Zelikovitz and F. Marquez, 2005] [Montejo-Ráez et al., 2005] [Q. Pu and G.-W. Yang, 2006] [Buscaldi et al., 2006] [Montejo-Ráez et al., 2006] [Peng et al., 2007] :

!

Unsupervised approach " "

[Makagonov et al., 2004] [Alexandrov et al, 2005]

8

8/14/2009

Introduction: Motivation Hierarchical relationship on the difficulty of clustering different kind of text data? The low frequencies of vocabulary terms in short texts, and The high vocabulary overlapping associated to narrow domains.

Introduction: Motivation !

Practical applications #

Clustering of search engine results

#

Word sense discrimination/induction

#

Homonymy discrimination

#

Summarization

#

clustering of blogs (opinion analysis)

#

assessment of quality of corpora

9

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora Introduction Improving the classical approach " The case study of word sense induction " "

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

More knowledge = Better clustering

Term expansion External resource (WordNet, HowNet, etc)

Self-term expansion

10

8/14/2009

Self-Term Expansion

Clustering Method K-Means, K-Star, MajorClust, SLC, CLC, etc

T1 T2 … TV D1 D2 : : Dn

w 11 w 21 : : w n1

w 12 … w 1v w 22 … w 2v : : : : w n2 … w nv

Self-Term Expansion

Self-Term Expansion Technique + Term Selection Technique = Self-Term Expansion Methodology

11

8/14/2009

Clustering narrow domain short-text corpora (CICLing-2002 / K-Star)

DF

TP

TS

Clustering narrow domain short-text corpora (hep-ex / K-Star)

DF

TP

TS

12

8/14/2009

Clustering narrow domain short-text corpora (CICLing-2002 / DK-Means)

Clustering narrow domain short-text corpora (hep-ex / DK-Means)

13

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora Introduction Improving the classical approach " The case study of word sense induction " "

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

Word Sense Discrimination/Induction

14

8/14/2009

Clustering narrow domain short-text corpora (WSI-SemEval)

DF

TP

TS

More knowledge = Better clustering !

A priori assessment of text corpora

!

Self-term expansion methodology

!

Clustering narrow domain short-text corpora

15

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

Assessment of text corpora Shortness Document length Stylometry It refers to the linguistic style of a writer Class imbalance The document distribution across the corpus Structure Structural properties of the categories distribution Domain broadness Narrow vs wide domain

16

8/14/2009

Assessment of Text Corpora Corpus feature Document shortness Stylometry Class imbalance Structure Domain broadness

to be used as Internal measure Internal measure External measure External measure Both

unsupervised vs supervised

Shortness-based evaluation measures Given a corpus made up of n documents D={d1,d2,...,dn} and V(di) the vocabulary of di, the following three measures are considered to be related with the shortness of D.

17

8/14/2009

Stylometry-based evaluation measure

20 Newsgroups

(Test)

CICLing-2002

Stylometry-based evaluation measure Given a corpus D with vocabulary V(D), the probability of term ti is calculated by its term frequency in D, tf(ti,D), as follows:

Whereas the expected Zipfian distribution (s=1) of terms is obtained as:

18

8/14/2009

Domain broadness evaluation measures !

Based on statistical language modeling

Given a corpus D with gold standard

LM2

LM1

Domain broadness evaluation measures !

Based on statistical language modeling

19

8/14/2009

Domain broadness evaluation measures !

Based on vocabulary dimensionality

Narrow domain vs Wide domain

complete vocabulary

complete vocabulary

Domain broadness evaluation measures !

Based on vocabulary dimensionality

Given a corpus D={d1,d2,...,dn} with gold standard

20

8/14/2009

Class imbalance evaluation measure

Class imbalance evaluation measure Given a corpus D={d1,d2,...,dn} with gold standard

21

8/14/2009

Structure-based evaluation measures

Assessment of text corpora

22

8/14/2009

Assessment of text corpora

Assessment of text corpora

23

8/14/2009

Assessment of text corpora

Assessment of text corpora

T(SLMB(C))=0.82

24

8/14/2009

Assessment of text corpora ! ! ! ! ! ! ! ! !

Kendall Tau (SLMB) Kendall Tau (ULMB) Kendall Tau (SVB) Kendall Tau (UVB) Kendall Tau (SEM) Kendall Tau (DL) Kendall Tau (VL) Kendall Tau (CI) Kendall Tau (Ro)

= = = = = = = = =

0.82 0.56 0.67 0.56 0.86 0.96 0.78 1.00 0.64

! ! ! !

Kendall Tau (mRH-J) Kendall Tau (mRH-C) Kendall Tau (VDR) Kendall Tau (Dunn)

= 0.09 = - 0.05 = 0.05 = - 0.09

Assessment of text corpora

25

8/14/2009

WikiArabic Clustering Corpus ! INEX 2006 Arabic Wikipedia corpus ! 3,638 categories (tagging one or more documents) ! 11,637 xml files (Arabic and Buckwalter) ! 1,725 categories tag only one document ! 8,690 documents are single-categorized ! the 30 most frequent categories were selected ! 1,089 documents belong to R30

WikiArabic Clustering Corpus Assessment measure

WikiBuckCorpusR30.col

WikiBuckCorpusR30.col.TOK

DL

211.92

241.17

VL

129.26

125.02

VDR

0.907

0.880

SEM CI

0.112 0.016

0.045 0.016

SLMB

2612.71

1442.05

ULMB SVB

856.91 10.46

296.96 8.86

UVB

11.02

9.38

Dunn-C

0.977

0.974

EDM-C

1.51

1.44

TotalTerms

230,785

262,641

CorpusVocSize

47,155

38,833

26

8/14/2009

WikiArabic Clustering Corpus Assessment measure

WikiBuckCorpusR30.col

WikiBuckCorpusR30.col.TOK

SEM

WikiArabic Clustering Corpus Assessment measure

WikiBuckCorpusR30.col

WikiBuckCorpusR30.col.TOK

VL

27

8/14/2009

WikiArabic Clustering Corpus Assessment measure

WikiBuckCorpusR30.col

WikiBuckCorpusR30.col.TOK

CI

WaCOS

(http://nlp.dsic.upv.es:8080 (http://nlp.dsic.upv.es: 8080/watermarker) /watermarker)

28

8/14/2009

WaCOS: modes of execution

WaCOS: Graphical view of results

29

8/14/2009

Domain broadness: Vocabulary-based

Class imbalance: category cardinalities

30

8/14/2009

Stylometric: Zipfian distribution

Structure: Dunn index family

31

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

Validity measures Category 1

Category Category

2

3

Category 1 Category 4

Category 2

Category 3

Category 4

32

8/14/2009

The relative hardness !

An analysis of the relative hardness of reuters-21578 subsets [Debole & Sebastiani, 2005]

!

Subjective and Objective measures applied to the clustering of narrow domain scientific abstracts (in Spanish)

[Ingaramo et al, 2007]

!

On the relative hardness of clustering corpora

!

Evaluation of Internal Validity Measures in Short-Text Corpora

[Pinto & Rosso, 2007]

[Ingaramo et al, 2008]

Correlation of validity measures (CICLing-2002)

33

8/14/2009

Correlation of validity measures (WSI-SemEval)

Relative Hardness (R8 Reuters)

a) Training

b) Test

34

8/14/2009

Outline !

Overview

!

Hypothesis

!

Challenges

!

Clustering of narrow domain short-text corpora

!

Assessment of text corpora

!

Validity measures

!

Conclusions and research directions

Conclusions The clustering of narrow domain short-text corpora is one of the most difficult tasks of unsupervised data analysis. " "

the high overlapping of vocabularies among the texts the low term frequency of short texts

We have addressed the above problems by studying three research directions: 1. 2. 3.

The study of methods and techniques for improving clustering of narrow domain short-text corpora. The determination of classifier-independent corpus features and the assessment of each of them. The applications of the proposed methods and techniques in different areas of natural language processing.

35

8/14/2009

Conclusions We have confirmed the hypothesis formulated at the beginning of this PhD thesis:

!

The fact of a corpus could be composed of narrow domain short texts is identifiable.

!

Clustering results obtained with narrow domain short-text corpora can be improved avoiding the use of external knowledge resources not always available in narrow domains.

Contributions !

The study and introduction of evaluation measures to analyse the following features of a corpus: shortness, domain broadness, class imbalance, stylometry and structure.

!

The development of the Watermarking Corpora On-line System, named WaCOS, for the assessment of corpus features.

!

A new unsupervised methodology (which does not use any external knowledge resource) for dealing with narrow domain short-text corpora. This methodology suggests first applying self-term expansion and then term selection.

36

8/14/2009

Research directions !

To observe the possible relationship that the clustering of narrow domain short-text corpora may have with summarization and viceversa.

!

To test the performance of the proposed approach in multicategorized narrow domain short-text corpora (fuzzy clustering).

!

To transfer the technology: different enterprises are interested in the classification of short texts.

UNIVERSIDAD POLITÉCNICA DE VALENCIA On Clustering and Evaluation of Narrow Domain ShortShort-Text Corpora PhD Thesis Dissertation David Pinto [email protected] http://nlp.dsic.upv.es:8080/watermarker/

37

Annotated Outline

curriculum outline

Course Outline

Annotated Outline

curriculum outline

Annotated Outline

Outline of inquiry.pdf

subject outline periwinkle.pdf

$outline of fracture.pdf$

outline of fracture.pdf

Updated web outline -

Bio Poem Outline

Surge-Logo-Black-Outline - GitHub

Expanded Plan Outline

Survivor Notebook Outline Accounts

state report outline -

Course Outline - WordPress.pdf

Download [Pdf] Schaum's Outline of Statistics and Econometrics, Second Edition (Schaum's Outline Series) Read online

Download [Pdf] Schaum's Outline of Advanced Mathematics for Engineers and Scientists (Schaum's Outline Series) Full Pages

Carly-Persuasive-preparation outline-final.pdf

Cara-preparation outline-final.pdf

Aug 14, 2009 - (e.g. blogs, snippets, news and text-message generation such as email ... Employing similar measures in order to assess the quality of gold ..... To transfer the technology: different enterprises are interested in the classification ...

Download PDF

3MB Sizes 2 Downloads 227 Views

Report

Annotated Outline

curriculum outline

Course Outline

Annotated Outline

curriculum outline

curriculum outline

curriculum outline

Annotated Outline

Outline of inquiry.pdf

subject outline periwinkle.pdf

outline of fracture.pdf

Updated web outline -

Bio Poem Outline

Surge-Logo-Black-Outline - GitHub

Expanded Plan Outline

Expanded Plan Outline

Survivor Notebook Outline Accounts

state report outline -

Course Outline - WordPress.pdf

Download [Pdf] Schaum's Outline of Statistics and Econometrics, Second Edition (Schaum's Outline Series) Read online

Download [Pdf] Schaum's Outline of Advanced Mathematics for Engineers and Scientists (Schaum's Outline Series) Full Pages

Carly-Persuasive-preparation outline-final.pdf

Cara-preparation outline-final.pdf

Outline

Recommend Documents