Web Mining Nguyen Tri Thanh Email: [email protected] Department of Information Systems Faculty of Information Technology

College of Technology May 11, 2011

1

Web Content Mining  Evaluating Clustering 

Approaches to Evaluating Clustering



Similarity-Based Criterion Functions



Probabilistic Criterion Functions



MDL-Based Model and Feature Evaluation



Classes to Clusters Evaluation



Precision, Recall and F-measure



Entropy

May 11, 2011

2

Approaches to Evaluating Clustering  Criterion functions evaluate clustering models

objectively, i.e. using only the document content 

Similarity-based functions: intracluster similarity and sum

of squared errors 

Probabilistic functions: log-likelihood and category utility



MDL-based evaluation

May 11, 2011

3

Approaches to Evaluating Clustering (cont’d)

 Document labels (if available) may also be used for

evaluation of clustering models 





If labeling is correct and reflects accurately the document content we can evaluate the quality of clustering If clustering reflects accurately the document content we can evaluate the quality of labeling Criterion function based on labeled data: classes to

clusters evaluation

May 11, 2011

4

Similarity-Based Criterion Functions (distance)

 Basic idea: the cluster center ci (centroid or

mean in case

of numeric data) best represents cluster Di if it minimizes the sum of the lengths of the “error” vectors x -ci for all x ∈ Di k

J e = ∑ ∑ x − ci

2

i =1 x∈Di

1 ci = Di

 Alternative formulation based on

∑x

x∈Di

pairwise distance

between cluster members 1 k 1 Je = ∑ 2 i =1 Di May 11, 2011

∑x

2 j

− xl

x j , xl ∈Di

5

Similarity-Based Criterion Functions (cosine similarity)

cosine similarity is used k ci • d j 1 ci = dj J s = ∑ ∑ sim(ci , d j ) sim(ci , d j ) = ∑ Di d j ∈Di ci d j i =1 d ∈D

 For document clustering the (centroid)

j

i

 Equivalent form based on

1 k 1 JS = ∑ 2 i =1 Di

pairwise similarity

∑ sim(d

j

, dk )

x j , xk ∈Di

 Another formulation based on

intracluster similarity (used

to controls merging of clusters in hierarchical Average pair-wise similarity agglomerative clustering) 1 k 1 1 k JS = ∑ sim(d j , d l ) = ∑ Di sim( Di ) ∑ 2 i =1 Di x j , xl ∈Di 2 i =1 May 11, 2011

6

Similarity-Based Criterion Functions (example) Sum of centroid similarity evaluation of four clusterings

May 11, 2011

7

Finding the Number of Clusters

May 11, 2011

8

Probabilistic Criterion Functions  Document is a

random event P(d ) = ∑ A P(d | A) P( A)

 Probability of document

 Probability of sample (assuming that documents are

independent events) P(d1 , d 2 ,..., d n ) = ∏i =1 ∑ A P(d i | A) P( A) n

 Log-likelihood (log of probability of sample)

L = ∑i =1 log∑ A P (d i | A) P( A) n

May 11, 2011

9

Probabilistic Criterion Functions (cont’d)  Category Utility (based on probability of attributes) 

Take into account the probabilities of attributes having particular values within clusters and across clusters (Gluck & Corter)

 CU measures both probabilities 



two objects in the same category have attribute values in common objects from different categories have different attribute values

CU = ∑∑∑ P (a j = vij | Ck ) P(Ck | a j = vij ) P(a j = vij ) i

May 11, 2011

j

k

10

Category Utility (basic idea)

∑∑∑ P(a i







j

j

= vij | Ck ) P (Ck | a j = vij ) P(a j = vij )

k

P(a j = vij | Ck )

is the probability that an object has value vij for its attribute aj given that it belongs to category Ck. The higher this probability, the more likely two objects in a category share the same attribute values

P(Ck | a j = vij )

is the probability that an object belongs to category Ck, given that it has value vij for its attribute aj. The greater this probability, the less likely objects from different categories have attribute values in common

P(a j = vij )

is a weight coefficient assuring that frequent attribute values have stronger influence on the evaluation

May 11, 2011

11

Category Utility Function CU = ∑∑∑ P(a j = vij | Ck ) P (Ck | a j = vij ) P (a j = vij ) i

j

k

P(Ck | a j = vij ) =

P(a j = vij | Ck ) P(Ck ) P(a j = vij )

Bayesian rule

CU = ∑ P(Ck )∑∑ P(a j = vij | Ck ) 2 k

∑∑ P(a i

j

i

j

j

= vij | Ck ) 2

Expected number of attribute values of a member of Ck correctly guessed using a probability

matching strategy

May 11, 2011

12

Category Utility Function (cont’d) Assume we don’t know the categories (clusters) in our sample, then

∑∑ P(a j = vij ) 2 i

j

Expected number of correctly guessed attribute values without knowing the categories (clusters) in the sample

The final expression of the function is CU (C1 , C2 ,..., Cn ) =

May 11, 2011

1 2 2 P ( C ) ( P ( a = v | C ) − P ( a = v ) ) ∑ k ∑∑ j ij k j ij n k i j

13

Category Utility (example) A = {1,3,4,6,8,10,12,16,17,18,19} B = {2,5,7,9,11,13,14,15,20}

May 11, 2011

14

Category Utility Function for continous attributes The category utility function can be extended to continuous attributes by assuming normal distribution and replacing probabilities with densities

1 n CU (C1 , C2 ,..., Cn ) = ∑ P(Ck )∑ n k =1 i where

[∫ f (v

ik

)dvik − ∫ f (vi )dvi

]

f (.) is the probability density function for normal

distribution

After solving the integrals, we obtain

1 n 1 CU (C1 , C2 ,..., Cn ) = ∑ P(Ck ) n k =1 2 π May 11, 2011

 1 1 ∑i  σ − σ  i   ik 15

Data Example

May 11, 2011

16

Finding Regularities in Data  Describe the cluster as a pattern of attribute values that

repeats in data  Include attribute values that are the same for all members

of the cluster and omit attributes that have different values (generalization by dropping conditions)  Hypothesis 1 (using attributes

science and research)

 R1 : IF (science = 0) AND (research = 1) THEN Class = A  R : IF (science = 1) AND (research = 0) THEN Class = A A = {1,3,4,6,8,10,12,16,17,18,19}  H1 =  2 B = {2,5,7,9,11,13,14,15,20} R : IF (science = 1 ) AND (research = 1) THEN Class = A 3   R4 : IF (science = 0) AND (research = 0) THEN Class = B

May 11, 2011

17

Finding Regularities in Data (cont’d)  Hypothesis 2 (using attribute

offers)

 R : IF (offers = 0) THEN Class = A H2 =  1  R2 : IF (offers = 1) THEN Class = B

 Which one is better,

May 11, 2011

A = {1,3,4,6,8,10,12,16,17,18,19} B = {2,5,7,9,11,13,14,15,20}

H1 or H2?

18

Occam’s Razor  “Entities

are not to be multiplied beyond necessity”

(William of Occam, 14th century)

Among several alternatives the simplest one is usually the best choice  H2 looks simpler (shorter) than H1, so H2 may be better than H1 

 How do we measure simplicity?

May 11, 2011

19

Measure complexity  Dropping more conditions produces simpler (shorter)

hypotheses, the simplest one is the empty rule 

The latter however is a single cluster including the whole dataset (overgeneralization)

 The most complex (longest) hypothesis has 20 clusters

and is equivalent to the original dataset (over-fitting)  How do we find the right balance?  The answer to both questions is

Minimum Description

Length (MDL)

May 11, 2011

20

Minimum Description Length (MDL) D (e.g. our document collection) and a set of hypotheses H1,H2,…,Hn, each one describes D

 Given a data set

 Find the most likely hypothesis

H i = arg max i P( H i | D)  Direct estimation of

P(H1|D) is difficult, so we apply Bayes

P( H i ) P( D | H i ) P( H i | D) = P( D)  Take –log of both sides

− log 2 P ( H i | D) = − log 2 P( H i ) − log 2 P( D | H i ) + log 2 P( D)

May 11, 2011

21

Minimum Description Length (MDL) (cont’d)

 Consider hypotheses and data as messages and apply

Shannon’s information theory 

which defines information in a message as a negative logarithm of its probability

 Then estimate the number of bits (L) needed to

encode the messages

L( H i | D ) = L( H i ) + L( D | H i ) − L( D )

May 11, 2011

22

Minimum Description Length Principle 

L(Hi) and L(D) are the minimum number of bits needed to

encode the hypothesis and data

L(D |Hi) is the number of bits needed to encode D if we know Hi  If we think of Hi as a pattern that repeats in D we don’t 

have to encode all its occurrences, rather we encode only the pattern itself and the differences that identify each individual instance in D  Thus the more regularity in data the shorter description

length L(D |Hi)

May 11, 2011

23

Minimum Description Length Principle (cont’d)  We need a good balance between L(Hi) and L(D|Hi)

because if H describes the data exactly then L(D|Hi)=0 but L(Hi) will be large  We can exclude L(D) because it does not depend on the

choice of hypotheses 

Minimum Description Length (MDL) principle

H i = arg min i L( H i ) + L( D | H i )

May 11, 2011

24

MDL-based Model Evaluation (Basics)  Choose a description language (e.g. rules)  Use the same encoding scheme for both hypotheses and

data given the hypotheses  Assume that hypotheses and data are uniformly distributed 



Then probability of the occurrence of an item out of n alternatives is 1/n And the minimum code length of the message informing us that a particular item has occurred is -log21/n=log2n

 What is the possibility of the occurrence of

k items among

n items?

May 11, 2011

25

Recall Data Example

May 11, 2011

26

Recall Hypotheses  Hypothesis 1 (using attributes

science and research)

 R1 : IF (science = 0) AND (research = 1) THEN Class = A  R : IF (science = 1) AND (research = 0) THEN Class = A A = {1,3,4,6,8,10,12,16,17,18,19}  H1 =  2 B = {2,5,7,9,11,13,14,15,20}  R3 : IF (science = 1) AND (research = 1) THEN Class = A  R4 : IF (science = 0) AND (research = 0) THEN Class = B

 Hypothesis 2 (using attribute

offers)

 R : IF (offers = 0) THEN Class = A H2 =  1  R2 : IF (offers = 1) THEN Class = B

A = {1,3,4,6,8,10,12,16,17,18,19} B = {2,5,7,9,11,13,14,15,20}

May 11, 2011

27

MDL-based Model Evaluation (Example)  Description language: 12 attribute-value pairs (6 attributes each

with two possible values)  Rule

R1 covers documents 8, 16, 18 and 19

 There are 9 different attribute-value pairs that occur in these

documents {history=0}, {science=0}, {research=1}, {offers=0}, {offers=1}, {students=0}, {students=1}, {hall=0}, {hall=1}  Specifying rule

R1 is equivalent to choosing 9 out of 12 attribute-

value pairs, which can be done in C912

May 11, 2011

No

History

Science

Research

Offers

Students

Hall

8

0

0

1

0

0

0

16

0

0

1

0

1

1

18

0

0

1

1

1

1

19

0

0

1

1

1

1

28

MDL-based Model Evaluation (Example) (cont’d)  Since we use only 1 among C912 possibilities, log2C912 bits

are needed to encode the left-hand side of R1  In addition we need one bit (a choice of one out of two

cluster labels) to encode the choice of the class (log22)  Thus the code length of

R1 is

L(R1)=log2C912 +1=log2 220+1 =8.78136

May 11, 2011

29

MDL-based Model Evaluation (L(H1))  Similarly we compute the code lengths of

R2, R3, and R4

and obtain L(R2)=log2C712 +1=log2 792+1 =10.6294 L(R3)=log2C712 +1=log2 792+1 =10.6294 L(R4)=log2C1012 +1=log2 66+1 =7.04439  Using the additivity of information to obtain the code

length of L(H) we add the code lengths of its constituent rules

L(H1)=37.0845 May 11, 2011

30

MDL-based Model Evaluation (L(D|R1))  Consider the message exchange setting, where the

hypothesis R1 has already been communicated  This means that the recipient of that message already

knows the subset of 9 attribute-value pairs selected by rule

R1  Then to communicate each document of those covered by

R1 we need to choose 6 (the pairs occurring in each document) out of the 9 pairs  This choice will take log2 C69 bits to encode

May 11, 2011

31

MDL-based Model Evaluation (L(D|R1)) (cont’d)  As

R1 covers 4 documents (8, 16, 18, 19) the code length

needed for all of them will be

L({8,16,18,19}|R1)=4 x log2 C69 =4 x log2 84=25.5693

May 11, 2011

32

MDL-based Model Evaluation (MDL(H1))  Similarly we compute the code length of the subsets of

documents covered by the other rules

L({4,6,10}|R2)=3 x log2 C67 =3 x log2 7=8.4220 L({1,3,12,17}|R3)=4 x log2 C67 =4 x log2 7=11.2294 L({2,5,7,9,11,13,14,15,20}|R4)=9 x log2 C610 =9 x log2 210=69.4282  The code length needed to communicate all documents

given hypothesis H1 will be the sum of all these code lengths, i.e. L(D|H1)=114.649  Now adding this to the code length of the hypothesis we

obtain MDL(H1)=L(H1)+L(D|H1)=37.0845+114.649=151.733 May 11, 2011

33

MDL-based Model Evaluation (H1 or H2?)  Similarly we compute

MDL(H2)

MDL(H2)=L(H2)+L(D |H2)=9.16992+177.035=186.205 

MDL(H1) < MDL(H2) ⇒ H1 is better than H2

 Also (very intuitively), L(H1)

> L(H2) and

L(D |H1) < L(D |H2)  How about the most general and the most specific

hypotheses?

May 11, 2011

34

Information (Data) Compression  The most general hypothesis (the empty rule {}) does not

restrict the choice of attribute-value pairs, so it selects 12 out of 12 pairs L({})=log2C1212+1=1 L(D|{})=L(D)=20 x log2C612=20 x (log2924+1)=197.035

 The most specific hypothesis has 20 rules – one for each

document L(S)=20 x (log2C612+1)=20 x (log2924+1)=217.035  Both {} and S represent extreme cases, undesirable in learning

overgeneralization and overspecialization  Good hypotheses should provide smaller MDL than {} and S, or

stated otherwise (principle of Information Compression) L(H)+L(D|H)< L(D) or Hi=arg maxi (L(D)-L(Hi)-L(D|Hi)) May 11, 2011

35

MDL-based Feature Evaluation  An attribute can split the set of documents into subsets,

each one including the documents that share the same value of that attribute  Consider this split as a clustering and evaluate its MDL

score  Rank attributes by their MDL score and select attributes

that provide the lowest score 

MDL(H1) was actually the MDL score of attribute offers from our 6-attribute document sample

May 11, 2011

36

MDL-based Feature Evaluation (Example)

May 11, 2011

37

Classes to Clusters Evaluation  Assume that the classification of the documents in a

sample is known, i.e. each document has a class label  Cluster the sample without using the class labels  Assign to each cluster the class label of the majority of

documents in it  Compute the

error as the proportion of documents with

different class and cluster label  Or compute the

accuracy as the proportion of documents

with the same class and cluster label

May 11, 2011

38

Classes to Clusters Evaluation (Example)

May 11, 2011

39

Confusion matrix (contingency table)

TP (True Positive), FN (False Negative), FP (False Positive), TN (True Negative)

FP + FN Error = TP + FP + TN + FN

May 11, 2011

TP + TN Accuracy = TP + FP + TN + FN

40

Precision and Recall TP Precision = TP + FP TP Recall = TP + FN

May 11, 2011

41

F-Measure Generalized confusion matrix form classes and k clusters

Combining precision and recall P(i, j ) =

nij

∑i =1 nij m

R(i, j ) =

nij



k

n

j =1 ij

2 P(i, j ) R(i, j ) P(i, j ) + R(i, j ) Evaluating the whole clustering m ni F = ∑i =1 max F (i, j ) n j =1,...,k F (i, j ) =

ni = ∑ j =1 nij k

n = ∑i =1 ∑ j =1 nij m

May 11, 2011

k

Total number of documents

42

F-Measure (Example)

May 11, 2011

43

Entropy  Consider the class label as a random event and evaluate its

probability distribution in each cluster

i in cluster j is estimated by the proportion of occurrences of class label i in cluster j nij pij = m ∑ nij

 The probability of class

i =1

 The entropy is as a measure of “impurity” and accounts for

the average information in an arbitrary message about the class label m

H j = −∑ pij log pij i =1

May 11, 2011

44

Entropy (cont’d)  To evaluate the whole clustering we sum up the entropies

of individual clusters weighted with the proportion of documents in each k

H =∑ i =1

May 11, 2011

nj n

Hj

45

Entropy (Examples)  A “pure” cluster where all documents have a single class

label has entropy of 0  The highest entropy is achieved when all class labels have

the same probability  For example, for a two class problem the 50-50 situation

has the highest entropy of (-0.5 log 0.5- 0.5 log 0.5)=1

May 11, 2011

46

Entropy (Examples) (cont’d)  Compare the entropies of the previously discussed

clusterings for attributes offers and students 11  8 8 3 3 9  3 3 6 6 H (offers ) =  − log − log  +  − log − log  = 0.878176 20  11 11 11 11  20  9 9 9 9 H ( students) =

May 11, 2011

15  10 10 5 5 5  1 1 4 4 − log − log + − log − log     = 0.869204 20  15 15 15 15  20  5 5 5 5

47

Summary  Evaluating Clustering 

Similarity-Based Criterion Functions



Probabilistic Criterion Functions



MDL-Based Model and Feature Evaluation



Classes to Clusters Evaluation



Entropy

May 11, 2011

48

Web Mining -

of the cluster and omit attributes that have different values. (generalization by dropping conditions). ❑ Hypothesis 1 (using attributes science and research)..

555KB Sizes 1 Downloads 207 Views

Recommend Documents

Web Social Mining
Web social mining refers to conducting social network mining on Web data. ... has a deep root in social network analysis, a research discipline pioneered by ...

(>
BOOKS BY MATTHEW A. RUSSELL. An e-book is definitely an electronic edition of a standard print guide that can be study by utilizing a private personal ...

Web Usage Mining: A Review
Jun 26, 2008 - “50% of visitors who accessed URLs /index.php and coed.php ... Web Usage Mining: Discovery and Application of Interesting Patterns from ...

Web Mining and Social Networking.pdf
Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect more apps.

Web Mining Tutorial 21.pdf
contribution from web robots has to be eliminated before proceeding with any further data mining,. i.e. when we are looking into web usage behaviour of real ...

A Web Service Mining Framework
would be key to leveraging the large investments in applica- tions that have ... models, present an inexpensive and accessible alternative to existing in .... Arachidonic Acid. Omega-3 inheritance parent. Aspirin block COX1 energy block COX2.

web mining techniques pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect ...

web content mining with java pdf download
web content mining with java pdf download. web content mining with java pdf download. Open. Extract. Open with. Sign In. Main menu.

Mediapedia: Mining Web Knowledge to Construct ...
Abstract. In recent years, we have witnessed the blooming of Web 2.0 content such as Wikipedia, Flickr and YouTube, etc. How might we ben- efit from such rich ...

Handbook of Research on Text and Web Mining ...
is such an analytical technique, which reveals various dimensions of data and their ... sional data cube as a suitable data structure to capture multi-dimensional ...

web usage mining using rough agglomerative clustering
is analysis of web log files with web pages sequences. ... structure of web sites based on co-occurrence ... building block of rough set theory is an assumption.

mining the social web 2nd edition pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. mining the ...

Mining the Web for Hyponymy Relations based on ...
Web not manually but automatically [9–15]. However, their .... In the after-mentioned experiments, µ is set to 4.9 · 10. −324. .... 2 notebook (0.00846) 2 head.

Mediapedia: Mining Web Knowledge to Construct ...
because it downloads the media contents as well as the corresponding ... users to distribute, evaluate and interact with each other in the social network. ... problem, where the most popular way may be the k-centers algorithm such as. [3]. ..... 10.

web site optimization through mining user navigational ...
... the complexity of web sites grow. The analysis of web user's navigational pattern within a ... KEYWORDS: Web Engineering, Data Mining. 1. INTRODUCTION.

Morgan Kaufmann - Mining the Web - Discovering Knowledge from ...
Morgan Kaufmann - Mining the Web - Discovering Knowledge from Hypertext Data.pdf. Morgan Kaufmann - Mining the Web - Discovering Knowledge from ...