Introduction to Clustering Methods

Viewer
Transcript

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Introduction to Clustering Methods Dr. Bidyut Kr. Patra Assistant Professor, National Institute of Technology Rourkela Rourkela, Orissa

October 15, 2012

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Introduction to Classification Introduction to Clustering Similarity Measures Category of clustering methods Density Based Clustering Method Clustering Methods for Large Datasets Major Approaches to Clustering Large Datasets Data Summarization BIRCH

Hybrid Clustering Method Conclusions Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Classification Definition The task of classification is to assign an object into one of the predefined categories. Working Principle A set of objects with their categories are provided. Training Set. It captures relationship between objects and their categories. (Find a suitable model) Assign a class label (category) to an unknown object. Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Examples of Classification

Predicting tumor cells as benign or malignant. Classify emails as spam or genuine e-mail. Classifying credit card transactions as legitimate or fraudulent Categorizing news stories as finance, weather, entertainment, sports, etc

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Drawback

Non availability of proper training set. It is difficult to have correct labeled data.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Introduction to Clustering Cluster Analysis Cluster analysis is to discover the natural grouping(s) of a set of patterns, points, or objects a . a A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010.

Operational definition of clustering : Cluster Analysis Clustering activity or cluster analysis is to find group(s) of patterns, called cluster (s) in a dataset in such a way that patterns in a cluster are more similar to each other than patterns in distinct clusters. Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Set Theoretic Definition Definition Let D = {x1 , x2 , . . . , xn } be a set of patterns called dataset, where each xi is a pattern of N dimensions. A clustering π of D can be defined as follows. π = {C1 , C2 , . . . , Ck }, such that Sk i Ci = D Ci 6= ∅, i = 1..k Ci ∩ Cj = ∅, i 6= j , i , j = 1..k sim (x1 , x2 ) > sim (x1 , y1 ), x1 , x2 ∈ Ci and y1 ∈ Cj , i 6= j

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Application Domains:

1

Biology: Clustering has been applied to genomic data to group functionally similar genes.

2

Information Retrieval: Search results obtained by various search engines such as Google, Yahoo can be clustered so that related documents appear together in a cluster.

3

Market Research: Entities (people, market, organizations) can be clustered based on common features or characteristics.

4

Geological mapping, Bio-informatics, Climate, Web mining

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Similarity Similarity between a pair of patterns x and y in D is a mapping, sim(x, y ) : D × D → [0, 1]. Closer the value of sim(.) is to 1, higher the similarity and it is 1, when x = y . Simple Matching Coefficient (SMC): Let x and y be two N-dimensional binary vectors, i.e. x, y ∈ {0, 1}N . 8 <1 if x = y PN i i i =1 t | t = :0 otherwise , SMC (x, y ) = N where, xi and yi are the i th feature values of x and y , respectively. Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Similarity(Contd..)

Cosine Similarity: Let x and y be two document-vectors. Similarity between x and y are expressed as cosine of the angle between them. cosine(x, y ) =

x •y ||x|| ||y ||

where, • is the dot product and ||.|| is L2 -norm of x, y . Let x = (3, 0, 2, 0, 0, 1) and y = (2, 1, 3, 0, 1, 0) be two document vectors.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Dissimilarity Many clustering methods use dissimilarity measures to find clusters instead of similarity measure. Euclidean distance between a pair of N dimensional points (patterns) can be expressed as follow. v u N uX d(x, y ) = t (xi − yi )2 i =1

Generalization of Euclidean distance is known as Minkowski distance. !1/p N X p Lp = |xi − yi | i =1

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Metric Space Definition M = (D, d) is said to be metric space if d is a metric on D, i.e., d : D × D → R≥0 , which satisfies following conditions. For three patterns x, y , z ∈ D, Non-negativity: d(x, y ) ≥ 0 Reflexivity: d(x, y ) = 0, if x = y Symmetry: d(x, y ) = d(y , x) Triangle inequality: d(x, y ) + d(y , z) ≥ d(x, z)

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Clustering Methods

Partitional Clustering: Partitional clustering creates a single clustering of a given dataset. Let C1 and C2 be two clusters in a clustering. (i) C1 * C2 or C1 + C2 ,

(ii) C1 ∩ C2 = ∅

Fuzzy and rough clustering approaches violate constraint (ii). C1 ∩ C2 6= ∅, Hierarchical Clustering method : Hierarchical clustering method creates a sequence of partitional clusterings of a dataset.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

DATASET

Similarity Measures Category of clustering methods

Clustering

Figure: Partitional Clustering

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

DATASET

Figure: Hierarchical Clustering

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Partitional Clustering Methods Basic Sequential Algorithmic Scheme (BSAS) : BSAS needs two user specified parameters: distance threshold (τ ) and number of clusters (k). i = 1, Ci = {x1 } x ∈ D \ {x1 } 1

2

Find nearest existing cluster Cmin such that d(x, Cmin ) = minj=1..i d(x, Cj ) if d(x, Cmin ) > τ and i < k, then i = i + 1; Ci = {x} ESLE Cmin = Cmin ∪ {x}

3

Repeat Step 1 and Step 2 until all patterns are assigned to clusters. Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Advantages: 1

Single dataset scan method.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Advantages: 1

Single dataset scan method.

2

Time Complexity= O(kn)

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Advantages: 1

Single dataset scan method.

2

Time Complexity= O(kn)

Disadvantages: 1

Number of clusters is to be provided.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Advantages: 1

Single dataset scan method.

2

Time Complexity= O(kn)

Disadvantages: 1

Number of clusters is to be provided.

2

Patterns may not be assigned to overall closest clusters.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

leaders clustering method INPUT:D, τ 1

L ← {x1 }

2

For each pattern x ∈ D \ {x1 }, if there is a l ∈ L such that ||l − x|| ≤ τ , then x is assigned to that cluster that is represented by l . There is no such leader, then x becomes a leader and is added to L.

3

It outputs leaders set L

Advantages: Single scan method Time complexity O(mn), m = |L| Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

. .

.

. .

. .

.

. .

.

.

.

Figure: Leaders find semi-spherical clusters.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

The k-means Clustering Methods

The k-means optimizes Sum of Squared Error (SSE) defined as follows. Let C1 , . . . , Ck be k clusters of the dataset. Then, SSE =

|Cj | k X X

||xi −

2 x j ||2 ,

j=1 i =1

Dr. Bidyut Kr. Patra

|Cj |

where xi ∈ Cj ,

1 X xi . x = |Cj | j

i =1

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

k-means Clustering

Select k points as initial centroids.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

k-means Clustering

Select k points as initial centroids. repeat 1 2

Form k clusters by assigning each point to its closest centroid. Recompute the centroid of the each cluster.

until Centroids do not change.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Example D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Example D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2 initial centroids 2 and 4

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Example D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2 initial centroids 2 and 4 C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Example D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2 initial centroids 2 and 4 C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance. mC1 = 2.5, mC2 = 16

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Example D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2 initial centroids 2 and 4 C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance. mC1 = 2.5, mC2 = 16 Next Iteration: C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25} mC1 = 3.0, mC2 = 18

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Example D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2 initial centroids 2 and 4 C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance. mC1 = 2.5, mC2 = 16 Next Iteration: C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25} mC1 = 3.0, mC2 = 18 C1 = {2, 3, 4, 10}, C2 = {12, 20, 30, 11, 25}

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Example D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2 initial centroids 2 and 4 C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance. mC1 = 2.5, mC2 = 16 Next Iteration: C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25} mC1 = 3.0, mC2 = 18 C1 = {2, 3, 4, 10}, C2 = {12, 20, 30, 11, 25} mC1 = 4.75, mC2 = 19.6

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Example D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2 initial centroids 2 and 4 C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance. mC1 = 2.5, mC2 = 16 Next Iteration: C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25} mC1 = 3.0, mC2 = 18 C1 = {2, 3, 4, 10}, C2 = {12, 20, 30, 11, 25} mC1 = 4.75, mC2 = 19.6 C1 = {2, 3, 4, 10, 11, 12}, C2 = {20, 30, 25} mC1 = 7, mC2 = 25 Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Example D = {2, 4, 10, 12, 3, 20, 30, 11, 25} and k = 2 initial centroids 2 and 4 C1 = {2, 3}, C2 = {4, 10, 12, 20, 30, 11, 25} using L2 distance. mC1 = 2.5, mC2 = 16 Next Iteration: C1 = {2, 3, 4}, C2 = {10, 12, 20, 30, 11, 25} mC1 = 3.0, mC2 = 18 C1 = {2, 3, 4, 10}, C2 = {12, 20, 30, 11, 25} mC1 = 4.75, mC2 = 19.6 C1 = {2, 3, 4, 10, 11, 12}, C2 = {20, 30, 25} mC1 = 7, mC2 = 25 C1 = {2, 3, 4, 10, 11, 12}, C2 = {20, 30, 25} Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Time and Space Complexity Time=O(I ∗ k ∗ n) , Space=O(k + n)

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Drawbacks

1

It cannot detect outliers.

2

It can find only convexed shaped clusters.

3

It is applicable to only numeric dataset.

4

With different initial points, it produces different clustering results.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

DATASET

Similarity Measures Category of clustering methods

CLUSTERING by k−means

Figure: Result produced by k-means clustering method

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Bisecting k-means Algorithm 1 2

Initialize the list of clusters L = {C1 }, where C1 = D. repeat 1 2

3 4 3

Remove a cluster from the list of clusters. Bisect the selected cluster using k-means clustering method for a number of times. Select two clusters from a bisection with lowest total SSE. Add these two clusters to the list of clusters L.

until number of clusters in the list L is k.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Hierarchical Clustering Hierarchical clustering methods create a sequence of clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D. It can produce inherent nested structures (hierarchical) of clusters in a data. Hierarchical Clustering obtained in two ways 1

Divisive (Top-down) approach:

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Hierarchical Clustering Hierarchical clustering methods create a sequence of clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D. It can produce inherent nested structures (hierarchical) of clusters in a data. Hierarchical Clustering obtained in two ways 1

Divisive (Top-down) approach: Start with one cluster containing all points. At each step, split a cluster until each cluster contains a point.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Hierarchical Clustering Hierarchical clustering methods create a sequence of clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D. It can produce inherent nested structures (hierarchical) of clusters in a data. Hierarchical Clustering obtained in two ways 1

Divisive (Top-down) approach: Start with one cluster containing all points. At each step, split a cluster until each cluster contains a point.

2

Agglomerative (Bottom-up) approach :

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Hierarchical Clustering Hierarchical clustering methods create a sequence of clusterings π1 , π2 , π3 , . . . πi . . . πp of given dataset D. It can produce inherent nested structures (hierarchical) of clusters in a data. Hierarchical Clustering obtained in two ways 1

Divisive (Top-down) approach: Start with one cluster containing all points. At each step, split a cluster until each cluster contains a point.

2

Agglomerative (Bottom-up) approach : Start with the points as individual clusters. Merge the closest pair of clusters until only number of cluster becomes one. Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

π5 = {{p1 , p5, p2, p3, p4}} π4 = {{p1 , p5}, {p2, p3, p4}} π3 = {{p1}, {p2, p3, p4}, {p5 }} π2 = {{p1}, {p2, p3}, {p4}, {p5}}

p1

p2

p3

p4

p5

π1 = {{p1}, {p2}, {p3 }, {p4}, {p5}}

Figure: Dendogram produced by a hierarchical clustering method

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Agglomerative (Bottom-up) approach

Compute distance matrix for the dataset. Let each data point be a cluster repeat 1 2

Merge the two closest clusters. Update the distance matrix.

until only a single cluster remains

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Popular Agglomerative Methods Single-link :Dis(C1 , C2 ) = min{||xi − xj || | xi ∈ C1 , xj ∈ C2 } Complete-link :Dis(C1 , C2 ) = max{||xi − xj || | xi ∈ C1 , xj ∈ C2 } Average-link: P P 1 Dis() = |C1 |×|C i j ||xi − xj ||, where xi ∈ C1 , xj ∈ C2 2|

(a) Single-link

(b) Complete-link

(c) Average-link

Figure: Distance between a pair of clusters in three hierarchal clustering methods.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Complexity

Space Complexity : O(n2 ) Time Complexity: O(n2 )

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Single-link Clustering Method

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

dendogram produced by single-link method

Figure: Dendogram for single-link Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Distance Updation Let C = Cx ∪ Cy (merging clusters Cx , Cy ) be a new cluster formed. Let Co be an another cluster. d(Co , (Cx , Cy )) = min{d(Co , Cx ), d(Co , Cy )} Lance and Williams (1967) generalizes the distance updation d(Co , (Cx , Cy )) = αi × d(Co , Cx ) + αj × d(Co , Cy ) +β × d(Cx , Cy ) + γ × |d(Co , Cx ) − d(Co , Cy )| where, d(., .) is a distance function and values of αi , αj , β and γ(∈ R) depend on the method used.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Similarity Measures Category of clustering methods

Table: Lance-Williams parameters for different hierarchal methods

Method Single-link Complete-link Average-link Centroid

αi 1/2 1/2 mCx mCx +mCy mCx mCx +mCy

αj 1/2 1/2

mCy mCx +mCy mCy mCx +mCy

β 0 0 0

γ -1/2 1/2 0

−mCx .mCy mCx +mCy

0

Complete Link:

d(Co , (Cx , Cy )) =

1 2

× d(Co , Cx ) +

1 2

× d(Co , Cy ) + 0 × d(Cx , Cy ) +

Dr. Bidyut Kr. Patra

1

× |d(Co , Cx ) − d(Co , Cy )| 2 = max{d(Co , Cx ), d(Co , Cy )}

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Density Based Clustering Method Density based clustering method views clusters as dense regions in the feature space which are separated by relatively less dense regions. Density based clustering approach optimizes local criteria, which is based on density distribution of the dataset. DBSCAN (Density Based Spatial Clustering of Applications with Noise) [Ester et al. (1996)] is a very popular density based partitional clustering method.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

DBSCAN (M. Ester et al., 1996) DBSCAN is a density based clustering method, which can find arbitrary shaped clusters. It classifies each point into any of the three categories: 1

2

3

Core Point:A hyper-sphere of radius ǫ centered at point x has more than Min data point, then x is called core point. Border Point: The point x is called border point if the hyper-sphere has less than Min points but there is a near-by core point (||x − CP|| < ǫ) Noisy Point: If the hyper-sphere has less than Min points and there is no near-by core point.

DBSCAN starts with a core point and expand the core point recursively merging nearby core points. Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Drawbacks 100

100

"NBC_paper_data.txt" using 1:2

80

80

60

60

40

40

20

20

0

Cluster1 Cluster2 Cluster3 Noise

0 0

20

40

60

80

100

Figure: Dataset

Dr. Bidyut Kr. Patra

0

20

40

60

80

100

Figure: Clusters obtained by DBSCAN Method

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

DBSCAN fails to detect clusters of arbitrary shapes with highly variable density.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Neighborhood Based Clustering (NBC) S. Zhou, Y. Zhao, J. Guan, and J. Z. Huang. A Neighborhood-Based Clustering Algorithm. Appeared in Pacific-Asia Conference on Knowledge Discovery and Data Mining, (PAKDD 2005) Neighborhood Based Clustering (NBC) discovers clusters based on the neighborhood characteristics of data. NBC is effective in discovering clusters of arbitrary shape and different densities. NBC needs fewer input parameters than the DBSCAN clustering method.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

NBC The core concept of NBC is the Neighborhood Density Factor (NDF). Number of reverse K nearest neighbors(x) NDF (x) = Number of K nearest neighbors(x) KNN(x) = {q ∈ D | d(x, q) ≤ d(z, q ′ )} such that 1 2

|KNN(x)| > K if q ′ is not unique or |KNN(x)| = K , otherwise. |KNN(x) − N(q ′ )| < K − 1.

Reverse K-Nearest Neighbors Set of x(R-KNN): Set of objects whose KNN contains x. R − KNN(x) = {p ∈ D | x ∈ KNN(p)}

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

It classifies a point x into any of the three categories: 1

Core Point: If NDF (x) > 1

2

Even Point: If NDF (x) = 1

3

Noisy Point: If NDF (x) < 1

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Definition (Neighborhood-based directly density reachable) Let x, y be two points in D and NNK (x) be the knn of x. The point y is directly reachable from x if x is CP or EP and y ∈ NNK (x)

Definition ( Neighborhood-based density reachable ) Let x, y be two points in D and NNK (x) be the KNN of x. The point y is reachable from x if there is a chain of patterns p1 = x, p2 , . . . , pn = y such that pi +1 is directly reachable from pi . i = 1..i = n − 1

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Definition (Neighborhood-based density connected) Let x, y be two points in D and NNK (x) be the KNN of x. The points x, y are neighborhood connected if any of the following conditions holds. y is reachable from x, or vice-versa. x and y are reachable from any other pattern in the dataset.

Definition (Neighborhood-based Cluster) Let D be a dataset. C ⊆ D is a cluster such that If p, q ∈ C , then p and q are neighborhood connected. If p ∈ C and q ∈ D are neighborhood connected, then q also belongs to cluster C . Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Neighborhood based Clustering (NBC) method 1

Calculate NDF for all patterns of the dataset

2

Let x be a pattern in the dataset. 1

2

If x is cluster point(CP/EP), then explore its neighbors in recursive manner and marked the points as “seen”. It forms a cluster. (At any stage,if any of the neighbors is not CP/EP, no need to explore the point but included into the cluster.) If x is not CP, then mark x as “seen” and temporarily marked as Noisy Point.

3

Repeat Step 2 until all points marked as “seen”.

4

If temporarily marked “Noisy Point” does not belong to any cluster, then the point is declared as final Noisy Point. Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

100

Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Noise

80

60

40

20

0 0

20

40

60

80

100

Figure: Clusters obtained by NBC Method Dr. Bidyut Kr. Patra Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Data Summarization Hybrid Clustering Method

Major approaches to clustering large dataset Many clustering approaches have been proposed to tackle large size data These approaches are mainly classified into the following listed groups1 1

Sampling Based Approach

2

Hybrid Clustering

3

Data Summarization

4

Nearest Neighbor Search

1

A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8):651–666, 2010. Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Data Summarization Hybrid Clustering Method

Data Summarization

This approaches create summary of a large dataset. This summary is intelligently used to scale up expensive clustering method.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Data Summarization Hybrid Clustering Method

BIRCH (Zhang et al., 1996) BIRCH is designed for clustering large datasets. It can work with limited main memory. It is an incremental method.

Two-Phase Method 1 Create a summary of a given dataset in form of CF tree. 2

Apply conventional hierarchical clustering to the summary.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Data Summarization Hybrid Clustering Method

CF tree CF-tree is a height-balanced tree with branching factor B.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Data Summarization Hybrid Clustering Method

CF tree CF-tree is a height-balanced tree with branching factor B. Each internal node of CF tree has at most B entries of the form [CFi , childi ]i =1,...,B , ,where CFi is the clustering feature(CF) of sub-cluster represented by this child.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Data Summarization Hybrid Clustering Method

CF tree CF-tree is a height-balanced tree with branching factor B. Each internal node of CF tree has at most B entries of the form [CFi , childi ]i =1,...,B , ,where CFi is the clustering feature(CF) of sub-cluster represented by this child. A leaf node has at most Lmax entries of the form [CFi ], i = 1 . . . Lmax . Diameter of each sub-cluster in leaf nodes must be less than a threshold T .

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Data Summarization Hybrid Clustering Method

Figure: CF-tree Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Data Summarization Hybrid Clustering Method

Clustering Features(CF) A CF is a triplet which contains summarized information of a sub-cluster. − → − → − → Let C1 = {X1 , X2 , . . . Xk } be a sub cluster. − → The CF for C1 is CF = (k, LS , ss), − → where LS is linear sum of patterns in C1 , i.e., − → P − → LS = i Xi , ss is square sum of data points, i.e., P − →2 ss = i Xi . CF values follows additive property, i.e., CF3 = CF1 + CF2 .

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Data Summarization Hybrid Clustering Method

Hybrid Method Partitional clustering method is combined with hierarchical clustering method. First hybrid method was proposed by Murthy and Krishna in 1980. It is the combination of k-means clustering with single-link method. Many clustering methods ( Lin and Chen (2005), Liu et al. (2009), Chaoji et al. (2009) ) have been developed in this line.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Data Summarization Hybrid Clustering Method

Divide the dataset into a number of sub clusters applying k-means clustering method. Compute Similarity/dissimilarity between a pair of sub clusters. Apply an agglomerative method to these sub clusters to obtain final clustering.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Data Summarization Hybrid Clustering Method

Vijaya et al. (2006) used hybrid clustering method in protein sequence classification 1

Leaders clustering is applied to whole dataset to obtain set of leaders.

2

Apply single-link/complete-link to obtain k clusters.

3

Median of each cluster is selected as the representative of the cluster.

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Bibliography

1

S. Theodoridis and K. Koutroumbas. Pattern Recognition, third ed. Academic Press, Inc., Orlando, 2006.

2

P. A. Vijaya, M. N. Murty, and D. K. Subramanian. Efficient bottom-up hybrid hierarchical clustering techniques for protein sequence classification. Pattern Recog- nition, 39(12):2344–2355, 2006.

3

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Outline Introduction to Classification Introduction to Clustering Density Based Clustering Method Clustering Methods for Large Datasets Conclusions

Questions THANK YOU

Dr. Bidyut Kr. Patra

Introduction to Clustering Methods

Introduction to Research Methods

Introduction to Kernel Methods

CSc 3200 Introduction to Numerical Methods

Advanced Clustering Methods for Mining Chemical ...

A Comparison of Clustering Methods for Writer Identification and ...

Comparing the Clustering Methods for User Centered ...