Clustering in Data Streams -- Yao Shen Ning (Martin) Xu

Content • Introduction • K-Median Problem • Small(er)-Space Algorithm • STREAM Algorithm • A Framework for Clustering Evolving Data Streams • References

Introduction • Data Stream − Ordered sequence of points can only be read once or a small number of times

• Source of Data Stream − − − − −

Routing data Telephone records Web documents Clickstreams …

Introduction (cont’d) • Clustering − Partition data set into subsets such that members of the same cluster are similar and those of distinct clusters are dissimilar

• Motivation − Data too large to fit in memory, typically stored in secondary storage devices − Linear scan, random access not allowed − Possible to make only one or a very small number of passes

K-Median Problem • A common formulation of clustering • Given − A set of N points − A distance function − A number of k

• Choose k medians from N • Assign each point to its closest median • Goal: minimize the sum of squared distance

Small(er)-Space Algorithm • Divide-and-conquer strategy • Small-Space − (1) Divide S into l disjoint pieces. − (2) Find k centers in each piece; Assign each point within the same piece to its closest center. − (3) Each center is weighted by number of points assigned to it. − (4) Cluster the centers obtained in (3) to find k centers for the entire stream

Small(er)-Space Algorithm (cont’d) • Smaller-Space − (1) Divide S into l disjoint pieces. − (2) Find k centers in each piece; Assign each point within the same piece to its closest center. − (3) Each center is weighted by number of points assigned to it. − (4) Perform Smaller-Space algorithm on the centers obtained in (3)

Small(er)-Space Algorithm (cont’d) • Application in data stream model − − − −

Input m (a multiple of 2k) points at a time Reduce the first m points to 2k medians Maintain at most m level-i medians On seeing m, generate 2k level-(i+1) medians − Having seen all the data points interested, cluster all intermediate medians into k final medians

STREAM Algorithm LOCALSEARCH

k

X2

LOCALSEARCH

k

Xn

LOCALSEARCH

k

X1

LOCALSEARCH

k

STREAM Algorithm (cont’d) • Perform LOCALSEARCH to find k centers based on a distance sum obtained from binary search within range • LOCALSEARCH subroutine − Based on CG algorithm (Charikar & Guha) − Compute a set of centers and the sum of distances from each point to its center − Randomly select a non-center point, if making it a new center reduces distance sum, then perform reassignment and discard centers with no members

Framework - intro  Traditional clustering algorithms are not directly applicable to data streams Because ◦ Stream data may have infinite length ◦ May be evolving over time

 Problems with the existing stream clustering algorithms

◦ Compute clusters over the entire data stream ◦ But data stream may be evolving with time ◦ Nature of clusters may vary

Framework intro (cont’d)  Separate the clustering process into ◦ Online micro-clustering component ◦ Offline macro-clustering component

 Online component

◦ Requires efficient process for storing summary statistics

 Offline component

◦ Uses the summary statistics to provide clusters as per user-requirement ◦ It is very efficient since it uses only the summary

 Users can explore the nature of the evolution

The ClusStream Framework • Micro-clusters:

− Maintains statistical information about the data locality − Their additive property makes them a natural choice for data streams.

• Pyramidal time frame:

− Micro-clusters are stored at snapshots in time which follow a pyramidal pattern. − It is an effective trade-off between the storage requirements and the ability to recall summary statistics from different time horizons

The Framework (cont’d) • Assume that the data stream consists of a set of multi-dimensional records − X1,…Xk…, arriving at times T1,…,Tk − Xi = (xi1,…,xid) , a d-dimensional point

• Definition:

− A micro-cluster is a 2d+3 tuple for a set of n points, each of which has d dimensions − (CF2X,CF1X,CF2t,CF1t,n) − CF2X : a vector of d values, where each value is the sum of squares of all data values in the micro-cluster, i.e., • (x11+,…,+xn1)2, (x12+,…,+xn2)2,…, (x1d+,…,+xnd)2

− CF1X : sum of data values

• (x11+,…,+xn1), (x12+,…,+xn2),…, (x1d+,…,+xnd)

The Framework (cont’d) • Pyramidal time frame:

− Snapshots are classified into different orders − Can vary from 1 to log(T), where T is the time since beginning of stream − Snapshots of the i-th order occur at time intervals of αi, where α is an integer >=1 − At any given moment of time, only the last αl+1 snapshots are stored of order i, where l is another integer − So, the maximum number of snapshots stored at any moment is (αl+1)log α(T)

The Framework (cont’d) • Pyramidal time frame (continued)

Online Micro-cluster Maintenance • This process is not dependent on user input • Maintains statistics at a sufficiently high level of temporal and spatial granularity

− A total of q micro-clusters are maintained at any moment − Let the micro-clusters be M1…Mq − Each micro-cluster is given a unique id when created − When two micro-clusters are merged, a list of ids is created − Value of q is determined by the amount of memory available

Online Micro-cluster (cont’d) • Online updating − When new data point Xk arrives − Either absorb it into a micro-cluster or create a new cluster of its own

• To absorb into existing micro-cluster − Find the distance of the data point from each micro-cluster centroid − Let the closest cluster be Mp − Absorb Xk into Mp

Conclusion • CluStream-clusters large evolving data stream • More efficient than recent techniques because

− Views the stream as a changing process over time − Rather than clustering the whole stream at a time

• Can characterize clusters

− over different time horizons in changing environment

• Provides flexibility to an analyst in a realtime and changing environment

References [1] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, “Clustering Data Streams,” In Proc. FOCS 2000, Nov. 2000. [2] L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, “Streaming-Data Algorithms For High-Quality Clustering,” In Proc. ICDE 2002, Feb. 2002. [3] C. C. Aggarwal, J. Han, J. Wang, and P. Yu, “A Framework for Clustering Evolving Data Streams,” VLDB Conference, 2003.

Thank you! Q&A

Clustering in Data Streams

Small(er)-Space Algorithm (cont'd). • Application in data stream model. − Input m (a multiple of 2k) points at a time. − Reduce the first m points to 2k medians. − Maintain at most m level-i medians. − On seeing m, generate 2k level-(i+1) medians. − Having seen all the data points interested, cluster all intermediate medians ...

91KB Sizes 2 Downloads 279 Views

Recommend Documents

Stochastic Data Streams
Stochastic Data Stream Algorithms. ○ What needs to be ... Storage space, communication should be sublinear .... Massive Data Algorithms, Indyk. MIT. 2007.

data clustering
Clustering is one of the most important techniques in data mining. ..... of data and more complex data, such as multimedia data, semi-structured/unstructured.

Scalable Regression Tree Learning in Data Streams
In the era of Big data, many classic ... novel regression tree learning algorithms using advanced data ... different profiles that best describe the data distribution.

Weighted similarity estimation in data streams
[29] A. Said, B. J. Jain, S. Albayrak. Analyzing weighting schemes in collaborative filtering: cold start, post cold start and power users. SAC 2012: 2035–2040.

Survey on Data Clustering - IJRIT
common technique for statistical data analysis used in many fields, including machine ... The clustering process may result in different partitioning of a data set, ...

Survey on Data Clustering - IJRIT
Data clustering aims to organize a collection of data items into clusters, such that ... common technique for statistical data analysis used in many fields, including ...

Computing Clustering Coefficients in Data ... - Research at Google
The analysis of the structure of large networks often requires the computation of ... provides methods that are either computational unfeasible on large data sets ...

Protecting sensitive knowledge based on clustering method in data ...
Protecting sensitive knowledge based on clustering method in data mining.pdf. Protecting sensitive knowledge based on clustering method in data mining.pdf.

Rough clustering of sequential data
a Business Intelligence Lab, Institute for Development and Research in Banking Technology (IDRBT),. 1, Castle Hills .... using rough approximation to cluster web transactions from web access logs has been attempted [11,13]. Moreover, fuzzy ...

Hokusai — Sketching Streams in Real Time
statistics of arbitrary events, e.g. streams of ... important problem in the analysis of sequence data. ... count statistics in real time for any given point or in- terval in ...

Frequent Pattern Mining over data streams
IJRIT International Journal of Research in Information Technology, Volume 1, Issue 5, May ... U.V.Patel College of Engineering, Ganpat University, Gujarat, India.

Wavelet Synopsis for Data Streams: Minimizing ... - Semantic Scholar
Aug 24, 2005 - Permission to make digital or hard copies of all or part of this work for personal or ... opsis or signature. These synopses or signatures are used.

From Data Streams to Information Flow: Information ...
multimodal fine-grained behavioral data in social interactions wherein a .... processing tools developed in our previous work. ..... developing data management and preprocessing software. ... workshop on research issues in data mining and.

STAGGER: Periodicity Mining of Data Streams ... - Research
continuously, the sliding windows expand in length in order to cover the whole ...... sales transactions for some stores over a period of 15 months serves the ...

Summarizing and Mining Skewed Data Streams - DIMACS - Rutgers ...
ces. In Workshop on data mining in resource constrained en- vironments at SIAM Intl Conf on Data mining, 2004. [33] E. Kohler, J. Li, V. Paxson, and S. Shenker.

Processing data streams with hard real-time constraints ...
data analysis, VoIP streaming, and sensor data processing .... AES framework is universally applicable to a large family ...... in such a dynamic environment.

STAGGER: Periodicity Mining of Data Streams ... - Semantic Scholar
proaches used for discovering periodicity rates, STAGGER not only discovers a wider, ... ∗Work done while at Department of Computer Sciences, Purdue Uni- versity ..... bounded by the buffer size allowed by the system for buffer- ing the data ...

Real-time RDF extraction from unstructured data streams - GitHub
May 9, 2013 - This results in a duplicate-free data stream ∆i. [k.d,(k+1)d] = {di ... The goal of this step is to find a suitable rdfs:range and rdfs:domain as well ..... resulted in a corpus, dubbed 100% of 38 time slices of 2 hours and 11.7 milli

Summarizing and Mining Skewed Data Streams
email streams [40], aggregating sensor data [39], analyzing .... The correlation is sufficiently good that not only ..... For z ≤ 1, the best results follow from analysis.