CRM RecSys

Viewer
Transcript

Advanced Topics on Data Stream Mining Albert Bifet, Joao Gama, Ricard Gavalda, Georg Krempl, Mykola Pechenizkiy, Bernhard Pfahringer, Myra Spiliopoulou, Indre Zliobaite ECML PKDD 2012, Bristol, Sept. 24

(1)

Tutorial Structure Part I: Mining one stream • • • • • •

Introduction to streams and drifting data Adaptive predictive modelling Clustering streaming data Pattern Mining on streams Tools for mining data streams Discussion / Q & A

Part II: Mining multiple streams • • • •

Mining distributed streams Mining relational streams Feedback issues in streams under drift Discussion / Q & A

Tutorial on Advanced Topics in Stream Mining – ECML PKDD 2012, Bristol, Sept. 24

(2)

Tutorial Presenters: Part I Mykola Pechenizkiy Assistant Professor at the Department of Computer Science, Eindhoven University of Technology, the Netherlands. He has broad research interests in data mining and its application to various (adaptive) information systems serving industry, commerse, medicine and education. He has been organizing several workshops and conferences in these areas.

http://www.win.tue.nl/~mpechen/ Indre Zliobaite is a lecturer in computational intelligence at Bournemouth University, UK and a research task leader within the INFER.eu project. Her research interests and competences concentrate around online predictive modeling, context awareness and adaptation over time, predictive analytics applications.

http://zliobaite.googlepages.com Tutorial on Advanced Topics in Stream Mining – ECML PKDD 2012, Bristol, Sept. 24

(3)

Tutorial Presenters Presenters: Part I Bernhard Pfahringer Associate Professor with the Computer Science Department of the University of Waikato. His main research interests are in Machine Learning and Data Mining, especially in efficient algorithms, stream mining, randomization, and applications.

http://www.cs.waikato.ac.nz/~bernhard/ Albert Bifet

Research Fellow at Yahoo! Research Barcelona. He is the author of a book on Adaptive Stream Mining and Pattern Learning and Mining from Evolving Data Streams. He is one of the core developers of MOA software environment for implementing algorithms and running experiments for online learning from evolving data streams.

http://www.cs.waikato.ac.nz/~abifet/ Tutorial on Advanced Topics in Stream Mining – ECML PKDD 2012, Bristol, Sept. 24

(4)

Tutorial Presenters Presenters: Part I Ricard Gavaldà Professor at the Department of Software, U. Politècnica de Catalunya – BarcelonaTech. He has published over 70 papers and supervised 7 Ph.D. students. His current research interests are algorithmics of machine learning and data mining, with emphasis on streaming and adaptive methods. He is also working on the use of data mining in autonomic and green computing.

http://www.lsi.upc.edu/~gavalda

Tutorial on Advanced Topics in Stream Mining – ECML PKDD 2012, Bristol, Sept. 24

(5)

Tutorial Presenters: Part II Myra Spiliopoulou is professor of Information Systems in the Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, and Chair of the Knowledge Management & Discovery (KMD) lab. Main research interest is mining in evolving systems. PC Co-Chair of ECML PKDD 2006 and NLDB 2008, Tutorials CoChair at ICDM 2010, Workshops Co-Chair at ICDM 2011, PC Co-Chair of GfKl 2012 and Demo Track Co-Chair at ECML PKDD 2012. http://omen.cs.uni-magdeburg.de/itikmd

Georg Krempl is postdoc researcher in the Knowledge Management & Discovery (KMD) lab at the Otto-von-Guericke-University Magdeburg, Germany. Doctorate from University of Graz, Austria. Main research interest is learning on evolving, drifting data. Has given several courses on data mining, statistics and optimization for students from different degrees at Univ. Graz and since 2011 at Univ. Magdeburg.

http://omen.cs.uni-magdeburg.de/itikmd Tutorial on Advanced Topics in Stream Mining – ECML PKDD 2012, Bristol, Sept. 24

(6)

Tutorial Presenters Presenters: Part II Joao Gama Is a researcher at LIAAD, University of Porto, working at the Machine Learning group. His main research interest is in Learning from Data Streams. He published more than 80 articles. He served as Co-chair of ECML 2005, DS09, ADMA09 and a series of Workshops on KDDS and Knowledge Discovery from Sensor Data with ACM SIGKDD. He is author of a recent book on Knowledge Discovery from Data Streams.

http://www.liaad.up.pt/~jgama/

Tutorial on Advanced Topics in Stream Mining – ECML PKDD 2012, Bristol, Sept. 24

(7)

On mining single data streams … In many applications data arrives in real time and needs to be mined in real time. In addition, streaming data may evolve (drift) over time. This setting is presents extra challenges to the traditional data mining. In Part 1 of this tutorial we discuss the challenges and main solutions in mining evolving streaming data. The outline is as follows: 1. Introduction to data streams and drifting data 2. Adaptive predictive models 3. Clustering streaming data

4. Pattern mining on streams 5. Tools for mining data streams

(8)

Why mine multiple streams simultaneously ? In many applications, we have many streams that are correlated in a known or unknown way.  Multiple streams of sensor signals

 Multiple health parameters recorded for a patient – sensors or other sources  Multiple streams of news – on same and different topics  Multiple streams of user activities – in a social network or an ecommerce platform

Mining multiple streams implies: ➜aligning them ➜combining them – not least for error correction ➜learning a model from them ➜adapting to drift Tutorial on Advanced Topics in Stream Mining – ECML PKDD 2012, Bristol, Sept. 24

(9)

On mining multiple streams ... In Part 2 of this tutorial we discuss: 1. Mining distributed streams  Communication among and querying over distributed streams  Learning on sensor networks

2. Mining multiple interdependent streams  Learning a model on the streams vs  Learning a model on entities that are enriched with stream data

3. Feedback issues in classification on evolving streams  How to formulate stream classification on multiple streams  How this formulation helps to identify and solve feedback issues  How drift mining can solve the issue of delayed label information, and how this is related to transfer learning / domain adaptation  Which active learning strategies have been explored on evolving streams Tutorial on Advanced Topics in Stream Mining – ECML PKDD 2012, Bristol, Sept. 24

(10)

Part I: Mining one Stream Introduction to data streams and drift

(c) Albert Bifet, Ricard Gavalda, Mykola Pechenizkiy, Bernhard Pfahringer, Indre Zliobaite

(11)

Mining streaming data

Production plant given sensor readings predict the quality of the output 24/7 plant operation

(12)

More examples

Sensor data Web data (logs,content)

Activity data

(13)

Evolving streaming data 

Changing data over time   



Data arrives online, never-ending Data distribution is changing over time Limited access to historical data  in “big” data streams – no access

Why do changes happen?    

Changing environment (e.g. economic situation) Changing internal/individual characteristics (e.g. interests) Complexity of the environment (e.g. self-driving car) Adversary activities (e.g. frauds, spam categorization)

(14)

Mining evolving streaming data 

Requirements 

Data mining systems need to have adaptation mechanisms  



update or retrain themselves to match recent data otherwise accuracy will degrade over time

Data mining models need to  

fit into limited memory and processing time and be ready to predict/produce an output at any time

(15)

Types of changes in data 

Evolving/drifting data = data distribution changes over time  Feature drift [data evolution] 



Real concept drift 



relation between input X and target y changes, p(y|X)

Changing prior distribution 



distribution of input data X changes, p(X)

E.g. of the target p(y)

Arrival of new information 

new concepts/classes appear

(16)

Feature drift [data evolution] Original data

Feature drift

(17)

Real concept drift Drifted data

Original data

But..!

(18)

Real concept drift Drifted data

Original data

Decision boundary changes

(19)

Feature drift (again) Feature drift only Original data

Feature drift with concept drift (20)

Changing priors Original data

Drifted data

(21)

Arrival of new information Original data

New data

(22)

Configuration of changes over time data mean time

gradual

sudden/abrupt

outlier

incremental

reoccuring concepts

(23)

Implications to predictive modeling 

 

Predictive systems need to adapt, otherwise accuracy will degrade over time Not all changes in data require adaptation Feedback (e.g. timely arrival of the true label) is critical for  



detecting a need for adaptation (e.g. increasing error) adapting the predictive model (updating the parameters or completely replacing the model)

Different changes may require different adaptation strategies (24)

Implications to clustering • Large volumes of arriving data makes the traditional clustering algorithms inefficient • e.g. consider calculating pair wise distances

• The quality of the clusters becomes poor when the data evolves over time • The main solution approach is to divide clustering process into • Online – for periodically storing summary statistics, and • Offline – cluster analysis

(25)

Implications to pattern mining • Very limited computational and memory resources – Typically incoming data can only be scanned once

• Stream data may have a much richer pattern structure due to their temporal nature – Extra challenges in filtering and presentation of the results

(26)

Change detection • Adaptive models may need change detection • What can be monitored for changes? • Statistical tests on incoming data distribution (univariate or multivariate) • Tests on streaming performance measure (e.g. error) • Monitoring model evolution

(27)

Mining streaming data 1. receive current data

5. receive new data

...

... MODEL

2 . model output

3. receive feedback (optional) 4. update model (28)

Model adaptation two main approaches INCREMENTAL update

periodic REPLACEMENT of the model

...

... old model

new model

...

... old model

(29)

Current situation 

A lot of techniques for mining evolving streaming data have been developed,  



in the 'basic' settings the problem is solved, but systematization of settings and terminology is lacking

Currently more advanced and more application oriented settings are being addressed    

Lack of feedback (delay, absence, lack of reliability) Security and reliability of adaptation Automating KDD process instead of exclusive model adaptation Assessing costs and utility of adaptation

(30)

Part I: Mining one Stream Adaptive predictive models

(c) Albert Bifet, Ricard Gavalda, Mykola Pechenizkiy, Bernhard Pfahringer, Indre Zliobaite

(31)

Part I Section 2: Handling CD in Predictive Modeling: Classification, Regression, Ranking and Prediction Mykola Pechenizkiy – Assistant Professor at the Department of Computer Science, Eindhoven University of Technology, the Netherlands [email protected] www.win.tue.nl/~mpechen/

(32)

Predictive modeling • Predict label(s), binary vs. categorical: – Classification: spam, antibiotic resistance

• Predict a score, ordinal vs. real number: – Regression and ranking: retrieval, recommendation

• Past vs. future: – Recognition (assign a value reflecting what has happened already but we cannot observe this directly) vs. prediction (assign a value reflecting what is likely to happen in future)

• Kinds of streaming evolving data: – Single relation/table data (user categorization vs. antibiotic resistance), time series/sequential data (popularity prediction vs. sensor signal reconstruction), graphs and sparse matrix data/recommendations (missing data vs. null) (33)

Major types of approaches Triggering (change detection)

Single learner (classifier, ranker, or regressor)

Detectors

Evolving (adapting every step)

Forgetting

variable windows

Ensemble

Contextual dynamic integration, meta learning

fixed windows, Instance weighting Dynamic ensemble adaptive fusion rules

(34)

Fixed training window time

(35)

Variable training window

(36)

Dynamic ensemble Classifier 1

Classifier 2

vote

Classifier 3

Classifier 4

(37)

punish voter 1

voter 1

voter 1

voter 2

voter 2

voter 3

voter 3

reward voter 2 punish voter 3 reward voter 4

TRUE

voter 4

voter 4

TRUE (38)

Taxonomy of methods Crisp Data Memory Smooth MEMORY

Memory Management

Handling Reoccurence

CONTROL

Detection Methods

Gama et al 2012

Fixed Variable

Window

00111

Selection

01010

Weighting window s

No

Yes

Threshold

Two windows

CUMSUM, SPC Kifer, Adwin

(39)

Taxonomy of methods (contd.) Blind Model Adaptation

Adaptation does not depend on loss function With triggers

Informed Evolving

LEARNING

Adaptation Methods

Model specific Model independ. Global

Local

Gama et al 2012

SVM, CVFDT catastrophic replacement incremental recursive

Learning Mode

Model Management

Maloof

Partial replace

Single model Model selection Ensemble Model weighting

(40)

Classification • Drivable and non-drivable regions in the image stream under changing lighting conditions, surface materials, measurements quality • Antibiotic resistance in hospitals in which over time pathogens may develop resistance and “share” it with other pathogens • Spam and regular e-mails under distribution shift, adversary actions and subjective notions of what spam is for a particular person

(41)

Antibiotic Resistance Prediction predict the sensitivity of a pathogen to an antibiotic based on data about the antibiotic, the isolated pathogen, and the demographic and clinical features of the patient.

(Tsymbal et al., 2008 . Information Fusion 9(1), pp. 56-68) (42)

How Antibiotic Resistance Happens

(43)

Challenges and Solution • • • • •

Different ways how resistance may happen Different ways how resistance may spread Physical/organizational connections Local drift Solution: – Contextual approach at the instance level – Dynamic integration of classifiers

(44)

Dynamic integration of classifiers For the current test instance: - find NNs - estimate expected error of classifiers in the neighborhood, - apply weighed voting of (selected) classifiers.

(45)

Regression • Regression trees from data streams with drift detection (Ikonomovska et al. DS 2009) • Hoeffding-Based Regression Trees With Options (Ikonomovska et al. ICML 2011) • Dynamic integration of regression models (Rooney et al., MSC 2004)

(46)

Hoeffding-Based Regression Trees • Hoeﬀding trees for on-line classiﬁcation • CVFDT algorithm • On-line Regression/Model Trees with Options (ORTO)

(47)

Ranking • Computer networks: generate lists of nodes ranked according to their susceptibility to attack – Real-time ranking with concept drift using expert advice (Becker & Arias, KDD’07)

• Information retrieval, recommender systems and adaptive information filtering: personalized ranked list of docs/items relevant to the query/user model – Adaptive news access (Billsus & Pazzani, UMUAI 2000): maintain an ensemble of the short-term and long term user interests (48)

Recommender Systems Lessons learnt from Netflix competition: Temporal dynamics is important Classical CD approaches may not work We Know What You Ought To Be Watching This Summer

(Koren, SIGKDD 2009)

Something happened in early 2004

Are movies getting better with time?

(49)

Multiple sources of temporal dynamics • Both items and users are changing over time • Item-side effects: – Product perception and popularity are constantly changing – Seasonal patterns influence items’ popularity

• User-side effects: – – – –

Customers ever redefine their taste Transient, short-term bias; anchoring Drifting rating scale Change of rater within household

(50)

Time series prediction • • • •

Food sales prediction Mass flow estimation Electricity load prediction Popularity/demand prediction

(51)

Stock Balancing Problem Empty shelves

vs.

Perishable goods becoming obsolete (52)

Challenges in food sales prediction

(53)

Reoccuring and suddent drift in food sales

Reoccurring season

(54)

Solution • Adaptive learning approach – Added external data (weather, holidays) – Contextual (meta learning) approach

• Approach – Form contextual features (structural, shape, relational) – Identify product categories – Train a switch mechanism, which assigns the predictor, based on what context is observed

• Challenges – Short history, rapid and frequent changes – Discretization, formulating the labels Žliobaitė et al, 2012. Beating the baseline prediction in food sales: How intelligent an intelligent predictor is?

(55)

Online Mass Flow Prediction data collected from a typical experimentation with CFB boiler

asymmetric nature of the outliers short consumption periods within feeding stages

Pechenizkiy et al. 2009 (56)

Solution – use domain knowledge • Adaptive learning approach – Detect a change and cut the training window

• Challenges – Specific types of outliers and noise – True change labels are unavailable

• Tree component prediction system – – – –

Signal model (2nd order regression) Based on moving average and learnable Outlier elimination thresholds Change detection and training window selection Backtracking Pechenizhkiy et al, 2009. SIGKDD Explorations 11(2), p. 109-116 (57)

Part I: Mining one Stream Clustering Streaming data

(c) Albert Bifet, Ricard Gavalda, Mykola Pechenizkiy, Bernhard Pfahringer, Indre Zliobaite

(58)

What is clustering 

informally: grouping similar objects together   







euclidean distance density-based approaches [kernel-based approaches] [NMF-based approaches]

maximise intra-cluster similarity and intercluster dissimilarity simultanouesly more formally: minimize some cost function over a partitioning of the data (59)

(Static) Evaluation

(60)

Streaming Evaluation 

Clusters may: 



Errors due to:   



appear, move, fade away, merge missed points misplaced points general noise

only one (?) measure: 

Cluster Mapping Measure (CMM)  

normalized sum of penalties for these errors external (i.e. relies on ground truth) (61)

BIRCH 



Balanced Iterative Reducing and Clustering using Hierarchies Aggregate: Clustering Features CF = (N,LS,SS)   

N: number of points LS: sum of these N points SS: sum of squares of these N points  



ADDITIVE: easy update, easy merge Compute: average intra/inter-cluster distances

CF-Tree: height-balanced, 

parameters B:branching factor, T max leaf radius (62)

BIRCH: processing 

two main phases:  





online: grow CF-Tree offline: cluster leaf nodes using any clusterer, e.g. k-means [optional: refinements of both phases]

This two-level, online/offline, scheme is very common in stream clustering (63)

Clu-Stream  

extends BIRCH with explicit time-stamps CF (aka micro cluster) = (N,LS,SS, LT,ST)  





N,LS,SS: same as for BIRCH LT: sum of the time stamps ST: sum of squares of the time stamps

uses pyramidal time frame

(64)

Clu-Stream: processing 

Online: 

new point is either  

absorbed into existing micro-cluster or starts a new micro-cluster  



delete oldest micro-cluster merge two old micro-clusters

Offline: 

treat micro-clusters as (weighted) points in kmeans (65)

Density-based methods 

Recall (non-stream) DBSCAN: 





defines density-reachability clusters points together that are densityreachable

Can form non-spherical clusters:

(66)

DenStream 

keep   



online  



core-micro-clusters potential core-micro-clusters (pmc) outlier-micro-clusters (omc) merge into a pmc, or omc (might become pmc) create new omc

offline: DBSCAN over microclusters (67)

More algorithms 

ClusTree: 



StreamKM++: 



maintain coreset tree

SOMKE: 



hierarchical, anytime, exponential decay

kernel density estimation over sequences of SOMs (self-organizing maps)

... (68)

Part I: Mining one Stream Frequent Pattern Mining on Streams

(c) Albert Bifet, Ricard Gavalda, Mykola Pechenizkiy, Bernhard Pfahringer, Indre Zliobaite

(69)

Pattern mining: definitions Patterns: sets with a “subpattern” relation 

{cheese,milk}  {milk,peanuts,cheese,butter} (search  buy)  (home  search  cart  buy  exit) B C



A

B

A C

C A

A

Applications: market basket analysis, intrusion detection, churn prediction, feature selection, XML query analysis, query and clickstream analysis, anomaly detection.…

(70)

Pattern mining in streams: definitions The support of a pattern T in a stream S at time t is the probability that a pattern T’ drawn from S’s distribution at time t is such that T  T’ Typical task: Given access to S, at all times t, produce the set of patterns T with support at least  at time t

A pattern is closed if no superpattern has the same support. No information is lost if we focus only on closed patterns. (71)

Pattern mining in streams: idea Key data structure: Lattice of patterns, with counts {A,B,C,D},2

count≤7 count>7

{A,B,C},4

{A,B},15

{A},20

{A,C},12

{A,B,D},3

{B,C},10

{B},18

{A,C,D},3

{A,D},5

{C},18

{B,C,D},8

{B,D},12

{C,D},12

{D},25

(72)

Pattern mining in streams: idea The vast majority of stream pattern mining algorithms (implicitly or explicitly) build and update the pattern lattice. General scheme:

let L be initial, empty lattice; forever do { collect a batch of items of size B; build a summary S of the batch; merge S into L; }

(73)

Itemset mining MOMENT (Chi+ 04) (Sliding window, frequent closed, exact) CLOSTREAM (Yen+ 09) (Sliding window, all closed, exact) MFI (Li+ 09) (Transaction-sensitive window, frequent closed, exact) IncMine (Cheng+ 08) (Sliding window, frequent closed, approximate; faster for moderate approximate ratios) (74)

Sequence, tree, graph mining MILE (Chen+ 05), SMDS (Marascu-Masseglia 06), SSBE (Koper-Nguyen 11): Frequent subsequence (aka sequential pattern) mining

Bifet+ 08: Frequent closed unlabeled subtree mining Bifet+ 11: Frequent closed labeled subtree mining

Bifet+11: Frequent closed subgraph mining (75)

Part I: Mining one Stream Tools for data streams

(c) Albert Bifet, Ricard Gavalda, Mykola Pechenizkiy, Bernhard Pfahringer, Indre Zliobaite

(76)

Tools • Data Stream Mining • MOA • VFML • Rapid Miner

• Large Scale Machine Learning • Vowpal Wabbit • Mahout (77)

VFML

(78)

RapidMiner

(79)

Vowpal Wabbit

(80)

Mahout

(81)

MOA Software

(82)

MOA Classification and Clustering

(83)

MOA Command Line EvaluatePeriodicHeldOutTest  java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask "EvaluatePeriodicHeldOutTest -l DecisionStump -s generators.WaveformGenerator -n 100000 -i 100000000 -f 1000000" > dsresult.csv This command creates a comma separated values file:  training the DecisionStump classifier on the WaveformGenerator data,  using the first 100 thousand examples for testing,  training on a total of 100 million examples,  and testing every one million examples

(84)

Part I: Mining one Stream Conclusion

(c) Albert Bifet, Ricard Gavalda, Mykola Pechenizkiy, Bernhard Pfahringer, Indre Zliobaite

(85)

Summary • Streaming data presents extra challenges in data mining and predictive modeling – Speed, memory, processing power limitations – Models may lose accuracy due to drift

• A lot of techniques for mining evolving streaming data have been developed – To save computational resources, increase speed – To maintain good accuracy

• The main modeling principles – Variations of “window” approaches to take into account the most recent data – Turning models into incrementally updatable – Combinations of models built for different situations and concepts (86)

Outlook • A lot of solutions for the “basic” settings are available • Desiderata for data stream mining field • Systematization of settings and terminology is needed • There is a lack of real-world benchmarks and tasks

• Future research directions • Advancing on handing recurring/seasonal changes • Advancing on change propagation from one object to the other in distributed stream settings (Tutorial Part II) • Progress on handling feedback (Tutorial Part II) • Adaptation management taking into account costs and utility of adaptation

(87)

References I • • • • • • • • • • • • •

Gama, Zliobaite, Bifet, Pechenizkiy (2012). A survey on concept drift adaptation. Under review. Meanwhile see/cite PAKDD'11 tutorial on Handling Concept Drift. Gama, Medas, Castillo, Rodrigues (2004). Learning with drift detection. SBIA'04. Kolter and Maloof (2007). Dynamic weighted majority. JMLR. Tsymbal, Pechenizkiy, Cunningham, Puuronen (2007). Dynamic integration of classifiers for handling concept drift. Information Fusion. Widmer and Kubat (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning. Hulten, Spencer, Domingos (2001). Mining time-changing data streams. KDD'01. Zliobaite (2010). Change with delayed labelling: when is it detectable? IEEE ICDMW'10. Klinkenberg and Renz (1998). Adaptive information filtering: Learning in the presence of concept drifts. ICML/AAAI-98 workshop notes. Katakis, Tsoumakas, Vlahavas (2010). Tracking recurring contexts using ensemble classifiers: an application to email filtering. KAIS. Zliobaite, Bifet, Pfahringher, Holmes (2011b). Active learning from evolving streaming data. ECMLPKDD'11. Zliobaite and Gabrys (2011a). Adaptive pre-processing for streaming data. Under review. Adae and Berthold (2011). Unifying Change - Towards a Framework for Detecting the Unexpected. HaCDAIS'11. Zliobaite (2009). Learning under Concept Drift: an Overview. Technical report. https://sites.google.com/site/zliobaite/Zliobaite_CDoverview.pdf

(88)

References II • • • • • • • •

• • • •

Elena Ikonomovska, Joao Gama, Raquel Sebastiao, and Dejan Gjorgjevik. Regression trees from data streams with drift detection. In DS ’09, pages 121–135. Elena Ikonomovska, João Gama, Bernard Zenko, Saso Dzeroski: Speeding-Up Hoeffding-Based Regression Trees With Options. ICML 2011: 537-544 Elena Ikonomovska, João Gama, Saso Dzeroski: Learning model trees from evolving data streams. Data Min. Knowl. Discov. 23(1): 128-168 (2011) Hila Becker and Marta Arias. 2007. Real-time ranking with concept drift using expert advice. KDD '07. DOI=10.1145/1281192.1281205 http://doi.acm.org/10.1145/1281192.1281205 Niall Rooney, David W. Patterson, Sarab S. Anand, Alexey Tsymbal: Dynamic Integration of Regression Models. Multiple Classifier Systems 2004: 164-173 Sentiment knowledge discovery in Twitter streaming data. Albert Bifet and Eibe Frank. In: DS’10. Collaborative Filtering with Temporal Dynamics Yehuda Koren, KDD 2009, ACM, 2009 MOA: Massive online analysis, a framework for stream classification and clustering. Albert Bifet, Geoff Holmes, Bernhard Pfahringer, Philipp Kranen, Hardy Kremer, Timm Jansen, Thomas Seidl, in Pechenizkiy, M. & Žliobaitė, I. (eds), HaCDAIS Workshop ECML-PKDD 2010, 2010 Dynamic Integration of Classifiers for Handling Concept Drift, Tsymbal, A., Pechenizkiy, M., Cunningham, P. & Puuronen, S. Information Fusion, Special Issue on Applications of Ensemble Methods, 9(1), pp. 56-68, 2008. Online Mass Flow Prediction in CFB Boilers with Explicit Detection of Sudden Concept Drift. Pechenizkiy, M., Bakker, J., Žliobaitė, I., Ivannikov, A., Karkkainen, T. SIGKDD Explorations 11(2), p. 109-116, 2009. Indre Zliobaite, Jorn Bakker, Mykola Pechenizkiy: Beating the baseline prediction in food sales: How intelligent an intelligent predictor is? Expert Syst. Appl. 39(1): 806-815 (2012) Daniel Billsus and Michael J. Pazzani. 2000. User Modeling for Adaptive News Access. User Modeling and User-Adapted Interaction 10, 2-3 (February 2000), 147-180. DOI=10.1023/A:1026501525781 http://dx.doi.org/10.1023/A:1026501525781

(89)

References III • • • • • • • • •

Hardy Kremer, Philipp Kranen, Timm Jansen, Thomas Seidl, Albert Bifet, Geoff Holmes, Bernhard Pfahringer: An effective evaluation measure for clustering on evolving data streams. KDD 2011: 868-876 G.W. Milligan. A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46(2):187–199, 1981. Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conference 1996: 103-114 Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu: A Framework for Clustering Evolving Data Streams. VLDB 2003: 81-92 Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996: 226-231 Feng Cao, Martin Ester, Weining Qian, Aoying Zhou: Density-Based Clustering over an Evolving Data Stream with Noise. SDM 2006 Philipp Kranen, Ira Assent, Corinna Baldauf, Thomas Seidl: The ClusTree: indexing micro-clusters for anytime stream mining. Knowl. Inf. Syst. 29(2): 249-272 (2011) Marcel R. Ackermann, Marcus Märtens, Christoph Raupach, Kamil Swierkot, Christiane Lammersen, Christian Sohler: StreamKM++: A clustering algorithm for data streams. ACM Journal of Experimental Algorithmics 17(1): (2012) Cao Y.: SOMKE: Kernel Density Estimation Over Data Streams by Sequences of Self-Organizing Maps. IEEE Transactions on Neural Networks and Learning Systems, 23(8), 1254-1268.

(90)

References III • • • • •

• • • • •

[Bifet+ 08] A. Bifet, R. Gavaldà: Mining adaptively frequent closed unlabeled rooted trees in data streams. KDD’08. [Bifet+ 11a] A. Bifet, R. Gavaldà: Mining frequent closed trees in evolving data streams. Intelligent Data Analysis 15, 2011. [Bifet+11b] A. Bifet, G. Holmes, B. Pfahringer, R. Gavaldà: Mining Frequent Closed Graphs on Evolving Data Streams. KDD’11. [Chen+ 05] G. Chen, X. Wu, X. Zhu: Sequential pattern mining in multiple streams. ICDM 2005. [Cheng+ 08] J. Cheng , Y. Ke, W. Ng: Maintaining Frequent Closed Itemsets over a Sliding Window. J. Intelligent Information Systems 2008. [Chi+ 04] Y. Chi , H. Wang, P. Yu , R. Muntz Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window. ICDM’04. [Koper-Nguyen 11] A. Koper, S. Nguyen: Sequential Pattern Mining from Stream Data. 7th Advanced Data Mining Applications (ADMA 2011). [Li+ 09] H. Li , S. Lee: Mining frequent itemsets over data streams using efficient window sliding techniques. Expert Systems with Applications, 2009. [Marascu-Masseglia 06] A. Marascu, F. Masseglia: Mining sequential patterns from data streams: a centroid approach. J. Intell. Inf. Syst (2006). [Yen+ 09] Ch. Wu: An Efficient Algorithm for Maintaining Frequent Closed Itemsets over Data Stream. IEA/AIE 2009.

(91)

Acknowledgements Albert Bifet's work at Waikato University was supported by a two year Build-IT Postdoc scholarship ( http://buildit.csi.ac.nz/ )

Part of I. Žliobaitė’s research leading to these results has received funding from the EC within the Marie Curie Industry and Academia Partnerships and Pathways (IAPP) programme under grant agreement no. 251617.

(92)

Tools for mining data streams. â¢ Discussion / Q & A ... Discussion / Q & A. Tutorial on Advanced Topics in Stream Mining â ECML PKDD 2012, Bristol, Sept. 24 ... adaptation over time, predictive analytics applications. Mykola Pechenizkiy.

Download PDF

3MB Sizes 5 Downloads 247 Views

Report

CRM RecSys

Recommend Documents