A Data Intensive Multi-chunk Ensemble Technique to ...

Viewer
Transcript

A Data Intensive Multi-chunk Ensemble Technique to Classify Stream Data Using Map-Reduce Framework Tahseen Al-Khateeb, Mohammad Salim Ahmed, Mohammad Masud and Latifur Khan Department of Computer Science, The University of Texas at Dallas [email protected], [email protected], [email protected], [email protected] Abstract We propose a data intensive and distributed multichunk ensemble classifier based data mining technique to classify data streams. In our approach, we combine r most recent consecutive data chunks with data chunks in the current ensemble and generate a new ensemble using this data for training. By introducing this multi-chunk ensemble technique in a Map-Reduce framework and considering the concept-drift of the data, we significantly reduce the running time and classification error compared to different ensemble approaches. We have empirically proved its effectiveness over other state-of-the-art stream classification techniques on synthetic data and real world botnet traffic. 1 Introduction It is not easy to measure the total volume of data electronically, but an IDC (International Data Corporation) estimate put the size of the digital universe at 0.18 zettabytes in 2006 [1]. One of the main contributors of this huge amount of data is streaming data. The high speed generation and huge volume of streaming data makes data stream classification, a major challenge to the data mining community. More specifically, there are three key problems related to stream data classification. First, it is impractical to store and use all the historical data for training, since it would require infinite storage and running time due to the sheer volume of such data. Second, there may be concept-drift in the data. Third, the classification process needs to be fast to cope with the high speed generation of data. The solutions to the first two problems are related. If there is a concept-drift in the data, we need to refine our hypothesis to accommodate the new concept. Thus, most of the old data must be discarded from the training set. Therefore, one of the main issues in mining concept-drifting data streams is to choose the appropriate training instances to learn the evolving concept. One approach is to select and store the training data that are most consistent with the current concept [2]. Some other approaches update the existing classification model when new data appear, such as the Very Fast Decision Tree (VFDT) [3] approach. Another approach is to use an ensemble of classifiers and update the ensemble every time new data appears [4,

5]. As shown in [4, 5], the ensemble classifier is often more robust at handling unexpected changes and concept drifts. In this paper, we propose DIME, a Data Intensive Multi-chunk Ensemble classification algorithm, which is a distributed and parallel ensemble method with the ability to improve the classification accuracy significantly. Since we are claiming our system to be data intensive, it is imperative that we justify our claim. The most important characteristic of such a system is that even with huge amounts of data, they should not be bogged down and their performance should not deteriorate. The need for cloud computing arises in order to cope with such a situation. It allows huge amount of data to be processed in a short period of time and prevents any deterioration of performance. The fast processing of data also means that, employing a cloud computing framework on top of our proposed algorithm solves the third problem mentioned previously. It should be emphasized that, unless the process is distributed, the amount of data and the classification algorithms that can be used to handle data streams, becomes limited due to time and storage constraint. With this goal in mind, we formulate our proposed multi-chunk ensemble classification algorithm in a Map-Reduce framework. This allows efficient processing and application of our algorithm in both its training and test phase with huge amount of data that may be generated at a high speed. We assume that the data stream is divided into equal sized chunks. Each chunk, when labeled, is used to train classifiers. In our approach, there are two parameters that control the multi-chunk ensemble: r, and K. Parameter r determines the number of chunks (r = 1 means single-chunk ensemble), and parameter K controls the ensemble size. Our ensemble consists of K classifiers. This ensemble is updated whenever a new data chunk is labeled. We take the most recent labeled r consecutive data chunks and merge it with the chunks of data present in the ensemble. This combined data is used to train a new ensemble which consists of the top K contributing classifiers. Thus, the total number of classifiers in the ensemble is always kept constant. It should be noted that when a new data point appears in the stream, it may not be labeled immediately. We defer the ensemble updating process until the data

points in the latest data chunk have been labeled, but we keep classifying new unlabeled data using the current ensemble. For example, consider the online creditcard fraud detection problem. When a new credit-card transaction takes place, its class ({fraud,authentic}) is predicted using the current ensemble. Suppose a fraudulent transaction has been miss-classified as “authentic”. When the customer receives the bank statement, he will identify this error and report to the authority. In this way, the actual labels of the data points will be obtained, and the ensemble will be updated accordingly. We have several contributions. First, we propose a generalized multi-chunk ensemble technique that significantly reduces the expected classification error over the existing single-chunk ensemble methods. Second, we formulate this algorithm based on the Map-Reduce framework so that it can run in a parallel distributed fashion. Finally, we apply our technique on synthetically generated data as well as on real botnet traffic, and achieve better detection accuracies than traditional as well as state-of-the-art stream data classification techniques. We strongly believe that the proposed ensemble technique provides a powerful tool for data stream classification. The rest of the paper is organized as follows: section 2 discusses related works, section 3 provides some background knowledge, section 4 discusses the classification algorithm and proves its effectiveness, section 5 discusses data collection, experimental setup and evaluation techniques, and results, and finally, section 6 concludes with directions to future work. 2 Related work There has been widespread research on how to use ensemble approaches for classification. In a non-streaming environment, ensemble classifiers like Boosting [6] are popular alternatives to single model classifiers. But these are not directly applicable to stream mining. However, several ensemble techniques for data stream mining have been proposed [4, 5, 7, 8]. These ensemble approaches have the advantage that they can be more efficiently built than updating a single model and they observe higher accuracy than their single model counterparts [9]. Among these approaches, our ensemble approach is related to that of Wang et al. [4]. Wang et al. [4] keep an ensemble of the K best classifiers. Each time a new data chunk appears, a classifier is trained from that chunk. If this classifier shows better accuracy than any of the K classifiers in the ensemble, then the new classifier replaces the old one. When classifying an instance, weighted voting among the classifiers in the ensemble is taken, where the weight of a classifier is inversely proportional to its error. However, there are several differences between our approach and the approach of Wang et al. First, we train each classifier from r consecutive data chunks, rather than from a single chunk. And second, rather than replacing only one

old classifier in the ensemble with a new classifier, in our approach more than one old classifier may be replaced. There have also been many works in stream data classification. They can be divided into two main approaches - single model classification, and ensemble classification. Single model classification techniques incrementally update their model with new data to cope with the evolution of the data stream [3, 10, 11, 12]. These techniques usually require complex operations to modify the internal structure of the model. Besides, in these algorithms, only the most recent data is used to update the model. Thus, contributions of historical data are forgotten at a constant rate even if some of the historical data are consistent with the current concept. So, the refined model may not appropriately reflect the current concept, and its prediction accuracy may not meet the expectation. Parallel and distributed classification is another area that has seen considerable research efforts. Khoussainov et al. [13] have designed and implemented a gridbased Weka toolkit. It allows a series of processing under the grid environment, such as training, testing and cross validation. But this style of distribution is coarsely-grained. Therefore, it cannot promote the distributed performance of the core mining process. Wu et al. in [14] proposed an ensemble of C4.5 classifiers based on Map-Reduce called MReC4.5, and made the MReC4.5 classifier “Construct Once, Use Anywhere” by providing a series of serialization operations at the model level. As a result, the classifiers built on a cluster of computers or in a cloud computing platform can be used in other environments. Our multi-chunk ensemble approach is built on a cloud computing framework and is utilized to classify data streams within the same environment. In [15] Zhao et al. proposed PKMeans, a parallel K-means clustering based on Map-Reduce. PKMeans make the clustering method applicable to large scale data. The proposed algorithm scale well and efficiently process large dataset. However, besides the fact that the model is implemented to handle offline data, it is not designed to cope with the concept-drift present within stream data. Another work closely related to our ensemble approach is the work of Masud et al. [16]. In [16], the authors propose a multi-partition, multi-chunk ensemble classifier called MPC. Considering K as the ensemble size, MPC keeps the best K ∗ v classifiers. In MPC, a batch of v classifiers are trained on v overlapping partitions of r consecutive data chunks. When classifying an instance, simple voting among the classifiers in the ensemble is taken. Therefore MPC is a generalization over previous ensemble approaches that train a single classifier from a single data chunk. There are several differences between our approach and MPC. First we do not consider partitions overlapping over the r consecutive data chunks to train the ensemble classifiers. Instead we consider the whole r consecutive data chunks

to get the best K classifiers. Second, when classifying an instance, weighted voting among the classifiers in the ensemble is taken. Third, in the approaches of Wang et al. and Masud et al., decision tree or Ripper is used as the base learner, whereas in our approach we use QNearest Neighbor (Q-NN), since it is more appropriate to the parallel and distributed nature of cloud computing. 3 Background As mentioned previously, we have formulated our classification model on the Hadoop Map-Reduce framework. In this section, we shall provide a brief overview of this framework. We shall also provide the description of a baseline approach called Multiple partitions of multiple chunks (MPC) Ensemble method as it has close resemblance to our algorithm. In section 4.1, we shall provide the algorithms and a description of how our proposed system works.

3.1 Hadoop Map-Reduce framework: MapReduce is a new distributed programming paradigm introduced by Dean et al. [17] to be used in a cloud computing environment. The model parallelly processes large data sets distributed across many nodes in a shared-nothing fashion. The main focus is to simplify the processing of large datasets using inexpensive cluster computers. Another objective is to keep it is easy for users to use, while achieving both load balancing and fault tolerance. Map-Reduce has two primary functions; the Map function and the Reduce function. In addition to these two function there is another optional function, the Combine. A node in the cloud computing environment, while performing these operations, will henceforth be referred to as Mapper, Reducer and Combiner respectively. These user-defined functions are designed as follows: the Map function, takes a key-value pair as input. The user specifies what to do with these key-value pairs and produce a set of intermediate output key-value pairs. (3.1)

M ap(key1 , val1 ) → List(key2 , val2 )

When the Map operation produces its outputs, they are already available in memory. For efficiency reasons, sometimes it makes sense to take advantage of this fact by utilizing the Combine operation to perform a reducetype function. (3.2) Combine(key2 , List(val2 )) → (key2 , List(val2 )) The Combine function prunes duplicate or undesired values from the input. If a Combiner is used then the Map key-value pairs are not immediately written to the output. Instead they will be collected in lists, one list for each key. After a certain number of key-value pairs have been buffered, the buffer flushed by passing all the values of each key to the Combiner. The Combiner will then output the key-value pairs as if they were produced by the original Mapper. After the set of all Map and Combine tasks are processed in parallel by each node in the cluster, without sharing data with other nodes, the output will be restored in the cluster as a collection of files. These files will then be transferred to another group of task(s) called Reduce. (3.3)

Reduce(key2 , List(val2 )) → List(val2 )

The Reduce function accepts an intermediate key and a set of values for that key. Again the user decides what to do with these key and values, and produce a possible smaller set of values. The original Map-Reduce software is a proprietary system of Google, and therefore, not Figure 1: Illustration: how data chunks are used to available for public use. However, Hadoop [18] is an build an ensemble with DIME on Hadoop framework open source software implementing the Map-Reduce paradigm which is utilized in our experiment.

3.2 Multiple partitions of multiple chunks (MPC) Ensemble Method: As mentioned in the related work, our work is closely related to the MPC method [16]. Here, we provide a brief description of the algorithm as shown in algorithm 1. We start with computing the error of each classifier Ai ∈ A on the most recent data chunk Dp in lines 1-3 of the algorithm. We define D={Dp−r+1 ,...,Dp }, i.e, the most recently labeled r data chunks including Dp . Then, in line 5, we randomly divide D into v equal parts = {d1 , ..., dv }, such that roughly, all the parts have the same class distributions. This partitioning is the reason why this algorithm is called MPC (Multi-partition multi-chunk ensemble). In lines 6-9, we train a new batch of v classifiers, where each classifier Apj is trained with the dataset D - {dj }. We compute the expected error of each classifier Apj on its corresponding test data dj . Finally, on line 10, we select the best K ∗ v classifiers from the K ∗ v + v classifiers Ap ∪ A. Note that any subset of the pth batch of v classifiers may take place in the new ensemble. Input: {Dp−r+1 , ..., Dp }: most recently labeled r data chunks A: Current ensemble of best K ∗ v classifiers Output: Updated ensemble A 1: for each classifier Ai ∈ A do 2: Test Ai on Dp and compute its expected error 3: end for 4: Let D = {Dp−r+1 ∪ Dp } 5: Divide D into v equal disjoint partitions {d1 , d2 , ..., dv } 6: for j = 1 to v do 7: Apj ← Train a classifier with training data D-dj 8: Test Apj on its test data dj and compute its expected error 9: end for 10: A ← best K ∗ v classifiers from Ap ∪ A based on expected error Algorithm 1: MPC Algorithm

4

A Data Intensive Multiple-chunk Ensemble Framework (DIME) 4.1 DIME Classification Model: Let D = {d1 , d2 , ..., dr } be the r most recent contiguous data chunks that have been labeled. Each chunk of data is assigned an id and contains multiple lines, each of which is a labeled training vector. Therefore, the training file contains K +r chunks, most recent r consecutive chunks merged with K chunks already present in the ensemble. In the ensemble, each classifier is in effect, a chunk of data, as we are using Q Nearest Neighbor (Q-NN) approach. So, in context of the ensemble, chunk and classifier mean the same thing and we shall use them interchangeably in future. The merging process to generate the training file is taken care of before loading the whole training file

into HDFS (Hadoop Distributed File System). Dp+1 is the data chunk containing the test data. It is used to generate the test file which is then uploaded into HDFS. After the uploading of training and test files into HDFS, our proposed method will proceed in the following four stages. The overall classification model is illustrated in figure 1. 1. Stage 1: First, the Hodoop Framework will partition the training file into b blocks, which will be queued to be distributed to t Map modules. By partitioning the training file in blocks, we facilitate t Mappers to operate on a single training file parallelly. The functionality of the Map algorithm is shown in Algorithm 2. 2. Stage 2: In this stage, once a training block is dispatched to a node, it will be stored in that node. This block is then processed line-by-line by the Map instance running on that node. Each line Vtr corresponds to a single training instance or vector. On the other hand, the test file is loaded as an array Vtst [.] to each Map instance as a whole from HDFS. And this loading of the test file is done only once for a Mapper. The Mapper then calculates the distance between each Vtr and all instances in Vtst [.]. Using these calculated distances, a vector is formed. This vector also contains the predicted label and the original test label. When the processing of the all the blocks assigned to a Map instance is completed, the generated vectors constitute the final output of the Map instance. This output is then used as the input for the third stage. 3. Stage 3: In the third stage, the task of the Combine module is to pass only Q candidate Nearest Neighbors (NN). As can be seen from figure 1, we have t Combine modules, each of which will produce Q candidate Nearest Neighbors. Therefore, we will have t ∗ Q candidate Nearest Neighbors. All of them will be grouped together as the input for the final stage. The algorithm for the Combine module is given in Algorithm 3. 4. Stage 4: This is the final stage and here the Reduce module generates the final Q Nearest Neighbors (QNN) for each test instance from the t ∗ Q candidate Nearest Neighbors it receives from the previous stage. After the Q-NN s are generated, each test instance will be classified based on the results of the weighted majority voting of those neighbors. Finally, a new ensemble (set of chunk ids) will be formed by choosing the best K chunks that contributes most towards class label prediction of all the test instances. This ensemble, along with the classification results, will be the output of the Reduce module. The algorithm for the Reduce module is given in Algorithm 4.

4.2 Algorithms: A detailed description of the previously mentioned Map, Combine and Reduce functions has been provided in this section. Also, the ensemble updating algorithm is discussed after them. Description of the Map module (algorithm 2): A single training instance called V alue is passed to this module as input, which is later converted into a vector representation Vtr . This function or module, then, reads all the test instances and puts them in an array Vtst [.]. In lines 5-7, the distances of the n test instances in Vtst to the training instance Vtr are calculated and saved in an array L[.]. Each entry in the array L[.] is an object or a tuple with a format of the form < Vtr , class label, distance >. Here, Vtr contains the chunk id and label information of the training instance and class label is the label of the corresponding test instance. In line 8, L[.] is represented as a string, which is then produced as output of the Map module. Since, each call of the Map module handles only a single training instance, it is needed to be called multiple times to process the set of training instance assigned to a particular Mapper. Also, the number of times the Map module is called across the whole distributed system is equal to the number of training instances in the training set. Input: The Chunk Id (Key), and the labeled training instance (V alue) Output: < Key ′ , V alue′ >, where Key ′ is null and V alue′ is a string comprising of the distances between the training instance (V alue) and all the test instances

load of Reduce module. We call these candidate nearest neighbors, because they are chosen with respect to the partial training instance set received by the preceding Map module. The Q candidate nearest neighbors are determined in line 6 by sorting the columns (where each column corresponds to one particular test instance) of the list L[.][.] in ascending order of the distance. The elements of L[.][.] are objects or tuples as previously mentioned in the Map module. The top Q rows of L[.][.] are then chosen (as they contain the candidate nearest neighbors) as output. Just like the Map module, these rows are converted to a string representation and then produced as output. Input: Key is null, and List(V ) is a list of all local Map outputs Output: < Key ′ , V alue′ >, where the Key ′ is null and V alue′ is a string comprising of the distances between the training instances and all the test instances 1: 2: 3: 4:

5: 6:

m ← Number of rows in V n ← Number of test instances in the test set (i.e. the number of columns in V ) Q ← Number of candidate nearest neighbored to be considered L[.][.] ← Construct a list from V having m rows and n columns, each element contains the chunk id of the training instance, the test instance class label and the distance of the test instance to the training instance for i = 1 to n do Sort m training instances in L[.][i] by the ascending order of their distance to the test instance i end for for j = 1 to q do V alue′ = String representation of the top jth row of L Output < Key ′ , V alue′ > end for Algorithm 3: Combine (Key, List(V))

Vtr ← Vector representation of training instance in V alue 7: 2: Vtst [.] ← Vector representation of all test instances 8: read from HDF S 9: 3: n ← Number of test instances in Vtst [.] 4: L[.] = Empty array of objects each of which contains 10: a distance and a class label 11: 5: for i = 1 to n do 6: L[i] ← Vtr , class label of Vtst [i] and distance between Vtr and Vtst [i] Description of the Reduce module (algorithm 4): 7: end for As mentioned previously, Map-Reduce is a distributed ′ 8: V alue = String representation of L[.] framework and therefore, the individual results of the ′ ′ 9: Output < Key , V alue > nodes in the cloud computing cluster needs to be merged Algorithm 2: Map (Key, Value) in order get the final result. Reduce is the module that performs this operation. The input to this algorithm, Description of the Combine module (algorithm 3): as can be seen from algorithm 4, is a list of Combiner As mentioned previously, each Map module is called outputs. Therefore, there are t outputs in V . Each multiple times based on the number of training in- of them contain Q candidate Nearest Neighbor training stances assigned to that module. So, multiple outputs instances for each test instance. In line 4, from the t ∗ Q are generated for each Map module. All these outputs candidate Nearest Neighbors, Q true Nearest Neighbors are put together into a list and then passed as input to are chosen. In lines 6-11, the weights of the training the Combine module. In our classification method, the instances is calculated and contribution of prediction task of Combine module is to determine the Q candi- is also saved to be used later to find which chunk date Nearest Neighbors for each test instance and pass of training instances contributed most. It should be them to the Reducer. This reduces the computation mentioned that the weight of a training instance is 1:

Description of the Ensemble Update function (algorithm 5): This function receives a list V otes[.][.], which contains the weights and votes (i.e. class labels) of the Q Nearest Neighbor training instances for each of the test instance. Labels[.], on the other hand, contains the predicted class labels of all the test instances after the weighted voting of labels. Now, we want to find out which chunks should be chosen for the future ensemble. This is done by considering the contribution of each chunk during the whole prediction process. As can be seen in line 4 of the algorithm, it is checked whether the vote in V otes[i][j] won in predicting the class label of test instance j. If that is true, we suffice that the training instance that performed the voting, contributed in predicting the class label. And this contribution can be considered as that of the chunk whose member the corresponding training instance is. We then accumulate the contribution of each chunk in predicting class labels of all the test instances in lines 2-10. In line 11, these contributions are sorted to find the top K contributInput: Key is null, and V is a list of all Combine ing chunks. Finally, the chunk ids of these chunks are outputs merged in one group since the Key is null output as the new ensemble. Output: < key ′ , V alue′ >, where key ′ is null and V alue′ is a string comprising of the ensemble (chunk Input: V otes is an array of Q rows and n columns, ids) and the predicted labels for each test instance each element contains a chunk id, a weight and the 1: Q ← Number of nearest neighbors to be considered class label vote. Labels is a list of n elements, each 2: n ← Number of test instances present in the test set element represents the predicted class label of the 3: L[.][.] ← Construct a list from V having m rows corresponding test instance. and n columns with each element having the same Output: The ensemble classifiers (K chunk ids) information as in Combine module 1: C[.] ← An array having length u, same as the 4: L[.][.] = Extract Q nearest neighbors from L[.][.] and number of training chunks, used to save chunk construct a list having Q rows and n columns contributions 5: V otes[.][.] ← An array of Q rows and n columns, 2: for i = 1 to Q do each element contains a chunk id, a weight and the 3: for j = 1 to n do class label vote 4: if (Label in V otes[i][j] is same as Labels[j]) 6: for j = 1 to Q do then 7: for k = 1 to n do 5: id ← Chunk id in V otes[i][j] 8: Calculate weight based on chunk recency value 6: w ← Weight in V otes[i][j] and distance in L[j][k] 7: C[id] = C[id] + w 9: V otes[j][k] = Vote using weight and class label 8: end if in L[j][k] 9: end for 10: end for 10: end for 11: end for 11: Sort C[.] In descending order of average contribu12: Labels[.] ← Empty list for assigning predicted class tion. labels 12: Ensemble[.] ← get K best chunk ids contributing 13: for i = 1 to n do towards high ranks 14: Labels[i] ← Predicted class label with the highest 13: Output Ensemble[.] rank of ith test instance 15: end for Algorithm 5: EnsembleUpdate (Votes, Labels) 16: Ensemble[.] ← Get the highest K contributing chunks using EnsembleU pdate(V otes, Labels) function //EnsembleU pdate is explained in algo4.3 Upper bounds of r: In our model, r can only rithm 5 be increased up to a certain value as will be explained ′ 17: V alue = String representation of Ensemble[.] and in section 5.2. After that, increasing r actually hurts Labels[.] the performance of our algorithm. The upper bound ′ ′ 18: Output < Key , V alue > of r depends on ρd , the magnitude of drift. Although, Algorithm 4: Reduce (Key, V) it may not be possible to know the actual value of ρd

directly proportional to the recency of that instance (i.e. the recency of the chunk of which the training instance is a member of) and inversely proportional to the distance of the training instance to the test instance. Therefore, if a chunk is more recent or its training instance has less distance to the test instance, the weight gets a high value. The recency is measured by considering the time stamp of the oldest chunk as the starting time and consecutive chunks as more recent. This operation is done for all the n test instances. Then, in line 14, the highest ranking prediction is assigned to the test instance. The previously saved contribution of training instances is merged to find the contribution of each chunk in predicting labels of test instances. These contributions are then used to find the top K contributing chunks to be used as the latest ensemble. This is done in line 16. Finally the program concludes by returning the test instance labels and the latest ensemble.

MPC Wang BestK All Last DIME

4

20

Error (%)

Error (%)

25

MPC Wang BestK All Last DIME

5

15

3 2

10 1 250

500 750 Chunk size

1000

(a)

30

60 90 Chunk size (minutes)

120

(b)

Figure 2: Error vs chunk size on (a) synthetic data and (b) botnet data. from the data, we may determine the optimal value of r experimentally. In our experiments, we found that for smaller chunk-sizes, higher values of r work better, and vice versa. However, the best performance-cost tradeoff is found for r = 5. We have used r = 5 in our experiments. 5

Experiments

We evaluate our proposed method on both synthetic data and real botnet traffic generated in a controlled environment, and compare our results with several baseline methods. 5.1

Data sets and experimental setup: Synthetic dataset: We have used the same synthetic data as used in [16]. We have generated synthetic data with drifting concepts [4]. The Conceptdrift in the data can be achieved with a moving hyperplane. The equation Pd Pd of a hyperplane is as follows: a x = a . If i i 0 i=1 i=1 ai xi ≤ a0 , then an example is negative, otherwise it is positive. Each example is a randomly generated d-dimensional vector {x1 , ..., xd }, where xi ∈ [0, 1]. Weights {a1 , ..., ad } are also randomly initialized with a real number in the range [0, 1]. The value of a0 is adjusted so that roughly the same number of positive and negative examples Pdare generated. This can be done by choosing a0 = 12 i=1 ai We also introduce noise randomly by switching the labels of p% of the examples, where p=5 is used in our experiments. There are several parameters that simulate concept-drift. We use the same parameters settings as in [4] to generate synthetic data. We generate a total of 250,000 records and generate four different datasets having chunk sizes 250, 500, 750, and 1000, respectively. The class distribution of these datasets is: 50% positive and 50% negative. Real (botnet) dataset: Botnet is a network of compromised hosts or bots, under the control of a human attacker known as the botmaster [19]. The botmaster

can issue commands to the bots to perform malicious actions, such as launching DDoS attacks, spamming, spying and so on. Thus, botnets have appeared as enormous threat to the Internet community. Peer-toPeer(P2P) is the new emerging technology of botnets. These botnets are distributed, and small. So, they are hard to detect and destroy. Examples of P2P bots are Nugache [20], Sinit [21], and Trojan.Peacomm [22]. Botnet traffic can be considered as a data stream having two properties: infinite length and conceptdrift. So, we apply our stream classification technique to detect P2P botnet traffic. We generate real P2P botnet traffic in a controlled environment, where we run a P2P bot named Nugache [20]. The details of the feature extraction process are discussed in [23]. There are 81 continuous attributes in total. The whole dataset consists of 30,000 records, representing one week’s worth of network traffic. We generate four different datasets having chunk sizes of 30 minutes, 60 minutes, 90 minutes, and 120 minutes, respectively. The class distribution of these datasets is: 25% positive (botnet traffic) and 75% negative (benign traffic). Hadoop Distributed System Setup: The distributed system on which we performed our experiments consists of a cluster of ten nodes. Each node has the same hardware configuration: Intel Pentium IV 2.8GHZ processor, 4GB main memory and 640GB hard disk space. The software environment also uses the same configuration: the operating system is Ubuntu 9.10, the distributed computing platform is Hadoop-0.20.1, the Java development platform is JDK 1.6, and the network link is 100MB LAN. Baseline methods: For classification, we use the “Weka” machine learning open source package, available at “http://www.cs.waikato.ac.nz/ml/weka/”. We apply three different classifiers - J48 decision tree, Ripper, and Bayes Net. In order to compare with other techniques, we implement the followings: DIME : This is our proposed classification method.

180

Running time(minutes)

Running time(seconds)

10

MPC Wang BestK All Last DIME

130

80

30 250

MPC Wang BestK All Last DIME

8 6 4 2 0

500 750 Chunk size

1000

(a)

30

60 90 Chunk size (minutes)

120

(b)

Figure 3: Running time vs chunk size on (a) synthetic data and (b) botnet data. MPC : This is the classification method described in section 3.2 [16]. BestK : This is a single-partition, single-chunk (SPC) ensemble approach, where an ensemble of the best K classifiers is used. Here K is the ensemble size. This ensemble is created by storing all the classifiers seen so far, and selecting the best K of them based on expected error. An instance is tested using simple voting. Last: In this case, we only keep the last trained classifier, trained on a single data chunk. It can be considered a SPC approach with K = 1. Wang : This is an SPC method implemented by Wang et al. [4]. All : This is also an SPC approach. In this case, we create an ensemble of all the classifiers seen so far, and the new data chunk is tested with this ensemble by simple voting among the classifiers. 5.2 Performance study: In this section, we compare the results of all the six techniques, DIME, MPC, Wang, BestK, All and Last. As soon as a new data chunk appears, we test each of these ensembles /classifiers on the new data, and update its accuracy, false positive, and false negative rate. In all the results shown here, we use the parameter value of v = 5 in MPC, and r = 5 in DIME and MPC, unless mentioned otherwise. Figure 2(a) shows the error rates for four different chunk sizes of each method (also using decision tree in all and Q-NN in DIME ) on synthetic data, It is evident that DIME has the lowest error compared to all other methods whenever the chunk size is less than 750. However, when the chunk size becomes more than 750, DIME still outperforms all other approaches except MPC. Here we can trade-off the difference in error rate and the classification time between DIME and MPC. It can be seen from Figure 3(a) that when the chunk size is 1000, the classification time for DIME is around 70 seconds less than MPC. Since we are dealing with stream data, the system should not wait long

to create large chunks as such a scheme would make the stream classification process slower. Therefore, whatever classification process is employed, the low error rate is desirable when the chunk size is small. In this respect, DIME performs better than other methods. Figure 2(b) shows the error rates on the real botnet data over four different chunk durations. Again, Q-NN is used in DIME and decision tree is used in all other methods. Here, DIME has the lowest error compared to all other methods except MPC whenever the chunk size is between 60 and 90 minutes. In this case, we are dealing with real botnet stream data and it makes sense to compromise between error rate and running time. As can be seen from figure 2(b), the difference in error rate between DIME and MPC is less than (0.3%), for the previously mentioned time interval. But, with respect to the classification time(detection time) as shown in figure 3(b), DIME always outperforms MPC and all other methods. Tables 1 and 2 report the error of decision tree and Ripper learning algorithms, respectively, on synthetic data, for different values chunk sizes. In both tables, we see that DIME has the lowest error (shown in bold) compared to all other methods except MPC when the chunk size is equal or more than 750. Again, when the chunk size is larger than 750 we can trade-off the difference in error rate and the classification time between DIME and MPC as discussed before. Figure 3(a) shows the total running times of different methods on synthetic data. Notice that the running time of DIME is less than MPC, BestK, and All over all different chunk sizes. However, when the chunk size is more than 500, the running time for Last and W ang is less than DIME. From the user perspective, the choice between prediction accuracy and running time may be considered as a trade-off. This is apparent from tables 1 and 2. The optimal choice would be to choose the chunk size to be 500 as it provides a low error rate as

Table 1: Error of decision tree approach on synthetic data Chunk size

MPC

W ang

BestK

All

Last

DIM E

250

16.2

26.1

19.5

29.2

26.8

9.29

500

10.2

12.4

11.3

11.3

14.7

8.15

750

10.3

11.3

11.2

15.8

13.8

10.96

1000

10.3

11.9

11.4

12.6

14.1

11.9

Table 2: Error of Ripper approach on synthetic data Chunk size

MPC

W ang

BestK

All

Last

DIM E

250

16.8

25.9

20.9

30.4

26.3

9.29

500

10.5

12.5

11.5

11.6

14.1

8.15

750

10.5

11.5

11.5

15.7

13.3

10.96

1000

10.2

11.9

11.8

12.6

13.6

11.9

well as fast running time. The running times of DIME on botnet data is shown in figure 3(b) and it includes both training and testing time. It is evident that DIME outperforms all others methods on the botnet data. The reason behind it is the high dimensionality of the botnet data (i.e. 81 attributes) as mentioned previously in section 5.1. This is another advantage that is achieved by our algorithm by utilizing the parallel and distributed computation architecture. Figure 4 shows the sensitivity of r on error and running times on synthetic data for DIME. Figure 4(a) shows the errors for different values of r for a fixed value of K = 10. The highest reduction in error occurs when r is increased from 3 to 5. Note that we observe no significant reduction in error for higher values of r, which follows from our analysis of parameter r on concept-drifting data in section 4.3. However, the running time keeps increasing, as shown in figure 4(b). The best trade-off between running time and error

20

6 Conclusion We have introduced the Data Intensive Multi-chunk Ensemble (DIME) method on the Map-Reduce framework for classifying concept-drifting data streams. Our ensemble approach keeps the best K classifiers, where these classifiers are trained on the combined training data of K data chunks from the previous ensemble and most recent consecutive r data chunks. It is a generalization over previous ensemble approaches that train a single classifier from a single data chunk. By introducing this DIME method, we have reduced error significantly over the single-partition, single-chunk approach. We have tested our approach on both synthetic data and real botnet data, and obtained better classification accuracies compared to other approaches. DIM E also outperforms those approaches with respect to running time. Since, DIM E is designed for stream data clas-

100

r=2 r=3 r=5

r=2 r=3 r=5

80 Time (seconds)

18 16 Error(%)

occurs when r = 5 and chunk size is less than 750.

14 12 10

60 40 20

8 0 250

500 750 Chunk size (a)

1000

250

500 750 Chunk size (b)

Figure 4: Sensitivity of parameter r on (a) error and (b) running time.

1000

sification, running time had been a primary concern in our algorithm design process and in the future, we would like to apply our technique on the classification and model evolution of other real streaming data. We would also like to find how different base classification learners like decision trees perform over the Map-Reduce framework and the stability of these approaches over the parameters of the cloud computing environment. Acknowledgment: This research was funded in part by NASA grant NNX08AC35A and Tahseen AlKhateeb is partly funded by Al-Hussein Bin Talal University, Ma’an, Jordan. References [1] T. White, Hadoop: The Definitive Guide, 1st Edition. O’Reilly Media, Inc., June 2009. [2] W. Fan, “Systematic data selection to mine conceptdrifting data streams,” in Proc. ACM SIGKDD, Seattle, WA, USA, 2004, pp. 128–137. [3] P. Domingos and G. Hulten, “Mining high-speed data streams.” in Proc. ACM SIGKDD. Boston, MA, USA: ACM Press, 2000, pp. 71–80. [4] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining concept-drifting data streams using ensemble classifiers,” in Proc. SIGKDD, Washington, DC, USA, 2003, pp. 226–235. [Online]. Available: http://portal.acm.org/citation.cfm?id=956750.956778 [5] M. Scholz and R. Klinkenberg., “An ensemble classifier for drifting concepts.” in Proc. Second International Workshop on Knowledge Discovery in Data Streams (IWKDDS), Porto, Portugal, 2005, pp. 53–64. [6] Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in Proc. International Conference on Machine Learning (ICML), Bari, Italy, 1996, pp. 148–156. [7] J. Z. Kolter and M. A. Maloof, “Using additive expert ensembles to cope with concept drift.” in Proc. International conference on Machine learning (ICML), Bonn, Germany, 2005, pp. 449–456. [8] J. Gao, W. Fan, and J. Han., “On appropriate assumptions to mine data streams.” in Proc. IEEE International Conference on Data Mining (ICDM), Omaha, NE, USA, 2007, pp. 143–152. [9] K. Tumer and J. Ghosh., “Error correlation and error reduction in ensemble classifiers.” Connection Science, vol. 8, no. 304, pp. 385–403, 1996. [10] J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh., “Boat-optimistic decision tree construction.” in Proc. ACM SIGMOD, Philadelphia, PA, USA, 1999, pp. 169–180. [11] G. Hulten, L. Spencer, and P. Domingos, “Mining timechanging data streams,” in Proc. ACM SIGKDD, San Francisco, CA, USA, 2001, pp. 97–106. [12] P. E. Utgoff, “Incremental induction of decision trees,” Machine Learning, vol. 4, pp. 161–186, 1989. [13] R. Khoussainov, X. Zuo, and N. Kushmerick, “Gridenabled weka: A toolkit for machine learning on the grid,” ERCIM News, vol. 59, October 2004.

[14] G. Wu, H. Li, X. Hu, Y. Bi, J. Zhang, and X. Wu, “Mrec4.5: C4.5 ensemble classification with mapreduce,” ChinaGrid, Annual Conference, vol. 0, pp. 249– 255, 2009. [15] W. Zhao, H. Ma, and Q. He, “Parallel k-means clustering based on mapreduce,” in CloudCom ’09: Proceedings of the 1st International Conference on Cloud Computing. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 674–679. [16] M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, A Multi-partition Multi-chunk Ensemble Technique to Classify Concept-Drifting Data Streams, ser. Advances in Knowledge Discovery and Data Mining. Springer, April 2009, vol. 5476/2009. [17] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008. [18] Apache, “Hadoop,” http://lucene.apache.org/hadoop/, 2006. [19] P. Barford and V. Yegneswaran, An Inside Look at Botnets, ser. Advances in Information Security. Springer, 2006. [20] R. Lemos, “Bot software looks to improve peerage.” http://www.securityfocus.com/news/11390, 2006. [21] L. T. I. Group, “Sinit p2p trojan analysis.” http://www.lurhq.com/sinit.html, 2004. [22] J. B. Grizzard, V. Sharma, C. Nunnery, B. B. Kang, and D. Dagon, “Peer-to-peer botnets: Overview and case study,” in Proc. 1st Workshop on Hot Topics in Understanding Botnets, 2007, p. 1. [23] M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, “Mining concept-drifting data stream to detect peer to peer botnet traffic,” Univ. of Texas at Dallas Tech. Report# UTDCS-05-08 (http://www.utdallas.edu/˜mmm058000/reports/UTDCS05-08.pdf ), 2008.

A Data Intensive Multi-chunk Ensemble Technique to ...

popular alternatives to single model classifiers. But these are not directly applicable to stream .... build an ensemble with DIME on Hadoop framework. 3.1 Hadoop. Map-Reduce framework: Map-. Reduce is a new distributed programming paradigm introduced by Dean et al. [17] to be used in a cloud computing environment.

Download PDF

197KB Sizes 0 Downloads 222 Views

Report

A Data Intensive Multi-chunk Ensemble Technique to ...

Recommend Documents