A Modular Multi-Location Anonymized Traffic ... -

Viewer
Transcript

A Modular Multi-Location Anonymized Traffic Monitoring Tool for a WiFi Network Justin Hummel, Andrew McDonald, Vatsal Shah, Riju Singh, Bradford Boyle, Tingshan Huang Nagarajan Kandasamy, Harish Sethu and Steven Weber Department of Electrical and Computer Engineering Drexel University, Philadelphia, Pennsylvania 19104 Abstract—Detection of anomalous patterns in network traffic is now considered a surer approach to early detection and mitigation of malware propagation than signature-based approaches. However, traffic anomaly detection to discover an intrusion or the presence of malware is best accomplished by a combined analysis of traffic data collected at multiple locations within the network. Existing open-source tools available today are not wellpositioned to a study of new anomaly detection strategies: they are primarily signature-based, or do not facilitate integration of traffic data from multiple locations for real-time analysis or are insufficiently modular for incorporation and testing of newer approaches to anomaly detection such as compressive sampling. In this paper, we describe a new modular open-source tool, called DataMap, for the collection and real-time analysis of sampled, anonymized and filtered traffic data from multiple WiFi locations on a network. We also present a simple example of data collected using a deployment of DataMap at multiple locations on the Drexel campus and how it can be employed in anomaly detection.

I.

I NTRODUCTION

A typical piece of new malware today uses a variety of obfuscation techniques including entry-point obfuscation, polymorphism and metamorphism to avoid detection, especially signature-based detection typical of antivirus products currently on the market [1]. Such increasing sophistication of viruses, worms and other malware has made their early detection immediately after launch significantly harder with some reported success rates as low as 5%. Many security professionals argue that real-time monitoring for anomalous behavior in the traffic data over a network, as opposed to signature-based detection, offers a surer approach to protecting networks [2]–[4]. Real-time monitoring for anomalous behavior, however, offers its own challenges. One challenge, of course, is the matter of minimizing false positives and false negatives in the detection of patterns that deviate from the normal after the data is gathered and stored [5], [6]. The other challenge, not as frequently considered in academic work, is one of gathering and storing the data in real-time to enable the immediate use of detection algorithms on it. This challenge stems primarily from the fact that traffic data collected at a single location is usually insufficient to infer an anomaly; one requires traffic data from multiple locations to be able to conclude that an anomalous behavior is ongoing. As an example, consider the spread of a worm using a vulnerability, say, in Microsoft’s RPC implementation that uses port 135. For both improved efficiency and a higher probability of finding similar vulnerabilities, an infected machine

first infects other machines within the local network before scanning for vulnerabilities outside the local network. On a typical campus or an enterprise network, this will manifest itself as a single incoming packet on port number 135 and many outgoing packets from several other machines within the network on port number 135. The behavior at no one location in isolation is anomalous, but the real-time examination of traffic at multiple locations is more likely to reveal a pattern typical of early steps in the emergence of a worm. While data from multiple locations is helpful to anomaly detection, transferring the extremely large volume of traffic recorded at multiple locations to a central server for analysis can be infeasible. Even with current state-of-the-art compression strategies, the data volume can be prohibitively large. There are two approaches that can render this task more feasible. One approach is to allow only the transfer of sampled and transformed data as opposed to all of the data, and use new techniques such as compressive sampling that have been shown to be successful in some attempts at anomaly detection [7], [8]. Another approach is to only transfer aggregated data, such as histograms constructed over slices of time [9], as opposed to individualized packet data. In this paper, we describe an open-source tool, called DataMap, developed as part of a project to build tools to collect traffic data from multiple WiFi locations and facilitate real-time analysis for anomalies and intrusions. The project envisions and accommodates the potential use of innovative new techniques such as compressive sampling. It is also intended to provide a means for empirical estimation of fundamental trade-offs such as between the sampling rate of traffic data and the accuracy of inferences on anomalies possible from it. II.

A N OVERVIEW OF THE DATA M AP TOOL

The DataMap tool, a pre-release version of which is available on Github [10], efficiently collects traffic data at multiple WiFi locations across a network and aggregates them into a single database hosted on a central server. The DataMap tool builds upon existing open-source software and, in contrast to other currently available tools, is open-source, offers concurrent multi-location traffic monitoring, enables integrated analysis on the multi-location aggregated data, and is modular to allow easy modification for academic research in anomaly detection. The DataMap tool is intended to facilitate experiments on sampling rates, compressive sampling and both time-domain and space-domain data aggregation strategies. DataMap can be used with unencrypted WiFi networks without

III. WiFi Base Station

Collection node

WiFi Base Station

Analysis and Anomaly Detection tools

Sampler

Sampler

Anonymizer Aggregator DB writer

Collection node

Anonymizer DataMap Central Server MySQL Storage

Aggregator DB writer

Fig. 1. A DataMap traffic monitoring system showing two collection nodes (also called traffic sensors) and the central server.

administrative authority or on encrypted networks with administrative authority. Figure 1 shows a high-level block diagram of the DataMap infrastructure composed of a series of collection nodes (only two are shown in the figure) and a central server. The collection nodes gather, sample, anonymize, aggregate/compress and extract the relevant traffic data to transmit to the central server. The central server hosts the database which serves as the data repository for either real-time analysis or as an informational aid. For privacy concerns, the DataMap tool does not store application-layer data at any node. All IP addresses collected in the data is anonymized using Crypto-PAn (Cryptography-based Prefix-preserving Anonymization [11]) before being transferred to the central server for analysis. While the DataMap tool helps detect malware threats, it does not itself take action to neutralize threats leaving any such action to human administrators or other tools working in conjunction with DataMap. Thursday, August 8, 13

The DataMap tool uses Vermont (Versatile Monitoring Toolkit [12]), an open-source modular framework for capturing and processing network data, at each collection node. Vermont uses the pcap library for packet capture and runs on Linux systems. It allows different sampling algorithms and filters for selecting packets for collection. The modular nature of Vermont facilitates the goal of the DataMap project to create replaceable and independent components in the DataMap tool. The DataMap tool avoids the overhead of clock synchronization between the multiple collection nodes. Our experience and other research suggests that patterns of data across multiple locations are best analyzed based on the statistical distribution (e.g., a histogram) of features of interest [9]. The pattern of changes in correlations over time between these distributions at different locations is not sensitive to time skews between the collection nodes (even if the correlations themselves fluctuate from one time slice to the next). This is especially true when the slices of time over which these distributions are analyzed are large or if the traffic volume is high. The DataMap tool or any similar tool intended to harvest and analyze data for purposes of anomaly detection will not work effectively in the case of contagion worms which do not cause new traffic patterns but instead spread through riding on normal traffic generated by servers and clients talking to each other [13].

R ELATED WORK

There are a number of open-source tools available for data collection to monitor network traffic and detect intrusions. Most of these are primarily targeted for detecting known types of intrusions (as opposed to detecting an anomalous pattern of behavior) using signature-based approaches, the most widely used of these being Snort [14]. Snort is a lightweight sniffer, packet logger and an intrusion detection system which can generate alerts when it observes specific types of probes or attacks that indicate a potential intrusion attempt. The detection can be based on a rule set included in the snort download and updated daily or on custom rules written by the user of snort. An installation of snort runs on a single machine and it takes a complementary set of tools to gather traffic data from multiple locations and detect in real-time a gradually spreading worm such as that described in Section I. The DataMap tool, on the other hand, facilitates the sampling, aggregation and transfer of data from multiple locations (sensors) to a central server for real-time analysis. Also, being signaturebased, Snort is not able to detect anomalies whose signature is not yet known. The DataMap tool, however, can be employed in conjunction with approaches that use any combination of signature-based algorithms (with rule sets) and anomaly-based algorithms (without pre-determined rule sets). Some other tools such as Suricata [15] are similar to Snort in their rule-based approach to intrusion detection and prevention but which are intended for better performance on multi-core CPUs [16]. Similar to Snort and in contrast to DataMap, Suricata also runs on a single machine without multiple data collection nodes at different locations (often also called traffic sensors). Open-source software tools that come closest to some of the functionality of the DataMap tool include Security Onion [17] and OpenWIPS-ng [18]. While these tools are also signature-based—sometimes based on tools such as Snort— they do allow data collection from multiple locations to facilitate an integrated analysis of the full network-level context. Security Onion is an Ubuntu-based system for network monitoring and intrusion detection and can be installed on multiple machines each of which can serve as a traffic sensor. OpenWIPS-ng, still under development and currently lacking some of the features of DataMap (e.g., channel hopping), similarly allows traffic sensors at multiple locations to transmit data to a central server for analysis. While these very useful tools are close in some functionality to the DataMap tool, they are all largely built on a foundation of signature-based detection and are not ideal for an academic study of approaches to anomaly detection when signatures are not known. The DataMap tool, in contrast, is modular and allows for building functionality that includes the use of different sampling algorithms and aggregation strategies through an XML specification file. Built on replaceable modules, the DataMap tool is intended to facilitate the study of newer techniques for anomaly detection, such as those based on compressive sampling, histogram construction, or informationtheoretic metrics. While we have only reviewed the dominant open-source tools in this section, the tools that require paid licenses are even less adaptable to modification for an experimental study or academic research on new approaches to anomaly detection.

1 0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5 0.4

0.5 0.4

0.5 0.4

0.3

0.3

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0

0

−0.1

5

10

15

20

25 30 Time slice

35

40

45

−0.1

50

(a) Correlation between histograms of source IP addresses at locations 1 and 2.

0 5

10

15

20

25 30 Time slice

35

40

45

−0.1

50

(b) Correlation between histograms of source IP addresses at locations 1 and 3. 1

1

0.9

0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.4

Correlation

1

0.5

0.5 0.4

0.3

0.2

0.2

0.2

0.1

0.1

0.1

0

0 10

15

20

25 30 Time slice

35

40

45

−0.1

50

(d) Correlation between histograms of destination IP addresses at locations 1 and 2.

15

20

25 30 Time slice

35

40

45

50

0.4

0.3

5

10

0.5

0.3

−0.1

5

(c) Correlation between histograms of source IP addresses at locations 2 and 3.

0.9

Correlation

Correlation

Correlation

1 0.9

Correlation

Correlation

1 0.9

0 5

10

15

20

25 30 Time slice

35

40

45

50

(e) Correlation between histograms of destination IP addresses at locations 1 and 3.

−0.1

5

10

15

20

25 30 Time slice

35

40

45

50

(f) Correlation between histograms of destination IP addresses at locations 2 and 3.

Fig. 2. An example of data collected at three locations on the Drexel University campus. The plots show correlations between the histograms computed on the traffic at these locations for source and destination IP addresses. The figures show a sustained change in the pattern of correlations between the traffic at location 3 and those at the other locations beginning sometime between time slices 15 and 20.

IV.

DATA M AP COMPONENTS

The DataMap tool consists of the collection nodes and a central server. Each collection node is intended to be placed in a different WiFi location monitoring traffic on a different base station or WiFi access point. The collection node consists of four primary components: the sampler module, the aggregator module, the anonymizer module and the DbWriter module. The sampler module uses Vermont [12] to capture traffic with the desired sampling algorithm and the desired filtering strategy. The DataMap tool on a collection node uses airmonng [19] to place the wireless network interface into the monitor mode so that it can collect data that is not specifically for that collection node. The DataMap tool also uses airodump-ng [20] to identify and choose the channels on which it should listen. The aggregator module summarizes the raw data gathered by the node for each pre-programmed slice of time. The aggregation can be determined by any of several packet header fields (such as TCP port numbers of IP addresses) which can be configured using an XML file. The anonymizer module uses Crypto-PAn (Cryptographybased Prefix-preserving Anonymization [11]) to anonymize individual IP addresses in a way that protects individual privacy while preserving subnet and other topological information about the network. The DataMap tool discards both MAC addresses and all of the application-layer data in the aggregation module. Data passes from the aggregation module to the anonymizer through shared memory and the anonymization occurs before any of the collected data is transmitted across the network. Yet, the DataMap project recognizes that topological

information in conjunction with patterns of behavior extracted from the data using sophisticated data mining that can often yield enough information to reveal individual actions. Finally, the DbWriter module sends the aggregated, filtered and anonymized data to a MySQL database on the central server. Data from each location is entered into a separate table identified by the unique identifier of the corresponding collection node and its location. These components on the collection nodes work as part of a node daemon which, upon start-up, sends a HELLO message to the central server with its id and location. The daemon then waits for instructions from the central server, sent by a server daemon running on it. The server daemon keeps track of all the nodes and periodically pings them with a HEARTBEAT message to make sure they are connected and to retrieve their latest state. A web interface, provided with the DataMap tool, can be used to keep track of the state of all the collection nodes. V.

A SIMPLE EXAMPLE OF DATA M AP USAGE

We present a simple example of data collection across multiple nodes and how it can be helpful in traffic monitoring and, potentially, the detection of an Internet worm in its early stages. A typical worm (except for contagion worms [13]) begins its life by first scanning for vulnerabilities on open ports at as many different IP addresses as possible in as short a time as possible using any of a number of sophisticated approaches. Therefore, the distribution of the destination IP addresses of IP packets emerging from a worm-infected node is likely very

different from that of a normal uninfected node. Detection of a wider range of anomalies in this or other features is made possible by constructing histograms of these features as described in [9]. However, for the same reasons mentioned in Section I, it is ideal to not rely entirely on a signature-based assessment of normal vs. anomalous histograms. A more effective approach is one that also uses a real-time assessment of deviations between the histograms computed at different nodes. Past research has shown that an analysis of correlations between traffic features through techniques such as Principal Component Analysis can achieve effective anomaly detection [21]. The DataMap tool allows precisely this kind of analysis. These correlations will likely fluctuate from one time slice to the next, but the pattern of fluctuations itself can serve as an indication of what is “normal”. A sustained shift in the pattern of changes/fluctuations in the correlation between the histograms computed at two different but equally busy nodes indicates one of these nodes as a potential candidate for further examination by a system administrator. Such a sustained shift can be observed by dividing time into slices and computing the correlation between histograms at each slice. Figure 2 presents data collected over a period of 30 minutes (or 50 time slices of 36 seconds each) from three nodes at different locations within the Drexel University campus network using the DataMap tool. The first row of figures, labeled (a)–(c), uses the histograms of the source IP addresses and the second row of figures, labeled (d)–(f), uses the histograms of the destination IP addresses. Each graph in this figure plots the correlation between histograms constructed from packets entering/leaving at two different locations. Each data point in the figure represents the correlation computed over a 36-second period. Since we use histograms and are only considering statistical distribution over 36-second time slices, synchronization errors of the order of the delay between these nodes (less than 20 milliseconds) does not change the results in any substantive way. The figure shows a sustained shift in the pattern of correlations between traffic at location 3 and the traffic at the other two locations beginning sometime between time slice 15 and time slice 20. This can be an indication of something anomalous or it could be a normal event in which location 3 is exhibiting some deviation for legitimate reasons. A deeper analysis using correlations between histograms of other features (such as port numbers) can complete a determination of whether or not this event deserves to be flagged with an alert. The DataMap tool is not itself an anomaly detection engine, but is a tool intended to facilitate such determinations and studies of network-wide anomaly detection approaches. VI.

C ONCLUDING REMARKS

In this paper, we have briefly described a pre-release version of the DataMap tool for multi-location traffic monitoring and analysis. It is a modular tool intended for easy testing of a variety of new approaches to anomaly detection such as those based on compressive sampling, histogram correlations and information-theoretic metrics. The tool is not intended to merely provide an alternative to existing intrusion detection or traffic monitoring tools but instead to serve as a framework for

the development and testing of new algorithmic approaches to traffic sampling, data aggregation, compression, and analysis for effective anomaly detection. ACKNOWLEDGMENT This work was partially funded by the National Science Foundation Award #1228847. R EFERENCES [1] [2]

[3] [4]

[5] [6]

[7] [8]

[9]

[10] [11]

[12] [13]

[14] [15] [16]

[17] [18] [19] [20] [21]

C.-H. Wu and J. D. Irwin, Introduction to Computer Networks and Cybersecurity. CRC Press, 2013. P. Li, M. Salour, and X. Su, “A survey of Internet worm detection and containment,” IEEE Communications Surveys and Tutorials, vol. 10, pp. 20–35, 2008. N. Perlroth, “Outmaneuvered at their own game, antivirus makers struggle to adapt,” The New York Times, December 31, 2012. D. Moore, C. Shannon, G. M. Voelker, and S. Savage, “Internet quarantine: Requirements for containing self-propagating code,” in Proc. INFOCOM. IEEE, 2003, pp. 1901–1910. A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using traffic feature distributions,” in Proc. ACM SIGCOMM. ACM, 2005. I. C. Paschalidis and G. Smaragdakis, “Spatio-temporal network anomaly detection by assessing deviations of empirical measures,” IEEE/ACM Transactions on Networking, vol. 17, pp. 685–697, 2009. E. J. Cand`es and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Processing Magazine, pp. 21–30, March 2008. J. Mai, C.-N. Chuah, A. Sridharan, T. Ye, and H. Zang, “Is sampled data sufficient for anomaly detection?” in Proc. Internet Measurement Conference. ACM, 2006. A. Kind, M. P. Stoecklin, and X. Dimitropoulos, “Histogram-based traffic anomaly detection,” IEEE/ACM Transactions on Network Service Management, vol. 6, pp. 110–121, June 2009. DataMap. Accessed: August 8, 2013. [Online]. Available: https: //github.com/DataMap13/DataMap/ J. Fan, J. Xu, M. H. Ammar, and S. B. Moon, “Prefix-preserving IP address anonymization: measurement-based security evaluation and a new cryptography-based scheme,” Computer Networks, vol. 46, no. 2, pp. 253–272, 2004. Vermont (VERsatile MONitoring Toolkit. Accessed: August 8, 2013. [Online]. Available: https://github.com/constcast/vermont/wiki S. Staniford, V. Paxson, and N. Weaver, “How to 0wn the Internet in your spare time,” in Proc. 11th USENIX Security Symposium. The USENIX Association, 2002. Snort. Accessed: August 8, 2013. [Online]. Available: http://www. snort.org/ Suricata. Accessed: August 8, 2013. [Online]. Available: http: //suricata-ids.org/ E. Albin and N. C. Rowe, “A realistic experimental comparison of the suricata and snort intrusion-detection systems,” in Proc. Int’l Conf. on Advanced Information Networking and Applications Workshops. IEEE, 2012. Security Onion. Accessed: August 8, 2013. [Online]. Available: https://code.google.com/p/security-onion/ OpenWIPS-ng. Accessed: August 8, 2013. [Online]. Available: http://www.openwips-ng.org/ Airmon-ng. Accessed: August 8, 2013. [Online]. Available: http: //www.aircrack-ng.org/doku.php?id=airmon-ng/ Airodump-ng. Accessed: August 8, 2013. [Online]. Available: http://www.aircrack-ng.org/doku.php?id=airodump-ng/ D. Brauckhoff, K. Salamatian, and M. May, “Applying PCA for traffic anomaly detection: Problems and solutions,” in Proc. IEEE INFOCOM. IEEE, 2009.

A Modular Multi-Location Anonymized Traffic ... -

DataMap tool builds upon existing open-source software and, in contrast to other ... administrative authority or on encrypted networks with admin- istrative ...

Download PDF

463KB Sizes 0 Downloads 160 Views

Report

A Modular Multi-Location Anonymized Traffic ... -

Recommend Documents