Exploring Long-Term System Symptoms for Anomaly ...

Viewer
Transcript

Exploring Long-Term System Symptoms for Anomaly Detection Shun-Te Liu1, 2 , Yi-Ming Chen2, Shiou-Jing Lin 1, Hui-Ching Huang1 1

Information & Communication Security Lab TL, Chunghwa Telecom Co., Ltd. 12, Lane 551, Min-Tsu Road Sec.5 Yang-Mei, Taoyuan, Taiwan (R.O.C) {rogerliu, sjlin, hushpuppy}@cht.com.tw 2

Department of Information Management, National Central University No.300, Jhongda Rd. Jhongli, 320, Taoyuan, Taiwan (R.O.C.) {964403004, cym}@cc.ncu.com.tw

Abstract. System symptoms have been leveraged for anomaly detection for a long time. Previous research achieved good results with them; however, the presented results are primarily performed on a relatively small scale and within a short time period. To understand the system’s long-term profiles, we collected four common types of symptom data including CPU usage, memory loading, disk I/O, and network I/O from more than 100 online internal systems that includes 300 servers for 9 months. We randomly selected 50 servers from these servers and analyzed their data in order to understand each symptom’s longterm features. Based on our findings, we give the suggestions when using each symptom data in anomaly detection for further research. Keywords: Anomaly detection, system symptoms, system profile.

1

Introduction

System anomaly detection is an important problem that has been studies within diverse application domains such as on-line failure detection, intrusion detection and malware detection. As the detection mechanisms are often designed to impose on the protected systems, they are independent of the system services or processes. In the case of lacking the information coming from the services directly, leveraging “system symptoms” to detect system anomalies can be taken into consideration [1]. System symptoms are the side effects that emit continuously during system running [1]: CPU loading and network I/O are the examples. As they are easily and efficiently collected from the system, leveraging them for anomaly detection can bring some advantages such as low overhead on the protected systems and being compatible with legacy systems. On the contrary, as system symptoms reflect the system behavior indirectly, the methods using them to detect anomalies are prone to cause false positives and false negatives compared to the methods using the data coming from the systems directly such as debug events [2].

Much research leverages different strategies to improve the detection rate and false positive rate such as function approximation, classifier, system models, and time series analysis [3]. In most studies, the authors achieved good results with system symptoms, and they reported high detection rates with low false positive rates. However, a closer examination of the presented results reveals that most experiments are performed on the small scale data which is obtained in a short time period, such as only few days or weeks, especially in false positive analysis. Meanwhile, the data size for determining a server’s profile is also not clear. As a result, it is ambiguous whether the symptom data is representative of the real world information systems. This would cause more false positives in practice. In this paper, we design a system called iServer to collect four types of symptom data: CPU usage, memory usage, network I/O, and disk I/O from more than 100 online information systems that include 300 servers for 9 months. We examine 50 servers’ data in depth in order to understand the long-term symptom features of systems. In addition, previous studies indicate that the symptoms data has a notable feature not only in time domain but also in frequency domain. Therefore, the analysis procedure consists of stationary test, probability distribution test and frequency spectrum analysis. Our analysis finds that 1) CPU load, disk I/O and network I/O are stationary for most servers (over 90%); 2) the probability distribution of CPU and network traffic is close to exponential distribution; and, 3) for network I/O, the extremely high values are often meaningful and occur periodically or pseudoperiodically even they are a small part of whole data and 4) the disturbance of memory usage is often low frequency and non-stationary. Memory usage also has a higher chance to be affected by long-term effects (trend effect) than the other three. The contributions of this paper are three folds: 

We implement a system called iServer to collect the system symptoms data from diverse platforms.



The long-term features of system symptoms are explored.



For each symptom, we provide a practical system anomaly detection strategy.

The remaining part of this paper is organized as follows. Section 2 describes previous research in system anomaly detection. Section 3 shows the analysis procedure. In Section 4, we present the experimental results of the analysis. Section 5 concludes and describes the future works.

2

Related Works

Anomalies can be classified into point anomalies, contextual anomalies and collective anomalies [3]. As its name, the meaning of point anomalies lies in as an individual data instance can be considered as anomalous, and a login failure is an example. If a data instance is anomalous in a specific context such as login a server at night, it is considered a case of contextual anomalies. Collective anomalies represent a collection of related data instances is anomalous with respect to the entire data set such as higher CPU loading than before.

Table 1.

The summary of the related works

Study [4] [5] [6]

Symptom data CPU CPU CPU, Bluetooth scanning Call stack

[7] [8] [9] [10] [11] [12] [13] [14]

Call stack Free physical memory and respond time Disk drive data Disk I/O Network flow Network packet bytes, flow counts Network flow

Anomalies DoS System failure Virus infection Intrusion Memory vulnerability Intrusion Resource outage

Experimental data Simulation Simulation MIT Reality dataset (one month)

Disk drive failure Performance bugs Dos and probe

Quantum SMART dataset Real servers’ workload 1999 DAPRA dataset and 3days of real world wifi data Simulate anomaly data and 3-hours trace of UNC campus network Abilene/Internet2 Network dataset

Dos, scan, service down DDoS

Real world network service applications Simulation Real world Apache web servers

Several studies leverage different system symptom data for anomaly detection. The following sections describe the related works summarized in Table 1 in detail. 2.1

CPU anomaly detection

There are some studies concerning about using CPU symptom for anomaly detection. In [4], Ming et al. investigated the impact of CPU-based DoS attacks on SIP infrastructure. Four attack scenarios that can exploit vulnerabilities in the SIP authentication protocols were also identified. The experimental results show that the SIP implementation is highly vulnerable to DoS attacks. Keith [5] used CPU performance counter, which provide low overhead and transparent instrumentation capable of supporting the black-box distributed system problem diagnosis. The applied instrumentation framework extracts CPU performance counter data from a three-tier executing instance of e-commerce system, and obtains a good true-positive rate and a good false-positive rate. Fudong et al. [6] proposed a behavior-based profiling technique that utilized CPU usage, callings and Bluetooth network scanning to profile mobile device. By developing a comprehensive multilevel approach, this technique is able to build upon the weaknesses of current systems.

2.2

Memory anomaly detection

Some studies have used memory symptom for anomaly detection. Jun et al. [7] presented a technique to automatically identify both known and unknown memory corruption vulnerabilities, such as buffer overflow and format string vulnerability. The technique relies on the crash event as the trigger to initiate an automatic diagnosis algorithm and generates a signature of the attack using specific data/address values

embedded in the malicious input message. Therefore the method can be used to block future attacks. Henry et al. [8] utilized call back information to detect anomalous events by extracting return addresses from the call stack and generating abstract execution path between two program execution points. Experiments show that their method can detect attacks that cannot be detected by other approaches, while keeping a good false positive rate. Hoffmann et al. [9] proposed a best practice guide for building empirical models to predict the response time, the amount of free physical memory of an Apache web server, and the call availability of an industrial telecommunication system.

2.3

Disk I/O anomaly detection

Disk I/O is often used for detecting and predicting disk drive failures. In [10], Hamerly et al. proposed two Bayesian failure prediction methods. The first one is a mixture model of naïve Bayes submodel with expectation maximization algorithm used to adjust standard classifier. The second one is a standard naïve Bayes classifier trained from the same input dataset. The experiment results show that both models outperformed the existing thresholding methods by the predictive accuracy. Murray et al. [15] developed an algorithm based on the multiple-instance learning framework and the naive Bayesian. The approach is designed for the low false-alarm case. They used SMART data to evaluate their improvement. Their experimental results indicate that the false alarm can be reduced significantly. In [11], Shen et al. proposed a modeldriven anomaly characterization approach for discovering operating system performance bugs, especially for disk I/O-intensive online servers.

2.4

Network I/O anomaly detection

Some studies utilizing Network I/O symptom for anomaly detection. In [12], Lu et al. collected fifteen features as system variables to generate wavelet coefficients and used Autoregressive with eXogenous (ARX) model to calculate residuals that represent the deviation of current data from normal behavior, and then residuals are put into outlier detection algorithm to make decisions. Zhang et al. [14] combined sketches and wavelet analysis to achieve the aggregation of different traffic flows and the detection of discontinuities in the data. The method of detection is based on computing the distance between the wavelet coefficients, which are obtained by transforming the elements in a sliding window. Thus, they can detect network anomalies, such as denial of service attack, by comparing the coefficients change. Callegari et al. [13] developed an outlier detection method called Multi-Resolution Anomaly Detection (MRAD) for detecting network anomalies in long term dependence (LRD) time series. They also provided a graphical tool to display the probability of observed time series to help recognize outlier.

Fig. 1. The web pages of iServer Table 2.

The platform distribution of candidates and the selected servers Platform Windows Linux AIX Solaris HP-UX Total

Candidate servers 198 35 18 17 32 300

3

Symptom Data Analysis

3.1

Data Collection

Selected Servers 39 5 2 2 2 50

In 2008, we had developed and deployed a system called iServer in Chunghwa Telecom. This system consists of software agents, six application servers, and a web server. The agent can be installed on diverse platforms, including AIX, Linux, Windows, HP-UX, and Sun Solaris. It collects the data of hardware, software, configuration and resource usage for asset, service and security management. Until August 20, 2011, over 300 on-line information systems consist of more than 2300 servers are managed by iServer. Fig 1shows the web pages of iServer. The symptom data, including CPU, memory, disk I/O, and network I/O, of each server is collected every 5 minutes. We had collected them from July 1, 2010 to May 30, 2011; the total data size is over 180GB without compression. As the data records of a server may be lost when the server is shutting down or rebooting, we attempt to minimize the data lost in time series by selecting 300 servers with more data records to be our analysis candidates. To conduct the analysis, we randomly select 50 servers from these candidates. The platform distribution of candidates and the selected servers is shown in Table 2.

3.2

Analysis approach

To understand the long-term features of symptom data in time and frequency domain, it is necessary to know 1) whether the data is stationary or not, 2) the frequency spectrum of the data, 3) the type of data distribution and 4) the meaningful values in statistics such as mean, median and standard deviation. We use the analysis procedure as shown in Fig 2. At first, we leverage Augmented Dickey–Fuller (ADF) test [16] to determine whether the data in the time series is stationary or not. ADF test is an econometric test for testing whether a certain time series data has an autoregressive unit root. If the unit root can reject the null hypothesis that time series data is non-stationary, the data is stationary. If the data is stationary, we attempt to fit the data distribution by using three tests (normal, chi-square and bi-normal tests) and calculate the mean, median, and standard deviation to reveal the data features in statistic. On the contrary, if data is nonstationary, we use the moving average approach to omit the trend effects and the residual values are fed to the stationary test again. If the data is still non-stationary, we will not count the statistical values because it is meaningless. Finally, the data is analyzed by Fast Fourier Transform (FFT) for spectrum analysis [17], which shows the data features in frequency domain. To understand the frequency features of each kind of symptom data, we simply divide the frequency into three bands: high, medium and low frequency from 1/288 to 1, where 288 means 1 day (288*5 minute=24 hour). By calculating the probability distribution of each band, we can easily obtain the frequency features of symptoms.

4

Analysis Results

The stationary and distribution test results are shown in Table 3; Table 4 shows the averages of the mean, standard deviation, min and max value of each kind of symptom data; Table 5 shows the percentage of major class to the entire data set; and Table 6 shows the frequency spectrum of the symptom data. The detailed analysis and discussion are as follows. Start data

Stationary test (ADF)

Pass test? Y

Distribution fitting (Chi-square, Normal, Binormal)

Get statistical values (mean, max, min ...)

N

Moving average

Pass test? Y N

residual value

Fig. 2. The symptom data analysis procedure

End Frequency spectrum (wavelet power spectrum)

Table 3. Passing Stationary and distribution test from 50 servers. (MEM means Memory, DISK means Disk I/O, N_I means Network Inbound, and N_O means Network Outbound) CPU (No.) 45 1 16 3 0

First ADF Second ADF Chi-square test Normal test Bi-normal test

MEM (No.) 2 10 2 1 1

DISK (No.) 46 2 12 8 2

N_I (No.) 47 0 0 2 0

N_O (No.) 47 0 9 2 0

Table 4. Statistical data of each kind of symptom data

Mean Standard deviation Min Max

CPU (%) 2.92 3.12 0.18 29.52

DISK (KB/s) 565 1690 0 30371

N_I (KB/s) 573 1508 1.4 29713

N_O (KB/s) 603 1682 1.3 29646

Table 5. The percentage of the bins and classes.

Major bin Other bins Major class Other classes

CPU (%) 86 14 90 10

DISK (%) 97 3 98 2

N_I (%) 94 6 98 2

N_O (%) 94 6 98 2

Table 6. Frequency spectrum of each kind of symptom data. Band Low Medium High

4.1

CPU (%) 61 25 14

MEM (%) 88 8 4

DISK (%) 51 30 19

N_I (%) 52 29 19

N_O (%) 52 29 19

CPU

Table 3 shows that CPU usage is stationary data in most servers (45 out of 50). However, it cannot fit into chi-square, normal, or bi-normal tests on more than half of the servers. By removing the data with value 0, the CPU usage is still stationary and fit to chi-square test. Meanwhile, we find out that most of time the CPU usage is equal to 0 on these servers, so the mean CPU usage is very low and closes to 2.92% on average. Table 5 also shows the same thing that the major bin (from 0 to 10) of CPU usage takes 86% on these servers on average. Fig 3 is the CPU usage and the corresponding coefficient of a database server. CPU usage bursts only when the queries of front-end applications occur, nevertheless,

the maximal usage is only 15%. One reason is that the loading of server is not very heavy. Another reason is that nowadays most CPUs have embedded more than one core, so the CPU loading can be shared among the cores and then decreases the total CPU usage. As a result, CPU usage may be useful for detecting anomalies with high value such as overloading but has the weakness in detecting anomalies with low value such as outage. Considering the frequency spectrum of CPU usage, as it is very even in different frequency bands, we suggest that the coefficients of different scales should be considered as a whole when profiling a server by CPU usage in frequency domain. 4.2

Memory

Table 3 indicates that memory usage is a weak symptom data for profiling a server because it is non-stationary, even when conducted by the moving average approach. Therefore, it is meaningless to analyze the statistical data of memory usage. The method [9] which leverages free physical memory usage to detect anomalies may not work for the candidate servers. However, Table 6 points out that on most servers the frequency of memory usage is low, which means it will not change frequently over time. Therefore, when profiling a server by memory usage, the frequency feature may be a notable one.

Fig. 3. The CPU usage (a) and the corresponding coefficient (b) of a database server

Fig. 4. The physical memory usage (a) and the corresponding coefficient (b) of a database server over time. The variance of the coefficients is more stable than the original value of memory usage. Fig 4(a) is the memory usage of a web server. It shows that the memory usage does not change frequently over time. Fig 4(b) is the coefficients of memory usage data transformed by CWT. It shows that the change of coefficient is not as dramatic as that of the value. That is what we say “low frequency”. As a result, to profile a server with memory usage, we suggest that the frequency feature should be concerned.

4.3

Disk I/O

The disk I/O data of the servers are mostly stationary. The statistic-based model with this symptom data may be worktable in profiling a server for anomaly detection. However, it faces the same problem as CPU usage that not all the servers’ data can fit the three probability distribution tests. This problem may be caused by that fact that we only analyze the disk I/O of system drive. We suggest that the disk I/O of all drives should be aggregated when profiling a server. Table 4 indicates that the value of disk I/O ranges from 0 to 30371 KB per second and the value of mean plus three times of standard deviation is still much smaller than the maximum value. This points out that disk I/O changes dramatically overtime.

Therefore, we suggest that the original data should be conducted by logarithm to diminish the range of the data.

Fig. 5. The disk I/O (a) and the corresponding coefficient (b) of a web server Meanwhile, Table 5 shows that the data concentrate on the major bin (97%). Therefore, the density-based cluster approaches with disk I/O may be useful to profile a server. Fig 5 illustrates the disk I/O and the corresponding coefficients of a web server overtime. It shows that the data with extreme high values of disk I/O seldom but periodically occur. This is caused by a backup service rather than its major service. It is also notable that the coefficients are influenced by the extreme high values and may cause the false positives if a signal-based anomaly detection approaches is used with this symptom data. 4.4

Network I/O

Network I/O is also stationary in most servers (47 out of 50). It is visualized by a histogram tool in Matlab [18] and appears that most data is located in a few bins. In addition, the sizes of the extreme high and low value bins are much smaller than the medium value bins but they show a stronger periodical or pseudo-periodical feature. These features may be caused by the complementary services, such as remote backup and season effect since no one uses the service at night, rather than by the major services. By removing the data with extreme high and low values, the remaining data

still accounts for over 98% of the orginal data and fits the chi-square test. These findings

Fig. 6. The network inbound (a) and the corresponding coefficient (b) of a web server suggest that the data should be seperated into several classes based on the value, and then each class can be profiled individually for anomaly detection and prediction. Fig 6 is the network inbound data and corresponding coefficient of a web server. It is clear that the data with extreme high values in network inbound occur periodically. Most of time the server has a lower network inbound traffic which is its major feature. Thus, when profiling the server with network I/O, the data should be pre-classified and then each of the classes is modeled individually to obtain the final profile for anomaly detection.

5

Conclusion and Future Works

This paper implements a system called iServer to collect four types of symptom data from five general operating systems for 9 months. The collected data are used for exploring the long term system symptom features for anomaly detection. We design the analysis procedure to examine the value, time and frequency features of the servers. Based on the analysis results, we give the suggestions when using the symptom data in system anomaly detection.

Regarding future research directions for the study, the analysis of memory usage points out that the frequency features may give us an opportunity to profile a server. Second, the values of disk I/O and network I/O change dramatically but their data with extreme high value occur periodically or pseudo-periodically. A classification mechanism may be useful for separating the data into several classes. Each class should be modeled individually and then aggregated to obtain a better system profile for anomaly detection.

Acknowledgment The authors would like to thank reviewers’ helpful comments. This research is supported by the Information & Communication Security Lab, Telecommunication Laboratories, Chunghwa Telecom co., Ltd and the National Science Council of Taiwan, ROC. The Grant No. is NSC 99-2221-E-008 -094.

References 1. Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. ACM Comput. Surv. 42(3), 1-42 (2010). doi:10.1145/1670679.1670680 2. Beizer, B.: Software testing techniques. Dreamtech Press, (2002) 3. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM Comput. Surv. 41(3), 1-58 (2009). doi:10.1145/1541880.1541882 4. Ming, L., Tao, P., Leckie, C.: CPU-based DoS attacks against SIP servers. In: Network Operations and Management Symposium, 2008. NOMS 2008. IEEE, 7-11 April 2008 2008, pp. 41-48 5. Bare, K., SCIENCE., C.-M.U.P.P.S.O.C.: CPU Performance Counter-Based Problem Diagnosis for Software Systems. In. CARNEGIE-MELLON UNIV PITTSBURGH PA SCHOOL OF COMPUTER SCIENCE, (2009) 6. Fudong, L.: Behaviour Profiling on Mobile Devices. In: Nathan, C., Maria, P., Paul, D. (eds.) 2010, pp. 77-82 7. Xu, J., Ning, P., Kil, C., Zhai, Y., Bookholt, C.: Automatic diagnosis and response to memory corruption vulnerabilities. Paper presented at the Proceedings of the 12th ACM conference on Computer and communications security, Alexandria, VA, USA, 8. Feng, H.H., Kolesnikov, O.M., Fogla, P., Lee, W., Weibo, G.: Anomaly detection using call stack information. In: Security and Privacy, 2003. Proceedings. 2003 Symposium on, 11-14 May 2003 2003, pp. 62-75 9. Hoffmann, G.A., Trivedi, K.S., Malek, M.: A Best Practice Guide to Resource Forecasting for Computing Systems. Reliability, IEEE Transactions on 56(4), 615-628 (2007). 10. Hamerly, G., Elkan, C.: Bayesian approaches to failure prediction for disk drives. Paper presented at the Proceedings of the Eighteenth International Conference on Machine Learning, 11. Shen, K., Zhong, M., Li, C.: I/O system performance debugging using model-driven anomaly characterization. Paper presented at the Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4, San Francisco, CA, 12. Lu, W., Ghorbani, A.A.: Network anomaly detection based on wavelet analysis. EURASIP J. Adv. Signal Process 2009, 1-16 (2009). doi:10.1155/2009/837601

13. Callegari, C., Giordano, S., Pagano, M., Pepe, T.: On the use of sketches and wavelet analysis for network anomaly detection. In: 2010, pp. 331-335. ACM 14. Zhang, L., Zhu, Z., Marron, J.: MultiResolution Anomaly Detection Method for Long Range Dependent Time Series. Arxiv preprint arXiv:0809.1281 (2008). 15. Murray, J.F., Hughes, G.F., Kreutz-Delgado, K.: Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application. J. Mach. Learn. Res. 6, 783-816 (2005). 16. Greene, W.H., Zhang, C.: Econometric analysis, vol. 5. Prentice hall Upper Saddle River, NJ, (2003) 17. Ingle, V.K., Proakis, J.G.: Digital signal processing using MATLAB. Cengage Learning, (2011) 18. Misiti, M., Misiti, Y., Oppenheim, G., Poggi, J.M.: Wavelet toolbox user's guide. The MathWorks (2005).