A MULTI-MODE INTERNET PROTOCOL INTRUSION DETECTION SYSTEM
A THESIS SUBMITTED TO THE COUNCIL OF THE FACULTY OF SCIENCE AND SCIENCE EDUCATION SCHOOL OF SCIENCE, UNIVERSITY OF SULAIMANI IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER
BY DEEMAN YOUSIF MAHMOOD B.SC. COMPUTER SCIENCE (2008), UNIVERSITY OF KIRKUK
SUPERVISED BY DR. MOHAMMED ABDULLAH HUSSEIN ASSISTANT PROFESSOR
June (2014 A.D)
Pushpar (2714 K)
بسم اهلل الرمحن الرحيم
َ َ ُ ُ ِّ ْ ْ َّ َ ا وما أو ِتيتم من العِلم ِ ِإال قلِيل
صدق اهلل العظيم
اإلسراء 58
Supervisor Certification I certify that the preparation of this thesis entitled, "A Multi-mode Internet Protocol Intrusion Detection System" accomplished by (Deeman Yousif Mahmood) was prepared under my supervision at the School of Science, Faculty of Science and Science Education at the University of Sulaimani, as partial fulfillment of the requirements for the degree of Master of Science in Computer Science.
Signature: Name: Ass. Prof. Dr. Mohammed Abdullah Hussein University of Sulaimani, Electrical Engineering Department Date: 25 / 03 / 2014
In view of the available recommendation, I forward this thesis for debate by the examining committee.
Signature: Name: Dr. Kamaran HamaAli Faraj University of Sulaimani, Head of Computer Science Department Date: 25 / 03 / 2014
Linguistic Evaluation Certification
I hereby certify that this thesis entitled, "A Multi-mode Internet Protocol Intrusion Detection System" prepared by Deeman Yousif Mahmood, has been read and checked and after indicating all the grammatical and spelling mistakes; the thesis was given again to the candidate to make the adequate corrections. After the second reading, I found that the candidate corrected the indicated mistakes. Therefore, I certify that this thesis is free from mistakes.
Signature: Name
: Jutiar Omer Salih
Position : English Department, School of Languages, University of Sulaimani Date
:
14 / 04 / 2014
Examining Committee Certification
We certify that we have read this thesis entitled "A Multi-mode Internet Protocol Intrusion Detection System" prepared by (Deeman Yousif Mahmood), and as Examining Committee, examined the student in its content and in what is connected with it, and in our opinion it meets the basic requirements toward the degree of Master of Science in Computer Science.
Signature:
Signature:
Name: Dr. Subhi R. M. Zebari
Name: Dr. Suzan Abdulla Mahmood
Title: Assistant Professor
Title: Assistant Professor
Date: 20 / 7 / 2014
Date: 17 / 7 / 2014
(Chairman)
(Member)
Signature:
Signature:
Name: Dr. Kamaran HamaAli Faraj
Name: Dr. Mohammed A. Hussein
Title: Lecturer
Title: Assistant Professor
Date: 21 / 7 / 2014
Date: 17 / 7 / 2014
(Member)
(Supervisor‐Member)
Approved by the Dean of the Faculty of Science.
Signature: Name: Dr. Bakhtiar Qader Aziz Title: Professor Date: 7 / 8 / 2014 (The Dean)
Dedication
This thesis is dedicated to: My parents for their endless love, support and encouragement, source of motivation and strength during moments of despair and discouragement.
Acknowledgments
Behind every successful work, there is a lot of devotion, hard work, efforts and sacrifice. Thanks to Allah for giving me this opportunity, the strength and the patience to complete my dissertation finally, after all the challenges and difficulties. This work would not have made it to this stage without the guidance of Dr. Mohammed Abdullah Hussein; I would like to thank him for introducing me to this interesting problem of network security. His knowledge, support, and guidance have a great contribution to the success of this work. I also would like to express my gratitude to all teaching staff at the university of Sulaimani/ School of Science – Computer Science Dept., who taught me during my Master courses; I really appreciate your efforts, encouragements and valuable instructions. Profound thanks to Prof. Dr. Hussein H. Khanaqa, previous president of Kirkuk University, for his encouragement during my work in rector office in presidency of Kirkuk University (2009-2012), and his valuable advice during my study which is a result of a great experience in directing and supervising for more than 35 years. Also I have to thank all my friends for their support, encouragement, and assistance in many aspects that I cannot list all them. Finally, I take this opportunity to thank my family for their moral support throughout my life. In particular, my parents who were behind me and inspired me during my entire studies. Their support and guidance gave me the power to struggle and survive during hard times.
Abstract Intrusion Detection Systems (IDS) are gaining more and more scope in the field of secure networks and new ideas and concepts regarding intrusion detection processes keep surfacing. Various services offered on the internet have problems of being unavailable for authorized users because of Denial-of-Service (DoS) attacks, which is the main concern of this thesis by implementing a semi-supervised hybrid IDS that can judge whether network traffics are normal or abnormal (attack) using machine learning techniques. To show the applicability of proposed intrusion detection approach the Knowledge Discovery and Data mining (KDD) Cup 99 dataset, which is considered as a standard dataset used for evaluation of security detection mechanisms, this dataset has served well to demonstrate that machine learning can be useful in intrusion detection. Two machine learning algorithms are applied to the basic security model to construct a semi-supervised hybrid technique for detecting intrusions: the K-means clustering (for unsupervised learning) and the Decision Tree algorithm (for supervised learning). These algorithms with information gain attribute ranking are used to filter and classify network packets. Although the K-means has been used previously for detecting intrusions, the addition of feature ranking enabled us to obtain better results compared to using K-means alone. With the K-means, packets could be classified either as normal or DoS packets, the DoS cluster feeds the Decision Tree, and with the addition of Decision Tree (DT) algorithm attack type classifications are made possible. Through the DT a hybrid system has been established. The result is an IDS that is effective in detecting network intrusions according to obtained high detection and low error rates, (DR = 98.2143%, Error Rate = 1.7857% for K-means and DR=99.9136%, Error Rate = 0.0864% for C4.5 Decision Tree).
i
CONTENTS
Abstract …………………………………………...………………………………… Contents …………………………………………………………………………….. List of Tables ……...………………………………………………………………... List of Figures ………………...…………………………………………………….. List of Abbreviations.…...…………………………………………………………...
i ii v vi vii
Chapter One: Introduction 1.1 Overview…………………………………………………………………...
1
1.2 Literature Survey…………………………………………………………...
3
1.3 Aim of the Thesis…………………………………………………………..
6
1.4 Thesis Outlines……………………………………………………………..
6
Chapter Two: Intrusion Detection and Data Mining 2.1 Introduction………………………………………………………………...
7
2.2 Definitions and Terminology……..………………………………………..
8
2.3 Intrusion Detection System (IDS)…..………………..........................……
11
2.4 Types of Intrusion Detection System………………………………………
12
2.4.1 Host-Based IDS…………………………………………......……..
13
2.4.2 Network-Based IDS……………..…………………………………
13
2.5 Intrusion Detection System Components and Requirements……………...
14
2.6 Intrusion Detection Techniques………………………….………………...
16
2.6.1 Anomaly Intrusion Detection……………...……………………....
17
2.6.2 Misuse Intrusion Detection………………………..……………….
18
2.7 Learning Procedures……………………………………………………….
19
2.8 Common Attacks and Vulnerabilities in NIDS…………………………....
20
2.9 Technical Discussion……………………………………………………….
21
2.9.1 Internet Protocol – IP………………………………………………
22
ii
2.9.2 Transmission Control Protocol – TCP…………………………….
22
2.10 IP Spoofing………………………………………………………………..
24
2.10.1 Denial of Service Attack…………………………………………..
25
2.11 Data Mining and Intrusion Detection System…………………………….
27
2.12 Feature Selection (FS)…………………………………………………….
28
2.12.1 General Methods for Feature Selection…………………………..
30
2.12.2 Information Gain (IG) Feature Selection…………………………
31
2.13 Clustering Algorithms…………………………………………………….
32
2.13.1 Classification of Clustering Algorithms…………………………..
33
2.13.2 K-means Algorithm………………………………………………
34
2.14 Decision Tree……………………………………………………………..
35
2.14.1 C4.5 Decision Tree Algorithm…………………………………...
36
2.15 Dataset Collection…………………………………………………………
38
2.15.1 Attacks in KDD Cup 99 Dataset………………………………….
39
2.15.2 Features of KDD Cup 99 Dataset…………………………………
39
Chapter Three: Proposed System Methodology 3.1 Introduction…………………………………………………………………
42
3.2 Dataset Pre-Processing……………………………………………………...
42
3.2.1 Dataset Transformation…………………………………………….
42
3.2.2 Dataset Normalization……………………………………………...
43
3.3 Proposed Detection Model………………………………………………….
44
3.4 Information Gain Feature Selection………………………………………...
46
3.5 K-means Clustering for the Proposed System……………………………...
47
3.5.1 Distance Calculation………………………………………………..
49
3.6 Decision Trees as a Model for Intrusion Detection………………………...
51
iii
Chapter Four: Implemented Results and Discussions
4.1 Introduction…………………………………………………………………..
55
4.2 Training and Testing the Dataset……………………………………………..
55
4.3 Experiment 1: Results of Pre-processing……………………………………..
55
4.3.1 Transformation and Normalization…………………………………...
55
4.3.2 Features Ranking and Subset Selection………………………………
59
4.4 Experiment 2: K-means Clustering (First Layer)……………………………..
61
4.5 Experiment 3: C4.5 Decision Tree (Second Layer)…………………………..
66
4.6 The Graphical User Interface (GUI)…………………………………………..
67
Chapter Five: Conclusions and Future Works 5.1 Conclusions……………………………………………………………………
71
5.2 Future Works…………………………………………………………………..
73
References………………………………………………………………………….
74
Appendices
iv
List of Tables
Table No.
Table Title
Page No.
2.1
Confusion Matrix
10
2.2
Comparison of Intrusion Detection Techniques
16
2.3
Basic Features of TCP Connection
40
2.4
Content Features of the TCP Connection
41
2.5
Time Based Features of the TCP Connection
41
3.1
Transformation Table for Different Values of Protocols, Flag and
43
Services 4.1
Sample Records of KDD Cup 99
56
4.2
Transformed Nominal Data and Normalized Numeric Data
57
Samples of KDD Cup 99 Dataset 4.3
Proportions of the Normal and DoS Classes in the Data Subset
58
4.4
Attribute Ranking by Information Gain
59
4.5
Attribute Ranking Using GainR for C4.5 DT
60
4.6
Attributes Centroid Using Euclidian Distance Metric for 20
62
Features with Highest Ranking 4.7
Attributes Centroid Using Manhattan Distance Metric for 20
63
Features with Highest Ranking 4.8
Evaluation and Results of K-means with Distance Functions Using
64
the Full Dataset 4.9
Evaluation and Results of K-means with Distance Functions Using
64
the Highest 10 Features Ranked by IG 4.10
Evaluation and Results of K-means with Distance Functions Using
65
the Highest 20 Features Ranked by IG 4.11
Evaluation and Results of C4.5 Algorithm
v
66
List of Figures
Figure No.
Figure Title
Page No.
2.1
OSI Model
21
2.2
IP Packet Header
22
2.3
TCP Packet Header
23
2.4
Types of Clustering Methods
34
2.5
Example of Decision Tree for IDS Classification
38
3.1
Records of the KDD Cup 99 Dataset
43
3.2
Records of the KDD Cup 99 Dataset After Transformation
44
3.3
Proposed Detection Model Structure
45
3.4
First Layer of Proposed Detection Model
47
3.5
K-means Clustering Flowchart
48
3.6
Euclidean Distance between Two Points
49
3.7
Manhattan Distance between Two Points
50
3.8
Decision Tree Structure for DoS Attack Classification
54
4.1
Comparative Chart of Distance Functions Values Using K-means
65
4.2
Main GUI of the Detection Model
68
4.3
Capturing and Classification of Network Traffics by the System
68
4.4
Extracting Normal and Attack Packets from Captured Packets
69
4.5
Log File of Captured Packets
70
vi
List of Abbreviations Abbreviation
Description
Acc
Accuracy
ACK
Acknowledge
ATM
Automated Teller Machine
CFS
Correlation-based Feature Selection
DDoS
Distributed Denial of Service attack
DNS
Domain Name Server
DoS
Denial of Service attack
DR
Detection Rate
DT
Decision Tree
ES
Expert System
FCBF
Fast Correlation-Based Feature selection
FN
False Negative
FNR
False Negative Rate
FP
False Positive
FPR
False Positive Rate
FS
Feature Selection
FSA
Feature Selection Algorithm
FTP
File Transfer Protocol
GainR
Gain Ratio
GUI
Graphical User Interface
HIDS
Host-based Intrusion Detection System
HTTP
Hyper Text Transfer Protocol
ICMP
Internet Control Message Protocol
IDE
Integrated Development Environment
IDS
Intrusion Detection System
vii
IG
Information Gain
IP
Internet Protocol
JDK
Java Development Kit
KDD
Knowledge Discovery in Database
MAE
Mean Absolute Error
MITM
Man In The Middle
ML
Machine Learning
MSE
Mean Square Error
NIDES
Next generation of Intrusion Detection Expert System
NIDS
Network-based Intrusion Detection System
OSI
Open Systems Interconnection
PCA
Principal Component Analysis
PoD
Ping of Death
PPV
Positive Predictive Value
R2L
Remote to Local
RMSE
Root Mean Squared Error
SOM
Self-Organizing Maps
SQL
Structured Query Language
SVM
Support Vector Machines
Sr. No.
Source Number
SYN
Synchronize
TCP
Transfer Control Protocol
TN
True Negative
TNR
True Negative Rate
TP
True Positive
TPR
True Positive Rate
U2R
User to Root
viii
Chapter One Introduction
Chapter One Introduction
1.1 Overview The world has seen rapid advances in science and technology in the last two decades. This has enabled dealing with a wide spectrum of human needs effectively. These needs vary from simple day-to-day needs like online shopping, online booking tickets, online banking, e-library, etc. [1]. These technologies have made life easier for average people, but make it harder for security experts and network administrators, and in the middle of this phenomenon, the rise and growth of a parallel technology is fearful that of compromising security, thereby resulting in different effects detrimental to the use of technology. This includes attacks on information, such as stealing private information, hacking, and outage of services [2]. Media and other forms of network security literature report the possibility of the existence of underground anonymous attack networks which can effectively attack any given target at any time [3]. An intrusion to a computer system does not need to be executed manually by a person; it may be executed automatically with engineered software. A well-known example of this is the Slammer worm (also known as Sapphire), which performed a global Denial of Service (DoS) attack in 2003. The worm exploited vulnerability in Microsoft’s SQL Server, which allowed it to disable database servers and overload networks. Slammer was the fastest computer worm in history and affected approximately 75,000 computer systems around the world within 10 minutes. Not only did the Slammer worm restrict the general Internet traffic, it caused network outages and unforeseen consequences such as canceled airline flights, interference with elections, and ATM failures [4].
1
Chapter One
Introduction
There are several mechanisms that can be adopted to increase the security in computer systems. A commonly used three-level protection is by [5]: Attack prevention: Firewalls, user names and passwords, and user rights. Attack avoidance: Encryption. Attack detection: Intrusion detection systems. Despite adopting mechanisms such as cryptography and protocols to control the communication between computers (and users), it is impossible to prevent all intrusions, Firewalls serve to block and filter certain types of data or services from users on a host computer or a network of computers, aiming to stop some potential misuse by enforcing restrictions. However, firewalls are unable to handle any form of misuse occurring within the network or on a host computer. Furthermore, intrusions can occur in traffic that appears normal [6]. IDS do not replace the other security mechanisms, but compliment them by attempting to detect when malicious behavior occurs. The purpose of an IDS, in general terms, is to detect network traffics when the behavior of a user conflicts with the intended use of the computer, or computer network, e.g., committing fraud, hacking into the system to steal information, conducting an attack to prevent the system from functioning properly or even break down. Before the 1990s, the intrusion detection was performed by system administrators, manually analyzing logs of user behavior and system messages, with poor chances of being able to detect intrusions in progress [7]. Due to the increased use of computers, the magnitude of data in contemporary computer networks still renders this a significant challenge, while the range of attacks that can be performed on targets is as broad as the spectrum of constructive technology itself, this thesis deals with a particular class of attacks known as Denial of Service (DoS) attacks that mostly uses IP spoofing. DoS attacks is a class of attacks on targets which aims at exhausting target resources, thereby denying service to valid users [3].
2
Chapter One
Introduction
1.2 Literature Survey As the network dramatically extended, security is considered as a major issue in networks. Internet attacks are increasing, and there have been various attack methods, researchers and companies have analyzed these methods and below are a survey on some of related researches: In 1980, the concept of intrusion detection began with Anderson’s seminal paper [8]; he introduced a threat classification model that develops a security monitoring surveillance system based on detecting anomalies in user behavior. In 1995, Anderson et al. [9], designed the Next generation of Intrusion Detection Expert System (NIDES) to operate in real time to detect intrusions as they occur. NIDES is a comprehensive system that uses innovative statistical algorithms for anomaly detection, as well as an expert system that encodes known in intrusion scenarios. Again in 1995, Kummer [10], used the classification of intrusion based on the "signatures" (patterns) they leave in the audit trial of the system made. The classification is intended or used in intrusion detection systems based on pattern matching. In 2002, Andrew et al. [11], used KDD CUP 1999 Data set for training and testing their model. Data were classified in to two classes: Normal (+1) and Attack (-1). They had used the SVM light freeware package. For data reduction, they had applied SVMs to identify the most significant features for detecting attack patterns. The procedure is to delete one feature at a time, and train SVMs with the same data set. By this process, 13 out of the 41 features of KDD CUP 1999 dataset are identified as most significant: 1, 2, 3, 5, 6, 9, 23, 24, 29, 32, 33, 34, and 36. Training was done using the RBF (Radial Bias Function) kernel option. In their
3
Chapter One
Introduction
experiment, authors got 98.9% accuracy for true negative case, and 99.7% accuracy for true positive case. In 2005 Mitrokotsa and Douligeris [12], proposed an approach that detects DoS attacks using Emergent Self-Organizing Maps. The approach is based on classifying “normal” traffic against “abnormal” traffic in the sense of DoS attacks. The approach permits the automatic classification of events that are contained in logs and visualization of network traffic. Extensive simulations show the effectiveness of this approach compared to previously proposed approaches regarding false alarms and detection rates. In 2008 Rajesh and Shina [13], proposed a method of analysis for the best feature selection method for Network intrusion detection model. In their paper they analyzed three measures namely: the Chisquare, Information Gain and the Gini Index methods for feature selection. These are the various filter based approaches that have been used. Among these filter based approaches given upon the open source Windows version 3.4 three of them were tested. Results have proved that the Information gain when used for the feature selection produces accurate results by accurately detecting the least prominent attack in the dataset. In 2009 Bian et al. [14], used K-means algorithm to cluster and analyze the data of KDD Cup 99 dataset. The simulation results that run on KDD Cup 99 dataset showed that the K-means method is an effective algorithm for partitioning large dataset and can detect unknown intrusions with detection rate 96%. In 2010 Affendey et al. [15], compared the efficiency of machine learning methods in intrusion detection system, including Classification Tree and Support Vector Machines. that Classification Decision Tree algorithm detects attacks at a very much greater rate than the Support Vector machines (SVM’s), the same dataset were evaluated with the two Data mining approaches. The correlation between the
4
Chapter One
Introduction
samples was measured by using the min-max normalization. The Results show that the C4.5 Classification Decision Tree algorithm is giving fewer false alarm rates than SVM. Again in 2010 Bharti et al. [16], used fuzzy k-mean clustering algorithm and random forest tree classification techniques for assigning a cluster to a particular class. From experimental results it is observed that for two class datasets the combination of clustering random forest tree gives the better results than the clustering alone. In 2012 Bhaskar and Kumar [17], presented an approach for identifying network anomalies by visualizing network flow data which is stored in weblogs. Various clustering techniques can be used to identify different anomalies in the network. Here, they present a new approach based on simple K-Means for analyzing network flow of data using different attributes like IP address, Protocol, Port number etc. to detect anomalies. By using visualization, they can identify which sites are more frequently accessed by the users. In their approach they provide overview about given dataset by studying network key parameters. In this process they used preprocessing techniques to eliminate unwanted attributes from weblog data. Since it is challenging for IDSs to maintain high accuracy, an IDS that uses attack signatures to detect intrusions cannot discover new attacks. These IDSs are becoming incapable of protecting computer system; therefore a detection approach that is able to detect new attacks is necessary for building reliable and efficient IDS. For this purposes an unsupervised data mining approach deployed the K-means clustering algorithm in the first layer of proposed IDS model, which is a selfadministrative and can learn new patterns within the dataset without any interference from outside (i.e., an administrator), and C4.5 DT deployed in the second layer for classifying DoS attack types which is a very accurate and easy classifier.
5
Chapter One
Introduction
1.3 Aim of the Thesis The aim of this thesis is to design an efficient IDS to detect DoS attacks in a NIDS. This thesis provides a survey of the state-of-the-art in the field of hybrid approaches applied to IDS’s and ends with implementing a system that utilizes unsupervised K-means and supervised Decision Tree algorithms. Additionally, it shows that each class of attacks could be treated separately as the thesis focuses only on DoS attack. In fact it is possible that at least one algorithm can be assigned to detect one class of attacks instead of using a single algorithm to detect all classes of attacks.
1.4 Thesis Outlines The rest of the thesis is organized as follows: Chapter Two (Intrusion Detection and Data Mining): This chapter deals with the concept of intrusion detection systems. It will also cover the different types of IDSs, and explain what a network-based IDS is, Machine learning types, used algorithms, and different types of attack and concepts of IP spoofing. Chapter Three (Proposed System Methodology): This chapter will cover an overall design of the IDS regarding the pre-processing, algorithms, and the overall proposed detection model structure. Chapter Four (Implemented Results and Discussions): This chapter will present results of functionally and efficiency test of the implemented IDS model.
Chapter Five (Conclusions and Future Works): This chapter will cover concluding remarks on the IDS and the whole work of this thesis, and gives some possibilities of future works.
6
Chapter Two Intrusion Detection and Data Mining
7
Chapter Two Intrusion Detection and Data Mining
2.1 Introduction Computer networks have expanded significantly in use and number. This expansion makes them target to different attacks [18]. It is obvious that, in today’s era of Information Technology, the sharing of resources and information in interconnected network is essential. But as to secure this information from unauthorized uses and manipulation, it is necessary to impose some restrictions. Some of the tools that are developed for these purposes are firewalls, anti-viruses and intrusion detection programs [19]. The use of an intrusion detection system is becoming common due to the increase in attack complexity and the evolution of computer systems. Generally intrusion detection system works in pre-defined manner regardless of the implementation mechanism selected. These are some common steps followed by the intrusion detection system [20]: Data is captured, often in the form of IP packets. The data are decoded and transformed into a uniform format, through the process of feature extraction. The data are then analyzed in a manner which is specific to the individual IDS, and classified as threatening or not. Alerts are generated if a threatening pattern is encountered. Computer and data security is a complex topic. The goals of computer security are [21]:
7
Chapter Two
Intrusion Detection and Data Mining
1. Data Confidentiality: protection of data so that it is not disclosed in an unauthorized fashion. 2. Data Integrity: protection against unauthorized modification of data. 3. Data Availability: protection from unauthorized attempts to withhold information or computer resources. This chapter starts with an introduction to the concept of intrusion detection system and the components of intrusion detection system. Algorithms and techniques of IDS that are used in this thesis are discussed.
2.2 Definitions and Terminology An Intrusion Detection System (IDS) employs techniques for modeling and recognizing intrusive behavior in a computer system. When referring to the performance and measurement factors of IDSs, the following terms are often used: Alarm: A signal suggesting that a system has been or is being attacked. True positive (TP): classifying an intrusion as an intrusion. The true positive rate is synonymous with detection rate, sensitivity and recall, which are terms often used in the literature. False positive (FP): incorrectly classifying normal data as an intrusion, also known as a false alarm. True negative (TN): correctly classifying normal data as normal. The true negative rate is also referred to as specificity. False negative (FN): incorrectly classifying an intrusion as normal.
In particular, the following measures will be used to assess the IDS's performance. The performances metrics are calculated as follows:
8
Chapter Two
Intrusion Detection and Data Mining
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒(𝑇𝑃𝑅) =
𝑇𝑃 𝑇𝑃+𝐹𝑁
𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒(𝐹𝑃𝑅) =
𝐹𝑃 𝐹𝑃+𝑇𝑁
𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒(𝑇𝑁𝑅) =
𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑅𝑎𝑡𝑒(𝐹𝑁𝑅) =
#𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛𝑠
=
#𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛𝑠
=
𝑇𝑁 𝑇𝑁+𝐹𝑃
#𝑁𝑜𝑟𝑚𝑎𝑙 𝑎𝑠 𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛𝑠 #𝑁𝑜𝑟𝑚𝑎𝑙
=
𝐹𝑁 𝐹𝑁+𝑇𝑃
#𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑁𝑜𝑟𝑚𝑎𝑙 #𝑁𝑜𝑟𝑚𝑎𝑙
=
#𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛 𝑎𝑠 𝑁𝑜𝑟𝑚𝑎𝑙 #𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛𝑠
Eq.2.1
Eq.2.2
Eq.2.3
Eq.2.4
True Positive Rate is also referred to as Sensitivity or Recall, and precision is also referred to as Positive Predictive Value (PPV). True Negative Rate is also called Specificity.
Commonly additional performance metrics are used referred to as accuracy, Error rate, precision and F-measure:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
=
#𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 #𝐴𝑙𝑙 𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
𝐸𝑟𝑟𝑜𝑟 𝑟𝑎𝑡𝑒 = 1 − 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 𝑇𝑃+𝐹𝑃
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∗
=
Eq.2.5
Eq.2.6
#𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛𝑠 #𝐼𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑎𝑠 𝐼𝑛𝑡𝑟𝑢𝑠𝑖𝑜𝑛
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
Eq.2.7
Eq.2.8
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
Accuracy is the most basic measure of the performance of a learning method. This measure determines the percentage of correctly classified instances and the overall classification rate, while F-measure is a measure of a test's accuracy. It considers both the precision and the recall of the test. The F-measure can be
9
Chapter Two
Intrusion Detection and Data Mining
interpreted as a weighted average of the precision and recall, where F-measure reaches its best value at 1 and worst score at 0. These metrics are derived from a basic data structure known as the confusion matrix [22;23],
which contains information about actual and predicted
classifications done by a classification system. A sample confusion matrix for a two class case can be represented as shown in Table 2.1. Table 2.1: Confusion Matrix Predicted Class Activity Attack Normal Actual Class Attack Normal
TP
FN
FP
TN
Another evaluation method is to calculate the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) values. Small values indicate classes of a higher quality. MAE is the average absolute difference between classifier predicted output and actual output, while RMSE is the square root of the Mean Square Error (MSE), which is the average of the sum of squared differences between classifier predicted output and actual output. 1
𝑀𝐴𝐸 = ∑𝑁 𝑖=1|𝐷𝑒𝑠𝑖𝑟𝑒𝑑𝑖 − 𝐴𝑐𝑡𝑢𝑎𝑙𝑖 | 𝑁
1
2 𝑀𝑆𝐸 = ∑𝑁 1 (𝐷𝑒𝑠𝑖𝑟𝑒𝑑𝑖 − 𝐴𝑐𝑡𝑢𝑎𝑙𝑖 ) 𝑁
1
2 𝑅𝑀𝑆𝐸 = √ ∑𝑁 1 (𝐷𝑒𝑠𝑖𝑟𝑒𝑑𝑖 − 𝐴𝑐𝑡𝑢𝑎𝑙𝑖 ) 𝑁
10
Eq.2.9 Eq.2.10
Eq.2.11
Chapter Two
Intrusion Detection and Data Mining
2.3 Intrusion Detection System (IDS) An intrusion can be defined as: any set of actions that attempt to compromise the integrity, confidentiality, or availability of resources. Intrusion detection is therefore required as an additional wall for protecting systems. [24]. Intrusion detection system (IDS) is a security layer that is used to discover ongoing intrusive attacks and anomaly activities in information systems and it is usually working in a dynamically changing environment. There are two types of intrusion detection systems, one of them is host based and the other is network based and usually they differ in the detection techniques they use. It ranges from misuse detection, anomaly detection to supervised and unsupervised based learning [24,25]. IDS’s perform the following operation in order to identify an intrusion [26]: Manual log examination. Automated log examination. Host-based intrusion detection software. Network-based intrusion detection software. Audit of system structure and fault. Audit tracing management of operating system and recognition of user’s behavior against security policy of an organization. Statistics analysis of abnormal activities. Monitoring and analyzing user and system activities. Recognition activity model for identification of known attacks and generating the alarm as an indication of attack. Measuring the confidentiality and integrity of the system and data files. Manual log examination can be effective but it can also be time-consuming and prone to error. Human beings are just not good at manually reviewing computer logs. A better form of log examination would be to create programs or scripts that
11
Chapter Two
Intrusion Detection and Data Mining
can search through computer logs looking for potential anomalies. Intrusion detection systems were once touted as the solution to the entire security problem. No longer would we need to protect our files and systems, we could just identify when someone was doing something wrong and stop them [26]. In fact, some of the intrusion detection systems were marketed with the ability to stop attacks before they were successful. Strictly speaking IDS does not prevent the intrusion from occurring but it detects the intrusion and reports it to the system operator. No intrusion detection system is foolproof and thus they cannot replace a good security program or a good security practice. They will also not detect legitimate users who may have incorrect access to information. The implementation of intrusion detection mechanisms should not be considered until the majority of high-risk areas are addressed, because they are broadly considered to be a classification problem [26]. The main issue in standard classification problem lies in minimizing the probability of error while making the classification decision; hence the key point is how to choose an effective classification method to construct accurate intrusion detection system with high detection rate and keeping low false alarm rate [27,28].
2.4 Types of Intrusion Detection Systems There are several types of intrusion detection systems and the choice of which one to use depends on the overall risks to the organization and the resources available [22]. One of the classifications of IDSs is established by the resource they monitor. According to this classification, IDSs are divided into two categories or two primary types of IDS according to their location: Host-based (HIDS) and Network-based (NIDS). As the name suggests, HIDS is located on the host computer. HIDSs analyzes audit trail data such as user logs, system calls (which are calls to functions provided by the operating system kernel) on the host where it is installed and looks for indications of attacks on that host [29].
12
Chapter Two
Intrusion Detection and Data Mining
NIDS on the other hand, resides on a separate system that watches network traffic, looking for indications of attacks that traverse that portion of the network and intercept packets passing through the network in order to analyze them and detect possible intrusion attempts. The current trend in intrusion detection is to combine both host based and network based information to develop hybrid systems [26,30].
2.4.1 Host-Based IDS A host-based IDS operates on data collected from a single computer system (host). These data can be from the innermost part of the host's operating system (audit data) or system log data. Host-based IDS uses these data to detect traces of an attack. They are usually deployed in the host system and usually they use the host's computational infrastructure that will lead to performance degradation. It is also deployed on individual hosts that make the configuration difficult as different hosts may have different behaviors and usage [31]. HIDS have access to detailed information on system events that may get disabled or made useless by an attacker who successfully gains administrative privileges on the protected machine. An intrusion that installs root kits (a piece of software that installs itself as part of the operating system kernel) is able to hide traces of anomalous in system activities [32]. Once the root kit is installed, it enables the attacker to cover the traces of malicious activities by cleaning system logs and hiding information about malicious processes at the kernel level.
2.4.2 Network-Based IDS A network-based IDS acquires and examines network traffic packets for signs of intrusion. A network-based IDS comprises a set of dedicated sensors or hosts which scan network traffic data to detect attacks or intrusive behaviors and protects the hosts connected to the network [31].
13
Chapter Two
Intrusion Detection and Data Mining
The major advantages of network-based IDS include its ability to scan large networks in a transparent way without affecting the normal operation of the network. Also, it has the ability to scan the traffic passively without being visible, and this makes it invisible to attackers and makes the network more secure [34]. NIDS analyzes packets crossing an entire network segment. NIDS has the advantage of being able to protect a higher numbers of hosts at the same time. However, it can suffer from performance problems due to the large amount of traffic it needs to analyze in real-time. In addition it can receive some attacks that exploit ambiguities in network protocols and cause the exhaustion of the memory and computational resources of the IDS [33]. The major disadvantages of network-based IDS are inability to handle encrypted data, incapacity to report whether an attack was successful or not and incapability to handle fragmented packets (that makes the IDS unstable). Furthermore, it can report only the initiation of an attack [34]. Furthermore, Network-Based IDS cannot easily monitor encrypted communications and is inherently unable to monitor intrusive activities that do not produce externally observable evidence.
2.5 Intrusion Detection System Components and Requirements IDS components can be fulfillment and summarized from two perspectives [35]: 1. From an algorithmic perspective: Features - capture intrusion evidence from audit data. Models – to infer attack from evidence. 2. From a system architecture perspective: Audit data processor, knowledge base, decision engine, alarm generation and responses. While the requirements to develop an IDS can be listed at two levels of abstraction [36]:
14
Chapter Two
Intrusion Detection and Data Mining
1. High Level Requirements: Develop a capable application that can sniff the traffic to and from the host machine. Development of an application that is competent of analyzing the network traffic and detects numerous pre-defined intrusion attacks and mappings. Development of an application that warns the owner of the host machine about the likely occurrence of an intrusion attack. The application should block traffic to and from a machine identified to be potentially malicious and usually it is defined by the owner of the host machine.
2. Low Level Requirements: Develop an application capable enough of displaying the incoming and outgoing traffic from the host machine in the form of packets to the owner of the host. An application that detects occurrence of Denial of Service (DoS) attacks such as Smurf and Syn-Flood is required. Development of an application that detects attempts to map the network of the host, using techniques such as Efficient Mapping and Cerebral Mapping. An application is required that detects actions attempting to gain unauthorized access to the services provided by the host machine using techniques such as Port Scanning. An application that maintains a "Log Record" of identified intrusion attacks done on the host in the present session and also displays it upon request. Activation or de-activation of each of the Attack Detection methods should be possible.
15
Chapter Two
Intrusion Detection and Data Mining
Provide a selection procedure for the user of the host for framing rules which explicitly specifies the set of IP addresses to be blocked or allowed. These Rules shall determine the flow of traffic at the host.
2.6 Intrusion Detection Techniques The techniques for the intrusion detection can be divided into two categories: Anomaly Intrusion Detection Misuse Intrusion Detection These techniques are categorized based upon different approaches like Statistics, Data mining, and Neural Network. Table 2.2 shows a comparison between different intrusion detection techniques [26].
Table 2.2: Comparison of Intrusion Detection Techniques Detection of Detection of No. Detection Technique
Approach
Known
Unknown
Attack
Attack
1
Misuse
Genetic Algorithm
Yes
No
2
Based
Expert system
Yes
No
3
Detection
State Transition
Yes
No
Data Mining
Yes
Yes
Rule Based
Yes
Yes
Decision Tree
Yes
Yes
Statistical
Yes
Yes
8
Signature
Yes
Yes
9
Neural network
Yes
Yes
4 5 6 7
Anomaly Based Detection
16
Chapter Two
Intrusion Detection and Data Mining
Intrusion detection methods may also include the detection using supervised and unsupervised learning. Supervised learning methods for intrusion detection can only detect known intrusions, while unsupervised learning methods can detect intrusions that have not been learned previously. Examples of unsupervised learning for intrusion detection include K-means-based approaches and self-organizing feature maps.
2.6.1 Anomaly Intrusion Detection This method works by using the definition "anomalies are not normal" [37,38]. Anomaly detection tries to determine whether deviation from the established normal usage patterns can be flagged as an intrusion. Anomaly detection technique assumes that all the intrusive activities are anomalous. There are many anomaly detection techniques that work on the principle of detecting deviations from normal behavior. This means that a normal activity profile for a system could be established and it could be stated that all system states that are varying from the established profile could be classified as an intrusion [38]. Anomaly Detection techniques includes Statistical, Neural Network, Immune System, File Checking and Data Mining [26]. Below is a brief description of each: Statistical based methods: Statistical methods monitor the user/network behavior by measuring certain variable statistics over time. Distance based methods: These methods try to overcome limitations of the statistical approach when the data are difficult to estimate in the multidimensional distributions. Rule based: Rule based system uses a set of "if-then" implication rules to characterize computer attacks. State transition is used to identify an intrusion by using a finite state machine that is deduced from the network. IDS states correspond to different states of the network, and an event makes a transition in
17
Chapter Two
Intrusion Detection and Data Mining
this finite state machine. An activity identifies intrusion if state transitions in the finite state machine of the network reach a sequel state. Profile based methods: This method is similar to rule based method. Here normal behavior’s profiles are built for different types of network traffics, users, and devices. Deviations from these profiles mean intrusion. Model based methods: This approach is based on the differences between a normal and abnormal behavior by modeling them but without creating several profiles of them. In model based methods, researchers attempt to model the normal or abnormal behaviors and deviation from this model means intrusion. Signature based: Matching available signatures in a database with collected data from activities for identifying intrusions. Neural Network Based: Neural Network model can distinguish between normal and attack patterns by training them and it can also identify the type of the attack.
2.6.2 Misuse Intrusion Detection Misuse detection is the most common approach used in the commercial IDS. Misuse Intrusion Detection uses the pattern of known attacks or weak spots of the system to match and identify attacks [26]. So there are some ways to represent attacks in forms of patterns or attacks signatures and even variations of the same attack can be detected. The main object of misuse detection focuses on the use of an expert system to identify intrusions based on an available knowledge base. This approach detects all the known attacks and tries to recognize known bad behavior [38]. Misuse attack detection techniques include genetic algorithm, expert system, pattern matching, state transition analysis and keystroke monitoring [26]. Below is a brief description of each: Genetic Algorithm based Detection (GAD): There are many researchers who used GAD in IDS to detect malicious intrusion. The Genetic Algorithm provides
18
Chapter Two
Intrusion Detection and Data Mining
the necessary population breeding, randomizing, and statistics gathering functions. Expert System based Detection: Expert System is software or a combined software and hardware capable of competently executing a specific task usually performed by a human expert. Expert systems are highly specialized computer systems capable of simulating human specialist knowledge and reasoning by using a knowledge-base. It is characterized by a set of facts and heuristic rules. Heuristic rules are rules of thumb accumulated by a human expert through intensive problem solving in the domain of a particular task. State Transition based Detection: In this approach the IDS identify an intrusion by using a finite state machine that is deduced from the network. IDS states correspond to different states of the network and an event generates a transition in this finite state machine. An activity is identified as an intrusion if the state transition in the finite state machine reaches an abnormal state. The main problem in this technique is to find out known signatures that include all the possible variations of pertinent attack, and which do not match the non-intrusive activity. 2.7 Learning Procedures Machine learning algorithms can be organized into a taxonomy based on the desired outcome of the algorithm or the type of input available during training the machine to [39,40]:
Supervised learning algorithms are trained on labeled examples. The supervised learning algorithm attempts to generalize a function or mapping from inputs to outputs which can then be used to speculatively generate an output for previously unseen inputs.
19
Chapter Two
Intrusion Detection and Data Mining
Unsupervised learning algorithms operate on unlabeled examples. Here the objective is to discover structure in the data (e.g. through a cluster analysis) for inputs where the desired output is unknown.
Semi-supervised learning combines both labeled and unlabeled examples to generate an appropriate function or classifier.
Reinforcement learning is concerned with how intelligent agents ought to act in an environment to maximize some notion of reward. The agent executes actions which cause the observable state of the environment to change. Through a sequence of actions, the agent attempts to gather knowledge about how the environment responds to its actions, and attempts to synthesize a sequence of actions that maximize a cumulative reward.
Learning procedure of this thesis fall in the Semi-supervised learning category.
2.8 Common Attacks and Vulnerabilities in NIDS Current NIDSs requires substantial amount of human interference and administrators for an effective operation. Therefore, it becomes important for the network administrators to understand the architecture of NIDS, the well-known attacks and the mechanisms used to detect them to contain the damages. In this section, some well-known attack types, exploits, vulnerabilities (in the end host operating systems) will be discussed, attack categories are [41]: 1. Confidentiality: In such kinds of attacks, the attacker gains access to confidential and otherwise inaccessible data. 2. Integrity: In such kinds of attacks, the attacker can modify the system state and alter the data without proper authorization from the owner.
20
Chapter Two
Intrusion Detection and Data Mining
3. Availability: In such kinds of attacks, the system is either shut down by the attacker or made unavailable to general users. Denial of Service attacks fall into this category. 4. Control: In such attacks the attacker gains full control of the system and can alter the access privileges of the system thereby potentially triggering all of the above three attacks.
2.9 Technical Discussion To completely understand how these attacks take place, one must examine the structure of the TCP/IP protocol suite of the OSI model Figure 2.1. A basic understanding of these headers and network exchanges is crucial to the process.
OSI Model Data unit
Layer 7. Application
Function Network process to application Data
Host
Data
representation,
encryption
6. Presentation decryption, convert machine dependent data to machine independent data
layers 5. Session
Inter host communication, managing sessions between applications End-to-end connections, reliability and flow
Segments
4. Transport
Packet/Datagram
3. Network
Path determination and logical addressing
Frame
2. Data link
Physical addressing
Bit
1. Physical
Media, signal and binary transmission
control
Media layers
and
Figure 2.1: OSI Model
21
Chapter Two
Intrusion Detection and Data Mining
2.9.1 Internet Protocol – IP Internet Protocol (IP) is a network protocol operating at layer 3 (network) of the OSI model. It is a connectionless model, meaning there is no information regarding transaction state, which is used to route packets on a network [42]. Additionally, there is no method in place to ensure that a packet is properly delivered to the destination. Examining the IP header Figure 2.2, the first 12 bytes (or the top 3 rows of the header) contain various information about the packet. The next 8 bytes (the next 2 rows), however, contain the source and destination IP addresses. Using one of several tools like (HPing, NMap, PacketExcalibur, Scapy, etc.) [43], an attacker can easily modify these addresses specifically the "source address" field. It is important to note that each datagram is sent independent of all others due to the stateless nature of IP.
Figure 2.2: IP Packet Header 2.9.2 Transmission Control Protocol – TCP IP can be thought of as a routing wrapper for layer 4 (transport) of OSI model, which contains the Transmission Control Protocol (TCP). Unlike IP, TCP uses a connection-oriented design. This means that the participants in a TCP session must
22
Chapter Two
Intrusion Detection and Data Mining
first build a connection - via the 3-way handshake (SYN-SYN/ACK-ACK) then update one another on progress via sequences and acknowledgements [42]. This "conversation", ensures data reliability, since the sender receives an OK from the recipient after each packet exchange [44]. A TCP header is very different from an IP header Figure 2.3. The concerned will be with the first 12 bytes of the TCP packet, which contain port and sequencing information. Much like an IP datagram, TCP packets can be manipulated using software. The source and destination ports normally depend on the network application in use (for example, HTTP via port 80). What's important for understanding of spoofing are the sequence and acknowledgement numbers. The data contained in these fields ensures packet delivery by determining whether or not a packet needs to be resent [42].
Figure 2.3: TCP Packet Header The sequence number is the number of the first byte in the current packet which is relevant to the data stream. The acknowledgement number, in turn, contains the value of the next expected sequence number in the stream. This relationship confirms, on both ends, that the proper packets were received. It is quite different than IP since transaction state is closely monitored [42].
23
Chapter Two
Intrusion Detection and Data Mining
2.10 IP Spoofing The basic protocol for sending data over the Internet and many other computer networks is the Internet Protocol ("IP") [44]. The header of each IP packet contains, among other things, the numerical source and destination address of the packet. The source address is normally the address that the packet was sent from. By forging the header so it contains a different address, an attacker can make it appear that the packet was sent by a different machine. The machine that receives spoofed packets will send response back to the forged source address. This means that this technique is mainly used when the attacker does not care about response or the attacker has some way of guessing the response [45]. IP spoofing or Internet protocol address spoofing is the method of creating an Internet protocol packet or IP packet using a fake IP address that is impersonating a legal and legitimate IP address. IP spoofing is a method of attacking a network in order to gain unauthorized access [46]. The attack is based on the fact that Internet communication between distant computers is routinely handled by routers which find the best route by examining the destination address, but generally ignore the origination address. The origination address is only used by the destination machine when it responds back to the source [47]. In a spoofing attack, the intruder sends messages to a computer indicating that the message has come from a trusted system. To be successful, the intruder must first determine the IP address of a trusted system, and then modify the packet headers to a form that it appears that the packets are coming from the trusted system [47], these include obscuring the true source of the attack, implicating another site as the attack origin, pretending to be a trusted host, hijacking or intercepting network traffic, or causing replies to target another system.
24
Chapter Two
Intrusion Detection and Data Mining
IP spoofing is most frequently used in denial-of-service attacks which will be addressed in the next section of this chapter.
2.10.1 Denial of Service Attack IP spoofing is almost always used in what is currently one of the most difficult attacks to defend against – denial of service attacks, or DoS. Since crackers are concerned only with consuming bandwidth and resources, they need not to worry about properly completing handshakes and transactions. Rather, they wish to flood the victim with as many packets as possible in a short amount of time [48]. In order to prolong the effectiveness of the attack, they spoof source IP addresses to make tracing and stopping the DoS as difficult as possible. When multiple compromised hosts are participating in the attack, all sending spoofed traffic; it will be very challenging to quickly block traffic [49]. A denial-of-service attack (DoS attack) or distributed denial-of-service attack (DDoS attack) is an attempt to make a computer resource unavailable to its intended users. Although the means to carry out, motives for, and targets of a DoS attack may vary, it generally consists of the efforts of a person or persons to prevent an Internet site or service from functioning efficiently, temporarily or indefinitely [50]. Perpetrators of DoS attacks typically target sites or services hosted on highprofile web servers such as banks, credit card payment gateways, and even DNS root servers [51]. One common method of attack involves saturating the target (victim) machine with external communications requests, such that it cannot respond to legitimate traffic, or responds so slowly as to be rendered effectively unavailable. In general terms, DoS attacks are implemented by either forcing the targeted computer(s) to reset, or consume its resources so that it can no longer provide its intended service or obstructing the communication media between the intended users and the victim so
25
Chapter Two
Intrusion Detection and Data Mining
that they can no longer communicate adequately [52]. Main types of DoS attack are listed below: Smurf Attack: Smurf attack exploits the target by sending repeated ping request to broadcast address of the target network. The ping request packet often uses forged IP address (return address), which is the target site that is to receive the denial of service attack. The result will be lots of ping replies flooding back to the innocent, spoofed host. If number of hosts replying to the ping request is large enough, the network will no longer be able to receive real traffic [52,53].
SYN Floods (Neptune): When establishing a session between TCP client and server, a hand-shaking message exchange occurs between a server and client. A session setup packet contains a SYN field that identifies the sequence in the message exchange. An attacker may send a flood of connection request and do not respond to the replies. This leaves the request packets in the buffer so that legitimate connection request cannot be accommodated [44].
Ping of Death (PoD): Ping of Death is caused by an attacker overwhelming the victim network with Internet Control Message Protocol (ICMP) echo requests packets. This is a fairly easy attack to perform without extensive network knowledge as many ping utilities support this operation. A flood of ping traffic can consume significant bandwidth on low to mid speed networks bringing down a network to a crawl. A ping of death is also known as "long ICMP" [53].
26
Chapter Two
Intrusion Detection and Data Mining
Teardrop Attack: Teardrop attack exploits by sending IP fragment packets that are difficult to reassemble. A fragment packet identifies an offset that is used to assemble the entire packet to be reassembled by the receiving system. In the teardrop attack, the attacker's IP puts a confusing offset value in the subsequent fragments and if the receiving system does not know how to handle such situation, it may cause the system to crash [53]. Back: This type of DoS attack works against the Apache web server, an attacker submits requests with URL's containing many fronts’ lashes. As the server tries to process these requests it will slow down and becomes unable to process other requests [54].
This thesis focuses on detection of DoS attack class and its types, system training and testing done on normal packets and DoS packets, to construct a model for DoS detection. 2.11 Data Mining and Intrusion Detection System The term data mining is frequently used to designate the process of extracting useful information from large databases. The term knowledge discovery in databases (KDD) is used to denote the process of extracting useful knowledge from large datasets. Data mining, by contrast, refers to one particular step in this process, which ensures that the extracted patterns actually correspond to useful knowledge [55]. Data mining refers to a set of procedures that use the process of excavating previously unknown but potentially valuable data from large stores of past data. Data mining techniques basically correspond to pattern discovery algorithms, but
27
Chapter Two
Intrusion Detection and Data Mining
most of them are drawn from related fields like machine learning or pattern recognition [56]. In this thesis two machine learning techniques have been used: Unsupervised K-means algorithm and Supervised Decision Tree (C4.5).
2.12 Feature Selection (FS) Feature selection is an important topic in data mining, especially for high dimensional datasets [57]. Multiple dimensions are hard to think in, impossible to visualize, and due to the exponential growth of the number of possible values with each dimension, complete enumeration of all subspaces becomes intractable with increasing dimensionality, this problem is known as the curse of dimensionality [58]. Feature selection (also known as subset selection) is a process of selecting a group of useful features from the original feature space [59]. This process commonly used in machine learning, wherein subsets of the features available from the data are selected for application of a learning algorithm. The best subset contains the least number of dimensions that mostly contribute to accuracy, and the remaining unimportant dimensions will be discarded. Feature selection is an important stage of preprocessing and is one of the ways of avoiding the curse of dimensionality which refers to how certain learning algorithms may perform poorly in multi-dimensional data. Usually before collecting data, features are specified or chosen. Features can be discrete, continuous, or nominal. Generally, features are characterized as [60]: 1. Relevant: Features which have an influence on the output and their role cannot be assumed by the rest. 2. Irrelevant: Irrelevant features are defined as those features that do not have any influence on the output, and whose values are generated at random.
28
Chapter Two
Intrusion Detection and Data Mining
3. Redundant: A redundancy exists whenever a feature can take the role of another (the simplest way to model redundancy). Feature Selection is an essential data processing step prior to applying a learning algorithm [61]. Features are not all useful in constructing the system model, some features may be redundant or irrelevant; thus, not contributing to the learning process. The main aim of the feature selection process is to determine a minimal feature subset from the problem domain while retaining a suitably high accuracy in representing the original features. There are two approaches in feature selection (FS) known as Forward Selection and Backward Selection. Forward Selection start with no variables and add them one by one, at each step adding the one that decreases the error the most, until any further addition does not significantly decrease the error, while Backward Selection start with all the variables and remove them one by one, at each step removing the one that decreases the error the most (or increases it only slightly), until any further removal increases the error significantly. To reduce over fitting, the error referred to in above is the error of a validation set that is distinct from the error of a training set [60]. The main idea of the FS process is to choose a subset of input variables by eliminating features that are with little or no predictive information. Advantages of FS can be listed as: It reduces the dimensionality of the feature space, to limit storage requirements and increase algorithm speed. It removes the redundant, irrelevant or noisy data. The immediate effects for data analysis tasks are speeding up the running time of the learning algorithms. Improving the data quality. Increasing the accuracy of the resulting model.
29
Chapter Two
Intrusion Detection and Data Mining
Feature set reduction to save resources in the next round of data collection or during utilization. Performance improvement to gain in predictive accuracy. Data understanding to gain knowledge about the processes that generated the data or simply to visualize the data in a better way. Feature selection is also useful as part of the data analysis process, as it shows which features are important for prediction, and how these features are related. The removal of irrelevant and redundant information often improves the performance of the machine learning algorithm.
2.12.1 General Methods for Feature Selection The relationship between a feature selection algorithm (FSA) and the inciter chosen to evaluate the usefulness of the feature selection process can be classified into two types: Wrapper and Filter methods. The Wrapper approach uses the method of classification itself to measure the importance of the feature set, hence the feature selected depends on the classifier model used. Wrapper methods generally result in a better performance than the filter methods because the feature selection process is optimized for the classification algorithm to be used. However, wrapper methods are too expensive for large dimensional database in terms of computational complexity and time, since each feature set considered must be evaluated with the classifier algorithm used. The Filter approach actually precedes the actual classification process, independent of the learning algorithm, computationally simple, fast and scalable. Using the Filter method feature selection is done only once and then can be provided as an input to different classifiers. Various feature ranking and feature selection techniques have been proposed such as Correlation-based Feature Selection (CFS), Principal Component Analysis (PCA), Gain Ratio (GainR) attribute evaluation, Chi-square Feature Evaluation, Fast Correlation-based Feature
30
Chapter Two
Intrusion Detection and Data Mining
(FCBF), Information Gain (IG), Euclidean distance, I-test and Markov blanket filter. Some of these filter methods do not perform feature selections but only feature rankings, hence they are combined with a search method when one needs to find out the appropriate number of attributes. Such filters are often used with forward selection (which considers only additions to the feature subset), backward elimination, bi-directional search, best-first search, and genetic search.
2.12.2 Information Gain (IG) Feature Selection Information Gain (IG) is an entropy-based feature evaluation method, widely used in the field of machine learning. As Information Gain is used in feature selection, it is defined as the amount of information provided by the feature items for the IDS [62]. Information gain is calculated by how much of a term can be used for classification of information in order to measure the importance of lexical items for the classification. In Information Gain the features are filtered to create the most prominent feature subset before the start of the learning process. It takes number and size of branches into account when choosing an attribute as it corrects the information gain by taking the intrinsic information of a split into account [22]. The procedures of the information gain are shown below: Let S be a set of training set samples with their corresponding labels. Suppose there are m classes and the training set contains si samples of class i and S is the total number of samples in the training set. Expected information needed to classify a given sample is calculated as in Eq. 2.12: 𝑠
𝑠𝑖
𝑆
𝑆
𝑖 𝐼(𝑠1 , 𝑠2 , … , 𝑠𝑚 ) = − ∑𝑚 𝑖=1 log 2
Eq.2.12
A feature F with values {f1, f2, … , fv} can divide the training set into v subsets { S1, S2, …, Sv } where Sj is the subset which has the value fj for feature F.
31
Chapter Two
Intrusion Detection and Data Mining
Furthermore let Sj contain sij samples of class i. Entropy of the feature F is calculated as in Eq. 3.: 𝐸 (𝐹 ) = ∑𝑣𝑗=1
𝑠1𝑗 +⋯+𝑠𝑚𝑗 𝑆
∗ 𝐼(𝑠1 , 𝑠2 , … , 𝑠𝑚 )
Eq.2.13
Information gain for feature F can be calculated as in Eq.2.14: 𝐼𝐺 (𝐹 ) = 𝐼 (𝑠1 , 𝑠2 , … , 𝑠𝑚 ) − 𝐸(𝐹)
Eq.2.14
2.13 Clustering Algorithms Clustering, or cluster analysis groups the data objects based on the information found in the data, which describes the objects and their relationships. The goal is to make objects within a group similar (or related) to one another and different (or unrelated) to objects in other groups. The quality of clustering is determined by distinctiveness of these groups, as well as homogeneity within a single group [63]. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering objects according to measured or perceived intrinsic characteristics or similarity [64]. Clustering is the classification of similar objects into different groups, or more precisely, the partitioning of data into subsets (clusters), so that the data in each subset (ideally) share some common trait of proximity according to some defined distance measure [65]. By clustering, one can spot dense and sparse regions and consequently, discover overall distribution samples and interesting relationships among the data attributes. Clustering algorithms are used extensively not only to organize and categorize data, but are also useful for data compression and model construction. By finding similarities in data, one can represent similar data within fewer symbols [66,67].
32
Chapter Two
Intrusion Detection and Data Mining
Also by finding groups of data, a model of the problem could be built based on those groupings. Another reason for clustering is its descriptive nature which can be used to discover relevant knowledge in huge dataset [67]. Clustering is a challenging field of research as it can be used as a separate tool to gain insight into the allocation of data, to observe the characteristic feature of each cluster and to spotlight on a particular set of clusters for more analysis. The advantage of applying Data Mining technology to Intrusion Detection Systems lies in its ability of mining the succinct and precise characters of intrusions in the system from large quantities of information automatically. It can solve the problem of difficulties in picking-up rules and in coding of the traditional Intrusion Detection System [56].
2.13.1 Classification of Clustering Algorithms There are essentially two types of clustering methods (Figure 2.4): hierarchical clustering and partitioning clustering. In hierarchical clustering once groups are found and objects are assigned to the groups, this assignment cannot be changed. In case of partitioning clustering, the assignment of objects into groups may change during the algorithm application. Further, the Partitioning clustering is categorized into hard clustering and soft clustering. Hard Clustering is based on mathematical set theory i.e. either a data point belong to a particular cluster or not. K-means clustering is a type of hard clustering. Soft Clustering is based on fuzzy set theory i.e. a data point may partially belong to a cluster [56]. Clustering algorithms can also be classified based on different parameters, based on whether the number of clusters to be formed are well known (priory) in advance or not known (a-priory). In priory since the number of clusters are well known in advance, priory algorithms try to partition the data into the given number of clusters. Since K-means and fuzzy c-means clustering algorithms need prior knowledge of the number of clusters, they belong to priory type. In the case of a-priory, since
33
Chapter Two
Intrusion Detection and Data Mining
number of clusters are not known in advance, the algorithm starts by finding the first large cluster, and then goes to find the second and so on, Mountain and Subtractive clustering algorithms are examples of this type [56].
Data Clustering
Hierarchal Clustering
Partitional Clustering
Hard Clustering (K-means)
Soft Clustering (Fuzzy C-means)
Figure 2.4: Types of Clustering Methods
K-means clustering algorithm has been used in this thesis. The K-means clustering algorithm clusters the combination of normal and Denial of Service (DoS) dataset into two clusters, normal and DoS attack clusters.
2.13.2 K-means Algorithm K-means is one of the simplest unsupervised clustering algorithms that solve the well-known problems in many fields. K-means is an iterative algorithm in which the number of clusters must be determined before the execution. The K-means algorithm partitions n data points into k clusters where the number of clusters K is pre-decided by users [68]. At the beginning K centroids are initialized according to some rule (usually at random from the data points) and they represent the centers of weight of corresponding clusters. For each data point in set the closest centroid is computed so that clusters of points are created. Assignment of the data points to clusters is depending upon the distance between cluster centroid and data point [69].
34
Chapter Two
Intrusion Detection and Data Mining
In the next step all data points assigned to a given cluster are used to recalculate the centroid. The procedure is repeated until certain termination condition is met. The general steps of K-means algorithm are as following: Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. Assign each object to the group that has the closest centroid. When all objects have been assigned, recalculate the positions of the K centroids. Repeat steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated.
2.14 Decision Tree A Decision Tree is defined as a predictive modeling technique from the fields of machine learning and statistics that builds a simple tree-like structure to model the underlying pattern of data [70]. Decision Trees are one example of a classification algorithm. Classification is a data mining technique that assigns objects to one of several predefined categories. Classification algorithms recognize distinctive patterns in a dataset and classifying activity based on this information [63]. A Decision Tree is a collection of if-then conditional rules for assignment of class labels to instances of a dataset. Decision Trees consist of nodes that specify a particular attribute of the data, branches that represent a test on each attribute value, and leaves that correspond to the terminal decision [71]. Decision Trees are well known machine learning technique and they are composed of three basic elements [72]:
35
Chapter Two
Intrusion Detection and Data Mining
A decision node specifying a test attributes. An edge or a branch corresponding to one of the possible attributes values. A leaf, usually named an answer node, which contains the class to which the object belongs. In Decision Trees, two major phases should be ensured: Building the tree: Based on a given training set. Classification: Order to classify a new instance. At start the root of the tree is determined, and then the node specified property is tested. The test results allow moving down the tree relative to a given instance of the attribute value. This process is repeated until it encounters a leaf. The instance is then classified in the same class based on leaves characteristics [73]. In summary, Decision Trees provide a simple set of rules that can categorize new data. Creating Decision Trees requires a pre-classified dataset in order for the algorithms to learn patterns in the data. The training dataset is made up of features which are quantifiable characteristics of the data. When the Decision Tree is built from these features, the rules of characterizing information can be used to identify and classify new data of interest by incorporating the logic into existing defenses, like IDSs, firewalls, custom-built detection scripts, or classification software [74].
2.14.1 C4.5 Decision Tree Algorithm C4.5 Decision Tree algorithm has been used in this thesis. The C4.5 is an algorithm used to generate a Decision Tree developed by Ross Quinlan [73]. C4.5 is an extension of Quinlan's earlier ID3 algorithm [75]. The Decision Trees generated by C4.5 can be used for classification and for this reason the C4.5 is often referred to as a statistical classifier [76].
36
Chapter Two
Intrusion Detection and Data Mining
The pseudo code for building C4.5 Decision Trees is written below [23]: 1. Check for a base case 2. For each attribute find the normalized information gain ratio. 3. Let a_best be the attribute with the highest normalized information gain 4. Create a decision node that splits on a_best 5. Recurse on the sublists obtained by splitting on a_best. Add the obtained nodes as children of the a_best node Decision Tree algorithms use the strategy of future generations, from root to leaves. To ensure this process, the attribute selection measure is used, taking into account the discriminative power of each attribute over the classes in order to choose the "best" one as the root of the (sub) Decision Tree [77]. In other words, best attribute should be used as a root node for splitting the tree. Objective criteria for judging the efficiency of the split is needed, and information gain measure is used to select the test attribute at each node in the tree [23]. The attribute with the highest information gain (or greatest entropy reduction) is chosen as the test attribute for the current node [78]. This attribute minimizes the information needed to classify samples in the resulting partitions. C4.5 uses an extension of information gain known as gain ratio for attributes ranking, which applies normalization to information gain [79]. Gain ratio (GainR) should be larger when data is evenly spread and small when all data belong to one branch attribute. GainR for set S to get split on feature F is: 𝐺𝑎𝑖𝑛𝑅 (𝑆, 𝐹 ) =
𝐼𝐺(𝑆,𝐹)
Eq.2.15
𝐸(𝐹)
Where the Information Gain IG(S,F) and Entropy E(F) is calculated by using Eqs. 2.13 and 2.15, respectively. From an intrusion detection perspective, classification algorithms can characterize network
data as normal or attack using information like
37
Chapter Two
Intrusion Detection and Data Mining
source/destination ports, IP addresses, and the number of bytes sent during a connection. Classification algorithms create a Decision Tree like the one presented in Figure 3.7, by identifying patterns in an existing dataset and using that information to create the tree. The algorithms take pre-classified data as input. They learn the patterns in the data and create simple rules to differentiate between the various types of data in the pre-classified dataset.
Figure 2.5: Example of Decision Tree for IDS Classification
2.15 Dataset Collection To verify the effectiveness and the feasibility of the proposed IDS`, KDD Cup 99 dataset has been used [80]. This dataset considered as a standard dataset and the most wildly used dataset for the evaluation of intrusion detection methods [22,29]. A connection is a sequence of TCP packets to and from some IP addresses, starting and ending at some well-defined times. This dataset contains seven weeks of network traffic; this was processed into about five million connection records and two weeks of test data that have around two million connection records. KDD Cup 99 training dataset consists approximately 4,900,000 single connection vectors, each of which is a vector of extracted feature values of that network connection which contains 41 features [Appendix A, Table A1].
38
Chapter Two
Intrusion Detection and Data Mining
2.15.1 Attacks in KDD Cup 99 Dataset The simulated attacks in the KDD Cup 99 dataset fall in one of the following four categories [81]: Denial of Service (DoS): Attacker tries to prevent legitimate users from using a service. Remote to Local (R2L): Attacker does not have an account on the victim machine, hence tries to gain access. User to Root (U2R): Attacker has local access to the victim machine and tries to gain super user privileges. Probe: Attacker tries to gain information about the target host.
2.15.2 Features of KDD Cup 99 Dataset In KDD Cup 99, the original TCP dump files were pre-processed for utilization in the Intrusion Detection System benchmark of the International Knowledge Discovery and Data Mining Tools Competition [81]. Packet information in the TCP dump file is summarized into connections. Specifically, a connection is a sequence of TCP packets starting and ending at some well-defined times, between which data flows from a source IP address to a target IP address under some well-defined protocol, with 41 features for each connection. The features are grouped into three categories:
Basic Features: Basic features can be derived from TCP/IP connection packet headers without inspecting the payload. Basic features are listed in Table 2.3.
Content Features: Domain knowledge is used to assess the payload of the original TCP packets. This includes features such as the number of failed login attempts as shown in Table 2.4.
39
Chapter Two
Intrusion Detection and Data Mining
Traffic Features: This category includes features that are computed with respect to a window interval and divided into two groups: -
"Same Host" Features: Examine only the connections in the past 2 seconds that have the same destination host as the current connection, and calculate statistics related to protocol behaviour, service, etc.
-
"Same Service" Features: Examine only the connections in the past 2 seconds that have the same service as the current connection.
The two aforementioned types of "Traffic" features are called time-based and are listed in Table 2.5. Table 2.3: Basic Features of TCP Connection No.
Feature
Description
1
Duration
2
Protocol_type
3
Service
4
Flag
5
Src_bytes
No. of Data Bytes sent from source to destination
6
Dst_bytes
No. of Data Bytes sent from destination to source
7
Land
8
Wrong_fragment
9
Urgent
Length of the connection (No. of Seconds) Type of connection Protocol (tcp, udp) Network Service on the destination (talnet, ftp) Status flag of the connection
1 if connection is from/to the same host/port; 0 otherwise No. of wrong fragments No. of urgent packets
The feature protocol type has 3 different values of icmp, tcp and udp. Likewise, the feature service has 70 different values and the flag feature has 11 different values. The description of different flag values are listed in [Appendix A, Table A2]. These 3 features and their different values acquire significant position to construct grammars of the proposed method.
40
Chapter Two
Intrusion Detection and Data Mining Table 2.4: Content Features of the TCP Connection
No.
Feature
Description
10
Hot
11
Num_failed_logins
12
Logged_in
13
Num_compromised
14
Root_shell
15
Su_attempted
16
Num_root
17
Num_file_creations
18
Num_shells
19
Num_access_files
20
Num_outbound_cmds
21
s_host_login
1 if the login belongs to the “hot” list; 0 otherwise
22
s_guest_login
1 if the login is a “guest” login; 0 otherwise
No. of “hot” indicators No. of failed logins 1 if successfully logged in; 0 otherwise No. of “compromised” conditions 1 if root shell is obtained; 0 otherwise 1 if “su root” command attempted; 0 otherwise No. of “root” accesses No. of file creation operations No. of shell prompts No. of operations on access control files No. of outbound commands in an ftp session
Table 2.5: Time Based Features TCP Connection No.
Feature
Description
23
Count
24
Srv_count
25
Serror_rate
% of connections that have “SYN” errors
26
Srv_serror_rate
% of connections that have “SYN” errors
27
Rerror_rate
% of connections that have “REJ” errors
28
Srv_rerror_rate
% of connections that have “REJ” errors
29
Same_srv_rate
% of connections to the same service
30
Diff_srv_rate
% of connections to different services
31
Srv_diff_host_rate
No. of connections to the same host as the current connection in the past two seconds No. of connections to the same service as the current connection in the past two seconds
% of connections to different hosts
41
Chapter Three Proposed System Methodology
42
Chapter Three Proposed System Methodology
3.1 Introduction This chapter describes the architecture and workflow process of the proposed IDS. It explains pre-processing of the dataset used for experiments including features transformation and normalization, optimal features selection using information gain. The proposed hybrid model will be described with its basic architecture in block diagram, and then gives details of each part.
3.2 Dataset Pre-Processing The first part of analysis engine component of the hybrid IDS model is the preprocessing dataset. The pre-processing of dataset is of great importance as it results in the increase the efficiency of intrusion detection mechanism in case of training, testing, and clustering of network activity into normal and abnormal. Pre-processing of original KDD Cup 99 dataset is necessary to make it suitable for IDS structure. Dataset pre-processing can be achieved by applying: Dataset transformation for nominal features Dataset normalization for numeric features
3.2.1 Dataset Transformation The training dataset of KDD Cup 99 consists of approximately 4,900,000 single connection instances. Each connection instance contains 42 features including target class attacks or normal. These labelled connection instances have to be transformed from nominal features to numeric values to be a suitable input for clustering by the
42
Chapter Three
Proposed System Methodology
K-means algorithm. For this transformation, Table 3.1 will be used. In this step, some useless data will be filtered and modified. For example, some text items need to be converted into numeric values. There are several nominal values like HTTP, TCP and SF. Hence it is necessary to transform these nominal values to numeric values in advance. For example, the service type of "tcp" is mapped to 1, "udp" is mapped to 2 and "icmp" is mapped to 3. Hence, keys in Table 3.1 will be followed to transform the nominal values of dataset features into the numeric values. Table 3.1: Transformation Table for Different Values of Protocols, Flag and Services TCP 1 Protocol Type UDP 2 ICMP 3 OTH 1 REJ 2 RSTO 3 RSTOS0 4 RSTR 5 Flag S0 6 S1 7 S2 8 S3 9 SF 10 SH 11 Service All services 1 to 70
An example of original KDD Cup 99 dataset record is shown in Figure 3.1. 0 tcp ftp_data SF 491 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 0 0 150 25 0.17 0.03 0.17 0 0 0 0.05 0 normal 0 udp other SF 146 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 1 0 0 0 0 0.08 0.15 0 255 1 0 0.6 0.88 0 0 0 0 0 normal
Figure 3.1: Records of the KDD Cup 99 Dataset
43
Chapter Three
Proposed System Methodology
After transformation, the original KDD Cup 99 dataset will become as shown in Figure 3.2. 0,1,30,10,491,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,0,0,0,0,1,0,0,150,25,0.17,0.03,0.17,0,0,0, 0.05,0,0 0,2,40,10,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,1,0,0,0,0,0.08,0.15,0,255,1,0,0.6,0.88,0,0, 0,0,0,0
Figure 3.2: Records of the KDD Cup 99 Dataset After Transformation
3.2.2 Dataset Normalization Dataset normalization is essential to enhance the performance of intrusion detection system when datasets are too large. The first step is to normalize continuous attributes, so that attribute values fall truly within a specified range of 0 to 1. Here, Min-Max method of normalization has been used, using the following equation [82]:
𝑥𝑖 =
𝑣𝑖 −min(𝑣𝑖 )
Eq.3.1
max(𝑣𝑖 )−min(𝑣𝑖 )
Where, xi is the normalized value, vi is the actual value of the attribute, and the maximum and minimum are taken over all values of the attribute. Normally xi is set to zero if the maximum is equal to the minimum.
3.3 Proposed Detection Model This thesis aims at building and simulating an intelligent IDS that can detect known and unknown network intrusions automatically. Under machine learning framework, the IDS is trained with unsupervised learning algorithm, namely the Kmeans algorithm.
44
Chapter Three
Proposed System Methodology
With the K-means two clusters are obtained which are normal and DoS attacks. With the normal one there is no action. For DoS attacks, the cluster acquired by Manhattan distance will be passed to the second layer classifier to feed classifier which is the C4.5 DT. At this stage the tree has already been constructed and learned and it can generate rules to classify types of DoS attacks to Smurf, Neptune, Pod, Back and Teardrop. Figure 3.3 shows the structure of the proposed system.
KDD Cup Dataset (Normal & DoS) records
Information Gain (IG) Feature Selection (Pre-Processing)
Testing Set (40%)
Training Set (60%)
K-means Clustering Algorithm with K=2 using Euclidean Distance metric
K-means Clustering Algorithm with K=2 using Manhattan Distance metric
Normal Cluster
DoS Cluster
Decision Tree (C4.5) Classification
Testing Set (40%)
Normal Cluster
Results comparison and evaluation
Results and performance evaluation
Figure 3.3: Proposed Detection Model Structure
45
DoS Cluster
Chapter Three
Proposed System Methodology
3.4 Information Gain Feature Selection The dataset which is used as an input for the proposed IDS consists of a huge amount of data with normal and DoS attacks records, and each record of data has numerous attributes associated with it, which means that it needs a lot of processing. A classification process that considers all these attributes needs a lot of processing time and it leads to an increase in the error rate, and a decrease in the efficiency of the classification process. The proposed system comes with a solution to overcome this problem by using Information Gain feature selection process. Information Gain (IG) algorithm can be described in algorithm 3.1 Algorithm 3.1: Information Gain Input: Number of samples in training set S. Number of class m. Output: a value represents Information gain for feature F. Step1: [Divide Training Set] Divide the training set into v subsets {S1, S2 …Sv} where Sj is the subset which has the value fj for feature F. Step2: [Compute Information Needed for Clustering S] 𝒎
𝑰(𝒔𝟏 , 𝒔𝟐 , … , 𝒔𝒎 ) = − ∑ 𝒊=𝟏
𝒔𝒊 𝒔𝒊 𝐥𝐨𝐠 𝟐 𝑺 𝑺
Step3: [Compute the Entropy of feature F] 𝒗
𝑬(𝑭) = ∑ 𝒋=𝟏
𝒔𝟏𝒋 + ⋯ + 𝒔𝒎𝒋 ∗ 𝑰(𝒔𝟏 , 𝒔𝟐 , … , 𝒔𝒎 ) 𝑺
Step4: [Compute Information Gain for Feature F] 𝑰𝑮(𝑭) = 𝑰(𝒔𝟏 , 𝒔𝟐 , … , 𝒔𝒎 ) − 𝑬(𝑭)
46
Chapter Three
Proposed System Methodology
3.5 K-means Clustering for the Proposed System The general structure of the first layer of the proposed IDS presented in Figure 3.4. Subset of KDD Cup 99 dataset
Transformation and Normalization
IG Feature Selection
Training Set (60%)
Testing Set (40%)
K-means Clustering Algorithm with K=2
Normal Cluster
DoS Cluster
Figure 3.4: First Layer of Proposed Detection Model K-means clustering includes procedures and steps to determine centroids of each cluster as shown in Figure 3.5. K-means training phase determines the centroid of both normal and attack cluster. The centroid is used in distance calculation for any coming packet to classify it to either normal or attack, based on the minimum distance to cluster centroid. Two distance metrics has been used, the Euclidean and the Manhattan, evaluate of the results and the performance of the K-means clustering with both metrics has been done. Manhattan distance metric did show much higher detection rates with
47
Chapter Three
Proposed System Methodology
reasonable true positive rates when compared to the Euclidean distance using the subset of the KDD Cup 99 dataset.
Start
Number of clusters K
Select randomly K points from the data as initial centroids
Calculate distance of objects to centroids
Group based on minimum distance
Calculate centroid
Is there objects movements between groups?
Yes
No Store the centroid
End
Figure 3.5: K-means Clustering Flowchart
48
Chapter Three
Proposed System Methodology
3.5.1 Distance Calculation Assignment of the data points to clusters depends upon the distance between cluster centroid and data point. A distance function is required to compute the distance between two objects. Distance functions also affect the size and members of a cluster as different distance functions use a different approach to find the distance between the data objects which is the most important step of the creation of clusters, so distance functions should be chosen wisely and according to the dataset. Generally K-means algorithm uses Euclidean distance, which is a distance function used to compute the distance between two objects. Two distance metrics used with K-means in this thesis: Euclidian Distance and Manhattan Distance. ● Euclidean Distance Metric: In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula [83]. By using this formula for distance, Euclidean space becomes a metric space as shown in Figure 3.6.
Y
X
Figure 3.6: Euclidean Distance between Two Points
The Euclidean distance between points x and y is the length of the line segment connecting them (𝑥𝑦 ̅̅̅). The formula for this distance between a point X (X1, X2, etc.) and a point Y (Y1, Y2, etc.) is:
49
Chapter Three
Proposed System Methodology
2 𝑑 (𝑥, 𝑦) = √∑𝑚 𝑖=1(𝑥𝑖 − 𝑦𝑖 )
Eq.3.2
Two input vectors with m quantitative features where x = (x1,….,xm) and y = (y1,….,ym). ● Taxicab Geometry (Manhattan): Manhattan is a form of geometry in which the usual distance function or metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. The taxicab metric is also known as rectilinear distance, L1 distance or l1 norm, Manhattan distance, or Manhattan length, with corresponding variations in the name of the geometry [84]. The Manhattan distance function computes the distance that would be traveled to get from one data point to the other, if a gridlike path is followed as shown in Figure 3.7. Y
X
Figure 3.7: Manhattan Distance Between Two Points The formula for this distance between a point X= (X1, X2, …. , Xn) and a point Y= (Y1, Y2, …. , Yn) is: 𝑑(𝑥, 𝑦) = ∑𝑛𝑖=1|𝑥𝑖 − 𝑦𝑖 |
Eq.3.5
Where n is the number of variables, and Xi and Yi are the values of the ith variable, at points X and Y respectively.
50
Chapter Three
Proposed System Methodology
3.6 Decision Trees as a Model for Intrusion Detection Intrusion detection can be considered as classification problem where each connection or user is identified either as one of the attack types or normal based on some existing data. Decision Trees can solve this classification problem of intrusion detection as they learn the model from the dataset and can classify new data items into one of the classes specified in the dataset. Decision Trees can be used as misuse intrusion detection as they can learn a model based on the training data and can predict the future data as one of the attack types or normal based on the learned model. DT constructs easily interpretable models, which is useful for a security officer to inspect and edit. In this thesis different set of (if-then) rules based on the GianR attribute ranking has been used to construct DT, and the rule with highest detection rate for known and unknown attacks will be adopted as the second layer of the proposed IDS.
Rule 1: Root node = flag If flag = SF and protocol_type = tcp and dst_host_same_srv_rate < 0.94 Then Classification = unknown If flag = SF and protocol_type = tcp and dst_host_same_srv_rate >= 0.94 Then Classification = back If flag = SF and protocol_type = udp Then Classification = teardrop If flag = SF and protocol_type = icmp and src_bytes < 1256 Then
51
Chapter Three
Proposed System Methodology
Classification = smurf If flag = SF and protocol_type = icmp and src_bytes >= 1256 Then Classification = pod If flag = RSTO or SH or OTH or or RSTOS0 or S1 or S0 or REJ Then Classification = back
Rule 2: Root node = protocol_type If protocol_type = tcp and serror_rate <= 0.02 and dst_host_diff_srv_rate <= 0.01 Then Classification = back If protocol_type = tcp and serror_rate <= 0.02 and dst_host_diff_srv_rate > 0.01 Then Classification = unknown If protocol_type = tcp and serror_rate > 0.02 Then Classification = neptune If protocol_type = udp Then Classification = teardrop If protocol_type = icmp and src_bytes <= 1235 Then Classification = smurf If protocol_type = icmp and src_bytes > 1235 Then Classification = pod
52
Chapter Three
Proposed System Methodology
Rule 3: Root node = srv_serror_rate If srv_serror_rate <= 0 and wrong_fragment <= 0 and dst_bytes <= 186 and src_bytes <= 39 Then Classification = unknown If srv_serror_rate <= 0 and wrong_fragment <= 0 and dst_bytes <= 186 and src_bytes > 39 Then Lassification = smurf If srv_serror_rate <= 0 and wrong_fragment <= 0 and dst_bytes > 186 Then Classification = back If srv_serror_rate <= 0 and wrong_fragment > 0 and protocol_type = tcp or udp Then Classification = teardrop If srv_serror_rate <= 0 and wrong_fragment > 0 and protocol_type = icmp Then Classification = pod If srv_serror_rate > 0 Then Classification = Neptune
Rule 3 with the feature (srv_serror_rate) as a root node did show much higher detection rates when compared to Rule 1 and Rule 2 using the DoS cluster from the first layer. Rule 3 will be used to construct the classification Decision Tree for the second layer of the proposed IDS model Figure 3.8.
53
Chapter Three
Proposed System Methodology
srv_serror_rate srv_serror_rate <= 0
srv_serror_rate > 0
Wrong_fragment wrong_fragment <= 0
wrong_fragment > 0
dst_bytes dst_bytes <= 0
src_bytes <=39
Unknown
protocol_type
dst_bytes>186
src_bytes
Neptun e
Back
protocol_type = icmp
protocol_type=tcp ‘or’ udp
Pod
src_bytes >39
Smurf
Figure 3.8: Decision Tree Structure for DoS Attack Classification
54
Teardrop
Chapter Four Implemented Results and Discussions
55
Chapter Four Implemented Results and Discussions
4.1 Introduction This chapter presents the results of a set of tests conducted on the proposed system described in the previous chapter. Test results computed and obtained using a computer running Windows 8 Pro, 64-bit Operating System, Intel® CoreTM i7-3537U CPU @ 2.00 GHz and 8 GB of RAM. 4.2 Training and Testing the Dataset The KDD (Knowledge Discovery and Data mining cup) dataset is divided into two parts, training dataset and testing dataset. The training dataset is used to tune the cluster centroid of the K-means cluster for intrusion detection (i.e., generate normal and attack signatures), and construct a Decision Tree rules. Testing dataset is used to evaluate the performance of the hybrid proposed system. A dataset of 100000 records which is extracted from the whole KDD Cup99 dataset and it includes both normal and DoS attack records to train and test the system has been used. The percentage of the training set is 60% of the extracted dataset and the remaining is for testing and validating the system.
4.3 Experiment 1: Results of Pre-processing The results of this stage consist of all preprocessing phases: 4.3.1 Transformation and Normalization A sample of KDD Cup 99 dataset presented in Table 4.1, and the transformation and
normalization
results
of
this
55
sample
presented
in
Table
4.2.
Implemented Results and Discussions Chapter Four
Table 4.1: Sample Records of KDD Cup 99
0 Tcp
0 Tcp
http
http
http
SF
SF
SF
219
235
486
181
1337
1337
486
5450
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 0 1 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 1 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 1 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 1 0 0
39
29
12
9
39
29
19
9
1 0 0.03
1 0 0.03
1 0 0.05
1 0 0.11
0
0
0
0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
Dataset
0 Tcp SF
0 0 0 0
http
0
0 0 0 0
0 Tcp
1 0 0.02
0.04
0 0 0 0
0 0 0 0 59
1 0
0.04
0 0 0 0
0 59
69
1 0 0.09
0.04
0 0 0 0
1 0 0.02
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 0 1 0 0
1
79
1 0 0.12
0.05
49
2032
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 0 1 0 1
11
89
1 0 0.12
49
217 1940
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 5 5 0 0 0 0 1 0 0
8
99
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 6 6 0 0 0 0 1 0 0
SF 212 4087
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 1 0 0
8
2032
http SF 159 151
0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 8 8 0 0 0 0 1 0 0
217
0 Tcp http SF 210
786
SF
0 Tcp http SF 212
http
0 Tcp http SF
0 Tcp
0 Tcp http
1
0 Tcp
56
Implemented Results and Discussions Chapter Four
0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0.31
0
1
1
Table 4.2: Transformed Nominal Data and Normalized Numeric Data Samples of KDD Cup 99 Dataset 1 1 0 0 0 0 1 0 0 0.14 0 1 0 0.03
10 1
0 1 33
0 1 33
0 1 33
10
10
10
10
0.66
0.64
0
0.66
0.73
0.12
0
0.74
0.34
0.36
0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
1
1
0.5
0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.71
1
1
0.5
0
0.71
0 0 0 0 1 0 0 0.12
0 0 0 0 1 0 0 0.12
0 0 0 0 1 0 0 0.17
0 0 0 0 1 0 1
0 0 0 0 1 0 0
0
1
1
0.89
0.78
0.67
0.56
1 0
1 0
1 0
0.1
1 0 0.07
1 0
0.1
1
0
1
0.8
0.8
0.8
0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0.28
0 1 33 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0
0.06
0 0 0 0 1 0 0 0.48
0
1
1
1 0 0.03
10
1
0.11
0 1 33
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0.33
0 1 33 10
0
0 0 0 0
0.22
0 0 0 0 1 0 0 0.66
1 0
0
0.95
0.71
0.44
1 0 0.01
10
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.71
0 0 0 0 1 0 0 0.83
0.22
0 1 33 0.22
0.71
0 0 0 0
0.75
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.71
0
10 0.36
1 0 0.01
0 1 33 0.73
0 0 0 0
10
0
0 1 33
0 1 33
57
Chapter Four
Implemented Results and Discussions
The proportions of Normal and DoS Attack Class in a subset of (100000) records of KDD Cup 99 dataset that used to train and test the proposed IDS presented in Table 4.3. Table 4.3: Proportions of the Normal and DoS Classes in the Data Subset Class
Full Subset (100000) Records
Training (60%)
Test (40%)
Normal
19600 (19.6%)
11760
7840
DoS
80400 (80.4%)
48240
32160
Total
100000 (100%)
60000
40000
Proportions of each class calculated as follow: 1- Normal Class: Proportion of Normal records in the Full Subset =
19600 100000
∗ 100 % = 19.6 %
No. of Normal records in the Training set (60%) =
19600 ∗ 60% = 11760 100%
No. of Normal records in the Training set (60%) =
19600 ∗ 40% = 7840 100%
2- DoS Class: Proportion of DoS records in the Full Subset =
80400 100000
∗ 100 % = 80.4 %
No. of DoS records in the Training set (60%) =
80400 ∗ 60% = 48240 100%
No. of DoS records in the Training set (60%) =
80400 ∗ 40% = 32160 100%
58
Chapter Four
Implemented Results and Discussions
4.3.2 Features Ranking and Subset Selection Information Gain used for feature ranking and then subset selection of the ranked features based on highest ranking. The result of attribute ranking by using IG over a sample of KDD dataset consisting of normal and DoS attack records are shown in Table 4.4. Table 4.4: Attribute Ranking by Information Gain Attribute Rank Sr. No. Attribute name 0.6832 5 src_bytes 0.67616 6 dst_bytes 0.67143 37 dst_host_srv_diff_host_rate 0.50454 23 Count 0.49037 3 Service 0.46333 12 logged_in 0.27043 33 dst_host_srv_count 0.25604 32 dst_host_count 0.23577 34 dst_host_same_srv_rate 0.23241 35 dst_host_diff_srv_rate 0.22553 36 dst_host_same_src_port_rate 0.19651 25 serror_rate 0.19651 30 diff_srv_rate 0.19651 29 same_srv_rate 0.18667 4 Flag 0.18086 38 dst_host_serror_rate 0.18086 39 dst_host_srv_serror_rate 0.18086 26 srv_serror_rate 0.13657 24 srv_count 0.10854 2 protocol_type 0.05489 31 srv_diff_host_rate 0.02458 8 wrong_fragment 0.01921 41 dst_host_srv_rerror_rate 0.01921 40 dst_host_rerror_rate 0.01877 10 Hot 0.01678 13 num_compromised 0.00583 27 rerror_rate 0.00583 28 srv_rerror_rate
59
Chapter Four
Implemented Results and Discussions 0 0 0 0 0 0 0 0 0 0 0 0 0
7 9 1 19 18 22 20 21 14 11 17 15 16
Land Urgent Duration num_access_files num_shells is_guest_login num_outbound_cmds is_host_login root_shell num_failed_logins num_file_creations su_attempted num_root
Gain Ration attribute ranking used as a preprocess step to construct the C4.5 Decision Tree, in order to determine using which features to construct the tree based on the amount of information of each feature. The attribute with the highest GainR is selected as the splitting attribute. Results of attribute ranking by Gain Ratio is presented in Table 4.5: Table 4.5: Attribute Ranking Using GainR for C4.5 DT Attribute Rank Sr. No. 1 1 1 1 1 1 1 1 1 0.987 0.969 0.955 0.93 0.93
2 26 12 5 6 37 40 41 38 8 10 4 25 30
Attribute name protocol_type srv_serror_rate logged_in src_bytes dst_bytes dst_host_srv_diff_host_rate dst_host_rerror_rate dst_host_srv_rerror_rate dst_host_serror_rate wrong_fragment Hot Flag serror_rate diff_srv_rate
60
Chapter Four
Implemented Results and Discussions 0.93 0.884 0.841 0.815 0.789 0.718 0.679
29 13 35 23 3 31 33
same_srv_rate num_compromised dst_host_diff_srv_rate Count Service srv_diff_host_rate dst_host_srv_count
0.611 0.611 0.577 0.577 0.486 0.454
1 24 28 27 34 36
Duration srv_count srv_rerror_rate rerror_rate dst_host_same_srv_rate dst_host_same_src_port_rate
0.29 0 0 0 0
32 7 9 19 18
dst_host_count Land Urgent num_access_files num_shells
0 0 0 0 0 0
22 20 21 14 11 17
is_guest_login num_outbound_cmds is_host_login root_shell num_failed_logins num_file_creations
0 0
15 16
su_attempted Num_root
4.4 Experiment 2: K-means Clustering (First Layer) In this stage K-means clustering algorithm is implemented by using two distance metrics to evaluate the metric that results in least error rate. These metrics are Euclidean and Manhattan. The K-means algorithm used the subset of the KDD Cup 99 dataset in three ways: the dataset with full features, the dataset with highest 10 features ranked by IG and the dataset with highest 20 features ranked by IG.
61
Chapter Four
Implemented Results and Discussions
Experiments show that Manhattan distance function is more accurate in terms of detection and false alarm rates and it outperforms the Euclidean distance function, furthermore, Manhattan distance function requires less computation than Euclidean distance function, which in turn improves the computational time complexity of kmeans. The results obtained by using dataset with highest 20 features ranked by IG, was more accurate with less false alarms compared to other ways of using the dataset. Centroids for K-means clustering using both Euclidean and Manhattan metrics presented in Table 4.6, 4.7 respectively. Table 4.6: Attributes Centroid Using Euclidian Distance Metric for 20 Features with Highest Ranking Feature Name Normal DoS src_bytes 6518.4059 0 dst_bytes 1692.6453 0 dst_host_srv_diff_host_rate 0.0185 0 Count 214.5045 120.8646 Service 80 67 logged_in 0 0 dst_host_srv_count 178.2433 10.0378 dst_host_count 162.4904 204 dst_host_same_srv_rate 0.8297 0.0781 dst_host_diff_srv_rate 0.0062 0.083 dst_host_same_src_port_rate 0.4122 0.0097 serror_rate 0.0005 1 diff_srv_rate 0.0012 0.0654 same_srv_rate 0.9995 0.0841 Flag 13 9 dst_host_serror_rate 0 1 dst_host_srv_serror_rate 0 1 srv_serror_rate 0 1 srv_count 214.585 10.4016 protocol_type 1 1
62
Chapter Four
Implemented Results and Discussions
Table 4.7: Attributes Centroid Using Manhattan Distance Metric for 20 Features with Highest Ranking Feature Name
Normal
DoS
src_bytes
1032
0
dst_bytes
0
0
dst_host_srv_diff_host_rate
0
0
Count
25
120
Service
80
67
logged_in
0
0
dst_host_srv_count
255
10
dst_host_count
186
255
dst_host_same_srv_rate
1
0.05
dst_host_diff_srv_rate
0
0.07
dst_host_same_src_port_rate
0.14
0
serror_rate
0
1
diff_srv_rate
0
0.07
same_srv_rate
1
0.09
Flag
13
9
dst_host_serror_rate
0
1
dst_host_srv_serror_rate
0
1
srv_serror_rate
0
1
srv_count
25
10
protocol_type
1
1
Evaluation and results of K-means clustering tests using different distance functions which are: Euclidean and Manhattan with different sets of data (Full dataset, 10 features, and 20 features) presented in Table 4.8, 4.9, and 4.10 respectively.
63
Chapter Four
Implemented Results and Discussions
Table 4.8: Evaluation and Results of K-means with Distance Functions Using the Full Dataset
Accuracy
Results of K-means With Euclidean Function 77.6786 %
Results of K-means With Manhattan Function 89.404 %
Error Rate
22.3214 %
10.596 %
Average True Positive Rate
77.7 %
89.4 %
Average False Positive Rate
24.9 %
9.9 %
Average Precision
84.3 %
90.5 %
Average Recall
77.7 %
89.4 %
Average F-Measure
76.2 %
89.4 %
Mean absolute error
0.2232
0.1455
Root mean squared error
0.4725
0.2769
Parameter
Table 4.9: Evaluation and Results of K-means with Distance Functions Using the Highest 10 Features Ranked by IG
Accuracy
Results of K-means With Euclidean Function 86 %
Results of K-means With Manhattan Function 93.8492 %
Error Rate
14 %
6.1508 %
Average True Positive Rate
86 %
93.8 %
Average False Positive Rate
15.2 %
5.9 %
Average Precision
89 %
94 %
Average Recall
86 %
93.8 %
Average F-Measure
85.6 %
93.9 %
Mean absolute error
0.14
0.0966
Root mean squared error
0.3742
0.2116
Parameter
64
Chapter Four
Implemented Results and Discussions
Table 4.10: Evaluation and Results of K-means with Distance Functions Using the Highest 20 Features Ranked by IG
Accuracy
Results of K-means With Euclidean Function 90.0875 %
Results of K-means With Manhattan Function 98.2143 %
Error Rate
9.9125 %
1.7857 %
Average True Positive Rate
90.1%
98.2 %
Average False Positive Rate
9.8 %
1.9 %
Average Precision
90.8 %
98.2 %
Average Recall
90.8 %
98.2 %
Average F-Measure
90.8 %
98.2 %
Mean absolute error
0.1247
0.0366
Root mean squared error
0.243
0.1241
Parameter
Figure 4.1, is showing a comparison between the two distance functions using K-means with different sets of data in a graphical way. The figure clearly indicates that in terms of accuracy and error rate Manhattan function and a data subset with 20 highest ranked features is better than the Euclidean function and other datasets.
100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%
Accuracy Error Rate TPR FPR Precision F-Measure K-means K-means K-means K-means K-means K-means with with with with with With Euclidean Manhattan Euclidean Manhattan Euclidean Manhattan Full dataset Full dataset 10 Highest 10 Highest 20 Highest 20 Highest Features Features Features Features
Figure 4.1: Comparative Chart of Distance Functions Values Using K-means
65
Chapter Four
Implemented Results and Discussions
4.5 Experiment 3: C4.5 Decision Tree (Second Layer) This stage includes implementing the C4.5 algorithm to generate a set of rules based on given training set, which includes DoS attack types that can correctly classify attack types. C4.5 algorithm is very accurate in term of classification, reducing and eliminating false alarms since it is a supervised algorithm and very clear (easy to get understood by users). Three set of rules generated by C4.5 has been tested and evaluated to determine the rule of the best tree structure in the second layer of proposed IDS. Detection rates and performance of C4.5 Decision Tree using three different rules presented in Table 4.11. Rule 3 shows a very high performance in term of Accuracy with low Error Rate and False Positive Rate.
Table 4.11: Evaluation and Results of C4.5 Algorithm Parameter
Rule 1
Rule 2
Rule 3
Accuracy
93.5644 %
98.5149 %
99.9136 %
Error Rate
6.4356 %
1.4851 %
0.0864 %
Average True Positive Rate
93.6 %
98.5 %
99.9 %
Average False Positive Rate
6.7 %
1.5 %
0%
Average Precision
94.1 %
98.5 %
99.9 %
Average Recall
93.6%
98.5 %
99.9 %
Average F-Measure
93.5 %
98.5 %
99.9 %
Mean absolute error
0.0669
0.0163
0.0003
Root mean squared error
0.2415
0.1218
0.0186
66
Chapter Four
Implemented Results and Discussions
4.6 The Graphical User Interface (GUI) The core of the designed system uses unsupervised learning for detecting network intrusions. Based on unsupervised learning algorithms, detection techniques were implemented and tested, showing a very high detection rate with a reasonable true positive rate. The designed GUI enables user-friendly handling of the system core. The main system window is shown in Figure 4.2 Java platform with JDK 7 has been used for implementing the model under NetBeans 7.4 IDE which is a powerful integrated development environment for developing applications on Java platforms. The main GUI of the detection model consists of a menu bar, several subwindows and a number of buttons. The IDS menu contains buttons for capture, analyze packets, stop capturing, and extract normal and attack packets. The Log menu is related to operations on the log file, such as activate log, open log and clear log. The Exit menu contains an Exit command and Exit with clear log file. The Start Capture and Analyze button is related and linked directly with the IDS analysis algorithms. Normal packets is the outcome of K-means normal cluster while other types of classified DoS attacks are the outcome of K-means DoS cluster which will get passed to the C4.5 algorithm for further classification, analysis of packets appear in packet analysis window as shown in Figure 4.3
67
Chapter Four
Implemented Results and Discussions
Figure 4.2: Main GUI of the Detection Model
Figure 4.3: Capturing and Classification of Network Traffics by the System The End Capture button stops capturing network traffics. Normal and Attack Packets buttons extract captured and classified packets with their type and present the normal packets in the normal packets window and the attack packets in the attack packets window, as shown in Figure 4.4.
68
Chapter Four
Implemented Results and Discussions
Also, the main window contains an Exit command to terminate the execution of the application.
Figure 4.4: Extracting Normal and Attack Packets from Captured Packets One of the most important functions of this system is the ability to activate log file to record all actions that are captured by the IDS, all captured packets with their classification type, the time and the date will be stored in the log file, as shown in Figure 4.5. This log file permits the user to open it via Open Log button or clear it by using the Clear Log button. The Exit &Clear Log button gives the user the option to terminate the application and clear the log file content.
69
Chapter Four
Implemented Results and Discussions
70
Chapter Five Conclusions and Future Works
71
Chapter Five Conclusions and Future Works
5.1 Conclusions
The proposed system has concluded an IDS that can efficiently detect DoS attacks by outer and insider intruders in a network-based system. The IDS monitors all network traffic behaviors and tests them to check if they are normal or DoS attacks. The use of machine learning algorithms (including, supervised and unsupervised learning) and the C4.5 decision tree algorithm enabled us to achieve the proposed goals.
Analyzing the results obtained and presented in chapter 4 concludes the below points:
1. To overcome the bad features selection that positively affect the whole performance of the presented IDS model, Information gain algorithm is used in this research to find suitable subsets of of relevant features with optimal sensitivity and highest discriminatory power
for the selected
attack category within the subset of the KDD dataset.
2. The K-means algorithm was chosen to evaluate the performance of an unsupervised learning method for anomaly detection. The results of the evaluation of using K-means with feature selection confirm that a high detection rate can be achieved while maintaining a low false alarm rate ( DR = 98.214%, Error rate = 1.7857%), compared to the results obtained in [14] that uses K-means alone for analyzing and partitioning the data of KDD Cup 99 dataset (DR ≈ 96%, Error rate ≈ 4%).
71
Chapter Five
Conclusions and Future Works
3. K-means uses similarity measures for clustering based on distance function, thus the metric for calculating the distance affects the overall performance and the processing time. Obtained results did show that the Manhattan function can achieve a high detection rate with low false alarms compared
to
Euclidean
function
(for
Manhattan
Distance:
DR = 98.2143%, Error Rate = 1.7857%, and for Euclidean Distance: DR = 90.0875%, Error Rate = 9.9125%).
4. DoS attack types classifications have been made possible by using the C4.5 Decision Tree (used as supervised learning algorithm) and the outcome of K-means clustering algorithm. Once the tree structure is built, C4.5 can classify traffic according to it with detection rate (DR) reaching 100%, (DR = 99.9136%, Error Rate = 0.0864%) due to the supervised work nature that makes it very accurate in detecting and classifying known patterns which have been learned.
5. There is no need to get concerned about new types of attacks and the performance of the system is not reduced if the IDS undergoes unknown attacks as unsupervised learning algorithm has been use as a detection model in the first layer. With the adoption of the unsupervised layer there is no need for carrying daily updates and inserting new types of attacks to the IDS database as the system cluster data based on similarity to a centroid cluster.
72
Chapter Five
Conclusions and Future Works
5.2 Future Works
There are several possible suggestions of how far this research can be extended. They can be listed as follows:
1- The presented IDS model classifies the network packets into two classes; normal and DoS. Therefore, this model activity can be further extended in the future to classify network activities based on the intrusion categories.
2- The accuracy of the model may be further improved by using extended version of K-means algorithm called K-modes algorithm. The K-modes algorithm extends the K-means paradigm to cluster large categorical data by using: Simple matching dissimilarity measure for categorical objects. Modes instead of means for clusters.
3- A graphical user interface (GUI) has been designed as a part of implementing the system. Although the GUI of implemented proposed system getting its input data from the KDD dataset, amending it to get real packet data could be done by adding and using some Java packages and classes.
73
References:
[1] Ashok K. Sahu, and Gulam Rasul, (2011), "Use of IT and Its Impact on Service Quality in an Academic Library", Library Philosophy and Practice (ejournal), Libraries at University of Nebraska-Lincoln ISSN 1522-0222. [2] Dhruv A. Patel, and Prof. Hasmukh Patel, (2014), "Detection and Mitigation of DDOS Attack against Web Server", International Journal of Engineering Development and Research (IJEDR),Volume 2, Issue 2, ISSN: 2321-9939. [3] R. Vijayasarathy, (Feb. 2012), "A Systems Approach to Network Modelling for DDoS Attack Detection using Na`ıve Bayes Classifier", Master thesis, Department of Computer Science and Engineering, Indian Institute of Technology, Madras. [4] D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver, (Aug. 2003) "Inside the Slammer Worm'', IEEE Security and Privacy, vol. 1, no. 4, pp. 33-39. [5] C. Kruegel, F. Valeur and G. Vigna, (2004), Intrusion Detection and Correlation: Challenges and Solutions, ©2005 Springer Science + Business Media, Inc. Boston, ISBN: 978-0-387-23398-7, pp. 11-12. [6] D. Gollmann, (2006), Computer Security. Wiley, 2nd edition, ISBN 0470862939, John Wiley & Sons, New York, NY, pp. 251-252. [7] G. Vigna, and R.A. Kemmerer (April 2002) "Intrusion Detection: A Brief History and Overview", Security and Privacy, supplement to IEEE Computer, pp. 27-30. [8] James P. Anderson, (April 1980), "Computer Security Threat Monitoring and Surveillance", Technical report, James P. Anderson Co., Fort Washington, Pennsylvania. [9] Anderson D., Frivold T., and Valdes A., (May 1995), "Next Generation Intrusion Detection Expert System ", Computer Science Laboratory.
74
[10] Kumer, S., (1995), "Classification and detection of computer Intrusion", Ph.D. Thesis, Department of computer science, Purdue university. [11] Andrew Sung, Guadalupe Janoski, and Srinivas Mukkamala, (2002), "Intrusion
detection
machines",
Proceedings
Neural
Networks,
using of
IJCNN,
neural the
networks
and
International
Joint
Honolulu, HI,
support
vector
Conference
ISSN: 1098-7576,
on DOI:
10.1109/IJCNN.2002.1007774, pp. 1702-1707. [12] Aikaterini Mitrokotsa, and Christos Douligeris, (Dec. 2005), "Detecting Denial of Service Attacks Using Emergent Self-Organizing Maps", Signal Processing and Information Technology. Proceedings of the Fifth IEEE International Symposium. [13] R Rajesh, and Shina Sheen, (2008), "Network Intrusion Detection using Feature Selection and Decision tree classifier", TENCON IEEE Region 10 Conference, Hyderabad, DOI: 10.1109/TENCON.2008.4766847. [14] Bian Ling, Meng Jianliang, and Shang Haikun, (2009), "The Application on Intrusion Detection Based on K-means Cluster Algorithm", Information Technology and Applications. IFITA International Forum IEEE. [15] Affendey, Ektefa, Memar, and Sidi, (March 2010), "Intrusion Detection Using Data Mining Techniques", Information Retrieval and knowledge Management (CAMP), IEEE International Conference, Shah Alam, Selangor, pp. 200-203, DOI: 10.1109/INFRKM.2010.5466919. [16] Bharti K., Jain S., and Shukla S., (2010), "Fuzzy K-mean Clustering via Random Forest for Intrusion Detection System", International Journal on Computer Science and Engineering Vol. 02(6), pp. 2197-2200. [17] A. Bhaskar, and B. K. Kumar, (June 2012), "Identifying Network Anomalies Using Clustering Technique in Weblog Data", International Journal of Computers & Technology Volume 2 No.3. [18] Reyadh Sh.Naoum, and Wafa' S. Al-Sharafat, (April 2009), "Adaptive Framework for Network Intrusion Detection by Using Genetic-Based
75
Machine Learning Algorithm", IJCSNS International Journal of Computer Science and Network Security, VOL.9 No.4. [19] Munish Sharma, and Tajinder Kaur, (2014), "A Study on Network Intrusion Detection Based on Proactive
Mechanism", International Journal of
Emerging Research in Management &Technology, Volume 3, Issue 1, ISSN: 2278-9359. [20] D.P. Gaikwad, Kunal Thakare, Sonali Jagtap, and Vaishali Budhawant, (Nov. 2012), "Anomaly Based Intrusion Detection System Using Artificial Neural Network and Fuzzy Clustering", International Journal of Engineering Research & Technology (IJERT), Vol. 1 Issue 9, ISSN: 2278-0181. [21] Matt Bishop (2002), Computer Security: Art and Science (1st ed.), AddisonWesley Professional, ISBN: 0-201-44099-7, pp. 10-11. [22] Deeman Y. Mahmood, and Dr. Mohammed A. Hussein, (Dec. 2013), "Intrusion Detection System Based on K-Star Classifier and Feature Set Reduction", International Organization of Scientific Research Journal of Computer
Engineering
(IOSR-JCE),
Volume
15,
Issue
5,
DOI:
10.9790/0661-155107112, PP. 107-112. [23] Deeman Yousif Mahmood, and Dr. Mohammed Abdulla Hussein, (2014), "Analyzing NB, DT and NBTree Intrusion Detection Algorithms", Journal of Zankoy Sulaimani – Part A (JZS-A), Volume 16, No. 1. [24] Ajith Abraham, Crina Grosan, and Yuehui Chen, (2005), "Cyber Security and the Evolution in Intrusion Detection Systems", Journal of Engineering and Technology, Volume 1, Issue 1, pp. 74-81. [25] R. Bace, (1999), "An introduction to intrusion Detection Assessment for system and network security management", Infidel Inc., prepared for ICSA Inc. [26] Ganesh Prasad, Sandip Sonawane, and Shailendra Pardeshi, (2012), "A survey on intrusion detection techniques", World Journal of Science and Technology, 2(3), ISSN: 2231 – 2587, 127-133.
76
[27] Neethu B, (March 2013), "Adaptive Intrusion Detection Using Machine Learning", IJCSNS International Journal of Computer Science and Network Security, Volume 13 No.3, pp. 118-124. [28] Abraham A., Panda M., and Patra M. R., (2010), "Discriminative Multinomial Naïve Bayes for Network Intrusion Detection", IEEE Sixth International Conference on Information Assurance and Security (IAS), DOI: 10.1109/ISIAS.2010.5604193, pp.5-10. [29] B. Pearlmutter, C. Warrender, and S. Forrest, (1999), "Detecting intrusions using system calls: alternative data models", IEEE Symposium on Security and Privacy, pages 133–145. [30] Dr. Sameer Shrivastava, (April 2012), "Case Study on JAVA based IDS", International Journal of Scientific & Engineering Research, Volume 3, No. 4. [31] Henok Alene, (Oct. 2011), "Graph Based Clustering for Anomaly Detection in IP Networks" Master thesis, Department of Information and Computer Science, School of Science, Aalto University. [32] C. Verbowski, H. J. Wang, J. R. Lorch, P. M. Chen, S. T. King, and Y. Wang, (2006), "SubVirt: Implementing malware with virtual machines", IEEE Symposium on Security and Privacy, pp. 314–327. [33] C. Kreibich, M. Handley, and V. Paxson, (2001), "Network intrusion detection: Evasion, traffic normalization, and end-to-end protocol semantics". USENIX Security Symposium, pp. 115–131. [34] Rebecca Bace, and Peter Mell, (2001), "NIST Special Publication on Intrusion Detection Systems", Infidel, Inc., Scotts Valley, CA, National Institute of Standards and Technology. [35] Leonard J. LaPadula, and Therese R. Metcalf, (2000), "Intrusion Detection System Requirements", Center for Integrated Intelligence Systems, Bedford, Massachusetts © MITRE Corporation.
77
[36] A. Appa Rao, B. Chakravarthy, K.Marx, P. Kiran and P.Srinivas, (2006), "A Java Based Network Intrusion Detection System (IDS)", Proceedings of the 2006 IJME - INTERTECH Conference [37] A. Movaghar, and F. Sabahiand, (2008), "Intrusion Detection: A Survey", IEEE
Third
International
Conference
on
Systems
and
Networks
Communications, DOI: 10.1109/ICSNC.2008.44. [38] Anil A. Ahlawat, and Brijpal Singh, (2013), "Intrusion detection of Network Attacks Using Artificial Neural Networks & Fuzzy Logic", International Journal of Engineering & Management Technology, ISSN: 2320-7043, Volume 1, Issue 1, pp. 53-66. [39] Elham Hormozi, Hadi Hormozi, and Hamed Rahimi Nohooji, (2012), "The Classification of the Applicable Machine Learning Methods in Robot Manipulators", International Journal of Machine Learning and Computing, Volume 2, No. 5, pp. 560-563. [40] Taiwo Oladipupo Ayodele, (2010), New Advances in Machine Learning, Chapter three: Types of Machine Learning Algorithms, pp. 20-23 ISBN: 978953-307-034-6, InTech, University of Portsmouth, United Kingdom. [41] Salahedin Ali Namroush, and Shauki Abdusalam Fatshul, (2006), "security issues, attack Trends related to the confidentiality, integrity, and availability of information assets on an organization's computer system", Proceedings of the Postgraduate Annual Research Seminar, Center of Advanced Software Engineering (CASE), University Technology Malaysia. [42] Yash Batra, (2013), "IP Spoofing", International Indexed & Refereed Research Journal, ISSN: 0974-2832, Volume V, ISSUE- 59, pp. 44-46. [43] D.K. Bhattacharyya , J.K. Kalit, Monowar H. Bhuyan, N. Hoque, and R.C. Baishya, (2014), "Network attacks: Taxonomy, tools and systems", ELSEVIER Journal of Network and Computer Applications, pp. 307–324.
78
[44] Steven J. Templeton, and Karl E. Levitt, (2003), "Detecting spoofed packets", IEEE DARPA Information Survivability Conference and Exposition Proceedings, Vol. 1, DOI: 10.1109/DISCEX.2003.1194882, pp. 164-175. [45] Sharmin Rashid, and Subhra Prosun Paul, (2013), "Proposed Methods of IP Spoofing Detection & Prevention", International Journal of Science and Research (IJSR), India Online ISSN: 2319-7064. Volume 2 Issue 8, pp. 438444. [46] Mridu Sahu, and Rainey C. Lal, (2012), "CONTROLLING IP SPOOFING THROUGH PACKET FILTERING", International Journal of Computer Techology & Applications IJCTA, Volume 3 (1), pp 155-159. [47] Victor Velasco, (2000), "Introduction to IP Spoofing", SANS Institute InfoSec
Reading
Room,
URL:
http://www.sans.org/reading-
room/whitepapers/threats/introduction-ip-spoofing-959 [48] Srinivas Aluvala, (2011)"Inter-domain Packet Filters To Control IPForging", Research Journal of Computer Systems Engineering- An International Journal, ISSN: 2230-8563, Volume 2, Issue 2, pp. 67-72. [49] M. Rajasekhar, and S. Kishor Kumar, (2012),"GROUP SIGNATURE PROTOCOL
FOR
SECURITY
TREATMENT
FOR
BLOCKING
MISBEHAVING USERS", International Journal of Research Sciences and Advanced Engineering, ISSN: 2319-6106, Volume 2 (5), pp. 30-34. [50] G. Sindhuri, K. Sachin, K. Sravani, and Y.Madhavi Latha, (2012), "A novel approach for the detection of SYN Flood Attack", International Journal of Computer Trends and Technology, ISSN: 2231-2803, Volume 3, Issue 2, pp. 286-289. [51] Gurjinder Kaur, V. K. Jain, and Yogesh Chaba, (2011), "Distributed Denial of Service Attacks in Mobile Adhoc Networks", World Academy of Science, Engineering and Technology (WASET), Volume 5, No. 1, pp. 591-593. [52] A. Vasavi, B V Ramana Murthy, and Vuppu Padmakar, (2014), "Significances and Issues of Network Security", International Journal of
79
Advanced Research in Computer and Communication Engineering, ISSN: 2319-5940, Volume 3, Issue 6. [53] Jon Erickson (2008). HACKING the art of exploitation (2nd ed.). San Francisco, ISBN: 1-59327-144-1, pp. 256-258. [54] DARPA Intrusion Detection Evaluation, Lincoln Laboratory Massachusetts Institute of Technology (MIT), Article about Back DoS attack, URL: http://www.ll.mit.edu/mission/communications/cyber/CSTcorpora/ideval/doc s/attackDB.html#back [55] Daniel Barbar, and Sushil Jajodia, Application of Data Mining in Computer Security, George Mason University, Kluwer Academic Publishers, Boston, Dordrecht, London, pp34-35. [56] A. M. Chandrashekhar, and K. Raghuveer, (2012), "Performance evaluation of data clustering techniques using KDD Cup-99 Intrusion detection data set", International Journal of Information & Network Security (IJINS), ISSN: 2089-3299, Volume 1, No.4, pp. 294-305 [57] Khoshgoftaar M., Napolitano A., Van J., and Wald R., (2009),"Feature Selection with High-Dimensional Imbalanced Data", IEEE International Conference
on
Data
Mining
Workshops,
ICDMW
'09,
DOI:
10.1109/ICDMW.2009.35, pp. 507-514. [58] Article
about:
Clustering
high-dimensional
data,
URL:
http://en.wikipedia.org/wiki/Clustering_high-dimensional_data. [59] GU Chun-hua, LIN Jia-jun, and ZHANG Xue-qin, (2006), "INTRUSION DETECTION SYSTEM BASED ON FEATURE SELECTION AND SUPPORT VECTOR MACHINE", IEEE, Communications and Networking in
China,
2006.
ChinaCom.
First
International
Conference,
DOI:
10.1109/CHINACOM.2006.344739. [60] L. Ladha, and T. Deepa, (2011), "Feature Selection Methods and algorithms", International Journal on Computer Science and Engineering (IJCSE), ISSN : 0975-3397, Volume 3, No. 5.
80
[61] Lioyd A. Smith, and Mark A. Hall, (1999) "Feature Selection for Machine Learning: Comparing a Correlation-based Filter Approach to the Wrapper", American Association for Artificial Intelligence. [62] Cheng-Hong Yang, Cheng-San Yang, Jung-Chike Li, and Li-Yeh Chuang, (2008),
"Information Gain with Chaotic Genetic Algorithm for Gene
Selection and Classification Problem", Systems, Man and Cybernetics, IEEE International Conference, DOI: 10.1109/ICSMC.2008.4811433 [63] Kumar V., Steinbach M., and Tan P.N, (2006) Introduction to Data Mining, Addison-Wesley. [64] Jain A. K., (2010), "Data clustering: 50 years beyond K-means in Pattern Recognition Letters", Elsevier B.V., Volume 31, Issue 8, pp. 651-666. [65] P.G Student, (2014), "Evaluation of Similarities Measure in Document Clustering", International Journal of Science and Research (IJSR), ISSN: 2319-7064, Volume 3 Issue 1, pp. 39-41. [66] Jayna Shah, Neha Soni, and Rimi Gupta, (2012),"Analytical Comparison of Some Traditional Partitioning based and Incremental Partitioning based Clustering Methods", International Journal of Computer Applications, ISSN: 0975-8887, Volume 59, No.10, pp. 8-12. [67] Dheeraj Panwar, Himadri Chauhan, and Vipin Kumar, (2013), "K-Means Clustering Approach to Analyze NSL-KDD Intrusion Detection Dataset", International Journal of Soft Computing and Engineering (IJSCE), ISSN: 2231-2307, Volume 3, Issue 4. [68] D.K.Ghosh, and P.Indira Priya, (2012), "K-means Clustering Algorithm Characteristics Differences based on Distance Measurement", International Journal of Computer Applications ISSN: 0975-8887, Volume 59, No.14. [69] Ahmed Hassan, A. M. Riad, Ibrahim Elhenawy , and Nancy Awadallah, (2013) "VISUALIZE NETWORK ANOMALY DETECTION BY USING K-MEANS CLUSTERING ALGORITHM", International Journal of Computer Networks & Communications (IJCNC) Vol.5, No.5.
81
[70] Neha Jain, and Shikha Sharma, (2012), "The Role of Decision Tree Technique for Automating Intrusion Detection System", International Journal of Computational Engineering Research, Volume 2, Issue 4. [71] Upendra, and Yogendra Kumar, (2012), "An Efficient Intrusion Detection Based on Decision Tree Classifier Using Feature Reduction", International Journal of Scientific and Research Publications, ISSN 2250-3153, Volume 2, Issue 1. [72] Article
about
Decision
Trees,
URL:
http://en.wikipedia.org/wiki/Decision_tree. [73] J. Rose Quinlan, (1993), C4.5: programs for machine learning, Morgan Kaufmann Publishers, Inc. San Mateo, CA. [74] Jeff Markey, (2011), "Using Decision Tree Analysis for Intrusion Detection: A How-To Guide", SANS Institute InfoSec Reading Room, URL: http://www.sans.org/reading-room/whitepapers/detection/decision-treeanalysis-intrusion-detection-how-to-guide-33678 [75] ID3 (Iterative Dichotomiser 3) is an algorithm invented by J. Ross Quinlan, URL: http://en.wikipedia.org/wiki/ID3_algorithm. [76] J. L. Rana, Manasi Gyanchandani, and R. N. Yadav, (2010) "Intrusion Detection
using
C4.5:
Performance
Enhancement
by
Classifier
Combination", ACEEE Int. J. on Signal & Image Processing, Volume 01, No. 03. [77] Eibe Frank, Ian H. Witten, and Mark A. Hall, (2011), Data Mining Practical Machine Learning Tools and Techniques, Copyright © Elsevier Inc. [78] Gaffney John E., and Ulvila, J.W., (2001), "Evaluation of intrusion detectors: a decision theory approach", Security and Privacy, S&P Proceedings, IEEE Symposium on. [79] Asha
Gowda
Karegowda,
A.
S.
Manjunath,
and
M.A.Jayaram,
(2010),"COMPARATIVE STUDY OF ATTRIBUTE SELECTION USING GAIN RATIO AND CORRELATION BASED FEATURE SELECTION",
82
International
Journal
of
Information
Technology
and
Knowledge
Management, Volume 2, No. 2, pp. 271-277. [80] Knowledge Discovery in Database, KDD Cup 99 benchmark dataset, URL: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. [81] Ali A. Ghorbani, Ebrahim Bagheri, Mahbod Tavallaee, and Wei Lu, (2009), "A Detailed Analysis of the KDD CUP 99 Dataset", proceeding of the IEEE symposium on computational Intelligence in security
and defense
application. [82] Svein J. Knapskog, Sylvain Gombault, and Wei Wang, (2009),"Attribute Normalization in Network Intrusion Detection", IEEE 10th International Symposium on Pervasive Systems, Algorithms, and Networks (ISPAN), DOI: 10.1109/I-SPAN.2009.49, pp. 448-453. [83] Article about: Euclidean Distance, URL: http://en.wikipedia.org/wiki/Euclidean_distance. [84] Article about: Manhattan Distance, URL: http://en.wikipedia.org/wiki/Manhattan_distance.
83
Appendices
84
Appendix A
Table A1: Description of the 41-Features of TCP Connection No.
Feature
Description
Type
1
Duration
2
Protocol_type
3
Service
4
Flag
5
Src_bytes
6
Dst_bytes
7
Land
8
Wrong_fragment
Length of the connection (No. of Seconds) Type of connection Protocol (tcp, udp) Network Service on the destination (talnet, ftp) Status flag of the connection No. of Data Bytes sent from source to destination No. of Data Bytes sent from destination to source 1 if connection is from/to the same host/port; 0 otherwise No. of wrong fragments
9
Urgent
No. of urgent packets
Continuous
10
Hot
No. of “hot” indicators
Continuous
11
Num_failed_logins
No. of failed logins
Continuous
12
Logged_in
1 if successfully logged in; 0 otherwise
13
Num_compromised
No. of “compromised” conditions
14
Root_shell
15
Su_attempted
16
Num_root
1 if root shell is obtained; 0 otherwise Discrete 1 if “su root” command attempted; 0 Discrete otherwise No. of “root” accesses Continuous
17
Num_file_creations
No. of file creation operations
18
Num_shells
19
Num_access_files
20
Num_outbound_cmds
21
s_host_login
22
s_guest_login
No. of shell prompts Continuous No. of operations on access control Continuous files No. of outbound commands in an ftp Continuous session 1 if the login belongs to the “hot” list; 0 Discrete otherwise 1 if the login is a “guest” login; 0 Discrete otherwise
A. 1
Continuous Discrete Discrete Discrete Continuous Continuous Discrete Continuous
Discrete Continuous
Continuous
23
Count
24
Srv_count
25
Serror_rate
26
Srv_serror_rate
27
Rerror_rate
28
Srv_rerror_rate
29
Same_srv_rate
No. of connections to the same host as the current connection in the past two seconds No. of connections to the same service as the current connection in the past two seconds % of connections that have “SYN” errors % of connections that have “SYN” errors % of connections that have “REJ” errors % of connections that have “REJ” errors % of connections to the same service
30
Diff_srv_rate
% of connections to different services
31
Srv_diff_host_rate
32 33
34 35 36 37 38 39 40 41
% of connections to different hosts Count of connections having the same Dst_host_count destination host Count of connections having the same Dst_host_srv_count destination host and using the same service % Count of connections having the Dst_host_same_srv_rate same destination host and using the same service % of different services on the current Dst_host_diff_srv_rate host Dst_host_same_src_port_rate % of connections to the current host having the same src port Dst_host_srv_diff_host_rate % of connections to the same service coming from different hosts % of connections to the current host Dst_host_serror_rate that have an S0 error % of connections to the current host Dst_host_srv_serror_rate and specified service that have S0 error % of connections to the current host Dst_host_rerror_rate that have an RST error % of connections to the current host Dst_host_srv_rerror_rate and specified service that have RST error
A. 2
Continuous
Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous
Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous
Table A2: Description of Flag Values
Flag
Description
RSTOS0 Originator sent a SYN followed by a RST, never see a SYN RSTR
Established, responder aborted
RSTO
Connection established, originator aborted (sent a RST)
OTH
No SYN seen, just midstream traffic (a “partial connection” that was not later closed)
REJ
Connection attempt rejected
S0
Connection attempt seen, no reply
S1
Connection established, not terminated
S2
S3 SF SH
Connection established and close attempt by originator seen (but no reply from responder) Connection established and close attempt by responder seen (but no reply from originator) Normal establishment and termination Originator sent a SYN followed by a FIN (finish ‘flag’) , never saw a SYN ACK from the responder (hence the connection was “half” open)
A. 3
Appendix B
List of Publications: The majority of the work used in this thesis has been published or accepted for publication:
1- Deeman Y. Mahmood, Dr. Mohammed A. Hussein, (Dec. 2013), "Intrusion Detection System Based on K-Star Classifier and Feature Set Reduction", International Organization of Scientific Research/ Journal of Computer Engineering (IOSR/JCE), ISSN: 2278-8727 Volume 15, Issue 5. The paper has been indexed in: -
ANED (American National Engineering Database), with ANEDDDL (Digital Data link) number: 11.0661/iosr-jce-R0155107112.
-
Crossref, with DOI (Digital Object Identifier) number: 10.9790/0661155107112
2- Deeman Yousif Mahmood, and Dr. Mohammed Abdulla Hussein, (2014), "Analyzing NB, DT and NBTree Intrusion Detection Algorithms", Journal of Zankoy Sulaimani – Part A (JZS-A), Volume 16, No. 1.
B. 4
سيستةمى جؤراوجؤر بؤ دؤزينةوةى بةزاندنى ثرؤتؤكؤالتى ئينتةرنيَت نامةيةكة ثيَشكةشكراوة بة ئةجنومةنى فاكةلَتى زانست و ثةروةردة زانستةكان سكولَى زانست لة زانكوَى سليَمانى وةك بةشيَك لة ثيَداويستيةكانى بةدةستهيَنانى برِوانامةى ماستةرى زانست لة كوَمبيوتةر دا لةاليةن دميةن يوسف حممود بكالوريوس لة كوَمبيوتةر ( ,)8002زانكوَى كةركوك بةسةرثةرشتى د .حممد عبداهلل حسني ث ِروَفيسوَرى ياريدةدةر
( 8172ك) ثوشثةر
( 8072ز) حوزةيران
كورتة
سيستةمةكانى دوَزينةوةى دزةكردن ( ) SDIلةم دواييةدا طرنطى و ثيَداويستييةكى زؤرى ثةيداكردووة لة بوارى ئاسايشى تؤِرةكان بة تايبةتى
دواى زيادبوونى تواناى هيَرشبةرةكان وة لة دواى ثيَشكةوتنى تةكنةلؤجيا و
تةكنيكةكان ,وة زؤربةى ئةو كارطوزارييانةى كة لةسةر تؤرِةكانى ئينتةرنيَت دةردةكةون بة هةموو جؤرةكانييةوة طرفيت ئةوةى تيَداية كة بة ئاسانى دةست ناكةون بؤ بةكارهيَنةرةكان كة رِيَثيَدراون بة بةكارهيَنانيان بةهؤي هيَرشةكانى دورخستنةوةى ئةو كارطوزارييانة كة بابةتى سةرةكى ئةم تويَذينةوةيةية . داتاى ( )KDD Cup 99بةكارهاتووة وةكو سةرضاوةيةكى سةرةكى بؤ طورزةكانى تؤرِةكة كة تيَيدا ئةم داتايانة بةكارهاتووة بة شيَوةيةكى فراوان لةطةلَ تةكنيكى فيَربوونى ئاميَرةكان .ئةم داتايانة بة داتاى ثيَوانةيى دادةنريَن بةشيَوةى باش كاردةكةن لةم بوارةدا وة سةمليَنراوة كة تةكنيةكانى فيَربوونى ئاميَر دةتوانريَت بة سودبن لة دؤزينةوةى دزةكردن. لةم كارةدا دوو خةوارزم لة خةوارزمةكانى فيَربوونى ئاميَر بةكارهاتووة لةسةر منوونةى ثاراستنى ثيَشنياركراو كة ئةويش بريتني لة خوارزمى طرووثة بؤلَيةكان ( )K-snaemبؤ قؤناغى فيَربوون كة لةذيَر ضاوديَريدا نيية وة خةوارزمى درةختى برِياردان ( )Decision Treeبؤ قؤناغى فيَربوون لة ذيَر ضاوديَرى وو سةرثةرشتيداية .ئةم خةوارزميانة بةكارهيَنراون لةطةلَ خةوارزمى بةدةستهيَنانى زانيةرييةكان ( )Seinasaoane naaeبؤ ثؤليَنكردن و رِيَكخستنى تايبةمتةنديةكانى ئةو ضةثكانة .لةطةلَ ئةوةى كة ثيَشرت خةوارزمييةكانى طروثى بؤلَيةكان ()K-snaem بةكارهيَنرابوون بؤ دؤزينةوةى دزةكردن بةالَم بةهؤي زيادكردنى هةلَبذاردن و رِيَكخستنى تايبةمتةندى و سيفةتةكان توانرا ئةجنامى باشرت بةدةستبهيَنني .بةكارهيَنانى خةوارزمى طروثةكان ( )K-snaemتواناي داينىَ بؤ ثؤليَنكردنى ضةثكةكانى تؤرِةكان بؤ ضةثكى سروشتى و ضةثكى (شاردنةوةى كارطوزارى) و بة زيادكردنى خةوارزمى درةختى برِياردان ( )Decision Treeتوانرا هيَرشةكانى (شاردنةوةى كارطوزارى) ثؤليَنبكريَ . ئةجنامى ئةم كارة بريتيية لة سيستةميَك يان جيَبةجيَكردنى كاريطةرانة بؤ درةختى دزةكردنةكان و هيَرشةكان لةسةر تؤرِي ئينتةرنيَت بة تيَكرِايةكى بةرز لة دوَزينةوةى دزةكردن وة تيَكرايةكى نزم بؤ ورياكرنةوةى هةلَةكان (رِيذةى دوَزينةوة = ,%3512,89رِيذةى هةلَة = %,11581خوارزمى طروثةكام)( ,رِيذةى دوَزينةوة = , %3313,99 رِيذةي هةلَة = %818598خةوارزمى درةختى برِياردان)
نظام متعدد األناا كشف
روقاا بوتوكوال األنتونت
رسالة مقدمة إلى مجلس فاكلتي العلوم وتربية العلوم سكول العلوم في جامعة السليمانية كجزء من متطلبات نيل شهادة ماجستير علوم في الحاسبات
من قبل ديان يوس
محاود
بكالوريوس علوم حاسبات ( ,)8002جامعة كركوك
بإشراف د .محاد عبدهللا حسين أستاذ مساعد
( 7241هـ) شعبان
( 8072م) حزيران
اكخالصة لقد أكتسبت أنظمة كشف التسلل ) ( IDSفي األونة األخيرة المزيد من األهمية و الضرورة في مجال أمن الشبكات و خصوصأ مع زيادة كفاءة المهاجمين ومع تطور التكنولوجيا والتقنيات .أن النمو الهائل لحركة البيانات على األنترنت جعلت من الصعب على أي نظام كشف جميع أنواع التسلل الموجودة و التي هي في تطور مستمر. و تعاني معظم الخدمات المعروضة على شبكة األنترنت بأختالف أنواعها من مشكلة عدم توفرها للمستخدمين المخولين والمرخصين بسبب هجمات حجب الخدمة والتي هي الموضوع الرئيسي لهذا البحث. وإلظهار قابلية وأمكانية النظام المقترح ,تم أستخدام بيانات الـ KDD Cup99كمصدر رئيسي لحزم الشبكة ،حيث تم أستخدام هذه البيانات على نطاق واسع مع تقنيات تعلم اآللة على مدى العقدين السابقين لتقييم أنظمة كشف التسلل. هذه البيانات تعتبر بيانات قياسية وتعمل بصورة جيدة في هذا المجال وأثبتت بأن تقنيات تعلم اآللة يمكن أن تكون مفيدة في مجال كشف التسلل. في هذا العمل تم تطبيق خوارزميتين من خوارزميات التعلم األلي على نموذج الحماية المقترح وهي خوارزمية مجموعة العناقيد ( )K-meansلمرحلة التعلم الغير خاضع للرقابة وخوارزمية شجرة القرار)(Decision Tree لمرحلة التعلم الخاضع للرقابة واإلشراف .تم أسستخدام هذه الخوارزميات مع خوارزمية كسب المعلومات ( ) Information Gainلتصنيف وترتيب خصائص الحزم .وعلى الرغم من أستخدام خوارزمية مجموعة العناقيد ( )K-meansسابقا ً لكشف التسلل ،فإن إضافة أختيار وترتيب الخصائص والميزات مكننا من الحصول على نتائج أفضل مع وقت أقصر للمعالجة. إن استخدام خوارزمية المجاميع ) (K-meansمكنتنا من تصنيف حزم الشبكة إلى حزم طبيعية أو حزم "حجب الخدمة" وبإضافة خوارزمية شجرة القرار ( )Decision Treeأصبح من الممكن تصنيف هجمات "حجب الخدمة". نتيجة هذا العمل هو نظام أو تطبيق فعال لكشف التسلل والهجمات على شبكة األنترنيت ذو معدل كشف عالي للخروقات ومعدل منخفض للتنبيهات الخاطئة بحسب النتائج التي تم التوصل إليها (نسبة الكشف = , %32,8724 نسبة الخطأ = %7,1211خوارزمية المجاميع) و (نسبة الكشف = , %33,3749نسبة الخطأ = %0,0292 خوارزمية شجرة القرار).