Phishing Website Detection Using Random Forest with Particle Swarm Optimization A thesis Submitted to the Council of the College of Science at the University of Sulaimani in partial fulfillment of the requirements for the Degree of Master of Science in Computer

By Salwa Omer Mohammed B.Sc. Computer Science (2008), University of Kirkuk

Supervised by Dr. Tarik A. Rashid Professor

April 2017

Nawroz 2717

Supervisor Certification I certify that the preparation of thesis titled “Phishing Website Detection Using Random Forest with Particle Swarm Optimization” accomplished by (Salwa Omer Mohammed), was prepared under my supervision in the college of Science, at the University of Sulaimani, as partial fulfillment of the requirements for the degree of Master of Science in Computer.

Signature: Supervisor: Dr. Tarik A. Rashid Scientific Title: Professor Date:

/

/ 2017

In view of the available recommendation, I forward this thesis for debate by the examining committee.

Signature: Name: Dr. Aree Ali Mohammed Scientific Title: Professor Head of the Department Date:

/

/ 2017

Linguistic Evaluation Certification I hereby certify that this thesis titled “Phishing Website Detection Using Random

Forest

with

Particle

Swarm

Optimization"

prepared

by

(Salwa Omer Mohammed), has been read and checked and after indicating all the grammatical and spelling mistakes; the thesis was given again to the candidate to make the adequate corrections. After the second reading, I found that the candidate corrected the indicated mistakes. Therefore, I certify that this thesis is free from mistakes.

Signature: Name: Dr. Sarah K. Othman Position: English Department, College of Languages, University of Sulaimani.

Date:

/

/ 2017

Examining Committee Certification

We certify that we have read this thesis entitled “Phishing Website

Detection Using Random Forest with Particle Swarm Optimization” prepared by (Salwa Omer Mohammed), and as Examining Committee, examined the student in its content and in what is connected with it, and in our opinion it meets the basic requirements toward the degree of Master of Science in Computer.

Signature:

Signature:

Name: Dr. Suzan Abdulla Mahmood

Name: Dr. Nzar A. Ali

Title: Assistant Professor

Title: Assistant Professor

Date: / / 2017

Date: / / 2017

(Chairman)

(Member)

Signature:

Signature:

Name: Dr. Abdulbasit Kamil Faeq

Name: Dr. Tarik A. Rashid

Title: Lecturer

Title: Professor

Date: / / 2017

Date: / / 2017

(Member)

(Member- Supervisor)

Approved by the Dean of the College of Science. Signature: Name: Dr. Bakhtiar Q. Aziz Title: Professor Date: / / 2017

Dedication

This thesis is dedicated to: My father and my mother, God will have mercy on them My husband and my children My sisters and brothers

Acknowledgement

First of all, my great thanks to Allah for giving me the ability and faith to fulfill this work. Foremost, I would like to express my sincere gratitude to my supervisor Prof. Dr. Tarik A. Rashid for his valuable guidance, encouragement, and expert advice for showing me the right way of carrying out this thesis. I would like to extend my special thanks to the Head of the Computer Department Dr. Aree Ali Mohammed for his knowledge and support at the stage of the courses. Also, I would like to thank the lecturers of the Computer Department / Faculty of Science and Science Education, School of Science for their support, opportunities and facilities for carrying out this work.

Finally Special thanks to my husband and children, for their support and endless moral support during my life, who were with me in every step throughout my entire studies. Special thanks to my mother-in-law for her support.

Salwa O. Mohammed

Abstract Noticeably, different environments of phishing

website include different

types of information which could be a threat for all websites users such as incitement for hacking sites and encouraging them for spreading notions through learning theft networks, Wi-Fi, websites, internet forums, Facebook, email accounts, etc. The proposed work deals with sites to protect from hacking through designing a method that takes full advantage of machine learning and intelligent systems’ capabilities to realize the informative contents. The ultimate goal of this

research is to understand the system behavior and determine the

best solution to secure the vulnerable users, state and society. Random Forest (RF), Support Vector Machines (SVM) and decision tree (J48) methods are used as classifiers, and Developing Stop Word (DSW) that handles single regular expression RF termed (RFDSW), while Term Frequency Inverse Document frequency (TF-IDF) is used for feature extraction. Stemming, Stop word removing and tokenizeing are used for data preprocessing, Particle Swarm Optimization (PSO) is used to find the best parameters values for RF classifier for optimization instead of traditional methods. Random Forest with Particle Swarm Optimization (RFPSO) exhibits promising results in terms of accuracy than default RF, SVM and J48. In this proposed approach, a novel data set is automatically collected. The accuracy of Random Forest classifier is used as a fitness function of each particle. Experimental results confirm that the new approach RFPSO works better than default RF, SVM and J48. The result also demonstrates that RFPSO is effectively capable of detecting phishing web site, through producing high detection rate and low error rate, where (detection rate =98.2136% and Error Rate = 1.7147 % for RFPSO). i

Contents Abstract……………………………………………………………..........

i

Contents…………………...…………………………………..................

ii

List of Tables………………………………………………….................

vi

List of Figures……………………………………………………..…….

vii

List of Abbreviations…………………………………………….............

ix

Chapter One: Introduction 1.1

Overview ………………………………………………….........

1

1.2

Problem Statements …………...………………………………..

2

1.3

Literature Survey …………………………………...…………..

3

1.4

The Aim of the Thesis…………………………..……..………..

6

1.5

Thesis Outlines ……………………………………………........

6

Chapter Two: Phishing Website Detection and Data Mining 2.1

Overview ……………………………………………………...

8

2.2

Data Mining …………...…………...……………....................

9

2.3

Web Mining ...………… ……..…………….………………...

11

2.4

Components of Web Mining ………………..………………..

12

2.4.1

Information Retrieval Resource Discovery……………….…

12

2.4.2

Information Extraction and Preprocessing………………..….

13

2.4.3

Generalization…………………………………………….…...

13

2.4.4

Analysis…………………..........................................................

13

2.5

Category of Web Mining…………..………………………….

14

2.5.1

Web Content Mining……………………………….…………. 14

2.5.2

Web Structure Mining ……………………………...……….... 16

ii

2.5.2.1

Hyperlinks………………………..……..................................

16

2.5.2.2

Document Structure………………………………...…………

16

2.5.3

Web Usage Mining ………………………………..………….

16

2.5.3.1

Web Server Data …………………………………..………….

17

2.5.3.2

Application Server Data…………………………..……….….

17

2.5.3.3

Application Level Data…………………………………..……

17

2.6

Text Mining Approaches ………………………………….….

17

2.7

Typical Modal of the General Detection of Phishing Website Process…

18

2.7.1

Data collection………………………………………..……….

19

2.7.2

Pre-processing ………………………………………..……….

20

2.7.2.1

Tokenizing……………………………………………….……

20

2.7.2.2

Stemming ……………………..………………………………

21

2.7.2.3

Stop Word Removing …………..…………………………….

22

2.7.3

Feature Extraction ……………….…………………………...

23

2.7.4

Classification and Optimization Stage...................................

24

2.7.4.1

Random Forest..………………….……….…………………...

24

2.7.4.1.1 Bagging …………….………………..………………………..

25

2.7.4.1.2 Bootstrap…………….………………..…………………….…

25

2.7.4.2

Particle Swarm Optimization…….……………………………

28

2.7.4.3

Support Vector Machine………………………………………

29

2.7.4.4

Decision Tree ……………………………….………………...

32

2.7.5

Evaluation and Performance Measurements…….………….…

33

iii

Chapter Three: Proposed System Methodology 3.1

Overview …………………………………………………......

36

3.2

System Structure …..…...…………...…………………...........

36

3.3

The Proposed System for Phishing Website Site ……..……...

38

3.4

Data Collection…….…..………………………………….…..

41

3.4.1

Searching……..………………………………….…………….

43

3.4.2

Get Text from Web Component……………..……………….

43

3.4.3

Cleaning………………………...............................................

44

3.4.4

Labeling …..…………………………………..……................

44

3.4.5

Arabic Encoding……………………………...……………….

44

3.4.6

Saving………………………………………………………….

47

3.5

Develop Stop Words Filtered to Improve and Handling the Stop Words..

47

3.6

Pre-processing……...……………...…………………………..

49

3.6.1

Stop Words Removing....…………..………………………….

51

3.6.2

Tokenizing ……………….…………………………………...

51

3.6.3

Stemming ……………………………………………………..

52

3.7

Feature Extraction Stage……………………………...……….

52

3.8

Classification…………………………………………………..

53

3.8.1

Random Forest……………………………….…………...…...

54

3.8.2

Support Vector Machine………………………………………

57

3.8.3

Decision Tree………………………………………………….

58

3.9

Implementation of Practical Work ...… …..………..…………

59

3.10

The Developed RFPSO Approach………………….…………

60

3.10.1

Particle Swarm Optimization………………...........................

60

3.10.2

Algorithm: RFPSO………………………………………...…..

61

3.10.3

Apply RFPSO………………………………………...……….

63

iv

Chapter Four: Results and Discussion 4.1

Overview …………………………...........................................

68

4.2

Training and Testing Dataset ……………………………..…..

68

4.3

Experiment 1: Results of Pre-processing ………………..……

69

4.3.1

Transformation and Normalization ………………………….

69

4.4

Experiment 2 Classification and Optimization…..……………

73

4.4.1

Model1: RF with the Develop Filter ……………………….....

73

4.4.2

Model 2: Model 2 RFPSO …………………..………………..

74

4.5

Experiment 3: Classification Using Default FR, SVM, and J48…...

77

4.6

Bar Chart of Experimental Results ………...…………...…….

80

4.7

Evaluation of Experimental Results ….………………………

82

4.8

The Graphical User Interface (GUI)…….........……………….

83

4.8.1

Data Collection (data extraction) Graphical User Interface…..

83

4.8.2

The Main Detection System Graphical User Interface……......

84

Chapter Five: Conclusions and Future Recommendation 5.1

5.1 Conclusions ……………………………............................

87

5.2

5.2 Future Recommendation …………………..……………..

89

Appendix ……………………………...……......………...…..

90

References …………………………..……….………………

97

Publication ………………………...……......………………..

106

v

List of Tables Table No. 2.1

Title

Page No.

Examples of word stems derived from Derivation forms of the word “Write” in Arabic…………………..……………………….

22

2.2

Confusion Matrix Components ……………………………..….....

35

3.1

Number of Number of Documents for Each Category ……...…..

41

4.1

RF Classifier Parameters..……………………….……………......

74

4.2

PSO Optimizer Parameters …….………………………………....

75

4.3

PSO Obtained Optimal Parameters Values from RF…………...…

75

4.4

Confusion Matrix for Both (RFDSW) and (RFPSO) Classifier......

75

4.5

Evaluation and Result Based on Accuracy for the Experiment 2…

76

4.6

Evaluation and Result Based on (Precision, Recall and F-Measures) For Experiment 2 ……..………………………………………....…

77

4.7

SVM Classifier Parameters ……………………………………….

78

4.8

J48 Classifier Parameters………………………………………….

78

4.9

Confusion Matrix for each classifier used SVM, RF and J48…….

79

4.10 Evaluation and Result based on accuracy for the Experiment 3….

79

4.11 Evaluation and Result Based on (Precision, Recall and F-Measures)….

80

4.12 The Accuracy and performance metric for Each Experiment…….

82

vi

List of Figures Figure No.

Title

Page No.

2.1

Knowledge Discovery Process……………………………………

10

2.2

Web Mining Subtasks …………………………………………….

12

2.3

Web Mining Taxonomy …………………………………………

14

2.4

Typical modal of the process of General Detection of phishing Web Site Process………………………………………………….

19

2.5

Flow Chart of Random Forest ……………………………………

26

2.6

A Hyper Plane Separates Patterns of Two Different Classes……

31

3.1

The Structure of Proposed System ………………………………

37

3.2

Explains the Architecture of the Proposed PSO – Based Determination Approach for RF and Calculating of the Accuracy of Results……….

40

3.3

General Structure of Data Collection …………………………….

42

3.4

Get Texts from Websites …………………………………………

43

3.5

Texts Cleaning ……………………………………………………

45

3.6

Demonstrates the Process of Labeling With (Yes, No) Per Their Content of Phishing and Not Phishing Terms…………………….

46

3.7

The Structure of Develop Stop Words Filtered…………………...

49

3.8

Random Forest is an Ensemble of Several Decision Trees……….

56

3.9

Flowchart for the Random Forest Method………………………...

57

3.10 The Flowchart of PSO Algorithm…………………………………

64

3.11 Illustrates the Workflow Used in the Cross-Validation Part……...

66

4.1

Original Dataset…………………………………………………...

70

4.2

Data after Convert to Numeric Weight …………………………...

71

4.3

The Result of Implementation of Stop Word Handler Filter……...

72

4.4

Classification Accuracy of experiment 2………………………….

81

vii

4.5

Classification Accuracy of Experiment3………………………….

81

4.6

Comparative Chart of Using Different Classifier…………………

83

4.7

Data Collection Graphical User Interface ………………………..

84

4.8

Main Detection System Graphical User Interface ……………......

85

4.9

Parameter of PSO and performance target………………………..

86

viii

List of Abbreviations Abbreviation

Meaning

AC

Acceleration Coefficient

KDD

knowledge Discovery in Databases

IR

Information Retrieval

IE

Information extraction

NLP

Natural Language Processing

HTML

Hyper Text Markup Language

XML

Extensible Markup Language

DOM

Document Object Model

ASCII

American Standard Code for Information Interchange

UTF-8

Unicode Transformation Format

TF

Term Frequency

IDF

Inverse Document frequency

IDF

Inverse Document Frequency

RF

Random Forest

SVM

Support Vector Machine

DT

Decision Tree

PSO

Particle Swarm Optimization

QP

Quadratic Programming

TPR

True Positive Rate

FNR

False Negative Rate

TNR

True Negative Rate

FPR

False Positive Rate

PPV

Positive Predictive Value

RFPSO

Random Forest with Particle Swarm Optimization ix

FN

False Negative

FP

False Positive

TN

True Negative

TP

True Positive

MAE

Mean Absolute Error

KNN

K-Nearest Neighbor

NBC

Naïve Bayes classifier

ICI

Incorrectly Classified Instances

IW

Inertia Weight

SMO

Sequential Minimal Optimization

NOP

Number of Particles

RMSE

Root Mean Square Error

RFDSW

Random Forest with Develop Stop Word

SWF

Stop Words File

BSWF

Basic Stop Words File

DSWF

Develop Stop Words

OOB

Out Of Bag

AC

Acceleration Coefficient

Dim

Dimension of Particles

CCI

Correctly Classified Instances

PCCI

Percentage of Correctly Classified Instances

ICI

Incorrectly Classified Instances

PICI

Percentage Incorrectly Classified Instances

RAE

Relative Absolute Error

RRSE

Root Relative Squared Error

CF

Confidence Factor

x

Chapter One Introduction

Chapter One Introduction

1.1 Overview Apparently, internet is the biggest data collection in the world; often it is important as it provides detailed data in different areas which can help people achieve their goals in gaining information and knowledge. This thesis puts a step towards the security and provides the user with a good perception. Unusually, websites are an imperative component of internet. They have become a very important source of information lately. Recently, the distribution of phishing websites through the internet has increased significantly. Thus, substantial numbers of web documents from different phishing websites have become available for users. In view of that, users might spontaneously collect both valuable and phishing information typically through those phishing sites via learning to unauthorized access to computer systems or networks /hacking, penetration and stealing personal information, etc. Therefore, refining the information retrieval process has become essential [1]. Text classification is a process of assigning a text document to a predefined category, or a set of categories, depending on its content. Besides, it can be used in several applications such as web pages, email filtering, automatic article indexing with clustering and natural language processing [2]. Although most new web pages contain keywords that help retrieve related documents quickly and accurately [3]. Nonetheless, there exist numerous old documents that do not have keywords [4].Thus, automatic detecting phishing websites through classification can be used to categorize websites based on their content. English

1

Chapter One

Introduction

is regarded as one of the dominant and leading languages on the World Wide Web together with some other European and Asian languages. Accordingly, most document classification systems are designed for categorizing documents written in one of these languages [5]. Moreover, it can be noticed that there are very limited works of automatic classification in Arabic language. Classifying Arabic texts is utterly different from classifying English language because Arabic is exceedingly inflectional and vastly derivational language in which monophonical exploration and its analysis is a challenging and interesting task. Yet, few challenges have been made to develop an automatic classification system for documents written in other languages, including Arabic [6]. This work is conduced to design an automatic and intelligent website classification system and to achieve this, data is collected from several Arabic websites, blogs and forums. The main contributions of the research are recommending a system via applying the latest data mining techniques for security purposes to deal with Arabic phishing websites to protect users from harmful information and better handling of multimedia information when massive websites are needed to be processed, and finally, generating interesting patterns and producing significant improvement in terms of accuracy. 1.2 Problem Statements It is clear that the different web environments of phishing web site can include different types of information such as texts that could be a threat for all web users. Besides seemingly, the spreading of phishing websites through the internet has increased expressively. Consequently, considerable web documents from diverse phishing websites have become available for users. As a result, users might extemporaneously collect both valuable and phishing information typically through those phishing sites. 2

Chapter One

Introduction

1.3 Literature Review From the literature review, there have been quite a lot of research works that address the issues related to extracting texts and selecting significant information via handling text documents. In this section, a review of past experiments using various feature extraction methods to improve text classification accuracy is explained. Remarkably, Data mining is regarded as one of the most artificial intelligence technologies that had been developed for exploring and analyzing significant patterns and rules in large numbers of data. In [7], they described automatic web-page classification by using machine learning methods. Lately, the importance of portal site services includes the search engine function on World Wide Web (WWW).They proposed techniques to generate attributes by using co-occurrence analysis and to classify Web-page automatically based on machine learning. They applied these techniques to Webpages on Yahoo! JAPAN and construct decision trees which determine appropriate category for each Web-page. In [8], a text mining technique which extracted Arabic text from web documents was considered using Classifier K-Nearest Neighbor (KNN) and Naïve Bayes classifier (NBC). He used a special corpus that consisted of 1562 documents belonged to 6 categories was established. They extracted feature set of keywords and terms weights to increase the performance of their technique, Ramos observed the outcomes of using (TF-IDF) to determine all those words that have extra advantageous within the documents for using in a query. He stated that words that have high TF-IDF numbers suggest a robust correlation with a document within which they are contained; this suggests that the document can be of interest to web customers, if the query has that word. They delivered evidently a simple algorithm which effectively classifies relevant words that can improve query retrieval 3

Chapter One

Introduction

Additionally, Web Content Mining is considered as an important part of data mining and is generally used for detecting significant information from script, audio, video, and pictures in website pages or documents. It is essentially based on research fields in retrieving information, information extraction and information visualization [9], it is also called web text mining, since the text content is practically and broadly used. In [10], feature vector was presented in which the web page textual contents are characterized via a weighted vector. He used four classifiers in their experiments; these are SVM, NBC, RF and KNN. In [11], they compare and investigate Naive Bayesian Method (NB), K-Nearest Neighbor along with Neural Network method on different Arabic data sets. In [12], a method for searching the optimal parameters based on particle swarm optimization is proposed to improve the learning and generalization ability of support vector machine. They constructed a speech recognition system based on support vector machine using the optimal parameters. Both Mathioudakis and Koudas established Queue Burst procedure which can handle real time data, this can discover burst of particular keyword over Twitter [13]. The idea of identifying a relevant or a trending word using a module has been inspired from their work. TF-IDF is regarded as one the most widely used approaches to determine important words from the corpuses.In [14, 15], three methods were used for classifying Arabic web sites; namely, SVM via SMO, NBC, and J48. They intended to determine the accuracy for all classifiers. They relied on stop word removing to find the best classifier. SMO produced the best accuracy and timing. The research work [16], proposed a hybrid approach which refined the output of multiclass classification that was based on the usage of random forest classifier for automatically labeling images with several words.

4

Chapter One

Introduction

The scheme of stemming is among various techniques that can be used to improve the capabilities of detecting phishing website pages. Basically, it helps to extract the root of a word which in return increases classification accuracy rate [17]. In [18], they propose an improved random forest algorithm for classifying text data, this algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is text corpus. In [19], they aimed to find a subset, having accuracy comparable to original RF but having a much smaller size. They showed that the problem of selection of optimal subset of random forest follows the dynamic programming. In [20], a method of linguistic root extraction is presented. They removed prefixes and suffixes relying on the word’s length. They also used morphological pattern right after reasoning to eliminate infixes. Their root extraction method has been enhanced. In [21, 22], data mining application approaches for extracting valuable intuitions from the web content, construction and procedure are explained. The scheme of stemming is among various techniques that can be used to improve the capabilities of detecting wicked website pages. Basically, it helps to extract the root of a word which in return increases classification accuracy rate. In [23], this paper presents a novel feature selection method based on particle swarm optimization to improve the performance of text categorization. In [24], Khoja stemmer was used for preprocessing and TF-IDF was used to get extract features, and finally, the classification task for detecting phishing websites was conducted via random forest, support vector machine decision tree j48 classifiers. In this thesis, it is realized from literature that little or no research works were carried out to tackle the problem of predicting phishing web sites using natural inspired techniques like Particle Swarm Optimization for 5

Chapter One

Introduction

optimizing Random Forest parameter values, and since classification of this application system has not been used, thus, the authors found it very significant to regulate an appropriate system to resolve the problems of predicting phishing websites detection in this system. Supervised classification methods with suitable pre-processing and optimization techniques are applied for the purpose of providing the best results. The contribution of this thesis is Particle Swarm Optimization method was used for improving Random Forests. One of the important points that make this thesis novel is using PSO adaptation with Random Forest to classify the phishing web sites which have not been applied before.

1.4 The Aim of the Thesis The main objectives of this research are to: 1. Propose a data preparation and pre-processing techniques. 2. Suggest a data model for security. 3. Propose innovative and efficient techniques for data mining and analysis via designing a system that allows the user to understand the system behavior. 4. Finally, Optimize and apply the latest innovative data mining techniques to better handle multimedia information such as texts that could be a threat for all web users and design a methodology that takes full advantage of machine learning capabilities. 1.5 Thesis Outlines The rest of the thesis is organized as follows: 

Chapter Two (Phishing Website Detection and Web Mining): which

presents description about phishing

websites detection system, background 6

Chapter One

Introduction

knowledge, necessary concepts of its method, popular approaches of categorizing web documents or texts, web content mining, text mining steps of a typical phishing website detection system, and methods that are used in each step.  Chapter Three (Proposed System Methodology): Is about the proposed technique the (Phishing Websites Detection System). It presents the system structure; data collection, preprocessing, feature extraction, classification, optimization and evaluation the performance.  Chapter Four (Results and Discussions): presents the results of the applied classification techniques, including the proposed system in the experimental cases with its discussions are presented, each experiment case with its training and testing datasets

 Chapter Five (Conclusions and Future Recommendation): presents the final results from the implemented system which are analyzed and concluded, in addition, potential future works are recommended for improving and covering the scope and the limitation of the proposed system.

7

Chapter Two Wicked Web Site Detection and Data Mining

Chapter Two Phishing Website Detection and Data Mining 2.1 Overview This chapter will present Background knowledge of understanding phishing websites detection, Data mining, web mining and analyzing. The methods, modals, and techniques that are used for predicting and classifying phishing websites detection are explained and discussed. Websites classification is the process of classifying documents into predefined categories based on their content. The most common techniques used for this purpose work include Random Forest, Support Vector Machine and Decision Tree. The World Wide Web (WWW) serves as a huge, widely distributed, global information service center for news, advertisements, consumer information, financial management, education, government, e-commerce, and many other information services. The web also contains a rich and dynamic collection of hyperlink information and web page access and usage information, which are rich sources for data mining. However, based on the following observations, the web also poses great challenges for effective resource and knowledge discovery. Enormous amount of data of phishing text documents are involved in the websites, blogs and forums. These data are important for the extraction and the analysis to develop an approach for detecting and improving powerful means for analysis, and interpretation of knowledge discovery that can lead to decision making.

8

Chapter Two

phishing Website Detection and Data Mining

2.2 Data Mining Data mining is defined as the process of discovering useful patterns or knowledge from data repositories such as in the form of databases. The data repositories should be valid, potentially useful, and understandable. Data mining is good at finding models from data. It has been applied on many fields and obtained good economic results. It is also the process that has been developed and used to discover and analyze interesting, unexpected and valuable information and structures from huge amounts of data to define significant patterns and rules [35, 37]. In enormous databases, data mining can be considered as an evolving approach to data analysis and becomes a useful tool for detecting and extracting web documents. Literally, it involves extracting knowledge based on patterns of data in the huge databases. It is also the process of extracting useful patterns or trends often previously unknown from large amounts of data using various techniques such as those from pattern recognition and machine learning. There have been several developments in data mining technologies which have been used for a wide variety of applications from marketing and finance to medicine and biotechnology to multimedia and entertainment [34]. Recently, there has been much interest in exploring the use of data mining for countering phishing and violence applications. Similarly, data mining can be used to detect unusual patterns, terrorist activities and fraudulent behavior [36]. Many other expressions are being used to interpret data mining, such as knowledge mining from databases, knowledge extraction and data analysis. Currently, it is commonly permitted that data mining is a vital step in the process of knowledge Discovery in Databases (KDD).

9

Chapter Two

phishing Website Detection and Data Mining

Knowledge Discovery in Databases is an interactive discovery process, exploratory analysis and modeling of large data repositories. KDD is the organized process of identifying valid, useful, and understandable patterns from huge and complex data sets [38]. The process generalizes to non-database sources of data, although it emphasizes databases as a primary source of data. KDD has several steps that should be followed for gaining useful knowledge; Figure 2.1, shows the Knowledge Discovery in Databases [39].

Interpretation Evaluation Data Mining Transformation

Knowledge

Preprocessing

Preprocessed Data

Transformed data

Patterns model

Data Source

Figure 2.1: Knowledge Discovery Process [39].

10

Chapter Two

phishing Website Detection and Data Mining

Data mining generally includes four classes of task: 1. Classification which arranges the data into predefined groups. 2. Clustering is like classification but the groups are not predefined so the algorithm will try to group similar items together. 3. Regression which attempts to find a function, modeling the data with the least error. 4. Association rule learning which searches for, relationships between variables. Accordingly, data mining functionalities would include data characterization, data discrimination, association analysis, classification, clustering, and data evolution analysis. Data characterization is a summarization of the general characteristics or features of a target class of data [39]. Data discrimination is the comparison of the general features of target class objects with the general features of objects from one or a set of contrasting classes. Association analysis is the discovery of association rules showing attributevalue conditions that occur frequently together in each set of data. Classification is the process of finding a set of models or functions that describe and distinguish data classes or concepts, for being able to use the model to predict the class of objects whose class label is unknown. Clustering analyzes data objects without referring to known class model. Data evolution analyses describe and model regularities [38]. 2.3 Web Mining Web mining is used to capture relevant information, to create new knowledge out of relevant data, to personalize the information and learn about Consumers or individual users and several others. The term web mining is mainly about using 11

Chapter Two

phishing Website Detection and Data Mining

data mining techniques to automatically discover and extract information from (WWW) documents and services. The potential of extracting valuable knowledge from the web has been quite evident. Web mining can be viewed as the collection of technologies to fulfill this potential. Web mining research has attracted many academicians and engineers from database management, information retrieval, artificial intelligence research, especially from data mining and knowledge discovery. The type of collected data differs and also has extreme variation both in its content (e.g., text, image, audio, symbolic) and meta information that might be available. This makes the techniques used for a task in web mining vary widely [35]. 2.4 Components of Web Mining Web mining can be viewed as consisting of four tasks as shown in Figure 2.2.

Web data

Information Retrieval (Resource Discovery)

Information (Extraction, Preprocessing)

Generalization Pattern Recognition and Machine Learning

Analysis (Validation , Interpretation)

Knowledge, Information

Figure 2.2: Web Mining Subtasks [35] 2.4.1 Information Retrieval or Resource Discovery Resource Discovery or Information Retrieval (IR) deals with automatic retrieval of all relevant documents, while at the same time ensuring that the nonrelevant ones are fetched as few as possible. The IR process mainly includes document representation, indexing, and searching for documents [35].

12

Chapter Two

phishing Website Detection and Data Mining

2.4.2 Information Extraction and Preprocessing Once the documents have been retrieved, the challenge is to automatically extract knowledge and other required information without human interaction. Information Extraction (IE) is the task of identifying specific fragments of a single document, which constitutes its core semantic content [35]. 2.4.3 Generalization In this phase, pattern recognition and machine learning techniques are usually used on the extracted information. Most of the machine learning systems, deployed on the web, learn more about the user’s interest than the web itself. A major obstacle when learning about the web is the labeling problem: data is abundant on the web, but it is unlabeled. Many data mining techniques require inputs labeled as positive (yes) or negative (no) examples with respect to their concepts and applications [35]. 2.4.4 Analysis Analysis is a data-driven problem which presumes that there is sufficient data available so that useful information can be potentially extracted and analyzed. People play an important role in the information or knowledge discovery process on the websites since the websites are an interactive medium. This is especially important for validation and or for interpretation of the mined patterns which take place in this phase. Once the patterns have been discovered, analysts need appropriate tools to understand, visualize, and interpret these patterns. Based on the aforesaid four phases [Figure 2.2], web mining can be viewed as the use of data mining techniques to automatically retrieve, extract and evaluate information for knowledge and available information discovery from web documents and services. Here, evaluation includes both generalization and analysis [54].

13

Chapter Two

phishing Website Detection and Data Mining

2.5 Category of Web Mining This section divides web mining into several categories depending on the type of data [9, 35]. Web mining can be broadly divided into three distinct categories, according to the kinds of data to be mined. Figure 2.3 shows the taxonomy of web mining. 1. Web content mining. 2. Web structure mining. 3. Web usage mining. Web Mining

Web Content Mining

Text

Imag

Audio

Video

Web Structure Mining

Web Server Data

Structured

Hyperlinks

Inter Document Hyperlinks

Web Usage Mining

Application Server Data

Application Level Data

Document Structure

Intra Document Hyperlinks

Figure 2.3: Web Mining Taxonomy [9]. 2.5.1 Web Content Mining Web content mining is the process of extracting useful information from the contents of web documents. Content data is the collection of facts a web site is designed to contain. It may consist of texts, images, audios, videos, or structured records such as lists and tables. The majority of the data available on the websites is unstructured data (unstructured text). The application of text mining to web content has been most widely researched. 14

Chapter Two

phishing Website Detection and Data Mining

Issues addressed in text mining include topic discovery and tracking, extracting association patterns, clustering of web documents and classification of web pages. Research activities on this topic have been drawn heavily on techniques developed in other disciplines such as IR and Natural Language Processing (NLP) [35]. Earlier studies of web mining (such as web document classification methods) have focused on structured data. However, in reality a considerable portion of the existing information is stored in text databases which consisted of a large collection of documents from various sources such as books, articles, research papers, e-mail messages and web pages. With the existence of an amazing number of these documents, it is tiresome, yet, essential to be able to automatically systematize the documents into classes to assist document retrieval and subsequent analysis [11, 13]. Web documents consist of texts, images, videos and audios. Text data in web documents are defined to be the most enormously identified, since all the web documents contain texts in their pages [14, 24]. The web content mining applications aims to [37]: 1. Identify the topics represented by web documents. 2. Categorize web documents. 3. Find web pages across different servers that are similar 4. The applications related to relevance. 5. Queries –enhance standard query relevance with user, role, and/or task based relevance. 6. Recommendations –list of top “n” relevant documents in a collection or portion of a collection. 7. Filters –show/hide documents based on relevance score.

15

Chapter Two

phishing Website Detection and Data Mining

2.5.2 Web Structure Mining The structure of a typical web graph consists of web pages as nodes, and hyperlinks as edges connecting related pages. Web structure mining is the process of discovering structure information from the web. This can be further divided into two kinds [38]. 2.5.2.1

Hyperlinks

A hyperlink is a structural unit that connects a location in a web page to a different location, either within the same web page or on a different web page. A hyperlink that connects to a different part of the same page is called an intradocument hyperlink, and a hyperlink that connects two different pages is called an inter-document hyperlink [38]. 2.5.2.2 Document Structure In addition, the content within a web page can also be organized in a tree structured format, based on the various Hyper Text Markup Language (HTML) and Extensible Markup Language (XML) tags within the page. Mining efforts here focused on automatically extracting document object model (DOM) structures out of documents [38]. 2.5.3 Web Usage Mining Web usage mining is the application of data mining techniques to discover interesting usage patterns from web usage data, in order to understand and better serve the needs of web-based applications. Usage data captures the identity or the origin of web users along with their browsing behavior at a websites. Web usage mining itself can be classified further depending on the kind of usage data considered:

16

Chapter Two

2.5.3.1

phishing Website Detection and Data Mining

Web Server Data

User logs are collected by the web server and typically include IP address, page reference and access time. 2.5.3.2

Application Server Data

Commercial application servers such as Web logic, Story Server, have significant features to enable E-commerce applications to be built on the top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs. 2.5.3.3

Application Level Data

New kinds of events can be defined in an application, and logging can be turned on for them - generating histories of these events [38]. 2.6 Text Mining Approaches Text Mining is the automatic discovery of new, previously unknown information, by automatic analysis of various textual resources. With the growth of the text documents, text mining is increasingly becoming important and popular. Text classification or text categorization aims at extracting useful information from a large data. There are many approaches to text mining which can be classified from different perspectives based on the inputs taken in the text mining system and the data mining tasks to be performed [ 35]. In general, the major approaches, based on the kinds of data that takes as input are: 1. The keyword-based approach, where the input is a set of keywords or terms in the documents. 2. The tagging approach where the input is a set of tags. 3. The information-extraction approach, which inputs semantic information, such as events, facts, or entities uncovered by information extraction. 17

Chapter Two

phishing Website Detection and Data Mining

A simple keyword-based approach may only discover relationships at a relatively shallow level, such as rediscovery of compound nouns (e.g., “database” and “systems”) or co-occurring patterns with less significance (e.g., “phishing ”,“terrorist” and “explosion”). It may not bring much deep understanding to the text [26]. The tagging approach may rely on tags obtained by manual tagging (which is costly and unfeasible for large collections of documents) or by some automated categorization algorithm (which may process a relatively small set of tags and require definition of the categories beforehand). The information-extraction approach is more advanced and may lead to the discovery of some deep knowledge, but it requires semantic analysis of text by natural language understanding and machine learning methods. 2.7 Typical Modal of the General Detection of Phishing Websites Process General Phishing websites detection measurement is placed on five step processes, which are data collection, preprocessing, feature extraction, apply classification and optimization algorithm and then evaluation step. Distinctive techniques have been applied on each stage. In the sections below, the techniques of each stage are clarified briefly. An organization that needs to set up phishing websites detection has to go through a set of levels, which are shown in the Figure 2.4.

18

Chapter Two

Data Collection

phishing Website Detection and Data Mining

Preprocessing

Feature extraction

Applying classification and Optimization algorithm

Evaluation

Figure 2.4: Typical Modal of the Process for General Detection of Phishing Websites Process 2.7.1 Data Collection In this digital era, almost every domain is overloaded with voluminous data. This voluminous data need to be processed to get some interesting patterns. The proposed work deals with phishing websites detection domain. Data collection is a vital process which is considered as data preparation. Web is a common and interactive medium with extreme amount of data freely available for users to access. It is a collection of documents, text files after audios, videos and other multimedia data. Various types of data have to be prepared in such a way that different users can competently access it. A data collection also is a document or a set of documents from which we want to extract text or discover patterns. In text data collection, all input documents must be text encoding to avoid any distortion of characters during the text reading process. In this research, all documents are encoded using Unicode Transformation Format (UTF-8). The encoding is a variable-length and uses 8-bit code units. This is safer to use within most programming and document languages that interpret certain American Standard Code for Information Interchange (ASCII) characters in a special way [40].

19

Chapter Two

phishing Website Detection and Data Mining

2.7.2 Pre-processing The aim of Pre-processing techniques is cleaning and normalizing the raw text contained in websites. Text preprocessing is actually trials to improve text classification by removing worthless information. It may include removal of numbers, punctuation (such as hyphens), and stop words. The Arabic documents are prepared by tokenizing and removing digits, punctuation and non-Arabic words as a first step. After that, stop words are removed. Stop words are common word that provide only a little of meaning and serve only as syntactic function without indicating any important subject or matter of the document like (‫لذلك‬, (so), ‫و‬, (and), ‫( بالنسبة‬for)). The rest of the words were referred to as features or keywords of the documents [41, 32]. Classifying

phishing

websites

detection

(text

documents)

requires

accomplishing of some preprocessing steps for the documents through stemming the words. This process is quite a major issue in terms of reducing the number of related words in a document. Several techniques have been established to perform such preprocessing tasks as illustrated below. 2.7.2.1

Tokenizing

In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics, (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis. Often a tokenizer relies on simple heuristics, for example: 1.

Punctuation and whitespace may or may not be included in the resulting list of tokens.

20

Chapter Two 2.

phishing Website Detection and Data Mining

All contiguous strings of alphabetic characters are part of one token; likewise with numbers.

3.

Tokens are separated by whitespace characters, such as a space or line break, or by punctuation characters.

2.7.2.2

Stemming

The number of the feature is usually very big and it grows by the length of the documents. Some filtering techniques are applied to reduce the features of number. One of these techniques is the root extraction. Root extraction or stemming stage targets to decrease the number of the features and remove the redundant features. Arabic words like (‫لعب‬, (play)) can be inflected with a morphological prefix and suffix to produce (‫يلعب‬,,‫)يلعبون يلعبان‬. They share the same stem ‘‫’لعب‬. Often (but not always) it is beneficial to map all inflected forms into the stem [6]. Roots can be extracted by two approaches: letter weight and stemming algorithms. Letter weight groups the letters in some ranks and weights, so each letter has a product weight and the three letters with the smallest product value give the root of the word, whereas stemming algorithms can be divided in to three classes: the root-based stemmer; the light stemmer; and the statistical stemmer [32]. These algorithms include morphological analysis, removing the prefixes, suffix and infixes of the words and string similarities measures [42]. This is a complex process, since there can be many exceptional cases as shown in Table 2.1.

21

Chapter Two

phishing Website Detection and Data Mining

Table 2.1: Examples of Word Stems Derived from Derivation Forms of the Word “Write” in Arabic Root

Vocalisms I

Derived stem

Gloss

‫ كاتب‬kaatib

Writer

‫كتب‬ a,a

‫ كتب‬katab

To write

a,a

‫ مكتب‬maktab

Office

A

‫ كتاب‬kitaab

Book

‫ك–ت–ب‬ B-T–K

The most commonly used stemmer is the Arabic stemmer Khoja[42]. There are many others such as the Arabic light stemmer. They are not perfect as Arabic stemmer Khoja and they do make some mistakes. Many are not designed for special domains like biological terms. 2.7.2.3

Stop word removing

The most frequent words often do not carry much meaning. We can also create our own stop word list for a particular application domain such as the approach organized in this research. Stop word removal is an essential preprocessing step. Arabic language belongs to the Semitic family of languages. It is the language of the Holy Quran. Removing Arabic stop word such as ( ‫(على‬on), ‫( لكن‬but), ‫(في‬at) etc.)[16]. Stop words are common word that provide only a little of meaning and serve only as syntactic function without indicating any important subject or matter. Thus, the removal of these stop words changes the document length and reduces the memory of the process and increase the efficiency [25].

22

Chapter Two

phishing Website Detection and Data Mining

2.7.3 Feature Extraction After selecting the most significant terms in the super vector, for each document is represented as a weighted vector of the terms found in this vector [24, 32]. Term weighting is one of the important and basic steps in text classification based on the statistical analysis approach. In this research work, experiments attempt to use two weighting schemes to reflect the relative importance of each term in a document (and a category) and to reduce the dimensionality of the feature space. In the first scheme, the relative Term Frequency (TF) of a term is denoted by t, in a text document denoted by d by the following arithmetical formula: (

)

[

]⁄[

]

In the second scheme, the TF- The inverse document frequency (IDF) is used: the TF of a word t is determined by the frequency of the word t in the document d. The Document frequency (DF) of a word t is the number of documents in the dataset where the word t occurs in at least once. The IDF of the words t is generally calculated as follows: ( ⁄ ) Where N is the total number of documents in the dataset and n is the number of documents that contain the concerned word. The weight of word t in document d using TF-IDF is: (

)

( ⁄ )

Thus, a term that has a high TF-IDF value must be simultaneously important in this document and must appear few times in the other documents. It is often the case where a term correlates to an important and unique characteristic of a document [12]. 23

Chapter Two

phishing Website Detection and Data Mining

2.7.4 Classification and Optimization Stage Data classification is the process of assigning a class (labeling) to a data instance, based on the values of a set of predictive attributes (features). In this study, classification is used. Therefore, the system needs to be trained to produce the knowledge source. The knowledge source contains all the important concepts (with their semantic relations and concept-frequencies) that are part of each class. To specify the class of a new document, the documents need to be preprocessed. Then, features should be extracted and fed to the classifiers along with the knowledge source [23, 27]. Different methodologies have been presented for the classification of document. The most popular and commonly used algorithms for text document classification are (RF), (SVM) and (DT) classifier and optimization algorithm such as PSO. There is little on RF classification using PSO for tuning parameter values, so this research will add this combination in this thesis [32]. The mentioned algorithms are explained below. 2.7.4.1

Random Forest

RF is a learning ensemble which consists of a bagging of un-pruned decision tree learners with a randomized selection of features at each split. Bagging works by taking a bootstrap sample from the training set. The RF method often shows superior performance in comparison with traditional techniques, Random forests are a combination of tree predictors in such a way that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest [29, 30, 31].The generalization error for forests converges, to a limit as the number of trees in the forest becomes large [56].

24

Chapter Two

phishing Website Detection and Data Mining

The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates which can be favorably compared with Adaboost [56].The classifier RF [32, 43]) has internal estimates monitor error, strength, and correlation and these are used to show the response to increase the number of features used in the splitting. Internal estimates are also used to measure variable importance.These ideas are also applicable to regression. Figure 2.5 shows decision tree learners with a randomized selection of features at each split. RF diversified ensemble of decision trees adopts two methods: 2.7.4.1.1 Bagging: Resulting in approximately 63.2% unique samples, and the rest are repeated at each node split, only a subset of features are drawn randomly to assess the goodness of each feature/attribute (√ log2 F is used, where F is the total number of features). Trees are allowed to grow without pruning. Typically 100 to 500 trees are used to form the ensemble. Tt is now considered among the best performing classifiers Random Forest 2.7.4.1.2 Bootstrap Sample: A sample is used for the construction of each tree. Each training set of n instances is drawn at random with replacement from the training set of n instances. With this sampling (called bootstrap replication), on average 36.8% of training instances are not used for building each tree. These out-of-bag instances come handy for computing an internal estimate of the strength and correlation of the forest.

25

Chapter Two

phishing Website Detection and Data Mining

Figure 2.5: The Flow Chart of Random Forests [44] The common element in RF is the kth tree [44, 56], a random vector Θk is generated, independent of the past random vectors {Θ1, ... Θk1} but with the same distribution; and a tree is grown using the training set and Θk, resulting in a classifier h(x, Θk) where x is an input vector. In random split selection, Θconsists of a number of independent random integers number between 1 and K. The nature and dimensionality of Θ depends on its use in tree construction. After a great amount of trees are produced, they vote for the most common class. These procedures are called Random Forests [56]. Random forest error rate depends on two things: 1.

Correlation: represents the correlation between any two trees in the forest. Error increases as the correlation increases.

2.

Strength: represents the strength of each tree in the forest. The strength is measured by the error rate; a tree ith low error is a strong tree. The forest 26

Chapter Two

phishing Website Detection and Data Mining

error rate decreases as the decision tree’s strength increases. Random forest classifier has the following advantages and disadvantages. Advantages of RF a) It produces a highly accurate classifier and learning is fast b) It runs efficiently on large data bases. c) It can handle thousands of input variables without variable deletion. d) It offers an experimental method for detecting variable interactions. e) It's relatively robust to outliers and noise. f) RF is easy to build and faster to predict g) It gives useful internal estimates of error, strength, correlation and variable importance. h) It's simple and easily parallelized. i) Resistance to over-training and over-fitting of data Random forest Algorithm j) Generalization through bagging (random samples) & randomised tree learning (random features). k) Inherently multi-class l) Simple training Dis-advantages a) Inconsistency b) Difficulty for adaptation c) It only appropriate for large datasets.

27

Chapter Two

2.7.4.2

phishing Website Detection and Data Mining

Particle Swarm Optimization

PSO is an optimization algorithm of nonlinear functions that is found by both Kennedy and Eberhart in 1995. The algorithm is more related with artificial life, fish schooling and bird flocking in general. It is also related to genetic algorithms and evolutionary programming. It uses only simple mathematical operations and can be implemented easily; it is also computational inexpensive regarding memory and speed [55, 45, 21]. The algorithm demonstrates competition and cooperation between particles instead of using genetic operators. Each particle (A flying bird in its group) represents a potential solution for a problem. Individual particles adjust their flying according to their experience and other individuals’ flying experience in the swarm. The adjustments include velocity and position. The velocity and positions are calculated according to the following equations [46, 45]: ( )

(

) (

(

( )

where

(

(

(

))

)

( )

( ) is new velocity of particle i ,

)) )

) ( −1) is the current velocity of

particle i , c1 and c2 are two positive constants, r1 and r2 are two random numbers between [0,1],

and

𝑔

are best positions (solutions) met by

particle and social (global or neighbors) respectively. ( ) and ( −1) are particles new and previous flying positions respectively. The second part of the equation 2.4 is called “mental” which represents private thinking of the particle whereas the third part called “social” since it represents the collaboration among all particles. The performance of each particle is obtained according to the pre28

Chapter Two

phishing Website Detection and Data Mining

defined fitness function which is related to the problem to be solved [46,55].The first part of the Equation 2.4 is modified by adding inertia weight w to make a balance between local and global search as it is shown below in equation 2.6 ( )

( (

)

( (

(

))

))

It is better to give a view on how PSO algorithm works and how its procedure looks like with the structure of working explained in Appendix. In this work PSO algorithm is used to find the best parameters values in the RF in RFPSO system, to find best possible result for the system.

2.7.4.3 Support Vector Machine SVM is a novel learning machine found by Vapnik [30]. Support vector machines can be seen as another modern mathematical concept based modal. SVM uses a vector as input and predicts a class label for the presented inputs. The SVM techniques can be used in classification and regression problems [31, 32, 47]. The average error between input and their target vector is reduced since SVM technique classification is based on the Structural Risk Minimization principle from computational learning theory. Hyper planes which may be a line in a twodimensions space or a plane in three dimensions space or hyper plane in higher dimensions is used in feature space (data) to separate classes by finding maximum-margin hyper plane. Margin is the distance to the nearest data point [19, 52, 48] (as shown in Figure 2.6). The optimal hyper plane is the one which has the maximum margin which also leads the classifier to be generalized. The dataset for the SVM technique is divided into parts; training and testing dataset. The training section uses a subset 29

Chapter Two

phishing Website Detection and Data Mining

of the dataset in the form of (xi, yi) where i=1,2,….n (number of training data), xi is a vector that represents the features of performance prediction and yi represents the correspondent label or class for xi . The remaining subset of the data is used for testing section [49]. The main stages of SVMs can be summarized in three steps: maximizing the hyper plane margin, mapping the input space to a linearly separable feature space and applying the ‘kernel trick ‘to the results of the first two steps [24, 25, 52].To achieve higher generalization ability for the system, the decision boundary should be as far away from the data of all classes as possible [19, 29, 32]. In another meaning the margin m should be maximized [48]. Discovering a hyper plane that maximizes the distance between the support vectors (SVs) and w is the main problem. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. Finding the non-zero solutions αi Lagrangian dual problem can be solved efficiently. Having found the support vector weights αi and given a labeled training set 〈 , y〉 the decision function in input space is [49]: ( )

𝑔 (∑





)

While b is the bias of the hyper plane and 〈𝑎〉 is the inner product of a and b. Nonlinearity problem can be solved performing mapping for each input sample x to its representation in feature space ( ) in which the algorithm can be applied to find the maximum margin hyper plane. The final step which is the third step is probably the most important and problematic step. Augmenting the margin and assessing the decision function both require the computation of the dot product

30

Chapter Two

phishing Website Detection and Data Mining

〈( ),𝛷( )〉 in a high-dimensional space. These lavish calculations are diminished significantly by using a kernel K function [19], such that. 〈𝛷( ) 𝛷( )〉

(

)

The patterns which have been located and detected using the maximal margin classifier does not have to coincide with the input x, applying the decision function equation (2.7) directly on 𝛷( ) is also possible. Substituting equation (2.8) for the inner product, the decision function in feature space becomes [19]. ( )



𝑔 (∑



)

(

)

Figure 2.6: A Hyper plane Separates Patterns of Two Different Classes [53]. In Figure 2.6, the vectors (data points) that are closest to the hyper plane are called the support vectors. The other points do not influence the position of this decision boundary [53].Non-linear classification can also be performed using the 31

Chapter Two

phishing Website Detection and Data Mining

kernel functions. The inputs are then mapped on to higher-dimensional feature spaces to perform the classification. The most widely used kernel functions are [50, 32]:  Linear: (  Sigmoid: (

) 𝑎

)

 Polynomial: (

)

(

)

(

 Radial Basis Function: (

) )

(



‖ )

Where, , r, and d are kernel parameters. Polynomial kernel function is used in this thesis space. 2.7.4.4

Decision Tree

A decision tree (DT) is a classifier expressed as a recursive partition of the instance space. J48 is a type of tree classifier based on decision trees. It used divide-and-conquer algorithm to grow an initial tree. The basic idea is to divide the data into range based on the attribute values for that item that are found in the training sample. It uses information gain to minimize the total entropy of the subsets and the default gain ratio which divides information gain by the information provided by the test outcomes [18]. The decision tree consists of nodes that form a rooted tree, meaning it is a directed tree with a node called “root” that ha`s no incoming edges. All other nodes have exactly one incoming edge. All other nodes are called leaves (also known as terminal or decision nodes). According to a certain discrete function, the instance space splits into two or more subspaces by the internal node. In the simplest and most frequent case, each test considers a single attribute, such that the instance space is partitioned according to the attribute’s value. In the case of numeric attributes, the condition refers to a range. Each leaf is 32

Chapter Two

phishing Website Detection and Data Mining

assigned to one class representing the most appropriate target value. Alternatively, the leaf may hold a probability vector indicating the probability of the target attribute having a certain value. Instances are classified by guiding them from the root of the tree down to a leaf, according to the outcome of the tests along the path [51, 52]. The structure of working this classifier is explained in Appendix A. 2.7.5 Evaluation and Performance Measurements

There are many methods for estimating the performance of machine learning techniques such as using training set and supplied test set, percentage spilt and kfold cross-validation. In this study, we will perform 10-fold cross-validations to estimate and evaluate the training model and test the data. The cross-validation involves training and prediction procedures in which the class instances are randomly distributed into 10 folds, where nine out of 10 are used as a training set, and the remaining one as the test set. N-fold cross-validation has been widely accepted as a reliable method for calculating generalization accuracy Cross validation is a model evaluation method. The basic idea of it is to remove some data before training set and used it to test the performance of the model. K-fold is one of the cross-validation techniques which divide the data to k sub-set and one of it is used as test set. This will repeat k times to test all sub-set [33]. The objective of this process is to choose different partitions of training set and validation set and then averages the result, so that the result will not be biased by any single partition. K is the number of partitions. It can be any number, usually, 10 are chosen as the value of k. The F-measure, Recall, and Precision results are generated by the classifiers the RF, SVM, DT and optimized RF with the same data sets where in each categories using 10-fold cross-validation. Evaluation step refers to the 33

Chapter Two

phishing Website Detection and Data Mining

performance and measurement factors of phishing websites detection, the following terms are often used: True Positive Rate (TPR) (

)

False Negative Rate (FNR) (

) 𝑔𝑎

(

Rate (TNR)

)

𝑎𝑙

𝑔𝑎 (

Rate (FNR) 𝑎

)

True Positive Rate is also referred to as Recall, and precision is also referred to as Positive Predictive Value (PPV). True Negative Rate is also called Specificity. Commonly additional performance metrics used referred to as accuracy and precision and F-measure [32, 18]: 𝑙𝑎

𝑎

𝑙𝑙

𝑎 𝑎

𝑔 𝑎

𝑙𝑎

𝑎

𝑔

𝑔 𝑎

𝑙𝑎

𝑎 34

𝑔

Chapter Two

phishing Website Detection and Data Mining

(

𝑎

𝑎𝑙𝑙 ) 𝑎𝑙𝑙

Accuracy is the most basic measure of the performance of a learning method. This measure determines the percentage of correctly classified instances and the overall classification rate, while F-measure is a measure of a test's accuracy. It considers both the precision and the recall of the test. The F-measure can be interpreted as a weighted average of the precision and recall, where F-measure reaches its best value at 1 and worst score at 0.These metrics are derived from a basic data structure known as the confusion matrix [32, 28], which contains information about actual and predicted classifications done by a classification system. A sample confusion matrix for a two-class case can be represented as shown in Table 2.2. Table 2.2: Confusion Matrix Components [32]

Classified As

Positive

Negative

Positive

TP

FN

Negative

FP

TN

35

Chapter Three Proposed System Methodology

Chapter Three Proposed System Methodology 3.1 Overview In the previous chapter, the theories of phishing websites detection were discussed in detail. Various techniques and methods that have been used in the literature in the field of enhancement detection of phishing websites were described. In this chapter, the structure of the proposed technique for enhancing the detection of phishing websites system will be explained in detail. Describing the architectural design of the system will be the opening section of this chapter, and then afterwards the other stages of the implementation of the system will be described in detail. 3.2 System Structure The organization of this chapter is presented in five main parts, as they are shown in Figure 3.1, these five parts are described in the following subsections in detail. The first part is about data extraction from different websites, data collection and input dataset preparation.

The second part is about pre-

processing. The third part is related to feature extraction approaches. The fourth part is related to the tools and techniques that are used to build the proposed model. The final part is about performance evaluation. Figure 3.1, illustrates in general, the processes of the Structure of proposed system.

36

Chapter Three

Data Collection and Preparation

Proposed System Methodology

Searching

Get text from web component

Clearing

Data Set Document (Text)

Labeling Load Stop word list

Pre-Processing

Load Filter Data

Tokanizer

Stemming

Stop Word Rempve

Saving

Feature Extraction

10- Folds Cross Validation

Training

Teasting

SVM

j48

RF

PSO RF

Error Performance

Error Performance

Error Performance

Error Performance

Evaluation

Figure 3.1: The Structure of Proposed System.

37

Chapter Three

Proposed System Methodology

3.3 The Proposed System for Phishing Websites Detection Initially, the data is collected from a large quantity of various documents from several websites, blogs and forums. The step of pre-processing text is called tokenization or text normalization. The dataset is collected automatically, further details are in section 3.4. Pre-processing can play a substantial role in text mining techniques. Both training and testing documents must move through this step. It eliminates all frequent terms and irrelevant data that might diminish the document classification accuracy. In addition, feature extraction is used to determine which words in a corpus of documents might be more constructive to be used in the document. Afterwards, three classifiers are applied namely; Random Forest (RF), Support Vector Machine (SVM) and decision tree (J48) Classifier. The largest problems encountered in setting up the classifier model are how to select the parameter values. Inappropriate parameter settings lead to poor classification results. For that reason a Particle Swarm optimization (PSO) based approach for parameter values determination of the RF which has highest accuracy, termed RFPSO, is developed. The developed approach was compared with default RF. Experimental results demonstrate that the classification accuracy rates of the developed approach surpass many other approaches. Eventually, the samples are divided into two subsets called training and testing sets via using K-fold cross-validation, Theories of the above mentioned classifiers were discussed in the previous chapter. Detail of the system stages will be described in the following sections. Random Forest with Develop Stop Word (RFDSW) and RFPSO are used. The optimization is used in the RF 38

Chapter Three

Proposed System Methodology

classification frameworks for selecting the best and the most effective parameter values of RF for detecting phishing websites, Figure 3.2 explained the proposed system for tuning RF in the calculation of the accuracy of results. Further details are in section 3.11. Classification accuracy and results will be described later in Chapter Four.

39

Chapter Three

Proposed System Methodology

Load Data

Data Set Document (Text) Training Set

Testing Set

Load Stop word list

Pre-Processing

Load Filter Data

Tokanizer

Stemming

Stop Word Rempve

Implement the Particle Swarm Optimization

NO Determine Best Parameters Value Feature Extraction Yes 10-Folds Cross-Validation

Training set

Teasting set

Build Classifier Model (RF)

Calculate the accuracy Rate Optimized Parameter Value

Meet ending condition 98.2

NO

Yes Optimized parameters value

Figure 3.2: Explains the Architecture of the Proposed PSO-Based Parameters Determination Approach for RF and Calculating of the Accuracy of Results.

40

Chapter Three

Proposed System Methodology

3.4 Data Collection As the first step, a huge quantity of various documents was collected from websites. This is considered as one of the most difficult issues that came across this work for splitting phishing and non-phishing texts. Automatic retrieval of documents from the unstructured form of data on the web is difficult, however in Arabic language there is a very limited work in automatic classification for phishing websites detection. Classifying Arabic phishing text is different from classifying English language because Arabic is highly inflectional and derivational language which makes monophonical analysis a very difficult task. The dataset is automatically extracted and more than 1280 records (documents) from 160 different websites are prepared. Table 3.1, shows the collection of dataset for two classes of documents. The collection of the data was based on documents of different lengths. Data collection contains over 1674 words (attributes). Each document was converting from HTML to TXT format. Figure 3.3 demonstrates the general structure of data collection. In this research, the automatic collection of data set for detecting phishing websites passes through several stages as described below. Table 3.1: Number of Documents for Each Category Category Name

Number of Documents

Phishing website

641

Not Phishing

642

Total

1283

41

Chapter Three

Proposed System Methodology

Start Load URL Extract Data

Remove tags

Save sentence that not contain phishing term as “No” labeled

Remove word boundaries (white space, punctuation marks)

No

URL Contend phishing Keyword

Send sentences to File named not phishing

Yes

Remove digits

Save sentence that contain phishing term as “Yes” labeled

Send sentences to File named phishing

Combine the class Yes and class No with their labels Encoding (UTF-8) ARFF file End

Figure 3.3: General Structure of Data Collection.

42

Chapter Three

Proposed System Methodology

3.4.1 Searching: The first step creates an application for searching texts inside websites, blogs and forums. The application can read and open the general websites. 3.4.2 Get Text From Web Component: Following the searching and opening the websites, the entire texts of the website including all numbers, symbols and letters within the website are taken regardless of the written language. İn this process, the images and the videos were excluded (See Figure 3.4).

Figure 3.4: Get Texts from Websites

43

Chapter Three

Proposed System Methodology

3.4.3 Cleaning: In this stage, the text is cleaned from any numbers, symbols and non-Arabic letters. The only Arabic letters will be kept. All sentences will be rearranged and prepared for labeling stage (See Figure 3.5). 3.4.4 Labeling: After cleaning the text in the previous stage which included sentence arrangement, the sentences are labeled per their content of phishing and not phishing words. If a sentence contains a phishing word, then it is labeled as “yes”. Furthermore, if the sentence included a not phishing word, then the sentence is labeled as “no”. Figure 3.6 demonstrates this process. 3.4.5 Arabic Encoding This is another challenge that this research work faced while working with texts, in other words, the ability to recognize the characters programmatically or via a computer program. Text encoding is used to avert potential errors of characters during texts reading process [32]. Encoding tends to be problematic; the most common and effective way to solve this difficulty is to use Unicode (UTF-8). For the above steps, the java open source Net Beans 8.1 IDE is used as text extraction. This is a powerful integrated development environment for developing applications on Java platforms.

44

Chapter Three

Proposed System Methodology

Sample of Data before clarify

Sample of Data after clarify

Figure 3.5 Show Texts Cleaning 45

Chapter Three

Proposed System Methodology

Figure 3.6: Sample Demonstrates the Process of Labeling With (Yes, No) Per Their Content of Phishing and Not Phishing Terms 46

Chapter Three

Proposed System Methodology

3.4.6 Saving: In this stage, the texts or sentences must be saved in a notepad in the application, which is created in this study. There are two types of saving processes. In the first type, the texts, for both labels; Yes and No, for each website are saved separately in a separate notepad. However, in the second step, texts are combined in one notepad. Then, the combination file is saved with (.arff) format so as to be compatible with weak data mining tool kit. The saved file will be utilized and prepared for pre-process, and classification phases. The program created the initial corpora of the sentences from different websites. Actually, web document or text classification depends on its contents, and an enormous amount of keywords can be found in phishing websites. The extracted data set is partitioned to sentences that contained phishing and nonphishing terms (yes and no labels). 3.5 Develop Stop Words Filtered to Improve and Handle the Stop Words After creating the initial corpora, the data set still needs to go through transformation and Normalization process. This approach aims at improving the Stop Words File (SWF). There are two types of SWF: a) The Basic Stop Words File (BSWF) which contains about 216 words. These words were the most particle words (preposition, Adverbs, Conjunction, and Interjection such as “‫البته‬, ‫ایھا‬, ‫في‬, ‫على‬, ‫تحت‬, ‫امس‬, ‫“ حیثما حول‬and so on. b) The Extended Stop Words or Develop Stop Words File (DSWF), which contains the basic stop word that can be influenced by normalization, so the researcher added a filter (stop word handler) to implement and develop (SWF) and it was very necessary to normalize the punctuation marks in Arabic letters such as (‫( إ‬hamza) or (‫ )أ‬or (‫ ))آ‬in all its forms to letter(‫) )ا‬A( 47

Chapter Three

Proposed System Methodology

(alef) only, the ‫( ي‬yea ) replaced by ‫ ى‬, and the ‫( ة‬taa marbota) to ‫ه‬. For example in this research there is (‫اإلختراق‬, (Hack)) replaced by (‫)االختراق‬ Diacritics for letters such as “alef” are used to signify or distinguish sounds that are not fully to specified by the Arabic letters. These characters can be used interchangeably, and change the meaning of the word. Since they are mostly used in the context of verbal exchanges or recitation, they hold very little value in the analysis carried out on texts about computational linguistics. This step is applied on training and testing procedure, also has an actual impact on the overall performance and improves the accuracy. Figure 3.7 shows the structure of Develop Stop Words Filtered.

48

Chapter Three

Proposed System Methodology

Start

Load Data Set

Load Stop Word List

If the sentence Contain words with ALhamza

Yes

Remove ALhamza and Return the Words

NO Remain the Words

Yes If the sentence Contain Words with Tanween

Remove Tanween and Return the Words

NO Remain the Words

Yes

If Data Set Contain Word From Stop Words List

Pre-processing

NO

Remove Stop Word List

Remain the Words

End

Figure 3.7: The Structure of Develop Stop Words Filtered 3.6 Pre-processing The process of preparing data for the core document (text) mining task is called Pre-processing. After the documents were transformed to the vector model via which the documents were represented as vectors in a space. 49

Chapter Three

Proposed System Methodology

Each vector is represented by the weights of terms in a document with respect to the dimension of the space. Apparently, the number of dimensions is equivalent to the number of terms or keywords used. These processes convert the documents from original data source into a format which is appropriate for applying various types of feature extraction methods. The pre-processing phase includes all those routines, processes and methods required to prepare data for a text mining system which is the core of knowledge discovery operations. Text mining pre-processing operations are centered at the identification and extraction of representative features for natural language documents. The goal of pre-processing phase is to identify features in a way that is most computationally efficient and practical for pattern discovery. First of all, the string , sentences, have to be converted to the word vector, this makes the string attributes convert into a set of attributes , representing word occurrence depending on the tokenize, the set of words (attributes) is determined by the first batch filtered. The documents may contain unnecessary data which may influence the accuracy of the classifier. Data preprocessing stage aims at cleaning the texts by removing unnecessary information. The essential object of this research is to explain and detect phishing websites as well as the effectiveness of the data pre-processing on a classifier. The Arabic documents are prepared by removing digits, punctuation and non-Arabic words as a first step. After that the stop words are removed. The rest of the words are referred as features or keywords of the documents. For the richness of the Arabic language, there is a need to normalize some writing forms that include “ ‫ ”ة‬to “ ‫ ”ه‬and ‫ ” “ي‬to “ ‫ ”ى‬and “ ‫إ‬, ‫أ‬, ‫ ” آ‬to “ ‫”ا‬. As 50

Chapter Three

Proposed System Methodology

performed in section 3.5 for developing stop words filtered to improve removing the stop words, the pre-processing step can play a substantial role in text mining techniques. Both training and testing documents must move through this step. It eliminates all frequent terms and irrelevant data. Several techniques have been developed in this regard such as stemming (root extraction) to count words, tokenizing and removing stop words. It is important to define what a word is. This is highly significant for languages, the basic text pre-processes are: 3.6.1 Stop Words Removing Stop words are the words that do not have any meaning in the content of the document like (so, (‫)لذلك‬, and (‫)و‬, for (‫)بالنسبة‬,). Clearly, eliminating the stop words shrinks the dimensionality of term space as they do not contribute any meaning to the documents. Accordingly, they do not negatively affect the classification accuracy. The most common words in text documents are articles, prepositions, pronouns, etc. For examples of the stop words see Table 1 in appendix. 3.6.2 Tokenizing Tokenizing is used on the strings. The basic and substantial stage in natural language process (NLP) is tokenization. The process of tokenization is used for transforming a text stream into tokens through segmenting the texts. The tokenization process involves the decomposition of the sentences and texts into delimited words by different line and white space. There are several types of tokenizing algorithm, word tokenizer, alphabetic tokenizer and N-gram tokenizer. In this research the word tokenizer “Set of delimiter characters to use

51

Chapter Three

Proposed System Methodology

in tokenizing (\r, \n and \t can be used for carriage-return, line-feed and tab (.,;:'"()?!)” is used . 3.6.3 Stemming Stemming is another approach of pre-processing which is used to identify the root of a word and finally reduces different syntactic word forms of a word such as its noun, adjective, verb, adverb etc. A modest stemmer procedure encompasses removing prefixes and suffixes of the word. Ultimately, it reduces the dimensionality of word vectors which helps enhance the performance of the system. In this work, Khoja stemmer is used to find the requested keywords. Table 3.3 shows examples of the root of the word see Table 2 in appendix. As words in Arabic can be divided into two modules for nouns and verbs that are used in root extraction (stemming), the derivation patterns discussed above are basic and can be further affixed by prefixes and suffixes (external).the basic set of prefixes, examples) see Table 3 in appendix and the basic set of suffixes, and the type of word (noun, or verb) they affix to and examples see Table 4 in appendix. 3.7 Feature Extraction Stage This stage will make the data ready to be used in the next stage which is the classification stage. In this thesis, feature extraction using TF-IDF is the most significant stage in developing detection system of phishing websites. A huge data can be used for detecting phishing websites, but the whole data may not be useful. After that, the corpus of data should be analyzed, filtered and reduced to a small set of information which can gain a better accuracy level.

52

Chapter Three

Proposed System Methodology

The final set of information is called a feature space. The performance and accuracy of the system depend on the quality of the feature space. Various techniques of detection need different amounts of information and format. Many research works have been done to obtain effective features from text document, and each approach is based on analyzing a specific area of feature extraction. Better feature extraction accuracy is TF-IDF Term Frequency Inverse Document Frequency. This is to determine what words in a corpus of documents might be more constructive to use in the document. In essence, TF-IDF works by determining the relevant frequency of words in a specific document compared to the inverse amount of that word over the entire document corpus. The keywords, measures in this case, were assigned a weight that expressed their importance for a document. This involves assigning a high weight to a term that occurred frequently in the document but rarely in the whole document collection. This means that keywords that appear in nearly all documents are given a low weight. In this study, term frequency-inverse document frequency (TF-IDF) is employed. The relationship that was used to calculate TF-IDF for specific terms in documents is given by equation in previews chapter (2.2) and (2.3). 3.8 Classification The next important stage after feature extraction stage in phishing websites detection system is classification. Features will be used as inputs to the classification algorithm. Several methods have been proposed for the classification of web document. Generally, these methods diverse from each other in the features applied to the algorithm and the algorithms used to classify text documents. 53

Chapter Three

Proposed System Methodology

The most commonly used algorithms for classifying text document are RF, SVM and DT classifier. In this study, random forest was proposed as the main classification framework. Random forest is a combined classifier consisting of many tree-structured classifiers and it is used in the learning and testing phases, for detecting phishing websites according to the collected data from the websites forums and blogs. 3.8.1 Random Forest Random forest is a classifier combination method. A random forest consists of N decision trees. Each decision tree is trained by a completely random approach. Each node of a tree is a weak classifier. The structure of random forest combines all the weak classifiers into a strong classifier. The output of random forest classifier is the voting of each tree. Tn are the samples that are selected randomly from the training sample. It is a subset of all the training samples. After N trees are trained, the final decision combines all the outputs of Tn1, Tn2 …TnN by considering the voting of all N outputs. Each node contains a simple comparison of the intensity in a pair of points. In equations 3.1 and 3.2, the final output is obtained. Each tree acts as a classification function on its own, and the final output is taken as the average of the individual tree outputs.

̂( )

( ) (

)



( ( )

)

54

Chapter Three

Proposed System Methodology

Where N is the total number of trees and n specifies one tree; p is the text to be classified; c is the label of class, pn,p is the probability classified by the nth tree that the text p belongs to. In general, each tree is grown as follows: a) Random Record Selection: Each tree is trained on roughly 2/3rd of the total training data (exactly 63.2%). Cases are drawn at random with replacement from the original data. This sample will be the training set for growing the tree. b) Random Variable Selection: Some predictor variables (m) are selected at random out of all the predictor variables and the best split on m is used to split the node. By default, m is the square root of the total number of all predictors for classification. The value of m is held constant during the forest growing. In a tree, each split is created after examining every variable and picking the best split from all the variables. c) For each tree using the remaining one-third of the cases (36.8%) data, are left out and not used in the construction of each tree, calculating the misclassification rate - out of bag (OOB) error rate. And aggregating errors from all trees to determine overall OOB error rate for the classification. d) Each tree gives a classification, or the tree "votes" for that class. The forest chooses the classification having the most votes over all the trees in the forest. Figure 3.8 shows Random Forest that combines the outputs of all decision trees as a classifier method [22] and Figure 3.9 show Flowchart for the Random Forest Method.

55

Chapter Three

Proposed System Methodology

Figure 3.8: Shows Random Forest is an Ensemble of Several Decision Trees

( | )



( | )

Where Fn (v)>tn , P(c│v) final classification of forest , Pt(c│v) classification at each tree , (T )Number of trees builds , (C) Category.

56

Chapter Three

Proposed System Methodology

Start

For Each Tree

select training set to build a model

No Stop Condition holds at Each Node?

Build the Next Split

Yes Calculate Prediction Error End

Figure 3.9: show Flowchart for the Random Forest Method. 3.8.2 Support Vector Machine Support vector machine is used in this thesis for classifying and predicting phishing websites. SVM is another type of classifier through which good accuracy can be obtained on large data set. SVM is considered to be a large margin technique to classify data that are linearly separable. Over fitting in SVM can be avoided via minimizing the generalization error. SVM is used widely to tackle text classification simply because feature space dimensions are commonly high.

57

Chapter Three

Proposed System Methodology

In this research work, Sequential Minimal Optimization (SMO) is used with Polynomial Kernel. A procedure is designed via which the training data is accepted and theoretically plots each item on a 2-D plane. Literally, the input of a document is received by SVM in which it is placed and then it is plotted like a point. Finally, SVM draws lines to define in which class the input of the test sample fits. It is worth mentioning that SVM for classification of multiple classes must run multiple times and in each time dividing the outstanding items to define the neighboring class. 3.8.3 Decision Tree Decision Tree (J48) is another method that can be used for classification and prediction. In this work, this method is used to build the classification, prediction and description of phishing websites detection. The J48 method depends on the Shannon Entropy to define the root and the rest of child nodes. Therefore, all records in input dataset are participated in the calculations to define the root node until the leave nodes. For building the J48 structure, it is important to find out the information that is carried by the population P. For probability distribution of P={p1, p2, ...pn} and a sample S, information carried by this distribution, which is also called entropy of P, it can be computed by equation (3.4), which is also known as Shannon Entropy [51]. ( )



( )

This step is not enough to define the root node, however it is essential. The entropy (P) is used to find out the gain information for each feature in the input 58

Chapter Three

Proposed System Methodology

data set, such (T) is variable. Below equation (3.5) is used to find out the gain information for feature T with reference to the Entropy value of P. (

)

( )



( )

Depending on these two values, the root node can be identified. The same process is repeated for identifying the child node(s) for the root node or parents. 3.9 Implementation of Practical Work Two interfaces were designed for the practical work in this thesis. One of them is for data collection to extract the data set from websites. The Second one is to classify default RF with developed stop word and optimized RF with PSO. Both software’s are built using java programming language using the java open sources Net Beans 8.1 IDE which is a powerful integrated development environment used. In this experiment Weak Tool packages is used. The learning process of classifiers is influenced by many parameters. Therefore, hundreds of tests are performed dynamically to find out the best suitable parameters for each of the classifiers. Then the GUI interface of Weak is used for recording details of the best tests. Weak has its own data format, therefore input features have been prepared based on that format, the data file should have a header consisting of the name of the whole data set “e.g. phishing”, attribute names for input features with their type (e.g. document string), attribute with its type (e.g. class {Yes, no}), and then the data itself which is separated by comma. 59

Chapter Three

Proposed System Methodology

3.10 The Developed RFPSO Approach A new hybrid algorithm PSO-RF based on PSO and RF algorithms is proposed to find out the best parameters value to classify phishing websites detection. In this section, the PSO and RF algorithms are described and then the proposed technique is discussed. The system is trained with a classifier learning algorithm, namely Random Forest algorithm. This research proposes a novel approach to improved and tuning Random Forest (RF) algorithm for classifying websites from its text and for enhancing and detecting phishing websites. The PSO optimization technique obtains the optimum four parameter values Max Depth, Num Iterations, Num Features and Num Folds. Initializing particle according to the size of swarm which are four elements; initializing the values of each swarm is generated through random number generator which are between two numbers for each parameter which are ((Max - Min) + 1) + Min. Then using this obtained parameters value in tuning RF classifier. In this research and case study, the obtained parameter values are used instead of the values of parameters on default RF. detail description in the following subsections. 3.10.1 Particle Swarm Optimization PSO performs searches using a population (swarm) of individuals (particles) that are updated from each iteration. Each solution of optimization problem is a particle of search space. Each particle has a fitness value that is determined by optimization function. Fitness value is an evaluation standard of particle. Each particle has a velocity to determine their direction and distance of fly. All 60

Chapter Three

Proposed System Methodology

particles are initialized as a swarm of stochastic particles (stochastic solutions) and then obtained the optimum solution through iterations. Each particle updates its velocity and position by following two maximum in each iteration. One is called personal best, which is the optimal solution that particle finds itself at present. The other is called global best which is the optimal solution that all particles find at present, supposing the solution space of optimization problem is of D dimensions. The number of particles is N. Suppose the position of ith particle is Xi = (xi1, xi2, ···, xiD), the personal best and velocity of this particle are Pi= (pi1, pi2, ···, piD) and Vi = (vi1, vi2, ···,viD) discretely. The global best of particle swarm is Pg = (pg1 , pg 2, ···, pgD). Particles fly according to the following two formulas to update their own velocity and position. (

) ( )

[

(

( )

( )

[

( )

( )]

( )

( )]

)

( )

(

)

In the equations above, t denotes the current iteration number of the particle, vid and xid

are the dth velocity component and position component of ith

particle. C1 and C2 are Acceleration factors. They present the weights of statistical acceleration items in approaching to Pi and Pg of a particle.rand1 and rand2 two random numbers between [0, 1]. W, is the inertia weight which determines the impact of historical velocity on current velocity. 3.10.2 Algorithm: RFPSO Input: phishing websites detection (web text document) data set. set of words after filtering. 61

Chapter Three

Proposed System Methodology

Output: optimal Max Depth, Num Features, Num Iterations and Num folds Begin: 1. Initialization a) Set the value of C1, C2, W and the particle number b) Establish PSO population of four dimensions and initialize the position and velocity of each particle. c) Initialize the personal best of each particle by their current position, calculate the fitness value of each particle, and set the personal best whose fitness value is the best of all particles as the global best. 2. Do circulations until solution reaches the performance target or reaches the maximum iteration times. a) Update the position and velocity of each particle according to the formulas (1) and (2). b) Calculate the fitness value of each particle, if the fitness value of new position is better than the fitness value of personal best, the new position replace the personal best position. c) If the best particle of the population is better than the global best, the best particles replaces the global best 3. Return the optimal Max Depth, Num Features, Num Iterations and Num Folds. 4. End. The values of parameters are as follows. Particle number is 20. C1=C2=0.5 and W=1.2.Termination condition is that the fitness value is 98.2 or reaching the maximum iteration time.

62

Chapter Three

Proposed System Methodology

3.10.3 Apply RFPSO Figure 3.10 shows the flowchart for PSO algorithm. First, the population of particles is initialized, each particle having a random position within the Ddimensional space and a random velocity for each dimension. Second, each particle’s fitness for the RF is evaluated. Each particle’s fitness in this study is the classification accuracy. If the fitness is better than the particle’s best fitness, then the position vector is saved for the particle. If the particle’s fitness is better than the global best fitness, then the position vector is saved for the global best. Finally the particle’s velocity and position are updated until the termination condition is satisfied.

63

Chapter Three

Proposed System Methodology

start

Read data set

Calculate RF

Create random initial population within their allowable rang Itr=1 to 100 PSO for loop I=1 to 20 Evaluate fitness function for each particle Yes Fitness of the swarm
pBest=fitness

No

Yes

fitness of the

gBest=fitness

swarm
Update velocity

Update position

No

End of partical

I=I+1

Yes No

End of iteration or reach target

Yes

Termination (optimized parameters)

end

Figure 3.10: The Flowchart of PSO Algorithm. 64

Chapter Three

Proposed System Methodology

Estimating the performance of parameters Max Depth, Num Features, Num Iterations and Num folds is equal to estimating the accuracy of the classifier. We use k cross validation to estimate the accuracy of each parameter and to decide the best parameters value for the optimization problem. The k cross validation accuracy of each particle is its fitness value namely. The process of k cross validation is as follows. First, the training data is separated to K folds. Sequentially a fold is considered as the validation set and the rest K-1 is for training. The average of accuracy on predicting the validation sets is the k cross validation accuracy. Each particle is of four dimensions Max Depth, Num Features, Num Iterations and Num folds. The results achieved are compared with different well known classification techniques. Cross validation refers to a widely used experimental testing procedure. The idea is to break a data set up into k disjoint subsets of approximately equal size. Then one performs k experiments, where for each experiment the kth subset is removed, the system is trained on the remaining data, and then the trained system is tested on the held-out subset. At the end of this k-fold cross validation, every example has been used in a test set exactly once. This procedure has the advantage that all the test sets are independent. However, the training sets are clearly not independent, since they overlap each other substantially. The limiting case is to set k = n - 1, where n is the size of the entire data set. N-fold cross-validation has been widely accepted as a reliable method for calculating generalization accuracy. The ratio of the number of data in the training set to the number of data in the validation set was maintained as closely as possible to 9:1. This procedure is illustrated in the flow chart of Figure 3. 11. 65

Chapter Three

Proposed System Methodology

start

Data set N=1283

Split the data into train and test sets for 10- fold crossvalidation

Train set, T=1154 (90% of N)

Repeat for 10 different seed

Test set S=129 (10% of N)

Model Selection and Hyperparameter Optimization Steps: 1. Random split of Train set T into train T* and test T’ sets. 2. Fit model to T* set for different hyperparameter values. 3. Compute confusion matrix for T’. 4. Repet step 1 – step 3 for 10-fold cross-validation. 5. Based on best accuracy, select the hyperparameter and fit the model on T dataset.

Copute confusion matrix for

No Record Accuracy (for RF) for 10-fold crossvalidation

Yes Stop

Figure 3.11: Illustrates the Workflow Used in the Cross-Validation Part. 66

Chapter Three

Proposed System Methodology

Eventually, in this chapter the proposed techniques for automated data collection, preprocessing, feature extraction, classification and PSO Combined Random Forest are shown in detail. The practical work and the designed software is also another part in this thesis. In the next chapter, the results and the evaluations of the systems will be described in detail.

67

Chapter Four Results and Discussions

Chapter Four Results and Discussions

4.1 Overview One of the important phases in this research work is to test the system and record the experimental results. The results are chosen to be for variant cases to provide the highest possible solutions that exist. The results give a comparable measurement which can be used for selecting an accurate algorithm. This chapter shows the steps of the methodology achievement of this work. It presents the implementation process of the model. The chapter also includes the implementation details and the results that are obtained throughout the experimental and evaluation processes. 4.2 Training and Testing Dataset According to the searching and gathering information from web sites, the collected data set contains a set of features; each one has its own value. A dataset with 1283 sentences is used for training and testing both RFDSW and RFPSO. After the documents were transformed to the vector model via which the documents were represented as vectors in a space. Each vector is represented by the weights of terms in a document with respect to the dimension of the space. Apparently, the number of dimensions is equivalent to the number of terms or keywords used. In this research work, sentences in documents were used as samples for training classifiers. There must be a sufficient number of positive samples for each classification labels.

68

Chapter Four

Results and Discussions

The dataset is divided into two parts, namely; training and testing datasets. The training dataset is used for learning the system and redirecting the system for making decisions in the other phases. The testing dataset is used to evaluate the performance of the proposed system. The system is trained and tested using Kfold Cross validation method. Cross validation is a model evaluation method. Thus, the data set is divided into K subsets, and the holdout method is repeated K times. Each time, one of the K subsets is used as a test set and the other K-1 subsets are put together to form a training set. Then the average error across all K trials is computed. In this research, the data set is used in the form of ten-fold cross-validations. 4.3 Experiment 1: Results of Pre-processing The results of this stage consist of all pre-processing phases: 4.3.1 Transformation and Normalization The proportions of original dataset are presented in Figure 4.1 and the transformation and normalization results (Data after Convert to Numeric Weight) of this sample are presented in Figure 4.2. The transformation and normalization processes transform the string (sentences) to numeric digits and consider each word with a specific weight number. The public class string to word vector extends filter implements unsupervised filter, option handler converts string attributes into a set of attributes representing word occurrence (depending on the tokenizer) information from the text contained in the strings. The set of words (attributes) is determined by the typically training data. The developed stop word filter is presented in Figure 4.3 shows the Result of Implementation of Stop Word Handler Filter. This step is applied on training and testing procedure; it has also an actual impact on the overall performance and improves the accuracy. 69

Chapter Four

Results and Discussions

Figure 4.1: Shows Original Dataset. 70

Chapter Four

Results and Discussions

Figure 4.2: Shows Data after Convert to Numeric Weight 71

Chapter Four

Results and Discussions

Figure 4.3: Shows the Result of Implementation of Stop Word Handler Filter. 72

Chapter Four

Results and Discussions

4.4 Experiment 2: RFDSW and RFPSO Various experimental tests on different models were carried out; however, the main experimental tests that were conducted and recorded in this thesis are namely: RFDSW and the optimized version RFPSO. Then the classification task was carried out by using variants of RF without developing filter, SVM and J48. The results were evaluated based on three criteria: Accuracy, error rate, Precision, Recall and F- measure. Details description is outlined below. 4.4.1 Model 1: RFDSW In this section, RF is used for classifying phishing web sites. There are two different models of classifier to compare and choose the best one that fits the proposal. The first model implements stop word Handler filter that handles single regular expression such as Tanween Movements Remove and Alhmze Replacement. Since in this research Weka packages are used, Weka offers several options to specify Stop words but a single regular expression is not part of the default implementations of the stop words Handler. Therefore, the researcher adds implementation of the Stop words Handler to solve the problem. It is also used for standardizing and normalizing the words in the stop word list, subsequently, this step is used to enhance the accuracy of default random forest. Figure 4.3 presents the result of implementation of stop word handler Filter. Table 4.1shows represents the used RF parameters which are input features, output (class labels), number of seed, max Depth, Num Features, Num Iteration and Num Folds. The Bellow mentioned parameters are used for RF classifier. A dataset with 1283 instances is used after converting strings to word vectors and preprocessing only 711 input features (words) remained. The outputs class labels are “Yes and No”. This means that 2 output class labels are used, seed (random 73

Chapter Four

Results and Discussions

number seed for the cross-validation), Max Depth (the maximum depth of the tree; 0 for unlimited), Num Features: Sets (the number of randomly chosen attributes (If 0, int (log_2 (number predictors) + 1)) is used, this parameter tells the engine how many processors is it allowed to use , value of “-” means there is not restriction , whereas value of “+” mean it only use one processor. Num Folds (determines the amount of data used for reducing error of each decision tree trained) and Num Iteration (number of Iteration). Table 4.1: Shows RF Classifier Parameters Input

Output

Features

Class

711

2

Num Seed

Max Depth

Features of RF

1

0

0

Num

Num

Folds

Iteration

3

100

4.4.2 Model 2: RFPSO In this model, the PSO based approach for parameters values determination of the RF Classifier, termed RFPSO is developed. RFPSO results are compared with the RFDSW model. PSO is used for optimizing parameter values, which provides a minimum error for the model. Table 4.2 shows the used parameters for PSO like number of particles (NoP), inertia weight (IW) which determines the impact of historical velocity on current velocity, acceleration coefficient (AC), Max Iteration (Max_Itr) and dimension of particles in PSO (Dim). Each particle has four dimensions which are the number of RF parameters (Max Depth, Num Features, Num Fold and Num Iterations), that are used for getting better parameter values through decreasing the error rate and increasing the accuracy rate in the training and testing phase (See Table 4.3). 74

Chapter Four

Results and Discussions

Table 4.2: Shows PSO Optimizer Parameters NoP

Dim

IW

AC

Max_Itr for PSO

Performance target

20

Max Depth Num Iterations Num feature Num Fold

1.2

0.5

100

98.2

Table 4.3: Shows PSO Obtained Optimal Parameters Values from RF Classifier. Max Depth

Num Features

Num Folds

Num iteration

25

-1027

17

104

It is clear from the confusion matrices for both models (RFDSW and RFPSO) that there are a number of correctly classified instances, with a number of incorrectly classified instances (See Table 4.4).

Table 4.4: Confusion Matrix for Both Model RFDSW and RFPSO Classifiers Classifier

Predicted Class

Yes

No

Yes

630

11

No

11

631

Yes

627

14

No

14

628

RFDSW

RFPSO

The accuracy performance for the experiments in this research was measured using the following criteria; 1) Correctly Classified Instances (CCI), 2) Percentage of Correctly Classified Instances (PCCI), 3) Incorrectly Classified 75

Chapter Four

Results and Discussions

Instances (ICI), 4) Percentage Incorrectly Classified Instances (PICI). The following subsections demonstrate performance details of each model in terms of accuracy. Tables 4.5, show Evaluation and Result Based on Accuracy for Experiment 2 between RFDWS and RFPSO. Each model produced good accuracy results. The average detection rate in RFPOS is increased and at the same time the average false positive rate is also decreased. Also by applying the RFPOS, the accuracy is increased from 97.81 to 98.28. Table 4.5: Evaluation and Result Based on Accuracy for Experiment 2

Classifier

CCI

PCCI %

ICI

PICI %

RFDWS

1255

97.8176

28

2.1824

RFPSO

1261

98.2853

22

1.7147

After the training phase is achieved, the testing set was used to test the effectiveness and several quantitative measures (for instance, F- measures, precision and Recall) are used for evaluating the system. Table 4.6 shows the weighted average of the categories in Experiment 2 for both models. It shows that the precision, recall and F- measures reach their lowest values when RFDSW classifier was used and reach their highest values when RFPSO classifier was used. It is clear that RFPSO has the best accuracy.

76

Chapter Four

Results and Discussions

Table 4.6: Evaluation and Result (based on Precision, Recall and F- measures) for Experiment 2 Classifier

Precision

Recall

F- measures

RFDSW

0.978

0.978

0.978

RFPSO

0.983

0.983

0.983

4.5 Experiment 3: Classification Using Default FR, SVM, and J48 In this experiment, three techniques of classification are used, namely: the default FR without the develop filter, SVM, and J48. The same datasets which are mentioned for the earlier experiments; 1 & 2 are used with these techniques. The aim at using three models is to compare and choose the most optimized and accurate model. The first model presented here is the default FR with its parameters which are the same parameters that are used in experiment 2 (See Table 4.1). The second model is the SVM, which is a linear model with its parameters (See Table 4.7). The third model is J48 with its parameters (See Table 4.8). The SVM parameters are represented by the input features, output classes (Yes, No), Epsilon (the epsilon for rounding error), the type of Kernel function, Tolerance Parameter, Num Folds ( the number of folds for cross-validation used to generate training data for calibration models; -1 means use training data), and Random Seed (random number seed for the cross-validation).

77

Chapter Four

Results and Discussions

Table 4.7: SVM Classifier Parameters Input feature space

Output

Kernel type

Epsilon

Tolerance

Num Folds

random Seed

711

2

Polynomial

1.0E-12

0.001

-1

1

The last classifier technique in this experiment was the Decision Tree classifier. J48 classifier parameters are Num of Attribute, output (class), confidence Factor (The confidence factor (CF) is used for pruning), minNO (represents the minimum number of instances per leaf), Num Folds (determines the amount of data used for the reducing error pruning, one fold is used for pruning, and the rest for growing the tree), and the Seed (which is used for randomizing the data, when the reduced error pruning is used). Table 4.8: J48 Classifier Parameters Input Feature Space

Output

CF

minNO

Num Folds

Seed

711

2

0.25

2

3

1

The confusion matrices for the above mentioned classification techniques in this experiment are presented in Table 4.9.

78

Chapter Four

Results and Discussions

Table 4.9: Confusion Matrix for Each Classifier (SVM, RF and J48) Classifier

Predicted Class

Yes

No

Yes

627

14

No

19

623

Yes

619

22

No

24

618

Yes

601

40

No

66

576

RF

SVM

J48

The evaluation results for the three used techniques in this experiment are given in Table 4.10. The results indicated high accuracy rates and low misclassification rates.

Table 4.10: Evaluation and Result Based on Accuracy for Experiment 3 Classifier

CCI

PCCI %

ICI

PICI %

RF

1250

97.4279

33

2.5721

SVM

1237

96.4147

46

3.5853

J48

1177

91.7381

106

8.2619

From the results, it can be seen that there are lower misclassification cases in RF model compared to the SVM and J48. As it is clear that the number of correctly classified instances is larger and the number of incorrectly classified 79

Chapter Four

Results and Discussions

instances is lower in the RF model. It is observed that the RF model gives the highest accuracy rate (97.4279 %). On the basis of accuracy, the RF classifier is considered to be the best for the classification of phishing websites. Table 4.11 shows that the precision, recall and F- measures reach their lowest values (0.918, 0.917and 0.917) when the J48 classifier was used and reach their highest values (0.938, 0.933 and 0.963) when the RF classifier was used. It is clear that the RF has the best accuracy. But the J48 was the worst classifier in all the measures. Table 4.11 Evaluation and Result Based on (Precision, Recall and F-measures) for Experiment 3 Classifier

Precision

Recall

F- measures

RF

0.974

0.974

0.974

SVM

0.964

0.964

0.964

J48

0.918

0.917

0.917

4.6 Bar Chart of Experimental Results In this section, different figures of experimental results are shown. The figures (4.4, 4.5) demonstrate the differences in terms of accuracy among various models of classifier that are used in this thesis.

80

Chapter Four

Results and Discussions

Evaluation based on accuracy 1400

1200 1000 800 1261

1255

600 400

97.8176 98.2853

200

28

22

2.1824 1.7147

0 CCI

PCCI %

ICI

RFDSW

PICI %

RFPSO

Figure 4.4: Classification Accuracy of Experiment 2

Evaluation Based on Accuracy 1400 1200 1000 800 600

1250 1237 1177

400 200 97.4279 96.4147 91.7381

33

46

106

2.57213.58538.2619

0 CCI

PCCI % RF

ICI SVM

PICI %

J48

Figure 4.5: Classification Accuracy of Experiment 3

81

Chapter Four

Results and Discussions

4.7 Evaluation of Experimental Results In the table below, the summarized results of all experiments are shown. It is clearly seen that the detection rate increases when the number of error decreased. It is realized that almost all of the models (RFDSW and RFPSO) give a satisfying outcome. The experimental results are represented in Table 4.12. This table contains the classification rate for each case. The RFPSO obtained the highest accuracy rate (98.2% compared to others). Figure 4.6 shows a comparative chart of using different classifier. Table 4.12: The Accuracy and performance metric for Each Experiment

Parameters

RFPSO

RFDSW

RF

SVM

J48

Accuracy %

98.285 %

97.817 %

97.427%

96.4147%

91.7381%

Average Precision

0.983%

0.978

0.974%

0.964%

0.918%

Average Recall

0.983%

0.978

0.974%

0.964%

0.917%

Average F-measures

0.983%

0.978

0.974%

0.964%

0.917%

82

Chapter Four

100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%

Results and Discussions

91.74% 96.41% 97.43% 97.82% 98.29%

RFPSO RFDSW RF 0.92% 0.92% 0.92% 0.96% 0.96% 0.96% 0.97% 0.97% 0.97% 0.98% 0.98% 0.98% 0.98% 0.98% 0.98%

SVM J48

J48

RF RFPSO

Figure 4.6: Comparative Chart of Using Different Classifier 4.8 The Graphical User Interface (GUI)

Two graphical user interfaces are used in the research work data collection and the main detection system. A detailed description of each is explained below: 4.8.1 Data Collection (data extraction) Graphical User Interface The system is designed for collecting and extracting data set text documents, from several phishing and non-phishing web sites, forum and blogs. The main window is shown in Figure 4.7. It consisted of several buttons and labels. The data collection user interface contains buttons for converting the documents into two different text files; one for saving phishing websites and the other for saving non phishing websites. 83

Chapter Four

Results and Discussions

Figure 4.7: Shows Data Collection Graphical User Interface 4.8.2 The main detection system graphical user interface The designed system uses supervised learning for detecting phishing websites. Based

on

supervised

learning

algorithms,

detection

techniques

were

implemented and tested. The designed GUI enables user-friendly handling of the core system. This GUI can be applied to any Arabic data sets. The main detection system window is shown in Figure 4.8. Java platform with JDK 7 is used for implementing the model under NetBeans 8.1IDE.

84

Figure 4.8: Shows Main Detection System Graphical User Interface

Chapter Four Results and Discussions

85

Chapter Four

Results and Discussions

The main GUI of the detection model consists of a several sub-windows and a number of buttons, text fields, text areas, labels, tabbed panes and tables: 1. Load Data:

allows loading data sets through File Chooser.

2. Load Stop Word: 3. Text Filter:

allows loading the list of stop word. uses pre-processing steps and handling the stop

word list. 4. Default RF:

classification uses the default Random Forest.

5. Tuning RF:

RF classification uses the optimal parameter

values that POS was obtained . 6. Training PSO:

the parameters are inserted directly via GUI

through the text fields (See Figures 4.9).

Figures 4.9: Show Parameter of PSO and Performance Target

86

Chapter Five Conclusions and Future Recommendation

Chapter Five Conclusions and Future Recommendation

5.1 Conclusions Obviously, internet is utilized massively all over the world. It is worth mentioning that there are abundant phishing websites. This study attempts to identify the phishing websites. The system was designed to process input documents and atomically identifies phishing websites via employing the techniques of data mining. The main steps of the proposed system are: data collection and preparation, pre-processing, classification. Data was collected automatically from various Arabic websites. The data set consists of 1283 documents of various subjects that fit into two categories, which are phishing and non-phishing. The pre-processing stage was divided into three sub-stages namely; Tokenizating, Removing Stop words and Stemming. The main difficult tasks that are encountered in this thesis and resolved by the author are developing the stop word list. The classification task was performed to identify phishing and non-phishing websites. It is worth mentioning that several algorithms have been implemented to solve the problem of text categorization. This study compares three different classification techniques, the results show that there are differences among the classifiers from various aspects namely; the accuracy, error rate, precision, recall and F- measure. The main conclusion points of this thesis are outlined below: The results show that the Random Forest with particle swarm optimization (RFPSO) achieved the highest accuracy and the lowest error rate, followed by

87

Chapter Five

Conclusions and Future Recommendation

the developed stop word applied on Random Forest (DSWRF) and the lowest accuracy was (J48). This thesis attempts to build a system for detecting phishing websites through intelligent techniques to replace the traditional methods. Several points for building and implementing the proposed system have been concluded considering the obtained results from the used techniques, and the collected dataset. These points are achieved based on a series of classification experiments. Some of these conclusions are summarized as follows:1. Attribute value types in the collected dataset were string, after that converting this string toward vector, and then converting vector text to numeric to have their weights. Numeric is a more appropriate type, accordingly, attribute types are considered and used as numeric in this thesis. 2. The developed stop word list in this thesis increased positively the accuracy rate, implements stop word Handler filter that handles single regular expression such as Tanween Movements Remove and Alhmze Replacement. Since in this research Weka packages are used, Weka offers several options to specify Stop words but a single regular expression is not part of the default implementations of the stop words Handler. Therefore, the researcher adds implementation of the Stop words Handler to solve the problem. It is also used for standardizing and normalizing the words in the stop word list, subsequently, this step is used to enhance the accuracy of default random forest. 3. The detection system is enhanced using natural inspired algorithms such as particle swarm optimization with Random Forest for obtaining the optimized parameter values for Random Forest.

88

Chapter Five

Conclusions and Future Recommendation

4. The highest classification accuracy rate is achieved using RFPSO which is (98.2853), while RF is (97.4279), RFPSO is more robust than RF. 5.2 Future Recommendation This study work can be extended for additional future works as follows:1. This work can be expanded via using multiclass applications and may include other classifier techniques such as using deep learning neural networks and Naïve Bayes classifier. 2. Other techniques of natural inspired algorithms such as Grey Wolf, Cuckoo Search and Artificial Bee Colony algorithms can be used instead of PSO. 3. This system can be adapted for different languages such as Kurdish or English but with different stemmers and stop words. 4. This system can be modified to deal with different applications of phishing websites such as terrorism. 5. Multi-label Classification method for the classification task might be considered, but then, this method allocates multiple target labels to each of the instances. 6. Other international data sets from different cultures might be considered.

89

Appendix

Appendix A1: The pseudo code of Random Forests in the procedure is as follows: A training set S := (x1,y1) ,…, (xn, yn), features F, and number of trees in forest B. 1. function Random Forest(S , F) 2. H ← Ø 3. for i ∈ 1,… , B do 4. S (i) ← A bootstrap sample from S 5. hi ← Randomized Tree Learn(S (i) , F) 6. H H ⋃ {hi} 7. end for 8. return H 9. end function 10.

function Randomized Tree Learn(S , F)

11. At each node: 12.

f ← very small subset of F

13.

Split on best feature in f

14. return The learned tree 15.

end function

A2: The pseudo code of Particle Swarm Optimization with Random Forests : For each particle 1. Initialize particle (Random Forests parameter values ) according to the size of swarm which are four elements: a) Max depth (generate random number between (10-1) ) 90

b) Number of Folds (generate random number between (10-2) ) c) Number of Iterations (generate random number between (1000-100) ) d) Number of Features (generate random number between (600-100) ) 2. Initialize particle parameters: a) Max Iterations (default value 100). b) Swarm Size (default value 20). c) Correction Factor (default value 0.5). d) Inertia (default value 1.2). e) Performance target (no default value). END Do 3. For each particle Calculate fitness value (rate of corrected classify in Random Forests) If the fitness value is better than the best fitness value (pBest) in History 4. set current value as the new pBest 5. End Choose the particle with the best fitness value of all the particles as the gBest For each particle Calculate particle velocity according equation (Eq 2.4) Update particle position according equation (Eq 2.6) End A3: Pseudo-Code for Simplified SMO Input: C: regularization parameter 91

tol: numerical tolerance max passes: max # of times to iterate over α’s without changing (x(1), y(1)), . . . , (x(m), y(m)): training data Output: α ∈ Rm: Lagrange multipliers for solution b ∈ R : threshold for solution 1. Initialize αi = 0, ∀i, b = 0. 2. Initialize passes = 0. 3. while (passes < max passes) 4. num changed alphas = 0. 5. for i = 1, . . .n, 6. Calculate Ei = f(x(i)) − y(i) 7. if ((y(i)Ei < − tol && αi < C) || y(i)Ei > tol && αi > 0)) 8. Select j ≠ i randomly. 9. Calculate Ej = f(x(j)) − y(j) 10.

Save old α’s: αi (old) = αi, αj(old) = αj .

11.

Compute L and H

12.

if (L == H)

continue to next i. 13.

Compute _ by (14).

14.

if (η >= 0)

continue to next i. 15.

Compute and clip new value for α j

16.

if (|α j − α j (old) | < 10−5)

17.

continue to next i.

18.

Determine value for αi 92

19.

Compute b by

20.

num –changed- alphas := num- changed -alphas + 1.

21.

end if

22.

end for

23. if (num changed alphas == 0) 24. passes := passes + 1 25.

else

26. passes := 0 26. end while A4: Decision Tree Pseudo Code INPUT: Training data OUTPUT Decision tree DTBUILD (*D) 1. Begin 2. T=φ; 3. T= Create root node and label with splitting attribute; 4. T= Add arc to root node for each split predicate and label; 5. For each arc do 6. D= Database created by applying splitting predicate to D; 7. If stopping point reached for this path, then 8. T’= create leaf node and label with appropriate class; 9. Else 10. T’= DTBUILD(D); 11. T= add T’ to arc; 12. End

93

Table 1: Part of Stop Words in Arabic Language. (Arabic) Stop word

Reading in English

Meaning in English

‫من‬

Min

From

‫عن‬

Ean

About

‫في‬

Fee

At

‫الى‬

Illa

To me

‫لكن‬

Lakin

But

‫حتى‬

Hatta

Even

‫ليس‬

Lasa

Not

‫السيما‬

Lasyma

Especially

‫بما‬

Bima

Including

‫ماذا‬

Matha

What

‫هناك‬

Hunaka

There

‫اللتان‬

Alltan

Who

‫على‬

Ealaa

On

‫نحن‬

Nahn

We

94

Table 2: Shows derivation forms of the word “write” in Arabic language Root of the word in

Reading in

Meaning in

Arabic

English

English

‫يكتب‬

Yakut

Write

‫اكتب‬

Aktub

I write

‫كتاب‬

Kitab

Book

‫مكتوب‬

Maktub

Written

‫كاتب‬

Katib

Writer

‫كتابة‬

Kitaba

Writing

Table 3: Shows the Basic Prefixes. Types of words

Prefix

prefixed

‫أ‬

Noun, verb

‫ب‬

Noun

‫ت‬

Noun

‫س‬

Verb

‫ف‬

Noun, verb

‫ك‬

Noun

‫ل‬

Noun, verb

‫و‬

Noun, verb

‫ال‬

Noun

95

‫‪Table 4: shows the Basic Suffixes.‬‬ ‫‪Examples‬‬

‫‪Types of words prefixed‬‬

‫‪Suffix‬‬

‫صدفا‪ ,‬صاحبا‬

‫‪Noun, Verb‬‬

‫ا‬

‫صدقت‬

‫‪Verb‬‬

‫ت‬

‫ذاهبة‬

‫‪Noun‬‬

‫ة‬

‫كتابك‬

‫‪Noun, Verb‬‬

‫ك‬

‫صدقن‬

‫‪Verb‬‬

‫ن‬

‫اخترقه‪ ,‬فيه‬

‫‪Noun, Verb‬‬

‫ه‬

‫محترفوا‬

‫‪Noun, Verb‬‬

‫و‬

‫عني‪ ,‬اكتبي‬

‫‪Noun, Verb‬‬

‫ي‬

‫سيدات‬

‫‪Noun‬‬

‫ات‬

‫مدرسان‪ ,‬يكتبان‬

‫‪Noun, Verb‬‬

‫ان‬

‫ذهبتم‬

‫‪Verb‬‬

‫تم‬

‫منكم‪ ,‬كتابكم‬

‫‪Noun, Verb‬‬

‫كم‬

‫كتابكن‪ ,‬دخلكن‬

‫‪Noun, Verb‬‬

‫كن‬

‫كتابنا ‪,‬ضربنا‬

‫‪Noun, Verb‬‬

‫نا‬

‫اعطاني‬

‫‪Verb‬‬

‫ني‬

‫منها‪ ,‬كتابها‬

‫‪Noun, Verb‬‬

‫ها‬

‫بينهم‪ ,‬نصرهم‬

‫‪Noun, Verb‬‬

‫هم‬

‫بيوتهم‪ ,‬بايعهن‬

‫‪Noun, Verb‬‬

‫هن‬

‫صدقوا‬

‫‪Verb‬‬

‫وا‬

‫يكذبون‪ ,‬يكتبون‬

‫‪Noun, Verb‬‬

‫ون‬

‫مدرسين‬

‫‪Noun‬‬

‫ين‬

‫ذهبتما‬

‫‪Verb‬‬

‫ثما‬

‫كتابكما‪ ,‬اخرجكما‬

‫‪Noun, Verb‬‬

‫كما‬

‫منزلهما‪ ,‬اخرجهما‬

‫‪Noun, Verb‬‬

‫هما‬

‫‪96‬‬

References [1] Abouenour L., Bouzoubaa K., Rosso P., (2008), “Improving Q/A using Arabic Wordnet”, In: Proceedings of the Arab Conference on Information Technology. IBTIKARAT Research Group, Tunisia. [2] Alkhalifa M., & Rodríguez H., (2009), “Automatically extending NE coverage of Arabic WordNet using Wikipedia” , In: Proceedings of the 3rd International Conference on Arabic Language Processing, Rabat, Morocco, 23–30. [3] Boudabous M. M., Kammoun N. C., Khedher N., Belguith L. H., Sadat F., (2013), “Arabic WordNet semantic relations enrichment through morpholexical patterns”,

In: Proceedings of the 1st International

Conference on Communications, Signal Processing and their Applications, pp. 1–6. IEEE Xplore Press, Sharjah. [4] Elberrichi Z., & Abidi K., (2012), “Arabic text categorization: a comparative study of different representation modes”, Int. Arab J. Inform. Technol. 9, 465–470. [5] El-Halees A., (2008), “A comparative study on Arabic text classification”, Egypt. Comput. Sci. J. 20, 57–64. [6] Yousif S. A., Samawi V. W., Elkabani I., Zantout R., (2015), “Enhancement of Arabic text classification using semantic relations of Arabic WordNet article”, J. Comput. Sci. 11(3), 498–509. 97

[7] Makoto T., Takashi W. (2001), Hiroshi M., “chapter 1: Automatic WebPage Classification by Using Machine Learning Methods”, Institute of Scientific and Industrial Research, Osaka University, JAPAN, 567-0047. [8] Ramos J., (2003), “Using tf-idf to determine word relevance in document queries”, In: Proceedings of the First Instructional Conference on Machine Learning. [9] Duwairi R., (2007), “Arabic text categorization. Int. Arab J. Inform”, Technol. 4, 125–131. [10] Zubi Z. S., (2009), “Using some web content mining techniques for Arabic text classification recent advances on data networks, communications, computers”, 73–84. [11] Harrag F., & El-Qawasmah E., (2009), “Neural network for Arabic text classification”, In: Proceedings of the 2nd International Conference on the Applications of Digital Information and Web Technologies, IEEE Xplore Press, London. 778–783. [12] Xueying Z., & Yueling G., (2009), “Optimization of SVM Parameters Based on PSO Algorithm “, Fifth International Conference on Natural Computation, College of Information Engineering, Taiyuan University of Technology, Taiyuan, Shanxi, China. [13] Mathioudakis M., & Koudas N., (2010), “Twitter monitor: trend detection over the twitter stream”, In: Proceedings of the ACM SIGMOD 98

International Conference on Management of data, 1155–1158. ACM, Chicago. [14] Zubi Z. S., (2010), “Text classification in deep web mining latest trends”, In: Applied Informatics and Computing, 55–64. [15] Al-Shargabi B., Alromima W., Olayah F., (2011), “A comparative study for Arabic text classification algorithms based on stop words elimination”, In: Proceedings of the 2nd International Conference on Intelligent Semantic Web-Services and Applications, Amman, Jordan. [16] Mohamed S., AboulElla H., Nashwa B., Robert C. B., “Incorporating Random Forest Trees with Particle Swarm Optimization for Automatic Image Annotation”, Proceedings of the Federated Conference on Computer Science and Information Systems, ISBN 978-83-60810-51-4, 763–769. [17] Yada V. M., & Mittal P., (2013), “Web mining an introduction”, Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3(3), 683–687. [18] Baoxun X., Xiufeng G., Yunming Y., Jiefeng C., (2012), “An Improved Random Forest Classifier for Text Categorization” Journal Of Computers, 7(12), 2913-2020. [19] Vrushali Y. K., & Aashu S., (2013), “An Approach towards Optimizing Random Forest using Dynamic Programming Algorithm”, International Journal of Computer Applications 75(16), 0975 – 8887. 99

[20] Alsaad A., & Abbod M., (2014), “Arabic text root extraction via morphological analysis and linguistic constraints”, In: 16th International Conference on Computer Modelling and Simulation, 125–130. IEEE, Cambridge. [21] Alahmadi A., Joorabchi A., Mahdi A. E., (2014), “Combining bag-ofwords and bag-of-concepts representations for Arabic text classification”, In: Proceedings of the 25th IET Irish Signals and Systems Conferencem China-Ireland

International

Conference

on

Information

and

Communications Technologies, 343–348. IEEE Xplore Press, Limerick. [22] Kumbhar V. S., & Oza K. S., (2015), “Educational web mining using weka”, Int. J. Eng. Trends Technol. (IJETT) 26(3), 128–131. [23] Mehdi H. A., & Setareh H., (2015), “Feature Selection Using Particle Swarm Optimization in Text Categorization” JAISCR, 5(4), 231-238. [24] Belmouhcine A., & Benkhalifa M., (2015), “Implicit links based web page representation for web page classification”, In: Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics, ACM, New York. [25] Hammo B., Abu-Salem H., Lytinen S., Evens M., (2002), “QARAB: a question answering system to support the arabic language”, In: Workshop on Computational Approaches to Semitic Languages, Philadelphia, 55–65.

100

[26] Khorsheed M. S., Al-Thubaity A. O., (2013), “Comparative evaluation of text classification techniques using a large diverse Arabic dataset”, Lang. Resour. Eval. 47, 513–538. [27] Fodil L., Sayoud H., Ouamour S., (2014), “Theme classification of Arabic text: a statistical approach”, In: Proceedings of the Terminology and Knowledge Engineering (TKE). [28] Forman G., (2003), “An extensive empirical study of feature selection metrics for text classification”, J. Mach. Learn. Res. 3, 1289–1305. [29] Lodhi H., Saunders C., Shawe-Taylor J., Cristianini N., Watkins C., (2002), “Text classification using string kernels”, J. Mach. Learn. Res. 2, 419–444. [30] Cortes C., Vapnik, V., (1995), “Support vector networks”, Mach. Learn. 20(3), 273–297. [31] Lin C. J., (2002), “Asymptotic convergence of an SMO algorithm without any assumptions”, IEEE Trans. Neural Netw. 13(1), 248–250. [32] T.A. Rashid and S.O. Mohamad., (2016), “Enhancement of Detecting Wicked Website through Intelligent Methods”, Security in Computing and Communications. SSCC 2016. Communications in Computer and Information Science, Springer, Singapore 625, 358-368 [33] Azar A., Sebt M. V., Ahmadi P., Rajaeian A., (2013), “A model for personnel selection with a data mining approach: A case study in a 101

commercial bank”, SA Journal of Human Resource Management, 11(1)10-pages. [34] Shipra S., & Hari M. P., (2015), “Review on Web Content Mining Techniques”, International Journal of Computer Applications 118(18), 33 – 36. [35] Malarvizhi R., &

Saraswathi K., (2013), “Web Content Mining

Techniques Tools & Algorithms – A Comprehensive Study”, International Journal of Computer Trends and Technology (IJCTT) – vol. 4 Issue 8– August, ISSN: 2231-2803, Page 2940 [36] Bhavani T., “Data Mining for Counter-Terrorism”, The MITRE Corporation Burlington Road, Bedford, MA on leave at the National Science Foundation, Arlington, VA [37] Han J., & Kamber M., (2006), “Data Mining: Concepts and Techniques, 2nd edition”, ISBN 1-55860-901-6. [38] Maimon O., & Rokach L. (Eds.), (2010), “Data Mining and Knowledge Discovery Handbook”, Springer New York Dordrecht Heidelberg London, ISBN 978-0-387-09822-7, e-ISBN 978-0-387-09823-4, Vol. 2, New York: Springer. [39] Jiawei H., & Micheline K., (2001), “Data Mining: Concepts and Techniques”, London: Academic Press, by Elsevier Inc. ISBN 13: 978-155860-901-3, ISBN 10: 1-55860-901-6. 102

[40] Prabhjot K., & Sri G., (2014), “Web Content Classification: A Survey”, International Journal of Computer Trends and Technology (IJCTT), 10 (2) ISSN:2231-2803. [41] Saleh Alsaleem, (2011) “Automated Arabic Text Categorization Using SVM and NB”, International Arab Journal of e-Technology, 2(2). [42] Abdelwadood M., (2008),"Support Vector Machines Based Arabic Language Text Classification System: Feature Selection Comparative Study," Advances in Computer and Information Sciences and Engineering, Springer Science + Business Media B.V Springer 11-16. [43] Schapire R., Freund Y., Bartlett P., Lee W., (1998), “Boosting the margin: A new explanation for the effectiveness of voting methods”, Annals of Statistics, 26(5):1651-1686. [44] Meiling L., Xiangnan L., JIN L. and Jiale Jiang (2014),“Evaluating total inorganic nitrogen in coastal waters through fusion of multi-temporal RADARSAT-2 and optical imagery using random forest algorithm” International

Journal

of

Applied

Earth

Observation

and

Geoinformation 33(1):192–202. [45] Serkan K., Turker I., Moncef G., (2014), “Multidimensional Particle Swarm Optimization for Machine Learning and Pattern Recognition”, Springer.

Vol.

15

of

the

Optimization, pp1-11.

103

series Adaptation,

Learning,

and

[46] Yuhui S., & Russell E., (1998), “A Modified Particle Swarm Optimizer”. 0-7803-4869-9198 IEEE. [47] Bernhard E., Boser I. M.G., Vladimir N., Vapnik, (1992), “A training algorithm for optimal margin classifiers”, in COLT '92 Proceedings of the fifth annual workshop on Computational learning theory. New York, NY, USA. [48] John C., (1998), “A Fast Algorithm for Training Support Vector Machines”, Technical Report MSR-TR-98-14 April 21. [49] Michel V., & Maja P., (2011), “chapter 1: Fully Automatic Facial Action Unit Detection and Temporal Analysis“, Computing Department,Imperial College London. Data-Centric Systems and Applications, Springer [50] Chih-W. H., Chih-Ch. Ch., Chih-J. L., (2010), “A Practical Guide to Support Vector Classification”, Department of Computer Science, National Taiwan University, Taipei, Taiwan. [51] Seema Sh., Jitendra A., Sanjeev Sh., (2013), “Classification Through Machine Learning Technique: C4.5 Algorithm based on Various Entropies”, International Journal of Computer Applications 82(16), 0975 – 8887. [52] Bing L., (2007), “Web Data Mining Exploring Hyperlinks Contents and Usage Data”, Data Centric Systems and Applications, Springer-Verlag Berlin Heidelberg, USA. 104

[53] Michel F. Valstar, (2005), “Facial Action Unit Detection Using Probabilistic Actively Learned Support Vector Machines on Tracked Facial Point Data”, in IEEE Conference on Computer Vision and Pattern Recognition.: California, USA. [54] Pal SK., Talwar V., Mitra P., (2002), “Web mining in soft computing framework: relevance, state of the art and future directions” Published in: IEEE Transactions on Neural Networks 13(5), 1163 – 1177. [55] James K., & Russell E., (1995), “Particle Swarm Optimization”. IEEE. [56] Leo B., (2001), "Random Forests," Statistics Department University of California Berkeley, 45(1), 5-32.

105

List of Publication

1.

Salwa O. Mohamad, Tarik A. Rashid, Random Forest Particle Swarm Optimisation Integration for Automatic Web Text Detection, submitted to Knowledge and Information Systems, An International Journal, Springer, ISI indexed, JCR IF=1.702.

2.

Tarik A. Rashid, Salwa O. Mohamad, Enhancement of Detecting Wicked Website Through Intelligent Methods. In: Mueller P., Thampi S., Alam Bhuiyan M., Ko R., Doss R., Alcaraz Calero J. (eds) Security in Computing and Communications. SSCC 2016. Communications in Computer and Information Science, vol 625. pp. 358-368, Springer, Singapore. Indexed by ISI, SCImago IF=0.175, Scopus SJR IF= 0.149, 2016.

3.

Tarik A. Rashid, Salwa O Mohamad, "Enhancement of Detecting Wicked Website Through Intelligent Methods" The Fourth International Symposium on Security in Computing and Communications (SSCC’16), At Jaipur (Rajasthan), India, 2016.

4.

The data set which collected consider International data set in GitHub. https://github.com/Tarik4Rashid4/Wicked-Websites-Data-Set

106

‫اخلالصة‬ ‫من الواضح وجود بيئات خمتلفة من املواقع اخلبيثة تشمل خمتلف انواع من املعلومات اليت ميكن ان يشكل‬ ‫تهديدا جلميع مستخدمني على شبكة االنرتنيت مثل التحريض على القرصنة وتشجيع املستخدمني لنشر مفاهيم‬ ‫سيئة من خالل تعليمهم على انتهاك حقوق الشخصية وتعلمهم سرقة الشبكات ‪ , Wi-Fi,‬واملواقع الشخصية‪,‬‬ ‫املنتديات على الشبكة االنرتنيت ‪,‬حساب الربيد االلكرتوني والفيس بوك‪ .‬العمل املقرتح يتناول احلماية من القرصنة‬ ‫من خالل تصميم طريقة لالستفاد الكاملة من تعلم الة وقدرات االنظمة الذكية لتحقيق تثقيف مكونات اجملتمع‬ ‫واهلدف الرئيسي من العمل يف هذا البحث هو فهم سلوك النظام واجياد افضل احللول لتأمني حالة املستخدمني الغري‬ ‫احملصنني و تأمني اجملتمع ‪ .‬طرق املستخدمة يف تصنيف املواقع الشرير هو الغابات العشوائية ‪Random Forest‬‬ ‫‪ ,‬دعم أجهز القرارات ‪ , Support Vector Machine‬شجر القرارات ‪ decision tree‬وحتسني سرب‬ ‫اجلسيمات ‪ Particle Swarm Optimization‬لتحسني التصنيف‪.‬‬ ‫استخدمت يف هذا البحث حملل اجلذور ‪ ,Stemmer‬كلمات التوقف ‪ Stop Word‬و )‪ (TF-IDF‬وايضا‬ ‫استخدمت ‪ Tokenizer‬ملعاجلة البيانات والستخراج امليز ‪ .‬حتسني سرب اجلسيمات ‪ PSO‬استخدمت لإلجياد‬ ‫افضل قيم للمعامالت املصنف لـ ‪ RF‬بدالً من الطرق التقليدية‪ RF.‬اظهر نتائج مبشر بتعبري الدقة والضبط‬ ‫اكثر من )‪ (SVM‬و)‪ .(J48‬الطريقة والنهج املستحدث اقرتحت وطبقت لضبط وتناغم معامالت الـ مصنف‬ ‫الغابات العشوائية عن طريق السرب اجلسيمات التحسني ‪ PSO‬اليت تعترب الطريقة االساسية لتحسني مصنفات‬ ‫الغابات العشوائية‪ ,‬هذه الدراسة مسيت خوارزمية التحسني إلجياد قيمة افضل معامالت)‪.(PSORF‬‬ ‫الطريقة املقرتحة الذي اقرتاحه الباحث حتميل جمموعة البيانات حيث هذه البيانات مجعت خصيصا هلذا‬ ‫البحث عن طريق الباحث أوتوماتيكيا‪ ,‬ثم حتويل اجلمل اىل كلمات ومن ثم جيري البيانات مبرشح لرتشيح البيانات‪.‬‬ ‫النتائج التجريبية تؤكد بنجاح أن النهج اجلديد ‪ PSORF‬يعمل بشكل أفضل من استخدام الغابات العشوائية‬ ‫االفرتاضية حيث له تأثري على اجياد املواقع الشرير خالل اجياد نسبة االجياد االعلى ونسبة اخلطأ قليل (نسبة‬ ‫اجياد =‪ %8.,9‬ونسبة اخلطأ =‪ %2,2‬لـ ‪ RF‬ونسبة اجياد =‪ %89,29‬ونسبة اخلطأ = ‪ %9,.9..‬لـ ‪. PSORF‬‬

‫الكشف عن املواقع االلكرتونية اخلبيثة من خالل استخدام الغابات‬ ‫العشوائية بتحسني سرب اجلسيمات‬ ‫رسالة‬ ‫مقدمة اىل جملس كلية العلوم‬ ‫يف جامعة السليمانية كجزء من متطلبات نيل شهاد املاجستري‬ ‫يف علوم الكومبيوتر‬ ‫من قبل‬ ‫سلوى عمر حممد‬ ‫بكالوريوس يف علوم الكومبيوتر (‪ ,)2009‬جامعة كركوك‬ ‫بإشراف‬ ‫طارق امحد رشيد‬ ‫استاذ‬

‫رجب ‪9.49‬‬

‫ابريل ‪209.‬‬

‫ثوختة‬ ‫ديارة جؤري جياواز هةن لة ويَبساييت زيان بةخش كة يةكيَك لةوانة بريتية لة ضةند زانياريةكى جياواز‬ ‫كة هةرةشةى دةبيَت بؤ سةر هةموو بةكارهيَنةراني ويَب وةكو ووروذاندني بابةت بؤ هاك كردني ويَب‬ ‫وهاندان بؤ بآلو كردنةوةى بريؤكة بةناوى فيَركردني دزةكردن بؤ ناو تؤرى ئينتةرنيَتةوة‪ ,‬واي‪-‬فاي‪,‬‬ ‫ويَبسايت‪ ,‬كؤرى ئينتةرنيَت‪ ,‬فةيسبووك‪ ,‬وئةكاونتى ئيميَل‪ ...‬هتد‪.‬‬ ‫كارى ثيَشنياركراو مامةلَة دةكات لةطةأل ويبسايت بة مةبةستى ثاراستين لة هاك كردن لة ريَطةى دروست‬ ‫كردني بريدؤزةيةك كة سوود وةردةطريَت لة فيَربوونى ئاميَري وسيستةمى زيرةكى كة دةتوانيَت ناوةرؤكى‬ ‫زانياريةكان بزانيَت‪ .‬ئاماجنى كؤتايي ئةم تويَذينةوةية بؤ ئةوةية كة تيَطةيشتنيَكى تةواو هةبيَت لةسةر‬ ‫رةفتارى سيستةم وباشرتين ضارةسةر ديارى بكاتك كة بةكارهيَنةر ثاريَزراوبيَت لة تووشبوونى مةترسيةكانى‬ ‫كةوا بامسان كرد‪ ,‬وبةهامان شيَوة بؤ بؤ ثاراستنى دةولَةت وكؤمةلطا‪.‬‬ ‫)‪ Support Vector Machines (SVM) ,Random Forest (RF‬و بريدؤزةى ‪decision‬‬ ‫)‪ tree (J48‬ريــــــَطةيـــــةكة بـــةكارهيَنرا بؤ شيكــردنةوة (جيـــــــاكردنةوة) بةآلم )‪,(TF-IDF‬‬ ‫‪ , stop word removing ,Stemming‬و ‪ tokenizing‬بةكاردهيَنرا بؤ ئامادةكردنى ثيَشةكى‬ ‫داتا وبنةبركردنى تايبةمتةندى‪.‬‬

‫‪ (PSO) Particle Swarm Optimization‬بةكارهيَنرا بؤ‬

‫دؤزينةوةى باشرتين بةهاى هؤكار بؤ جياكردنةوة ‪ RF‬بؤباشرت كردنى لة جياتى ريَطةى ئاسايي‪ .‬دةردةكةويَت‬ ‫كة ووردى ئةجنامى ‪ RF‬راسرتة وبيَ هةلَةترة لة ‪ SVM‬و )‪.(J48‬‬ ‫‪ RFPSO‬دادةنريَت وةك باشرتين بؤضوون بؤ دياريكردني بةهاىل نيشانةى دياريكراو‪ .‬لة بؤضوونى‬ ‫ثيَشنياركراوى تويَذةرةكة بة باركردنى ئةو داتا سيَتة نويَية كة كؤكراوةتةوة‪ ,‬داتاى خاو ( بةلطةنامةى‬ ‫نووسني ) كة بة شيَوةيةكى ئؤتؤماتيكى وثاشان رستةكان دةطؤريَت بؤ ووشةى ئاراستةكراو ثاشان ثاآلوتنى‬ ‫ووشةكان‪ .‬ووردى ثؤليَنى ‪ Random Forest‬بةكارهيَنرا وةك ‪ fitness function‬بؤ تيَكراي ووردى‬ ‫ثؤليَنةكة بؤ هةلَسةنطاندنى )‪ .(PSO‬ئةجنامى ئةزموونى بةسةركةوتووي دووثاتيدةكاتةوة كة بؤضوونى‬ ‫نويَ باشرت كاردةكات كة بةتةنها بةكاردةهيَنريَت لة ‪,Random Forest‬باشرتين ريَذةى ووردية ‪.‬‬

‫دةستنيشانكردنى مالَثةرة زيان بةخشةكان لة ريَي بةكارهيَنانى‬ ‫دارستاني هةرةمةكى بة باشكردنى كؤمةلَةبةش‬

‫نامةيةكة‬ ‫ثيَشكةش كراوة بة ئةجنومةنى كؤليَجي زانست‬ ‫لة زانكؤي سليَماني وةك بةشيَك لة ثيَداويستييةكانى بةدةست هيَناني‬ ‫بروانامةى ماستةر لة زانستى كؤمثيوتةر‬ ‫لة اليةن‬ ‫سلوى عمر حممد‬ ‫بةكالؤريوس لة زانستى كؤمثيوتةر (‪ ,)8002‬زانكؤي كةركوك‬ ‫بة سةرثةرشيت‬ ‫د‪ .‬طارق أمحد رشيد‬ ‫ثرؤفيسؤر‬ ‫نةورؤز ‪8272‬‬

‫ابريل ‪8072‬‬

Phishing Website Detection Using Random Forest with Particle Swarm ...

Phishing Website Detection Using Random Forest with Particle Swarm Optimization.pdf. Phishing Website Detection Using Random Forest with Particle Swarm ...

5MB Sizes 8 Downloads 350 Views

Recommend Documents

Phishing Detection System
various features such as HTML Email, IP-based URL, no of domains used,age ... E. Classifying Phishing Emails Using Confidence-Weighted Linear Classifiers.

Identification and Detection of Phishing Emails Using Natural ...
See all ›. 22 References. See all ›. 1 Figure. Download citation. Share. Facebook .... owned Internet Security company lists the recent phishing attacks. [22].

Srinivasan, Seow, Particle Swarm Inspired Evolutionary Algorithm ...
Tbe fifth and last test function is Schwefel function. given by: d=l. Page 3 of 6. Srinivasan, Seow, Particle Swarm Inspired Evolutionar ... (PS-EA) for Multiobjective ...

Visual-Similarity-Based Phishing Detection
[email protected] ... republish, to post on servers or to redistribute to lists, requires prior specific .... quiring the user to actively verify the server identity. There.

large scale anomaly detection and clustering using random walks
Sep 12, 2012 - years. And I would like to express my hearty gratitude to my wife, Le Tran, for her love, patience, and ... 1. Robust Outlier Detection Using Commute Time and Eigenspace Embedding. Nguyen Lu ..... used metric in computer science. Howev

Quantum Evolutionary Algorithm Based on Particle Swarm Theory in ...
Md. Kowsar Hossain, Md. Amjad Hossain, M.M.A. Hashem, Md. Mohsin Ali. Dept. of ... Khulna University of Engineering & Technology, ... Proceedings of 13th International Conference on Computer and Information Technology (ICCIT 2010).

particle swarm optimization pdf ebook download
File: Particle swarm optimization pdf. ebook download. Download now. Click here if your download doesn't start automatically. Page 1 of 1. particle swarm ...

An Improved Particle Swarm Optimization for Prediction Model of ...
An Improved Particle Swarm Optimization for Prediction Model of. Macromolecular Structure. Fuli RONG, Yang YI,Yang HU. Information Science School ...

A Modified Binary Particle Swarm Optimization ... - IEEE Xplore
Aug 22, 2007 - All particles are initialized as random binary vectors, and the Smallest Position. Value (SPV) rule is used to construct a mapping from binary.

EJOR-A discrete particle swarm optimization method_ Unler and ...
Page 3 of 12. EJOR-A discrete particle swarm optimization method_ Unler and Murat_2010 (1).pdf. EJOR-A discrete particle swarm optimization method_ Unler ...

Particle Swarm Optimization for Clustering Short-Text ... | Google Sites
Text Mining (Question Answering etc.) ... clustering of short-text corpora problems? Which are .... Data Sets. We select 3 short-text collection to test our approach:.

A Modified Binary Particle Swarm Optimization ...
Traditional Methods. ○ PSO methods. ○ Proposed Methods. ○ Basic Modification: the Smallest Position Value Rule. ○ Further Modification: new set of Update ...

Enhancing Memory-Based Particle Filter with Detection-Based ...
Nov 11, 2012 - The enhance- ment is the addition of a detection-based memory acquisition mechanism. The memory-based particle filter, called M-PF, is a ...

Quantum Evolutionary Algorithm Based on Particle Swarm Theory in ...
hardware/software systems design [1], determination ... is found by swarms following the best particle. It is ..... “Applying an Analytical Approach to Shop-Floor.

A Comparative Study of Differential Evolution, Particle Swarm ...
BiRC - Bioinformatics Research Center. University of Aarhus, Ny .... arPSO was shown to be more robust than the basic PSO on problems with many optima [9].

Control a Novel Discrete Chaotic System through Particle Swarm ...
Control a Novel Discrete Chaotic System through. Particle Swarm Optimization. *. Fei Gao and Hengqing Tong. Department of mathematics. Wuhan University of ...

An Interactive Particle Swarm Optimisation for selecting a product ...
Abstract: A platform-based product development with mixed market-modular strategy ... applied to supply chain, product development and electrical economic ...

Particle Swarm Optimization: An Efficient Method for Tracing Periodic ...
[email protected] e [email protected] ..... http://www.adaptiveview.com/articles/ipsop1.html, 2003. [10] J. F. Schutte ... email:[email protected].

Particle Swarm Optimization: An Efficient Method for Tracing Periodic ...
trinsic chaotic phenomena and fractal characters [1, 2, 3]. Most local chaos control ..... http://www.adaptiveview.com/articles/ipsop1.html, 2003. [10] J. F. Schutte ...

Application of a Novel Parallel Particle Swarm ...
Dept. of Electrical & Computer Engineering, University of Delaware, Newark, DE 19711. Email: [email protected], [email protected]. 1. Introduction. In 1995 ...

Application of a Parallel Particle Swarm Optimization ...
Application of a Parallel Particle Swarm Optimization. Scheme to the Design of Electromagnetic Absorbers. Suomin Cui, Senior Member, IEEE, and Daniel S.