Mzhda Hiwa Hama.pdf

Viewer
Transcript

SENTIMENTAL ANALYSIS OF BIG DATA USING NAÏVE BAYES AND NEURAL NETWORK

A thesis Submitted to the Council of Faculty of Science and Science Education School of Science at the University of Sulaimani in partial fulfillment of the requirements for the degree of Master of Science in Computer Science

By

Mzhda Hiwa Hama B.Sc. Computer Science (2010), University of Sulaimani

Supervised by

Dr. Sozan Abdulla Mahmood Assistant Professor

Gulan 2715

May 2015

Supervisor’s Certification

I certify that the preparation of the thesis entitled "Sentiment Analysis of Big Data Using Naïve Bayes and Neural Network" prepared by "Mzhda Hiwa Hama" was prepared under my supervision in the School of Science, Faculty of Science and Science Education at the University of Sulaimani, as partial fulfillment of the requirements for the degree of Master of Science in Computer Science.

Signature: Name: Dr. Sozan A. Mahmood Title: Assistant Professor Date: / / 2015

In view of the available recommendation, I forward this thesis for debate by the examining committee.

Signature: Name: Dr. Aree Ali Mohammed Title: Assistant Professor Date: / / 2015

Linguistic Evaluation Certification

I hereby certify that this thesis entitled "Sentiment Analysis of Big Data Using Naïve Bayes and Neural Network" Prepared by "Mzhda Hiwa Hama" has been read and checked and after indicating all the grammatical and spelling mistakes; the thesis was given again to the candidate to make the adequate corrections. After the second reading, I found that the candidate corrected the indicated mistakes. Therefore, I certify that this thesis is free from mistakes.

Signature: Name: Zana Mahmood Hassan Position: English Department, School of Languages, University of Sulaimani. Date:

/ / 2015

Examining Committee Certification

We certify that we have read this thesis entitled "Sentiment Analyses of Big Data Using Naïve Bayes and Neural Network" prepared by "Mzhda Hiwa Hama", and as Examining Committee, examined the student in its content and in what is connected with it, and in our opinion it meets the basic requirements toward the degree of Master of Science in Computer Sciences.

Signature:

Signature:

Name: Dr. Hussein Keitan Al-Khafaji

Name: Dr. Nzar A. Ali

Title: Professor

Title: Lecturer

Date:

Date:

/ / 2015

(Chairman)

/ / 2015 (Member)

Signature:

Signature:

Name: Dr. Aysar A. Abdulrahman Mahmood

Name: Dr. Sozan A.

Title: Lecturer

Title: Assistant Professor

Date:

Date:

/ / 2015

(Member)

/ / 2015

(Supervisor‐Member)

Approved by the Dean of the Faculty of Science and Science Education. Signature: Name: Dr. Bakhtiar Qader Aziz Title: Professor Date:

/ / 2015

Acknowledgments

 First of all, my great thanks to God who helped me and gave me the ability to finish this work.  I would like to express my appreciation to my supervisor, Dr.Sozan, for her guidance and supervision during this work.  Special thanks to my husband Mr.Salam, for his love, patient, encouragements, scientific notes, and the support that he has shown during my study which has taken me to finalize this work. Without his help this work could not have been completed.  Special thanks to Dr.Alaa, Mr.Hogir, Mr.Omid and Mr.Tawfiq for their assistances.  Furthermore, I would like to thank my family for their endless love and support.  Finally, I would like to thank staff of the Computer Science Department.

Abstract

In recent years, the volume of data has increased dramatically. The structured, semi-structured and unstructured data come from different sources such as blogs, e-mails, social networks, wikis and tweets. The stream of the data has a huge velocity. Big data analytics refer to the capability of extracting useful information from these types of data that could not be managed by the current methodologies and relational database management systems or data mining software tools. The most important software for big data analytic is ‘Apache Hadoop’ which is based on MapReduce programming paradigm and a distributed file system which is called Hadoop Distributed File System. It allows writing programs using Apache Hive which quickly processes large amounts of data in a parallel on large cluster of computing nodes. Apache Hive codes, which is called HiveQL, will be compiled and translated in to MapReduce jobs. These codes divide the input dataset into independent subsets which are processed by map tasks in parallel. After that, steps of reducing tasks to obtain the results are followed. In this work, sentiment analysis as one of the user cases of the big data analytics has been investigated. Sentiment analysis is a task of determining polarity in opinions, feelings, and attitudes expressed in text sources. A sentiment analysis tool should predict whether the underlying opinion in a given text is positive, negative. In this work, 70% of 1,500,000 Tweets data set provided by University of Michigann used for the training set and the rest has been used for test set. One Bag of words has been made from the training set and used by Multinomial Naïve Bayes algorithm and Forward Neural Network. Moreover, an existing Bag of i

Words containing 6000 positive and negative words has been used in this work. All of them are applied to the test set. Naïve Bayes has the highest F-Score among all of them with F-Score of 0.76. Neural Network has the second highest accuracy with F-Score of 0.74, while the Existing Bag of words has the last accuracy with F-Score of 0.68. Apache Hive has been used to implement the algorithms inside Hortonworks Hadoop which is a single cluster and portable Hadoop environment. For preprocessing purpose, python NLTK libraries has been used to enhance the accuracy of the algorithms by 2 percentage. Java program has been used to make connection to ‘Twitter’ using OAuth API, and 50,000 tweets from five categories (IPhone 6, Obama, Kurd, Marriage and ISIS) have been downloaded and stored in MongoDB NoSQL Database with JSON format. After importing the data to Hadoop Hortonworks and applying the Naïve Bayes algorithm on them, all the analyzed and refined results have been visualized using Microsoft Excel Professional Power View.

ii

CONTENTS

Abstract……………………………...……….…………………………..

i

Contents………………………………………………………………….

iii

List of Figure…………………………...……………………………….

viii

List of Tables………..………………...………………………………...

xiii

List of Abbreviations………..…………………………………………..

xv

Chapter One: Introduction to Big Data Analytics 1.1

Era of Big Data…………………………….…………..………

1

1.2

Three V’s of Big Data…………...………………..…….……..

2

1.2.1

Volume……………………….……………….……………….

3

1.2.2

Variety…………………..………………………….………….

3

1.2.3

Velocity……………………..…………………………............

4

1.3

Big Data Analytic………...…………………………………....

5

1.4

The Importance of Big Data and Its Analytics…………...……

8

1.5

Visualization…...…...…………………………………………

9

1.6

Big Data User Cases…...……………….……………………..

11

1.6.1

Sentimental Analysis…………….…….....................................

11

1.6.2

Predictive Analysis…..………………………………...............

13

1.6.3

Fraud Detection……………………………………...…………

13

1.7

Big Data Challenges…………………………..………………..

13

1.8

Literature Survey…….…………………………………………

14

iii

1.9

Thesis Objectives…...…………………………………………..

18

1.10

Thesis Layout…………………..……………………….………

19

Chapter Two: Big Data Technologies, Tools, Paradigm and Algorithms

2.1

Introduction…….………………………….……………………

20

2.2

Hadoop…...……..……………………………….……………...

20

2.2.1

Hadoop Characteristics………………….…..………………….

21

2.2.2

Hadoop Components…………………………...……………….

22

2.3

Hive……………………………………….…………………….

28

2.3.1

Hive Characteristics ………….…………….…………………..

28

2.3.2.

HCatalog ..……….……………….………….…………………

29

2.4

Hortonworks …………………………………………………...

29

2.5

NoSQL Databases…...………....................................................

30

2.5.1

Characteristics of NoSQL………..….…….…………………...

33

2.5.2

Types of NoSQL Databases…………………………................

36

2.5.2.1 Key Value Stores……………………………………………….

36

2.5.2.2 Document Databases (or stores)………………………………..

37

2.5.2.3 Column Family Stores…………..……………………………...

39

2.5.2.4 Graph Databases…………..…………………………................

40

2.6

MongoDB………………………….…………………………...

41

2.6.1

Key MongoDB Features…………………………………….….

45

2.7

Eclipse………………………....………....…………..................

47

2.8

Naïve Bayes……...…..……….………………………………...

47

2.8.1

Multinomial Naïve Bayes………………………………………

48

2.8.2

Bag of Words Representation…………………………………..

50

iv

2.9

Feed Forward Neural Network…………………………………

52

2.10

Python…………………………...…………………...................

52

2.10.1

NLTK……………………………………...……….…..……….

53

Chapter Three: Requirement Analysis and Algorithms Implementation for Proposed System 3.1

Introduction……………………………….…..…………….....

54

3.2

Requirement Analysis…………………………...…………….

54

3.3

Preprocessing……………………………….............................

57

3.3.1

Stop Words……………………………………..……………..

59

3.3.2

Stemming……………………………………………………...

60

3.3.3

Lemmatization……………………………………...................

60

3.3.4

Removing Repeating Characters……...……………...……….

61

3.3.5

Removing none alphabetic Characters…………..…................

61

3.4

Text Classification………………………….............................

62

3.4.1

Naïve Bayes algorithm……………………………………..…

62

3.4.1.1

Multinomial Naïve Bayes Algorithm Formulization…………

62

3.4.1.2

Implementation of Multinomial Naïve Bayes in Hadoop…….

63

3.4.1.3

Steps of Creation Bag of Words and Implementation of Naïve Bayes…………………………………………………………. v

65

3.4.2

Feedforward Neural Network …………………...………….

75

3.4.2.1

Feed Forward Neural Network Algorithm…………………..

76

3.4.2.2

Implementation of Feed Forward Neural Network in Hadoop

77

3.4.2.3

Steps of Creation Bag of Words and Implementation of Feed Forward Neural Network Algorithm………….………….....

78

3.4.3

Neural Network Algorithm for the Ready Dictionary………

83

3.5

Overall Proposed System……………………........................

85

3.6

Steps of Analytic…………………………………………….

86

Chapter Four: Algorithms Results, Sentiment Analytics Results and Visualization

4.1

Introduction……………………………..……………………

97

4.2

Challenges and their Solution…………………..……………

97

4.3

Accuracy Comparison of algorithms…...……………………

99

4.4

Experiment Sentiment Analysis Result……….……………..

103

4.5

Access the Refined Sentiment Data and Visualization using Microsoft Excel……………………………….......................

105

4.6

Sentiment Analyses Statistic from USA……………………..

108

4.7

Sentiment Analysis Statistic from Iraq……………………….

110

vi

Chapter Five: Conclusions and Suggestions for Future Work 5.1

Conclusions………………..…………………………………

112

5.2

Suggestions for Future Work………………………………...

114

References………….. ………………………………………

115

vii

List of Figure

Figure No.

Figure Title

Page No.

1.1

Three V’s of Big Data……………………….…………….

2

1.2

Historical periods of positive and negative moods….……

7

1.3

Election Map Result Prediction for 2012 in US….………

7

1.4

Google Flu Trends in the Japan…………………………...

8

1.5

Word Clouds….………………………………….……….

10

1.6

BilblioMetrics………..…………………………….……..

11

2.1

HDFS Architecture……………………..…………………

23

2.2

Illustration of MapReduce Algorithm………..……………

26

2.3

Hortonworks Data Platform……………………………….

30

2.4

Structure, Unstructured and Semi-Structured Data…….

32

2.5

CAP Theorem……………………………………………..

35

2.6

Horizontal Scaling and Vertical Scaling………………….

36

2.7

Key Value Stores…………………………………………

37

2.8

Relational data model and Document data model………..

39

2.9

Column Family Data Store……..…………………………

40

2.10

Graph Database………………………..………………….

41

viii

Figure No.

Figure Title

Page No.

2.11

MongoDB Programming Languages……………………

42

2.12

Document in MongoDB…………………………………

43

2.13

a collection of MongoDB document……………………..

43

2.14

Insertion into Collection…………………..……………...

44

2.15

MongoDB Database Model………………………………

44

2.16

The Bag of Words Representation……………………….

51

3.1

Activity Diagram for Preprocessing in Python…………..

58

3.2

List of Stop Words in English……………………………

59

3.3

Activity Diagram to Make Naïve Bayes Bag of Words …

64

3.4

Naïve Base Algorithm Implementation on Hadoop Using Hive………………………………………………………

65

3.5

Uploading Training Set To Hadoop……………………...

66

3.6

Training Set Table Made by HCatalog…………………..

66

3.7

Tokenizing for Training Set……………………………..

67

3.8

Splitwords for Training Set……………………………...

68

3.9

Calculate Total Number of Tweets in Training Set……..

68

3.10

Calculate Total Number of Negative Tweets in Training

3.11

Set………………………………………………………

69

Total Count of Positive Tweets…………………………

69

ix

Figure No.

Figure Title

Page No.

3.12

Probability Table………………………………..……..

70

3.13

Counts of each Words in Positive Sentiments………...

70

3.14

Count of All Words in Positive Sentiment…………….

71

3.15

Total Number of Each Words in Negative Sentiments..

71

3.16

Total Number of All Words in Negative Sentiments…..

72

3.17

Occurrence of each Word in Positive and Negative Sentiments……………………………………………..

3.18

BOW Cross-Join Necessary Attributes from Other Table…………………………………………………..

3.19

73

Part of Query Results: Joining Test Set Words with Final Tables…………………………………………..

3.20

72

74

Part of Query Results: Sentiment Results for Test Set...…………………………………………………...

74

3.21

Neural Network Architecture…………………………

75

3.22

Activity Diagram to Compute Positive and Negative Weights………..………………………………….….

3.23

77

Implementation of Neural Network on Hadoop using Hive………………………………………………....

x

78

Figure No.

Figure Title

Page No.

3.24

Part of Positive and Negative Weights………………

79

3.25

Count of each Word in each Tweet…………………..

80

3.26

Input Layer with the Weights of the Words………….

81

3.27

Summation of Positive and Negative of Hidden Layer..

81

3.28

Activation Function of Hidden Positive and Hidden Negative………………………………………….….

82

3.29

Output Layer………..………………………………

82

3.30

Algorithm of Ready Dictionary using Neural Network

84

3.31

Overview of System Implementation………………..

85

3.32

Downloading Tweets from Twitter and Store in MongoDB……………………………………………

87

3.33

monjaDB…………………………………………….

88

3.34

Mapping Time Zone ………………………………...

90

3.35

Uploading the Preprocessed Tweets into HDFS……

91

3.36

Create Kurdish Table using HCatalog………………

92

3.37

Time Zone Map………...…………………………....

93

xi

Figure No.

Figure Title

Page No.

3.38

Tokenized Kurdish Table.………..………………..

94

3.39

Separated Words with their associated ids ………

94

3.40

Joining the separated words with Bag of Words….

95

3.41

Kurdishresult Table………………..………………

96

3.42

KurdishResultJoin Table Including Country Field…

96

4.1

Chart of Analyzed Results for the Proposed System..

104

4.2

Map…………….……………………………………

107

4.3

Visualization of Sentiment Data about Kurd around the World…………………………………………….

107

4.4

Visualization of the Tweets that Comes from USA….

109

4.5

Results Chart of Sentiments of USA ………….…….

109

4.6

Visualization of the Tweets that Comes from Iraq…..

111

4.7

Results Chart of Sentiments of Iraq…………………

111

xii

List of Tables

Table No.

Table Title

Page No.

3.1

Simple Snapshot of the Original Dataset…………

56

3.2

Snapshot of the prepared Dictionary……………..

57

3.3

Snapshot of Output of Preprocessing…………….

61

3.4

Time Zone Map…………………………………..

89

4.1

Snapshot of Preprocessing Using SQL Server 2012……………………………………………….

98

4.2

Snapshot of Results of Three different Algorithm..

99

4.3

Confusion matrix…………………….……………

101

4.4

shows the confusion matrix for Naïve Bayes……..

101

4.5

shows the confusion matrix for Neural Network Algorithm results………………………………….

4.6

shows the confusion matrix for Existing BOW Algorithm results………………………………….

4.7

102

102

Accuracy comparison among Naïve Bayes, Neural Network and Existing BOW………………………

102

4.8

Sentiment Analyses Result………………………..

104

4.9

Imported Data from Hadoop…………..……..……

105

xiii

Table No. 4.10

Table Title

Page No.

Countries and Their associated Sentiment About Kurd……………………………………………….

106

4.11

the Statistic from United States…………………..

108

4.12

the Statistic from Iraq…………………………….

110

xiv

List of Abbreviations API………………….………….....................Application Programming Interface BOW……………………………...………………………………...Bag of Words BSON………………………………...………Binary JavaScript Option Notation CAP……………………………….Consistency, Availability, Partition Tolerance CDT………………………………………...………C/C++ Development Tooling CPU…………………………………………...………….Central Processing Unit CSV……………………………………………..……..Comma Separated Values DDL……………………………………………...…….Data Definition Language DFS………………………………………………………Distributed File System DML…………………………………..……………Data Manipulation Language GB………………………………………………………..……………Giga Bytes GFS…………………………………………..…Google Distributed File Systems HDFS……………………………………….…..Hadoop Distributed File System HiveQL………………………………………..…………..Hive Query Language IBM……………………………………………..International Business Machines IDC……………………...………….…………….International Data Corporation IR………………………………………….……………….Information Retrieval ISIS……………...……………………………….. Islamic State in Iraq and Syria JDT……………………………..……………………….Java Development Tools JSON……………………………..…………………..JavaScript Option Notation KV……………………………………..………………………………..key-value LBS………………………………………..……………...location-based services MSE……………………………………..………………….Means Squared Error xv

MAP……………………..………………...….Maximum A Posterior Probability NLTK…….…………………………………………...Natural Language Toolkit NoSQL……………………….…………………………………...Not Only SQL ODBC…………………………………………….....Open Database Connectivity ORC…………………………………………………..Optimized Row Columnar PC………………………………………………..……………Personal Computer PDT…………………………………………..…………PHP Development Tools RCFile…………………………………..…………………Record Columnar File RDBMS……………………………...Relational Database Management Systems RFID………………………………….…..………Radio Frequency Identification RSS……………………………………..…………………….Rich Site Summary SerDe……………………...…………………………………Serialize-deserialize SQL………………………………...…………………Structural Query Language TB…………………………………..………………………………….Tera Bytes US………………………………………………………………….United States URL...……………………………………………..…Uniform Resource Locater VoIP……………………………………...………….Voice over Internet Protocol XML…………………...…………………………...Extensible Markup Language ZB……………………………………………………………………..ZettaByte

xvi

Chapter One

Introduction to Big Data Analytics

Chapter One

Introduction to Big Data Analytics

1.1Era of Big Data We are living in the era of information explosion in which large scale amount of data is getting increasingly larger because of virtual worlds, wikis, blogs, emails, online games, VoIP telephone, digital photos, tweets, traffic systems, bridges, airplanes and engines, satellites, RFID and weather sensors. The IBM Big Data Flood Info graphic showed that 2.7 ZB of data existed in the digital universe in 2012. According to the same study, there are hundreds of thousands of TB of data which will be updated daily only through Facebook and a lot of other activities on social networks leading to an estimate of 35 ZB of data generated annually by 2020. Just to have an idea of the amount of data being 21

12

generated, one ZB equals 10 bytes meaning 10 GB [1]. This explosion of data is referred to as “Big Data”. A good definition of "Big data" may be that from a paper by the McKinsey Global Institute in May 2011, standing that Big Data "refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze." [2] Traditional database management system, such as relational database, was proven good for the structured data, but in cases of semi-structured and unstructured data they are not fit and appropriate tools. However, in reality, data are coming from different data sources in various formats and the vast majority of these data are unstructured or semi-structured in nature. Moreover, database systems are also pushed to their limit of storage capacity and lack of scale which cannot effectively store unstructured and semi-structured data. As a result, organizations are struggling to extract useful information from the unpredictable explosion of data captured from inside and outside their organizations [3]. To tackle the challenges of Big Data, we must choose an alternative way to process data, we need new technologies. IDC defines Big Data technologies as a new generation of technologies and architectures, designed to economically 1

Chapter One

Introduction to Big Data Analytics

extract value from very large volumes of a wide variety of data by enabling high velocity capture, discovery and/or analysis [4]. Google was the pioneer of many big data technologies including MapReduce computation framework, GFS and distributed locking services. Amazon’s distributed key-value store (Dynamo) created a new milestone in big data storage space as well. Moreover, during the last few years open source tools and technologies including Hadoop, HBase, Hive, mongoDB , Cassandra, Storm and many other projects has been added in big data space [3]. 1.2 Three V’s of Big Data Most definitions of big data focus on the size of data in storage. Certainly, size matters, but there are other important attributes of big data, namely, data variety and data velocity. The three V’s of big data (volume, variety, and velocity) to constitute a comprehensive definition, and they bust the myth that big data is only about data volume. In addition, each of the three Vs has its own ramifications for analytics. Big data is not just about data volume.

Figure (1.1) Three V’s of Big Data [5] 2

Chapter One

Introduction to Big Data Analytics

1.2.1 Volume It is obvious that data volume is about the size of the data, and it is the primary attribute of big data. Bearing that in mind, most people define big data in TB— sometimes PB. Big data can also be quantified by counting records, transactions, tables, or files. Some organizations find it more useful to quantify big data in terms of time .For example, due to the seven-year statute of limitations in the U.S., many firms prefer to keep seven years of data available for risk, compliance, and legal analysis. The scope of big data affects its quantification, too. For example, in many organizations, the data collected for general data warehousing differs from the data collected specifically for analytics. Different forms of analytics may have different data sets. Some analytic practices lead a business analyst or similar user to create ad hoc analytic data sets per analytic project. Furthermore, each of these quantifications of big data grows continuously. All this makes big data for analytics a moving target which is tough to quantify [5].

1.2.2 Variety One of the things that makes big data really big is that it is coming from a greater variety of sources than ever before. Many of the newer ones are Web sources, including logs, clickstreams, and social media. Certainly, user organizations have been collecting Web data for years. However, for most organizations, it has been a kind of hoarding. Variety refers to different types of data. With the increased use of smart devices, sensors, as well as social collaboration technologies, data has become large and complex, as it includes not only traditional relational data, but also semi-structured, and unstructured data from different sources such as web pages, search indexes, e- mails, documents, sensor data, social media forums, web log files (including click-stream data) etc. Over 80% of the generated data in the world is unstructured, which makes it 3

Chapter One

Introduction to Big Data Analytics

difficult for traditional tools to handle and analyze. Organizations should choose an analytical tool consisting of both traditional and non-traditional methods of data analysis as traditional analytical tools are limited to structured data analysis. The organization’s success is dependent on its ability to analyze both relational and non-relational data [6].

1.2.3 Velocity Big data can be described by its velocity or speed. You may prefer to think of it as the frequency of data generation or data delivery. For example, think of the stream of data coming from any kind of device or sensor, say robotic manufacturing machines, thermometers sensing temperature, microphones listening for movement in a secure area, or video cameras scanning for a specific face in a crowd. The collection of big data in real time is not new; many firms have been collecting clickstream data from Web sites for years, using streaming data to make purchase recommendations to Web visitors. With sensor and Web data flying at you relentlessly in real time, data volumes become big really quickly. Even more challenging, the analytics that go with streaming data have to make sense of the data and possibly take actions—all in real time. Velocity refers to the speed of data. This can be in two folds: First, it describes the rate of new data flowing in and the existing data being updated known as ‘acquisition rate challenge’. Second, it corresponds to the acceptable time to analyze the data and act on it, while it is flowing in, known as ‘timeliness challenge’. These are essentially two different issues which do not necessarily have to occur at the same time, but often they do. The first challenge is to receive, maybe, filter, manage and store the fast and continuously arriving data. So, the task is to update a persistent state in some database and to do that very quick and very often. The second challenge regards the timelines of 4

Chapter One

Introduction to Big Data Analytics

information extraction, analysis - that is identifying complex patterns in a stream of data and a reaction to the incoming data. This is often called "stream analysis". The importance is to react to inflowing data and events in (near) real-time and state which allows organizations to be more agile than the competition [7].

1.3 Big Data Analytic The process of collecting, organizing and analyzing of such a large sets of data to discover patterns and other useful information is called “big data analytics”. Big Data analytics refers to tools and methodologies which aims at transforming massive quantities of raw data into “data about the data”—for analytical purposes. They typically rely on powerful algorithms that can detect patterns, trends, and correlations over various time horizons in the data, but also on advanced visualization techniques such as “sense-making tools.” Once trained (which involves having training data), algorithms can help make predictions that can be used to detect anomalies in the form of large deviations from the expected trends or relations in the data [8]. Organizations can use the information derived from analysis of data to identify the trends and measure the performance of the business. The information derived from the data can also be used to predict future trends and to enable risk analysis and fraud detection. For example, a retail store can quickly identify items commonly purchased together and place them in a nearby location in the store to maximize sales. An online retailer can make suggestions depending on which item the customer is considering purchasing. Credit card companies can analyze transactions in real time to find suspicious transactions and automatically take measures to protect customers. Consumer location can be tracked by analyzing trends and any aberration in purchasing behavior can be highlighted as a possible fraud and appropriate action can be taken. For example, if a consumer uses a credit 5

Chapter One

Introduction to Big Data Analytics

card in New York and the same credit card is used to make a purchase in San Francisco at that moment, the real-time fraud detection system can analyze the geographical distance and the time duration and send an alert. Banks and financial organizations also use stock price data and perform statistical calculations such as Monte Carlo simulations to predict risk and expected stock prices for investment. Data analytics is a very powerful tool that has enabled organizations to gather information regarding consumer preferences as well as their demographics. For example, most grocery stores have discount cards, which enable customers to receive discounts on some commonly used items. In return for the shopping discounts offered to the customer, the retailer gathers valuable data on shopping behavior, highlighting shopping trends and buying preferences, which ultimately results in better marketing for the retailer. By leveraging data and the subsequent business analytics, the target was able to increase its sales from $44 billion in 2002 to $67 billion in 2010. The total cumulative inflation for the same time period was around 22.3%, but sales increased by over 65% [9]. Analyzing the expression of emotions in 20th century books is a relatively simple process of collecting and analyzing data. Google digitized the books in the 20th century and made the data available to the public. On one hand, WordNet has scored a large set of 1-grams with a mood score and made the data set available to the public. On the other, the mathematics required to analyze the case was relatively straightforward. Analyzers have found evidence for distinct historical periods of positive and negative moods, underlain by a general decrease in the use of emotion-related words through time. Finally, they have showed that, in books, American English has become decidedly more “emotional” than British English in the last half-century, as a part of a more general increase of the stylistic divergence between the two variants of English language. The other analysis have shown that difference between -scores of Joy 6

Chapter One

Introduction to Big Data Analytics

and Sadness for years from 1900 to 2000 (raw data and smoothed trend). Values above zero indicate generally ‘happy’ periods, and values below the zero indicate generally ‘sad’ periods. Values are smoothed using Friedman's ‘super smoother' through R function supsmu() [10].

Figure (1.2) Historical periods of positive and negative moods [10] Nate Silver (born January 13, 1978) is an American statistician who could exactly predict presidential election results map by doing fairly basic research and employing pretty simple methods into what really has predictive power in a political campaign by 2012. As shown in the figure (1.3), blue denotes states/districts won by Obama/Biden, and Red denotes those won by McCain/Palin. Numbers indicate electoral votes allotted to the winner of each state [11].

Figure (1.3) Election Map Result Prediction for 2012 in US [11] 7

Chapter One

Introduction to Big Data Analytics

1.4 The Importance of Big Data and Its Analytics Big Data is a big challenge, not only for software engineers but also for governments and economists across every sectors. Studies from McKinsey Global Institute show, for example, that if US health care could use big data creatively and effectively to drive efficiency and quality, the potential value from data in the sector could reach more than $300 billion every year [2]. As an example Google Flu Trends uses aggregated Google search data to estimate flu activity. They have found a close relationship between how many people search for flu-related topics and how many people actually have flu symptoms. Of course, not every person who searches for "flu" is actually sick, but a pattern emerges when all the flu-related search queries are added together. They compared the query counts with traditional flu surveillance systems and found that many search queries tend to be popular exactly when flu season is happening. By counting how often Google sees these search queries, they can estimate how flu is circulating in different countries and regions around the world. Estimates were made using a model that proved accurate when compared to historical official flu activity data as shown in figure(1.4)[12].

Figure (1.4) Google Flu Trends in the Japan [12]

8

Chapter One

Introduction to Big Data Analytics

As other examples, recommender systems have become extremely common in recent years, and are applied in a variety of applications. The most popular ones are probably movies, music, news, books, research articles, search queries, social tags, and products in general. However, there are also recommender systems for experts, jokes, restaurants, financial services, life insurance, persons (online dating), and Twitter followers [13]. Today almost every enterprise realizes the importance of big data and its analytics and try to take advantage of that to increase the level of productivity, quick marketing, discover the competitors, enhance security issues, Fraud detection and even health care.

1.5 Visualization An important feature of Big Data analytics is the role of visualization, which can provide new perspectives on findings that would otherwise be difficult to grasp[8]. Visualization creates encodings of data into visual channels that people can view and understand. This process externalizes the data and enables people to think about and manipulate the data at a higher level. Visualization exploits the human visual system to provide an intuitive, immediate and language-independent way to view and show the data. It is an essential tool for understanding information. The human visual system is by far the richest, most immediate, highest bandwidth pipeline into the human mind. The amount of brain capacity that is devoted to processing visual input far exceeds that of the other human senses. Some scientific estimation suggest that the human visual system is capable of processing about 9 megabits of information per second, which corresponds to close to 1 million letters of text per second [14]. There are a lot of beautiful interactive visualizations. For example, “word clouds” Figure (1.5), which are a set of words that have appeared in a certain body of text – such as blogs, news articles or speeches, which are a simple and common 9

Chapter One

Introduction to Big Data Analytics

example of the use of visualization, or "information design," to unearth trends and convey information contained in a dataset [8].

Fiugre (1.5) word clouds [8]

Academic paper citation metric was repurposed when ranking the relevance of an academic paper to determine when a new field of science emerged. This set of scientific fields show the major shifts in the last decade in science. Each significance clustering for the citation networks in years 2001, 2003, 2005, and 2007 occupies a column in the diagram and is horizontally connected to preceding and succeeding significance clustering by stream fields. Each block in a column represents a field and the height of the block reflects citation flow through the field. The fields are ordered from bottom to top by their size with mutually none significant fields placed together and separated by half of standard spacing. We use a darker color to indicate the significant subset of each cluster. All journals that are clustered in the field of neuroscience in 2007 are colored to highlight the fusion and formation of neuroscience. It uncovers the emergence of the new science which called “Neuroscience” [15].

10

Chapter One

Introduction to Big Data Analytics

Figure (1.6) BilblioMetrics [15]

1.6 Big Data User Cases We can use big data techniques to solve various complex problems. Some of the big data user cases are given below:

1.6.1 Sentimental Analysis It is one of the most widely discussed use case. Sentiment analysis is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes and emotions towards entities such as products, services, organizations, individuals, issues, events, governments, politics, business, movies and topics. In simple words, it is used to track the mood of the public. A sentiment analysis tool should predict whether the underlying opinion in a given text is positive, negative, or neutral. In recent years, the exponential increase in the Internet usage and the exchange of public opinion are the driving forces behind sentiment analysis. The web is a huge repository of structured and unstructured data. The analysis of this data to extract latent public opinion and sentiment is a challenging task. Generally 11

Chapter One

Introduction to Big Data Analytics

speaking, sentiment analysis aims at determining the attitude of a speaker or a writer with respect to the overall document [16]. With the rapid growth of social networks, sentiment analysis is becoming much more attractive to researchers. It is becoming so popular that many organizations are investing huge amounts of money to use some sort of sentiment analysis to measure public emotion about their company or products. Increasing interest in this research area is due to many useful applications associated with it: calculating public opinion polls of presidential elections in blogosphere, measuring customer satisfaction from product reviews, and customer feedback from websites (Amazon, BestBuy, etc.) and social networks (Twitter, Facebook, etc.). One of most wildly used sentimental analysis is a Twitter sentimental analysis. Twitter is also considered a customer relationship platform, where customers are able to easily post reviews about products and services. Providers can also interact with their customers by replying directly to these posts. Many companies have started collecting Twitter data in order to measure customer satisfaction about their products [17].

An enterprise may analyze sentiments about:[18]  A product – For example, does the target segment understand and appreciate messaging around a product launch? What products do visitors tend to buy together? And what are they most likely to buy in the future?  A service – For example, a hotel or restaurant can look into its locations with particularly strong or poor service.  Competitors – In what areas do people see a company as better than (or weaker than) the competitors?  Reputation – What does the public really think about our company? Is our reputation positive or negative? 12

Chapter One

Introduction to Big Data Analytics

1.6.2 Predictive Analysis Another common use case is predictive analysis which includes correlations, back-testing strategies and probability calculations using Monte Carlo simulations. Capital market firms are one of the biggest users of this type of analytics. Moreover, predictive analysis is also used for strategy development and risk management [3].

1.6.3 Fraud Detection Big data analysis techniques are successfully used to detect fraud by correlating points of sale data (available to a credit card issuer) with web behavior analysis (either the bank’s site or externally) and cross examining it with other financial institutions or service providers. Finally, big data provides us with tools and technologies to analyze large volumes of complex data to discover patterns and clues. However, we have to decide what problem we want to solve [3].

1.7 Big Data Challenges 1. The understanding of Big Data is mainly very important. In order to determine the best strategy for a company, it is essential that the data that you are counting on must be properly analyzed. Also the time span of this analysis is important because some of them need to be performed very frequently in order to quickly determine any change in the business environment.

2. Another aspect is represented by the new technologies that are developed every day. Considering the fact that Big Data is new to the organizations nowadays, it is necessary for these organizations to learn how to use the new developed technologies as soon as they are in the market. This is an important aspect that is going to bring competitive advantages to businesses. 13

Chapter One

Introduction to Big Data Analytics

3. The needs for “IT specialists” is also a challenge for Big Data. According to McKinsey’s study on Big Data, there is a need for up to 190,000 more workers with analytical expertise and 1.5 million more data literate managers only in the United States. This statistics are a proof that in order for a company to take the Big Data initiative it has to either hire experts or train the existing employees in the new field.

4. Privacy and Security are also important challenges for Big Data. Because Big Data consists in a large amount of complex data, it is very difficult for a company to sort this data on privacy levels and apply the according security [19].

1.8 Literature Survey 1. Md.zahidul Islam (2014) [3]. The aim of this study is to show how to use big data within the cloud. In the proposed system, an application has been built for collecting, organizing, analyzing and visualizing of the data from retail industry, which has been gathered from indoor navigation systems and social networks such as Twitter and Facebook and etc. The work is done in a way that users can see shopper’s movement around the shopping malls. By visualizing this information the user can identify dominant traffic paths or low traffic or bottleneck regions and also assist shoppers if they need any help. Twitter data has been analyzed to see how negative or positive people are about certain shopping malls or products, by discovering the shopper’s feeling marketing campaign can be design efficiently. The results showed that Hadoop cluster and “Hive or Pig” are the best choice for making data warehouse and analyze data afterwards(batch processing), while for analyzing data in real time Hadoop cluster and storm are the best choices.

14

Chapter One

Introduction to Big Data Analytics

2. Chetan Sharma (2014) [20]. In this study, a specific system has been designed which is capable of analyzing and predicting outputs for large-scale databases. A neural network has been generated based on predicting results that are as closed as possible to the results which would have been generated by the IBM Watson for the IBM challenge, and also generated a neural network capable of most accurately predicting whether to trust a person or not based on its biological and physical behavior. For generating the neural networks Matlab has been used. Back-propagation algorithm and so many other training algorithms have been used to find out a best algorithm with lowest possible MSE and a best performance on training dataset and the result of each algorithm has been visualized. In order to obtain better results, the author combined two training algorithm (gradient descent and scaled conjugate gradient descent) which provided better MSE and performance than the other algorithm.

3. Lina L. Dhande and Dr. Prof. Girish K. Patnaik (2014) [16]. In this work sentiment analysis has been selected to work on. Movie review datasets have been classified in to negative or positive polarity of sentiment using data mining techniques that are Naive Bayes classifier and Neural Network Classifier. 2000 reviews have been tested using naïve Bayes classifier. 1247 correct samples have been achieved. Incorrect samples were 753, when neural network which have used correct samples were 999, and incorrect samples were 1001. After the experiment, the result of sentiment analysis using Naive Bayes classifier obtained 62.35% accuracy on training data, and Using Neural Network classifier, the accuracy of sentiment analysis obtained 49.95% on training dataset. Accuracy of sentiment analysis is increased by combining naïve Bayes and neural network classifier.

15

Chapter One

Introduction to Big Data Analytics

4. Pablo Gamallo and Marcos Garcia [2014] [21]. This paper describes a strategy based on a Naïve Bayes classifier for detecting the polarity of English tweets. The experiments have shown that the best performance is achieved by using a binary classifier trained to detect just two categories: positive and negative, and tweets has been preprocessed by removing URLs, references to usernames, and hashtags, reduction of replicated characters, identifying emoticons and interjections and replacing them with polarity or sentiment expressions. After experiment result the F-score that has been achieved is 63%.

5. Das and P. M. Kumar (2013) [22]. This paper explained that most data saved by organizations are unstructured data. In this paper, they have worked on how to analyze the unstructured data. Unstructured data is taken from public tweets of Twitter and then the data has been stored into NoSQL database HBase using Hadoop cluster and for user interface (front end) build using java script which is connected to java framework and java framework connected to HBase for fetching and analyzing data, and big data, three v’s of big data are also explained.

6. Thibaud Chardonnens (2013) [23]. This work provided solutions for high velocity streams, Twitter and Bitly are used as datasets. Both Twitter and Bitly are decoded to encode JSON format and they are linked together. Data contain information of a tweet: date of creation, text of the tweet, location from where the tweet has been posted, eventual hashtags, URLs or user mentions, etc. Along with the information about tweets, the data also contain information about the user who posted the tweet. Because a tweet is limited to 140 characters, when they want to post a tweet it contains URLs. Bitly 16

Chapter One

Introduction to Big Data Analytics

is used, as a link shortened, that is why Twitter and Bitly are used together. For processing part in this thesis, storm is mainly used for stream processing system. Use cases have been done in order to analyze the scalability of storm: 34 GB of data (from both Bitly and Twitter) was loaded into Kafka. The topology has been run three times with the same data and configuration; once with one worker node, once with two worker nodes and once with three worker nodes. The goal was to analyze the performance based on the number of worker nodes. Using three node showed better performance than others.

7. Narayanan et al. (2013) [24]. This paper analyzed sentiment using publicly available dataset of movie reviews from the Internet movie database. The dataset contains 25,000 movie reviews for training set and 25,000 for testing set. Movie reviews, which contain wide range of human emotion is a good case for sentimental analysis. Naïve Bayes classifier is used to detect polarity (negative or positive) of movie review. For improving accuracy of naïve Bayes N-grams model used, by using bigrams and trigrams they were able to capture information about adverbs and adjectives and many other method are used for improve accuracy. This paper showed that a high degree of accuracy can be obtained using Naïve Bayes model and some enhancements such as Negation, Feature Selection and Mutual Exclusion instead of using support vector machines, and it takes orders of magnitude less time to train when compared to support vector machines. The result has been shown that by applying Original Naive Bayes algorithm with Laplacian smoothing on test set 73.77% accuracy has been achieved.

17

Chapter One

Introduction to Big Data Analytics

8. Niles Mouthaan (2012) [25]. The goal of this study is to show how the big data analytics can affect organization’s value creation. Two cases have been studied: X FACTOR Heartbeat, it is the web application showing live statistics about the participants of a music show. Tweets about particular participants have been collected and analyzed. After analyzing all the tweets, the sentiment of the tweets appeared to be corresponding with the virtual audience. The main goal of the web application was to involve more people with X FACTOR and the participations by using big data analytics this goal was achieved, because during the second live show 200,000 pages were viewed and 41,000 unique visitors have visited the page. This case clearly indicated that big data analytics is not only about volume. The second case “bol.com”, an Internet-based retailer. When a visitor visits bol.com, huge data is being collected. Every night this data is analyzed on a Hadoop cluster containing 5 server nodes. The data was analyzed every night to identify popular search terms. This case study showed that big data analytics acts as a value creation driver as it improves transaction efficiency of the transactions between bol.com and its customers by improving the search activity within the shop, also showing part of the transactions as a whole. It also confirms that big data analytics supports the creation of new or improved services. In 2011 more than 3.4 million visitor visited bol.com. The number of users indicated its popularity and the increase in sales its value for the organization.

1.9 Thesis Objectives The Objective of this work can be illustrated in the following points: 1. Building an Analytic System to determine and discover the feeling of individuals about different topics through text analyzing and text classification. 18

Chapter One

Introduction to Big Data Analytics

2. Studying methodologies, tools and techniques in Big Data such as Apache Hadoop and its characteristics and components like MapReduce paradigm, Apache Hive and HDFS to implement two important Machin learning algorithms like Naïve Bayes Algorithm and Feedforward Neural Network and using them. 3. Applying preprocessing steps on dataset and comparing the accuracy of the results before and after preprocessing. 4. Applying visualization techniques to the data result from Hadoop to compare the results that have been gained. 5. Comparing the accuracy of Naïve Bayes and Neural Network.

1.10 Thesis Layout The remaining parts of the work are schemed as the following: 1. Chapter two presents all tools, technology and algorithms which cover Hadoop, MapReduce, HDFS, Hortonworks, Hive (HiveQL), HCatalog, NoSQL Databases, mongoDB, Python, NLTK, Eclipse, Naïve Bayes algorithm, Feed forward neural network, and Bag of words. 2. Chapter three contains all activity diagrams of Naïve Bayes, Neural network and bag of words algorithms, and preprocessing steps. Also, explained step by steps. Moreover, implementation of all algorithms inside Hadoop have been explained then, the overall proposed system are explained. 3. In Chapter four, the results of two algorithms have been explained. Also, the results of each downloaded twitter categories has been illustrated and the visualization has been explained step by step. 4. Chapter five presents the conclusions, suggestions for future works, and references.

19

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

2.1 Introduction As mentioned in chapter one, big data analytic is relatively using new technologies that have great impacts on almost all areas of our life. In this chapter, all tools, techniques, technologies, architecture, concepts, and paradigms that have been used in this work are explained briefly and the important points have been highlighted. Almost all of them are new, modern and state-of-the-art. Hadoop as heart of Big Data and Cloud oriented approach, MapReduce as the parallel programming paradigm, MongoDB as NoSQL database, NLTK as python library for natural language processing, Apache Hive as query language, Hcatalog, Bag of Words, Feedforward Neural Network and Naïve Bayes algorithms have been explained in the coming sections.

2.2 Hadoop Hadoop is a Top level Apache project, an open source software framework, that is written in java programming language which allows the distributed processing of massive data sets across different sets of servers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. Rather than relying on hardware to deliver highavailability, the library itself is designed to detect and handle failures at the application layer, delivering a highly available service on top of a cluster of computers, each of which may be prone to failures [26]. Hadoop created by Doug Cutting with the goal to create a distributed computing framework and programming model to provide for easier development of distributed applications. The philosophy is to provide for scale-out scalability over large clusters of rather cheap commodity hardware. Its creation was motivated and is largely based on papers published by Google to describe some of their internally systems, namely the GFS and Google MapReduce [7]. 20

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

Hadoop is designed to be scalable, and can run on small as well as very large installations. Several programming frameworks including Pig Latin and Hive allow users to write applications in high level languages (loosely based on the SQL syntax) which compile into MapReduce jobs, which are then executed on a Hadoop cluster. Hadoop committers today work at several different organizations such as Hortonworks, Microsoft, Facebook, Cloudera, LinkedIn, yahoo, eBay and many others around the world [27].

2.2.1 Hadoop Characteristics 1. Scalable: Automatic scale up/ down Hadoop heavily relies on DFS, and hence it comes with a capability of easily adding or deleting the number of nodes needed in the cluster without needing to change data formats, how data is loaded, how jobs are written, or the applications on top. 2. Cost effective: Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per TB of storage, which in turn makes it affordable to model all the data. 3. Flexible: Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any single system can provide [28]. 4. Fault tolerant: It is the ability of a system to stay functional without interruption and losing data even if any of the system components fail. One of the main goals of Hadoop is to be fault tolerant. Since Hadoop cluster can use thousands of nodes running on commodity hardware, it becomes highly susceptible to failures. Hadoop achieves fault tolerance by data redundancy/ replication. It also provides ability to monitor running tasks and auto restart the task if it fails. 21

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

5. Built in redundancy: Hadoop essentially duplicates data in blocks across data nodes. For every block, there is assurance for a back-up block of same data existing somewhere across the data nodes. Master node keeps track of these nodes and data mapping. In case if any node fails, the other node where back-up data block resides takes over making the infrastructure failsafe. A conventional RDBMS has the same concerns and uses terms like: persistence, backup and recovery. These concerns scale upwards with Big Data. 6. Computational tasks on data residence: Moving computation to data, any computational queries are performed where the data resides. This avoid overhead required to bring the data to the computational environment. Queries are computed parallely and locally and combined to complete the result set [29].

2.2.2 Hadoop Components Hadoop Consists of two main components: MapReduce, which deals with the computational operation to be applied on data, and the Hadoop Distributed File System or HDFS, which deals with reliable storage of the data.

a. HDFS It is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers. HDFS was designed to be a scalable, fault-tolerant, distributed storage system that works closely with MapReduce [30]. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS 22

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

provides high throughput access to application data and is suitable for applications that have large data sets. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop Core project [31]. HDFS Architecture:

Figure (2.1) HDFS Architecture [30] HDFS has a master/slave architecture as shown in figure (2.1), comprised of a NameNode which manages the cluster metadata and DataNodes that store the data. Master node called “NameNode”, and slave nodes called “DataNodes”. HDFS divides the data into fixed-size blocks (chunks) and spreads them across all DataNodes in the cluster. Each data block is typically replicated three times with two replicas placed within the same rack and one outside. The Namenode keeps track of which DataNodes hold replicas of which block and Namenode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates 23

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

another replica of the block. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode [32].

b. MapReduce MapReduce is a Heart of Hadoop, a programming model for parallel processing of tasks on a distributed computing system and an associated implementation for processing and generating large data sets. This programming model allows splitting a single computation task to multiple nodes or computers for distributed processing. As a single task can be broken down into multiple subparts, each handled by a separate node, the number of nodes determines the processing power of the system. As MapReduce is an algorithm, it can be written in any programming language [9]. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. When there is a large amount of data, the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues. As a reaction to this complexity, Google designed a MapReduce as a new abstraction that allows expressing the simple computations that programmers trying to perform but hides

24

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library [33].

Programmers find the system easy to use because MapeRduce allows programmers to focus on the application logic and handling the messy details such as handling failures, application deployment, task duplication, and aggregation of results automatically. The MapReduce paradigm has become a popular way of expressing distributed data processing problems that need dealing with large amount of data. MapReduce is used by a number of organizations worldwide for diverse tasks such as application log processing, user behavior analysis, processing scientific data, web crawling and indexing etc. A MapReduce program consists of two user-specified functions, the Map and the Reduce function which are explained bellow:

a. Map Phase In the Map phase, each mapper reads raw input, record by record, and converts it into Key/Value pair [(k,v)], and feeds it to the map function then map function performs a computation on a key value pair. The Map function operates on each of the pairs in the input and produces intermediate output in the form of new key/value pairs depending upon how the user has defined the Map function. Output of the map function is then passed to the reduce function as input [34].

b. Reduce Phase The reduce function applies an aggregate function on its input merging all intermediate values associated with intermediate keys (e.g. count or sum values), and storing its output to disk. The output of reduce is also in the form of key-value

25

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

pairs. At the end of reduce, the output is sorted according to the values of keys, and the function for comparison of the keys is usually supplied by the user. During the execution of a MapReduce job, the input is first divided into a set of input splits. The system then applies map functions on each of the splits in parallel. The system spawns one task for each input split, and output of the task is stored on a disk for transferring to the reduce tasks. The system starts reduce tasks once all the map tasks have been successfully completed. Task or node failures are dealt by re-launching the tasks. Data given as input to the tasks, and generated as output of the tasks which is stored in a distributed file system (HDFS for instance), to make sure that output of the task survives failures [29]. One of the easiest MapReduce task is to count the words for a set of documents. So, here is an example which is shown in figure (2.2):

Figure (2.2) Illustration of MapReduce Algorithm [9]

26

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

In the given example the number of occurrences of each word has been calculated in an input text file. In this example there are two-node clusters which means that two mappers are available to distribute the mapping task: The mapping function takes the input file, separates the records, and sends them to different nodes or mapping instances for processing. The mapping function then splits the document into words and each record is given to one cluster to perform mapping task in parallel. Then, it assigns a digit “1” to them to form a key-value pair for further computation. The intermediate output is in the form of (word, 1) and is sorted and grouped into individual nodes to calculate the frequency. The resultant output from the sort operation is then fed to the reduce function, which sums up the outputs from different nodes and generates the final output containing the frequency.

The above diagram illustrates the bellow pseudo-code [33]: Map (String key, String value): // key: document name // value: document contents for each word w in values: EmitInternmediate (w, “1”);

Reduce (String key, Iterator values): // Key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt (v); Emit (AsString (result));

27

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

2.3 Hive The Apache Hive is an open source data warehouse software built on top of Hadoop core which provides data summarization, query, and analysis of datasets[35]. While initially developed by Facebook, Apache Hive is now used and developed by other companies. It facilitates querying and managing large datasets residing in distributed storage. Previously, it was a subproject of Apache Hadoop, but has now graduated to become a top-level project of its own. Hive provides a mechanism to project structure on to this data and query the data using a SQL-like language called HiveQL. At the same time, this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. Queries are compiled into MapReduce jobs which means that queries are automatically translated in to MapReduce and then executed on Hadoop [36]. It has been widely used in organizations to manage and process large volumes of data, such as eBay, Facebook, LinkedIn, and Yahoo [37].

2.3.1 Hive Characteristics 1. HiveQL supports Data Definition (DDL) statements (CREATE, DROP) and Data Manipulation (DML) statements (INSERT, INSERT INTO) but UPDATE and DELETE queries are not supported since Hive is based on write-once, read multiple times batch processing concept. 2. It supports select, project, join, aggregate, union, and sub-query like SQL queries. 3. It facilitates easy data extraction, transform and loading Hive stores data in files on HDFS.

28

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

4. Hive queries have the ability to run MapReduce without having to write custom MapReduce jobs to execute queries, because HiveQL implicitly converted into map-reduce jobs data structures like arrays. 5. A mechanism to impose structure on a variety of data formats. Helps to conceptualize data in a structured way. 6. SerDe It’s an interface to let Hive know how to process data record. Data is serialized so that hive can store it on HDFS supported format or other supported storage format. Hive deserializes data into Java object so that Hive can manipulate is while processing queries. Normally, deserialize is used while processing query like SELECT query and serializer is used while using query of type INSERT [29].

2.3.2 HCatalog It is a component of Hive built on top of hive. Is a table and storage management layer for Hadoop that enables users with different data processing tools — Pig, MapReduce — to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the HDFS and ensures that users need not to worry about where or in what format their data has been stored. HCatalog supports reading and writing files in any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV, JSON, and SequenceFile, and ORC file formats [38].

2.4 Hortonworks HDP is an open source data management platform for Apache Hadoop. HDP allows users to capture, process, storing, managing and analyzing data in any format and at any scale. Built and packaged by the core architects, builders and 29

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

operators of Hadoop, HDP is making Hadoop more robust and easier to install, manage and use [39]. Hortonworks develops Hortonworks Sandbox, which is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. It is a self-contained virtual machine with Hadoop pre-configured for development purposes [40]. The Sandbox includes the HDP in an easy to use form. You can add your own datasets, and connect it with your existing tools and applications [39].

Figure (2.3) Hortonworks Data Platform [41]

2.5 NoSQL Databases Evolution of the SQL databases began in the late 1990s. After a few years, it became a serious competitor to RDBMS because RDBMS had its own set of problems when applied to massive amounts of data. The problems are related to efficient processing, effective parallelization, scalability, and costs. So, in 2009 and 2010 there were organized NoSQL conferences. This NoSQL name was first

30

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

used by Carlo Strozzi in 1998 as the name of the file he was developing for his database. NoSQL Driven by scalability requirements of Big Data [42]. NoSQL is not an alternative substitute for traditional Relational Database Management Systems. Each kind of database suits different needs. That is why each solution must be evaluated for each application. NoSQL Databases are very good for applications that deal with large amount of data and which has some main entities that associates to many other secondary entities [43]. NoSQL is another type of data storage other than databases (that were used earlier). It is an eclectic and increasingly familiar group of none relational data management system. NoSQL is used to store huge amount of data storage like data in Facebook, twitter (which keeps on increasing day by day). NoSQL database management systems are useful when working with the data that its nature does not require a relational model, fast information retrieval database and is portable. NoSQL databases have high performances in a linear way which is horizontally scalable. None relational database does not organize its data in related tables (i.e., data is stored in a non-normalized way) and generally do not use SQL for data manipulation. NoSQL databases are open sources; therefore, everyone can look into its code freely, update it according to the needs and compile it [44]. The main reason behind using NoSQL is the era of big data as amount of data is growing rapidly, and the nature of data is changing as well. As developers find new data types – most of which are unstructured or semi-structured, this type of data cannot fit into traditional databases. So, many organizations turning to NoSQL for help.

31

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

Figure (2.4) Structure, Unstructured and Semi-Structured Data [45]

As shown in figure (2.4), the amount of Unstructured and Semi-Structured Data are growing rapidly. Unstructured data refers to the information that either does not have a pre-defined data model and/or is not organized in a predefined manner. In Contrast to unstructured data, structured data is the data that can be easily organized. Regardless of its simplicity, most experts in today’s data industry estimate that structured data accounts for only 20% of the available data. It is clean, analytical and usually stored in databases. Semi-structured data is the data that is neither raw, nor typed in a conventional database system. It is structured data, but it is not organized in a rational model, like a table or an object-based graph. A lot of data found on the Web can be described as semi-structured such as BibTex files [46]. As it has been mentioned before, Organizations that collect large amounts of unstructured data are increasingly turning to NoSQL, such as Google, Amazon, Facebook, Twitter, and LinkedIn which had challenges in dealing with huge quantities of data with conventional RDBMS. They can support multiple activities, including exploratory and predictive analytics. These systems are designed to scale to thousands or millions of users doing updates as well as reads [47]. 32

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

2.5.1 Characteristics of NoSQL 1. Partition and Replication NoSQL databases allow Large-scale data processing (parallel processing over distributed systems). Partition means that the data is spread to different machines and is managed by different machines. So, here it uses the concept of data replication [44]. Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication protects a database from the loss of a single server. Replication also allows you to recover from hardware failure and service interruptions. With additional copies of the data, you can dedicate one to disaster recovery, reporting, or backup. In some cases, you can use replication to increase reading capacity. Clients have the ability to send read, and write operations to different servers. You can also maintain copies in different data centers to increase the locality and availability of data for distributing applications. This allows NoSQL databases to support a large number of read/write operations [48].

2. Dynamic Schemas Relational databases require schemas to be defined before you can add data. For example, you might want to store data about your customers such as phone numbers, first and last name, address, city and state – a SQL database needs to know what you are storing in advance. This fits poorly with agile development approaches because each time you complete new features, the schema of your database often needs to be changed. So, if you decide, a few iterations into development, that you would like to store customers' favorite items in addition to their addresses and phone numbers, you will need to add that column to the 33

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

database, and then migrate the entire database to the new schema. If the database is large, this will be a very slow process which involves significant downtime. If you are frequently changing the data that your application stores – because you are iterating rapidly – this downtime may also be frequent. There is also no way, by using a relational database, to effectively address the data that is completely unstructured or unknown in advance. NoSQL databases are built to allow the insertion of data without a predefined schema. This makes it easy to make significant application changes in real-time, without worrying about service interruptions – which means development is faster, code integration is more reliable, and less database administrator time is needed [49].

3. CAP Theorem 1. Strong Consistency: All clients see the same version of the data, even on updates to the dataset. 2. High Availability: All clients can always find at least one copy of the requested data, even if some of the machines in a cluster are down. 3. Partition tolerances: The total system keeps its characteristic, even when being deployed on different servers, transparent to the client. The CAP-Theorem postulates that only two of the three different aspects of scaling out can be achieved fully at the same time [50].

34

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

Figure (2.5) CAP Theorem [51]

4. Horizontal Scaling instead of Vertical Scaling A key feature of NoSQL systems is relying on horizontal scaling which is based on partitioning data stores across several machines to cope with massive amounts of data. A specific architecture for horizontal scaling is the sharednothing architecture in which each machine is independent and self-sufficient and none of the machines share memory or disk storage [40]. NoSQL databases provides an easier, linear, and cost-effective approach to database scaling. As the number of concurrent users grows and to store more data, simply add additional low-cost, commodity servers, computers to your cluster. There is no need to modify the application, since the application always sees a single (distributed) database. 35

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

But horizontal scaling and sharding are in sharp contrast to the vertical scaling approach. Traditional databases use vertical scaling (scale up) instead of using horizontal scaling (scale out) which relies on enhancing the hardware characteristics (e.g., CPU/memory) of a single machine in order to provide scalability. In order to support more concurrent users and store more data, RDBMS requires a bigger and more expensive server with more CPUs, memory, and disk storage. At some point, the capacity of even the biggest server can be outstripped and the relational database cannot scale further, as shown in figure (2.6) [32][52].

Figure(2.6) Horizontal Scaling and Vertical Scaling [45]

2.5.2 Types of NoSQL Databases 2.5.2.1 Key Value Stores Key value stores are similar to maps or dictionaries where data is addressed by a unique key. Values are simple texts, keys that are the only way to retrieve stored data. Values are isolated and independent from each other wherefore relationships must be handled in application logic. Due to this very simple data structure, key value stores are completely schema free. New values of any kind can be added at 36

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

runtime without conflicting any other stored data and without influencing system availability. The grouping of key value pairs into collection is the only offered possibility to add some kinds of structure to the data model [53]. The simplicity of Key-Value Stores makes them ideally suited to lightningfast. Highly scalable retrieval of the values are needed for application tasks like managing user profiles or sessions or retrieving product names. That is why Amazon makes an extensive use of its own KV system, Dynamo, in its shopping cart. Dynamo is a highly available key-value storage system that some of Amazon’s core services use to provide highly available and scalable distributed data store. Examples: Key-Value StoresDynamo (Amazon); Voldemort (LinkedIn); Redis; BerkeleyDB; Riak.

Figure (2.7) Key Value Stores [54]

2.5.2.2 Document Databases (or stores) Document Stores databases are those NoSQL databases which use records as documents. As their name implies, designed to manage and store documents. This type of database stores unstructured (text) or semi-structured (XML) documents 37

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

which are usually hierarchal in nature. Document Stores Databases are schema free and are not fixed in nature. Every document contains a special key "ID", which is also unique within a collection of documents, and therefore identifies a document explicitly. These documents are encoded in a standard data exchange format such as XML, JSON or BSON. Unlike the simple key-value stores described above, the value column in document databases contains semi-structured data – specifically attribute name/value pairs. A single column can house hundreds of such attributes, and the number and type of recorded attributes can vary from row to row. Also, unlike simple keyvalue stores, both keys and values are fully searchable in document databases. Document databases, unlike RDBMS which has unused field (empty field) with null value there is no empty field the system, allows information to be added at any time [47]. Document Database Examples are such as CouchDB (JSON); MongoDB (BSON), couchbase, Amazon SimpleDB. In this thesis MongoDB is used as NoSQL database to store tweets from Twitter.

38

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

Figure(2.8) Relational data model and Document data model [47]

2.5.2.3 Column Family Stores The databases in this category are based on Google’s BigTable model, and this is the reason they are also referred to as BigTable clones. They are built for storing and processing very large amounts of data. They are also a form of key-value pairs, but they organize their storage in a semi-schematized and hierarchical pattern. Google’s BigTable is: "...a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row-key, column key, and a timestamp; each value in the map is an un-interpreted array of bytes." Column families are groups of related data (columns) that are often accessed together. The number of column families is virtually unlimited, thus making a fairly common use of the column name as a piece of runtime populated data. A row key is mapped to column families which are mapped to column keys, and this structure is more a two-level map than a table as shown in figure (2.9). 39

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

They are less schema flexible than document databases, which are best used for semi-structured data, not for data whose structure changes from row to row. To say it differently, wide columns do well also in grouping entities that have high-level characteristics in common, but with different context-specific attributes [55]. Example: Cassandra developed at Facebook and is now managed as a project of the Apache, Hypertable and HBase.

Figure (2.9) Column Family Data Store [56]

2.5.2.4 Graph Databases Graph databases originated from the graph theory, where you have vertices (singular: vertex) and edges connecting the vertices. In databases, vertices are entities like persons, and edges are relations between entities. These relations are similar to the relations in RDBMS. The edges can have properties like a metric of a communication link. Edges can also be used to describe the relation between vertices (the edges may be directed or undirected). The majority of graph databases 40

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

are schema free (unstructured). For storing the context of edges and vertices, directed adjacency lists are used most often. With an adjacency list, the vertex describes to which other vertices it has edges with. Nowadays, for example, graph databases can be used in LBS to find common friends on social networks or to establish the shortest paths through the daily traffic (with standard algorithms like Dijkstra), or in a more general manner, for the efficient querying of data in a network [44]. Graph Database Examples: Neo4j; InfoGrid; Sones GraphDB; AllegroGraph; FlockDB

Figure (2.10) Graph Database [57]

2.6 MongoDB As mentioned before, MongoDB is used as NoSQL database in this thesis to store tweets from Twitter. MongoDB is a high-performance, scalable, open source and schema free Document-oriented NoSQL databases. It supports automatically replicated data for scale and high availability. MongoDB supports a rich, ad-hoc 41

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

query language of its own. It offers atomic modification for faster data processing using Map/Reduce operations. .This database also provides horizontal scalability and has no single point of failure [58]. MongoDB has client support for most programming languages as shown in figure(2.11):

Figure (2.11) MongoDB Programming Languages [59] As shown in figure (2.12), simple examples outline Mongo’s data storage construct. MongoDB, as mentioned previously, stores data in a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents. In a relational database, there should always be some way to uniquely identify a given record. Otherwise, it becomes impossible to refer to a specific row. To that end, you are supposed to include a field that holds a unique value (called a primary key) or a collection of fields that can uniquely identify the given row (called a 42

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

compound primary key). MongoDB requires each document to have a unique identifier for much of the same reason. In MongoDB, this identifier is called _id. Unless you specify a value for this field, MongoDB will generate a unique value for you [60].

Figure (2.12) Document in MongoDB [60]

MongoDB stores all documents in collections. A collection is a group of related documents that have a set of shared common indexes. Collections are analogous to a table in relational databases, as shown in figure (2.13):

Figure (2.13) a collection of MongoDB document [60]

43

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

Figure (2.14) shows how to insert documents into collection.

Figure (2.14) Insertion into Collection [60]

Overall structure of MongoDB consists of documents, collections and databases, as shown in figure (2.15)

Figure (2.15) MongoDB Database Model [61] 44

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

Choosing the right NoSQL database for your application can be complicated because not all NoSQL solutions are the same. Every solution is optimized for a particular type of workload and use case. Therefore, each has its own pros and cons [58]. David Mytton tested most of the NoSQL database systems for a project and explains why they have chosen MongoDB in his article ”Choosing a non-relational database; why we migrated from MySQL to MongoDB”. Some of the reasons made Mytton to choose MongoDB include its easyiness way to install, easy replication capability, automated sharding and good documentation [62]. Some of Key Features of MongoDB are explained bellow:

2.6.1 Key MongoDB Features

1. Flexibility MongoDB stores data in JSON documents (which serialize to BSON). JSON provides a rich data model that seamlessly maps to native programming language types, and the dynamic schema makes it easier to evolve your data model than with a system with enforced schemas such as a RDBMS because documents can be updated individually or changed independently of any other documents. It is a very popular data exchange format that provides easy way of storing and interchanging data. It is very expressive and easy for humans to read and write. MongoDB does not actually use JSON to store the data; rather, it uses an open data format developed by the MongoDB team called “BSON”. For the most part, using BSON instead of JSON does not change how you work with your data. BSON makes MongoDB even faster by making it much easier for a 45

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

computer to process and search documents. BSON also adds a couple of features that are not available in standard JSON, including the ability to add types for handling binary data [59]. 2. Power MongoDB provides a lot of the features of a traditional RDBMS such as secondary indexes, dynamic queries, sorting, rich updates, upserts (update if document exists, insert if it does not), and easy aggregation. This gives you the breadth of functionality that you are used to from an RDBMS, with the flexibility and scaling capability that the non-relational model allows.

3. Autosharding Auto-sharding in MongoDB makes it easy to automatically balance the data distribution over each shard, addition of more machines and automatic failover. Sharding means partitioning collections of data and splitting them through machines, called “shards” which means that each portion of a partitioned data store residing on a given machine is called a “shard”, and the data store partitioning process is called “sharding”. MongoDB offers automated share to balance and handle failover with no user interference. Sharding uses replica set to prevent failover [43].

4. Speed/Scaling By keeping related data together in documents, queries can be much faster than in a relational database where related data is separated into multiple tables and then they need to be joined later. MongoDB also makes it easy to scale out your database. Autosharding allows you to scale your cluster linearly by adding more machines. It is possible to increase the capacity 46

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

without any downtime, which is very important on the web when the load can increase suddenly and bring down the website. For extended maintenance can cost your business large amounts of revenue.

5. Ease of use MongoDB works hard to be very easy to install, configure, maintain, and use. To this end, MongoDB provides few configuration options, and instead tries to automatically do the “right thing” whenever possible. This means that MongoDB works right out of the box, and you can dive right into developing your application, instead of spending a lot of time fine-tuning obscure database configurations [63].

2.7 Eclipse It is an integrated development environment. It contains a base workspace and an extensible plug-in system for customizing the environment. Written mostly in Java, Eclipse can be used to develop applications. By means of various plugins, Eclipse may also be used to develop applications in other programming languages: C,C++ JavaScript, Perl, PHP, Prolog, Python and R. It can also be used to develop packages for the software Mathematica. Development environments include the Eclipse JDT for Java and Scala, Eclipse CDT for C/C++ and Eclipse PDT for PHP, among others [64].

2.8 Naïve Bayes Naïve Bayes classifier is a simple model for classification. The Naïve Bayes is one of the most important algorithms in sentimental analyses and text classifications. It is a probabilistic classifier based on applying Bayes' theorem with strong independence assumptions. This is the simplest form of Bayesian 47

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

Network, in which all attributes are independent given the value of the class variable. This is called conditional independence. It assumes each feature is conditional independent to other features given the class. A Naïve Bayes classifier is a technique that applies to a certain class of problems, namely, those that phrased as associating an object with a discrete category [16]. The Naïve Bayes model involves a simplifying conditional independence assumption. That is, given a class (positive or negative), the words are conditionally independent of each other. This assumption does not affect the accuracy in text classification a lot, but it makes really fast classification algorithms to be applicable for the problem. From numerical based approach group, Naive Bayes has several advantages such as being simple, fast, high accurate and relatively effective. It also relies on very simple representation of document, which is called “Bag of Words” [23].

2.8.1 Multinomial Naïve Bayes Multinomial Naive Bayes is a specialized version of Naïve Bayes that is designed more for text documents. Whereas simple Naïve Bayes would model a document as the presence and absence of particular words, multinomial Naïve Bayes explicitly models the word counts and adjusts the underlying calculations to deal with in. In this algorithm, Bayes’s rule applied to documents and classes, for each document d and class c: P(c|d)=

P(d|c)P(c) P(d)

……………………………………………………...(2.1)

48

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

The best class is the maximum posteriori class (MAP) which have been looking for to assign this document is: CMAP = argmax P(c|d) ………………………………………..............(2.2) c ɛC by Bayes rule the equation is: = argmax

P(d|c)P(c) P(d)

…………….......………….…………………....(2.3)

c ɛC and since probability of all the document are identical and constant, so it is reasonable to eliminate the P(d) . = argmax (P(d|c)P(c)) ………………….…………….……....(2.4) Now let us represent the documents by whole set of features: x1 , x2 , x3 , … , xn .In that case: CMAP = argmax (P(x1 , x2 , x3 , … , xn |c)P(c)) .…………………........(2.5) c ɛC To make computations less complex the following simplifying assumptions have been made: 1. Bag of words assumption: Assume that position does not matter. 2. Conditional Independence: Assume that different features probabilities P( xi |cj ) are independent given the class c. Therefore, easily write: 1. P(x1 , x2 , x3 , … , xn |c)=P(x1 |c)* P(x2 |c)* P(x3 |c)*…∗ P(xn |c)…….(2.6), this is obtained from before : 2. Equation (3.5) So easily write: CMAP = argmax P(Cj ) ∏nI=1 (P(xi |Cj )) …………………………….....(2.7)

49

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

But there is a problem with maximum likelihood because if one of the words in the test set does not exist in the BOW, the P(xn |c)=0, therefore CMAP will equal to zero. To solve this problem, classic Laplace or (add-1) smoothing for Naïve Bayes is used. So, in this case by having the following equation: P(Wi |Cj )= ∑k

Count(𝑤i ,c)+1

i=1(Count(wi

,c)+V)

…………………………….…………(2.8),

k is the index of the last word in the bag of words. Finally: P(Wi |Cj )=

Count(wi ,c)+1 k (∑i=1(Count(wi ,c))+V)

……………………………….……...(2.9)

V is the size of the bag of words [65].

2.8.2 Bag of Words Representation The intuition of bag of words is very simple. All it needed is that instead of using all words in specific documents, only subset of words and their count are used to represent the document. Then, building a function which takes the document and returns a positive or negative class, whether to use a subset of words or all the words in the document, the bag of word representation loses all the information about the order of the words in the document. All it needs to represent about the document is the set of words that occurred and their counts. Figure (2.16) shows an example of the document that has been represented as just a vector upwards. The function takes that representation and assign a class as positive or negative. It needs to be mentioned that in some kinds of text classification the entire set of words will be used.

50

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

Figure (2.16) the Bag of Words Representation [66]

51

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

2.9 Feed Forward Neural Network Feed forward Neural Network allows signals to travel only one way; from input to output, where connections between the units do not form cycle. This is different from recurrent neural networks. In this network, the information moves only in one direction, forward, from the input nodes, through the hidden nodes (if any) to the output nodes. There are no cycles or loops in the network [67]. The output of any layer does not affect the same layer. Feed forward tend to be straightforward networks that associate inputs with outputs. This type of organization is also referred to as bottom-up or top-down. Feed-forward neural networks are now well established as tools for many information-processing tasks — regression, function approximation, time series prediction, nonlinear dimension-reduction, clustering and classification are examples of the diverse range of applications [68].

2.10 Python Python is a powerful programming language that allows you work more quickly and integrate your systems more effectively. Python is easy to learn. It has efficient high-level data structures and a simple but effective approach to objectoriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms [69]. Although, its primary strength lies in the ease with which it allows a programmer to rapidly prototype a project, its powerful and mature set of standard libraries make it a great fit for large-scale production-level. Python has a very shallow learning curve and an excellent online learning resource [70]. 52

Chapter Two

Big Data Technologies, Tools, Paradigm and Algorithms

Python is a General Purpose Language widely used in the fields of numeric programming, artificial intelligence, image processing, biology and software engineering projects as well it is known to be easy and it can express with clarity complex ideas. Such complex ideas easily result in errors in other languages for solutions to the same problem [69]. As mentioned earlier, python has a set of powerful libraries which can be used for many purpose.

2.10.1 NLTK It is the Natural Language Toolkit, a comprehensive Python library for natural language processing and text analytics. Originally designed for teaching, it has been adopted in the industry for research and development due to its usefulness and breadth of coverage [71].

53

Chapter Three Requirement Analysis and Algorithms Implementation for Proposed System

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

3.1 Introduction In this chapter, the proposed system is presented, Twitter sentimental data has been investigated, refined, analyzed, and visualized as one of the user cases of big data analytics. The applied algorithms that will be discussed in this chapter have been implemented in Hortonworks Sandbox which is a personal and portable Hadoop environment that have been installed on an VMware virtual machine environment. Apache Hive has been used for implementing the algorithms and processing the data inside the HDFS. Hive provides HiveQL which is a Query language used for implementation of all algorithms then automatically converted into MapReduce and then executed on Hadoop.

3.2 Requirement Analysis As mentioned before sentiment data is the unstructured data that represents opinions, feelings, and attitudes of people contained in sources such as social media posts, blogs, online product reviews, and customer support connections. In this study, Twitter sentimental analysis has been used because it is the most wildly used sentimental analysis. Twitter is a relationship platform, where people are able to easily post their feelings and emotions through posting tweets and express reviews about products, services, politic, government, organization, and business. The work will not be limited to focus on only a specific topic or category. Instead, it has been tried to design and implement the system in a dynamic way, tweets about different categories have been collected. Implementation of different parts of this project has been done in a very convenient and efficient way.

54

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

The dataset that has been used for the purpose of this work is based on the data from the following two sources: 1. University of Michigan Sentiment Analysis competition on Kaggle [72] 2. Twitter Sentiment Corpus by Niek Sanders [73]

While it is possible to use the system for any kind of sentiment analytics purposes that have been mentioned above, as an example, public sentiments have been looked at all over the worlds about “Kurd, Kurdistan, Shengal, Senjar, Kobani, Peshmarga, North of Iraq”. It helps to realize how public tweets talk about our nation at a particular moment in time, especially after the attacks of the ISIS and possibility of independency of Kurdistan. It also helps to track how those opinions change over time. Naive Bayes algorithms that have been discussed in the consequent sections have been applied to other categories like Iphone6, ISIS, Obama and Marriage. Twitter Sentiment Analysis Dataset contains 1,578,627 classified tweets, which have been sorted by tweets content alphabetically. Each row is marked as “1” for positive sentiment and “0” for negative sentiment. Table (3.1) shows a small part of original dataset. As the first part of the work after preprocessing on the entire dataset, about (30%) which is equal to 473589 of the corpuses have been used for testing the algorithms, while the rest (70%) which is equal to 1105038 have been dedicated towards training set to train the neural network and Naïve Bayes algorithms to classify sentiments. In both algorithms Bag of Words from training set have been made to implement the algorithms. MapReduce paradigm and Hadoop have been employed.

55

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Table (3.1) Simple Snapshot of the Original Dataset

Later, Neural Network algorithm has been used to apply directly on test set using the existing Bag of Words, which is a prepared dictionary of positive and negative words. Table (3.2) illustrates a snapshot of the prepared dictionary that has been used by neural network. In section two of this work, the public tweets directly have been downloaded using a java program inside eclipse and Twitter API has been stored in a MongoDB, which is a type of NoSQL Databases. Moreover, the downloaded sentiment data has been refined, analyzed and visualized using Hadoop and Excel Professional 2013, which will be investigated and discussed later on this chapter.

56

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Table (3.2) Snapshot of the prepared Dictionary

3.3 Preprocessing The activity diagram for preprocessing stages has been shown in the figure (3.1). This diagram does not contain the stages that have been done by SQL Server and Notpad++.

57

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.1) Activity Diagram for Preprocessing in Python Preprocessing is the start point for processing and analyzing of the data. In this stage, the process of replacing words using NLTK in python have been done. Then, using SQL Server 2012 randomly sentiment records inside data set has been reordered. Removing stop words, repeating characters and none alphabetic letters 58

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

are part of preprocessing. Stemming and lemmatization are also a kind of linguistic compression, and word replacement can be thought of as error correction, or text normalization. By compressing the vocabulary without losing meaning, memory can be saved in cases such as frequency analysis and text indexing as well as saving processing time. It needs to be mentioned that NLTK is a Python library for natural language processing and text analytics. There are many special techniques for pre-processing text documents to make them suitable for mining and most of them are from the field of “IR”. Tokenization is a method of breaking up a piece of text into many pieces, and it is an essential first step for recipes in later steps.

3.3.1 Stop Words Many of the most frequently used words in English are worthless in IR and text mining. As it is shown in Figure (3.2), these words are called stop words like The, From, And, To, etc. Typically about 400 to 500 of such words are available in English Language. Stop words need to be removed to reduce indexing or file size because they account for 20-30% of total word counts.

Figure (3.2) List of Stop Words in English 59

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

3.3.2 Stemming The technique that is used to find out the root/stem of a words is called “stemming”, for example, consign, consigned and consigning, all have consign as stem. Stemming is a technique for removing suffixes from a word, ending up with the stem. For example, the stem of "cooking" is "cook", and a good stemming algorithm knows that the "ing" suffix can be removed. It improves the effectiveness of IR and text mining by matching similar words which helps in combining words with same roots to reduce indexing size as much as 40-50%. Stemming is most commonly used by search engines for indexing words. Instead of storing all forms of a word, a search engine can store only the stems, greatly reducing the size of index while increasing retrieval accuracy.

3.3.3 Lemmatization Lemmatization is very similar to stemming, but is more akin to synonym replacement. A lemma is a root word, as opposed to the root stem. So, unlike stemming, you are always left with a valid word which means the same thing. Sometimes, the word you end up with can be completely different. Stemming and lemmatization can be combined to compress words more than either process can by itself. For example, estemmer.stem('believes') returns 'believe', while lemmatizer.lemmatize('believes') returns 'belief'.

60

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

3.3.4 Removing Repeating Characters Generally, people are not strictly grammatical. They will write things like "I looooooove it" in order to emphasize the word "love". But computers don not know that "looooooove" is a variation of "love" unless they are told. This recipe presents a method for removing those annoying repeating characters in order to end up with a "proper" English word.

3.3.5 Removing none alphabetic Characters Sometimes people use numbers or even other non-alphabetic characters to discuss their emotions. In order to focus on words, all non-alphabetic characters have been removed. Table (3.3) is a snapshot of dataset that shows the changes that have been occurred during preprocessing steps. Table (3.3) Snapshot of Output of Preprocessing

61

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

3.4 Text Classification Naïve Bayes and Neural network have been used for text classification in order to determine positive and negative polarity in sentiments. There are many other machine learning and data mining algorithms such as K-Nearest Neighbor, Support Vector Machines, but in this work naïve Bayes and neural network have been selected. In both algorithms Bag of Words from training set have been made and later Neural Network algorithm has been used to apply directly on test set using existing Bag of Words, which is a prepared dictionary of positive and negative words.

3.4.1 Naïve Bayes Algorithm 3.4.1.1 Multinomial Naïve Bayes Algorithm Formulization Algorithm (3.1): Multinomial Naïve Bayes  Problem : Find the sentiment polarity for given Tweets  Inputs : Tweets  Outputs : Tweets with sentiment polarities Begin 1. From training corpus, extract Vocabulary /*list of words*/ Calculate P(Cj ) terms for each Cj in C doc docsj =all docs with class Cj P(Cj )= |total

|docsj | number of decuments|

2. Calculate P(Wk |Cj) terms Build BOW: Wk , Count In CPositive , Count In CNegative

62

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

P(Wk |Cj)= ∑k

𝐂𝐨𝐮𝐧𝐭(𝒘𝐢 ,𝐜)+𝟏

i=1(𝐂𝐨𝐮𝐧𝐭(𝐰𝐢 ,𝐜)+𝐕)

K

Γ(positives) = P(Cj ) ∏ (P(Wk |Cj )) I=1 K

Γ(negatives) = P(Cj ) ∏(P(Wk |Cj )) I=1

IF Γ(positives)>= Γ(negative) then the sentence is positive else negative End

3.4.1.2 Implementation of Multinomial Naïve Bayes in Hadoop Activity diagram of making bag of words has been illustrated in the figure (3.3). Actually, two different bag of words have been made in order to analyzing the sentiment data. The first bag of words has been made for analyzing both Naïve Bayes and Neural Network algorithms and the second one has been made only for analyzing Neural Network algorithm.

63

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.3) Activity Diagram to Make Naïve Bayes Bag of Words

After creation of bag of words the naïve Bayes algorithm need to be implemented on Hadoop Using Apache Hive as shown in figure (3.4):

64

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.4) Naïve Base Algorithm Implementation on Hadoop Using Hive

3.4.1.3 Steps of Creation Bag of Words and Implementation of Naïve Bayes At First, it needs to upload the training set file to the Sandbox to apply the algorithms on it. In Sandbox, File Browser has been used to upload the training set to Hadoop HDFS as has been illustrated in the figure (3.5). Secondly, using Sandbox HCatalog, a structured table from training set file has been made and the data has been imported to prepare it for the next step’s operations by Hadoop using 65

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Apache Hive. Figure (3.6) shows the part of table created for importing training set to HDFS in Hadoop.

Figure (3.5) Uploading Training Set To Hadoop

Figure (3.6) Training Set Table Made by HCatalog

66

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Each statement in the training set has been tokenized as shown in the figure (3.7) which is called “ThesisWordByWord”. The id, its sentiment and array of words shown in the figure below:

Figure (3.7) Tokenizing for Training Set

“ThesisSplitWords” is the name of the result of Hive Query that contains id, its sentiment polarity and each word separately. Actually, in this step the tokenized table has been divided to a new table which includes one work in one row for the purpose of calculating the required results to apply on the equations of the algorithm as shown in figure (3.8).

67

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.8) Splitwords for Training Set For the purpose of applying the algorithm in the next step, some calculations, using HiveQL query, need to be performed including: calculating the total number of tweets, the total number of negative and positive tweets. Part of this results is shown in the figure (3.9), (3.10), and (3.11).

Figure (3.9) Calculate Total Number of Tweets in Training Set 68

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.10) Calculate Total Number of Negative Tweets in Training Set

Figure (3.11) Total Count of Positive Tweets

Using the results from previous step, the total number of documents, negative documents, and positive documents all have been put in one table. Additionally, the probability of positive tweets and negative tweets has been hold in Probability table as shown in figure (3.12):

69

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.12) Probability Table

In the next step, four tables have been implemented. The first one is the information about count occurrence number of each word in positive sentiments. A snapshot of this table have been illustrated in the figure (3.13). Additionally, total number of all words in positive sentiments has been computed shown in the figure (3.14).

Figure (3.13) Counts of each Words in Positive Sentiments 70

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.14) Count of All Words in Positive Sentiment The third table contains information about the number of each word in negative sentiments. A snapshot of this table has been illustrated in the figure (3.15). Moreover, the total number of all words in negative sentiments have been computed and shown in the figure (3.16).

Figure (3.15) Total Number of Each Words in Negative Sentiments 71

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.16) Total Number of All Words in Negative Sentiments The final step for building the Bag of Words to apply on both Naïve Bayes and neural network Algorithms is to calculate the occurrence of each word in positive and negative sentiments. Part of the information is shown in the figure (3.17).

Figure (3.17) Part of the BOW

72

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Additionally, for making analyses simpler, a new table has been created as shown in the figure (3.18), “ThesisFinalTableChanged” contains BOW with its five extra attributes for applying the algorithm on Tweets to identify its polarity. For the task of applying the Naïve Bayes algorithm on the test set, the mentioned steps in the previous section for building the bag of words, using the same algorithm containing uploading test set, making table using HCatalog, Tokenizing, Splitwords, need to be performed for this part as well. After executing this steps and applying the Naïve Bayes algorithm, the final result has been achieved. Joining the separated words table with the “ThesisFinalTableChanged” table from the previous section has been shown in the figure (3.19). The last table has been shown in figure (3.20) which is the results of the sentiment data for the test set after applying the Naïve Bayes algorithm. This table only contains the id of the statement or tweets and its associated polarity results.

Figure (3.18) BOW Cross-Join Necessary Attributes from Other Tables

73

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.19) Part of Query Results: Joining Test Set Words with Final Table

Figure (3.20) Part of Query Results: Sentiment Results for Test Set

74

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

3.4.2 Feed Forward Neural Network Neural Network has emerged as an important tool for classification. The recent vast research activities in neural classification have established that Neural Networks are a promising alternative to various conventional. In this proposed system, as shown in Figure (3.21), neural network text classifier contain 3 layers. The input layer contains n nodes which is the size of bag of words and two hidden layer nodes for positive and negative polarities and the output layer for showing 0 or 1 result.

Figure (3.21) Neural Network Architecture

75

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

3.4.2.1 Feed Forward Neural Network Algorithm Algorithm (3.2): Feedforward Neural Network  Problem : Find the sentiment polarity for given Tweets  Inputs : Tweets  Outputs : Tweets with sentiment polarities Begin 1. Calculate WeightP and WeightN For each word Wk in training set. WeightP=

number of occurences of Wk in Positive Docs

WeightN =

number of occurences of Wk in all Docs number of occurences of Wk in Negative Docs number of occurences of Wk in all Docs

2. Calculate numberk of each words in a given document for input layer For each word Wk in test set. 3. Calculate βpositive = ∑ki=1(numberi ∗ WeightPi ) 4. Calculate βnegative = ∑ki=1(numberi ∗ WeightNi ) 5. Calculate the activation function using Xp= Xn=

1 1+Exp(−βpositive) 1 1+Exp(−βNegative)

6. The same calculation will be computed for the output layer If in the value of output layer is equal or larger than 0.5, then the document will be classified as positive otherwise it will classified as negative. End

76

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

3.3.2.2 Implementation of Feed Forward Neural Network in Hadoop Activity diagram of computing positive and negative weights for each words using prepared BOW and training set is shown in Figure (3.22). Figure (3.23) clarifies the steps and operation that are used to implement Feed Forward Neural network in Hadoop.

Figure (3.22) Activity Diagram to Compute Positive and Negative Weights

77

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.23) Implementation of Neural Network on Hadoop using Hive

3.4.2.3 Steps of Computing Weights and Implementation of Feed Forward Neural Network Algorithm In this section (upload preprocessed training set to HDFS, create table using HCatalog, tokenizing Statements and Separate words) these steps are the same as explained by figure in implementation using Naïve Bayes algorithm. Moreover, 78

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

the BOW is already has been made from the previous section. Only some important tables have been illustrated in this section. Figure (3.24) shows the part of weights of words that have been computed from training set and the prepared BOW from previous section.

Figure (3.24) Part of Positive and Negative Weights

After uploading test set, building Table using Hadoop HCatalog, tokenization and splitting the words, the counting of number of each word inside every tweets is necessary. Figure (3.25) illustrates part of this counting this figure shows the result in Hidden Layer.

79

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.25) Count of each Word in each Tweet

In the next step of this implementation, input layer of neural network has been prepared. Actually, left outer join is used for the BOW with positive and negative rates with “TnnCountofwords” to have input layer and Name of Words and Weights of Neural Algorithms, as it is shown in figure (3.26). Then sum( number*pos ), sum( number*neg ) as the inputs of the hidden layer have been computed from the input layer. The id refers to the unique sentiment and the other two attributes represent the summation of number of words multiplied by the positive rates and negative rates separately. Actually, this table has been achieved from "TnnInputLayerweights” table by the summation calculation and group by operation and results are shown in figure (3.27).

80

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.26) Input Layer with the Weights of the Words

Figure (3.27) Summation of Positive and Negative of Hidden Layer Then, the activation function have been computed for “sumhiddeninputpos”, “sumhiddeninputneg” of both hidden layer by an equation that has been written in activity diagram and the results are shown in table (3.28), and Figure (3.29) shows the results and the output layer for implementation of neural network algorithm. 81

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.28) Activation Function of Hidden Positive and Hidden Negative

Figure (3.29) Output Layer 82

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

3.4.3 Neural Network Algorithm for the Ready Dictionary The last algorithm for the text classification was based on the same algorithm for neural network for a premade dictionary which contains more than 6000 positive and negative words. That means the weights will be only 1 for positive words and 0 for negative words and vice versa. So, the algorithm will take the same steps as the second algorithm, but there is no need to compute the weights. In order to save space and because of the similarity, all the figures for this algorithm have been neglected here.

Algorithm (3.3): Existing BOW  Problem : Find the sentiment polarity for give Tweets  Inputs : Tweets  Outputs : Tweets with sentiment polarities Begin 1. Using preprocessing algorithm the dictionary will be normalized 2. Calculate numberk of each words in a given document for input layer For each word Wk in bag of words 3. Regarding the dictionary WeightP and WeightN is supposed to be 1 or 0 as classified in the dictionary. In this case Weight i either equal to 1 or to 0; 4. Calculate βpositive = ∑ki=1(numberi ∗ WeightPi ) 5. Calculate βnegative = ∑ki=1(numberi ∗ WeightNi ) 6. Calculate the activation function using Xp=

1 1+Exp(−βpositive)

,

Xn=

1 1+Exp(−βNegative)

7.The same calculation will be commutated for the output layer if the value of

83

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

output layer is equal or larger than 0.5, then the document will be classified as positive otherwise negative. End

The figure(3.30) is the same algorithm using the available dictionary, python was used for processing ready dictionary. It needs to upload the file and create a table using Hcatalog. “ReadyDicProcess” is the name of the table that has been uploaded. In this case, the weights for neural network for positive polarity words will be 1 for the first node in the hidden node and will be 0 for the second node and for the negative polarity words it will be vice versa.

Figure (3.30) Algorithm of Ready Dictionary using Neural Network 84

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

3.5 Overall Proposed System

Figure (3.31) Overview of System Implementation

85

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Finally, in this section, activity diagram for the proposed system has been illustrated in the figure (3.31). The diagram shows step-by-step tasks that need to be performed to accomplish the goals successfully. These activities include, downloading interested tweets directly from Twitter API using Java program. Twittes about any kind of categories can be downloaded here. Tweets about Kurd, Iphone6, ISIS, Obama and Marriage have been downloaded which is Over 50,000 tweets. For each category, about 10,000 tweets have been downloaded and Stored in MongoDB NoSQL database, preprocessing then by using Python NLTK library, Naïve Bayes algorithms that have been discussed in the consequent sections. They have been applied to categories inside Hortonworks Hadoop environment using Apache Hive, analyses and visualization using Excel Professional 2013 Power View.

3.6 Steps of Analytic Twitter processes half a billion tweets per day. The maximum size of tweets 140 characters. For experimentation purpose, Twitter provides these tweets to the public. Initially they were providing it through REST API and then they were moved onto OAuth. Twitter Streaming API makes this possible, and provides realtime streaming tweets at 1% of the Twitter's load. So, this means, on an average Twitter receives 5787 (500000000 Tweets / (24 Hours * 60 Mins * 60 Secs)) Tweets per second, and the Streaming API sends us 58 Tweets per second. Figure (3.32) shows a part of java program, which is running In the Eclipse, directly downloading the tweets from Twitter Streaming API, and storing them in the MongoDB NoSQL database.

86

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.32) Downloading Tweets from Twitter and Store in MongoDB

MongoDB receives the tweets that contains {“ Kurd” ,”Kurdistan”, “North of Iraq”, “Peshmerga”, “Kobani”, “Shangal”, Sinjar”}, and stores them in BSON format. Additionally, to see a straightforward way of updating MongoDB documents, monjaDB has been used. monjaDB is a MongoDB GUI client tool for rapid application development. It runs inside the Eclipse environment. monjaDB is a MongoDB GUI client tool for rapid application development. It aims to provide a thoroughly straightforward way of updating MongoDB documents, and Figure (3.33) illustrates part of the Downloaded Tweets in monjaDB in JSON format.

87

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.33) monjaDB More than 50,000 tweets have been downloaded and stored inside MongoDB and then uploaded to Hortonworks Hadoop. The Three explained algorithms that have been tested on dataset and the accuracy of them have been calculated. As in the obtained result Naïve Bayes algorithm has been selected for analyzing the tweets which have been downloaded because it has highest accuracy among the other discussed algorithms and the results will be discussed in chapter four. The tweets that have been downloaded are stored inside MongoDB as JSON format.

88

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

As the first step, all the downloaded tweets have been exported from MongoDB to CSV file to be prepared for uploading to HDFS. This code needed from mongoexport.exe to export data from JSON file to CSV file (Mongoexport --db twitterdb --collection Obama --csv --fieldFile fields.txt --out d:/Obama.csv). It needs to be mentioned that it is possible to directly use JSON file inside Hadoop, but for the purpose of simplicity, JSON file has been converted into CSV format. As the downloaded tweets have the time zone property and the related country, they are so necessary for the purpose of analysis and visualization. Join the downloaded tweets table containing time zone with the table that has both time zone and country is needed like the table that has been shown below which includes both time-zone and the country. Therefore, by joining the mentioned tables easily, it could be possible to have country field as well. Table (3.4) Time Zone Map

89

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

The activity diagram for the entire process and all the required steps have been shown in the figure (3.34) in more details, excluding the tokenization and splitting words steps that are similar to the others mentioned and implemented works in previous sections.

Figure (3.34) Mapping Time Zone

90

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

After the files are successfully downloaded from Twitter and saved in MongoDB. They will exported to CSV file, then the text preprocessing steps will be applied on them. Later, the preprocessed files will uploaded to HDFS of Hadoop for the analytic purpose. As it has been illustrated in Figure (3.35), preprocessed Tweets have been uploaded into the Hortonworks Sandbox using File Browser icon.

Figure (3.35) Uploading the Preprocessed Tweets into HDFS

91

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

After the tweets have been successfully uploaded, as shown in the figure (3.36), using HCagalog inside Sandbox, the Kurdish table have been made. This table includes all the information needed for the purpose of sentiment analysis. The most important and required fields for analysis are Created Time and Date, Sentiment Text and Time Zone of the user. Using Time Zone, easily the related country will be found. As it has been mentioned before, it is necessary to join this table with the table which contains the time zones and countries, as it shown (3.37).

Figure (3.36) Create Kurdish Table using HCatalog

92

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.37) Time Zone Map

After

determining

that

each

tweet

comes

from

which

country,

“KurdishTokenize” table has been shown in the figure (3.38) which contains tokenizing all the tweets. For analytics process, it is needed to use some techniques and Hive commands for tokenizing all the statements. In this step, each statement is converted is into array of tokens to be prepared for the next step which is splitting words as shown in the figure (3.39). This important task is essential, because after all, the rate of each separate words play the ultimate role of deciding the positive or negative polarity of the tweets. Therefore, for each tweet, all the available tokens need to be separated. “KuridshSeperateWords” table contains id and the associated tokens. Each tweet has been represented by an Id and all the available words inside a specific tweet take the same Id in this new table.

93

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.38) Tokenized Kurdish Table

Figure (3.39) Separated Words with their associated ids

94

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

The figure (3.40) shows that each token belongs to the specefic id , its required rates have been computed through joining the seperated words with the Bag of Words which had been achieved before. All the required fields for applying the Naïve Bayes algorithm are inclueded in this table, the steps are same as the algorithm in section (3.4.1). Then, the following result have been achieved. The results include id, sentiment and time_zone as shown in the figure (3.41). To visualize this table, it must be joined with the TimeZone table to have the country attribute as well. Finally, as it has been shown in the figure (3.42), to have both sentiment polarity and the name of the countries that tweet comes from, “Kurdishresult” table have been joined with Timezone Table. It means that tweets can be easily determined the country it comes from and with its polarity, negative or positive.

Figure (3.40) Joining the separated words with Bag of Words

95

Chapter Three

Requirement Analysis and Algorithms Implementation for Proposed System

Figure (3.41) Kurdishresult Table

Figure (3.42) KurdishResultJoin Table Including Country Field

96

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

4.1 Introduction In this chapter, the challenges and performance accuracy have been discussed and visualization of the proposed system has been shown. In this work, it has been tried to design and implement the system in a dynamic way in the case of sentimental analysis for different categories. Implementation of different parts of this project has been done in a very convenient and efficient way. While it is possible to use the system for any kind of sentiment analytics purposes that have been mentioned before, also the system is capable of running and doing operations much faster than other usual implementations of the algorithms. To implement all of them, Map Reduce paradigm and Hadoop have been employed.

4.2 Challenges and their Solutions 1. Using mathematic tricks to implement the part of formulas that does not have any direct way to implement. Since there is no function to calculate ∏𝐾 𝑖=1 (𝑷(𝑾𝒌 |𝑪𝒋 )) which is a part of Naïve Bayes steps, in HiveQL. This challenge has been tackled using mathematical

trick

to

compute

this

expression.

This

is

the

trick:

EXP(SUM(LN(Field1*Field2))) which will give the same result.

2. Reordering the dataset using SQL Server 2012 As mentioned before, since the dataset has been sorted by sentiment statements alphabetically ascending, it is not reasonable to choose first 70% percent of dataset for training set and the other 30% for test set because many similar and identical words have been grouped either in training set or test set. On the other hand, usual 97

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

tools like Microsoft Excel could not load more than one million tweets. Therefore, to tackle these challenges; all the dataset has been imported to SQL Server 2012 and stored in the specific table. Then, by using DDL, a new decimal field has been added to the related table, and a random value have been assigned for each record. After that, by using SQL commands all the records inside the table have been reordered by this new field. 70% of the overall records for training set, and the rest has been selected as dataset, and each file is exported to csv to start to deliver to other preprocessing stages. Table (4.1) shows the result table after operations mentioned. Table (4.1) Snapshot of Preprocessing Using SQL Server 2012

3. The next challenge has appeared when preprocessing the all tweets have to start and complete. Since the number of tweets exceeds one million, like Microsoft Excel, Python also could not finish preprocessing for the entire dataset and failed to complete. To tackle this challenge, Notepad++ has been used. Id divided the

98

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

training set into two parts of 700,000 tweets and after preprocessing, Notepad++ merged both parts.

4.3 Accuracy Comparison of Algorithms In this study, three different algorithms have been applied on the test set data and part of the results have been showed in the table (4.2). Part of actual polarity of the tweets on the data set and result polarity also shown in the table. Performance metrics are used for the analysis of classifier accuracy. The proposed system evaluates performance using accuracy parameter. The overall classification accuracy is 76.80% on the test set out of 473,589 tweets after preprocessing by using NLTK library for python, while accuracy is about 2 percent less than before applying the preprocessing task. The preprocessed files have only 12.5% of the initial size which helped decreasing the complexity of processing and increasing the accuracy of classification.

Table (4.2) Snapshot of Results of Three different Algorithm

99

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

In statistical analysis of binary classification, the F1 score (also F-score or Fmeasure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results, and r is the number of correct positive results divided by the number of positive results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. For each algorithm, Precision, Recall and F1 have been computed and used for the accuracy comparison of those three algorithms. In order to calculate the Precision, Recall and F-Score the following equations need to be used: Precision =

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

Recall =

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐴𝑐𝑡𝑢𝑎𝑙 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

F-Score =

2∗𝑃𝑟𝑒𝑠𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑠𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙

=

=

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒+𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒

…….(4.1)

………….(4.2)

…………………………………………….....(4.3)

A confusion matrix is a visualization method typically used in supervised learning. Each column of the matrix represents the instances in an expected class, while each row represents the instances in a predicted class. One benefit of a confusion matrix is that it is easy to see if the system is confusing two classes. When a data set is unbalanced (when the number of samples in different classes vary greatly) the error rate of a classifier is not representative of the true performance of the classifier. The diagonal of confusion matrix is indicates the

100

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

correct classifications. The remaining values of confusion matrix indicates the incorrect classifications. Table (4.3) shows the confusion matrix which displays the number of correct and incorrect predictions made by the model compared with the actual classifications in the test data. Since there are two classes of the matrix 2-by-2. Table (4.3) Confusion matrix Actual Class 1 Actual Class 0 Predicted Class 1

Predicted Class 0

True

False

Positive

Positive

False

True

Negative

Negative

Table (4.4) shows the confusion matrix for Naïve Bayes algorithm. Table (4.5) shows the confusion matrix of neural network, and table (4.6) shows the confusion matrix of BOW algorithm:

Table (4.4) shows the confusion matrix for Naïve Bayes Actual Tweet 1

Actual Tweet 0

Predicted Tweet 1

175380

50308

Predicted Tweet 0

61131

186770

Naïve Bayes Algorithm

101

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

Table (4.5) shows the confusion matrix for Neural Network Algorithm results Neural Network Actual Tweet 1 Actual Tweet 0 Algorithm Predicted Tweet 1

164899

45862

Predicted Tweet 0

71612

191216

Table (4.6) shows the confusion matrix for Existing BOW Algorithm results Actual Tweet 1

Actual Tweet 0

Predicted Tweet 1

213901

175708

Predicted Tweet 0

22610

61370

Existing Bag of Words

Finally, accuracy comparison of all algorithms has been illustrated in table (4.7). After the experiment, the result of sentiment analysis using Naive Bayes classifier obtained 76% accuracy on training data neural network classifier obtained 74 % and result of existing bag of word (Ready dictionary) is 68%. Table (4.7) Accuracy comparison among Naïve Bayes, Neural Network and Existing BOW Precision

Recall

F Score

Naïve Bayes Algorithm

0.777090497

0.741529992

0.76

Neural Network Algorihm

0.782398072

0.697214929

0.74

Existing Bag of Words

0.549014525

0.904401909

0.68

Algorithm

102

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

4.4 Experiment Sentiment Analysis Result As mentioned before the sentiment analysis can be used in many different areas. Organizations use sentiment analysis to understand how the public feels about something at a particular moment in time, and also to track how those opinions change over time. Some key words in different areas have been selected to download the tweets. Then, they have been analyzed by using discussed tools, technologies, and techniques. Table (4.8) show the information about the categories that have been selected for the purpose of this work. Globally, over 50,000 tweets have been downloaded, refined, processed and visualized. For each category, about 10,000 tweets have been downloaded, preprocessed, refined, analyzed and visualized. All tweets which have been selected are in English languages and they are form different periods of time from 01/12/2014 to 15/01/2015. Results of the processing of all mentioned categories have been shown in table (4.8). More than half of tweets possess positive feeling about Kurds, while 70% of the downloaded tweets hold positive feeling about ISIS. As it will be mentioned in the next sections, this ratios will change from country to another. For example, the Tweets that come from Iraq about ISIS possess extraordinary negative feeling. The Tweets that come from USA hold 70% positive feeling about ISIS.

103

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

Table (4.8) Sentiment Analyses Result Row

Language Category

Key Words

Total

Positive Negative

tweets %

%

1

En

Competitors

{“IPhone 6”}

10341

19%

81%

2

En

Reputation

{“Obama”}

10628

72%

28%

3

En

Reputation

{“ISIS”}

10242

70%

30%

4

En

Social

{“Marriage”}

10389

69%

31%

5

En

Social

{"Kurd", "Kurdistan",

10051

53%

47%

"North of Iraq", “Kobani" ,"Kurdish","Peshmerga", "Shangal","Sinjar"};

Figure (4.1) Chart of Analyzed Results for the Proposed System

104

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

4.5 Access the Refined Sentiment Data and Visualization using Microsoft Excel To analyze and visualize the results of sentiment data for each category, Microsoft Excel Professional Plus 2013 has been used. The ODBC driver needs to be installed to directly connect to the Hortonworks Hadoop and import data from it. Tweets about “Kurdish” have been visualized, discussed and displayed in this chapter table (4.9) data that has been imported from hadoop into Excel Power View from that table Country and its Sentiment polarity have been selected to analyze and visualize the data for Kurdsown in table (4.10).

Table (4.9) Imported Data from Hadoop

105

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

Table (4.10) Countries and Their associated Sentiment About Kurd

The map view displays a global view of the data as part of visualization tasks. As it has been illustrated in figure (4.2), all countries that had tweets about Kurds have been indicated by blue circle in the world wide map. As an example, USA has been zoomed that has 1,696 tweets about Kurds. Figure (4.3) illustrates the sentiment feeling about Kurds around the world. As an example, if the map controls used to zoom in on USA, more than half of the tweets have a positive sentiment score, as indicated by the color red while the Tweets that come from Russia, hold more negative feeling about Kurds and its situation. Now, the map displays the sentiment data by color: Red: positive, Blue: negative.

106

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

Figure (4.2) Map

Figure (4.3) Visualization of Sentiment Data about Kurd around the World 107

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

4.6 Sentiment Analyses Statistic from USA Total number of tweets about Kurds from US is 1,696. About 850 tweets hold negative feeling about Kurds while 846 of the total tweets are positive. Table (4.11), figure (4.4) and figure (4.5) contain exciting statistics. Feeling for Obama and ISIS from the point view of people in the United States are the same. Both hold 69% positive polarity and 31% negative polarity. Feeling for “marriage” in Iraq and USA are almost equal, since in the USA it holds 66% positive polarity and in Iraq it possesses 69%.

Table (4.11) The Statistic from United States Row

Language Category

Key Words

Total

Positive Negative

tweets %

%

3564

19%

81%

1

En

Competitors {“IPhone 6”}

2

En

Reputation

{“Obama”}

4,289

69%

31%

3

En

Reputation

{“ISIS”}

2,569

69%

31%

4

En

Social

{“Marriage”}

3,068

66%

34%

5

En

Social

{"Kurd","Kurdistan",

1696

49%

51%

"North of Iraq", “Kobani" ,"Kurdish","Peshmerga"," Shangal","Sinjar"};

108

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

Figure (4.4) Visualization of the Tweets that Comes from USA

Figure (4.5) Results Chart of Sentiments of USA

109

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

4.7 Sentiment Analysis Statistic from Iraq Finally and specifically, the twitters that have come from Iraq have been investigated and the results have been visualized. As shown in the figure (4.6), the number of tweets that come from Iraq and have positive polarity about Kurds are more than negative ones. Totally 734 tweets of 10,051 downloaded tweets come from Iraq. About 168 tweets have positive feeling and 153 have negative polarity about Kurds. Table (4.12) and its related diagram show the statistics of all tweets about the different interesting categories. The most downloaded Tweets from the API are about Kurds, which is 321, and the least downloaded ones are about “Obama” which is 85. The Tweets that contain “ISIS” word possess 83% negative polarity. Which is the most negative feeling among all the other categories. The Tweets that contains “Obama” have the most positive polarity. Figure (4.7) shows chart of sentiment results.

Table (4.12) The Statistic from Iraq Row

Language Category

Key Words

Total

Positive Negative

tweets %

%

1

En

Competitors

{“IPhone 6”}

28

25%

75%

2

En

Reputation

{“Obama”}

27

85%

15%

3

En

Reputation

{“ISIS”}

291

17%

83%

4

En

Social

{“Marriage”}

67

70%

30%

5

En

Social

{"Kurd", "Kurdistan",

321

52%

48%

"North of Iraq", “Kobani" ,"Kurdish","Peshmerga", "Shangal", "Sinjar"};

110

Chapter Four

Algorithms Results, Sentiment Analytics Results and Visualization

Figure (4.6) Visualization of the Tweets that Comes from Iraq

Figure (4.7) Results Chart of Sentiments of Iraq

111

Chapter Five

Conclusions and Suggestions for Future Work

Chapter Five

Conclusions & Suggestion for Future Work

5.1 Conclusions Conclusions that can be inferred from this work are specified below: 1) Naïve Bayes and Feed Forward Neural Network have been applied into an available data set containing above 1,500,000 Tweets, and for both algorithm a specific Bag of Words has been obtained. Then, by using the Bag of Words and applying both algorithms on test set data, interesting results have been achieved. Moreover, a ready-to-use Bag of Words have been downloaded and in the same way with applying Neural Network algorithm. Some new results have been obtained which were not good enough as the previous Bag of Words that has been made by applying the algorithms on the data set. Actually, our version of Bag of Words works much better than ready to use one. Using and applying the same algorithm on test set for the proposed system Bag of Words showed 14% better accuracy performance which Naïve Bayes has the highest F-Score among all of them with F-Score of 0.76. Neural Network has the second highest accuracy with F-Score of 0.74, while the Existing Bag of words has the last accuracy with FScore of 0.68. 2) Multinomial Naïve Bayes have also show better accuracy performance than the others. It showed almost 3% better accuracy than Neural Networks with the same Bag of Words and more than 17% than ready to use Bag of Words. 3) Preprocessing improved the accuracy for about 2%. Moreover, it decreased the size of all files for about 25% .This reduction of the files size leads to perform and apply the algorithms in a much more efficient and convenient way. 4) MongoDB is used as NoSQL database in this thesis to store semi-structured tweets coming from Twitter in the JSON format. It is a high-performance, scalable, open source and schema free Document-oriented NoSQL databases 112

Chapter Five

Conclusions & Suggestion for Future Work

which supports automatically replicated data for scale and has a high availability. This database also provides horizontal scalability and has no single point of failure. 5) In the proposed system, sentiment analysis has been investigated and analyzed for five different key words including “Obama”, “ISIS”, “IPhone 6” ,”Marriage” and “Kurd” by applying the Naïve Bayes algorithm on more than 50,000 downloaded tweets. Interesting statistics have been obtained. The first statistic is the public feeling about all the key words, which have been analyzed for each category. More than 50,000 tweets has been downloaded from twitter. For IPhone 6, total number of positive tweets were 19% and negative tweets were 81%, while for Obama it was 72% for positive and 28% for negative. Moreover, tweets about ISIS convey 70% positive feeling and 30% negative feeling. Furthermore, regarding the Marriage, totally, 69% of tweets have positive feeling and the rest have negative feeling. Finally, total positive tweets about Kurds around the world have 53% positive feeling and the rest have negative feeling. The next statistics contains sentiment results about all categories from United States. The total number of tweets about IPhone were 3564. About 19% of tweets were positive and 81% negative. At the same time 4,289 of tweets were about Obama that 69% of them convey positive feeling while 31% of them convey negative feeling. The total number of tweets about ISIS were 2,569 .The positive and negative feeling for ISIS was identical to Obama’s results. Moreover, 3,068 of downloaded tweets are about Marriages which 66% of them have positive feeling while 34% of them convey negative feeling. Finally, total number of tweets that downloaded about Kurds from USA were 1,696 which contain 49% positive feelings and 51% negative feelings. 113

Chapter Five

Conclusions & Suggestion for Future Work

The last statistics contains sentiment results from Iraq. Totally 734 tweets of 10051 downloaded tweets come from Iraq. Almost 25% of them contains positive tweets about IPhone 6 while 75% have negative feelings. About 85% of tweets were positive about Obama while 15% of them were negative. In addition, 17% of tweets convey positive feeling about ISIS and the rest were negative .At the same time positive tweets about marriage were 70% and the rest were negative. Total tweets about Kurds that come from Iraq were 321 tweets, which about 53% of them were positive and 48% were negative. 6) Visualization for the different achieved results have been made using Excel Professional Power View 2013. By using visualization total number of tweets, total number of positive and negative tweets can be easily determined through looking at the worldwide map for each country.

5.2 Suggestions for Future Work As for future works, many suggestions can be put forward either to improve this work or other related work in the following areas: 1. Using Amazon or Google which are cloud-based and real multi cluster Hadoop environment to big data analytics instead of using single cluster environment like Hortonworks Hadoop. 2. Using other machine learning algorithms like SVM for sentiment analysis. 3. Combining Naïve Bayes and FFNN algorithms for increasing accuracy. 4. Using the proposed algorithms, tools and techniques for Spam and other text Classification purposes. 5. Real time analytic and data processing using Storm

114

References

References

References [1] G.Noseworthy, (2012) “Infographic: Managing the Big Flood of Big Data in Digital Marketing”, http://analyzingmedia.com/2012/infographic-big-flood-ofbig-data-in-digital-marketing/ .

[2] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh and Angela Hung Byers,(2011) ” Big data: The next frontier for innovation, competition, and productivity”, McKinsey & Company. [3] M. Islam, (2014) “A Cloud Based Platform for Big Data Science,” M.Sc. Thesis, Linköping University Department of Computer and Information Science. [4] John Gantz, David Reinsel, (2011) “Extracting value from chaos”, IDC. [5] Philip Russom, (2011) “Big data analytics,” TDWI. [6] Mr. Mahesh. G. Huddar, M. M. Ramannavar, (2013) “A Survey on Big Data Analytical Tools”, International Journal of Latest Trends in Engineering and Technology (IJLTET), ISSN: 2278-621X. [7] Markus Maier, (2013) “Towards a Big Data Reference Architecture”, M.Sc. thesis, Eindhoven University of Technology Department of Mathematics and Computer Science.

115

References

[8] Emmanuel Letouzé, (2012) “Big data for development: challenges & opportunities”,

Global

pulse,

the

report

is

available

online

at:

http://unglobalpulse.org/ .

[9] Benoy Bhagattjee, ( 2014) “Emergence and Taxonomy of Big Data as a Service”, Composite Information Systems Laboratory (CISL) Sloan School of Management, Massachusetts Institute of Technology Cambridge, MA 02142.

[10] Acerbi A, Lampos V, Garnett P and Bentley RA, (2013) “The Expression of Emotions in 20th Century Books”, PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030. [11] Nat Silver, (2012) “Election Map Result Prediction for 2012 in US”, available at:http://en.wikipedia.org/wiki/United_States_presidential_election,_2012, accessed at 17/9/2014.

[12] Flu Trend explanation, available at: http://www.google.org/flutrends/jp/#JP, 21/10/2014.

[13] Pankaj Gupta, Ashish Goel, Jimmy Lin, Aneesh Sharma, Dong Wang, and Reza Bosagh Zadeh ,”Recommender systems”, available at: http://en.wikipedia.org/wiki/Recommender_system, accessed at: 23/10/2014. [14] T. Alan Keahey, (2013) “Using visualization to understand big data”, IBM Visualization Science and Systems Expert.

116

References

[15] Rosvall M, Bergstrom CT, (2010) “Mapping Change in Large Networks”, PLoS ONE 5(1): e8694. doi:10.1371/journal.pone.0008694.

[16] Lina L. Dhande, Dr. Prof. Girish K. Patnaik, (2014) “Analyzing Sentiment of Movie Review Data using Naive Bayes Neural Classifier”, International Journal of Emerging Trends & Technology in Computer Science, Volume 3, Issue 4. [17] Vinh Khuc B.S, (2012) “Approaches to Automatically Constructing Polarity Lexicons for Sentiment Analysis on Social Networks,” M.Sc. Thesis, The Ohio State University.

[18] Organization analysis sentiment about? , available at: http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-sentimentdata/ , accessed at 6/11/2014. [19] E. G. Ularu, F. C. Puican, A. Apostu, and M. Velicanu, (2012) “Perspectives on Big Data and Big Data Analytics”, Database Systems Journal, vol. 3, no. 4, pp. 3–14. [20] Chetan Sharma, (2014) “Big Data Analytics using Neural Networks”, M.Sc. Thesis, San José State University. [21] Pablo Gamallo and Marcos Garcia, (2014) “Citius: A Naive-Bayes Strategy for Sentiment Analysis on English Tweets∗”, Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 171–175, Dublin, Ireland.

117

References

[22] Das, P. M. Kumar, (2013) “Big data analytics: A framework for unstructured data analysis”, International Journal of Engineering and Technology (IJET), ISSN: 0975-4024 Vol 5 No 1.

[23] Thibaud Chardonnens, Philippe Cudré-Mauroux, Martin Grund, and Benoit Perroud, (2013) “Big Data analytics on high velocity streams”, IEEE Intranational Conference, page 784-787.

[24] Narayanan, Vivek, Ishan Arora, and Arjun Bhatia, (2013) "Fast and accurate sentiment classification using an enhanced Naive Bayes model", Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206, pp 194-201. [25] Niels Mouthaan, (2012) “Effects of Big Data Analytics on Organization’s Value Creation”, Master Thesis, University of Amsterdam Faculty of Science and Faculty of Economics and Business.

[26] Hadoop definition, available at: http://hadoop.apache.org , accessed at 23/10/2014. [27] Hadoop, available at: http://hortonworks.com/hadoop-tutorial/introducingapache-hadoop-developers/ , accessed at 23/10/2014.

[28] Hadoop characteristics, available at: http://www01.ibm.com/software/data/infosphere/hadoop/mapreduce/ , accessed at 27/10/2014.

118

References

[29] Raste, Ketaki Subhash, (2014) "Big Data Analytics - Hadoop Performance Analysis", M.Sc. Thesis, San Deng State University.

[30] HDFS, available at: http://hortonworks.com/hadoop/hdfs/, accessed at 29/10/2014. [31] D. Borthakur, (2007) “The hadoop distributed file system: Architecture and design”, Hadoop Project Website, vol. 11, p. 21. [32] Guanying Wang, (2012) “Evaluating MapReduc System Performance: A Simulation Approach”, Ph.D. Dissertation, Virginia Polytechnic Institute and State University. [33] J. Dean, S. Ghemawat, (2008) “MapReduce: simplified data processing on large clusters”, Communications of the ACM, vol. 51, no. 1, pp. 107–113. [34] Dionysios Logothetis, (2011) “Architectures for Stateful Data-intensive Analytics”, Ph.D. Dissertation, University of California, San Diego. [35] Jason Venner, (2009) “Pro Hadoop”, ISBN 978-1-4302-1942-2.

[36] What is Hive, available at: http://hive.apache.org , accessed at 2/11/2014. [37] Y. Huai et al. (2014) “Major Technical Advancements in Apache Hive,” ACM SIGMOD international conference on Management of data, pages.12351246.

119

References

[38] HCatalog, available at: https://cwiki.apache.org/confluence/display/Hive/Home , accessed at 3/11/2014.

[39] What is a Hortonworks, available at: http://hortonworks.com/get-started/ , accessed at 3/11/2014. [40] Yun Shen, Olivier Thonnard, (2013) “BigFoot: Big Data Analytics of Digital Footprints”, FP7-ICT-ICT-2011.1.2 Call 8 Project No. 317858.

[41] Hortonworks Data Platform, available at: https://www.pinterest.com/pin/347129083748263960/ , accessed at 5/11/2014. [42] V. Sharma and M. Dave, (2012) “SQL and NoSQL Databases,” International Journal of Advanced Research in Computer Science and Software Engineering, vol. 2, no. 8, pp. 20–27.

[43] Braga, Maxmiliano Franco, and Fábio Nogueira de LUCENA, "Using NoSQL Database to Persist Complex Data Objects." Jornal da Ciência. [44] Silvan Weber, (2010) “Nosql databases”, available at: URL http://wiki. hsr. ch/Datenbanken/files/Weber\_NoSQL\_Paper. pdf. [45] Why NoSQL database, available at: http://www.couchbase.com/nosqlresources/what-is-no-sql , accessed at 1/11/2014.

120

References

[46] Robin Henricsson, (2011) “Document Oriented NoSQL Databases”, Bachelor Thesis, Blekinge Institute of Technology.

[47] A. Moniruzzaman, S. A. Hossain, (2013) “Nosql database: New era of databases for big data analytics-classification, characteristics and comparison,” International journal of database theory and application, vol.6,no.4 .

[48] MongoDB characteristics, available at: http://docs.mongodb.org/manual/core/replication-introduction/ , accessed at 2/11/2014. [49] What is a mongoDB, available at: http://www.mongodb.com/nosqlexplained , accessed at 10/11/2014. [50] R. Chaudhary, G. Singh, (2014) “A Novel Technique in NoSQL Data Extraction”, International Journal of Research – GRANTHAALAYAH, Vol.1(Iss.1) pp. 51–58. [51] CAP theorem, available at: http://www.nutanix.com/blog/2014/03/11/understanding-web-scale-properties/ accessed at 13/11/2014. [52] Shashank Tiwari, (2011) “Professional NoSQL”, ISBN: 978-0-470-942246. [53] Hecht, R. & Jablonski, (2011) “NoSQL Evaluation”, International Conference on Cloud and Service Computing.

121

References

[54] Key value store, available at: http://www.aerospike.com/what-is-a-nosqlkey-value-store/, accessed at 19/11/2014.

[55] Rilinda Lamllari, (2013) “Extending a methodology for migration of the database layer to the cloud considering relational database schema migration to NoSQL”, M.Sc. Thesis, Institute of Architecture of Application Systems University of Stuttgart.

[56] Column family data store, available at: http://www.thoughtworks.com/insights/blog/nosql-databases-overview , accessed at 6/2/2015.

[57] Graph Database, available at: http://oss.infoscience.co.jp/neo4j/neo4j.org/neo4j-the-graph-database/index.html , accessed at: 19/11/2014.

[58] Alexey Diomin, Kirill Grigorchuk, “Benchmarking Couchbase Server for Interactive Applications”, Altoros Systems, Inc. [59] Eelco Plugge, Peter Membrey and Tim Hawkins, (2013) “MongoDB: the definitive guide”, ISBN-13: 978-1-4302-3052-6.

[60] Introduction to mongoDB, available at: http://docs.mongodb.org/manual/core/introduction/ , accessed at 20/11/2014.

122

References

[61] MongoDB database model, available at: https://mongopi.wordpress.com/page/5/ , accessed at 15/11/2014.

[62] David Mytton, (2009) “Choosing a non-relational database; why we migrated from MySQL to MongoDB,” Boxed Ice Blog.

[63] Key mongoDB features, available at: http://docs.mongodb.org/ , accessed at 21/11/2014.

[64] What is Eclipse, available at: http://en.wikipedia.org/wiki/Eclipse_(software) , accessed at 28/10/2014. [65] Dan Jurafsky, Christopher Manning, (2015) “Naïve Bayses - Standford NLP”, available at: https://class.coursera.org/nlp/lecture/25 , accessed at 3/2/2015. [66] Dan Jurafsky, Christopher Manning, (2015) “Bag of Words Representation”, available at: https://class.coursera.org/nlp/lecture/37 , accessed at 6/2/2015.

[67] What is a FFNN, available at: http://en.wikipedia.org/wiki/Feedforward_neural_network , accessed at 3/12/2014. [68] Sachchidanand bajpayee, (2009) “edge preserving image compression technique using feed forward neural network”, M.Sc. Thesis, Department of electrical and instrumentation Engineering Thapar University.

123

References

[69] What is Python, available at: www.python.org , accessed at 5/12/2014. [71] Nitin Madnani, (2009) “Getting Started on Natural Language Processing with Python”, ACM Crossroads, Volume 13, Issue 4. [72] Jacob Perkins, (2010) “Python Text Processing with NLTK 2.0 Cookbook”, ISBN: 978-1-849513-60-9.

[73] University of Michigan Sentiment Analysis competition on Kaggle, available at: https://inclass.kaggle.com/c/si650winter11 , accessed at 2/7/2014. [74] Niek Sanders, (2011) “Twitter Sentiment Corpus”, available at: http://www.sananalytics.com/lab/twitter-sentiment/ , accessed at 5/7/2014.

124

‫بمعدل ‪ .)F-score( 0.76‬الشبكة العصبية لها ثاني أعلى دقة لـ )‪ (F-score‬وبقيمة ‪0.74‬‬ ‫بينما خوارزمية ‪ Bow‬الموجودة لديها اقل دقة وبمعدل ‪ Apache Hive . )F-Score( 0.68‬تم‬ ‫استخدامها لتنفيذ الخوارزميات داخل ‪ Hortonworks Hadoop‬والتي هي مجموعة واحدة وبيئة‬ ‫‪ Hadoop‬المحمولة‪.‬‬ ‫لغرض المعالجة تم استخدام مكتبة ‪ Python NLTK‬والتي أدت إلى زيادة دقة الخوارزميات‬ ‫بنسبة ‪.% 2‬‬ ‫تم استخدام برنامج ‪ Java‬من اجل الربط مع التويتر وذلك باالعتماد على واجهة البرامج‬ ‫التطبيقية ‪ OAuth API‬وباستخدام ‪ 50،000‬تغريده موزعة على خمسة فئات (أي فون‪ ،6‬أوباما‪ ،‬ال ُكرد‪،‬‬ ‫الزواج و ‪ )ISIS‬تم تحميلها وتخزينها في قاعدة بيانات ‪MongoDB‬‬

‫‪ NOSQL‬بصيغة ‪JSON‬‬

‫‪ .format‬وبعد تحميل البيانات في ‪ Hadoop Hortonworks‬وتطبيق خوارزمية ‪Naïve Bayes‬‬ ‫عليها‪ ،‬تم تمثيل وتصوير جميع التحليالت والنتائج ال ُمنَقاة باستخدام برنامج ‪Microsoft Excel‬‬

‫‪.Professional Power View‬‬

‫ألخالصة‬

‫في السنوات األخيرة ازداد حجم المعلومات بشكل كبير‪ .‬المعلومات المهيكلة وشبة المهيكلة والغير‬ ‫مهيكلة تأتي من مختلف المصادر مثل البريد االلكتروني والشبكات االجتماعية والويكي والتغريدات‬ ‫وغيرها مما أدى إلى زيادة حجم البيانات بسرعة هائلة‪ .‬تحليل البيانات الكبيرة يمثل القدرة على استخراج‬ ‫المعلومات المفيدة من هذه األنواع من البيانات والتي ال يمكن أن تدار بواسطة المنهجيات الحالية ونظم‬ ‫إدارة قواعد البيانات العالئقية وكذلك برمجيات التنقيب عن البيانات‪.‬‬ ‫يعتبر برنامج ‪ Apache Hadoop‬من أكثر البرامج أهمية في مجال تحليل البيانات الكبيرة‬ ‫والذي يعتمد على النماذج البرمجية والملفات الموزعة للـ ‪ MapReduce‬والذي يدعى ‪HDFS‬‬ ‫والذي بدوره يسمح بكتابة البرنامج باستخدام ‪ Apache Hive‬والذي بإمكانه معالجة كمية كبيرة من‬ ‫البيانات بسرعة عالية وبشكل متوازي في مجموعة كبيرة من العقد الحسابية‪ .‬خطوات أو رموز ‪Apache‬‬ ‫‪ Hive‬والتي تدعى ‪ Hive QL‬تجمع وتترجم إلى وظائف ‪ MapReduce‬والتي بدورها تقسم البيانات‬ ‫المدخلة إلى مجموعات فرعية مستقلة والتي تتم معالجتها بواسطة مهام الخريطة بالتوازي‪ ،‬بعد ذلك يتبعها‬ ‫خطوات تقليل المهام للحصول على النتائج‪.‬‬ ‫في هذا العمل تحليل اآلراء والذي يعتبر واحدة من تطبيقات تحليل البيانات الكبيرة تم التحقيق فيه‪.‬‬ ‫تحليل اآلراء هي عملية تحديد القطبية في وجهات النظر والمشاعر والمواقف التي تُعّ ِرب عنها المصادر‬ ‫المختلفة من النصوص‪ .‬أداة تحليل اآلراء يجب أن تتوقع أو تتنبأ ما إذا كان الرأي في نص معين هو أيحابي‬ ‫أم سلبي أم محايد‪.‬‬ ‫في هذا العمل ‪ %70‬من ‪ 1،500،000‬تغريده من مجموعة بيانات مقدمة من قبل جامعة‬ ‫ميشيغان استخدمت في مجموعة التدريب والباقي استخدم في مجموعة االختبار‪ .‬تم الحصول على حقيبة‬ ‫من الكلمات من مجموعة التدريب و تم تطبيقها في مجموعة االختبار من خالل تطبيق خوارزمية ‪Naïve‬‬ ‫‪ Bayes‬متعددة الحدود وخوارزمية الشبكة العصبية المباشرة‪ .‬عالوة على ذلك تحتوي حقيبة البيانات‬ ‫الموجودة على ‪ 6000‬من الكلمات االيجابية والسلبية والتي تم استخدامها في هذا العمل‪ ،‬جميعها تم تطبيقها‬ ‫في مجموعة االختبار‪ .‬خوارزمية ‪ Naïve Bayes‬لها أعلى )‪ (F-score‬من باقي الخوارزميات‬

‫تحليل المشاعرالبيانات الكبيرة بآستخدام الشبکة‬ ‫العصبية ونایف بیس‬

‫رسالة‬ ‫مقدمة الى مجلس فاكلتي العلوم و تربية العلوم‬ ‫سكول العلوم في جامعة السليمانية‬ ‫كجزء من متطلبات نيل شهادة‬ ‫ماجستيرالعلوم في علوم الحاسبات‬

‫من قبل‬

‫مژدە هیوا حمە‬ ‫بکالوریوس علوم الحاسبات )‪ ،)2010‬جامعة السليمانية‬

‫بإشراف‬

‫د‪ .‬سۆزان عبدهللا محمود‬ ‫استاذ المساعد‬

‫مایس ‪2015‬‬

‫جمادى اآلخرة ‪1436‬‬

‫‪ Neural Network‬کە ڕێژەکەی ‪٪٧٤‬ە بۆ ئەو ‪ BOW‬کە بەرهەم هێنراوە وە ‪ ٪٦٨‬بۆ ‪ BOW‬ی حازر‪.‬‬ ‫بەمەبەستی ئامادەکاری بۆ هەنگاوەکانی کارکردن لە سەر داتاکە‪ ،‬گورزە پرۆگرامی پێشوەخت ئامادەکراوی‬ ‫‪NLTK‬ی ‪Python‬بەکارهاتووە بۆ باشتر کردنی توانای ئەلگوریتمەکان لە شیکارکردن بە ڕێژەی ‪ .٪٢‬لە هەمان‬ ‫کاتدا زمانی بەرنامەڕێژی ‪ java‬بەکارهێنراوە بۆ بەدەست خستنی داتا لە تۆڕی کۆمەاڵیەتی توێتەر خۆی لە سەر‬ ‫‪ ٥‬بابەتی جیاواز ئەبینێتەوەکە بریتین لە ئایفۆنی ‪ .٦‬ئۆباما‪ ،‬کورد‪ ،‬هاوسەرگیری وە داعەش‪ .‬وە پاشان داتاکان‬ ‫بەشێوەیەکی ڕێکنەخراو هەڵگیراون لە داتابەیسی جۆری ‪ MongoDb‬بەفۆڕماتی ‪ .JSON‬دوای ئەوەی کە‬ ‫داتاکە داخڵی هادوپ کرا ‪ Naïve Bayes‬بەکارهێنرا بۆ جێبەجێ کردنی هەنگاوەکانی شیکاری وە ئەنجامەکان‬ ‫بۆزیاتر ڕوونکردنەوەیان بە هێڵکاری کران لە ڕێگەی بەکارهێنانی بەرنامەی ‪.power View‬‬

‫پوختە‬ ‫بە هۆی پێشکەوتنی تەنکنەلۆجیا لە دنیای ئەمڕۆدا‪ ،‬قەبارەو جۆری داتاوزانیاری بە شێوەیەکی چاوەڕواننەکرا‬ ‫ڕووی لە زیاد بوونە‪.‬بەشێوەیەکی گشتی ئەو داتایانەی کە بەرهەم ئەهێنرێن لە الیەن بە کارهێنەری تەکنەلۆجیاوە‬ ‫ئەتوانرێت دابەش بکرێت بە سەر سێ شێوازی سەرەکی کە ئەوانیش بریتین لە ڕێخراو‪،‬نیمچە ڕیخراوو‬ ‫ڕێکنەخراو‪.‬سەرچاوەی ئەم داتایانە جۆراوجۆرە وەکماڵپەڕەکان‪ ،‬پۆستی ئەلێکترۆنی‪ ،‬تۆڕەکۆمەاڵیەتیەکان‪،‬ویکی‬ ‫وتویتەر‪ .‬بەرهەم هێنانی ئەم داتایانە بە شێوەیەکی زۆر خێرایە کە ئەمە ئەوە دەگەیێنێ شیکار کردنی ئەم قەبارە‬ ‫گەورەیە لە داتا پێویستی بە کۆمەڵە بیردۆزێکی تۆکمەوخێرا هەیە کە لە توانای تەکنەلۆجیای سەردەم ئەستەمە‪.‬‬ ‫نمونەی ئەو بەرنامانەی کە ئەتوانرێت سودی لێ ببینرێت بۆ شیکارکردنی قەبارەیەکی گەورە لە داتا بریتیە لە‬ ‫‪ Hadoop‬کە لەسەر بنەماکانی بەرنامەڕێژی ‪ Map/Reduce‬وەهەروەها ‪ HDFS‬کار دەکەن‪ .‬لەم‬ ‫تەکنەلۆجیایەدا ‪ Apache Hive‬بەکاردێت بۆ هەنگاوەکانی شیکار کردنی داتا بەشێوەیەکی هاوتەریب‪ .‬یەکێک‬ ‫لە کارە گرنگەکانی ‪ Apache Hive‬کە دەتوانرێت بۆ کۆمەڵەداتایەک ئەنجامی بدات بریتیە لە دابەشکرنیان‬ ‫بەسەر چەند کۆمەڵەیەکی سەربەخۆکە هەر یەکەیان بەشێوەیەکی هاوتەریب کاری لەسەردەکرێت‪.‬‬ ‫بۆبەدەستهێنانی ئامانجەکانی ئەم پڕۆژەیە‪ ،‬زیاتر لە یەک ملیون وپێنج سەد هەزار تویت لە زانکۆی مشیگان‬ ‫وەرگیراوە‪.‬کە لەسەدا حەفتای ئەم زانیاریە وەک ‪ Training Set‬وە باقی داتاکە وەک ‪ Test Set‬بە‬ ‫کارهێنراوە‪.‬ئەم زانیاریانە هەموو شیکار کراون وەبەرهەمەکەی لە ‪ BOW‬کۆکراوەتەوە کە لە ڕێگەی چەند‬ ‫ئەلگوریتمێکی جێاوازەوە وەک ‪ Naïve Bayes‬وە ‪ Neural Network‬کاری لەسەردەکرێت بۆ دیاریکردنی‬ ‫شێوازی ڕەستەکان لە ڕووی مەبەست وواتاوهەستەوە‪.‬‬ ‫ئەنجامەکان دەریدەخەن کە لە نێوان ئەو ئەلگۆریتمانەی بە کار هێنراون ‪ Naïve Bayes‬باشترینیانە لە‬ ‫هەنگاوەکانی بەراورد کردنو بەرهەمەوە کە ڕێژەکەی دەگاتە ‪ ٪٧٦‬کە بەرزترە بە بەراورد لەگەڵ ئەلگورتمی‬

‫شيكارکردنی سۆزی داتاى بەرفراوان بە بهکارهێنانی تۆڕی‬ ‫دهمارهگهیاندن و نایڤ بەیس‬

‫نامهیهکه‬ ‫پێشکهش کراوه به ئهنجومهنی فاکڵتی زانست و پهروهرده زانستهکان‬ ‫سکوڵی زانست له زانکۆی سلێمانی‬ ‫وهك بهشێك له پێداویستیهکانی بهدهستهێنانی بروانامهی‬ ‫ماستهر له زانستی کۆمپیوتهر‬

‫له الیهن‬

‫مژدە هیوا حمە‬ ‫بکالۆریۆس لە زانستی کۆمپیوتەر)‪ ، (2010‬زانکۆی سلێمانی‬

‫بهسهرپهرشتی‬

‫د‪ .‬سۆزان عبدالله محمود‬ ‫پرۆفیسۆری یاریدهدهر‬

‫ایار ‪2015‬‬

‫گواڵن ‪2715‬‬