The Role of Missing Data Imputation Methods on the Accuracy of Data Mining Results A Thesis Submitted to The Council of College of Administration and Economics University of Sulaimani, As Partial Fulfillment for the Requirements of the Master Degree of Sciences in Statistics

By

Zhyan Mohammed Omer

Supervised By Assistant professor

Dr. Nzar Abdulqader Ali

2016 AD

2716 Kurdish

1437 H

‫بسم اهلل الرمحن الرحيم‬ ‫﴿‬

‫قَالُوا سُبْحَنَكَ ال عِلْمَ لَنَا إِال‬ ‫مَا عَلَّمْتَنَا إِنَّكَ أَنْتَ الْعَلِيمُ‬ ‫الْحَكِيمُ‬ ‫صدق اهلل العظيم‬ ‫سورة البقرة‬ ‫اآليت‪32 :‬‬

‫﴾‬

Dedication

Dedication This thesis is dedicated to: My dear father, God bless him My dear mother My dear sisters My dear brother With love ........

Zhyan M. Omer

Linguistic Evaluation Certification This is to certify that I, Rangi Shorsh Rauf , have proofread this thesis entitled " The Role of Missing Data Imputation Methods on the Accuracy of Data Mining Results " by Zhyan Mohammed Omer. After making and correcting the mistakes, the thesis was handed again to the researcher to make the correction in this last copy.

Signature: Proof reader: Rangi Shorsh Rauf Title: Assistant lecture Date:

/ 10 / 2016

Department of English, School of Languages, Faculty of Humanities, University of Sulaimani.

Supervisor's Certification I certify that the preparation of thesis titled "The Role of Missing Data Imputation Methods on the Accuracy of Data Mining Results", accomplished by (Zhyan Mohammed Omer), was prepared under my supervision at the University of Sulaimani, College of Administration and Economics, Statistics department as partial fulfillment of the requirement for the degree of Master of Science in Statistics.

Signature: Supervisor: Dr. Nzar Abdulqader Ali Title: Assistant professor Date:

/ 10 / 2016

Chairman's Certification In view of the available recommendations, I forward this thesis for debate by the examining committee.

Signature: Name: Dr. Samira M. Salih Title: Assistant professor Higher Education Committee Date:

/ 10 / 2016

Examination Committee Certification We are the exam committee, certificate that we read this thesis entitled "The Role of Missing Data Imputation Methods on the Accuracy of Data Mining Results ", and we examined the student (Zhyan Mohammed Omer) in its contents and that in our opinion, it is adequate as a thesis for the degree of Master of Science in Statistics.

Signature:

Signature:

Name: Dr. Nawzad M.Ahmad

Name: Dr.Mohammed M. Faqe

Title: Assistant Professor

Title: Assistant Professor

Committee head

Member

Date:

Date:

/ 10 / 2016

/ 10 / 2016

Signature:

Signature:

Name: Dr. Soran A.Bkr Mohammed

Name: Dr. Nzar Abdulqader Ali

Title: Assistant Professor

Title: Assistant Professor

Member

Member /Supervisor

Date:

/ 10 / 2016

Date:

/ 10 / 2016

Signature: Name: Dr. Kawa M. Jamal Rashid Title: Assistant Professor Dean of Administration and Economics College Date:

/ 10 / 2016

Acknowledgements

I

Acknowledgements First of all, I would like to thank Allah for helping me to accomplish my study. I want to express my warmest thanks and gratitude to my supervisor Assist.Prof.Dr. Nzar Abdulqader Ali for his invaluable advice, guidance, support and suggestion through my study. I would like to extend thanks to the dean of the college of Administration and Economics Assist.Prof.Dr.Kawa Mohammed Jamal Rashid, head of Statistics Department Assist.Prof.Dr.Mohammed Mahmmud Faqe, Administrator of Higher Education Unit Assist.Prof.Dr.Samira Mohammed Salih. Special thanks to all the teachers who taught and helped me during the course of my study, I also want to thank all the staff of the library, Higher Education Unit and Statistics Department. I would like to thank all the people who once helped me to complete this thesis. I want to express my heartfelt gratitude to my parents for their passionate love, care and help, without whom I could not overcome the difficulties, especially my father who always supported my decisions which I made in my life and always supported my interest in science.

Abstract

II

Abstract In fact, raw data in the real world is dirty. Each large data repository contains various types of anomalous values that influence the result of the analysis, since in data mining, good models usually need good data, Input data must be provided in the amount, structure and format that suit each data mining (DM) task perfectly. Unfortunately, real-world databases are highly influenced by negative factors such the presence of noise, missing values (MVs), inconsistent and superfluous data and huge sizes in both dimensions, examples and features therefore these problem cause that the analyses result perform poorly. Thus, low-quality data will lead to low-quality data mining performance. In large data repository, data preprocessing is very important to deal with such problems as mentioned .Data preprocessing comprise the several of tasks such as (data cleaning, data integration, data transformation, data reduction and data discretization). Missing data is a common drawback in many real-world data sets. Missing data refers to unobserved values in a data set which can be of different types and may be missing for different reasons. These different reasons can include unit nonresponse, item nonresponse, dropout, human error, equipment failure, and latent classes. The presence of such imperfections typically requires a preprocessing stage in which the data is arranged and cleaned, with a specific end goal to be valuable to and adequately clear for the knowledge extraction process.

Analysts have distinguished three classes of missing data are missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR).

Abstract

III

There are various procedures for dealing with missing values. The most straightforward solution for the missing values is the reduction of the data set and elimination of all samples with missing values. Another solution is Tolerance method. Finally missing values problem can be handled by various missing values imputation methods. Thus, recently imputation is one of the most popular solutions for dealing with missing data. In this thesis, we proposed an algorithm depending on improving (MIGEC) algorithm in the way of imputation for dealing missing values. We implement grey relational analysis (GRA) on attribute values instead of instance values, and the missing data were initially imputed by mean imputation and then estimated by our proposed algorithm (PA) used as a complete value for imputing next missing value.

We compare our proposed algorithm with several other algorithms such as Mean Mode Substitution (MMS), Hot Deck Imputation (HDI), KNearest Neighbour Imputation with Mutual Information (KNNMI), Fuzzy C-Mean based on Optimal Completion Strategy (FCMOCS), Clusteringbased Random Imputation (CRI), Clustering-based Multiple Imputation (CMI), The - Non Parametric Iterative Imputation (NIIA) and Multiple Imputation algorithm using Gray System Theory and Entropy based on Clustering (MIGEC) under different missing mechanisms. Experimental results demonstrate that the proposed algorithm has less RMSE values than other algorithms under all missingness mechanisms.

List of Contents

IV

List of Contents Acknowledgement

I

Abstract

II

List of Contents

IV

List of Figures

VII

List of Tables

X

List of Abbreviations

XII

Chapter 1: Introduction, Literature Review and The Aim of The Study 1.1

Introduction……………………………………………………………...

1

1.2

Literature Review……………...………...……………………………

6

1.3

The Aim of The study.....................................................

13

1.4

The Layout of The Thesis...............................................

13

Chapter 2: Theoretical Part 2.1

Introduction.......................................................................

14

2.2

Missing Value...................................................................

20

2.2.1 Missing Data Mechanisms...................................................

21

2.2.2 Methods for Handling Incomplete Data…………………………….

23

2.3

Simulation……………………………………………………………………

27

2.4

Clustering ……………………………………………...................................

28

2.4.1 Fuzzy C-mean Clustering……..………………………………………

29

Grey System Theory……..…………………………………………..….

31

2.5.1 Grey Relational Analysis………………..………………………….....

31

2.5

List of Contents

2.6

V

Classification………………………………………………………………

35

2.6.1 Decision Tree ……………………………………………………………..

37

2.6.2 Entropy and Information again……………..…………....................

38

2.7

Hybrid Imputation…………………………………...…………………..

40

2.8

Proposed Algorithm…………...………………………………………

40

2.8.1 Steps of Algorithm………………………………………………….….

43

2.9

The Frame work of Proposed Algorithm………………….……

48

Chapter 3: Practical Part 3.1

Introduction………………………….....……………………………………

50

3.2

Dataset…………………………………………………………………………

50

3.2.1 Wine Data Set …………………………………………………….……….

50

3.2.2 Simulated Data……………………………………………………………..

52

3.3

Generating Missingness…………..……………………………………

53

3.4

Performance Measure………………………………..………………….

56

3.5

Optimality Measure…..…………………………….…………………....

57

3.5.1 Number of Iteration……………………………….……………………...

57

3.5.2 Number of Clusters………………………….…………………………...

59

3.6

Comparative Experiments…………..…………………........................

62

3.7

Results of Proposed Algorithm for Simulated Data…....….

67

3.7.1 Number of Iteration………………………….…………………………...

67

3.7.2 Number of Clusters………………………….…………………………...

69

3.8

Comparison between wine and simulated dataset…………...

72

3.9

Comparing Algorithms by Number of Iteration……………...

76

List of Contents

VI

Chapter4: Conclusions and Future Work 4.1

Conclusions..............................................................................................

77

4.2

Future Work..............................................................................................

78

References………………………………………………………………..

80

Appendixes…………………………………………….............................

88

‫اخلالصة‬ ‫ثوختة‬

List of Figures

VII

List of Figures Figure

Figure Name

No.

Page No.

1.1

KDD process

3

1.2

DM methods

5

2.1

Forms of Data Preprocessing

19

2.2

Data set with MVs denoted with a symbol ‘?’

21

2.3

Difference between Hard and Soft Clustering

29

2.4

General Approach for Building a Classification Model

36

2.5

Decision Tree Example

38

2.6

40

2.7

Entropy in the case of two possibilities with probabilities p and (p-1) The Flowchart of the Proposed Algorithm

3.1

Sample of Wine dataset

51

3.2

Sample of simulated dataset

53

3.3

Sample of Wine dataset missed by MCAR

54

3.4

Sample of Wine dataset missed by MAR

55

3.5

Sample of Wine dataset missed by NMAR

56

3.6

Checking Optimality by number of Iterations (for MCAR)

57

3.7

Checking Optimality by number of Iterations (for MAR)

58

3.8

Checking Optimality by number of Iterations (for NMAR)

59

3.9

Checking Optimality by number of clusters (for MCAR)

60

3.10

Checking Optimality by number of clusters (for MAR)

61

3.11

Checking Optimality by number of clusters (for NMAR)

62

3.12

Comparison between proposed algorithm and other methods for imputation (MCAR)

42

65

List of Figures

3.13

3.14

3.15

3.16

3.17

3.18

3.19

3.20

3.21

3.22

3.23

3.24

3.25

Comparison between proposed algorithm and other methods for imputation (MAR) Comparison between proposed algorithm and other methods for imputation (NMAR) Checking Optimality by number of Iterations (for MCAR) *simulated data Checking Optimality by number of Iterations (for MAR) *simulated data Checking Optimality by number of Iterations (for NMAR) *simulated data Checking Optimality by number of clusters (for MCAR) *simulated data Checking Optimality by number of clusters (for MAR) *simulated data Checking Optimality by number of clusters (for NMAR) *simulated data Result of proposed algorithm for Simulated data under different mechanism with varying missing rates comparison between wine and simulated dataset for various iterations (MCAR) comparison between wine and simulated dataset for various iterations (MAR) comparison between wine and simulated dataset for various iterations (NMAR) comparison between wine and simulated dataset for various clusters (MCAR)

VIII

66

66

67

68

69

70

70

71

72

73

73

73

74

List of Figures

3.26

3.27

comparison between wine and simulated dataset for various clusters (MAR) comparison between wine and simulated dataset for various clusters (NMAR)

IX

74

74

List of Tables

X

List of Tables Table

Table Name

No.

Page No.

2.1

Listwise Deletion

24

2.2

Pairwise Deletion

24

3.1

Information about Wine dataset

51

3.2

Mean and Standard deviation of Wine data

53

3.3

Checking optimality by number of Iterations (for MCAR)

57

3.4

Checking Optimality by number of Iterations (for MAR)

58

3.5

Checking Optimality by number of Iterations (for NMAR)

59

3.6

Checking Optimality by number of clusters (for MCAR)

60

3.7

Checking Optimality by number of clusters (for MAR)

60

3.8

Checking Optimality by number of clusters (for NMAR)

61

3.9

3.10

3.11

3.12

3.13

3.14

Comparison between Proposed algorithm and other methods for imputation (MCAR) Difference between proposed

algorithm and other

algorithms for MCAR Comparison between Proposed algorithm and other methods for imputation (MAR) Difference between proposed

algorithm and other

algorithms for MAR Comparison between Proposed algorithm and other methods for imputation (NMAR) Difference between proposed algorithms for NMAR

algorithm and other

63

63

63

64

64

64

List of Tables

3.15

3.16

3.17

3.18

3.19

3.20

3.21

Checking Optimality by number of Iterations (for MCAR) *simulated data Checking Optimality by number of Iterations (for MAR) *simulated data Checking Optimality by number of Iterations (for NMAR) *simulated data Checking Optimality by number of clusters (for MCAR) *simulated data Checking Optimality by number of clusters (for MAR) *simulated data Checking Optimality by number of clusters (for NMAR) *simulated data Result of proposed algorithm for Simulated data under different mechanism with varying missing rates

XI

67

68

68

69

70

71

72

3.22

Difference between RMSE of Simulated and Wine dataset

75

3.23

Comparing number of iteration between proposed algorithm and existing algorithms

76

List of Abbreviations

XII

List of Abbreviations Abbreviations

Description

CDI

Cold Deck Imputation

CMI

Clustering-based Multiple Imputation

CRI

Clustering-based Random Imputation

DM

Data Mining

EM

Expectation Maximization

FCM

Fuzzy C-mean Clustering

FCMOCS

Fuzzy C-Mean based on Optimal Completion Strategy

GRA

Grey Relational Analysis

GSA

Grey System Theory

HDI

Hot Deck Imputation

KDD

Knowledge Discovery in Databases

KNNMI

MAR

K Nearest Neighbour Imputation with Mutual Information

Missing at Random

MCAR

Missing Completely at Random

MIGEC

Multiple Imputation algorithm using Gray System Theory and Entropy based on Clustering

MMS

Mean Mode Substitution

MV

Missing Value

NIIA

The - Non Parametric Iterative Imputation

NMAR

Not Missing at Random

NRMSE

Normalize Root Mean Square Error

List of Abbreviations

PA RMSE UCI

Proposed Algorithm Root Mean Square Error University of California, Irvine

XIII

Chapter One Introduction, Literature Review and The Aim of The Study

1

Chapter One : Introduction, Literature Review and The Aim of The Study

Chapter One Introduction,Literature Review and The Aim of The Study 1.1

Introduction Data mining (DM) also popularly known as Knowledge Discovery in

Databases (KDD) is " the non trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"

[69]

.

Vast amounts of data are around us in our world, raw data is mainly intractable for human or manual applications. So, the analysis of such data is now a necessity. The World Wide Web (WWW), business related services, society, applications and networks for science or engineering, among others, are continuously generating data in exponential growth since the development of powerful storage and connection tools. This immense data growth does not easily allow to get useful information or organized knowledge to be understood or extracted automatically. This fact has led to the start of Data Mining (DM), which is currently a well-known discipline increasingly preset in the current world of the Information Age. DM is about solving problems by analyzing data present in real databases. Nowadays, it is qualified as science and technology for exploring data to discover already present unknown patterns. Many people distinguish DM as synonym of the Knowledge Discovery in Databases (KDD) process, while others view DM as the main step of KDD [8] [11]

[16]

.

The KDD process is divided into six different stages according to the agreement of several important researchers [18] .

Chapter One : Introduction, Literature Review and The Aim of The Study

2

1. Problem Specification: Designating and arranging the application domain, the relevant prior knowledge obtained by experts and the final objectives pursued by the end-user. 2. Problem Understanding: Including the comprehension of both the selected data to approach and the expert knowledge associated in order to achieve high degree of reliability. 3. Data Preprocessing: This stage includes operations for data cleaning (such as handling the removal of noise and inconsistent data), data integration (where multiple data sources may be combined into one), data transformation (where data is transformed and consolidated into forms which are appropriate for specific DM tasks or aggregation operations), data reduction (including the selection and extraction) and imputing missing data are features and examples for data preprocessing. 4. Data Mining: It is the essential process where the methods are used to extract valid data patterns. This step includes the choice of the most suitable DM task (such as classification, regression, clustering or association), the choice of the DM algorithm itself, belonging to one of the previous families. Finally, the employment and accommodation of the algorithm selected to the problem, by tuning essential parameters and validation procedures. 5. Evaluation: Estimating and interpreting the mined patterns based on interestingness measures. 6. Result Exploitation: The last stage may involve using the knowledge directly; incorporating the knowledge into another system for further processes or simply reporting the discovered knowledge through visualization tools.

Chapter One : Introduction, Literature Review and The Aim of The Study

3

Figure (1.1) summarizes the KDD process and reveals the six stages mentioned previously. It is worth mentioning that all the stages are interconnected, showing that the KDD process is actually a self-organized scheme where each stage conditions the remaining stages and reverse path is also allowed .

Figure (1.1) :KDD process [18]

4

Chapter One : Introduction, Literature Review and The Aim of The Study

The objective of data mining is both prediction and description. That is, to predict unknown or future values of the attributes of interest using other attributes in the databases, while describing the data in a manner understandable and interpretable for human beings. Predicting the sale amounts of a new product based on advertising expenditure, or predicting wind velocities as a function of temperature, humidity, air pressure, etc., are examples of tasks with a predictive goal in data mining. Describing the different terrain groupings that emerge in a sampling of satellite imagery is an example of a descriptive goal for a data mining task. The relative importance of description and prediction can vary between different applications. These two goals can be fulfilled by any of a number data mining tasks(techniques) including: classification, regression, clustering, summarization, dependency modeling, and deviation detection [11] [20] .

Alarge number of techniques for DM are well-known and used in many applications. Figure (1.2) shows a division of the main DM methods according to two methods of obtaining knowledge: prediction and description. Within the prediction family of methods, two main groups can be distinguished:statistical methods and symbolic methods

[6]

.Statistical

methods are usually characterized by the representation of knowledge through mathematical models with computations. In contrast, symbolic methods prefer to represent the knowledge by means of symbols and connectives, yielding more interpretable models for humans.

Chapter One : Introduction, Literature Review and The Aim of The Study

5

Figure (1.2) :DM Methods [18] The next important step is the data to be used. Input data must be provided in the amount, structure and format that suit each DM task perfectly.It is unrealistic to expect that data will be perfect after they have been extracted. Since good models usually need good data, a thorough cleansing of the data is an important step to improve the quality of data mining methods. Not only the correctness, also the consistency of values is important. Missing data can also be a particularly pernicious problem. Especially when the number of missing data is large, not all attributes (instances) with a missing values can be deleted from the sample. Moreover, the fact that a value is missing may be significant itself. A widely applied approach is to calculate a substitute value for missing data,

Chapter One : Introduction, Literature Review and The Aim of The Study

6

for example, the median or mean of a variable. Furthermore, several data mining approaches (especially many clustering approaches, but also some learning methods) can be modified to ignore missing values [5] [22] .

There are a number of data preprocessing techniques. Data cleaning can be applied to remove noise , correct inconsistencies in the data and fill missing values. Data integration merges data from multiple sources into a coherent data store, such as a data warehouse. Data transformations, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements. Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance. These techniques are not mutually exclusive; they may work together. For example, data cleaning can involve transformations to correct wrong data, such as by transforming all entries for a date field to a common format. Data processing techniques, when applied before mining, can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining [10] .

1.2 Literature Review : M. D. Zio , U. Guarnera , O. Luzi , (2007)

[52]

, they proposed a

technique based on finite mixture of multivariate Gaussian distributions for handling missing data. The main reason is that it allows to control the trade-off between parsimony and flexibility. An experimental comparison with the widely used imputation nearest neighbor donor is illustrated .

7

Chapter One : Introduction, Literature Review and The Aim of The Study

C.Y. J. Peng , J. Zhu , (2008)

[32]

,This study seeks to fill the

void(null) by comparing two approaches for handling missing data in categorical covariates in logistic regression: the expectation-maximization (EM) method of weights and multiple imputation (MI). Missing data on covariates are simulated under two conditions: missing completely at random and missing at random with different missing rates. A logistic regression model was fit to each sample using either the EM or MI approach. The performance of these two approaches is compared on four criteria: bias, efficiency, coverage, and rejection rate. Results generally favored MI over EM. S. Zhang, J. Zhang, X .Zhu, Y .Qin, C. Zhang, (2008)

[67]

, They

proposed an efficient nonparametric missing value imputation method based on clustering, called CMI (Clustering-based Missing value Imputation), for dealing with missing values in target attributes. In their approach, They impute the missing values of an instance A with plausible values that are generated from the data in the instances which do not contain missing values and are most similar to the instance A using a kernel-based method. Specifically, They first divide the dataset (including the instances with missing values) into clusters. Next , missing values of an instance A are patched up with the plausible values generated from A’s cluster. Extensive experiments show the effectiveness of the proposed method in missing value imputation task. P.J.G.Laencina , J.L.S.Gómez, A.R.F.Vidal, M.Verleysen, (2009) [59] , they propose a novel KNN imputation procedure using a feature-weighted distance metric based on mutual information (MI). This method provides a missing data estimation aimed at solving the classification task i.e., it provides an imputed dataset which is directed toward improving the

Chapter One : Introduction, Literature Review and The Aim of The Study

8

classification performance. The MI-based distance metric is also used to implement an effective KNN classifier . A.D.Nuovo, (2011)

[24]

, This paper introduced the most common

solutions to this problem offered by the most popular statistical software and a technique (Case Deletion) based on the most famous fuzzy clustering algorithm: Fuzzy C-Means (FCM). Then compared these methodologies in order to highlight the peculiar characteristics of each solution. The comparison was made in a psychological research environment, using a database of in-patients who have a diagnosis of mental retardation. The results demonstrate that completion techniques, and in particular the one based on FCM, lead to effective data imputation, avoiding the deletion of elements with missing data, which diminishes the power of the research. N. Ankaiah ,V. Ravi, ( 2011) [56] , They propose a novel 2-stage soft computing approach for data imputation, involving local learning and global approximation in tandem. In stage 1, K-means algorithm is used to replace the missing values with cluster centers. Stage 2 refines the resultant approximate values using multilayer perceptron (MLP). MLP is trained on the complete data with the attribute having missing values as the target variable one at a time. In all datasets, some values, which are randomly removed, are treated as missing values. The actual and the predicted values obtained are compared by using Mean Absolute Percentage Error (MAPE). They observe that, the MAPE value is reduced from stage 1 to stage 2, indicating the hybrid approach resulted in better imputation compared to stage 1 alone. J.Tian , B. Yu , D. Yu , Sh. Ma , (2012)

[46]

,They focus on an

algorithm of a fuzzy clustering approach for missing value imputation with noisy data immunity. The PCFKMI (Pre-Clustering based Fuzzy K-Means

Chapter One : Introduction, Literature Review and The Aim of The Study

9

Imputation) method aggregates data instances to more accurate clusters for further appropriate estimation via information entropy after resampling preclustering and outlier test . their experimental results demonstrate that the PCFKMI proposed obtains higher precision both on quantitative and on nominal attributive missing value completion than other classic methods under all missingness mechanisms at varying missing rates with abnormal values. S. Gajawada , D. Toshniwal , (2012)

[62]

, They proposed a missing

value imputation method based on K-Means and nearest neighbors. Their method uses the imputed objects for further imputations. They propose a method that has been applied on clinical datasets from UCI Machine Learning Repository, their results shows their proposed method performed better than simple method (without using imputed values for further imputations) but it is not the case for all the datasets as error in earlier imputation may propagate to further imputations. M. M. Rahman, D. N. Davis , ( 2013 ) [55] , They explored the use of a machine learning technique as a missing value imputation method for incomplete cardiovascular data. Mean/mode imputation, fuzzy unordered rule induction algorithm imputation, decision tree imputation and other machine learning algorithms are used as missing value imputation and the final datasets are classified using decision tree, fuzzy unordered rule induction, KNN and K-Mean clustering. The experiment shows that final classifier performance is improved when the fuzzy unordered rule induction algorithm is used to predict missing attribute values for K-Mean clustering and most of the cases machine learning technique found to be performed better than the mean imputation.

Chapter One : Introduction, Literature Review and The Aim of The Study

10

Z. Chi , F. H. cai , J. Kai , Y. Ting , (2013) [73] ,In order to improve the accuracy of filling in the missing data, a filling missing data algorithm of the nearest neighbor based on the cluster analysis is proposed by this paper. After clustering data analysis , the algorithm assigns weights according to the categories and improves calculation formula and filling value calculation based on the MGNN (Mahalanobis-Gray and Nearest Neighbor algorithm) algorithm. Their experimental results show that the filling accuracy of the method is higher than traditional KNN algorithm and MGNN algorithm . F. C. S. Liu , (2014)

[36]

, This paper applies three types of multiple

imputation (MI) method such as [Amelia II, MICE, and mi] to reconstruct the distribution of vote choice in the sample. Vote choice is one of the most important dependent variables in political science studies.This paper shows how the multiple imputation procedure in general facilitates the work of reconstructing the distribution of a targeted variable. Particularly, it shows how MI can be applied to point-estimation in descriptive statistics. G. Sang , K. Shi , Zh. Liu , L. Gao, (2014) [39] , They proposed a new weighted KNN data filling algorithm based on grey correlation analysis (GBWKNN) by researching the nearest neighbor of missing data filling method. It is combined with grey system theory and the advantage of the K nearest neighbor algorithm. The experimental results on six UCI data sets showed the filling algorithm in theis paper is superior to KNN algorithm and the filling algorithm proposed by Huang and Lee [27] H. Li , C. Zhao, F. Shao , G. Zheng Li , X. Wang , (2014) [40] , In this paper ,they propose a novel hybrid imputation method, called Recursive Mutual Imputation (RMI). Specifically, RMI exploits global correlation information and local structure in the data, captured by two popular

Chapter One : Introduction, Literature Review and The Aim of The Study

11

methods, Bayesian Principal Component Analysis (BPCA) and Local Least Squares (LLS), respectively. Mutual strategy is implemented by sharing the estimated data sequences at each recursive process. Meanwhile, they consider the imputation sequence based on the number of missing entries in the target gene. Furthermore, a weight based integrated method is utilized in the final assembling step. It is noted that their proposed hybrid imputation approach incorporates both global and local information of microarray genes, which achieves lower NRMSE values against to any single approach only. J. Tian , B. Yu , D. Yu , Sh. Ma , (2014)

[47]

,In this paper they

proposed a hybrid missing data completion method named Multiple Imputation using Gray-system-theory and Entropy based on Clustering (MIGEC). Firstly, the non-missing data instances are separated into several clusters. Then, the imputed value is obtained after multiple calculations by utilizing the information entropy of the proximal category for each incomplete instance in terms of the similarity metric based on Gray System Theory (GST). Minakshi

, R. Vohra, Gimpy , (2014)

[51]

, In this paper three

techniques are used to impute missing values named as Litwise deletion , mean/mode imputation, KNN (k nearest neighbor) . C4.5/J48 classification algorithm is applied to these imputed datasets. In this work analyzes the performance of imputation methods using C4.5 classifier on the basis of accuracy for handling missing data or value. After that decision which imputation method is the best method to handle missing value. Their experimental results show that accuracy of KNN is greater than other two techniques.

Chapter One : Introduction, Literature Review and The Aim of The Study

X.Y. Zhou , J. S. Lim , (2014)

[71]

12

, they studied a new method, the

NB-EM (Naïve Bayesian-Expectation Maximization) algorithm, for handling missing values .The performance of this method is compared with the traditional EM(Expectation Maximization) method and non-substitution approaches for dealing with datasets containing randomly missing value. They determined the most effective method, compared with the traditional EM algorithm, the NB-EM algorithm has a higher accuracy rate, which suggests that the NB-EM algorithm can obtain a better effect on missing values in practice. O. B. Shukur , M.H. Lee , (2015 )

[57]

, In this study, the hybrid

artificial neural network (ANN) and autoregressive (AR) method is proposed for imputing the missing values. ANN is a nonlinear method that is capable of imputing the missing values in wind speed data with nonlinear characteristic. AR model is used for determining the structure of the input layer for the ANN. Listwise deletion is used before AR modeling to handle the missing values. A case study is carried out using daily Iraqi and Malaysian wind speed data. The proposed imputation method is compared with linear, nearest neighbor, and state space methods. The comparison has shown that AR-ANN outperformed the classical methods. In conclusion, the missing values in wind speed data with nonlinear characteristic can be impute more accurately using AR-ANN. Therefore, imputing the missing values using AR-ANN leads to more accurate performance of time series modeling and analysis.

13

Chapter One : Introduction, Literature Review and The Aim of The Study

1.3 The Aim of The Study The aim of this study is to investigate the impact of

data

preprocessing on the accuracy of DM models, imputing missing data is one of the steps of data cleaning in data preprocessing. In this thesis an algorithm is proposed for dealing with missing values based on existing algorithm MIGEC proposed by (J. Tian , B. Yu , D. Yu , Sh. Ma) .

1.4 The Layout of The Thesis The rest of this thesis is organized as follows: Chapter Two: firstly discussed data preprocessing in data mining , also missing value and how dealing with it is discussed , in the end of this chapter, proposed algorithm that use to impute missing values explained in detail . Chapter Three: In this chapter, the experimental results for proposed algorithm are presented and computed and comparisons between different algorithms are made and results are discussed in details. Chapter Four: contains the main conclusions with the recommendations for the future works.

Chapter Two Theoretical Part

14

Chapter Two: Theoretical Part

Chapter Two Theoretical Part 2.1 Introduction Data pre-processing is an important and critical step in the data mining process and it has a huge impact on the success of a data mining project. Data pre-processing is a step of the Knowledge discovery in databases (KDD) process that reduces the complexity of the data and offers better conditions to subsequent analysis. Through this the Nature of the data is better understood and the data analysis is performed more accurately and efficiently [10] . Data preprocessing is composed of several steps depicted in Figure (2.1) are:

1. Data Cleaning Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data [2] . a. Missing Values: Missing data alludes to in secret qualities in a data set which can be of various sorts and might miss for various reasons. These distinctive reasons can incorporate unit nonresponse, thing nonresponse, dropout, human blunder, hardware disappointment, and inert classes. [21] . b. Noisy Data: Noise is a random error or variance in a measured variable. Noise Identification known as the smoothing in data transformation, its main objective is to detect random errors or variances in a measured variable. Popular techniques for noise

Chapter Two: Theoretical Part

15

smoothing include binning, Regression, and clustering. Most of these techniques include a data discrimination phase [42] . c. Outliers: Outlier is an observation point that is distant from other observations .Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers [19] .

2.

Data Integration It comprises the merging of data from multiple data sources. This

process must be carefully performed in order to avoid redundancies and inconsistencies in the resulting data set. Typical operations accomplished within the data integration are the identification and unification of variables and domains, the analysis of attribute correlation, the duplication of tuples and the detection of conflicts in data values of different sources [18] .

3.

Data Transformation In this preprocessing step, the data is converted or consolidated so

that the mining process result could be applied or may be more efficient. Data transformation can involve the following steps [12] : a. Smoothing, this works to remove noise from the data. Such techniques include binning, regression, and clustering. b. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities. c. Generalization of the data, where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be

16

Chapter Two: Theoretical Part

generalized to higher-level concepts, like city or country. Similarly, values for numerical attributes, like age, may be mapped to higherlevel concepts, like youth, middle-aged, and senior. d. Normalization, where the attribute data are scaled so as to fall within a small specified range, such as -1 to 1, or 0 to 1. The most common normalization techniques are [18] : 1. Min-Max Normalization: Performs a linear transformation on the

original data. Suppose that minA and maxA are the minimum and maximum values of an attribute (A). Min-max normalization maps a value

of A to ` in the range [new _minA ; new_maxA ] by

computing : (

)

(

)

Min-max normalization preserves the relationships among the original data values. It will encounter an “out-of-bounds” error if a future input case for normalization falls outside of the original data range for A. 2. Z-score Normalization (or zero-mean normalization): In some cases, the min-max normalization is not useful or cannot be

applied.

When the minimum or maximum values of attribute A are not known, the min-max normalization is infeasible. Even when the minimum and maximum values are available, the presence of outliers can bias the min-max normalization by grouping the values and limiting the digital precision available to represent the values. If Ā is the mean of the values of attribute A and SA is the standard deviation, original value of A is normalized to

using:

………………………………………....… (2.2)

17

Chapter Two: Theoretical Part

By applying this transformation the attribute values now present a mean equal to 0 and a standard deviation of 1. If the mean and standard deviation associated to the probability distribution are not available, it is usual to use instead the sample mean and standard deviation: ̅



.. .………………………………......……( 2.3 )

And



∑(

̅)

………………….….…… ( 2.4)

3. Decimal Scaling Normalization: A simple way to reduce the

absolute values of a numerical attribute is to normalize its values by shifting the decimal point using a power of ten division such that the maximum absolute value is always lower than 1 after the transformation. This transformation is commonly known as decimal scaling and it is expressed as ……………………….…………..……….. (2.5) Where j is the smallest integer such that new _maxA < 1 . e. Attribute construction (feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process.

Chapter Two: Theoretical Part

4.

18

Data Reduction Data reduction techniques are used in order to obtain a new

representation of the data set that is much smaller in volume, but yet produces the same (or almost the same) analytical results [12] . Strategies for data reduction include the following [10] : a) Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube. b) Attribute subset selection, where irrelevant, weakly relevant or redundant attributes or dimensions may be detected and removed. c) Dimensionality reduction, where encoding mechanisms are used to reduce the data set size. d) Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering, sampling, and the use of histograms.

5.

Discretization and concept hierarchy generation Data discretization techniques can be used to reduce the number of

values for a given continuous attribute by dividing the range of the attribute into finite number of intervals. Interval labels can then be used to replace actual data values. Data discretization is a form of reduction that is very useful for the automatic generation of concept hierarchies. Discretization and concept hierarchy generation are powerful tools for data mining, in that they allow the mining of data at multiple levels of abstraction [10][18] .

Chapter Two: Theoretical Part

Figure (2.1): Forms of Data Preprocessing [10]

19

20

Chapter Two: Theoretical Part

2.2 Missing Value Many existing, industrial and research data sets contain missing values (MVs) in their attribute values as shown in Figure (2.2). Intuitively MVs is just a value for attribute that was not introduced or was lost in the recording process. There are various reasons for their existence, such as manual data entry procedures, equipment errors and incorrect measurements. The presence of such imperfections usually requires a preprocessing stage in which the data is prepared and cleaned, in order to be useful to and sufficiently clear for the knowledge extraction process. The simplest way of dealing with MVs is to discard the instances (attributes) that contain them. However, this method is practical only when the data contains a relatively small number of examples with MVs and when analysis of the complete examples will not lead to serious bias during the inference .MVs make performing data analysis difficult. The presence of MVs can also pose serious problems for researchers. In fact, inappropriate handling of the MVs in the analysis may introduce bias and can result in misleading conclusions being drawn from a research study, and can also limit the generalizability of the research findings

[5] [15] [41]

. Three types of problems are usually

associated with MVs in DM [43] : a. Loss of efficiency. b. Complications in handling and analyzing the data. c. Bias resulting from differences between missing and complete data.

21

Chapter Two: Theoretical Part

Figure (2.2): Data set with MVs denoted with a symbol ‘?’

[18]

2.2.1- Missing Data Mechanisms The mechanism of missingness describes the relationship between the probability of a value being missing and the other variables in the data set. If Y represent the complete data that can be partitioned as (Y obs, Ymis) where Yobs is the observed part of Y and Ymis is the missing part of Y, and R be an indicator random variable (or matrix) indicating whether or not Y is observed or missing. Let R = 1 denotes a value which is observed and let R = 0 denote a value which is missing. The statistical model for missing data is P (R\Y, Ø) where Ø is the parameter for the missing data process. The

Chapter Two: Theoretical Part

22

mechanism of missingness is determined by the dependency of R on the variables in the data set [21] . The following are different mechanisms of missingness [15] [34] . i- Missing Completely At Random (MCAR)

The first mechanism of missingness is a special case of MAR known as missing completely at random (MCAR). In this case, the mechanism of missingness is given by: P (R\Y, Ø) = P (R, Ø) …………………………... (2.6)

That is, the probability of missingness is not dependent on any observed or unobserved values in Y. It is what one colloquially thinks of as “random“. One example of MCAR might be a computer malfunction that arbitrarily deletes some of the data values. ii- Missing At Random (MAR)

The second mechanism of missingness is missing at random (MAR), this mechanism of missingness is given by: P (R\Y, Ø) = P (R\Yobs, Ø) ………………… (2.7)

That is, the probability of missingness is only dependent on observed values in Y and not on any unobserved values in Y. A simple example of MAR is a survey where subjects over a certain age refuse to answer a particular survey question and age is an observed covariate. iii- Not Missing At Random (NMAR)

The third mechanism of missingness is referred to as missing not at random (MNAR). This mechanism of missingness is given by: P (R\Y, Ø) = P (R\Yobs,Ymis, Ø) ………....…. (2.8)

Chapter Two: Theoretical Part

23

This mechanism occurs when the conditions of MAR are violated so that the probability of missingness is dependent on Ymis or some unobserved covariate. One instance of MNAR might be subjects who have an income above a certain value refusing to report an income in the survey. Here the missingness is dependent on the unobserved response, income.

2.2.2- Methods for Handling Incomplete Data (Missing Data) Methods to deal with missing values are not something new. In 1976, Rubin

[34]

developed a framework of inference from incomplete data that is

still in use today. After that many researchers have run into this area and proposed a great number of methods. There are several different methods and strategies available to handle missing- data. Current administrations of processing missing data can be approximately divided into three categories: tolerance, ignoring and imputation-based procedures [47] .

a.

Tolerance The straightforward method aims to maintain the source entries in the

incomplete fashion. It may be a practical and computationally low cost solution, whereas it requires the techniques to work robustly even if the data quality stays low [31] .

b.

Ignoring Missing data ignorance often refers to “Case Deletion”. It is the most

frequently applied procedure nowadays, this method suffer from a loss of information in the incomplete cases and risk of bias if the missing data is not MCAR and it is prone to diminish the data quality. The strength lies in the ease of application: deleting the elements with missing values is done in two manners [4] [66] :

Chapter Two: Theoretical Part

(i)

24

List-wise/Case-wise Deletion (complete-case analysis): Omits the entire instances (case) or attributes (records) containing

missing values. The main drawback of this method is that the application may lead to large loss of observations, which may result in high inaccuracy in particular if the original dataset is itself too small or the number of instances that contain missing value is too large. Table (2.1) represents List-wise Deletion techniques. Table (2.1): Listwise Deletion

(ii)

Pairwise Deletion (available-case analysis): Incomplete instances are removed on an analysis-by-analysis basis,

Unlike list wise deletion which removes instances that have missing values on any of the variables under analysis, pair wise deletion only removes the specific missing values from the analysis (not the entire instance) such that any given instance may contribute to some analyses but not to others. Table (2.2) represents Pairwise Deletion techniques. Table (2.2): Pairwise Deletion

Chapter Two: Theoretical Part

c.

25

Imputation Imputation is the process of replacing missing data with substituted

values. Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with list wise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discard any case that has a missing value, which may introduce bias or affect the representation of the results. Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analyzed using standard techniques for complete data .There are different methods to impute missing values as shown below [1] .

i.

Mean/Mode Substitution (MMS): It replaces the missing values by the mean (the arithmetic average

value) or mode (the highest frequency of occurrence) of all the observations or a subgroup at the same variable. It consists of replacing the unknown value for a given attribute by the mean (quantitative attribute) or mode (qualitative attribute) of all known values of that attribute. Replacing all missing records with a single value distorts the input data distribution [44] [59].

ii.

Hot-deck/Cold-deck Imputation : Given an incomplete pattern, Hot-Deck Imputation (HDI) replaces the

missing data with the values from the input vector that is closest in terms of the attributes that are known in both patterns. This method attempts to preserve the distribution by substituting different observed values for each missing item. Another possibility is the Cold-Deck Imputation (CDI) method, which is similar to hot deck but the data source must be other than

Chapter Two: Theoretical Part

26

the current dataset. For example, in a survey context, the external source can be a previous realization of the same survey [25] [50] . Regression imputation: This method uses multiple linear regressions to obtain estimates of the

iii.

missing values. It is applied by estimating a regression equation for each variable, using the others as predictors. This solves the problems concerning variance and covariance raised by the previous method but leads to polarization of all the variables if they are not linked in a linear fashion. Possible errors are due to the insertion of highly correlated predictors to estimate the variables. Many forms of regression models can be used for regression imputation such as linear regression, logistic regression and semi parametric regression [26] [32] .

iv.

Expectation Maximization Estimation (EME): The algorithm can handle parameter estimation in the presence of

missing data, based on Expectation Maximization (EM) algorithm proposed by Dempster, Laird and Rubin, the EM algorithm consists of two steps that are repeated until convergence: the expectation (E-step) and the maximization (M-step) steps. These methods are generally superior to case deletion methods, because they utilize all the observed data. However, they suffer from a strict assumption of a model distribution for the variables, such as a multivariate normal model, which has a high sensitivity to outliers [63]

v.

[52]

. Machine-learning-based imputation: It acquires the features of interested unknown data by behavior

evolution after sample data processed. The essence is to automatically learn sample for complicated pattern cognition and intelligently predict the missing values. Most methods described above come from statistics. Recently, some machine learning techniques have been introduced to

27

Chapter Two: Theoretical Part

estimate missing values. For example the methods mainly include decision tree based imputation; association rules based imputation and clusteringbased imputation [7] [65] [66] .

vi.

Multiple Imputation : Multiple imputation was first proposed by Rubin (1976)

[34]

and now

is an increasingly popular way to handle missing data. It produces m complete datasets, and then each of the datasets is analyzed by completedata method. At last, the results derived from these m datasets are combined. Multiple imputation reflects the uncertainty of the missing values. All the analysis becomes combined to reflect both the inter-imputation variability and intra-imputation variability [58] .

2.3 Simulation Simulation refers to the process of using computer generated random samples to create dataset. Simulation uses methods based on random samples from a set to simulate a process of interest, uses random samples from particular probability distribution, discrete or continuous distribution such as (Bernoulli, Binomial, Geometric, uniform, Normal , chi-squared, exponential ,Beta,….) or simulate data from a model (linear model, nonlinear model, Time series model,…..). Simulation methods are relatively straightforward once the assumptions of a model and the parameters to be used for data generation are specified. Researchers who use simulation methods can have tight experimental control over these axioms and their data, and can test how a model performs under a known set of parameters (whereas with real-world data, the parameters are unknown). Simulation methods are flexible and can be applied to a number of problems to obtain quantitative answers to questions that may not be possible to derive through other approaches. Nowadays simulation is a tool mainly used for different types of production system analysis. But recent advantages in technology

28

Chapter Two: Theoretical Part

have allowed simulation to expand its usefulness beyond a purely design function and into operational use [17] [48] .

2.4 Clustering Clustering analysis plays an important role in the data mining field, it is a method of clustering objects or patterns into several groups. It attempts to organize unlabeled input objects into clusters or “Natural groups” such that data points within a cluster are more similar to each other than those belonging to different clusters, i.e., to maximize the intra-cluster similarity while minimizing the inter-cluster similarity. In the field of clustering analysis, a number of methods have been put forward and many successful applications have been reported.

Clustering algorithms can be loosely

categorized into the following categories: hierarchical, partition-based, density-based, grid-based and model-based clustering algorithms. Among them, partition-based algorithms which partition objects with some membership matrices are most widely studied. Traditional partition-based clustering methods usually are deterministic clustering methods which usually obtain the specific group which objects belong to, i.e., membership functions of these methods take on a value of 0 or 1. One can accurately know which group that the observation object pertains to. This characteristic brings about these clustering methods’ common drawback, that we cannot clearly know the probability of the observation object being a part of different groups, which reduces the effectiveness of hard clustering methods in many real situations. For this purpose, fuzzy clustering methods which incorporate fuzzy set theory have emerged [37] [72] . One possible classification of clustering methods can be according to whether the subsets are fuzzy (soft) or crisp (hard) between them illustrated in Figure (2.3).

[14]

, the difference

29

Chapter Two: Theoretical Part

Hard clustering methods are based on classical set theory, and require that an object either does or does not belong to a cluster. Hard clustering means partitioning the data into a specified number of mutually exclusive subsets. Soft clustering methods, however, allow the objects to belong to several clusters simultaneously, with different degrees of membership. In many situations, fuzzy clustering is more Natural than hard clustering.

2.4.1 Fuzzy C-Mean Clustering Among the fuzzy clustering method, the fuzzy c-means (FCM) algorithm is the most well-known method because it has the advantage of robustness for ambiguity and maintains much more information than any hard clustering methods. The algorithm is an extension of the classical and the

crisp (k-means clustering method

[64]

) in fuzzy set domain. Soft

clustering ensures data elements belong to more than one cluster, which aim to minimize the objective function, minimizing objective function means increasing similarity among all the components within an object and reducing similarity between components of one object with others

[9] [60]

.

FCM is widely studied and applied in pattern recognition, image segmentation and image clustering

[29] [53]

, data mining, wireless sensor

network [35] and so on.

Figure (2 .3): Difference between Hard and Soft Clustering

30

Chapter Two: Theoretical Part

The FCM can be summarized in 4 steps: Step 1. Randomly initialize the matrix

( )

( )

0

1 with initial value U (0),

which satisfies



( )

……………………... (2.9)

Step 2. From the rth iteration (r > 0), calculate the centroids

( )



Step 4. If

:

……………………………………………. (2.10)

Step 3. Update the membership matrix

( )

( )

( )



( )

( )





[

( )

.

( )

/

.

( )

/

(

]

)

( )

:

......…………………..…... (2.11)



satisfied no more iteration

needed. Then the iterative procedure immediately ends with formed clusters, otherwise, it returns to Step 2. Where: • r is the ordinal number of the iterations. • xi is ith complete data instance . • d(·, ·)is the distance metric between two instances . •

u(r)i j is the degree of membership (or probability) that the ith instance is subordinate to the jth cluster under the “fuzzier” s.

• G is the total number of clusters • M represents the number of data instances.

31

Chapter Two: Theoretical Part

2.5 Grey System Theory (GST) Grey System where "grey" means poor, incomplete, uncertain, etc. (GST) first proposed by Prof. Deng (1982)

[45]

, is the system of which part

information is known and part information is unknown. Systems with completely unknown information are black systems. Systems with complete information available are called white systems. The term “Grey” lies between “Black” and “White” and it indicates that the information is partially available. Up to now, GST has been developing a set of theories and techniques including grey mathematics, Grey Generating ,Grey relational analysis, grey modeling, grey clustering, grey forecasting, grey decision making, grey programming and grey control, and has been applied successfully in many engineering and managerial fields such as industry, ecology, meteorology, geography, earthquake, hydrology, medicine and military .The major advantage of grey theory is that it can handle both incomplete information and unclear problems very precisely. It serves as an analysis tool especially in cases where there is insufficient data [38] [49] [70] .

2.5.1 Grey Relational Analysis (GRA) Grey Relational Analysis (GRA) is an important method of Grey System theory (GST), it is based on geometrical mathematics, which compliance with the principles of normality, symmetry, entirety, and proximity. GRA is suitable for solving complicated interrelationships between multiple factors and variables and has been successfully applied on cluster analysis, robot path planning, project selection, prediction analysis, performance evaluation, and factor effect evaluation and multiple criteria decision [61] .

32

Chapter Two: Theoretical Part

Gray Relational Analysis includes Gray Relational Coefficient (GRC) and Gray Relational Grade (GRG) detailed explanation about GRA method is presented in the following steps [39] : Step 1.

The first step is data pre-processing. Data pre-processing is usually required when the range or unit in one data sequence is different from others or the sequence scatter range is too large. Data pre-processing is a method of transferring the original data sequence to a comparable sequence. Therefore, data must be normalized, scaled and polarized first into a comparable sequence before proceed to other steps. To map the original data into a particular interval (a , b), min-max normalization equation used :

(

)

Hereinto, under attribute A.

…..…. (2.12)

is the maximum and minimum respectively

When normalize data in to specific interval such [0, 1], where: The min-max normalization equation will be as follows: ,

………………………………..…….… (2.13)

-

Step 2.

In

dataset

*

+

,

* ( )

( )

( )+

,

and

. m is the number of attributes in each instance, so that the grey relationship coefficient of the two instances on attribute A is:

33

Chapter Two: Theoretical Part

( ( ) | |

( )

( )) ( )

|

( )|

( )|

|

( )

( )

( )|

( )|

…... (2.14)

Hereinto, ,

• One of important parameter is

-, which is used to control the

level of differences with respect to the relational coefficient. When

, the comparison environment does not occur any more.

On the contrary,

shows that the comparison environment

remains the unchanged status. A proper value of

could favorably

manage the impact of the maximum value in the matrix. Nevertheless, no methods have been convinced about the optimum value selection so far. Instead, researchers usually choose to empirically set it as 0.5 or learn the optimized one from experimental results [28] . • • ( ( ) and

( ))

,

- represent the level of similarity of instances ( ( )

on attribute. When

( ))

, it shows that

and

have the same attribute value on attribute A; on the contrary, when

and

have different values on attribute A, the value

( ( )

( )) of tends

to 0. Step 3.

* ( )

In dataset

( )

( )+,

and m is the

number of attributes in each instance, so that the calculation formula for grey similarity of the similarity level between instances

and

is determined

to be: (

)



( ( )

( )) ………..…………... (2.15)

34

Chapter Two: Theoretical Part

The larger the grey similarity between two instances determined in (

equation (2.15), the more similar the two instances. If (

) , it shows that the level of similarity between

smaller than that between and

and

are totally irrelevant;

) and

.

(

)

(

)

shows that instances

is

shows that instances and

are the same. In order to suit the above equations of GRA with missing values and other techniques from proposed algorithm, since in this study GRA applied to incomplete dataset. The equation of each GRC and GRG can be expressed by the following steps: Step 1. Map the original data into a particular interval [0,1] .Then Gray

Relational Coefficient (GRC) is formulated by equation (2.16):

(

)

| |

| |

| |

| |

…(2.16)

Where 

is the kth incomplete instance



is the pth attribute with non-missing values,



denotes the centroid of the ith cluster



In other words, the calculation only happens when the pth attributive value of

exists.

Step 2. Integrating each parameter`s GRC between an incomplete instance

and the reference, the GRG is calculated in (2.17).

35

Chapter Two: Theoretical Part

(

)



(

)

…………………….…. (2.17)

 In terms of the maximal value of GRG, each incomplete attribute is individually incorporated into the closest cluster [54] .

2.6 Classification Classification consists of predicting a certain outcome based on a given input. In order to predict the outcome, the algorithm processes a training set containing a set of attributes and the respective outcome, usually called goal or prediction attribute. The algorithm tries to discover relationships between the attributes that would make it possible to predict the outcome. Next the algorithm is given a data set not seen before, called prediction set (test set), which contains the same set of attributes, except for the prediction attribute – not yet known. The algorithm analyses the input and produces a prediction. The prediction accuracy defines how “good” the algorithm is [23] . A classification technique (or classifier) is a systematic approach to building classification models from an input data set. Examples include decision tree classifiers, rule-based classifiers, neural networks, support vector machines, and Naïve Bayes classifiers. Each technique employs a learning algorithm to identify a model that best fits the relationship between the attribute set and class label of the input data. The model generated by a learning algorithm should both fit the input data well and correctly predict the class labels of attributes it has never seen before. Therefore, a key objective of the learning algorithm is to build models with good

Chapter Two: Theoretical Part

36

generalization capability; i.e., models that accurately predict the class labels of previously unknown attribute [13] . Figure (2.4) shows a general approach for solving classification problems. First, a training set consisting of attributes whose class labels are known must be provided. The training set is used to build a classification model, which is subsequently applied to the test set, which consists of attributes with unknown class labels.

Figure (2.4): General Approach for Building a Classification Model [13] .

Chapter Two: Theoretical Part

37

2.6.1 Decision Tree One of the most intuitive tools for data classification is the decision tree. It hierarchically partitions the input space until it reaches a subspace associated with a class label. Decision trees are appreciated for being easy to interpret and easy to use. They are enthusiastically used in a range of business, scientific, and health care applications because they provide an intuitive means of solving complex decision-making tasks. For example, in business, decision trees are used for everything from codifying how employees should deal with customer needs to make high-value investments. In medicine, decision trees are used for diagnosing illnesses and making treatment decisions for individuals or for communities. A decision tree is a rooted, directed tree like flowchart [3]. The tree has three types of nodes [13] :  A root node that has no incoming edges and zero or more outgoing edges.  Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.  Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges. Each internal node corresponds to a partitioning decision, and each leaf node is mapped to a class label prediction as shown in Figure (2.5).

38

Chapter Two: Theoretical Part

Figure (2.5): Decision Tree Example [13] . Choosing the root of the tree, or the top node is done usually by measuring the entropy (or its inverse- information gain) for each attribute.

2.6.2 Entropy and Information Gain The term “entropy” refers to the Shannon’s entropy was introduced by Claude E. Shannon in 1948

[30]

, entropy is a method of measuring

randomness or uncertainty in a given set of data. Which quantifies the expected value of the information contained in a message, usually in units such as bits. Equivalently, the Shannon entropy is a measure of the average information content one is missing when one does not know the value of the random variable. The entropy is 0 if the outcome is certain, in the other hand, it is maximum if there is no knowledge of the system (or any outcome is equally possible) this is also intuitively the most uncertain situation. As shown in Figure (2.6).

39

Chapter Two: Theoretical Part

The entropy (H) of a discrete random variable Y, formulated as: ( )

, ( )-

,

- …………….……………… (2.18)

Where E is the expected value function and I (Y) is the information content or self-information of Y. I(Y) is itself a random variable. If p denotes the probability mass function (p.m.f) of Y then the entropy can be written as: ( )

( )





………….……… (2.19)

Where: •

is the probability that a random selection would have state y. If the system is pre-partitioned into subsets according to another

variable (or splitting rule) S, then the information entropy of the overall system is the weighted sum of the entropies for each partition, is equivalent to the conditional entropy ( )

( | )



( ). This

( | )

| | ∑ | |

………….….. (2.20)

Here information gain of random variable y after split X in to S partitions: ( )



( )



( )

| | ∑ | |

……(2.21)

( | ) ………………….… (2.22)

Then information gain is the amount of information that's gained by knowing the value of the attribute, which is the entropy of the set before the split minus the entropy of the set after split it. The largest information gain is equivalent to the smallest entropy.

Chapter Two: Theoretical Part

40

Information gain= (Entropy of the set before the split)–(entropy of the set after split it)

Figure (2.6): Entropy in the case of two possibilities with probabilities ( ) . [30]

2.7 Hybrid Imputation Hybrid Imputation approach, apparently, is derived by combined more than one techniques (methods) that used for imputing missing value , using the hybrid method may achieve higher imputation performance than a single type approach only [40] [47] [56] [57] .

2.8 Proposed Algorithm Our proposed algorithm depends on improving existing algorithm (MIGEC) proposed by [47] after adding the following steps: 1- Converting incomplete data set to binary dataset.

Chapter Two: Theoretical Part

41

2- GRA based on attribute instead of instance. 3- Attribute merging instead of instance merging. 4- After each missing elements of attributes imputed by mean imputation, next times we use the result of new imputation (imputation by PA) instead of mean to calculate imputation of reminders missing values of specific attribute. The global procedure of the proposed algorithm is schematized in Figure (2.7). And each of the key components is detailed in the following subsections. For implementing our new development PA, the items from the raw data set are divided into two disjoint subsets, namely the complete dataset and the incomplete dataset. It is expected to minimize the negative impact due to the information loss of missing values by the way of separation. On one hand, the objects of the complete set constitute a number of clusters via FCM. On the other hand, the items in the incomplete set are reordered according to the missing severity from high to low and applied the GSTbased distance metric. That is, In terms of the maximal value of GRG, each incomplete attribute is individually incorporated into the closest cluster. After that, change incomplete dataset to binary dataset in which modified each observed element to one and each missed element to zero (observed=1 & missed=0) . Next, each missing attributive value is imputed by the proposed imputation that use classification and mean imputation: Firstly classify each instances of binary dataset due to target instance (in here is clusters), then returned to incomplete dataset and imputed it by using mean imputation as first imputation, therefore each time we get new imputed values, use this new imputed values instead of mean to estimate next missing values. This process continues till all missing values of incomplete dataset are imputed.

42

Chapter Two: Theoretical Part

Raw dataset preparation

Incomplete dataset

Complete dataset FCM

Clusters

Attribute merging

GSAbased on attribute

Convert incomplete dataset to binary dataset

Rank instances by missing amount in decreasing order

Imputation via mean & classification

NO

All the missing values imputation finished

Yes Combining complete dataset with imputed dataset

Full Datasets with nonmissing values

MIGEC Algorithm Improved Steps

Figure (2.7): The Flowchart of the Proposed Algorithm

43

Chapter Two: Theoretical Part

2.8.1 Steps of Algorithm Let

denote an incomplete dataset with

*

+ and

instances. For each elements of incomplete dataset

is defined by

, it contains two {

parts:

attribute

} where

is observed values and

is missing

values.

[

]

A binary matrix ( ) from incomplete dataset ( converting each observed values ( (

) in which

) to one and each missing values

) to zero is produced, in this case ( ) becomes a matrix of missing

data indicators, when this R matrix has the same number of rows and columns as the data matrix ( ) .

{ For example: A= [

]

 R= [

• NA=Missing Value (Not Available)

]

44

Chapter Two: Theoretical Part

After each time that one attribute has been assigned to the most proximate cluster via (GST) in our proposed improvement, finally one instance inserted to binary matrix and called it class (target):

( )

associates with the data matrix of the cluster. Then imputation technique starts as follows: Step 1. Calculate Expected information (entropy) after partitioning each

instance due to class.

(



)

…………………………. (2.23)

(

)

Where : •

is the probability of event occurring .



is number of instances



is number of clusters (

Information needed after split

( •

(

| )

)

due to j |



|

| |

(

)



We illustrate how to calculate entropy by this example:

) …… (2.24)

45

Chapter Two: Theoretical Part

Instance 1 2 3 4 5 6 7 8 9 10 11 12

A1 93 98 NA 85 82 95 NA NA NA NA NA

A2 NA NA 88 NA 78 92 NA NA 90 82 NA

A3 87 99 NA NA NA 98 90 NA NA NA NA

A4 89 NA 95 NA 85 92 90 NA 98 88 95

A5 97 94 NA 85 80 NA

13 14

96 NA 71

NA 81 84

NA 92 89

90 87 83

95 93 NA

Class

c1

c2

c1

c2

c1

90 95 85 80 88

Instance A1 1 1 2 1 3 0 4 1 5 1 6 1 7 0 8 0 9 0 10 0 11 0 12 1 13 0 14 1 Class c1

A2 0 0 1 0 1 1 0 0 1 1 0 0 1 1 c2

A3 1 1 0 0 0 1 1 0 0 0 0 0 1 1 c1

A4 1 0 1 0 1 1 1 0 1 1 1 1 1 1 c2

A5 1 1 0 1 1 0 1 1 1 1 1 1 1 0 c1

Where class is clusters that we computed it by (FCM) and inserted it to each attribute by maximum value of (GRG)

For each instance calculate entropy as follows:

Instance -1-

0

1

0 c1 , 1 c2

3 c1 , 1 c2

(

)

(

)

46

Chapter Two: Theoretical Part

(

)

Instance -2-

0

1

0 c1 , 2 c2

3 c1 , 0 c2

(

)

(

)

(

)

Instance -3-

0

1

3 c1 , 0 c2

0 c1 , 2 c2

(

)

(

) (

)

47

Chapter Two: Theoretical Part

Instance -4-

0

1

1 c1 , 2 c2

2 c1 , 0 c2

(

)

(

)

(

0.918295834

)

By this way we continue calculation for instances (5 to 14) Step 2. Compute the coefficient of difference for the

instance:

………..……….. (2.25) represents the inherent contrast intensity of the f greater value of

parameter. The

signifies the more significance of that parameter.

Step 3. Elicit the coefficient of weight for the f



th

th

copy:

………………………………………..… (2.26)

48

Chapter Two: Theoretical Part

Step 4.

The mean mode substitution (MMS) is employed to initialize missing values in the first imputation. The simple technique could perform well only when the data is normally distributed. Yet, it is believed that it could produce excellent performance provided that the missing ones are initialized by MMS before the imputation, even without any prior knowledge about the pattern of distribution [68] . th

Then, estimate the j attributive missing value of

:

……………………………………… (2.27)



After each missing elements of attributes imputed by mean imputation, next times use the result of new imputation (imputation by PA algorithm) instead of mean to calculate imputation of reminders missing values of specific attribute.

2.9 The Framework of the Proposed Algorithm Input: Output:

,the

m × n dimensional dataset with missing values.

, the m × n dimensional complete dataset with imputed

values. Step 1: Artificially generate missingness to original dataset by using (MCAR, MAR, NMAR) with four different levels of missing rate .

49

Chapter Two: Theoretical Part

Step 2: Divide dataset

into two disjoint datasets:

Where Step 3: For each variable in FCM (

)) →

, apply fuzzy c -mean clustering

*

+.

G is the number of clusters . Step 4: For each element xk in Xmis Allocate xk to the closest cluster cq according to Grey System Theory (GST). Step 5: Impute all missing values of Xmis by utilizing classification and mean imputation, and then integrate

Step 6: Repeat steps (1 to 5) iteratively in case still missing values exist.

Chapter Three Practical Part

50

Chapter Three: Practical Part

Chapter Three Practical Part 3.1 Introduction In this chapter Experimental results of proposed algorithm for both Wine and simulated dataset are displayed and discussed, also comparison between both proposed algorithms with other previous techniques for dealing with missing data is described. In this thesis, R programming language (version 3.2.3) used to implement all methods that used to conduct proposed algorithm.

3.2 Dataset The first dataset that we depend on to implement our algorithm is the Wine dataset that we used in this study is achieved from The UCI (University of California, Irvine) Machine Learning Repository database. The purpose of selecting this dataset is to compare the efficiency of our algorithm with previous (MIGEC) algorithm implemented by (J. Tian , B. Yu , D. Yu , Sh. Ma) [47] .

3.2.1 Wine Data Set The Wine data set are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. Information about Wine dataset shown in Table (3.1).

51

Chapter Three: Practical Part

Table (3.1): Information about Wine dataset Data Set Characteristics:

Multivariate

Number of Instances:

178

Area:

Physical

Attribute Characteristics:

Integer, Real

Number of Attributes:

13

Date Donated

1991-07-01

Associated Tasks:

Classification

Missing Values?

No

Number of Web Hits:

531927

The attributes are: 1) Alcohol

2) Malic acid

3) Ash

4) Alcalinity of ash

5) Magnesium

6) Total phenols

7) Flavanoids

8) Non flavanoid phenols

9) Proanthocyanins

10) Color intensity

11) Hue

12) OD280/OD315 of diluted wines

13) Proline Sample of wine dataset represented by Figure (3.1)

Figure (3.1): Sample of Wine dataset

Chapter Three: Practical Part

52

3.2.2 Simulated data Nowadays, data will growth speedily (Million, Billion, and Trillion ...) in the world as we discussed in chapter one, in other words, data mining already work with massive quantities of data. For this reason we simulated data to know the performance of (proposed algorithm) with large amount of data. We used simple random samples of size 1000. We consider simulations under a normal distribution, this dataset contain 13 attributes and 1000 instances and all of them are numeric. The computer limitation (Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz) for implementing our algorithm doesn’t allow us to increase the amount of simulated data. We used (rnorm) command from (R programing language) to generate 1000 sample of data under normal distribution based on mean and standard deviation of Wine dataset as displayed in Table (3.2). X=rnorm (n, mean, sd) where:  X represents attributes of simulated data. Since we have 13 attribute then X= (Attribute-1, Attribute-2,…………., Attribute-13).  The rnorm() function in R is a convenient way to simulate values from the normal distribution, characterized by a given mean and standard deviation.  n is sample size of simulated data.  mean is the mean of simulated data based on Wine dataset.  sd is the standard deviation of simulated data based on Wine dataset.

53

Chapter Three: Practical Part

Table (3.2): Mean and Standard deviation of Wine data

Attribute

A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

A11

A12

A13

Mean

13.0

2.34

2.37

19.50

99.74

2.30

2.03

0.36

1.59

5.06

0.96

2.61

746.89

Sd

0.81

1.12

0.27

3.34

14.28

0.63

1.00

0.12

0.57

2.32

0.23

0.71

314.91

The sample of simulated dataset displayed in Figure (3.2)

Figure (3.2): Sample of simulated dataset

3.3 Generating missingness: To intrinsically examine the effectiveness and validity and ensure the systematic nature of the research, we artificially generated a lack of data at four distinct missing ratios,

under three

different modalities, namely MCAR, MAR, NMAR in the complete datasets.

54

Chapter Three: Practical Part

For MCAR , In order to simulate missing values on attributes, the original datasets are run using a random generator and let every data in the dataset have the same probability α to be missing, where α was the specified missing rate . “Nonparametric Missing Value Imputation using Random Forest” package from R programing language used to generate MCAR. Sample of generating missing values on wine dataset under MCAR displayed in Figure (3.3), where each NA represent missing value

Figure (3.3): Sample of Wine dataset missed by MCAR Simulating MAR was more challenging and it worked as follow: In case there is a complete dataset with two attributes (

), where

the attribute in to which missing values were introduced, and attribute that affected the missingness of attributes

was the

. Given a pair of

), and missing rate α , First split the instances into two

equal-sized subsets according to their values at of

was

and find the median

and then assigned all the instances into two subsets according

to weather the instances have lower (or higher) values than the median

55

Chapter Three: Practical Part

at

(

)

.

After

the

splitting of instances, randomly selected one subset of instances and let their values at

to be missing with the probability of 4α. The probability of 4α

will result in a missing rate of 2α on the whole variable

which is

equivalent to have a missing rate of α on the two variables

).

For multi-attributes pair selection of attributes was based on high correlations among the attributes, different pairs of attributes were used to generate the missingness. Each attribute is paired with the one it is highly correlated to. Sample of generating missing values on wine dataset under MAR displayed in Figure (3.4)

Figure (3.4): Sample of Wine dataset missed by MAR The process of generating missing values by NMAR was similar to MAR. The only difference was that there was no need to split variables into pairs, NMAR produced missingness on every variable directly. For a given variable of

and specified missing rate α , first calculated the median

and then randomly let the values that are lower (or higher) than

56

Chapter Three: Practical Part

to be missing with probability of 2α . Sample of generating missing values on wine dataset under NMAR displayed in Figure (3.5)

Figure (3.5): Sample of Wine dataset missed by NMAR

3.4 Performance Measure To evaluate the precision of various data imputation algorithms the Root Mean Square Error (RMSE) (or Normalize Root Mean Square Error(NRMSE)) used in this study .

√ ∑ ̂

…………………………….…… (3.1)

…………………………………........... (3.2) Where

is the original value, ̂ is the predicted plausible value,

total number of estimations and

is the

is standard deviation. The larger

value of RMSE suggests the less accuracy that the algorithm holds.

57

Chapter Three: Practical Part

3.5 Optimality Measure Before the comparative demonstrations, to capture the result of the data imputation accurately, it is requisite to select the optimum values for number of iteration (number of imputations) and clusters, by another expression mean that which clusters or iteration gives us minimum RMSE.

3.5.1 Number of Iteration Firstly we tested for number of

iterations

for each missingness

mechanism (MCAR, MAR, NMAR) with 10 % missing rate and five clusters as initial value (cluster) that is because 5 lies between (2 and 10), for wine dataset. For MCAR: Table (3.3): Checking optimality by number of Iterations (for MCAR) MCAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and cluster=5 Iterations 5

10

15

20

25

30

35

40

45

50

0.1418

0.1563

0.1575

0.1650

0.1620

0.1616

0.1637

0.1600

0.1590

0.1591

As seen in Table (3.3), the RMSE declines to the least which is (0.1418) when number of iteration is 5 times. So it is the optimum iteration for MCAR, also we shown it by Figure (3.6). 0.17

RMSE

0.16 0.15 0.14 0.13 5

10

15

20

25

30

35

40

45

50

Iterations

Figure (3.6): Checking optimality by number of Iterations (for MCAR)

58

Chapter Three: Practical Part

For MAR: Table (3.4): Checking optimality for number of Iterations (for MAR) MAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and cluster=5 Iterations 5

10

15

20

25

30

35

40

45

50

0.1388

0.1503

0.1578

0.1618

0.1625

0.1621

0.1648

0.1674

0.1693

0.1660

Table (3.4), illustrate that the best number of iteration for MAR also is 5 which give minimum RMSE (0.1388) also as shown in Figure (3.7). 0.17

RMSE

0.16 0.15 0.14 0.13 5

10

15

20

25

30

35

40

45

50

Iterations

Figure (3.7): Checking optimality by number of Iterations (for MAR)

For NMAR: Finally, for NMAR as appeared in Table (3.5) and Figure (3.8), Iteration (10) give us lower RMSE which is (0.1347) and the worst iteration, which yield the maximum RMSE is iteration (15).

59

Chapter Three: Practical Part

Table (3.5): Checking optimality by number of Iterations (for NMAR) NMAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and cluster=5 Iterations 5

10

0.1376 0.1347

15

20

0.1406

0.1387

25

30

0.1374 0.1400

35

40

45

50

0.1394

0.1385

0.1387

0.1405

0.17

RMSE

0.16 0.15 0.14 0.13 5

10

15

20

25

30

35

40

45

50

Iterations

Figure (3.8): Checking optimality by number of Iterations (for NMAR)

3.5.2 Number of Clusters Second step for checking optimality is number of clusters, we obtained it via (FCM) by applied it to complete dataset. we can match number of clusters that is affect the accuracy of our algorithm for imputation directly, since we used it for calculating classification therefore classification is a main part of imputing incomplete data in proposed algorithm. For each three types of missing mechanism (MCAR, MAR, NMAR), we checked for optimal number of clusters from 2 to 10 by using %10 missing rate. Each MCAR, MAR, NMAR respectively discussed: For MCAR:

60

Chapter Three: Practical Part

Table (3.6): Checking optimality by number of clusters (for MCAR)

MCAR Number of clusters cluster= 2

cluster= 3

cluster= 4

cluster= 5

cluster= 6

cluster= 7

cluster= 8

cluster= 9

cluster=1 0

0.1553

0.1423

0.1410

0.1418

0.1482

0.1330

0.1490

0.1456

0.1515

According to Table (3.6), it appears that when the whole data is agglomerated into 7 groups, the RMSE declines to the minimum that is (0.1330). In contrast the worst value obtained when 2 clusters exist. All the above description summarized in Figure (3.9). 0.19 0.18

RMSE

0.17 0.16 0.15 0.14 0.13 0.12

Figure (3.9): Checking optimality by number of clusters (for MCAR)

For MAR: Table (3.7): Checking optimality by number of clusters (for MAR)

MAR Number of clusters cluster= 2

cluster= 3

cluster= 4

cluster= 5

cluster= 6

cluster= 7

cluster= 8

cluster= 9

cluster=1 0

0.1321

0.1464

0.1514

0.1388

0.1463

0.1463

0.1849

0.1588

0.1448

61

Chapter Three: Practical Part

In Table (3.7) and Figure (3.10), MAR performs best when data partitions into 2 groups, it`s RMSE is (0.1321). 0.19 0.18

RMSE

0.17 0.16 0.15 0.14 0.13 0.12

Figure (3.10): Checking optimality by number of clusters (for MAR) For NMAR: Table (3.8): Checking optimality by number of clusters (for NMAR)

NMAR Number of clusters cluster= 2

cluster= 3

cluster= 4

cluster= 5

cluster= 6

cluster= 7

cluster= 8

cluster= 9

cluster=1 0

0.14485

0.14039

0.13419

0.13469

0.14319

0.14915

0.12380

0.13518

0.14078

From Table (3.8) and Figure (3.11), results of RMSE for NMAR are between [0.12 - 0.15], minimum RMSE (0.12) yield when number of clusters is 8.

62

Chapter Three: Practical Part 0.19 0.18

RMSE

0.17 0.16 0.15 0.14 0.13 0.12

Figure (3.11): Checking optimality by number of clusters (for NMAR)

3.6 Comparative experiments In consideration of making comparisons as extensively as possible, we thoughtfully select eight other approaches, which are MMS (Mean Mode Substitution), HDI (Hot Deck Imputation), KNNMI (K Nearest Neighbour Imputation with Mutual Information) Mean based on Optimal Completion Strategy ) Random Imputation)

[33]

[59]

[24]

, FCMOCS (Fuzzy C-

, CRI (Clustering-based

, CMI (Clustering-based Multiple Imputation)

[67]

,

NIIA (The - Non Parametric Iterative Imputation) [68] and MIGEC (Multiple Imputation algorithm using Gray System Theory and Entropy based on Clustering) [47] . After selecting optimum number of clusters and iterations for all three types of missingness, we compared proposed algorithm with various methods with varying missing rates by using RMSE (average of each RMSE) as displayed in Table (3.9), (3.11) and (3.13), and difference between our PA and each other algorithms displayed in Table (3.10),(3.12) and (3.14).

63

Chapter Three: Practical Part

Table (3.9): Comparison between proposed algorithm and other methods for imputation (MCAR)

Missing Mechanism

Methods Missing Rate

MMS

HDI

KNNMI

FCMOCS

NIIA

CMI

CRI

MIGEC

P.A.

5%

0.201

0.197

0.191

0.188

0.179

0.176

0.172

0.174

0.1785

10%

0.203

0.202

0.195

0.189

0.182

0.180

0.179

0.179

0.1330

15%

0.205

0.203

0.196

0.192

0.186

0.181

0.181

0.182

0.1520

20%

0.213

0.205

0.198

0.195

0.188

0.184

0.188

0.189

0.1672

MCAR

Table (3.10): Difference between proposed algorithm and other algorithms for MCAR Different between proposed algorithm and other algorithms for MCAR Missing Mechanism

MCAR

Methods Missing Rate

MMS

HDI

KNNMI

FCMOCS

NIIA

CMI

CRI

MIGEC

5%

0.0225

0.0185

0.0125

0.0095

0.0005

-0.0025

-0.0065

-0.0045

10%

0.07

0.069

0.062

0.056

0.049

0.047

0.046

0.046

15%

0.053

0.051

0.044

0.04

0.034

0.029

0.029

0.03

20%

0.0458

0.0378

0.0308

0.0278

0.0208

0.0168

0.0208

0.0218

Average

0.04783

0.04408

0.03733

0.03333

0.02608

0.02258

0.02233

0.02333

Table (3.11): Comparison between proposed algorithm and other methods for imputation (MAR) Missing Mechanism

Methods Missing Rate

MMS

HDI

KNNMI

FCMOCS

NIIA

CMI

CRI

MIGEC

P.A.

5%

0.192

0.188

0.186

0.185

0.172

0.171

0.171

0.169

0.1366

10%

0.194

0.196

0.194

0.189

0.176

0.177

0.173

0.172

0.1321

15%

0.204

0.206

0.202

0.192

0.178

0.182

0.185

0.184

0.1493

20%

0.210

0.208

0.204

0.198

0.185

0.188

0.183

0.187

0.1649

MAR

64

Chapter Three: Practical Part

Table (3.12): Difference between proposed algorithm and other algorithms

for MAR Different between proposed algorithm and other algorithms for MAR Missing Mechanism

MAR

Methods Missing Rate

MMS

HDI

KNNMI

FCMOCS

NIIA

CMI

CRI

MIGEC

5%

0.0554

0.0514

0.0494

0.0484

0.0354

0.0344

0.0344

0.0324

10% 15% 20%

0.0619 0.0547 0.0451

0.0639 0.0567 0.0431

0.0619 0.0527 0.0391

0.0569 0.0427 0.0331

0.0439 0.0287 0.0201

0.0449 0.0327 0.0231

0.0409 0.0357 0.0181

0.0399 0.0347 0.0221

Average

0.05428

0.05378

0.05078

0.04528

0.03203

0.03378

0.03228

0.03228

Table (3.13): Comparison between proposed algorithm and other methods for imputation (NMAR)

Methods Missing Mechanism

Missing Rate

MMS

HDI

KNNMI

FCMOCS

NIIA

CMI

CRI

MIGEC

P.A.

5%

0.171

0.169

0.168

0.165

0.160

0.159

0.158

0.155

0.1183

10%

0.176

0.172

0.172

0.169

0.163

0.166

0.163

0.157

0.1238

15%

0.183

0.178

0.174

0.171

0.164

0.168

0.167

0.164

0.1601

20%

0.192

0.189

0.180

0.175

0.167

0.169

0.168

0.168

0.1629

NMAR

Table (3.14): Difference between proposed algorithm and other algorithms for NMAR Different between proposed algorithm and other algorithms for NMAR Missing Mechanism

NMAR

Methods Missing Rate

MMS

HDI

KNNMI

FCMOCS

NIIA

CMI

CRI

MIGEC

5%

0.0527

0.0507

0.0497

0.0467

0.0417

0.0407

0.0397

0.0367

10%

0.0522

0.0482

0.0482

0.0452

0.0392

0.0422

0.0392

0.0332

15%

0.0229

0.0179

0.0139

0.0109

0.0039

0.0079

0.0069

0.0039

20%

0.0291

0.0261

0.0171

0.0121

0.0041

0.0061

0.0051

0.0051

Average

0.03923

0.03573

0.03223

0.02873

0.02223

0.02423

0.02273

0.01973

65

Chapter Three: Practical Part

Discussion of the Results: Table (3.9), Table (3.11) and Table (3.13) illustrate some results that we would like to discuss as follows: i- Outcome results demonstrate that the proposed algorithm performs better than the other eight approaches under all missingness mechanisms at varying missing rates. ii- Different missing rate have different impacts on imputation accuracy. Generally speaking, the RMSE increases with increasing missing proportion for all the methods approximately. This is understandable because with more missing rate introduced into the datasets, more information of data will be loss. However sometimes the nature of data, outlier and noise also effect on accuracy of imputation. iii- The worse RMSE achieved by methods are for MCAR mechanism, followed by MAR and MCAR mechanism, respectively in general.

For more illustrate we displayed all three Table (3.9),(3.11) and (3.13) by Figures (3.12), (3.13) and (3.14). 0.25 MMS HDI 0.2

RMSE

KNNMI FCMOCS NIIA

0.15

CMI CRI MIGEC

0.1 5%

10%

15%

20%

PA

Missing Rate

Figure (3.12): Comparison between proposed algorithm (PA) and other methods for imputation (MCAR)

66

Chapter Three: Practical Part

0.25 MMS HDI 0.2

RMSE

KNNMI FCMOCS NIIA

0.15

CMI CRI MIGEC

0.1 5%

10%

15%

20%

PA

Missing Rate

Figure (3.13): Comparison between proposed algorithm (PA) and other methods for imputation (MAR)

0.25 MMS HDI 0.2

RMSE

KNNMI FCMOCS NIIA

0.15

CMI CRI MIGEC

0.1 5%

10%

15%

20%

PA

Missing rate

Figure (3.14): Comparison between proposed algorithm (PA) and other methods for imputation (NMAR)

67

Chapter Three: Practical Part

3.7 Results of Proposed Algorithm for Simulated data At the beginning we checked out which clusters and iteration has optimum value for Wine dataset, this scenario also required for simulated dataset.

3.7.1 Number of Iteration For MCAR: Table (3.15): Checking optimality by number of Iterations (for MCAR) MCAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and cluster=5 Iterations 5

10

15

20

25

30

35

40

45

50

0.1604

0.1601

0.1585

0.1583

0.1575

0.1590

0.1592

0.1595

0.1592

0.1593

From Table (3.15) and Figure (3.15), we can notice that lower RMSE achieved by 25 iteration and it`s value is (0.1575). Also the range of all Iterations is between [15 -16] so RMSE is slightly different between

RMSE

number of iterations. 0.2 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 0.155 0.15 5

10

15

20

25

30

35

40

45

50

Iterations

Figure (3.15): Checking optimality by number of Iterations (for MCAR)

68

Chapter Three: Practical Part

For MAR: Table (3.16): Checking optimality for number of Iterations (for MAR)

MAR with %10 missing rate, (5,10,15,20,25,30,35,40,45,50) iteration and cluster=5 Iterations 5

10

15

20

25

30

35

40

45

50

0.1569

0.1583

0.1558

0.1572

0.1582

0.1581

0.1576

0.1573

0.1574

0.1574

In Table (3.16), iteration 15 has lower RMSE .However Figure (3.16) obviously show us that the difference between each iterations (from 5 to 50) is too small since the difference between maximum and minimum RMSE is

RMSE

just (0.0025). 0.2 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 0.155 0.15 5

10

15

20

25

30

35

40

45

50

Iterations

Figure (3.16): Checking optimality for number of Iterations (for MAR)

For NMAR: Table (3. 17): Checking optimality by number of Iterations (for NMAR) NMAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and cluster=5 Iterations 5

10

15

20

25

30

35

40

45

50

0.1908

0.1954

0.1955

0.1946

0.1951

0.1930

0.1938

0.1928

0.1921

0.1922

69

Chapter Three: Practical Part

For NMAR iteration (5) has best RMSE, but worst RMSE is for 15 however the difference of them is just (0.004779) .as seen in Table (3.17)

RMSE

and Figure (3.17) 0.2 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 0.155 0.15 5

10

15

20

25

30

35

40

45

50

Iterations

Figure (3.17): Checking optimality by number of Iterations (for NMAR)

3.7.2 Number of Clusters Clustering could actually help ameliorate the accuracy of the prediction through narrowing the potential space of the target value. Since there is relationship between RMSE and the number of clusters in this technique, we should discuss which groups have the best result (namely, the minimal value of RMSE) For MCAR: Table (3.18): Checking optimality by number of clusters (for MCAR)

MCAR Number of Clusters cluster= 2

cluster= 3

cluster= 4

cluster= 5

cluster= 6

cluster= 7

cluster= 8

cluster= 9

cluster=1 0

0.1577

0.1589

0.1599

0.1575

0.1580

0.1590

0.1601

0.1583

0.1562

70

RMSE

Chapter Three: Practical Part 0.21 0.205 0.2 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 0.155 0.15

Figure (3.18): Checking optimality by number of clusters (for MCAR)

As seen in Table (3.18) and Figure (3.18), it could be obviously observed that when the whole data is combined into 10 groups, the RMSE of PA declines to the minimum. Since difference between maximum and minimum RMSE is just (0.0039). For MAR: Table (3.19): Checking optimality by number of clusters (for MAR)

MAR Number of Clusters cluster =10

0.1586

0.1536

RMSE

cluster= cluster= cluster= cluster= cluster= cluster= cluster= cluster= 2 3 4 5 6 7 8 9 0.1566

0.1541

0.1558

0.1587

0.1588

0.1588

0.1571

0.21 0.205 0.2 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 0.155 0.15

Figure (3.19): Checking optimality by number of clusters (for MAR)

71

Chapter Three: Practical Part

Table (3.19) and Figure (3.19) displayed that the same 10 clusters is desirable for MAR. For NMAR: Table (3.20): Checking optimality by number of clusters (for NMAR)

NMAR Number of Clusters cluster= cluster= cluster= cluster= cluster= cluster= cluster= cluster= cluster=1 2 3 4 5 6 7 8 9 0

RMSE

0.2003

0.1934

0.1937

0.1908

0.2002

0.1808

0.2017

0.1958

0.1974

0.21 0.205 0.2 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 0.155 0.15

Figure (3.20): Checking optimality by number of clusters (for NMAR)

From Table (3.20) and Figure (3.20), we seen that cluster seven (7) is optimal for NMAR. Finally experimental results demonstrated that proposed algorithm requires the different optimal number of clusters. After selecting optimum number of iterations and clusters the final result of simulated dataset shown in Table (3.21) and Figure (3.21)

72

Chapter Three: Practical Part

Table (3.21): Result of proposed algorithm for simulated data under different mechanism with varying missing rates Missing Rate Missing Mechanism 5%

10%

15%

20%

MCAR

0.160187

0.156174

0.160077504

0.160309

MAR

0.159357

0.153592

0.160294

0.159913

NMAR

0.187617

0.1808

0.2006182

0.20736

0.21 0.2 0.19 MCAR 0.18

MAR

0.17

NMAR

0.16 0.15 5%

10%

15%

20%

Figure (3.21): Result of proposed algorithm for Simulated data under different mechanism with varying missing rates

3.8 Comparison between wine and simulated dataset The results show that our proposed algorithm remains stable with increasing the size of data. Which mean the success of our algorithm when implemented on huge amount of data in data mining algorithms as shown by Figure (3.22), (3.23), (3.24), (3.25), (3.26) and (3.27). We mixed Figures of iterations and clusters of each Wine dataset and simulated dataset to satisfy that results of simulated dataset are more stable comparing to Wine dataset.

73

Chapter Three: Practical Part MCAR 0.17

RMSE

0.16 0.15

wine dataset

0.14

Simulated dataset

0.13 5

10

15

20

25

30

35

40

45

50

Iterations

Figure (3.22): comparison between wine and simulated dataset for various iterations (MCAR) MAR 0.17

RMSE

0.16 0.15

wine dataset

0.14

Simulated dataset

0.13 5

10

15

20

25

30

35

40

45

50

Iterations

Figure (3.23): comparison between wine and simulated dataset for various iterations (MAR)

RMSE

NMAR 0.2 0.19 0.18 0.17 0.16 0.15 0.14 0.13

Wine dataset simulated dataset

1

2

3

4

5

6

7

8

9

10

Iterations

Figure (3.24): comparison between wine and simulated dataset for various iterations (NMAR)

74

Chapter Three: Practical Part MCAR

RMSE

0.16 0.15 Wine dataset

0.14

Simulated dataset

0.13

Figure (3.25): comparison between wine and simulated dataset for various clusters (MCAR) MAR 0.19

RMSE

0.18 0.17 0.16 Wine dataset

0.15

Simulated dataset

0.14 0.13

Figure (3.26): comparison between wine and simulated dataset for various clusters (MAR)

RMSE

NMAR 0.21 0.2 0.19 0.18 0.17 0.16 0.15 0.14 0.13 0.12

Wine dataset Simulated dataset

Figure (3.27): comparison between wine and simulated dataset for various iterations (NMAR)

75

Chapter Three: Practical Part

Figure (3.22),(3.23) (3.24),(3.25),(3.26) and (3.27) show that in general results of PA for simulated dataset are more stable than wine dataset under each missingness mechanism. From Table (3.21) and Figure (3.21), also we can observed that proposed algorithm with growing the size of data starting to stable under different missing rate that mean RMSE slightly difference between various proportion of missing. And results show the success of our proposed algorithm. We made out interval of proposed algorithm for each Wine dataset from Table (3.9), (3.11), (3.13) and simulated dataset from table (3.21) to satisfied that our proposed algorithm`s results are stable for simulated dataset than Wine dataset. The interval of Wine data for MCAR is among [0.13 – 0.18], for MAR is [0.13 – 0.16] and for NMAR is [0.12 – 0.16], while the interval of simulated data for MCAR is among [0.15 – 0.16], for MAR [0.15 – 0.16] and for NMAR [0.18 – 0.21] .Difference between each interval of simulated and Wine dataset displayed in Table (3.22). Table (3.22): Difference between RMSE of Simulated and Wine dataset

Interval

Difference between interval

Missing Mechanism Wine dataset

Simulated dataset

Wine dataset

Simulated dataset

MCAR

0.13-0.18

0.15-0.16

0.05

0.01

MAR

0.13-0.16

0.15-0.16

0.03

0.01

NMAR

0.12-0.16

0.18-0.21

0.04

0.03

76

Chapter Three: Practical Part

3.9 Comparing Algorithms by Number of Iteration We compare optimal number of iteration between our proposed algorithm and existing algorithms, as shown in Table (3.23). Table (3.23): Comparing number of iteration between proposed algorithm and existing algorithms

Algorithms

No. of Iteration

NIIA

27

CMI

22

CRI

25

MIGEC

Proposed algorithm

Wine dataset

Simulated dataset

5

15

18

From Table (3.23), by using MAR mechanism when (NIIA, CMI, CRI, MIGEC) applied to RCSF (Remote Controlling for Spacecraft Flying) dataset, optimal numbers of iteration for each of them are (27, 22, 25, 18) respectively. But when proposed algorithm applied to wine and simulated dataset we see that best RMSE got by 5 times of iteration for (Wine dataset) and 15 times of iteration for (simulated dataset) , then among these algorithm, our proposed algorithm has minimum times of iteration which tends to less response running time of our proposed algorithm for large dataset.

Chapter Four

Conclusions and Future Work

Chapter Four: Conclusions and Future Work

77

Chapter Four Conclusions and Future Work 4.1 Conclusions The problem of incomplete data is one which researchers must handle it. Many researchers fail to consider missing values of varying natures in their analyses, treating them as a singular type or not considering the impact of the missing values at all. In this thesis an extension algorithm based on MIGEC for dealing with incomplete data has been proposed. The experimental results show: 1- Experimental results on wine dataset from University of California

Irvine (UCI) repository illustrate the superiority of proposed algorithm to other imputation methods on accuracy of imputing missing data on three different missing types MCAR, MAR and NMAR. 2- The RMSE shows that our proposed algorithm has better results

(namely, the minimal value of RMSE) than MIGEC algorithm, with average absolute difference beyond (0.025108). 3- When calculating GRA on attributes instead of instances we work

with more homogeneous values in comparison with calculating GRA based on instances and as a result the attribute belong to proper cluster. 4- In general, increasing proportion of missing instances deteriorates the

accuracy of the interpolation in RMSE. It states that incomplete values negatively impact on the completion, in other words, more

Chapter Four: Conclusions and Future Work

78

available information could promote the precision of the final predictions. 5- Proposed algorithm can handle missing values and perform better

either with small or huge amount of the raw data, we can conclude that proposed algorithm remain stable with increasing the size of dataset which means our proposed algorithm is suitable for large data repositories. 6- Proposed algorithm reach results with less imputed iterations in

comparison with other algorithms which means less run time needed in case of huge amount of data in data repositories. 7- The drawback of our proposed algorithm on MIGEC is appeared in

cases when there is large amount of heterogeneity inside the attributes since GRA in our proposed algorithm depends on attribute values instead of instance values. This conclusion appears when we run the algorithm on difference simulated data.

4.2 Future Work 1- Working with data mining techniques need powerful computer to

implement our work speedily and not restrict us. In this thesis because of computer limitation we cannot increase the size of simulated dataset because it needs days to get results of proposed algorithm with vast size of dataset. 2- Hybrid proposed algorithm with another data mining or statistical

techniques like (Neural Network, Nearest Neighbor, …). 3- Extending proposed algorithm to work with categorical attributes.

Chapter Four: Conclusions and Future Work

79

4- Noise has a great impact on the effectiveness of imputation

techniques, while real-world data often contain much noise, therefore, another preprocessing algorithm can be implemented to clean the data before implementing PA. 5- Implementing different data mining algorithms such as association

rule mining on PA and compare the results with other existing algorisms.

References

References

80

References: Books: [1] A.Gelman, J.Hill, (2006) “Data analysis using regression and multilevel/hierarchical models”, Cambridge University Press. [2] A. Ghosh , C.Jain, (2005) “Evolutionary Computation in Data Mining”, Springer Verlag Berlin Heidelberg, Printed in Germany. [3]

C. C. Aggarwal , (2015) “Data Classification Algorithms and Applications ” , IBM T. J. Watson Research Center , Yorktown Heights, New York, USA , International Standard Book Number-13: 978-1-46658675-8 .

[4] C. K. Enders, (2010) “Applied Missing Data Analysis” , The Guilford Press ,New York, London , ISBN 978-1-60623-639-0 . [5] D. Pyle , (1999) “Data Preparation for Data Mining” , Order toll free 800-745-7323,Morgan Kaufmann, San Francisco. [6] E. Alpaydin, (2010) “Introduction to Machine Learning”, Second Edition, MIT Press, Cambridge . [7] G.J. Mclachlan, K.A. Do, C. Ambroise, (2004) microarray gene expression data”, Wiley, New York.

“Analyzing

[8] I.H. Witten, E. Frank, (2005) “Data Mining: Practical Machine Learning Tools and Techniques”, Morgan Kaufmann Series in Data Management Systems, 2nd edn. Morgan Kaufmann, San Francisco . [9] J.C. Bezdek, J. Keller, R. Krisnapuram, N. Pal , (1999) “Fuzzy Models and Algorithms for Pattern Recognition and Image Processing ”, The handbooks of fuzzy sets series Vol. 4. Springer.

References

81

[10] J. Han and M. Kamber, (2011) “Data Mining: Concepts and Techniques”, Morgan Kaufmann,San Francisco . [11] J. Harrell, E. Frank, (2001) “Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis”, Springer-Verlag, New York. [12] L. Symeonidis , A. Mitkas (2005) “Agent Intelligence through Data Mining”, Springer Science+Business Media, Inc. . [13] P.Tan, M.Steinbach, V. Kuma , (2006) “Introduction to Data Mining”, Addison-Wesley. [14] R. Babuška , (2001) “Computational Intelligence In Modeling And Control” , Delft Center for Systems and Control , Delft University of Technology Delft, the Netherlands. [15] R.J.A. Little, D.B. Rubin, (1987) “Statistical Analysis with Missing Data” , 1st edn. Wiley Series in Probability and Statistics, New York. [16] R. Nisbet , J. Elder, G. Miner , (2009) “Handbook of Statistical Analysis and Data Mining Applications” . Academic Press, Boston. [17] R. Wicklin , (2013) “Simulating Data with SAS ”, SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414, USA. [18] S. García, F. Herrera, J. Luengo , (2015) “Data Preprocessing in Data Mining” , Springer International Publishing Switzerland ,London , New York.

Thesis or Dissertation: [19] A.A.Saeed , (2008) “Association Rule Mining for Analyzing AltunMall Market Basket in Sulaimani” , M.Sc. Thesis, University of Sulaimani .

References

82

[20] B. M.Bidgoli ,(2004) “ Data Mining For A Web-Based Educational System” , Ph.D. Thesis , Michigan State University . [21] J.A. Boyko, (2013) “Handling Data with Three Types of Missing Values” , Ph.D. Thesis , University of Connecticut . [22] M.A . Janecek , (2009) “ Efficient feature reduction and classification methods”, Ph.D. Thesis ,University of Wien. [23] R. M. Kilany , (2013) “Efficient Classification and Prediction Algorithms for Biomedical Information”, Ph.D. Thesis ,University of Connecticut.

Papers: [24] A.D.Nuovo , (2011) “Missing data analysis with fuzzy C-means: a study of its application in a psychological scenario”, Expert Syst Appl 38(6):6793–6797. [25] A.Farhangfar, L.Kurgan, W.Pedrycz, (2004) “Experimental analysis of methods for imputation of missing values in databases” In: Intelligent computing: theory and applications II ,Orlando, Florida, 12 April 2004. Proceedings of SPIE, vol 5421. SPIE Press, Bellingham, pp 172–182. [26] A.R. Donders, G.J. van der Heijden, T. Stijnen, K.G. Moons , (2006) “Review: a gentle introduction to imputation of missing values”, J Clin Epidemiol 59(10):1087–1091. [27] C. C. Huang and H. M. Lee , (2004) “A grey-based nearest neighbor approach for missing attribute value prediction”, Applied Intelligence, vol. 20, no. 3, pp. 239-252. [28] C.C. Huang, H.M. Lee, (2006) “An instance-based learning approach based on grey relational structure”. Appl Intell 25(3):243– 251,Springer.

References

83

[29] C. C. Huang, S. Kulkarni , B. C. Kuo, (2011) “A New Weighted Fuzzy C-Means Clustering Algorithm for Remotely Sensed Image Classification”, IEEE Selected Topics in Signal Processing, vol. 5, no. 3. [30] C. E. Shannon ,(1948) “A mathematical theory of communication” , Bell Syst Tech J 27(3):379–423. [31] C. Enders, S .Dietz, M .Montague, J. Dixon ,(2005) “Modern alternatives for dealing with missing data in special education research” .Adv Learn Behav Disabil 19:101–129. [32] C.Y. J. Peng , J. Zhu , (2008) “Comparison of Two Approaches for Handling Missing Covariates in Logistic Regression” , Educational and Psychological Measurement ,Volume 68 Number 1 , 58-77, 2008 Sage Publications, 10.1177/0013164407305582 . [33] C. Zhang, Y .Qin, X .Zhu, J .Zhang, S .Zhang , (2006) “Clusteringbased missing value imputation for data preprocessing”. In: Proceedings of IEEE international conference on industrial informatics, Singapore, 16–18 Aug 2006, pp 1081–1086 . [34] D.B. Rubin, (1976) “Inference and missing data”, Biometrika,Vol. 63, NO.3,581-592. [35] D. C. Hoang, R. Kumar and S. K. Panda, (2012) “Optimal data aggregation tree in wireless sensor networks based on intelligent water drops algorithm”, IET Wireless Sensor Systems, vol. 2, no. 3 . [36] F. C. S. Liu , (2014)“Using Multiple Imputation for Vote Choice Data: A Comparison across Multiple Imputation Tools” , Institute of Political Science, National Sun Yat-Sen University, Kaohsiung, Taiwan, Open Journal of Political Science, 4, 39-46 . [37] G. Karypis, E. H. Han, and V. Kumar, (1999) “Chameleon: hierarchical clustering using dynamic modeling” IEEE Computer society , vol. 32, no. 8, pp. 68–75.

References

84

[38] G. Nagpal, M.Uddin, A.Kaur , (2014) “Grey Relational Effort Analysis Technique Using Regression Methods for Software Estimation” ,The International Arab Journal of Information Technology, Vol. 11, No. 5 . [39] G. Sang , K. Shi, Zh. Liu , L. Gao , (2014) “ Missing Data Imputation Based on Grey System Theory ” , International Journal of Hybrid Information Technology , Vol.7, No.2 ,pp.347-356 . [40] H. Li, C. Zhao, F. Shao, G. Zheng Li, X. Wang ,(2014)“A hybrid imputation approach for microarray missing value estimation ” ,From IEEE International Conference on Bioinformatics and Biomedicine BIBM. [41] H. Wang , S. Wang, (2010) “Mining incomplete survey data through classification”. Knowl.Inf. Syst. 24(2), 221–233 . [42] J.A.Sáez, J.Luengo, F.Herrera, (2013) “Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification” Pattern Recogn. 46(1), 355–364. [43] J.Barnard, X.Meng, (1999) “Applications of multiple imputation in medical studies: from aids to nhanes”, Stat. Methods Med. Res. 8(1), 17–36 . [44] J.Kaiser , (2014 ) “Dealing with Missing Values in Data”, JOURNAL OF SYSTEMS INTEGRATION . [45] J.L. Deng , (1982) “Control problems of grey systems” , Systems and Control Letters, vol.5, pp.288-294 . [46] J.Tian , B. Yu , D. Yu , Sh. Ma, (2012) “A Fuzzy Clustering Approach for Missing Value Imputation with Non-Parameter Outlier Test” State Key Laboratory of Software Development Environment , Beihang University.

References

85

[47] J. Tian , B. Yu , D. Yu , Sh. Ma , (2014) “A hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering” Appl Intell 40:376–388, DOI 10.1007/s10489-013-0469-x, Springer Science+Business Media New York. [48] K. A. Hallgren , (2013) “Conducting Simulation Studies in the R Programming Environment ” , Tutorials in Quantitative Methods for Psychology , Vol. 9(2), p. 43-60 . [49] K.L.Wen, (2004) “The grey system analysis and its application in gas breakdown and var compensator finding(invited paper)”, Int. Journal of computational Cognition 2(121-44. ) . [50] L. Altmayer, (2011) “Hot-Deck Imputation: A Simple DATA Step Approach”. U.S. Bureau of the Census; Washington, DC: 1999. [51] Minakshi , R. Vohra, Gimpy, (2014)”Missing Value Imputation in Multi Attribute Data Set” ,(IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 5315-5321) . [52] M. D. Zio , U. Guarnera , O. Luzi , (2007) “Imputation through finite Gaussian mixture models” , Computational Statistics & Data Analysis 51 5305 – 5316 . [53] M. Gong, Y. Liang, J. Shi, W. Ma, J. Ma, (2013) “Fuzzy C-Means Clustering With Local Information and Kernel Metric for Image Segmentation” , IEEE Transactions on Image Processing,vol. 22, no. 2 . [54] M.L. Zhang, Z.H. Zhou , (2009) “Multi-instance clustering with applications to multi-instance prediction”, Appl Intell 31(1):47–68. [55] M. M. Rahman, D. N. Davis , (2013) “Machine Learning Based Missing Value Imputation Method for Clinical Dataset” , springer.

References

86

[56] N. Ankaiah ,V. Ravi, ( 2011) “A Novel Soft Computing Hybrid for Data Imputation”, Proceedings of the 7th international conference on data mining (DMIN). [57] O. B. Shukur, M.H. Lee , (2015) “Imputation of Missing Values in Daily Wind Speed Data Using Hybrid AR-ANN Method ”, Published by Canadian Center of Science and Education ,Modern, ISSN 19131844 E-ISSN 1913-1852, Applied Science; Vol. 9, No. 11 . [58] O.Harel, X.H. Zhou , (2007) “Multiple imputation: Review of theory, implementation and software”, Stat Med 26(16):3057–3077. [59] P.J.G.Laencina, , J.L.S.Gómez, A.R.F.Vidal, M.Verleysen , (2009) “K nearest neighbours with mutual information for simultaneous classification and missing data imputation” , Neurocomputing 72 1483–1493. [60] R. Hathaway, J.C. Bezdek , (2001). “Fuzzy C-means clustering of incomplete Data ”, IEEE Trans Syst Man Cybern, Part B, Cybern . [61] R. Sallehuddin, S. M. H. Shamsuddin, S. Z.M. Hashim , (2008) “Grey Relational Analysis And Its Application On Multivariate Time Series” , University Technology Malaysia, 81300 . [62] S. Gajawada and D. Toshniwal , (2012). “Missing Value Imputation Method Based on Clustering and Nearest Neighbours” International Journal of Future Computer and Communication, Vol. 1, No. 2 . [63] S. González, Rueda, A. Arcos , (2008) “An improved estimator to analyse missing data”, Stat Pap 49(4):791–796 . [64] S. H. Al-Harbi , V. J. Rayward-Smith ,(2006) “Adapting k-means for supervised clustering ”, Appl Intell 24(3):219–226. [65]S.Parveen, P .Green ,(2004) “Speech enhancement with missing data techniques using recurrent neural networks” In: Proceedings of the

References

87

IEEE international conference on acoustics, speech, and signal processing (ICASSP ’04), vol 1, pp 733–738 . [66] S .Zhang , (2011) “Shell-neighbor method and its application in missing data imputation ”, Appl Intell 35(1):123–133. [67] S. Zhang, J. Zhang, X .Zhu, Y .Qin, C. Zhang , (2008) “Missing value imputation based on data clustering”. In: Transactions on computational science I. Lecture notes in computer science, vol 4750, pp 128–138. [68] S. Zhang, Z. Jin, X. Zhu , (2011) “Missing data imputation by utilizing information within incomplete instances”. J Syst Softw 84(3):452–459. [69] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth ,(1996) “From data mining to knowledge discovery”, American Association for Artificial Intelligence, San Francisco, Vol. 17, No. 3. [70] W. Ziliang, and L. Sifeng , (2004) “Extension of Grey Superiority Analysis, Liu S.F. et al. Grey system Theory and its application”, Science Press, Beijing, China, 616 - 621 Vol. 1 . [71] X.Y. Zhou , J. S. Lim , (2014) “Replace Missing Values with EM algorithm based on GMM and Naïve Bayesian” International Journal of Software Engineering and Its Applications Vol.8, No.5, pp.177-188. [72] Y. Lu, T.Ma, C.Yin, X. Xie, W. Tian ,S.M. Zhong , (2013) “Implementation of the Fuzzy C-Means Clustering Algorithm in Meteorological Data” , International Journal of Database Theory and Application , Vol.6, No.6 . [73] Z. Chi , F. H. cai , J. Kai , Y. Ting , (2013) “The nearest neighbor algorithm of filling missing data based on cluster analysis” , Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering .

Appendixes

Appendixes

88

Appendix A Wine dataset from UCI (University of California, Irvine) Machine Learning Repository database. Alcohol

Malic

Ash

Alcalinity

Magnesium

Phenols

Flavanoids

Non flavanoids

Proanthocyanins

Color

Hue

Dilution

Proline

14.23

1.71

2.43

15.6

127

2.8

3.06

0.28

2.29

5.64

1.04

3.92

1065

13.2

1.78

2.14

11.2

100

2.65

2.76

0.26

1.28

4.38

1.05

3.4

1050

13.16

2.36

2.67

18.6

101

2.8

3.24

0.3

2.81

5.68

1.03

3.17

1185

14.37

1.95

2.5

16.8

113

3.85

3.49

0.24

2.18

7.8

0.86

3.45

1480

13.24

2.59

2.87

21

118

2.8

2.69

0.39

1.82

4.32

1.04

2.93

735

14.2

1.76

2.45

15.2

112

3.27

3.39

0.34

1.97

6.75

1.05

2.85

1450

14.39

1.87

2.45

14.6

96

2.5

2.52

0.3

1.98

5.25

1.02

3.58

1290

14.06

2.15

2.61

17.6

121

2.6

2.51

0.31

1.25

5.05

1.06

3.58

1295

14.83

1.64

2.17

14

97

2.8

2.98

0.29

1.98

5.2

1.08

2.85

1045

13.86

1.35

2.27

16

98

2.98

3.15

0.22

1.85

7.22

1.01

3.55

1045

14.1

2.16

2.3

18

105

2.95

3.32

0.22

2.38

5.75

1.25

3.17

1510

14.12

1.48

2.32

16.8

95

2.2

2.43

0.26

1.57

5

1.17

2.82

1280

13.75

1.73

2.41

16

89

2.6

2.76

0.29

1.81

5.6

1.15

2.9

1320

14.75

1.73

2.39

11.4

91

3.1

3.69

0.43

2.81

5.4

1.25

2.73

1150

14.38

1.87

2.38

12

102

3.3

3.64

0.29

2.96

7.5

1.2

3

1547

13.63

1.81

2.7

17.2

112

2.85

2.91

0.3

1.46

7.3

1.28

2.88

1310

14.3

1.92

2.72

20

120

2.8

3.14

0.33

1.97

6.2

1.07

2.65

1280

13.83

1.57

2.62

20

115

2.95

3.4

0.4

1.72

6.6

1.13

2.57

1130

14.19

1.59

2.48

16.5

108

3.3

3.93

0.32

1.86

8.7

1.23

2.82

1680

13.64

3.1

2.56

15.2

116

2.7

3.03

0.17

1.66

5.1

0.96

3.36

845

14.06

1.63

2.28

16

126

3

3.17

0.24

2.1

5.65

1.09

3.71

780

12.93

3.8

2.65

18.6

102

2.41

2.41

0.25

1.98

4.5

1.03

3.52

770

13.71

1.86

2.36

16.6

101

2.61

2.88

0.27

1.69

3.8

1.11

4

1035

12.85

1.6

2.52

17.8

95

2.48

2.37

0.26

1.46

3.93

1.09

3.63

1015

13.5

1.81

2.61

20

96

2.53

2.61

0.28

1.66

3.52

1.12

3.82

845

13.05

2.05

3.22

25

124

2.63

2.68

0.47

1.92

3.58

1.13

3.2

830

13.39

1.77

2.62

16.1

93

2.85

2.94

0.34

1.45

4.8

0.92

3.22

1195

13.3

1.72

2.14

17

94

2.4

2.19

0.27

1.35

3.95

1.02

2.77

1285

13.87

1.9

2.8

19.4

107

2.95

2.97

0.37

1.76

4.5

1.25

3.4

915

14.02

1.68

2.21

16

96

2.65

2.33

0.26

1.98

4.7

1.04

3.59

1035

13.73

1.5

2.7

22.5

101

3

3.25

0.29

2.38

5.7

1.19

2.71

1285

Appendixes

89

13.58

1.66

2.36

19.1

106

2.86

3.19

0.22

1.95

6.9

1.09

2.88

1515

13.68

1.83

2.36

17.2

104

2.42

2.69

0.42

1.97

3.84

1.23

2.87

990

13.76

1.53

2.7

19.5

132

2.95

2.74

0.5

1.35

5.4

1.25

3

1235

13.51

1.8

2.65

19

110

2.35

2.53

0.29

1.54

4.2

1.1

2.87

1095

13.48

1.81

2.41

20.5

100

2.7

2.98

0.26

1.86

5.1

1.04

3.47

920

13.28

1.64

2.84

15.5

110

2.6

2.68

0.34

1.36

4.6

1.09

2.78

880

13.05

1.65

2.55

18

98

2.45

2.43

0.29

1.44

4.25

1.12

2.51

1105

13.07

1.5

2.1

15.5

98

2.4

2.64

0.28

1.37

3.7

1.18

2.69

1020

14.22

3.99

2.51

13.2

128

3

3.04

0.2

2.08

5.1

0.89

3.53

760

13.56

1.71

2.31

16.2

117

3.15

3.29

0.34

2.34

6.13

0.95

3.38

795

13.41

3.84

2.12

18.8

90

2.45

2.68

0.27

1.48

4.28

0.91

3

1035

13.88

1.89

2.59

15

101

3.25

3.56

0.17

1.7

5.43

0.88

3.56

1095

13.24

3.98

2.29

17.5

103

2.64

2.63

0.32

1.66

4.36

0.82

3

680

13.05

1.77

2.1

17

107

3

3

0.28

2.03

5.04

0.88

3.35

885

14.21

4.04

2.44

18.9

111

2.85

2.65

0.3

1.25

5.24

0.87

3.33

1080

14.38

3.59

2.28

16

102

3.25

3.17

0.27

2.19

4.9

1.04

3.44

1065

13.9

1.68

2.12

16

101

3.1

3.39

0.21

2.14

6.1

0.91

3.33

985

14.1

2.02

2.4

18.8

103

2.75

2.92

0.32

2.38

6.2

1.07

2.75

1060

13.94

1.73

2.27

17.4

108

2.88

3.54

0.32

2.08

8.9

1.12

3.1

1260

13.05

1.73

2.04

12.4

92

2.72

3.27

0.17

2.91

7.2

1.12

2.91

1150

13.83

1.65

2.6

17.2

94

2.45

2.99

0.22

2.29

5.6

1.24

3.37

1265

13.82

1.75

2.42

14

111

3.88

3.74

0.32

1.87

7.05

1.01

3.26

1190

13.77

1.9

2.68

17.1

115

3

2.79

0.39

1.68

6.3

1.13

2.93

1375

13.74

1.67

2.25

16.4

118

2.6

2.9

0.21

1.62

5.85

0.92

3.2

1060

13.56

1.73

2.46

20.5

116

2.96

2.78

0.2

2.45

6.25

0.98

3.03

1120

14.22

1.7

2.3

16.3

118

3.2

3

0.26

2.03

6.38

0.94

3.31

970

13.29

1.97

2.68

16.8

102

3

3.23

0.31

1.66

6

1.07

2.84

1270

13.72

1.43

2.5

16.7

108

3.4

3.67

0.19

2.04

6.8

0.89

2.87

1285

12.37

0.94

1.36

10.6

88

1.98

0.57

0.28

0.42

1.95

1.05

1.82

520

12.33

1.1

2.28

16

101

2.05

1.09

0.63

0.41

3.27

1.25

1.67

680

12.64

1.36

2.02

16.8

100

2.02

1.41

0.53

0.62

5.75

0.98

1.59

450

13.67

1.25

1.92

18

94

2.1

1.79

0.32

0.73

3.8

1.23

2.46

630

12.37

1.13

2.16

19

87

3.5

3.1

0.19

1.87

4.45

1.22

2.87

420

12.17

1.45

2.53

19

104

1.89

1.75

0.45

1.03

2.95

1.45

2.23

355

12.37

1.21

2.56

18.1

98

2.42

2.65

0.37

2.08

4.6

1.19

2.3

678

13.11

1.01

1.7

15

78

2.98

3.18

0.26

2.28

5.3

1.12

3.18

502

12.37

1.17

1.92

19.6

78

2.11

2

0.27

1.04

4.68

1.12

3.48

510

13.34

0.94

2.36

17

110

2.53

1.3

0.55

0.42

3.17

1.02

1.93

750

12.21

1.19

1.75

16.8

151

1.85

1.28

0.14

2.5

2.85

1.28

3.07

718

12.29

1.61

2.21

20.4

103

1.1

1.02

0.37

1.46

3.05

0.906

1.82

870

Appendixes

90

13.86

1.51

2.67

25

86

2.95

2.86

0.21

1.87

3.38

1.36

3.16

410

13.49

1.66

2.24

24

87

1.88

1.84

0.27

1.03

3.74

0.98

2.78

472

12.99

1.67

2.6

30

139

3.3

2.89

0.21

1.96

3.35

1.31

3.5

985

11.96

1.09

2.3

21

101

3.38

2.14

0.13

1.65

3.21

0.99

3.13

886

11.66

1.88

1.92

16

97

1.61

1.57

0.34

1.15

3.8

1.23

2.14

428

13.03

0.9

1.71

16

86

1.95

2.03

0.24

1.46

4.6

1.19

2.48

392

11.84

2.89

2.23

18

112

1.72

1.32

0.43

0.95

2.65

0.96

2.52

500

12.33

0.99

1.95

14.8

136

1.9

1.85

0.35

2.76

3.4

1.06

2.31

750

12.7

3.87

2.4

23

101

2.83

2.55

0.43

1.95

2.57

1.19

3.13

463

12

0.92

2

19

86

2.42

2.26

0.3

1.43

2.5

1.38

3.12

278

12.72

1.81

2.2

18.8

86

2.2

2.53

0.26

1.77

3.9

1.16

3.14

714

12.08

1.13

2.51

24

78

2

1.58

0.4

1.4

2.2

1.31

2.72

630

13.05

3.86

2.32

22.5

85

1.65

1.59

0.61

1.62

4.8

0.84

2.01

515

11.84

0.89

2.58

18

94

2.2

2.21

0.22

2.35

3.05

0.79

3.08

520

12.67

0.98

2.24

18

99

2.2

1.94

0.3

1.46

2.62

1.23

3.16

450

12.16

1.61

2.31

22.8

90

1.78

1.69

0.43

1.56

2.45

1.33

2.26

495

11.65

1.67

2.62

26

88

1.92

1.61

0.4

1.34

2.6

1.36

3.21

562

11.64

2.06

2.46

21.6

84

1.95

1.69

0.48

1.35

2.8

1

2.75

680

12.08

1.33

2.3

23.6

70

2.2

1.59

0.42

1.38

1.74

1.07

3.21

625

12.08

1.83

2.32

18.5

81

1.6

1.5

0.52

1.64

2.4

1.08

2.27

480

12

1.51

2.42

22

86

1.45

1.25

0.5

1.63

3.6

1.05

2.65

450

12.69

1.53

2.26

20.7

80

1.38

1.46

0.58

1.62

3.05

0.96

2.06

495

12.29

2.83

2.22

18

88

2.45

2.25

0.25

1.99

2.15

1.15

3.3

290

11.62

1.99

2.28

18

98

3.02

2.26

0.17

1.35

3.25

1.16

2.96

345

12.47

1.52

2.2

19

162

2.5

2.27

0.32

3.28

2.6

1.16

2.63

937

11.81

2.12

2.74

21.5

134

1.6

0.99

0.14

1.56

2.5

0.95

2.26

625

12.29

1.41

1.98

16

85

2.55

2.5

0.29

1.77

2.9

1.23

2.74

428

12.37

1.07

2.1

18.5

88

3.52

3.75

0.24

1.95

4.5

1.04

2.77

660

12.29

3.17

2.21

18

88

2.85

2.99

0.45

2.81

2.3

1.42

2.83

406

12.08

2.08

1.7

17.5

97

2.23

2.17

0.26

1.4

3.3

1.27

2.96

710

12.6

1.34

1.9

18.5

88

1.45

1.36

0.29

1.35

2.45

1.04

2.77

562

12.34

2.45

2.46

21

98

2.56

2.11

0.34

1.31

2.8

0.8

3.38

438

11.82

1.72

1.88

19.5

86

2.5

1.64

0.37

1.42

2.06

0.94

2.44

415

12.51

1.73

1.98

20.5

85

2.2

1.92

0.32

1.48

2.94

1.04

3.57

672

12.42

2.55

2.27

22

90

1.68

1.84

0.66

1.42

2.7

0.86

3.3

315

12.25

1.73

2.12

19

80

1.65

2.03

0.37

1.63

3.4

1

3.17

510

12.72

1.75

2.28

22.5

84

1.38

1.76

0.48

1.63

3.3

0.88

2.42

488

12.22

1.29

1.94

19

92

2.36

2.04

0.39

2.08

2.7

0.86

3.02

312

11.61

1.35

2.7

20

94

2.74

2.92

0.29

2.49

2.65

0.96

3.26

680

11.46

3.74

1.82

19.5

107

3.18

2.58

0.24

3.58

2.9

0.75

2.81

562

Appendixes

91

12.52

2.43

2.17

21

88

2.55

2.27

0.26

1.22

2

0.9

2.78

325

11.76

2.68

2.92

20

103

1.75

2.03

0.6

1.05

3.8

1.23

2.5

607

11.41

0.74

2.5

21

88

2.48

2.01

0.42

1.44

3.08

1.1

2.31

434

12.08

1.39

2.5

22.5

84

2.56

2.29

0.43

1.04

2.9

0.93

3.19

385

11.03

1.51

2.2

21.5

85

2.46

2.17

0.52

2.01

1.9

1.71

2.87

407

11.82

1.47

1.99

20.8

86

1.98

1.6

0.3

1.53

1.95

0.95

3.33

495

12.42

1.61

2.19

22.5

108

2

2.09

0.34

1.61

2.06

1.06

2.96

345

12.77

3.43

1.98

16

80

1.63

1.25

0.43

0.83

3.4

0.7

2.12

372

12

3.43

2

19

87

2

1.64

0.37

1.87

1.28

0.93

3.05

564

11.45

2.4

2.42

20

96

2.9

2.79

0.32

1.83

3.25

0.8

3.39

625

11.56

2.05

3.23

28.5

119

3.18

5.08

0.47

1.87

6

0.93

3.69

465

12.42

4.43

2.73

26.5

102

2.2

2.13

0.43

1.71

2.08

0.92

3.12

365

13.05

5.8

2.13

21.5

86

2.62

2.65

0.3

2.01

2.6

0.73

3.1

380

11.87

4.31

2.39

21

82

2.86

3.03

0.21

2.91

2.8

0.75

3.64

380

12.07

2.16

2.17

21

85

2.6

2.65

0.37

1.35

2.76

0.86

3.28

378

12.43

1.53

2.29

21.5

86

2.74

3.15

0.39

1.77

3.94

0.69

2.84

352

11.79

2.13

2.78

28.5

92

2.13

2.24

0.58

1.76

3

0.97

2.44

466

12.37

1.63

2.3

24.5

88

2.22

2.45

0.4

1.9

2.12

0.89

2.78

342

12.04

4.3

2.38

22

80

2.1

1.75

0.42

1.35

2.6

0.79

2.57

580

12.86

1.35

2.32

18

122

1.51

1.25

0.21

0.94

4.1

0.76

1.29

630

12.88

2.99

2.4

20

104

1.3

1.22

0.24

0.83

5.4

0.74

1.42

530

12.81

2.31

2.4

24

98

1.15

1.09

0.27

0.83

5.7

0.66

1.36

560

12.7

3.55

2.36

21.5

106

1.7

1.2

0.17

0.84

5

0.78

1.29

600

12.51

1.24

2.25

17.5

85

2

0.58

0.6

1.25

5.45

0.75

1.51

650

12.6

2.46

2.2

18.5

94

1.62

0.66

0.63

0.94

7.1

0.73

1.58

695

12.25

4.72

2.54

21

89

1.38

0.47

0.53

0.8

3.85

0.75

1.27

720

12.53

5.51

2.64

25

96

1.79

0.6

0.63

1.1

5

0.82

1.69

515

13.49

3.59

2.19

19.5

88

1.62

0.48

0.58

0.88

5.7

0.81

1.82

580

12.84

2.96

2.61

24

101

2.32

0.6

0.53

0.81

4.92

0.89

2.15

590

12.93

2.81

2.7

21

96

1.54

0.5

0.53

0.75

4.6

0.77

2.31

600

13.36

2.56

2.35

20

89

1.4

0.5

0.37

0.64

5.6

0.7

2.47

780

13.52

3.17

2.72

23.5

97

1.55

0.52

0.5

0.55

4.35

0.89

2.06

520

13.62

4.95

2.35

20

92

2

0.8

0.47

1.02

4.4

0.91

2.05

550

12.25

3.88

2.2

18.5

112

1.38

0.78

0.29

1.14

8.21

0.65

2

855

13.16

3.57

2.15

21

102

1.5

0.55

0.43

1.3

4

0.6

1.68

830

13.88

5.04

2.23

20

80

0.98

0.34

0.4

0.68

4.9

0.58

1.33

415

12.87

4.61

2.48

21.5

86

1.7

0.65

0.47

0.86

7.65

0.54

1.86

625

13.32

3.24

2.38

21.5

92

1.93

0.76

0.45

1.25

8.42

0.55

1.62

650

13.08

3.9

2.36

21.5

113

1.41

1.39

0.34

1.14

9.4

0.57

1.33

550

13.5

3.12

2.62

24

123

1.4

1.57

0.22

1.25

8.6

0.59

1.3

500

Appendixes

92

12.79

2.67

2.48

22

112

1.48

1.36

0.24

1.26

10.8

0.48

1.47

480

13.11

1.9

2.75

25.5

116

2.2

1.28

0.26

1.56

7.1

0.61

1.33

425

13.23

3.3

2.28

18.5

98

1.8

0.83

0.61

1.87

10.52

0.56

1.51

675

12.58

1.29

2.1

20

103

1.48

0.58

0.53

1.4

7.6

0.58

1.55

640

13.17

5.19

2.32

22

93

1.74

0.63

0.61

1.55

7.9

0.6

1.48

725

13.84

4.12

2.38

19.5

89

1.8

0.83

0.48

1.56

9.01

0.57

1.64

480

12.45

3.03

2.64

27

97

1.9

0.58

0.63

1.14

7.5

0.67

1.73

880

14.34

1.68

2.7

25

98

2.8

1.31

0.53

2.7

13

0.57

1.96

660

13.48

1.67

2.64

22.5

89

2.6

1.1

0.52

2.29

11.75

0.57

1.78

620

12.36

3.83

2.38

21

88

2.3

0.92

0.5

1.04

7.65

0.56

1.58

520

13.69

3.26

2.54

20

107

1.83

0.56

0.5

0.8

5.88

0.96

1.82

680

12.85

3.27

2.58

22

106

1.65

0.6

0.6

0.96

5.58

0.87

2.11

570

12.96

3.45

2.35

18.5

106

1.39

0.7

0.4

0.94

5.28

0.68

1.75

675

13.78

2.76

2.3

22

90

1.35

0.68

0.41

1.03

9.58

0.7

1.68

615

13.73

4.36

2.26

22.5

88

1.28

0.47

0.52

1.15

6.62

0.78

1.75

520

13.45

3.7

2.6

23

111

1.7

0.92

0.43

1.46

10.68

0.85

1.56

695

12.82

3.37

2.3

19.5

88

1.48

0.66

0.4

0.97

10.26

0.72

1.75

685

13.58

2.58

2.69

24.5

105

1.55

0.84

0.39

1.54

8.66

0.74

1.8

750

13.4

4.6

2.86

25

112

1.98

0.96

0.27

1.11

8.5

0.67

1.92

630

12.2

3.03

2.32

19

96

1.25

0.49

0.4

0.73

5.5

0.66

1.83

510

12.77

2.39

2.28

19.5

86

1.39

0.51

0.48

0.64

9.899999

0.57

1.63

470

14.16

2.51

2.48

20

91

1.68

0.7

0.44

1.24

9.7

0.62

1.71

660

13.71

5.65

2.45

20.5

95

1.68

0.61

0.52

1.06

7.7

0.64

1.74

740

13.4

3.91

2.48

23

102

1.8

0.75

0.43

1.41

7.3

0.7

1.56

750

13.27

4.28

2.26

20

120

1.59

0.69

0.43

1.35

10.2

0.59

1.56

835

13.17

2.59

2.37

20

120

1.65

0.68

0.53

1.46

9.3

0.6

1.62

840

14.13

4.1

2.74

24.5

96

2.05

0.76

0.56

1.35

9.2

0.61

1.6

560

Appendixes

93

Appendix B Results for Wine dataset Table B1:Checking Iterations optimality for MCAR MCAR with %10 missing rate,(5,10,15,20,25,30,35,40,45,50) iterations and 5 clusters Iterations 5

10

15

20

25

30

0.119995

0.119995

0.119995

0.119995

0.119995

0.119995

0.157127

0.157127

0.157127

0.157127

0.157127

0.144542

0.144542

0.144542

0.144542

0.163347

0.163347

0.163347

0.123995

0.123995

0.141801

35

40

45

50

0.119995

0.119995

0.119995

0.119995

0.157127

0.157127

0.157127

0.157127

0.157127

0.144542

0.144542

0.144542

0.144542

0.144542

0.144542

0.163347

0.163347

0.163347

0.163347

0.163347

0.163347

0.163347

0.123995

0.123995

0.123995

0.123995

0.123995

0.123995

0.123995

0.123995

0.145956

0.145956

0.145956

0.145956

0.145956

0.145956

0.145956

0.145956

0.145956

0.219304

0.219304

0.219304

0.219304

0.219304

0.219304

0.219304

0.219304

0.219304

0.135228

0.135228

0.135228

0.135228

0.135228

0.135228

0.135228

0.135228

0.135228

0.198424

0.198424

0.198424

0.198424

0.198424

0.198424

0.198424

0.198424

0.198424

0.154646

0.154646

0.154646

0.154646

0.154646

0.154646

0.154646

0.154646

0.154646

0.156257

0.144885

0.144885

0.144885

0.144885

0.144885

0.144885

0.144885

0.144885

0.200714

0.200714

0.200714

0.200714

0.200714

0.200714

0.200714

0.200714

0.177991

0.177991

0.177991

0.177991

0.177991

0.177991

0.177991

0.177991

0.132489

0.132489

0.132489

0.132489

0.132489

0.132489

0.132489

0.132489

0.143639

0.143639

0.143639

0.143639

0.143639

0.143639

0.143639

0.143639

0.157485

0.195644

0.195644

0.195644

0.195644

0.195644

0.195644

0.195644

0.208403

0.208403

0.208403

0.208403

0.208403

0.208403

0.208403

0.161633

0.161633

0.161633

0.161633

0.161633

0.161633

0.161633

0.192958

0.192958

0.192958

0.192958

0.192958

0.192958

0.192958

0.178453

0.178453

0.178453

0.178453

0.178453

0.178453

0.178453

0.164969

0.139598

0.139598

0.139598

0.139598

0.139598

0.139598

0.175441

0.175441

0.175441

0.175441

0.175441

0.175441

0.194173

0.194173

0.194173

0.194173

0.194173

0.194173

0.111892

0.111892

0.111892

0.111892

0.111892

0.111892

0.12863

0.12863

0.12863

0.12863

0.12863

0.12863

0.161964

0.188968

0.188968

0.188968

0.188968

0.188968

0.12392

0.12392

0.12392

0.12392

0.12392

0.181677

0.181677

0.181677

0.181677

0.181677

0.128676

0.128676

0.128676

0.128676

0.128676

0.17699

0.17699

0.17699

0.17699

0.17699

0.161645

0.152143

0.152143

0.152143

0.152143

0.173484

0.173484

0.173484

0.173484

0.2229

0.2229

0.2229

0.2229

Appendixes

94 0.1616

0.1616

0.1616

0.1616

0.170026

0.170026

0.170026

0.170026

0.1637

0.129903

0.129903

0.129903

0.096851

0.096851

0.096851

0.12573

0.12573

0.12573

0.151848

0.151848

0.151848

0.168106

0.168106

0.168106

0.160048

0.178388

0.178388

0.158312

0.158312

0.149087

0.149087

0.147104

0.147104

0.119003

0.119003

0.158974

0.136323 0.158392 0.19003 0.171362 0.142905 0.159057

Table B2:Checking Iterations optimality for MAR MAR with %10 missing rate, (5,10,15,20,25,30,35,40,45,50) iterations and 5 clusters Iterations 5

10

15

20

25

30

35

0.116424

0.116424

0.116424

0.116424

0.116424

0.116424

0.116424

0.160154

0.160154

0.160154

0.160154

0.160154

0.160154

0.119233

0.119233

0.119233

0.119233

0.119233

0.165774

0.165774

0.165774

0.165774

0.132559

0.132559

0.132559

0.132559

0.138829

0.158503

0.158503

0.184031

40

45

50

0.116424

0.116424

0.116424

0.160154

0.160154

0.160154

0.160154

0.119233

0.119233

0.119233

0.119233

0.119233

0.165774

0.165774

0.165774

0.165774

0.165774

0.165774

0.132559

0.132559

0.132559

0.132559

0.132559

0.132559

0.158503

0.158503

0.158503

0.158503

0.158503

0.158503

0.158503

0.184031

0.184031

0.184031

0.184031

0.184031

0.184031

0.184031

0.184031

0.189875

0.189875

0.189875

0.189875

0.189875

0.189875

0.189875

0.189875

0.189875

0.11239

0.11239

0.11239

0.11239

0.11239

0.11239

0.11239

0.11239

0.11239

0.164468

0.164468

0.164468

0.164468

0.164468

0.164468

0.164468

0.164468

0.164468

0.150341

0.137727

0.137727

0.137727

0.137727

0.137727

0.137727

0.137727

0.137727

0.1798

0.1798

0.1798

0.1798

0.1798

0.1798

0.1798

0.1798

0.155587

0.155587

0.155587

0.155587

0.155587

0.155587

0.155587

0.155587

0.154735

0.154735

0.154735

0.154735

0.154735

0.154735

0.154735

0.154735

0.235491

0.235491

0.235491

0.235491

0.235491

0.235491

0.235491

0.235491

0.157783

0.154625

0.154625

0.154625

0.154625

0.154625

0.154625

0.154625

0.152466

0.152466

0.152466

0.152466

0.152466

0.152466

0.152466

0.139568

0.139568

0.139568

0.139568

0.139568

0.139568

0.139568

Appendixes

95 0.242568

0.242568

0.242568

0.242568

0.242568

0.242568

0.242568

0.179524

0.179524

0.179524

0.179524

0.179524

0.179524

0.179524

0.161775

0.142842

0.142842

0.142842

0.142842

0.142842

0.142842

0.166338

0.166338

0.166338

0.166338

0.166338

0.166338

0.203657

0.203657

0.203657

0.203657

0.203657

0.203657

0.132613

0.132613

0.132613

0.132613

0.132613

0.132613

0.182556

0.182556

0.182556

0.182556

0.182556

0.182556

0.16254

0.138687

0.138687

0.138687

0.138687

0.138687

0.133001

0.133001

0.133001

0.133001

0.133001

0.233413

0.233413

0.233413

0.233413

0.233413

0.144331

0.144331

0.144331

0.144331

0.144331

0.149607

0.149607

0.149607

0.149607

0.149607

0.162085

0.205097

0.205097

0.205097

0.205097

0.209826

0.209826

0.209826

0.209826

0.209837

0.209837

0.209837

0.209837

0.150513

0.150513

0.150513

0.150513

0.130369

0.130369

0.130369

0.130369

0.164805

0.167136

0.167136

0.167136

0.22898

0.22898

0.22898

0.121962

0.121962

0.121962

0.221249

0.221249

0.221249

0.188631

0.188631

0.188631

0.167404

0.191256

0.191256

0.140439

0.140439

0.164967

0.164967

0.19752

0.19752

0.229722

0.229722

0.169334

0.137825 0.126992 0.118114 0.159787 0.136048 0.165976

Table B3:Checking Iterations optimality for NMAR NMAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and 5 cluster Iterations 5

10

15

20

25

30

35

0.101827

0.101827

0.101827

0.101827

0.101827

0.101827

0.101827

0.148259

0.148259

0.148259

0.148259

0.148259

0.148259

0.134618

0.134618

0.134618

0.134618

0.134618

0.134618

40

45

50

0.101827

0.101827

0.101827

0.148259

0.148259

0.148259

0.148259

0.134618

0.134618

0.134618

0.134618

Appendixes

96

0.171184

0.171184

0.171184

0.171184

0.171184

0.171184

0.171184

0.171184

0.171184

0.171184

0.132262

0.132262

0.132262

0.132262

0.132262

0.132262

0.132262

0.132262

0.132262

0.132262

0.13763

0.154081

0.154081

0.154081

0.154081

0.154081

0.154081

0.154081

0.154081

0.154081

0.132

0.132

0.132

0.132

0.132

0.132

0.132

0.132

0.132

0.101074

0.101074

0.101074

0.101074

0.101074

0.101074

0.101074

0.101074

0.101074

0.12951

0.12951

0.12951

0.12951

0.12951

0.12951

0.12951

0.12951

0.12951

0.142087

0.142087

0.142087

0.142087

0.142087

0.142087

0.142087

0.142087

0.142087

0.13469

0.171737

0.171737

0.171737

0.171737

0.171737

0.171737

0.171737

0.171737

0.144791

0.144791

0.144791

0.144791

0.144791

0.144791

0.144791

0.144791

0.151151

0.151151

0.151151

0.151151

0.151151

0.151151

0.151151

0.151151

0.153515

0.153515

0.153515

0.153515

0.153515

0.153515

0.153515

0.153515

0.140745

0.140745

0.140745

0.140745

0.140745

0.140745

0.140745

0.140745

0.140589

0.125031

0.125031

0.125031

0.125031

0.125031

0.125031

0.125031

0.144125

0.144125

0.144125

0.144125

0.144125

0.144125

0.144125

0.134358

0.134358

0.134358

0.134358

0.134358

0.134358

0.134358

0.154087

0.154087

0.154087

0.154087

0.154087

0.154087

0.154087

0.107476

0.107476

0.107476

0.107476

0.107476

0.107476

0.107476

0.138696

0.160749

0.160749

0.160749

0.160749

0.160749

0.160749

0.123673

0.123673

0.123673

0.123673

0.123673

0.123673

0.111506

0.111506

0.111506

0.111506

0.111506

0.111506

0.132467

0.132467

0.132467

0.132467

0.132467

0.132467

0.132824

0.132824

0.132824

0.132824

0.132824

0.132824

0.137405

0.201953

0.201953

0.201953

0.201953

0.201953

0.130917

0.130917

0.130917

0.130917

0.130917

0.113388

0.113388

0.113388

0.113388

0.113388

0.13169

0.13169

0.13169

0.13169

0.13169

0.185789

0.185789

0.185789

0.185789

0.185789

0.139962

0.158892

0.158892

0.158892

0.158892

0.139257

0.139257

0.139257

0.139257

0.125552

0.125552

0.125552

0.125552

0.135817

0.135817

0.135817

0.135817

0.119563

0.119563

0.119563

0.119563

0.13937

0.128803

0.128803

0.128803

0.121294

0.121294

0.121294

0.147059

0.147059

0.147059

0.143882

0.143882

0.143882

0.122895

0.122895

0.122895

0.138547

0.152443

0.152443

0.129408

0.129408

0.157881

0.157881

0.131696

0.131696

0.126149

0.126149

0.138655

0.149964 0.130314

Appendixes

97 0.166372 0.165953 0.174888 0.140539

Table B4:Checking clusters optimality for MCAR Checking Optimality by Clusters { from 2 to 10 } for MCAR and %10 missing rate with 5 iteration cluster=2

cluster=3

cluster=4

cluster=5

cluster=6

cluster=7

0.119995

0.119995

0.119995

0.119995

0.119995

0.1644784

0.1310551

0.143052

0.157127

0.1707968

0.1475838

0.147584

0.144542

0.1516658

0.144586

0.151666

0.169648

0.1681193

0.1553168

0.1422678

cluster=8

cluster=9

cluster=10

0.119995

0.119995

0.119995

0.119995

0.163305

0.136033

0.1532939

0.1861692

0.1838858

0.165234

0.1492543

0.1891945

0.1435476

0.1430013

0.163347

0.140378

0.1258645

0.1554044

0.1304298

0.1618359

0.142791

0.123995

0.152117

0.1338744

0.1270823

0.1481036

0.1487056

0.141017

0.141801

0.148206

0.1330042

0.148994

0.145649

0.15148472

Table B5:Checking clusters optimality for MAR Checking Optimality by Clusters { from 2 to 10 } for MAR and %10 missing rate with 5 iteration cluster=2

cluster=3

cluster=4

cluster=5

cluster=6

cluster=7

cluster=8

cluster=9

cluster=10

0.1164238

0.1164238

0.1164238

0.1164238

0.1164238

0.1164238

0.1164238

0.1164238

0.1164238

0.1111938

0.1042875

0.1988778

0.1601544

0.1305181

0.1305181

0.1810234

0.1538771

0.14902

0.1455251

0.1363458

0.130387

0.1192332

0.1284773

0.1284773

0.1928594

0.1284773

0.1284773

0.139088

0.1871515

0.1708623

0.1657735

0.2131112

0.2131112

0.2131112

0.2070394

0.1419509

0.1484469

0.1877002

0.1402389

0.1325589

0.1427671

0.1427671

0.220838

0.1882515

0.1882515

0.13213552

0.14638176

0.15135796

0.13882876

0.1462595

0.1462595

0.1848512

0.1588138

0.1448247

Table B6:Checking clusters optimality for NMAR Checking Optimality by Clusters { from 2 to 10 } for NMAR and %10 missing rate with 10 iteration cluster=2

cluster=3

cluster=4

cluster=5

cluster=6

cluster=7

cluster=8

cluster=9

cluster=10

0.101827

0.101827

0.101827

0.101827

0.101827

0.101827

0.1018274

0.1018274

0.1018274

0.147712

0.141803

0.146086

0.148259

0.161689

0.181101

0.102285

0.1111108

0.132016

0.141073

0.148043

0.156468

0.134618

0.14306

0.143543

0.1204954

0.154565

0.154565

0.122868

0.139485

0.128994

0.171184

0.120328

0.168822

0.125493

0.1423266

0.1711844

0.185993

0.168247

0.161213

0.132262

0.140954

0.163517

0.1197519

0.1322619

0.1384011

0.143375

0.119339

0.119682

0.154081

0.147615

0.169057

0.1433045

0.1014166

0.1423992

0.167334

0.12867

0.137231

0.132

0.175561

0.122293

0.153223

0.1761987

0.1247971

Appendixes

98

0.15731

0.166899

0.129085

0.101074

0.13569

0.129942

0.1317386

0.1504504

0.1225512

0.134908

0.126762

0.12951

0.12951

0.18249

0.160278

0.1195941

0.1284742

0.1649264

0.146148

0.162895

0.131819

0.142087

0.122715

0.151186

0.1203773

0.1531819

0.155201

0.144855

0.140397

0.134192

0.13469

0.143193

0.149157

0.12380902

0.13518135

0.14078688

Final results for each MCAR&MAR&NMAR with varying missing rate (%5, %10, %15, and % 20) MCAR with 5 iteration & 7 clusters 5% 10% 15% 20% 0.1357158 0.119995 0.1320235 0.1478433 0.2369731 0.136033 0.1777792 0.176007 0.1420297 0.1492543 0.1614744 0.1787792 0.2457712 0.1258645 0.1246992 0.1868581 0.131962 0.1338744 0.1638367 0.1466173 0.1784904 0.1330042 0.1519626 0.167221

MAR with 5 iteration & 2 clusters 5% 10% 15% 20% 0.1256567 0.1164238 0.1257124 0.1302127 0.09937878 0.1111938 0.1695857 0.1710534 0.130162 0.1455251 0.1485814 0.182729 0.1371659 0.139088 0.1849823 0.1511046 0.1904594 0.1484469 0.1174341 0.1895456 0.13656456 0.1321355 0.1492592 0.1649291

NMAR with 10 iteration & 8 clusters 5% 10% 15% 20% 0.09692448 0.1018274 0.1372783 0.1423096 0.1426843 0.102285 0.1578463 0.1695661 0.1099717 0.1204954 0.1592683 0.1479767 0.1132756 0.125493 0.1565235 0.1706958 0.1651139 0.1197519 0.194072 0.1577921 0.1358363 0.1433045 0.1848806 0.1911542 0.1128783 0.153223 0.1445272 0.1691071 0.07659872 0.1317386 0.1702514 0.166472 0.1334439 0.1195941 0.1544002 0.161353 0.09654459 0.1203773 0.1415541 0.152692 0.11832718 0.123809 0.1600602 0.1629119

Appendixes

99

Appendix C Results for simulation dataset Table C1:Checking Iterations optimality for MCAR MCAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and 5 clusters Iterations 5

10

15

20

25

30

35

40

45

50

0.153359

0.153359

0.153359

0.153359

0.153359

0.153359

0.153359

0.153359

0.153359

0.153359

0.158336

0.158336

0.158336

0.158336

0.158336

0.158336

0.158336

0.158336

0.158336

0.158336

0.16414

0.16414

0.16414

0.16414

0.16414

0.16414

0.16414

0.16414

0.16414

0.16414

0.16309

0.16309

0.16309

0.16309

0.16309

0.16309

0.16309

0.16309

0.16309

0.16309

0.163074

0.163074

0.163074

0.163074

0.163074

0.163074

0.163074

0.163074

0.163074

0.163074

0.1604

0.158522

0.158522

0.158522

0.158522

0.158522

0.158522

0.158522

0.158522

0.158522

0.169981

0.169981

0.169981

0.169981

0.169981

0.169981

0.169981

0.169981

0.169981

0.148495

0.148495

0.148495

0.148495

0.148495

0.148495

0.148495

0.148495

0.148495

0.168434

0.168434

0.168434

0.168434

0.168434

0.168434

0.168434

0.168434

0.168434

0.153843

0.153843

0.153843

0.153843

0.153843

0.153843

0.153843

0.153843

0.153843

0.160127

0.149075

0.149075

0.149075

0.149075

0.149075

0.149075

0.149075

0.149075

0.159246

0.159246

0.159246

0.159246

0.159246

0.159246

0.159246

0.159246

0.15548

0.15548

0.15548

0.15548

0.15548

0.15548

0.15548

0.15548

0.16857

0.16857

0.16857

0.16857

0.16857

0.16857

0.16857

0.16857

0.143935

0.143935

0.143935

0.143935

0.143935

0.143935

0.143935

0.143935

0.158505

0.155882

0.155882

0.155882

0.155882

0.155882

0.155882

0.155882

0.154877

0.154877

0.154877

0.154877

0.154877

0.154877

0.154877

0.156831

0.156831

0.156831

0.156831

0.156831

0.156831

0.156831

0.165743

0.165743

0.165743

0.165743

0.165743

0.165743

0.165743

0.15544

0.15544

0.15544

0.15544

0.15544

0.15544

0.15544

0.158318

0.155984

0.155984

0.155984

0.155984

0.155984

0.155984

0.161967

0.161967

0.161967

0.161967

0.161967

0.161967

0.1529

0.1529

0.1529

0.1529

0.1529

0.1529

0.153367

0.153367

0.153367

0.153367

0.153367

0.153367

0.147581

0.147581

0.147581

0.147581

0.147581

0.147581

0.157526

0.167094

0.167094

0.167094

0.167094

0.167094

0.169896

0.169896

0.169896

0.169896

0.169896

0.142285

0.142285

0.142285

0.142285

0.142285

0.183654

0.183654

0.183654

0.183654

0.183654

0.167614

0.167614

0.167614

0.167614

0.167614

0.158956

0.155271

0.155271

0.155271

0.155271

0.1659

0.1659

0.1659

0.1659

0.154727

0.154727

0.154727

0.154727

Appendixes

100 0.156046

0.156046

0.156046

0.156046

0.170652

0.170652

0.170652

0.170652

0.15918

0.159226

0.159226

0.159226

0.16181

0.16181

0.16181

0.15641

0.15641

0.15641

0.171235

0.171235

0.171235

0.161378

0.161378

0.161378

0.159534

0.142579

0.142579

0.169344

0.169344

0.154192

0.154192

0.157791

0.157791

0.15821

0.15821

0.159188

0.162551 0.171719 0.147943 0.159561 0.160148 0.159308

Table C2:Checking Iterations optimality for MAR MAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and 5 clusters Iterations 5

10

15

20

25

30

35

40

45

50

0.153728

0.153728

0.153728

0.153728

0.153728

0.153728

0.153728

0.153728

0.153728

0.153728

0.1648

0.1648

0.1648

0.1648

0.1648

0.1648

0.1648

0.1648

0.1648

0.1648

0.153778

0.153778

0.153778

0.153778

0.153778

0.153778

0.153778

0.153778

0.153778

0.153778

0.155587

0.155587

0.155587

0.155587

0.155587

0.155587

0.155587

0.155587

0.155587

0.155587

0.156406

0.156406

0.156406

0.156406

0.156406

0.156406

0.156406

0.156406

0.156406

0.156406

0.15686

0.173553

0.173553

0.173553

0.173553

0.173553

0.173553

0.173553

0.173553

0.173553

0.152668

0.152668

0.152668

0.152668

0.152668

0.152668

0.152668

0.152668

0.152668

0.171421

0.171421

0.171421

0.171421

0.171421

0.171421

0.171421

0.171421

0.171421

0.146845

0.146845

0.146845

0.146845

0.146845

0.146845

0.146845

0.146845

0.146845

0.153936

0.153936

0.153936

0.153936

0.153936

0.153936

0.153936

0.153936

0.153936

0.158272

0.139333

0.139333

0.139333

0.139333

0.139333

0.139333

0.139333

0.139333

0.156994

0.156994

0.156994

0.156994

0.156994

0.156994

0.156994

0.156994

0.155585

0.155585

0.155585

0.155585

0.155585

0.155585

0.155585

0.155585

0.150068

0.150068

0.150068

0.150068

0.150068

0.150068

0.150068

0.150068

0.152859

0.152859

0.152859

0.152859

0.152859

0.152859

0.152859

0.152859

0.155837

0.160279

0.160279

0.160279

0.160279

0.160279

0.160279

0.160279

0.159149

0.159149

0.159149

0.159149

0.159149

0.159149

0.159149

0.161601

0.161601

0.161601

0.161601

0.161601

0.161601

0.161601

Appendixes

101 0.158001

0.158001

0.158001

0.158001

0.158001

0.158001

0.158001

0.167223

0.167223

0.167223

0.167223

0.167223

0.167223

0.167223

0.157191

0.159067

0.159067

0.159067

0.159067

0.159067

0.159067

0.170436

0.170436

0.170436

0.170436

0.170436

0.170436

0.161758

0.161758

0.161758

0.161758

0.161758

0.161758

0.155481

0.155481

0.155481

0.155481

0.155481

0.155481

0.163212

0.163212

0.163212

0.163212

0.163212

0.163212

0.158151

0.174201

0.174201

0.174201

0.174201

0.174201

0.161014

0.161014

0.161014

0.161014

0.161014

0.162374

0.162374

0.162374

0.162374

0.162374

0.14687

0.14687

0.14687

0.14687

0.14687

0.144909

0.144909

0.144909

0.144909

0.144909

0.158105

0.160961

0.160961

0.160961

0.160961

0.15469

0.15469

0.15469

0.15469

0.155764

0.155764

0.155764

0.155764

0.149137

0.149137

0.149137

0.149137

0.151771

0.151771

0.151771

0.151771

0.157585

0.156602

0.156602

0.156602

0.161862

0.161862

0.161862

0.160368

0.160368

0.160368

0.140072

0.140072

0.140072

0.158005

0.158005

0.158005

0.157309

0.162652

0.162652

0.156742

0.156742

0.155924

0.155924

0.145725

0.145725

0.167808

0.167808

0.15736

0.152795 0.150149 0.151995 0.173102 0.160254 0.15739

Table C3:Checking Iterations optimality for NMAR NMAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and 5 clusters Iterations 5

10

15

20

25

30

35

40

45

50

0.191215

0.191215

0.191215

0.191215

0.191215

0.191215

0.191215

0.191215

0.191215

0.191215

0.180921

0.180921

0.180921

0.180921

0.180921

0.180921

0.180921

0.180921

0.180921

0.180921

0.189186

0.189186

0.189186

0.189186

0.189186

0.189186

0.189186

0.189186

0.189186

0.189186

0.193569

0.193569

0.193569

0.193569

0.193569

0.193569

0.193569

0.193569

0.193569

0.193569

Appendixes

102

0.198927

0.198927

0.198927

0.198927

0.198927

0.198927

0.198927

0.198927

0.198927

0.198927

0.190763

0.187168

0.187168

0.187168

0.187168

0.187168

0.187168

0.187168

0.187168

0.187168

0.214106

0.214106

0.214106

0.214106

0.214106

0.214106

0.214106

0.214106

0.214106

0.209192

0.209192

0.209192

0.209192

0.209192

0.209192

0.209192

0.209192

0.209192

0.190581

0.190581

0.190581

0.190581

0.190581

0.190581

0.190581

0.190581

0.190581

0.199373

0.199373

0.199373

0.199373

0.199373

0.199373

0.199373

0.199373

0.199373

0.195424

0.205131

0.205131

0.205131

0.205131

0.205131

0.205131

0.205131

0.205131

0.204471

0.204471

0.204471

0.204471

0.204471

0.204471

0.204471

0.204471

0.179282

0.179282

0.179282

0.179282

0.179282

0.179282

0.179282

0.179282

0.186874

0.186874

0.186874

0.186874

0.186874

0.186874

0.186874

0.186874

0.203141

0.203141

0.203141

0.203141

0.203141

0.203141

0.203141

0.203141

0.195542

0.176813

0.176813

0.176813

0.176813

0.176813

0.176813

0.176813

0.198609

0.198609

0.198609

0.198609

0.198609

0.198609

0.198609

0.19868

0.19868

0.19868

0.19868

0.19868

0.19868

0.19868

0.181641

0.181641

0.181641

0.181641

0.181641

0.181641

0.181641

0.20381

0.20381

0.20381

0.20381

0.20381

0.20381

0.20381

0.194634

0.217707

0.217707

0.217707

0.217707

0.217707

0.217707

0.184723

0.184723

0.184723

0.184723

0.184723

0.184723

0.19732

0.19732

0.19732

0.19732

0.19732

0.19732

0.203079

0.203079

0.203079

0.203079

0.203079

0.203079

0.183122

0.183122

0.183122

0.183122

0.183122

0.183122

0.195146

0.184925

0.184925

0.184925

0.184925

0.184925

0.183616

0.183616

0.183616

0.183616

0.183616

0.165382

0.165382

0.165382

0.165382

0.165382

0.192947

0.192947

0.192947

0.192947

0.192947

0.183129

0.183129

0.183129

0.183129

0.183129

0.192955

0.209584

0.209584

0.209584

0.209584

0.193737

0.193737

0.193737

0.193737

0.185378

0.185378

0.185378

0.185378

0.209035

0.209035

0.209035

0.209035

0.195736

0.195736

0.195736

0.195736

0.193775

0.188837

0.188837

0.188837

0.185255

0.185255

0.185255

0.192626

0.192626

0.192626

0.180581

0.180581

0.180581

0.183989

0.183989

0.183989

0.192835

0.16191

0.16191

0.184659

0.184659

0.214747

0.214747

0.173244

0.173244

0.195047

0.195047

0.192067

0.185084 0.177579 0.179497

Appendixes

103 0.206431 0.215945 0.192151

Table C4:Checking clusters optimality for MCAR Checking Optimality by Clusters { from 2 to 10 } for MCAR and %10 missing rate with 25 iteration cluster=2

cluster=3

cluster=4

cluster=5

cluster=6

cluster=7

cluster=8

cluster=9

cluster=10

0.1533592

0.1533592

0.1533592

0.1533592

0.1533592

0.1533592

0.1533592

0.1533592

0.1533592

0.1489389

0.1483384

0.1569593

0.1583362

0.1719848

0.1693923

0.1722979

0.1672909

0.1603954

0.1623354

0.1537003

0.1791276

0.1641396

0.1684081

0.1541449

0.1541449

0.1684081

0.1482683

0.1375055

0.1643216

0.1670794

0.1630901

0.1549887

0.1384483

0.1375055

0.1471827

0.1630901

0.1520735

0.1411166

0.1689852

0.1630736

0.1605996

0.1684685

0.1497198

0.1853841

0.1436523

0.1562331

0.1617841

0.1678442

0.158522

0.1752373

0.1659302

0.1701602

0.1454038

0.1583527

0.1719522

0.1558143

0.1700855

0.1699811

0.1472907

0.1558143

0.1478893

0.1479569

0.1506295

0.1502757

0.1650343

0.1603936

0.148495

0.1329248

0.1630998

0.1651562

0.1564991

0.1486734

0.1505391

0.1696355

0.1634406

0.168434

0.1581702

0.1540173

0.1559501

0.1661535

0.15807

0.1624842

0.1542061

0.1542061

0.1538426

0.1695223

0.1604386

0.1798738

0.153607

0.1675288

0.1618832

0.1502475

0.1542317

0.1490754

0.1608844

0.1474603

0.1559411

0.1845573

0.1562098

0.1584813

0.1573709

0.1560721

0.1592457

0.1468942

0.1598619

0.1675985

0.1696406

0.1556324

0.1650894

0.1514869

0.1693832

0.1554798

0.1591576

0.1581082

0.1631849

0.138378

0.1806822

0.1515375

0.1648625

0.1601565

0.1685699

0.1496362

0.176223

0.1618728

0.1621714

0.1422305

0.1580196

0.1591094

0.1637442

0.143935

0.1661519

0.152468

0.1514247

0.1681505

0.1739812

0.160114

0.149538

0.1638434

0.1558824

0.1614639

0.1710976

0.1558824

0.1417505

0.1417505

0.1527515

0.1801297

0.1453186

0.1548768

0.159631

0.1621434

0.1647212

0.144054

0.1485046

0.155914

0.168735

0.1566251

0.1568308

0.156257

0.1655719

0.1522495

0.1610732

0.1598972

0.1608124

0.1605367

0.1545108

0.1657425

0.170586

0.1518525

0.1721695

0.1646111

0.1590815

0.1717702

0.1498006

0.1698343

0.1554398

0.1598

0.1556466

0.1594241

0.1424231

0.1635823

0.1548753

0.1404491

0.1557742

0.1559838

0.1451302

0.167485

0.1639821

0.1687283

0.1533676

0.1662146

0.1542759

0.1579264

0.1619668

0.1627313

0.1610508

0.1770395

0.1579174

0.1524603

0.1775547

0.1702114

0.1504134

0.1528995

0.1571493

0.1637851

0.1573379

0.1657478

0.1503753

0.1465657

0.1588564

0.1524613

0.1533672

0.1496224

0.1543442

0.1579138

0.1523508

0.1601699

0.1559756

0.1887714

0.1450477

0.1475811

0.1532

0.1445442

0.1566019

0.1438552

0.1543964

0.1577302

0.1588677

0.1598729

0.157526

0.1580312

0.1589902

0.160136

0.1582662

0.1561737

Appendixes

104

Table C5:Checking clusters optimality for MAR Checking Optimality by Clusters { from 2 to 10 } for MAR and %10 missing rate with 15 iteration cluster=2

cluster=3

cluster=4

cluster=5

cluster=6

cluster=7

cluster=8

cluster=9

cluster=10

0.1537281

0.1537281

0.1537281

0.1537281

0.1537281

0.1537281

0.1537281

0.1537281

0.1537281

0.1559714

0.1653103

0.168922

0.1647996

0.1605876

0.1762846

0.152632

0.1609916

0.1564276

0.1430237

0.1578057

0.1558575

0.1537778

0.1578932

0.1578932

0.1473044

0.1583922

0.1711735

0.1797234

0.1570308

0.1532421

0.1555874

0.1678468

0.1502265

0.1666473

0.1577527

0.1479904

0.1543626

0.1580641

0.1547588

0.1564055

0.1607431

0.1593679

0.1712027

0.1580641

0.1505579

0.1738606

0.1473499

0.1543532

0.1735531

0.1625813

0.1707245

0.1683802

0.1661751

0.1518151

0.1479519

0.1678214

0.1587382

0.1526677

0.1364277

0.1653977

0.151088

0.1364277

0.1490839

0.1630103

0.1616283

0.1492837

0.1714212

0.1519786

0.1692667

0.163754

0.1533212

0.155685

0.1601952

0.1593245

0.1547209

0.1468447

0.1523563

0.1385901

0.151686

0.1623853

0.1432987

0.1507822

0.1555362

0.1441457

0.1539358

0.1651569

0.1536312

0.1606629

0.1561751

0.150177

0.1625122

0.153142

0.1645387

0.1393328

0.165179

0.1635327

0.1577447

0.161684

0.1492152

0.1528108

0.1480366

0.1450526

0.1569943

0.1543731

0.1532529

0.1490633

0.1649248

0.1464023

0.16153

0.1636468

0.1635872

0.155585

0.1545211

0.1550309

0.1731423

0.1674821

0.1645329

0.1568732

0.1530184

0.1390426

0.1500678

0.1598051

0.1531167

0.1673272

0.1586938

0.1649376

0.1626306

0.1479142

0.1516159

0.1528591

0.1780196

0.1619305

0.1472205

0.1396384

0.1488475

0.1585977

0.1566238

0.1541058

0.1558373

0.1587465

0.1587983

0.1587722

0.1570557

0.1535915

Table C6:Checking clusters optimality for NMAR Checking Optimality by Clusters { from 2 to 10 } for NMAR and %10 missing rate with 5 iteration cluster=2

cluster=3

cluster=4

cluster=5

cluster=6

cluster=7

cluster=8

cluster=9

cluster=10

0.1912149

0.1912149

0.1912149

0.1912149

0.1912149

0.1912149

0.1912149

0.1912149

0.1912149

0.2029784

0.1686545

0.193161

0.1809205

0.2081727

0.1852416

0.1934861

0.1934835

0.1839155

0.2022097

0.1928589

0.2077185

0.189186

0.2158239

0.1892655

0.1967761

0.1882468

0.2062008

0.2054832

0.2026381

0.2054832

0.1935687

0.211011

0.1774081

0.2196228

0.2026549

0.2091377

0.1996884

0.2115519

0.1706702

0.1989272

0.1748489

0.1608717

0.2075485

0.2034826

0.1965995

0.2003149

0.1933837

0.1936496

0.1907635

0.2002143

0.1808004

0.2017297

0.1958165

0.1974137

Appendixes

105

Final results for each MCAR&MAR&NMAR with varying missing rate (%5, %10, %15, and % 20) MCAR with 25 iteration &10 clusters 5%

10%

15%

20%

0.1670757

0.1533592

0.1635079

0.1610349

0.1496876

0.1603954

0.153637

0.1481382

0.1666809

0.1482683

0.1635736

0.1714814

0.1843632

0.1630901

0.1641762

0.1589371

0.1652808

0.1436523

0.1438135

0.1620806

0.1520065

0.1583527

0.1588536

0.1684821

0.1484322

0.1506295

0.1571042

0.1518329

0.1652677

0.1486734

0.1500573

0.165141

0.1313637

0.15807

0.1609874

0.1615885

0.1565603

0.1675288

0.1611219

0.1646294

0.1638889

0.1562098

0.1676683

0.1700945

0.1529984

0.1556324

0.1670556

0.1545456

0.1472798

0.1806822

0.1715883

0.1510724

0.1607192

0.1422305

0.1653691

0.156485

0.1520928

0.1739812

0.1670011

0.1624461

0.1506007

0.1417505

0.1610007

0.1609171

0.1679272

0.1485046

0.1493622

0.1580064

0.1740199

0.1598972

0.1673336

0.1642655

0.1594476

0.1590815

0.14001

0.1580765

0.1711124

0.1635823

0.1631971

0.1586979

0.1532117

0.1533676

0.1600419

0.1613813

0.1672335

0.1524603

0.1554851

0.1604047

0.164847

0.1503753

0.1562593

0.1668781

0.1618757

0.1601699

0.161643

0.1455881

0.1707137

0.1543964

0.1720897

0.1655179

0.160187484

0.156173656

0.160077504

0.160308928

MAR with 15 iteration & 10 clusters 5%

10%

15%

20%

0.1456426

0.1537281

0.1539078

0.1614865

0.1537746

0.1564276

0.168065

0.1558089

0.1559374

0.1711735

0.1539176

0.1609754

0.1598557

0.1479904

0.1605395

0.1492341

0.1780625

0.1505579

0.1580683

0.1615557

0.182565

0.1518151

0.1514933

0.1723007

0.1304944

0.1490839

0.161859

0.160757

0.1494331

0.155685

0.1621599

0.1651331

0.1528486

0.1432987

0.1667488

0.1606182

Appendixes

106

0.1634133

0.150177

0.1601363

0.1637187

0.1530894

0.1492152

0.1684983

0.1516515

0.1590996

0.1464023

0.151475

0.1657967

0.1573837

0.1645329

0.1755507

0.1588642

0.1657624

0.1649376

0.1547743

0.1549464

0.1829929

0.1488475

0.1572161

0.1558534

0.159357

0.1535915

0.160294

0.1599134

NMAR with 5 iteration & 7 clusters 5%

10%

15%

20%

0.180415

0.1912149

0.1966909

0.2101974

0.2242219

0.1852416

0.2089012

0.2048447

0.1837408

0.1892655

0.2039496

0.2112158

0.1781351

0.1774081

0.2018332

0.2061187

0.1715732

0.1608717

0.1917163

0.204424

0.1876172

0.1808004

0.2006182

0.2073601

Appendixes

107

Appendix D Programming (wine dataset) We created this program from R programming language (version 3.2.3, 2015) Kill all variables Set Seed= 1200 FOR (z=1,2,3,4,5): # Number of Iterations Kill all variables except z variable Create Y1 as matrix of raw data (wine data set) Remove any Heterogeneous attribute from the Y1 Set Y=Y1 8. Set C=5 ← Initialize number of clusters (c>=2) 9. Set V ← Number of variables from the matrix Y # The data set variables 10. Set iter ← Initialize number of fuzzy Iterations (maximum iteration is 100 iterations ) 11. Set s ← Initialize value of fuzzy parameter # 2 12. Set B ← Initialize value of gray relational analysis parameter # 1 13. GRG= Initialize empty matrix for GRG 14. Set CV=NULL # Maximum value of GRG 15. Pr = Initialize empty matrix for entropy values 16. Set E=NULL # Set Entropy vector of instances to NULL 17. Set H=NULL # Set Entropy vector to NULL 18. Set L=NULL # Set average vector of instances to NULL 19. Set ENTR=NULL # Set final vector of Entropy to NULL 20. Set INF=0 # Set initial value of Information vector to Zero 21. Set WF=NULL # Set average weight vector to NULL 22. ES= Initialize empty matrix for estimating missing values 23. Select case 24. Case MCAR X=MCAR(Y,α) # Y: Original data set , α : Probability of missing rate Case MAR X=MAR(Y,α) # Y: Original data set , α : Probability of missing rate Case NMAR X=NMAR(Y,α) # Y: Original data set , α : Probability of missing rate 1. 2. 3. 4. 5. 6. 7.

Call library mice from r package 25. Separate Xc matrix by calling cc function # Xc are cases (instances) without missing data (ie. Xc is complete data set) 26. Separate Xic matrix by calling ic function # Xic are cases (instances) with missing data (ie. Xic is incomplete data set) 27. Call library e1071 from R package to calculate fuzzy c-mean

Appendixes

108

28. FCM=cmeans(Xc,iter,m=s) # apply FCM to complete dataset 29. cp=unname(FCM$centers) # retrieve centroid of clusters from FCM 30. rank incomplete instances by missing amount in descending order 31. Xic=Xic[ order(rowSums(is.na(Xic)), decreasing=TRUE), ] 32. Applay Grey system theory to incomplete dataset (Xic) {GRA} 33. normalize incomplete dataset for GRA by calling norm function 34. nor<- as.data.frame(lapply(Xic, norm)) 35. For (i in 1:C){

dalta=abs(sweep(nor, 2, cp[i,],'-')) names(dalta) <- paste("dalta", 1:ncol(Xic),sep=".") minmin=min(dalta ,na.rm=TRUE) maxmax=max(dalta,na.rm=TRUE) GRC= (minmin+(B*maxmax))/(dalta+(B*maxmax)) # Grey relational coefficient names(GRC) <- paste("GRC", 1:ncol(Xic),sep=".") for (j in 1:ncol(Xic)){ GRG[i,j] =c(sum(GRC[,j],na.rm=TRUE)/(length(na.omit(Xic[,j])))) # grey relational grade } } 36. For (j in 1:v){ CV[j]=(which.max (GRG[,j])) 37. Change incomplete dataset to binary dataset in which {observed=1 & missed=0} Xic[] <- as.numeric(!is.na(Xic)) Xic=t(Xic) Xicc=cbind(Xic,CV) # add class vector to incomplete data set xclas<-as.data.frame(Xicc) xclas=t(xclas) 38. Calculate Entropy

r =nrow(xclas)-1 for (p in 1:r){ t=table(xclas[p,],CV) } m0 <- matrix(0, 2, c) cn=as.numeric(colnames(t)) cl=length(cn) for (c in 1:cl){ m0[1,cn[c]]=t[1,c] m0[2,cn[c]]=t[2,c] } t=m0

Appendixes

109

r=nrow(xclas)-1 for(p in 1:r){ L=rowSums(t)\sum(t) for (ii in 1:2){ s=0 for (jj in 1:c){ s=s+t[ii,jj]} for (jj in 1:2){ Pr[ii,jj]=t[ii,jj]/s} } for (i in 1:2){ E[i]=0 for (j in 1:c){ H=-(Pr[i,j]*log2(Pr[i,j])) H=ifelse(is.na(H), 0, H) E[i]=E[i]+H} ENTR[p]=0 ENTR[p]=ENTR[p]+E[i]*L[i] INF[p]=1-ENTR[p] } } SINF=0 for (p in 1:r){ SINF=SINF+INF[p] } for (p in 1:r){ WF[p]=INF[p]/SINF } Win=cbind(Xin,WF) CM=colMeans(Win,na.rm = T) Fin=rbind(Win,CM) #------------------------------------------------------------------IFin=Fin n1=nrow(Fin)-1 m1=ncol(Fin)-1 for (m in 1:m1){ for (n in 1:n1){ EE=0 if (is.na(Fin[n,m])){ for (nn in 1:n1){ if (nn!=n){ if(is.na(Fin[nn,m])){EE=EE+(Fin[n1+1,m]*Fin [nn,m1+1])} else {EE=EE+(Fin[nn,m]*Fin[nn,m1+1])} ES[n,m]=EE Fin[n,m]=EE }

Appendixes

110

} } else {ES[n,m]=Fin[n,m]} } } for (m in 1:m1){ for (n in 1:n1){ if (is.na(Fin[n,m])) IFin[n,m]= ES[n,m] } } for (m in 1:m1){ for (n in 1:n1){ if (is.na(Fin[n,m])) IFin[n,m]= ES[n,m] } } WNA=Win WIM=Fin[-(n+1),-(m+1)] CDTU=rbind(WIM,Xc) CDT=CDTU[order(as.numeric(row.names(CDTU))),] 39. Calculating RMSE based on imputation dataset, missing dataset and original dataset RMSE=function (imp, mis, true, norm = TRUE) { imp <- as.matrix(imp) mis <- as.matrix(mis) true <- as.matrix(true) missIndex <- which(is.na(mis) errvec <- imp[missIndex] - true[missIndex] rmse <- sqrt(mean(errvec^2)) if (norm) { rmse <- rmse/sd(true[missIndex]) } return(rmse) } RMSE=RMSE(CDT,x,y,norm=TRUE) Print the final result 40. print(RMSE) }

Appendixes

111

Appendix E Programming (Simulated dataset) We created this program from R programming language (version 3.2.3, 2015) 1. Kill all variables 2. Set Seed= 1200 3. Set n=1000 #Number of generated instances 4. for (i=1 to number of generated variable) 5. Set variable= rnorm( n , mean , standard deviation) 6. Y1 is data frame of generated variables 7. FOR (z=1,2,3,4,5): 8. Kill all variables except z and y variable

# Number of Iterations

9. Remove any Heterogeneous attribute from the Y1 10. Set Y=Y1 11. Set C=5 ← Initialize number of clusters (c>=2) 12. Set V ← Number of variables from the matrix Y

# The data set variables 13. Set iter ← Initialize number of fuzzy Iterations (maximum iteration is 100 iterations ) 14. Set s ← Initialize value of fuzzy parameter # 2 15. Set B ← Initialize value of gray relational analysis parameter # 1 16. GRG= Initialize empty matrix for GRG 17. Set CV=NULL # Maximum value of GRG 18. Pr = Initialize empty matrix for entropy values 19. Set E=NULL # Set Entropy vector of instances to NULL 20. Set H=NULL # Set Entropy vector to NULL 21. Set L=NULL # Set average vector of instances to NULL 22. Set ENTR=NULL # Set final vector of Entropy to NULL 23. Set INF=0 # Set initial value of Information vector to Zero 24. Set WF=NULL # Set average weight vector to NULL 25. ES= Initialize empty matrix for estimating missing values 26. Select case 27. Case MCAR X=MCAR(x,α) # x: Original data set , α : Probability of missing rate Case MAR X=MAR(x,α) # x: Original data set , α : Probability of missing rate Case NMAR X=MCAR(x,α) # x: Original data set , α : Probability of missing rate Call library mice from r package 28. Separate Xc matrix by calling cc function # Xc are cases (instances) without missing data (ie. Xc is complete data set)

Appendixes 29. Separate Xic matrix by calling ic function

112 # Xic are cases (instances)

with missing data (ie. Xic is incomplete data set) 30. 31. 32. 33.

Call library e1071 from R package to calculate fuzzy c-mean FCM=cmeans(Xc,iter,m=s) # apply FCM to complete dataset cp=unname(FCM$centers) # retrieve centroid of clusters from FCM rank incomplete instances by missing amount in descending order

34. Xic=Xic[ order(rowSums(is.na(Xic)), decreasing=TRUE), ] 35. Applay Grey system theory to incomplete dataset (Xic) {GRA} 36. normalize incomplete dataset for GRA by calling norm function 37. nor<- as.data.frame(lapply(Xic, norm)) 38. For (i in 1:C){

dalta=abs(sweep(nor, 2, cp[i,],'-')) names(dalta) <- paste("dalta", 1:ncol(Xic),sep=".") minmin=min(dalta ,na.rm=TRUE) maxmax=max(dalta,na.rm=TRUE) GRC= (minmin+(B*maxmax))/(dalta+(B*maxmax)) # Grey relational coefficient names(GRC) <- paste("GRC", 1:ncol(Xic),sep=".") for (j in 1:ncol(Xic)){ GRG[i,j] =c(sum(GRC[,j],na.rm=TRUE)/(length(na.omit(Xic[,j])))) # grey relational grade } } 39. For (j in 1:v){ CV[j]=(which.max (GRG[,j])) 40. Change incomplete dataset to binary dataset in which {observed=1 & missed=0} Xic[] <- as.numeric(!is.na(Xic)) Xic=t(Xic) Xicc=cbind(Xic,CV) # add class vector to incomplete data set xclas<-as.data.frame(Xicc) xclas=t(xclas) 41. Calculate Entropy

r =nrow(xclas)-1 for (p in 1:r){ t=table(xclas[p,],CV) } m0 <- matrix(0, 2, c) cn=as.numeric(colnames(t)) cl=length(cn) for (c in 1:cl){ m0[1,cn[c]]=t[1,c] m0[2,cn[c]]=t[2,c]

Appendixes } t=m0 r=nrow(xclas)-1 for(p in 1:r){ L=rowSums(t)\sum(t) for (ii in 1:2){ s=0 for (jj in 1:c){ s=s+t[ii,jj]} for (jj in 1:2){ Pr[ii,jj]=t[ii,jj]/s} } for (i in 1:2){ E[i]=0 for (j in 1:c){ H=-(Pr[i,j]*log2(Pr[i,j])) H=ifelse(is.na(H), 0, H) E[i]=E[i]+H} ENTR[p]=0 ENTR[p]=ENTR[p]+E[i]*L[i] INF[p]=1-ENTR[p] } } SINF=0 for (p in 1:r){ SINF=SINF+INF[p] } for (p in 1:r){ WF[p]=INF[p]/SINF } Win=cbind(Xin,WF) CM=colMeans(Win,na.rm = T) Fin=rbind(Win,CM) #------------------------------------------------------------------IFin=Fin n1=nrow(Fin)-1 m1=ncol(Fin)-1 for (m in 1:m1){ for (n in 1:n1){ EE=0 if (is.na(Fin[n,m])){ for (nn in 1:n1){ if (nn!=n){ if(is.na(Fin[nn,m])){EE=EE+(Fin[n1+1,m]*Fin [nn,m1+1])} else {EE=EE+(Fin[nn,m]*Fin[nn,m1+1])} ES[n,m]=EE Fin[n,m]=EE } }

113

Appendixes

} else {ES[n,m]=Fin[n,m]} } } for (m in 1:m1){ for (n in 1:n1){ if (is.na(Fin[n,m])) IFin[n,m]= ES[n,m] } } for (m in 1:m1){ for (n in 1:n1){ if (is.na(Fin[n,m])) IFin[n,m]= ES[n,m] } } WNA=Win WIM=Fin[-(n+1),-(m+1)] CDTU=rbind(WIM,Xc) CDT=CDTU[order(as.numeric(row.names(CDTU))),] 42. Calculating RMSE based on imputation dataset, missing dataset and original dataset RMSE=function (imp, mis, true, norm = TRUE) { imp <- as.matrix(imp) mis <- as.matrix(mis) true <- as.matrix(true) missIndex <- which(is.na(mis) errvec <- imp[missIndex] - true[missIndex] rmse <- sqrt(mean(errvec^2)) if (norm) { rmse <- rmse/sd(true[missIndex]) } return(rmse) } RMSE=RMSE(CDT,x,y,norm=TRUE) Print the final result 43. print(RMSE) }

114

Appendixes

115

Appendix F Functions  MCAR # Missing Completely At Random (MCAR) mcar=function (x, alpha ) { n <- nrow(x) p <- ncol(x) NAloc <- rep(FALSE, n * p) NAloc[sample(n * p, floor(n * p * alpha))] <- TRUE x[matrix(NAloc, nrow = n, ncol = p)] <- NA return(x) }

 MAR # Missing at random (MAR) mar=function(x,alpha){ alpha=alpha*2 c=cor(x) diag(c)<-0 m=NULL for (i in 1:ncol(x)){ m[i]= max(abs(c[,i])) } v <- row.names(c)[apply(c, 2, which.max)] L=mget(v,inherits =TRUE) df <- data.frame(matrix(unlist(L), nrow=nrow(x), ncol=ncol(x))) for (j in 1:ncol(x)){ x[(df[,j] <= median(df[,j])) & (runif(nrow(x)) < alpha ),j] <- NA } x }

Appendixes

116

 NMAR # Not Missing at random (NMAR) nmar=function(x,alpha){ alpha=alpha*2 for (j in 1:ncol(x)){ x[(x[,j] <= median(x[,j])) & (runif(nrow(x)) < alpha ),j] <- NA } x }

 CC # Extracting complete cases from a data set, is also known as 'listwise deletion' or 'complete case analyses' cc=(function (x, drop) return(x[cci(x)]))(x, drop) #where Complete case indicator cci=(function (x) return(!is.na(x)))

 IC # Extracts incomplete cases from a data set ic=(function (x, drop) return(x[ici(x)]))(x, drop) #where Incomplete case indicator ici=(function (x) return(is.na(x)))

 NORM # Min-Max normalization function norm <- function(x) {(x - min(x, na.rm=TRUE))/(max(x,na.rm=TRUE) -min(x, na.rm=TRUE))}

Appendixes

 FCM #Fuzzy C-mean Clustering cmeans=function (x, centers, iter.max = 100, verbose = FALSE, dist = "euclidean", method = "cmeans", m = 2, rate.par = NULL, weights = 1, control = list()) { x <- as.matrix(x) xrows <- nrow(x) xcols <- ncol(x) if (missing(centers)) stop("Argument 'centers' must be a number or a matrix.") dist <- pmatch(dist, c("euclidean", "manhattan")) if (is.na(dist)) stop("invalid distance") if (dist == -1) stop("ambiguous distance") method <- pmatch(method, c("cmeans", "ufcl")) if (is.na(method)) stop("invalid clustering method") if (method == -1) stop("ambiguous clustering method") if (length(centers) == 1) { ncenters <- centers centers <- x[sample(1:xrows, ncenters), , drop = FALSE] if (any(duplicated(centers))) { cn <- unique(x) mm <- nrow(cn) if (mm < ncenters) stop("More cluster centers than distinct data points.") centers <- cn[sample(1:mm, ncenters), , drop = FALSE] } } else { centers <- as.matrix(centers) if (any(duplicated(centers))) stop("Initial centers are not distinct.") cn <- NULL ncenters <- nrow(centers) if (xrows < ncenters) stop("More cluster centers than data points.") } if (xcols != ncol(centers)) stop("Must have same number of columns in 'x' and 'centers'.") if (iter.max < 1) stop("Argument 'iter.max' must be positive.")

117

Appendixes if (method == 2) { if (missing(rate.par)) { rate.par <- 0.3 } } reltol <- control$reltol if (is.null(reltol)) reltol <- sqrt(.Machine$double.eps) if (reltol <= 0) stop("Control parameter 'reltol' must be positive.") if (any(weights < 0)) stop("Argument 'weights' has negative elements.") if (!any(weights > 0)) stop("Argument 'weights' has no positive elements.") weights <- rep(weights, length = xrows) weights <- weights/sum(weights) perm <- sample(xrows) x <- x[perm, ] weights <- weights[perm] initcenters <- centers pos <- as.factor(1:ncenters) rownames(centers) <- pos if (method == 1) { retval <- .C("cmeans", as.double(x), as.integer(xrows), as.integer(xcols), centers = as.double(centers), as.integer(ncenters), as.double(weights), as.double(m), as.integer(dist - 1), as.integer(iter.max), as.double(reltol), as.integer(verbose), u = double(xrows * ncenters), ermin = double(1), iter = integer(1), PACKAGE = "e1071") } else if (method == 2) { retval <- .C("ufcl", x = as.double(x), as.integer(xrows), as.integer(xcols), centers = as.double(centers), as.integer(ncenters), as.double(weights), as.double(m), as.integer(dist - 1), as.integer(iter.max), as.double(reltol), as.integer(verbose), as.double(rate.par), u = double(xrows * ncenters), ermin = double(1), iter = integer(1), PACKAGE = "e1071") } centers <- matrix(retval$centers, ncol = xcols, dimnames = list(1:ncenters, colnames(initcenters))) u <- matrix(retval$u, ncol = ncenters, dimnames = list(rownames(x), 1:ncenters)) u <- u[order(perm), ] iter <- retval$iter - 1 withinerror <- retval$ermin cluster <- apply(u, 1, which.max) clustersize <- as.integer(table(cluster))

118

Appendixes retval <- list(centers = centers, size = clustersize, cluster = cluster, membership = u, iter = iter, withinerror = withinerror, call = match.call()) class(retval) <- c("fclust") return(retval) }



RMSE

#Root mean square error (Normalize root mean square error) RMSE=function (imp, mis, true, norm = TRUE) { imp <- as.matrix(imp) mis <- as.matrix(mis) true <- as.matrix(true) missIndex <- which(is.na(mis)) errvec <- imp[missIndex] - true[missIndex] rmse <- sqrt(mean(errvec^2)) if (norm) { rmse <- rmse/sd(true[missIndex]) } return(rmse) }

119

‫الخالصة‬ ‫وعظي البًاٌات األولًُ املىجىدَ يف العامل اذتقًقٌ حتوتىٍ لمِ خطااْ كجريَ‪ .‬اُ وطوتىدع البًاٌات‬ ‫الكبريَ حيىّ لمِ خٌىاع ووتعددَ وَ القًي الشاذَ واليت بدوزِا تؤثس لمِ ٌوتًجُ حتمًن البًاٌات‪ ،‬ومبا خُ‬ ‫الٍىاذج ادتًدَ لادَ وا حتوتاج اىل بًاٌات جًدَ‪ ,‬لرلك فالبًاٌات املدطمُ جيب اُ تكىُ وَ الٍاحًُ الكىًُ‬ ‫واهلًكمًُ والشكمًُ وٍاضبُ وبشكن وجالٌ وع كن ارتاىات املطوتخدوُ لموتٍقًب لَ البًاٌات ( ‪Data‬‬ ‫‪ .)Mining‬لطىْ اذتظ‪ ،‬قىالد البًاٌات يف العامل الىاقعٌ توتأثس بشكن كبري بالعىاون الطمبًُ كىجىد‬ ‫الضىضاْ والقًي املفقىدَ (‪ )Missing Values‬والبًاٌات غري املوتكاومُ والػري ضسوزيُ وكرلك االحجاً‬ ‫الكبريَ يف كال البعديَ ‪،‬الٍىاذج و ارتصآص‪ .‬لرلك ِرَ املشاكن تؤدّ اىل ضعف يف ٌوتآج حتمًن البًاٌات‪.‬‬ ‫وبالوتالٌ البًاٌات وٍخفضُ ادتىدَ تؤدّ اىل اداْ وٍخفض يف الوتٍقًب لَ البًاٌات‪ .‬يف وطوتىدع البًاٌات‬ ‫الكبريَ لىمًُ وعادتُ البًاٌات وّىُ جدا لموتعاون وع املشاكن اليت مت ذكسِا آٌفاً‪ .‬وعادتُ البًاٌات حتىٍ‬ ‫لمِ العديد وَ املّاً كــوتٍقًُ البًاٌات‪ ،‬تكاون البًاٌات ‪ ،‬حتىين البًاٌات ‪ ،‬وتقمًن البًاٌات و تفسيدِا‪.‬‬ ‫البًاٌات املفقىدَ ِى لًب شآع يف العديد وَ زتاوًع البًاٌات يف العامل الىاقعٌ‪ .‬البًاٌات املفقىدَ‬ ‫تشري اىل القًي الػريومخىظُ يف زتاوًع البًاٌات واليت ميكَ اُ تكىُ لمِ اٌىاع خموتمفُ وزمبا فقدت ألضباب‬ ‫خموتمفُ‪ ,‬ووَ ِره األضباب املخوتمفُ ٌِ لدً اضوتجابُ الىحدَ‪ ،‬لدً اضوتجابُ العٍصس‪ ،‬الوتطسب‪ ،‬ارتاأ‬ ‫البشسٍ‪ ،‬فشن األجّصه و الابقات الكاوٍُ (ارتفًُ)‪ .‬وجىد وجن ِره العًىب لادَ وا يوتامب وسحمُ املعادتُ‬ ‫حبًح يوتي حتضري البًاٌات وتٍقًوتّا لكٌ تكىُ وفًدَ وواضخُ بشكن كايف ألجن لىمًُ اضوتخساج املعسفُ‪.‬‬ ‫املخوتصني بعمي االحصاْ شخصىا ثالثُ طبقات لمبًاٌات املفقىدَ وٌِ بًاٌات وفقىدَ لشىآًا بشكن‬ ‫كاون (‪ , (MCAR‬وبًاٌات وفقىدَ لشىآًا (‪ )MAR‬وبًاٌات لًطت وفقىدَ لشىآًا (‪.)NMAR‬‬ ‫ٍِاك العديد وَ األضرتاتًجًات لموتعاون وع القًي املفقىدَ‪ .‬اضّن اذتمىه لمقًي املفقىدَ ٌِ ختفًض‬ ‫زتىىلُ البًاٌات والوتخمص وَ كن العًٍات اليت حتوتىٍ لمِ القًي املفقىدَ‪ .‬وٍِالك حن آطس يطىِ طسيقُ‬ ‫الوتطاوح )‪ . (Tolerance method‬يف الٍّايُ ميكَ الوتعاون وع وشكمُ القًي املفقىدَ لَ طسيق خضالًب‬

‫األحوتطاب ملخوتمف القًي املفقىدَ وبالوتالٌ اصبخت ِره الاسيقُ خٍ طسيقُ األحوتطاب واحده وَ خكجس اذتمىه‬ ‫شآع يف الوتعاون وع البًاٌات املفقىدَ‪.‬‬ ‫يف ِره السضالُ‪ ،‬مت اقرتاح طىازشوًُ تعوتىد لمِ حتطني طىازشوًُ )‪ (MIGEC‬لَ طسيقُ‬ ‫اضوتخداً اضمىب األحوتطاب لموتعاون وع القًي املفقىدَ‪ .‬مت تابًق خلــ ‪ GRA‬لمِ قًي املوتػريات ‪(Attribute‬‬ ‫)‪ values‬بدال لَ قًي املشاِدات )‪ ، (Instance values‬والبًاٌات املفقىدَ يوتي احوتطابّا اوال باملعده‬ ‫املوتىض ثي بعد ذلك يوتي تقديسِا وَ قبن ارتىازشوًُ املقرتحُ والرٍ يطوتخدً كقًىُ كاومُ ألحوتطاب القًىُ‬ ‫املفقىدَ الوتالًُ‪.‬‬ ‫متت وقازٌُ ارتىازشوًُ املقرتحُ وع العديد وَ ارتىازشوًات األطسّ وجن ‪(MMS, HDI,‬‬ ‫)‪ KNNMI, FCMOCS, CRI, CMI, NIIA and MIGEC‬حتت خموتمف آلًات البًاٌات‬ ‫املفقىدَ‪ ،‬الٍوتآج الوتجسيبًُ خظّست اُ ارتىازشوًُ املقرتحُ هلا قًي ادترز الرتبًعٌ ملوتىض وسبعات ارتاأ‬ ‫(‪ )RMSE‬خقن وقازًٌُ بارتىازشوًات األطسّ حتت مجًع آلًات البًاٌات املفقىدَ‪.‬‬

‫ثوختة‬ ‫لةطةلَ بةزدةوامى ثيَشكةوتهى تةكهةلؤجيايى شانيازى داتاى تؤمازكساو لة داتابةيطى كؤمجانياو دةشطا‬ ‫حكوميةكاى بةزدةوام لة شيادبووى داية ‪ .‬يةلَةجنيَهانى شانيازى (‪ )Information Mining‬زِاضت و دزوضت لة‬ ‫طةجنيهةكانى داتا (‪ )Data warehouse‬بؤتة بوازيَكى طسنط لة تويَريهةوةكانى شانطتى يةلَةجنيَهانى داتا‬ ‫(‪ .)Data mining‬بوونى داتاى نادزوضت (‪ )Noise Data‬ياى داتاى ووى بوو (‪ )Missing Value‬لةم‬ ‫طةجنيهانةى داتادا بةدواى خؤيدا كيَشةى نازِاضتى ئةجنامةكانى يةلَةجنيَهانى شانيازى دزوضت كسدووة ‪ .‬وة بؤ ئةم‬ ‫مةبةضتة زِيَطاكانى ضازةضةزكسدنى ثيَش وةخت (‪ )Preprocessing‬ضةزجنى تويَرةزةكانى ئةم بوازةى شياتس بةالى‬ ‫خؤيدا زِاكيَشاوة ‪.‬‬ ‫داتاى ووى بوو (‪ )Missing Value‬يةكيَكة لة كيَشة ضةزةكيةكانى طةجنيهةكانى داتا كة بة بةزدةوامى‬ ‫دةزدةكةويَت و يؤكازى ووى بوونى داتا دةطةزِيَتةوة بؤ تؤماز نةكسدنى داتا لةاليةى ئاميَسةكانةوة ( ‪Machine‬‬ ‫‪ ,)Fail‬يةلَةى مسؤيى لة تؤمازكسدندا ياى نةدانى شانيازى يةضتياز ‪ .‬بوونى ئةم كةم و كوزتيانة يةميشة ثيَويطتى بة‬ ‫زِيَطاكانى ضازةضةزكسدنى ثيَش وةختى داتا (‪ )Data Preprocessing‬يةية بةو شيَواشةى كة داتاى ئامادةكساو ضوود‬ ‫بةخش بيَت بؤ ثسِؤضةى يةلَةجنيَهانى شانيازى‪.‬‬ ‫ئامازناضاى جؤزةكانى داتاى ووى بوو بؤ ضىَ جؤز ثؤليَو دةكةى ‪ ,‬ووى بووى بةتةواوى بة شيَوةيةكى يةزِةمةكى‬ ‫(‪ , )MCAR‬ووى بوونى يةزِةمةكى (‪ )MAR‬لةطةلَ ووى بوونى نايةزِةمةكى (‪. )NMAR‬‬ ‫شيَواشو ضرتاتيرى جؤزاو جؤز بةكازدةييَهسيَت بؤ خةمالَندنى بةياى داتاى ووى بوو ‪ ,‬ضادةتسيهياى كوذاندنةوةى‬ ‫يةموو ئةو منوونانةى كة داتاى ووى بووياى‬

‫تياية بة شيَوةى كوذاندنةوةى ئاضؤيى (‪( Instance )Case‬‬

‫‪( Deletion‬لةو تؤمازانةى كة شانيازى ووى بوو لةخؤ دةطسيَت ‪ ,‬جطة لةم شيَواشة ‪ ,‬شيَواشى تسيش بةكاز دةييَهسيَت‬ ‫وةك شيَواشى زيَطةثيَداى (‪ , )Tolerance Method‬بةالَم ئةم شيَواشانة دةبهة يؤكازى كةمكسدنةوةى بسِى داتا لة‬ ‫طةجنيهةكانى داتادا لةاليةكةوة وة فةوتانى شانيازى ثةيوضت بة تؤمازةكانى شانيازى ووى بوو لةاليةكى تسةوة ‪.‬‬ ‫خةمالَندنى ياى ضةزلةنوىَ دؤشيهةوةى بةياى داتاى ووى بوو (‪ )Data Imputation‬يةكيَك لة زِيَطةكانى‬ ‫ضازةضةزكسدنى بةياى داتاى ووى بوو‪.‬‬ ‫لةم تيَصةدا ثشت بةضت بة يةكيَك لة ئةلطؤزيتنةكانى دؤشيهةوةى بةياى داتاى ووى بوو كة لة اليةى‬ ‫( ‪ )Tian , Yu , Ma‬ثيَشهيازكساوة ناو دةبسيَت بة (‪ , )MIGEC‬دواى جيَبةجيَكسدنى بةياى ضتونى‬ ‫(‪ )Attribute values‬داتاكاى لة جياتى بةياى ئاضؤيى (‪ )Instance values‬داتاكاى لة بةشى (‪ )GRA‬ى‬

‫ئةلطؤزيتنةكة لةاليةكةوة وة داتا ووى بووةكاى لةضةزةتاوة يةذماز دةكسيَت بة بةكازييَهانى زيَطةى ناوةند ( ‪Mean‬‬ ‫‪ )Imputation‬وة دواتس لة زِيَطةى ئةلطؤزيتنة ثيَشهيازكساوةكةوة بةيا ووى بووةكاى دةخةممَيَهسيَو بة شيَوةيةك‬ ‫بةياى خةممَيَهساوى داتاى ووى بووى يةكةم بؤ خةمآلندنى بةياى داتاى ووى بووى دواتسبةكازدةييَهسيَت ‪.‬‬ ‫دوابةدواى بةزوازدكسدنى ئةجنامةكانى ئةلطؤزيتنى ثيَشهيازكساو لةطةلَ ضةند يو ئةلطؤزيتنى تسى جؤزاوجؤز‬ ‫وةك (‪ NII ، CMI ، CRI ، FCMOCM، INNMI ، HDI ، MMM‬و ‪ ) MIGEC‬وة بة بةكازييَهانى‬ ‫ميكانيصمى ووى بووى جياواش‪ ،‬ئةجنامة ئةشموونةكاى واماى ثيشاى دةدات كة ئةلطؤزيتنة ثيَشهيازكساوةكة كةمرتيو‬ ‫بةياى ‪ RMME‬يةية لةضاو ئةلطؤزيتنةكانى تس لةذيَس يةموو ميكانيصمةكانى ووى بوونى داتادا ‪.‬‬

‫دور طرق احتشاب البيانات املفكودة عمى دقة نتائج تهكيب‬ ‫البيانات‬ ‫رسالة مكدمة‬ ‫اىل جممص كمية األدارة و األقتصاد – جامعة الشميمانية‬ ‫و هي جزء مو متطمبات نين درجة ماجشتري عموم يف االحصاء‬ ‫مو قبن‬

‫ذيان حممد عمر‬

‫بإشراف‬ ‫االستاذ املشاعد‬

‫د‪.‬نزار عبدالكادر عمي‬

‫‪ 1437‬ه‬

‫‪ 2716‬ك‬

‫‪ 2016‬م‬

‫طرنطى ريَطاكانى ئةذماركردنى داتاى وون بوو لةسةر دروستى‬ ‫ئةجنامةكانى هةلَةجنيَنانى داتا‬

‫نامةيةكي ماجستيَرة‬ ‫ثيَشكةش كراوة بة ئةجنومةني كؤليَجي كارطيَرِي و ئابوررى‪ -‬زانكؤي سميَماني وةك بةشيَك‬ ‫لة ثيَداويستيةكاني وةدةستويَناني ثمةى ماجستيَر لة زاسيت ئامار‬

‫لةاليةن‬

‫ذيان حممد عمر‬ ‫بةسةرثةرشيت‬ ‫ثرؤفيسؤري ياريدةدةر‬

‫د‪.‬نزار عبدالكادر عمي‬

‫‪ 1437‬ه‬

‫‪ 2716‬ك‬

‫‪ 2016‬م‬

The Role of Missing Data Imputation Methods on the Accuracy of Data ...

The Role of Missing Data Imputation Methods on the Accuracy of Data Mining Results .pdf. The Role of Missing Data Imputation Methods on the Accuracy of ...

10MB Sizes 4 Downloads 233 Views

Recommend Documents

The Role of Missing Data Imputation Methods on the Accuracy of ...
The Role of Missing Data Imputation Methods on the Accuracy of Data Mining Results .pdf. The Role of Missing Data Imputation Methods on the Accuracy of ...

Chapter 8 The Effect of Missing Data Imputation on ...
criterion in Step 1. For confirmatory test construction, the MHM is fitted to the data cor- responding to the a priori defined test consisting of J items using methods.

The Role Of Data Mining, Olap,Oltp And Data Warehousing.
The designer must also deal with data warehouse administrative processes, which are complex in structure, large in number and hard to code; deadlines must ...

Motivic Donaldson-Thomas theory and the role of orientation data
motivate the introduction of orientation data: we will see how the natural choice for the motivic weight fails to define ... or slope stability), and under suitable conditions this space will be a finite type fine moduli scheme, which ...... H. Kajiu

Improved Business Decisions: The Role Of Data ...
These tools help us in interactive and effective analysis of data in ... constructed by integrating data from multiple heterogeneous sources to support and /or ad ...

of retrieved information on the accuracy of judgements
Subsequent experiments explored the degree to which the relative accuracy of delayed JOLs for deceptive items ... memoranda such as paired associates (e.g.,.

On the Effect of Bias Estimation on Coverage Accuracy in ...
Jan 18, 2017 - The pivotal work was done by Hall (1992b), and has been relied upon since. ... error optimal bandwidths and a fully data-driven direct plug-in.

On the Effect of Bias Estimation on Coverage Accuracy in ...
Jan 18, 2017 - degree local polynomial regression, we show that, as with point estimation, coverage error adapts .... collected in a lengthy online supplement.

of retrieved information on the accuracy of judgements
(Experiment 2) and corrective feedback regarding the veracity of information retrieved prior to making a ..... Thus participants in Experinnent 2 were given an ...... lėópard same lightning leader linen bywłer mail piate nai! decent need timid nur

The accuracy of
Harvard Business Review: September-October 1970. Exhibit 11'. Accuracy of companies' planned. 1 969 revenues, 1964-1:968, adjusted for inflation via Industrial Price Index. Mean ratio of. I d .' l planned/actual 1969 revenue n “ma. Year plan Price

Examining the Imputation of the Active Obedience of Christ.pdf ...
Examining the Imputation of the Active Obedience of Christ.pdf. Examining the Imputation of the Active Obedience of Christ.pdf. Open. Extract. Open with. Sign In.

The Effect of Speech Interface Accuracy on Driving ...
... 03824, USA. 2. Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA ... bringing automotive speech recognition to the forefront of public safety.

accuracy and clinical role of fine needle percutaneous biopsy with ...
However, biopsy was less accurate in evaluating Furhman grade. Biopsy with helical CT guidance could be a key point in tailored management of small solid.

Impact of Missing Data in Training Artificial Neural ...
by the area under the Receiver Operating .... Custom software in the C language was used to implement the BP-ANN. 4. ROC. Receiver Operating Characteristic.