The Role of Missing Data Imputation Methods on the Accuracy of Data Mining Results A Thesis Submitted to The Council of College of Administration and Economics University of Sulaimani, As Partial Fulfillment for the Requirements of the Master Degree of Sciences in Statistics
By
Zhyan Mohammed Omer
Supervised By Assistant professor
Dr. Nzar Abdulqader Ali
2016 AD
2716 Kurdish
1437 H
بسم اهلل الرمحن الرحيم ﴿
قَالُوا سُبْحَنَكَ ال عِلْمَ لَنَا إِال مَا عَلَّمْتَنَا إِنَّكَ أَنْتَ الْعَلِيمُ الْحَكِيمُ صدق اهلل العظيم سورة البقرة اآليت32 :
﴾
Dedication
Dedication This thesis is dedicated to: My dear father, God bless him My dear mother My dear sisters My dear brother With love ........
Zhyan M. Omer
Linguistic Evaluation Certification This is to certify that I, Rangi Shorsh Rauf , have proofread this thesis entitled " The Role of Missing Data Imputation Methods on the Accuracy of Data Mining Results " by Zhyan Mohammed Omer. After making and correcting the mistakes, the thesis was handed again to the researcher to make the correction in this last copy.
Signature: Proof reader: Rangi Shorsh Rauf Title: Assistant lecture Date:
/ 10 / 2016
Department of English, School of Languages, Faculty of Humanities, University of Sulaimani.
Supervisor's Certification I certify that the preparation of thesis titled "The Role of Missing Data Imputation Methods on the Accuracy of Data Mining Results", accomplished by (Zhyan Mohammed Omer), was prepared under my supervision at the University of Sulaimani, College of Administration and Economics, Statistics department as partial fulfillment of the requirement for the degree of Master of Science in Statistics.
Signature: Supervisor: Dr. Nzar Abdulqader Ali Title: Assistant professor Date:
/ 10 / 2016
Chairman's Certification In view of the available recommendations, I forward this thesis for debate by the examining committee.
Signature: Name: Dr. Samira M. Salih Title: Assistant professor Higher Education Committee Date:
/ 10 / 2016
Examination Committee Certification We are the exam committee, certificate that we read this thesis entitled "The Role of Missing Data Imputation Methods on the Accuracy of Data Mining Results ", and we examined the student (Zhyan Mohammed Omer) in its contents and that in our opinion, it is adequate as a thesis for the degree of Master of Science in Statistics.
Signature:
Signature:
Name: Dr. Nawzad M.Ahmad
Name: Dr.Mohammed M. Faqe
Title: Assistant Professor
Title: Assistant Professor
Committee head
Member
Date:
Date:
/ 10 / 2016
/ 10 / 2016
Signature:
Signature:
Name: Dr. Soran A.Bkr Mohammed
Name: Dr. Nzar Abdulqader Ali
Title: Assistant Professor
Title: Assistant Professor
Member
Member /Supervisor
Date:
/ 10 / 2016
Date:
/ 10 / 2016
Signature: Name: Dr. Kawa M. Jamal Rashid Title: Assistant Professor Dean of Administration and Economics College Date:
/ 10 / 2016
Acknowledgements
I
Acknowledgements First of all, I would like to thank Allah for helping me to accomplish my study. I want to express my warmest thanks and gratitude to my supervisor Assist.Prof.Dr. Nzar Abdulqader Ali for his invaluable advice, guidance, support and suggestion through my study. I would like to extend thanks to the dean of the college of Administration and Economics Assist.Prof.Dr.Kawa Mohammed Jamal Rashid, head of Statistics Department Assist.Prof.Dr.Mohammed Mahmmud Faqe, Administrator of Higher Education Unit Assist.Prof.Dr.Samira Mohammed Salih. Special thanks to all the teachers who taught and helped me during the course of my study, I also want to thank all the staff of the library, Higher Education Unit and Statistics Department. I would like to thank all the people who once helped me to complete this thesis. I want to express my heartfelt gratitude to my parents for their passionate love, care and help, without whom I could not overcome the difficulties, especially my father who always supported my decisions which I made in my life and always supported my interest in science.
Abstract
II
Abstract In fact, raw data in the real world is dirty. Each large data repository contains various types of anomalous values that influence the result of the analysis, since in data mining, good models usually need good data, Input data must be provided in the amount, structure and format that suit each data mining (DM) task perfectly. Unfortunately, real-world databases are highly influenced by negative factors such the presence of noise, missing values (MVs), inconsistent and superfluous data and huge sizes in both dimensions, examples and features therefore these problem cause that the analyses result perform poorly. Thus, low-quality data will lead to low-quality data mining performance. In large data repository, data preprocessing is very important to deal with such problems as mentioned .Data preprocessing comprise the several of tasks such as (data cleaning, data integration, data transformation, data reduction and data discretization). Missing data is a common drawback in many real-world data sets. Missing data refers to unobserved values in a data set which can be of different types and may be missing for different reasons. These different reasons can include unit nonresponse, item nonresponse, dropout, human error, equipment failure, and latent classes. The presence of such imperfections typically requires a preprocessing stage in which the data is arranged and cleaned, with a specific end goal to be valuable to and adequately clear for the knowledge extraction process.
Analysts have distinguished three classes of missing data are missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR).
Abstract
III
There are various procedures for dealing with missing values. The most straightforward solution for the missing values is the reduction of the data set and elimination of all samples with missing values. Another solution is Tolerance method. Finally missing values problem can be handled by various missing values imputation methods. Thus, recently imputation is one of the most popular solutions for dealing with missing data. In this thesis, we proposed an algorithm depending on improving (MIGEC) algorithm in the way of imputation for dealing missing values. We implement grey relational analysis (GRA) on attribute values instead of instance values, and the missing data were initially imputed by mean imputation and then estimated by our proposed algorithm (PA) used as a complete value for imputing next missing value.
We compare our proposed algorithm with several other algorithms such as Mean Mode Substitution (MMS), Hot Deck Imputation (HDI), KNearest Neighbour Imputation with Mutual Information (KNNMI), Fuzzy C-Mean based on Optimal Completion Strategy (FCMOCS), Clusteringbased Random Imputation (CRI), Clustering-based Multiple Imputation (CMI), The - Non Parametric Iterative Imputation (NIIA) and Multiple Imputation algorithm using Gray System Theory and Entropy based on Clustering (MIGEC) under different missing mechanisms. Experimental results demonstrate that the proposed algorithm has less RMSE values than other algorithms under all missingness mechanisms.
List of Contents
IV
List of Contents Acknowledgement
I
Abstract
II
List of Contents
IV
List of Figures
VII
List of Tables
X
List of Abbreviations
XII
Chapter 1: Introduction, Literature Review and The Aim of The Study 1.1
Introduction……………………………………………………………...
1
1.2
Literature Review……………...………...……………………………
6
1.3
The Aim of The study.....................................................
13
1.4
The Layout of The Thesis...............................................
13
Chapter 2: Theoretical Part 2.1
Introduction.......................................................................
14
2.2
Missing Value...................................................................
20
2.2.1 Missing Data Mechanisms...................................................
21
2.2.2 Methods for Handling Incomplete Data…………………………….
23
2.3
Simulation……………………………………………………………………
27
2.4
Clustering ……………………………………………...................................
28
2.4.1 Fuzzy C-mean Clustering……..………………………………………
29
Grey System Theory……..…………………………………………..….
31
2.5.1 Grey Relational Analysis………………..………………………….....
31
2.5
List of Contents
2.6
V
Classification………………………………………………………………
35
2.6.1 Decision Tree ……………………………………………………………..
37
2.6.2 Entropy and Information again……………..…………....................
38
2.7
Hybrid Imputation…………………………………...…………………..
40
2.8
Proposed Algorithm…………...………………………………………
40
2.8.1 Steps of Algorithm………………………………………………….….
43
2.9
The Frame work of Proposed Algorithm………………….……
48
Chapter 3: Practical Part 3.1
Introduction………………………….....……………………………………
50
3.2
Dataset…………………………………………………………………………
50
3.2.1 Wine Data Set …………………………………………………….……….
50
3.2.2 Simulated Data……………………………………………………………..
52
3.3
Generating Missingness…………..……………………………………
53
3.4
Performance Measure………………………………..………………….
56
3.5
Optimality Measure…..…………………………….…………………....
57
3.5.1 Number of Iteration……………………………….……………………...
57
3.5.2 Number of Clusters………………………….…………………………...
59
3.6
Comparative Experiments…………..…………………........................
62
3.7
Results of Proposed Algorithm for Simulated Data…....….
67
3.7.1 Number of Iteration………………………….…………………………...
67
3.7.2 Number of Clusters………………………….…………………………...
69
3.8
Comparison between wine and simulated dataset…………...
72
3.9
Comparing Algorithms by Number of Iteration……………...
76
List of Contents
VI
Chapter4: Conclusions and Future Work 4.1
Conclusions..............................................................................................
77
4.2
Future Work..............................................................................................
78
References………………………………………………………………..
80
Appendixes…………………………………………….............................
88
اخلالصة ثوختة
List of Figures
VII
List of Figures Figure
Figure Name
No.
Page No.
1.1
KDD process
3
1.2
DM methods
5
2.1
Forms of Data Preprocessing
19
2.2
Data set with MVs denoted with a symbol ‘?’
21
2.3
Difference between Hard and Soft Clustering
29
2.4
General Approach for Building a Classification Model
36
2.5
Decision Tree Example
38
2.6
40
2.7
Entropy in the case of two possibilities with probabilities p and (p-1) The Flowchart of the Proposed Algorithm
3.1
Sample of Wine dataset
51
3.2
Sample of simulated dataset
53
3.3
Sample of Wine dataset missed by MCAR
54
3.4
Sample of Wine dataset missed by MAR
55
3.5
Sample of Wine dataset missed by NMAR
56
3.6
Checking Optimality by number of Iterations (for MCAR)
57
3.7
Checking Optimality by number of Iterations (for MAR)
58
3.8
Checking Optimality by number of Iterations (for NMAR)
59
3.9
Checking Optimality by number of clusters (for MCAR)
60
3.10
Checking Optimality by number of clusters (for MAR)
61
3.11
Checking Optimality by number of clusters (for NMAR)
62
3.12
Comparison between proposed algorithm and other methods for imputation (MCAR)
42
65
List of Figures
3.13
3.14
3.15
3.16
3.17
3.18
3.19
3.20
3.21
3.22
3.23
3.24
3.25
Comparison between proposed algorithm and other methods for imputation (MAR) Comparison between proposed algorithm and other methods for imputation (NMAR) Checking Optimality by number of Iterations (for MCAR) *simulated data Checking Optimality by number of Iterations (for MAR) *simulated data Checking Optimality by number of Iterations (for NMAR) *simulated data Checking Optimality by number of clusters (for MCAR) *simulated data Checking Optimality by number of clusters (for MAR) *simulated data Checking Optimality by number of clusters (for NMAR) *simulated data Result of proposed algorithm for Simulated data under different mechanism with varying missing rates comparison between wine and simulated dataset for various iterations (MCAR) comparison between wine and simulated dataset for various iterations (MAR) comparison between wine and simulated dataset for various iterations (NMAR) comparison between wine and simulated dataset for various clusters (MCAR)
VIII
66
66
67
68
69
70
70
71
72
73
73
73
74
List of Figures
3.26
3.27
comparison between wine and simulated dataset for various clusters (MAR) comparison between wine and simulated dataset for various clusters (NMAR)
IX
74
74
List of Tables
X
List of Tables Table
Table Name
No.
Page No.
2.1
Listwise Deletion
24
2.2
Pairwise Deletion
24
3.1
Information about Wine dataset
51
3.2
Mean and Standard deviation of Wine data
53
3.3
Checking optimality by number of Iterations (for MCAR)
57
3.4
Checking Optimality by number of Iterations (for MAR)
58
3.5
Checking Optimality by number of Iterations (for NMAR)
59
3.6
Checking Optimality by number of clusters (for MCAR)
60
3.7
Checking Optimality by number of clusters (for MAR)
60
3.8
Checking Optimality by number of clusters (for NMAR)
61
3.9
3.10
3.11
3.12
3.13
3.14
Comparison between Proposed algorithm and other methods for imputation (MCAR) Difference between proposed
algorithm and other
algorithms for MCAR Comparison between Proposed algorithm and other methods for imputation (MAR) Difference between proposed
algorithm and other
algorithms for MAR Comparison between Proposed algorithm and other methods for imputation (NMAR) Difference between proposed algorithms for NMAR
algorithm and other
63
63
63
64
64
64
List of Tables
3.15
3.16
3.17
3.18
3.19
3.20
3.21
Checking Optimality by number of Iterations (for MCAR) *simulated data Checking Optimality by number of Iterations (for MAR) *simulated data Checking Optimality by number of Iterations (for NMAR) *simulated data Checking Optimality by number of clusters (for MCAR) *simulated data Checking Optimality by number of clusters (for MAR) *simulated data Checking Optimality by number of clusters (for NMAR) *simulated data Result of proposed algorithm for Simulated data under different mechanism with varying missing rates
XI
67
68
68
69
70
71
72
3.22
Difference between RMSE of Simulated and Wine dataset
75
3.23
Comparing number of iteration between proposed algorithm and existing algorithms
76
List of Abbreviations
XII
List of Abbreviations Abbreviations
Description
CDI
Cold Deck Imputation
CMI
Clustering-based Multiple Imputation
CRI
Clustering-based Random Imputation
DM
Data Mining
EM
Expectation Maximization
FCM
Fuzzy C-mean Clustering
FCMOCS
Fuzzy C-Mean based on Optimal Completion Strategy
GRA
Grey Relational Analysis
GSA
Grey System Theory
HDI
Hot Deck Imputation
KDD
Knowledge Discovery in Databases
KNNMI
MAR
K Nearest Neighbour Imputation with Mutual Information
Missing at Random
MCAR
Missing Completely at Random
MIGEC
Multiple Imputation algorithm using Gray System Theory and Entropy based on Clustering
MMS
Mean Mode Substitution
MV
Missing Value
NIIA
The - Non Parametric Iterative Imputation
NMAR
Not Missing at Random
NRMSE
Normalize Root Mean Square Error
List of Abbreviations
PA RMSE UCI
Proposed Algorithm Root Mean Square Error University of California, Irvine
XIII
Chapter One Introduction, Literature Review and The Aim of The Study
1
Chapter One : Introduction, Literature Review and The Aim of The Study
Chapter One Introduction,Literature Review and The Aim of The Study 1.1
Introduction Data mining (DM) also popularly known as Knowledge Discovery in
Databases (KDD) is " the non trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data"
[69]
.
Vast amounts of data are around us in our world, raw data is mainly intractable for human or manual applications. So, the analysis of such data is now a necessity. The World Wide Web (WWW), business related services, society, applications and networks for science or engineering, among others, are continuously generating data in exponential growth since the development of powerful storage and connection tools. This immense data growth does not easily allow to get useful information or organized knowledge to be understood or extracted automatically. This fact has led to the start of Data Mining (DM), which is currently a well-known discipline increasingly preset in the current world of the Information Age. DM is about solving problems by analyzing data present in real databases. Nowadays, it is qualified as science and technology for exploring data to discover already present unknown patterns. Many people distinguish DM as synonym of the Knowledge Discovery in Databases (KDD) process, while others view DM as the main step of KDD [8] [11]
[16]
.
The KDD process is divided into six different stages according to the agreement of several important researchers [18] .
Chapter One : Introduction, Literature Review and The Aim of The Study
2
1. Problem Specification: Designating and arranging the application domain, the relevant prior knowledge obtained by experts and the final objectives pursued by the end-user. 2. Problem Understanding: Including the comprehension of both the selected data to approach and the expert knowledge associated in order to achieve high degree of reliability. 3. Data Preprocessing: This stage includes operations for data cleaning (such as handling the removal of noise and inconsistent data), data integration (where multiple data sources may be combined into one), data transformation (where data is transformed and consolidated into forms which are appropriate for specific DM tasks or aggregation operations), data reduction (including the selection and extraction) and imputing missing data are features and examples for data preprocessing. 4. Data Mining: It is the essential process where the methods are used to extract valid data patterns. This step includes the choice of the most suitable DM task (such as classification, regression, clustering or association), the choice of the DM algorithm itself, belonging to one of the previous families. Finally, the employment and accommodation of the algorithm selected to the problem, by tuning essential parameters and validation procedures. 5. Evaluation: Estimating and interpreting the mined patterns based on interestingness measures. 6. Result Exploitation: The last stage may involve using the knowledge directly; incorporating the knowledge into another system for further processes or simply reporting the discovered knowledge through visualization tools.
Chapter One : Introduction, Literature Review and The Aim of The Study
3
Figure (1.1) summarizes the KDD process and reveals the six stages mentioned previously. It is worth mentioning that all the stages are interconnected, showing that the KDD process is actually a self-organized scheme where each stage conditions the remaining stages and reverse path is also allowed .
Figure (1.1) :KDD process [18]
4
Chapter One : Introduction, Literature Review and The Aim of The Study
The objective of data mining is both prediction and description. That is, to predict unknown or future values of the attributes of interest using other attributes in the databases, while describing the data in a manner understandable and interpretable for human beings. Predicting the sale amounts of a new product based on advertising expenditure, or predicting wind velocities as a function of temperature, humidity, air pressure, etc., are examples of tasks with a predictive goal in data mining. Describing the different terrain groupings that emerge in a sampling of satellite imagery is an example of a descriptive goal for a data mining task. The relative importance of description and prediction can vary between different applications. These two goals can be fulfilled by any of a number data mining tasks(techniques) including: classification, regression, clustering, summarization, dependency modeling, and deviation detection [11] [20] .
Alarge number of techniques for DM are well-known and used in many applications. Figure (1.2) shows a division of the main DM methods according to two methods of obtaining knowledge: prediction and description. Within the prediction family of methods, two main groups can be distinguished:statistical methods and symbolic methods
[6]
.Statistical
methods are usually characterized by the representation of knowledge through mathematical models with computations. In contrast, symbolic methods prefer to represent the knowledge by means of symbols and connectives, yielding more interpretable models for humans.
Chapter One : Introduction, Literature Review and The Aim of The Study
5
Figure (1.2) :DM Methods [18] The next important step is the data to be used. Input data must be provided in the amount, structure and format that suit each DM task perfectly.It is unrealistic to expect that data will be perfect after they have been extracted. Since good models usually need good data, a thorough cleansing of the data is an important step to improve the quality of data mining methods. Not only the correctness, also the consistency of values is important. Missing data can also be a particularly pernicious problem. Especially when the number of missing data is large, not all attributes (instances) with a missing values can be deleted from the sample. Moreover, the fact that a value is missing may be significant itself. A widely applied approach is to calculate a substitute value for missing data,
Chapter One : Introduction, Literature Review and The Aim of The Study
6
for example, the median or mean of a variable. Furthermore, several data mining approaches (especially many clustering approaches, but also some learning methods) can be modified to ignore missing values [5] [22] .
There are a number of data preprocessing techniques. Data cleaning can be applied to remove noise , correct inconsistencies in the data and fill missing values. Data integration merges data from multiple sources into a coherent data store, such as a data warehouse. Data transformations, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements. Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance. These techniques are not mutually exclusive; they may work together. For example, data cleaning can involve transformations to correct wrong data, such as by transforming all entries for a date field to a common format. Data processing techniques, when applied before mining, can substantially improve the overall quality of the patterns mined and/or the time required for the actual mining [10] .
1.2 Literature Review : M. D. Zio , U. Guarnera , O. Luzi , (2007)
[52]
, they proposed a
technique based on finite mixture of multivariate Gaussian distributions for handling missing data. The main reason is that it allows to control the trade-off between parsimony and flexibility. An experimental comparison with the widely used imputation nearest neighbor donor is illustrated .
7
Chapter One : Introduction, Literature Review and The Aim of The Study
C.Y. J. Peng , J. Zhu , (2008)
[32]
,This study seeks to fill the
void(null) by comparing two approaches for handling missing data in categorical covariates in logistic regression: the expectation-maximization (EM) method of weights and multiple imputation (MI). Missing data on covariates are simulated under two conditions: missing completely at random and missing at random with different missing rates. A logistic regression model was fit to each sample using either the EM or MI approach. The performance of these two approaches is compared on four criteria: bias, efficiency, coverage, and rejection rate. Results generally favored MI over EM. S. Zhang, J. Zhang, X .Zhu, Y .Qin, C. Zhang, (2008)
[67]
, They
proposed an efficient nonparametric missing value imputation method based on clustering, called CMI (Clustering-based Missing value Imputation), for dealing with missing values in target attributes. In their approach, They impute the missing values of an instance A with plausible values that are generated from the data in the instances which do not contain missing values and are most similar to the instance A using a kernel-based method. Specifically, They first divide the dataset (including the instances with missing values) into clusters. Next , missing values of an instance A are patched up with the plausible values generated from A’s cluster. Extensive experiments show the effectiveness of the proposed method in missing value imputation task. P.J.G.Laencina , J.L.S.Gómez, A.R.F.Vidal, M.Verleysen, (2009) [59] , they propose a novel KNN imputation procedure using a feature-weighted distance metric based on mutual information (MI). This method provides a missing data estimation aimed at solving the classification task i.e., it provides an imputed dataset which is directed toward improving the
Chapter One : Introduction, Literature Review and The Aim of The Study
8
classification performance. The MI-based distance metric is also used to implement an effective KNN classifier . A.D.Nuovo, (2011)
[24]
, This paper introduced the most common
solutions to this problem offered by the most popular statistical software and a technique (Case Deletion) based on the most famous fuzzy clustering algorithm: Fuzzy C-Means (FCM). Then compared these methodologies in order to highlight the peculiar characteristics of each solution. The comparison was made in a psychological research environment, using a database of in-patients who have a diagnosis of mental retardation. The results demonstrate that completion techniques, and in particular the one based on FCM, lead to effective data imputation, avoiding the deletion of elements with missing data, which diminishes the power of the research. N. Ankaiah ,V. Ravi, ( 2011) [56] , They propose a novel 2-stage soft computing approach for data imputation, involving local learning and global approximation in tandem. In stage 1, K-means algorithm is used to replace the missing values with cluster centers. Stage 2 refines the resultant approximate values using multilayer perceptron (MLP). MLP is trained on the complete data with the attribute having missing values as the target variable one at a time. In all datasets, some values, which are randomly removed, are treated as missing values. The actual and the predicted values obtained are compared by using Mean Absolute Percentage Error (MAPE). They observe that, the MAPE value is reduced from stage 1 to stage 2, indicating the hybrid approach resulted in better imputation compared to stage 1 alone. J.Tian , B. Yu , D. Yu , Sh. Ma , (2012)
[46]
,They focus on an
algorithm of a fuzzy clustering approach for missing value imputation with noisy data immunity. The PCFKMI (Pre-Clustering based Fuzzy K-Means
Chapter One : Introduction, Literature Review and The Aim of The Study
9
Imputation) method aggregates data instances to more accurate clusters for further appropriate estimation via information entropy after resampling preclustering and outlier test . their experimental results demonstrate that the PCFKMI proposed obtains higher precision both on quantitative and on nominal attributive missing value completion than other classic methods under all missingness mechanisms at varying missing rates with abnormal values. S. Gajawada , D. Toshniwal , (2012)
[62]
, They proposed a missing
value imputation method based on K-Means and nearest neighbors. Their method uses the imputed objects for further imputations. They propose a method that has been applied on clinical datasets from UCI Machine Learning Repository, their results shows their proposed method performed better than simple method (without using imputed values for further imputations) but it is not the case for all the datasets as error in earlier imputation may propagate to further imputations. M. M. Rahman, D. N. Davis , ( 2013 ) [55] , They explored the use of a machine learning technique as a missing value imputation method for incomplete cardiovascular data. Mean/mode imputation, fuzzy unordered rule induction algorithm imputation, decision tree imputation and other machine learning algorithms are used as missing value imputation and the final datasets are classified using decision tree, fuzzy unordered rule induction, KNN and K-Mean clustering. The experiment shows that final classifier performance is improved when the fuzzy unordered rule induction algorithm is used to predict missing attribute values for K-Mean clustering and most of the cases machine learning technique found to be performed better than the mean imputation.
Chapter One : Introduction, Literature Review and The Aim of The Study
10
Z. Chi , F. H. cai , J. Kai , Y. Ting , (2013) [73] ,In order to improve the accuracy of filling in the missing data, a filling missing data algorithm of the nearest neighbor based on the cluster analysis is proposed by this paper. After clustering data analysis , the algorithm assigns weights according to the categories and improves calculation formula and filling value calculation based on the MGNN (Mahalanobis-Gray and Nearest Neighbor algorithm) algorithm. Their experimental results show that the filling accuracy of the method is higher than traditional KNN algorithm and MGNN algorithm . F. C. S. Liu , (2014)
[36]
, This paper applies three types of multiple
imputation (MI) method such as [Amelia II, MICE, and mi] to reconstruct the distribution of vote choice in the sample. Vote choice is one of the most important dependent variables in political science studies.This paper shows how the multiple imputation procedure in general facilitates the work of reconstructing the distribution of a targeted variable. Particularly, it shows how MI can be applied to point-estimation in descriptive statistics. G. Sang , K. Shi , Zh. Liu , L. Gao, (2014) [39] , They proposed a new weighted KNN data filling algorithm based on grey correlation analysis (GBWKNN) by researching the nearest neighbor of missing data filling method. It is combined with grey system theory and the advantage of the K nearest neighbor algorithm. The experimental results on six UCI data sets showed the filling algorithm in theis paper is superior to KNN algorithm and the filling algorithm proposed by Huang and Lee [27] H. Li , C. Zhao, F. Shao , G. Zheng Li , X. Wang , (2014) [40] , In this paper ,they propose a novel hybrid imputation method, called Recursive Mutual Imputation (RMI). Specifically, RMI exploits global correlation information and local structure in the data, captured by two popular
Chapter One : Introduction, Literature Review and The Aim of The Study
11
methods, Bayesian Principal Component Analysis (BPCA) and Local Least Squares (LLS), respectively. Mutual strategy is implemented by sharing the estimated data sequences at each recursive process. Meanwhile, they consider the imputation sequence based on the number of missing entries in the target gene. Furthermore, a weight based integrated method is utilized in the final assembling step. It is noted that their proposed hybrid imputation approach incorporates both global and local information of microarray genes, which achieves lower NRMSE values against to any single approach only. J. Tian , B. Yu , D. Yu , Sh. Ma , (2014)
[47]
,In this paper they
proposed a hybrid missing data completion method named Multiple Imputation using Gray-system-theory and Entropy based on Clustering (MIGEC). Firstly, the non-missing data instances are separated into several clusters. Then, the imputed value is obtained after multiple calculations by utilizing the information entropy of the proximal category for each incomplete instance in terms of the similarity metric based on Gray System Theory (GST). Minakshi
, R. Vohra, Gimpy , (2014)
[51]
, In this paper three
techniques are used to impute missing values named as Litwise deletion , mean/mode imputation, KNN (k nearest neighbor) . C4.5/J48 classification algorithm is applied to these imputed datasets. In this work analyzes the performance of imputation methods using C4.5 classifier on the basis of accuracy for handling missing data or value. After that decision which imputation method is the best method to handle missing value. Their experimental results show that accuracy of KNN is greater than other two techniques.
Chapter One : Introduction, Literature Review and The Aim of The Study
X.Y. Zhou , J. S. Lim , (2014)
[71]
12
, they studied a new method, the
NB-EM (Naïve Bayesian-Expectation Maximization) algorithm, for handling missing values .The performance of this method is compared with the traditional EM(Expectation Maximization) method and non-substitution approaches for dealing with datasets containing randomly missing value. They determined the most effective method, compared with the traditional EM algorithm, the NB-EM algorithm has a higher accuracy rate, which suggests that the NB-EM algorithm can obtain a better effect on missing values in practice. O. B. Shukur , M.H. Lee , (2015 )
[57]
, In this study, the hybrid
artificial neural network (ANN) and autoregressive (AR) method is proposed for imputing the missing values. ANN is a nonlinear method that is capable of imputing the missing values in wind speed data with nonlinear characteristic. AR model is used for determining the structure of the input layer for the ANN. Listwise deletion is used before AR modeling to handle the missing values. A case study is carried out using daily Iraqi and Malaysian wind speed data. The proposed imputation method is compared with linear, nearest neighbor, and state space methods. The comparison has shown that AR-ANN outperformed the classical methods. In conclusion, the missing values in wind speed data with nonlinear characteristic can be impute more accurately using AR-ANN. Therefore, imputing the missing values using AR-ANN leads to more accurate performance of time series modeling and analysis.
13
Chapter One : Introduction, Literature Review and The Aim of The Study
1.3 The Aim of The Study The aim of this study is to investigate the impact of
data
preprocessing on the accuracy of DM models, imputing missing data is one of the steps of data cleaning in data preprocessing. In this thesis an algorithm is proposed for dealing with missing values based on existing algorithm MIGEC proposed by (J. Tian , B. Yu , D. Yu , Sh. Ma) .
1.4 The Layout of The Thesis The rest of this thesis is organized as follows: Chapter Two: firstly discussed data preprocessing in data mining , also missing value and how dealing with it is discussed , in the end of this chapter, proposed algorithm that use to impute missing values explained in detail . Chapter Three: In this chapter, the experimental results for proposed algorithm are presented and computed and comparisons between different algorithms are made and results are discussed in details. Chapter Four: contains the main conclusions with the recommendations for the future works.
Chapter Two Theoretical Part
14
Chapter Two: Theoretical Part
Chapter Two Theoretical Part 2.1 Introduction Data pre-processing is an important and critical step in the data mining process and it has a huge impact on the success of a data mining project. Data pre-processing is a step of the Knowledge discovery in databases (KDD) process that reduces the complexity of the data and offers better conditions to subsequent analysis. Through this the Nature of the data is better understood and the data analysis is performed more accurately and efficiently [10] . Data preprocessing is composed of several steps depicted in Figure (2.1) are:
1. Data Cleaning Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data [2] . a. Missing Values: Missing data alludes to in secret qualities in a data set which can be of various sorts and might miss for various reasons. These distinctive reasons can incorporate unit nonresponse, thing nonresponse, dropout, human blunder, hardware disappointment, and inert classes. [21] . b. Noisy Data: Noise is a random error or variance in a measured variable. Noise Identification known as the smoothing in data transformation, its main objective is to detect random errors or variances in a measured variable. Popular techniques for noise
Chapter Two: Theoretical Part
15
smoothing include binning, Regression, and clustering. Most of these techniques include a data discrimination phase [42] . c. Outliers: Outlier is an observation point that is distant from other observations .Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers [19] .
2.
Data Integration It comprises the merging of data from multiple data sources. This
process must be carefully performed in order to avoid redundancies and inconsistencies in the resulting data set. Typical operations accomplished within the data integration are the identification and unification of variables and domains, the analysis of attribute correlation, the duplication of tuples and the detection of conflicts in data values of different sources [18] .
3.
Data Transformation In this preprocessing step, the data is converted or consolidated so
that the mining process result could be applied or may be more efficient. Data transformation can involve the following steps [12] : a. Smoothing, this works to remove noise from the data. Such techniques include binning, regression, and clustering. b. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities. c. Generalization of the data, where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be
16
Chapter Two: Theoretical Part
generalized to higher-level concepts, like city or country. Similarly, values for numerical attributes, like age, may be mapped to higherlevel concepts, like youth, middle-aged, and senior. d. Normalization, where the attribute data are scaled so as to fall within a small specified range, such as -1 to 1, or 0 to 1. The most common normalization techniques are [18] : 1. Min-Max Normalization: Performs a linear transformation on the
original data. Suppose that minA and maxA are the minimum and maximum values of an attribute (A). Min-max normalization maps a value
of A to ` in the range [new _minA ; new_maxA ] by
computing : (
)
(
)
Min-max normalization preserves the relationships among the original data values. It will encounter an “out-of-bounds” error if a future input case for normalization falls outside of the original data range for A. 2. Z-score Normalization (or zero-mean normalization): In some cases, the min-max normalization is not useful or cannot be
applied.
When the minimum or maximum values of attribute A are not known, the min-max normalization is infeasible. Even when the minimum and maximum values are available, the presence of outliers can bias the min-max normalization by grouping the values and limiting the digital precision available to represent the values. If Ā is the mean of the values of attribute A and SA is the standard deviation, original value of A is normalized to
using:
………………………………………....… (2.2)
17
Chapter Two: Theoretical Part
By applying this transformation the attribute values now present a mean equal to 0 and a standard deviation of 1. If the mean and standard deviation associated to the probability distribution are not available, it is usual to use instead the sample mean and standard deviation: ̅
∑
.. .………………………………......……( 2.3 )
And
√
∑(
̅)
………………….….…… ( 2.4)
3. Decimal Scaling Normalization: A simple way to reduce the
absolute values of a numerical attribute is to normalize its values by shifting the decimal point using a power of ten division such that the maximum absolute value is always lower than 1 after the transformation. This transformation is commonly known as decimal scaling and it is expressed as ……………………….…………..……….. (2.5) Where j is the smallest integer such that new _maxA < 1 . e. Attribute construction (feature construction), where new attributes are constructed and added from the given set of attributes to help the mining process.
Chapter Two: Theoretical Part
4.
18
Data Reduction Data reduction techniques are used in order to obtain a new
representation of the data set that is much smaller in volume, but yet produces the same (or almost the same) analytical results [12] . Strategies for data reduction include the following [10] : a) Data cube aggregation, where aggregation operations are applied to the data in the construction of a data cube. b) Attribute subset selection, where irrelevant, weakly relevant or redundant attributes or dimensions may be detected and removed. c) Dimensionality reduction, where encoding mechanisms are used to reduce the data set size. d) Numerosity reduction, where the data are replaced or estimated by alternative, smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering, sampling, and the use of histograms.
5.
Discretization and concept hierarchy generation Data discretization techniques can be used to reduce the number of
values for a given continuous attribute by dividing the range of the attribute into finite number of intervals. Interval labels can then be used to replace actual data values. Data discretization is a form of reduction that is very useful for the automatic generation of concept hierarchies. Discretization and concept hierarchy generation are powerful tools for data mining, in that they allow the mining of data at multiple levels of abstraction [10][18] .
Chapter Two: Theoretical Part
Figure (2.1): Forms of Data Preprocessing [10]
19
20
Chapter Two: Theoretical Part
2.2 Missing Value Many existing, industrial and research data sets contain missing values (MVs) in their attribute values as shown in Figure (2.2). Intuitively MVs is just a value for attribute that was not introduced or was lost in the recording process. There are various reasons for their existence, such as manual data entry procedures, equipment errors and incorrect measurements. The presence of such imperfections usually requires a preprocessing stage in which the data is prepared and cleaned, in order to be useful to and sufficiently clear for the knowledge extraction process. The simplest way of dealing with MVs is to discard the instances (attributes) that contain them. However, this method is practical only when the data contains a relatively small number of examples with MVs and when analysis of the complete examples will not lead to serious bias during the inference .MVs make performing data analysis difficult. The presence of MVs can also pose serious problems for researchers. In fact, inappropriate handling of the MVs in the analysis may introduce bias and can result in misleading conclusions being drawn from a research study, and can also limit the generalizability of the research findings
[5] [15] [41]
. Three types of problems are usually
associated with MVs in DM [43] : a. Loss of efficiency. b. Complications in handling and analyzing the data. c. Bias resulting from differences between missing and complete data.
21
Chapter Two: Theoretical Part
Figure (2.2): Data set with MVs denoted with a symbol ‘?’
[18]
2.2.1- Missing Data Mechanisms The mechanism of missingness describes the relationship between the probability of a value being missing and the other variables in the data set. If Y represent the complete data that can be partitioned as (Y obs, Ymis) where Yobs is the observed part of Y and Ymis is the missing part of Y, and R be an indicator random variable (or matrix) indicating whether or not Y is observed or missing. Let R = 1 denotes a value which is observed and let R = 0 denote a value which is missing. The statistical model for missing data is P (R\Y, Ø) where Ø is the parameter for the missing data process. The
Chapter Two: Theoretical Part
22
mechanism of missingness is determined by the dependency of R on the variables in the data set [21] . The following are different mechanisms of missingness [15] [34] . i- Missing Completely At Random (MCAR)
The first mechanism of missingness is a special case of MAR known as missing completely at random (MCAR). In this case, the mechanism of missingness is given by: P (R\Y, Ø) = P (R, Ø) …………………………... (2.6)
That is, the probability of missingness is not dependent on any observed or unobserved values in Y. It is what one colloquially thinks of as “random“. One example of MCAR might be a computer malfunction that arbitrarily deletes some of the data values. ii- Missing At Random (MAR)
The second mechanism of missingness is missing at random (MAR), this mechanism of missingness is given by: P (R\Y, Ø) = P (R\Yobs, Ø) ………………… (2.7)
That is, the probability of missingness is only dependent on observed values in Y and not on any unobserved values in Y. A simple example of MAR is a survey where subjects over a certain age refuse to answer a particular survey question and age is an observed covariate. iii- Not Missing At Random (NMAR)
The third mechanism of missingness is referred to as missing not at random (MNAR). This mechanism of missingness is given by: P (R\Y, Ø) = P (R\Yobs,Ymis, Ø) ………....…. (2.8)
Chapter Two: Theoretical Part
23
This mechanism occurs when the conditions of MAR are violated so that the probability of missingness is dependent on Ymis or some unobserved covariate. One instance of MNAR might be subjects who have an income above a certain value refusing to report an income in the survey. Here the missingness is dependent on the unobserved response, income.
2.2.2- Methods for Handling Incomplete Data (Missing Data) Methods to deal with missing values are not something new. In 1976, Rubin
[34]
developed a framework of inference from incomplete data that is
still in use today. After that many researchers have run into this area and proposed a great number of methods. There are several different methods and strategies available to handle missing- data. Current administrations of processing missing data can be approximately divided into three categories: tolerance, ignoring and imputation-based procedures [47] .
a.
Tolerance The straightforward method aims to maintain the source entries in the
incomplete fashion. It may be a practical and computationally low cost solution, whereas it requires the techniques to work robustly even if the data quality stays low [31] .
b.
Ignoring Missing data ignorance often refers to “Case Deletion”. It is the most
frequently applied procedure nowadays, this method suffer from a loss of information in the incomplete cases and risk of bias if the missing data is not MCAR and it is prone to diminish the data quality. The strength lies in the ease of application: deleting the elements with missing values is done in two manners [4] [66] :
Chapter Two: Theoretical Part
(i)
24
List-wise/Case-wise Deletion (complete-case analysis): Omits the entire instances (case) or attributes (records) containing
missing values. The main drawback of this method is that the application may lead to large loss of observations, which may result in high inaccuracy in particular if the original dataset is itself too small or the number of instances that contain missing value is too large. Table (2.1) represents List-wise Deletion techniques. Table (2.1): Listwise Deletion
(ii)
Pairwise Deletion (available-case analysis): Incomplete instances are removed on an analysis-by-analysis basis,
Unlike list wise deletion which removes instances that have missing values on any of the variables under analysis, pair wise deletion only removes the specific missing values from the analysis (not the entire instance) such that any given instance may contribute to some analyses but not to others. Table (2.2) represents Pairwise Deletion techniques. Table (2.2): Pairwise Deletion
Chapter Two: Theoretical Part
c.
25
Imputation Imputation is the process of replacing missing data with substituted
values. Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with list wise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discard any case that has a missing value, which may introduce bias or affect the representation of the results. Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analyzed using standard techniques for complete data .There are different methods to impute missing values as shown below [1] .
i.
Mean/Mode Substitution (MMS): It replaces the missing values by the mean (the arithmetic average
value) or mode (the highest frequency of occurrence) of all the observations or a subgroup at the same variable. It consists of replacing the unknown value for a given attribute by the mean (quantitative attribute) or mode (qualitative attribute) of all known values of that attribute. Replacing all missing records with a single value distorts the input data distribution [44] [59].
ii.
Hot-deck/Cold-deck Imputation : Given an incomplete pattern, Hot-Deck Imputation (HDI) replaces the
missing data with the values from the input vector that is closest in terms of the attributes that are known in both patterns. This method attempts to preserve the distribution by substituting different observed values for each missing item. Another possibility is the Cold-Deck Imputation (CDI) method, which is similar to hot deck but the data source must be other than
Chapter Two: Theoretical Part
26
the current dataset. For example, in a survey context, the external source can be a previous realization of the same survey [25] [50] . Regression imputation: This method uses multiple linear regressions to obtain estimates of the
iii.
missing values. It is applied by estimating a regression equation for each variable, using the others as predictors. This solves the problems concerning variance and covariance raised by the previous method but leads to polarization of all the variables if they are not linked in a linear fashion. Possible errors are due to the insertion of highly correlated predictors to estimate the variables. Many forms of regression models can be used for regression imputation such as linear regression, logistic regression and semi parametric regression [26] [32] .
iv.
Expectation Maximization Estimation (EME): The algorithm can handle parameter estimation in the presence of
missing data, based on Expectation Maximization (EM) algorithm proposed by Dempster, Laird and Rubin, the EM algorithm consists of two steps that are repeated until convergence: the expectation (E-step) and the maximization (M-step) steps. These methods are generally superior to case deletion methods, because they utilize all the observed data. However, they suffer from a strict assumption of a model distribution for the variables, such as a multivariate normal model, which has a high sensitivity to outliers [63]
v.
[52]
. Machine-learning-based imputation: It acquires the features of interested unknown data by behavior
evolution after sample data processed. The essence is to automatically learn sample for complicated pattern cognition and intelligently predict the missing values. Most methods described above come from statistics. Recently, some machine learning techniques have been introduced to
27
Chapter Two: Theoretical Part
estimate missing values. For example the methods mainly include decision tree based imputation; association rules based imputation and clusteringbased imputation [7] [65] [66] .
vi.
Multiple Imputation : Multiple imputation was first proposed by Rubin (1976)
[34]
and now
is an increasingly popular way to handle missing data. It produces m complete datasets, and then each of the datasets is analyzed by completedata method. At last, the results derived from these m datasets are combined. Multiple imputation reflects the uncertainty of the missing values. All the analysis becomes combined to reflect both the inter-imputation variability and intra-imputation variability [58] .
2.3 Simulation Simulation refers to the process of using computer generated random samples to create dataset. Simulation uses methods based on random samples from a set to simulate a process of interest, uses random samples from particular probability distribution, discrete or continuous distribution such as (Bernoulli, Binomial, Geometric, uniform, Normal , chi-squared, exponential ,Beta,….) or simulate data from a model (linear model, nonlinear model, Time series model,…..). Simulation methods are relatively straightforward once the assumptions of a model and the parameters to be used for data generation are specified. Researchers who use simulation methods can have tight experimental control over these axioms and their data, and can test how a model performs under a known set of parameters (whereas with real-world data, the parameters are unknown). Simulation methods are flexible and can be applied to a number of problems to obtain quantitative answers to questions that may not be possible to derive through other approaches. Nowadays simulation is a tool mainly used for different types of production system analysis. But recent advantages in technology
28
Chapter Two: Theoretical Part
have allowed simulation to expand its usefulness beyond a purely design function and into operational use [17] [48] .
2.4 Clustering Clustering analysis plays an important role in the data mining field, it is a method of clustering objects or patterns into several groups. It attempts to organize unlabeled input objects into clusters or “Natural groups” such that data points within a cluster are more similar to each other than those belonging to different clusters, i.e., to maximize the intra-cluster similarity while minimizing the inter-cluster similarity. In the field of clustering analysis, a number of methods have been put forward and many successful applications have been reported.
Clustering algorithms can be loosely
categorized into the following categories: hierarchical, partition-based, density-based, grid-based and model-based clustering algorithms. Among them, partition-based algorithms which partition objects with some membership matrices are most widely studied. Traditional partition-based clustering methods usually are deterministic clustering methods which usually obtain the specific group which objects belong to, i.e., membership functions of these methods take on a value of 0 or 1. One can accurately know which group that the observation object pertains to. This characteristic brings about these clustering methods’ common drawback, that we cannot clearly know the probability of the observation object being a part of different groups, which reduces the effectiveness of hard clustering methods in many real situations. For this purpose, fuzzy clustering methods which incorporate fuzzy set theory have emerged [37] [72] . One possible classification of clustering methods can be according to whether the subsets are fuzzy (soft) or crisp (hard) between them illustrated in Figure (2.3).
[14]
, the difference
29
Chapter Two: Theoretical Part
Hard clustering methods are based on classical set theory, and require that an object either does or does not belong to a cluster. Hard clustering means partitioning the data into a specified number of mutually exclusive subsets. Soft clustering methods, however, allow the objects to belong to several clusters simultaneously, with different degrees of membership. In many situations, fuzzy clustering is more Natural than hard clustering.
2.4.1 Fuzzy C-Mean Clustering Among the fuzzy clustering method, the fuzzy c-means (FCM) algorithm is the most well-known method because it has the advantage of robustness for ambiguity and maintains much more information than any hard clustering methods. The algorithm is an extension of the classical and the
crisp (k-means clustering method
[64]
) in fuzzy set domain. Soft
clustering ensures data elements belong to more than one cluster, which aim to minimize the objective function, minimizing objective function means increasing similarity among all the components within an object and reducing similarity between components of one object with others
[9] [60]
.
FCM is widely studied and applied in pattern recognition, image segmentation and image clustering
[29] [53]
, data mining, wireless sensor
network [35] and so on.
Figure (2 .3): Difference between Hard and Soft Clustering
30
Chapter Two: Theoretical Part
The FCM can be summarized in 4 steps: Step 1. Randomly initialize the matrix
( )
( )
0
1 with initial value U (0),
which satisfies
∑
( )
……………………... (2.9)
Step 2. From the rth iteration (r > 0), calculate the centroids
( )
∑
Step 4. If
:
……………………………………………. (2.10)
Step 3. Update the membership matrix
( )
( )
( )
∑
( )
( )
∑
‖
[
( )
.
( )
/
.
( )
/
(
]
)
( )
:
......…………………..…... (2.11)
‖
satisfied no more iteration
needed. Then the iterative procedure immediately ends with formed clusters, otherwise, it returns to Step 2. Where: • r is the ordinal number of the iterations. • xi is ith complete data instance . • d(·, ·)is the distance metric between two instances . •
u(r)i j is the degree of membership (or probability) that the ith instance is subordinate to the jth cluster under the “fuzzier” s.
• G is the total number of clusters • M represents the number of data instances.
31
Chapter Two: Theoretical Part
2.5 Grey System Theory (GST) Grey System where "grey" means poor, incomplete, uncertain, etc. (GST) first proposed by Prof. Deng (1982)
[45]
, is the system of which part
information is known and part information is unknown. Systems with completely unknown information are black systems. Systems with complete information available are called white systems. The term “Grey” lies between “Black” and “White” and it indicates that the information is partially available. Up to now, GST has been developing a set of theories and techniques including grey mathematics, Grey Generating ,Grey relational analysis, grey modeling, grey clustering, grey forecasting, grey decision making, grey programming and grey control, and has been applied successfully in many engineering and managerial fields such as industry, ecology, meteorology, geography, earthquake, hydrology, medicine and military .The major advantage of grey theory is that it can handle both incomplete information and unclear problems very precisely. It serves as an analysis tool especially in cases where there is insufficient data [38] [49] [70] .
2.5.1 Grey Relational Analysis (GRA) Grey Relational Analysis (GRA) is an important method of Grey System theory (GST), it is based on geometrical mathematics, which compliance with the principles of normality, symmetry, entirety, and proximity. GRA is suitable for solving complicated interrelationships between multiple factors and variables and has been successfully applied on cluster analysis, robot path planning, project selection, prediction analysis, performance evaluation, and factor effect evaluation and multiple criteria decision [61] .
32
Chapter Two: Theoretical Part
Gray Relational Analysis includes Gray Relational Coefficient (GRC) and Gray Relational Grade (GRG) detailed explanation about GRA method is presented in the following steps [39] : Step 1.
The first step is data pre-processing. Data pre-processing is usually required when the range or unit in one data sequence is different from others or the sequence scatter range is too large. Data pre-processing is a method of transferring the original data sequence to a comparable sequence. Therefore, data must be normalized, scaled and polarized first into a comparable sequence before proceed to other steps. To map the original data into a particular interval (a , b), min-max normalization equation used :
(
)
Hereinto, under attribute A.
…..…. (2.12)
is the maximum and minimum respectively
When normalize data in to specific interval such [0, 1], where: The min-max normalization equation will be as follows: ,
………………………………..…….… (2.13)
-
Step 2.
In
dataset
*
+
,
* ( )
( )
( )+
,
and
. m is the number of attributes in each instance, so that the grey relationship coefficient of the two instances on attribute A is:
33
Chapter Two: Theoretical Part
( ( ) | |
( )
( )) ( )
|
( )|
( )|
|
( )
( )
( )|
( )|
…... (2.14)
Hereinto, ,
• One of important parameter is
-, which is used to control the
level of differences with respect to the relational coefficient. When
, the comparison environment does not occur any more.
On the contrary,
shows that the comparison environment
remains the unchanged status. A proper value of
could favorably
manage the impact of the maximum value in the matrix. Nevertheless, no methods have been convinced about the optimum value selection so far. Instead, researchers usually choose to empirically set it as 0.5 or learn the optimized one from experimental results [28] . • • ( ( ) and
( ))
,
- represent the level of similarity of instances ( ( )
on attribute. When
( ))
, it shows that
and
have the same attribute value on attribute A; on the contrary, when
and
have different values on attribute A, the value
( ( )
( )) of tends
to 0. Step 3.
* ( )
In dataset
( )
( )+,
and m is the
number of attributes in each instance, so that the calculation formula for grey similarity of the similarity level between instances
and
is determined
to be: (
)
∑
( ( )
( )) ………..…………... (2.15)
34
Chapter Two: Theoretical Part
The larger the grey similarity between two instances determined in (
equation (2.15), the more similar the two instances. If (
) , it shows that the level of similarity between
smaller than that between and
and
are totally irrelevant;
) and
.
(
)
(
)
shows that instances
is
shows that instances and
are the same. In order to suit the above equations of GRA with missing values and other techniques from proposed algorithm, since in this study GRA applied to incomplete dataset. The equation of each GRC and GRG can be expressed by the following steps: Step 1. Map the original data into a particular interval [0,1] .Then Gray
Relational Coefficient (GRC) is formulated by equation (2.16):
(
)
| |
| |
| |
| |
…(2.16)
Where
is the kth incomplete instance
is the pth attribute with non-missing values,
denotes the centroid of the ith cluster
In other words, the calculation only happens when the pth attributive value of
exists.
Step 2. Integrating each parameter`s GRC between an incomplete instance
and the reference, the GRG is calculated in (2.17).
35
Chapter Two: Theoretical Part
(
)
∑
(
)
…………………….…. (2.17)
In terms of the maximal value of GRG, each incomplete attribute is individually incorporated into the closest cluster [54] .
2.6 Classification Classification consists of predicting a certain outcome based on a given input. In order to predict the outcome, the algorithm processes a training set containing a set of attributes and the respective outcome, usually called goal or prediction attribute. The algorithm tries to discover relationships between the attributes that would make it possible to predict the outcome. Next the algorithm is given a data set not seen before, called prediction set (test set), which contains the same set of attributes, except for the prediction attribute – not yet known. The algorithm analyses the input and produces a prediction. The prediction accuracy defines how “good” the algorithm is [23] . A classification technique (or classifier) is a systematic approach to building classification models from an input data set. Examples include decision tree classifiers, rule-based classifiers, neural networks, support vector machines, and Naïve Bayes classifiers. Each technique employs a learning algorithm to identify a model that best fits the relationship between the attribute set and class label of the input data. The model generated by a learning algorithm should both fit the input data well and correctly predict the class labels of attributes it has never seen before. Therefore, a key objective of the learning algorithm is to build models with good
Chapter Two: Theoretical Part
36
generalization capability; i.e., models that accurately predict the class labels of previously unknown attribute [13] . Figure (2.4) shows a general approach for solving classification problems. First, a training set consisting of attributes whose class labels are known must be provided. The training set is used to build a classification model, which is subsequently applied to the test set, which consists of attributes with unknown class labels.
Figure (2.4): General Approach for Building a Classification Model [13] .
Chapter Two: Theoretical Part
37
2.6.1 Decision Tree One of the most intuitive tools for data classification is the decision tree. It hierarchically partitions the input space until it reaches a subspace associated with a class label. Decision trees are appreciated for being easy to interpret and easy to use. They are enthusiastically used in a range of business, scientific, and health care applications because they provide an intuitive means of solving complex decision-making tasks. For example, in business, decision trees are used for everything from codifying how employees should deal with customer needs to make high-value investments. In medicine, decision trees are used for diagnosing illnesses and making treatment decisions for individuals or for communities. A decision tree is a rooted, directed tree like flowchart [3]. The tree has three types of nodes [13] : A root node that has no incoming edges and zero or more outgoing edges. Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges. Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges. Each internal node corresponds to a partitioning decision, and each leaf node is mapped to a class label prediction as shown in Figure (2.5).
38
Chapter Two: Theoretical Part
Figure (2.5): Decision Tree Example [13] . Choosing the root of the tree, or the top node is done usually by measuring the entropy (or its inverse- information gain) for each attribute.
2.6.2 Entropy and Information Gain The term “entropy” refers to the Shannon’s entropy was introduced by Claude E. Shannon in 1948
[30]
, entropy is a method of measuring
randomness or uncertainty in a given set of data. Which quantifies the expected value of the information contained in a message, usually in units such as bits. Equivalently, the Shannon entropy is a measure of the average information content one is missing when one does not know the value of the random variable. The entropy is 0 if the outcome is certain, in the other hand, it is maximum if there is no knowledge of the system (or any outcome is equally possible) this is also intuitively the most uncertain situation. As shown in Figure (2.6).
39
Chapter Two: Theoretical Part
The entropy (H) of a discrete random variable Y, formulated as: ( )
, ( )-
,
- …………….……………… (2.18)
Where E is the expected value function and I (Y) is the information content or self-information of Y. I(Y) is itself a random variable. If p denotes the probability mass function (p.m.f) of Y then the entropy can be written as: ( )
( )
∑
∑
………….……… (2.19)
Where: •
is the probability that a random selection would have state y. If the system is pre-partitioned into subsets according to another
variable (or splitting rule) S, then the information entropy of the overall system is the weighted sum of the entropies for each partition, is equivalent to the conditional entropy ( )
( | )
∑
( ). This
( | )
| | ∑ | |
………….….. (2.20)
Here information gain of random variable y after split X in to S partitions: ( )
∑
( )
∑
( )
| | ∑ | |
……(2.21)
( | ) ………………….… (2.22)
Then information gain is the amount of information that's gained by knowing the value of the attribute, which is the entropy of the set before the split minus the entropy of the set after split it. The largest information gain is equivalent to the smallest entropy.
Chapter Two: Theoretical Part
40
Information gain= (Entropy of the set before the split)–(entropy of the set after split it)
Figure (2.6): Entropy in the case of two possibilities with probabilities ( ) . [30]
2.7 Hybrid Imputation Hybrid Imputation approach, apparently, is derived by combined more than one techniques (methods) that used for imputing missing value , using the hybrid method may achieve higher imputation performance than a single type approach only [40] [47] [56] [57] .
2.8 Proposed Algorithm Our proposed algorithm depends on improving existing algorithm (MIGEC) proposed by [47] after adding the following steps: 1- Converting incomplete data set to binary dataset.
Chapter Two: Theoretical Part
41
2- GRA based on attribute instead of instance. 3- Attribute merging instead of instance merging. 4- After each missing elements of attributes imputed by mean imputation, next times we use the result of new imputation (imputation by PA) instead of mean to calculate imputation of reminders missing values of specific attribute. The global procedure of the proposed algorithm is schematized in Figure (2.7). And each of the key components is detailed in the following subsections. For implementing our new development PA, the items from the raw data set are divided into two disjoint subsets, namely the complete dataset and the incomplete dataset. It is expected to minimize the negative impact due to the information loss of missing values by the way of separation. On one hand, the objects of the complete set constitute a number of clusters via FCM. On the other hand, the items in the incomplete set are reordered according to the missing severity from high to low and applied the GSTbased distance metric. That is, In terms of the maximal value of GRG, each incomplete attribute is individually incorporated into the closest cluster. After that, change incomplete dataset to binary dataset in which modified each observed element to one and each missed element to zero (observed=1 & missed=0) . Next, each missing attributive value is imputed by the proposed imputation that use classification and mean imputation: Firstly classify each instances of binary dataset due to target instance (in here is clusters), then returned to incomplete dataset and imputed it by using mean imputation as first imputation, therefore each time we get new imputed values, use this new imputed values instead of mean to estimate next missing values. This process continues till all missing values of incomplete dataset are imputed.
42
Chapter Two: Theoretical Part
Raw dataset preparation
Incomplete dataset
Complete dataset FCM
Clusters
Attribute merging
GSAbased on attribute
Convert incomplete dataset to binary dataset
Rank instances by missing amount in decreasing order
Imputation via mean & classification
NO
All the missing values imputation finished
Yes Combining complete dataset with imputed dataset
Full Datasets with nonmissing values
MIGEC Algorithm Improved Steps
Figure (2.7): The Flowchart of the Proposed Algorithm
43
Chapter Two: Theoretical Part
2.8.1 Steps of Algorithm Let
denote an incomplete dataset with
*
+ and
instances. For each elements of incomplete dataset
is defined by
, it contains two {
parts:
attribute
} where
is observed values and
is missing
values.
[
]
A binary matrix ( ) from incomplete dataset ( converting each observed values ( (
) in which
) to one and each missing values
) to zero is produced, in this case ( ) becomes a matrix of missing
data indicators, when this R matrix has the same number of rows and columns as the data matrix ( ) .
{ For example: A= [
]
R= [
• NA=Missing Value (Not Available)
]
44
Chapter Two: Theoretical Part
After each time that one attribute has been assigned to the most proximate cluster via (GST) in our proposed improvement, finally one instance inserted to binary matrix and called it class (target):
( )
associates with the data matrix of the cluster. Then imputation technique starts as follows: Step 1. Calculate Expected information (entropy) after partitioning each
instance due to class.
(
∑
)
…………………………. (2.23)
(
)
Where : •
is the probability of event occurring .
•
is number of instances
•
is number of clusters (
Information needed after split
( •
(
| )
)
due to j |
∑
|
| |
(
)
•
We illustrate how to calculate entropy by this example:
) …… (2.24)
45
Chapter Two: Theoretical Part
Instance 1 2 3 4 5 6 7 8 9 10 11 12
A1 93 98 NA 85 82 95 NA NA NA NA NA
A2 NA NA 88 NA 78 92 NA NA 90 82 NA
A3 87 99 NA NA NA 98 90 NA NA NA NA
A4 89 NA 95 NA 85 92 90 NA 98 88 95
A5 97 94 NA 85 80 NA
13 14
96 NA 71
NA 81 84
NA 92 89
90 87 83
95 93 NA
Class
c1
c2
c1
c2
c1
90 95 85 80 88
Instance A1 1 1 2 1 3 0 4 1 5 1 6 1 7 0 8 0 9 0 10 0 11 0 12 1 13 0 14 1 Class c1
A2 0 0 1 0 1 1 0 0 1 1 0 0 1 1 c2
A3 1 1 0 0 0 1 1 0 0 0 0 0 1 1 c1
A4 1 0 1 0 1 1 1 0 1 1 1 1 1 1 c2
A5 1 1 0 1 1 0 1 1 1 1 1 1 1 0 c1
Where class is clusters that we computed it by (FCM) and inserted it to each attribute by maximum value of (GRG)
For each instance calculate entropy as follows:
Instance -1-
0
1
0 c1 , 1 c2
3 c1 , 1 c2
(
)
(
)
46
Chapter Two: Theoretical Part
(
)
Instance -2-
0
1
0 c1 , 2 c2
3 c1 , 0 c2
(
)
(
)
(
)
Instance -3-
0
1
3 c1 , 0 c2
0 c1 , 2 c2
(
)
(
) (
)
47
Chapter Two: Theoretical Part
Instance -4-
0
1
1 c1 , 2 c2
2 c1 , 0 c2
(
)
(
)
(
0.918295834
)
By this way we continue calculation for instances (5 to 14) Step 2. Compute the coefficient of difference for the
instance:
………..……….. (2.25) represents the inherent contrast intensity of the f greater value of
parameter. The
signifies the more significance of that parameter.
Step 3. Elicit the coefficient of weight for the f
∑
th
th
copy:
………………………………………..… (2.26)
48
Chapter Two: Theoretical Part
Step 4.
The mean mode substitution (MMS) is employed to initialize missing values in the first imputation. The simple technique could perform well only when the data is normally distributed. Yet, it is believed that it could produce excellent performance provided that the missing ones are initialized by MMS before the imputation, even without any prior knowledge about the pattern of distribution [68] . th
Then, estimate the j attributive missing value of
:
……………………………………… (2.27)
∑
After each missing elements of attributes imputed by mean imputation, next times use the result of new imputation (imputation by PA algorithm) instead of mean to calculate imputation of reminders missing values of specific attribute.
2.9 The Framework of the Proposed Algorithm Input: Output:
,the
m × n dimensional dataset with missing values.
, the m × n dimensional complete dataset with imputed
values. Step 1: Artificially generate missingness to original dataset by using (MCAR, MAR, NMAR) with four different levels of missing rate .
49
Chapter Two: Theoretical Part
Step 2: Divide dataset
into two disjoint datasets:
Where Step 3: For each variable in FCM (
)) →
, apply fuzzy c -mean clustering
*
+.
G is the number of clusters . Step 4: For each element xk in Xmis Allocate xk to the closest cluster cq according to Grey System Theory (GST). Step 5: Impute all missing values of Xmis by utilizing classification and mean imputation, and then integrate
Step 6: Repeat steps (1 to 5) iteratively in case still missing values exist.
Chapter Three Practical Part
50
Chapter Three: Practical Part
Chapter Three Practical Part 3.1 Introduction In this chapter Experimental results of proposed algorithm for both Wine and simulated dataset are displayed and discussed, also comparison between both proposed algorithms with other previous techniques for dealing with missing data is described. In this thesis, R programming language (version 3.2.3) used to implement all methods that used to conduct proposed algorithm.
3.2 Dataset The first dataset that we depend on to implement our algorithm is the Wine dataset that we used in this study is achieved from The UCI (University of California, Irvine) Machine Learning Repository database. The purpose of selecting this dataset is to compare the efficiency of our algorithm with previous (MIGEC) algorithm implemented by (J. Tian , B. Yu , D. Yu , Sh. Ma) [47] .
3.2.1 Wine Data Set The Wine data set are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. Information about Wine dataset shown in Table (3.1).
51
Chapter Three: Practical Part
Table (3.1): Information about Wine dataset Data Set Characteristics:
Multivariate
Number of Instances:
178
Area:
Physical
Attribute Characteristics:
Integer, Real
Number of Attributes:
13
Date Donated
1991-07-01
Associated Tasks:
Classification
Missing Values?
No
Number of Web Hits:
531927
The attributes are: 1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Non flavanoid phenols
9) Proanthocyanins
10) Color intensity
11) Hue
12) OD280/OD315 of diluted wines
13) Proline Sample of wine dataset represented by Figure (3.1)
Figure (3.1): Sample of Wine dataset
Chapter Three: Practical Part
52
3.2.2 Simulated data Nowadays, data will growth speedily (Million, Billion, and Trillion ...) in the world as we discussed in chapter one, in other words, data mining already work with massive quantities of data. For this reason we simulated data to know the performance of (proposed algorithm) with large amount of data. We used simple random samples of size 1000. We consider simulations under a normal distribution, this dataset contain 13 attributes and 1000 instances and all of them are numeric. The computer limitation (Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz) for implementing our algorithm doesn’t allow us to increase the amount of simulated data. We used (rnorm) command from (R programing language) to generate 1000 sample of data under normal distribution based on mean and standard deviation of Wine dataset as displayed in Table (3.2). X=rnorm (n, mean, sd) where: X represents attributes of simulated data. Since we have 13 attribute then X= (Attribute-1, Attribute-2,…………., Attribute-13). The rnorm() function in R is a convenient way to simulate values from the normal distribution, characterized by a given mean and standard deviation. n is sample size of simulated data. mean is the mean of simulated data based on Wine dataset. sd is the standard deviation of simulated data based on Wine dataset.
53
Chapter Three: Practical Part
Table (3.2): Mean and Standard deviation of Wine data
Attribute
A1
A2
A3
A4
A5
A6
A7
A8
A9
A10
A11
A12
A13
Mean
13.0
2.34
2.37
19.50
99.74
2.30
2.03
0.36
1.59
5.06
0.96
2.61
746.89
Sd
0.81
1.12
0.27
3.34
14.28
0.63
1.00
0.12
0.57
2.32
0.23
0.71
314.91
The sample of simulated dataset displayed in Figure (3.2)
Figure (3.2): Sample of simulated dataset
3.3 Generating missingness: To intrinsically examine the effectiveness and validity and ensure the systematic nature of the research, we artificially generated a lack of data at four distinct missing ratios,
under three
different modalities, namely MCAR, MAR, NMAR in the complete datasets.
54
Chapter Three: Practical Part
For MCAR , In order to simulate missing values on attributes, the original datasets are run using a random generator and let every data in the dataset have the same probability α to be missing, where α was the specified missing rate . “Nonparametric Missing Value Imputation using Random Forest” package from R programing language used to generate MCAR. Sample of generating missing values on wine dataset under MCAR displayed in Figure (3.3), where each NA represent missing value
Figure (3.3): Sample of Wine dataset missed by MCAR Simulating MAR was more challenging and it worked as follow: In case there is a complete dataset with two attributes (
), where
the attribute in to which missing values were introduced, and attribute that affected the missingness of attributes
was the
. Given a pair of
), and missing rate α , First split the instances into two
equal-sized subsets according to their values at of
was
and find the median
and then assigned all the instances into two subsets according
to weather the instances have lower (or higher) values than the median
55
Chapter Three: Practical Part
at
(
)
.
After
the
splitting of instances, randomly selected one subset of instances and let their values at
to be missing with the probability of 4α. The probability of 4α
will result in a missing rate of 2α on the whole variable
which is
equivalent to have a missing rate of α on the two variables
).
For multi-attributes pair selection of attributes was based on high correlations among the attributes, different pairs of attributes were used to generate the missingness. Each attribute is paired with the one it is highly correlated to. Sample of generating missing values on wine dataset under MAR displayed in Figure (3.4)
Figure (3.4): Sample of Wine dataset missed by MAR The process of generating missing values by NMAR was similar to MAR. The only difference was that there was no need to split variables into pairs, NMAR produced missingness on every variable directly. For a given variable of
and specified missing rate α , first calculated the median
and then randomly let the values that are lower (or higher) than
56
Chapter Three: Practical Part
to be missing with probability of 2α . Sample of generating missing values on wine dataset under NMAR displayed in Figure (3.5)
Figure (3.5): Sample of Wine dataset missed by NMAR
3.4 Performance Measure To evaluate the precision of various data imputation algorithms the Root Mean Square Error (RMSE) (or Normalize Root Mean Square Error(NRMSE)) used in this study .
√ ∑ ̂
…………………………….…… (3.1)
…………………………………........... (3.2) Where
is the original value, ̂ is the predicted plausible value,
total number of estimations and
is the
is standard deviation. The larger
value of RMSE suggests the less accuracy that the algorithm holds.
57
Chapter Three: Practical Part
3.5 Optimality Measure Before the comparative demonstrations, to capture the result of the data imputation accurately, it is requisite to select the optimum values for number of iteration (number of imputations) and clusters, by another expression mean that which clusters or iteration gives us minimum RMSE.
3.5.1 Number of Iteration Firstly we tested for number of
iterations
for each missingness
mechanism (MCAR, MAR, NMAR) with 10 % missing rate and five clusters as initial value (cluster) that is because 5 lies between (2 and 10), for wine dataset. For MCAR: Table (3.3): Checking optimality by number of Iterations (for MCAR) MCAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and cluster=5 Iterations 5
10
15
20
25
30
35
40
45
50
0.1418
0.1563
0.1575
0.1650
0.1620
0.1616
0.1637
0.1600
0.1590
0.1591
As seen in Table (3.3), the RMSE declines to the least which is (0.1418) when number of iteration is 5 times. So it is the optimum iteration for MCAR, also we shown it by Figure (3.6). 0.17
RMSE
0.16 0.15 0.14 0.13 5
10
15
20
25
30
35
40
45
50
Iterations
Figure (3.6): Checking optimality by number of Iterations (for MCAR)
58
Chapter Three: Practical Part
For MAR: Table (3.4): Checking optimality for number of Iterations (for MAR) MAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and cluster=5 Iterations 5
10
15
20
25
30
35
40
45
50
0.1388
0.1503
0.1578
0.1618
0.1625
0.1621
0.1648
0.1674
0.1693
0.1660
Table (3.4), illustrate that the best number of iteration for MAR also is 5 which give minimum RMSE (0.1388) also as shown in Figure (3.7). 0.17
RMSE
0.16 0.15 0.14 0.13 5
10
15
20
25
30
35
40
45
50
Iterations
Figure (3.7): Checking optimality by number of Iterations (for MAR)
For NMAR: Finally, for NMAR as appeared in Table (3.5) and Figure (3.8), Iteration (10) give us lower RMSE which is (0.1347) and the worst iteration, which yield the maximum RMSE is iteration (15).
59
Chapter Three: Practical Part
Table (3.5): Checking optimality by number of Iterations (for NMAR) NMAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and cluster=5 Iterations 5
10
0.1376 0.1347
15
20
0.1406
0.1387
25
30
0.1374 0.1400
35
40
45
50
0.1394
0.1385
0.1387
0.1405
0.17
RMSE
0.16 0.15 0.14 0.13 5
10
15
20
25
30
35
40
45
50
Iterations
Figure (3.8): Checking optimality by number of Iterations (for NMAR)
3.5.2 Number of Clusters Second step for checking optimality is number of clusters, we obtained it via (FCM) by applied it to complete dataset. we can match number of clusters that is affect the accuracy of our algorithm for imputation directly, since we used it for calculating classification therefore classification is a main part of imputing incomplete data in proposed algorithm. For each three types of missing mechanism (MCAR, MAR, NMAR), we checked for optimal number of clusters from 2 to 10 by using %10 missing rate. Each MCAR, MAR, NMAR respectively discussed: For MCAR:
60
Chapter Three: Practical Part
Table (3.6): Checking optimality by number of clusters (for MCAR)
MCAR Number of clusters cluster= 2
cluster= 3
cluster= 4
cluster= 5
cluster= 6
cluster= 7
cluster= 8
cluster= 9
cluster=1 0
0.1553
0.1423
0.1410
0.1418
0.1482
0.1330
0.1490
0.1456
0.1515
According to Table (3.6), it appears that when the whole data is agglomerated into 7 groups, the RMSE declines to the minimum that is (0.1330). In contrast the worst value obtained when 2 clusters exist. All the above description summarized in Figure (3.9). 0.19 0.18
RMSE
0.17 0.16 0.15 0.14 0.13 0.12
Figure (3.9): Checking optimality by number of clusters (for MCAR)
For MAR: Table (3.7): Checking optimality by number of clusters (for MAR)
MAR Number of clusters cluster= 2
cluster= 3
cluster= 4
cluster= 5
cluster= 6
cluster= 7
cluster= 8
cluster= 9
cluster=1 0
0.1321
0.1464
0.1514
0.1388
0.1463
0.1463
0.1849
0.1588
0.1448
61
Chapter Three: Practical Part
In Table (3.7) and Figure (3.10), MAR performs best when data partitions into 2 groups, it`s RMSE is (0.1321). 0.19 0.18
RMSE
0.17 0.16 0.15 0.14 0.13 0.12
Figure (3.10): Checking optimality by number of clusters (for MAR) For NMAR: Table (3.8): Checking optimality by number of clusters (for NMAR)
NMAR Number of clusters cluster= 2
cluster= 3
cluster= 4
cluster= 5
cluster= 6
cluster= 7
cluster= 8
cluster= 9
cluster=1 0
0.14485
0.14039
0.13419
0.13469
0.14319
0.14915
0.12380
0.13518
0.14078
From Table (3.8) and Figure (3.11), results of RMSE for NMAR are between [0.12 - 0.15], minimum RMSE (0.12) yield when number of clusters is 8.
62
Chapter Three: Practical Part 0.19 0.18
RMSE
0.17 0.16 0.15 0.14 0.13 0.12
Figure (3.11): Checking optimality by number of clusters (for NMAR)
3.6 Comparative experiments In consideration of making comparisons as extensively as possible, we thoughtfully select eight other approaches, which are MMS (Mean Mode Substitution), HDI (Hot Deck Imputation), KNNMI (K Nearest Neighbour Imputation with Mutual Information) Mean based on Optimal Completion Strategy ) Random Imputation)
[33]
[59]
[24]
, FCMOCS (Fuzzy C-
, CRI (Clustering-based
, CMI (Clustering-based Multiple Imputation)
[67]
,
NIIA (The - Non Parametric Iterative Imputation) [68] and MIGEC (Multiple Imputation algorithm using Gray System Theory and Entropy based on Clustering) [47] . After selecting optimum number of clusters and iterations for all three types of missingness, we compared proposed algorithm with various methods with varying missing rates by using RMSE (average of each RMSE) as displayed in Table (3.9), (3.11) and (3.13), and difference between our PA and each other algorithms displayed in Table (3.10),(3.12) and (3.14).
63
Chapter Three: Practical Part
Table (3.9): Comparison between proposed algorithm and other methods for imputation (MCAR)
Missing Mechanism
Methods Missing Rate
MMS
HDI
KNNMI
FCMOCS
NIIA
CMI
CRI
MIGEC
P.A.
5%
0.201
0.197
0.191
0.188
0.179
0.176
0.172
0.174
0.1785
10%
0.203
0.202
0.195
0.189
0.182
0.180
0.179
0.179
0.1330
15%
0.205
0.203
0.196
0.192
0.186
0.181
0.181
0.182
0.1520
20%
0.213
0.205
0.198
0.195
0.188
0.184
0.188
0.189
0.1672
MCAR
Table (3.10): Difference between proposed algorithm and other algorithms for MCAR Different between proposed algorithm and other algorithms for MCAR Missing Mechanism
MCAR
Methods Missing Rate
MMS
HDI
KNNMI
FCMOCS
NIIA
CMI
CRI
MIGEC
5%
0.0225
0.0185
0.0125
0.0095
0.0005
-0.0025
-0.0065
-0.0045
10%
0.07
0.069
0.062
0.056
0.049
0.047
0.046
0.046
15%
0.053
0.051
0.044
0.04
0.034
0.029
0.029
0.03
20%
0.0458
0.0378
0.0308
0.0278
0.0208
0.0168
0.0208
0.0218
Average
0.04783
0.04408
0.03733
0.03333
0.02608
0.02258
0.02233
0.02333
Table (3.11): Comparison between proposed algorithm and other methods for imputation (MAR) Missing Mechanism
Methods Missing Rate
MMS
HDI
KNNMI
FCMOCS
NIIA
CMI
CRI
MIGEC
P.A.
5%
0.192
0.188
0.186
0.185
0.172
0.171
0.171
0.169
0.1366
10%
0.194
0.196
0.194
0.189
0.176
0.177
0.173
0.172
0.1321
15%
0.204
0.206
0.202
0.192
0.178
0.182
0.185
0.184
0.1493
20%
0.210
0.208
0.204
0.198
0.185
0.188
0.183
0.187
0.1649
MAR
64
Chapter Three: Practical Part
Table (3.12): Difference between proposed algorithm and other algorithms
for MAR Different between proposed algorithm and other algorithms for MAR Missing Mechanism
MAR
Methods Missing Rate
MMS
HDI
KNNMI
FCMOCS
NIIA
CMI
CRI
MIGEC
5%
0.0554
0.0514
0.0494
0.0484
0.0354
0.0344
0.0344
0.0324
10% 15% 20%
0.0619 0.0547 0.0451
0.0639 0.0567 0.0431
0.0619 0.0527 0.0391
0.0569 0.0427 0.0331
0.0439 0.0287 0.0201
0.0449 0.0327 0.0231
0.0409 0.0357 0.0181
0.0399 0.0347 0.0221
Average
0.05428
0.05378
0.05078
0.04528
0.03203
0.03378
0.03228
0.03228
Table (3.13): Comparison between proposed algorithm and other methods for imputation (NMAR)
Methods Missing Mechanism
Missing Rate
MMS
HDI
KNNMI
FCMOCS
NIIA
CMI
CRI
MIGEC
P.A.
5%
0.171
0.169
0.168
0.165
0.160
0.159
0.158
0.155
0.1183
10%
0.176
0.172
0.172
0.169
0.163
0.166
0.163
0.157
0.1238
15%
0.183
0.178
0.174
0.171
0.164
0.168
0.167
0.164
0.1601
20%
0.192
0.189
0.180
0.175
0.167
0.169
0.168
0.168
0.1629
NMAR
Table (3.14): Difference between proposed algorithm and other algorithms for NMAR Different between proposed algorithm and other algorithms for NMAR Missing Mechanism
NMAR
Methods Missing Rate
MMS
HDI
KNNMI
FCMOCS
NIIA
CMI
CRI
MIGEC
5%
0.0527
0.0507
0.0497
0.0467
0.0417
0.0407
0.0397
0.0367
10%
0.0522
0.0482
0.0482
0.0452
0.0392
0.0422
0.0392
0.0332
15%
0.0229
0.0179
0.0139
0.0109
0.0039
0.0079
0.0069
0.0039
20%
0.0291
0.0261
0.0171
0.0121
0.0041
0.0061
0.0051
0.0051
Average
0.03923
0.03573
0.03223
0.02873
0.02223
0.02423
0.02273
0.01973
65
Chapter Three: Practical Part
Discussion of the Results: Table (3.9), Table (3.11) and Table (3.13) illustrate some results that we would like to discuss as follows: i- Outcome results demonstrate that the proposed algorithm performs better than the other eight approaches under all missingness mechanisms at varying missing rates. ii- Different missing rate have different impacts on imputation accuracy. Generally speaking, the RMSE increases with increasing missing proportion for all the methods approximately. This is understandable because with more missing rate introduced into the datasets, more information of data will be loss. However sometimes the nature of data, outlier and noise also effect on accuracy of imputation. iii- The worse RMSE achieved by methods are for MCAR mechanism, followed by MAR and MCAR mechanism, respectively in general.
For more illustrate we displayed all three Table (3.9),(3.11) and (3.13) by Figures (3.12), (3.13) and (3.14). 0.25 MMS HDI 0.2
RMSE
KNNMI FCMOCS NIIA
0.15
CMI CRI MIGEC
0.1 5%
10%
15%
20%
PA
Missing Rate
Figure (3.12): Comparison between proposed algorithm (PA) and other methods for imputation (MCAR)
66
Chapter Three: Practical Part
0.25 MMS HDI 0.2
RMSE
KNNMI FCMOCS NIIA
0.15
CMI CRI MIGEC
0.1 5%
10%
15%
20%
PA
Missing Rate
Figure (3.13): Comparison between proposed algorithm (PA) and other methods for imputation (MAR)
0.25 MMS HDI 0.2
RMSE
KNNMI FCMOCS NIIA
0.15
CMI CRI MIGEC
0.1 5%
10%
15%
20%
PA
Missing rate
Figure (3.14): Comparison between proposed algorithm (PA) and other methods for imputation (NMAR)
67
Chapter Three: Practical Part
3.7 Results of Proposed Algorithm for Simulated data At the beginning we checked out which clusters and iteration has optimum value for Wine dataset, this scenario also required for simulated dataset.
3.7.1 Number of Iteration For MCAR: Table (3.15): Checking optimality by number of Iterations (for MCAR) MCAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and cluster=5 Iterations 5
10
15
20
25
30
35
40
45
50
0.1604
0.1601
0.1585
0.1583
0.1575
0.1590
0.1592
0.1595
0.1592
0.1593
From Table (3.15) and Figure (3.15), we can notice that lower RMSE achieved by 25 iteration and it`s value is (0.1575). Also the range of all Iterations is between [15 -16] so RMSE is slightly different between
RMSE
number of iterations. 0.2 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 0.155 0.15 5
10
15
20
25
30
35
40
45
50
Iterations
Figure (3.15): Checking optimality by number of Iterations (for MCAR)
68
Chapter Three: Practical Part
For MAR: Table (3.16): Checking optimality for number of Iterations (for MAR)
MAR with %10 missing rate, (5,10,15,20,25,30,35,40,45,50) iteration and cluster=5 Iterations 5
10
15
20
25
30
35
40
45
50
0.1569
0.1583
0.1558
0.1572
0.1582
0.1581
0.1576
0.1573
0.1574
0.1574
In Table (3.16), iteration 15 has lower RMSE .However Figure (3.16) obviously show us that the difference between each iterations (from 5 to 50) is too small since the difference between maximum and minimum RMSE is
RMSE
just (0.0025). 0.2 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 0.155 0.15 5
10
15
20
25
30
35
40
45
50
Iterations
Figure (3.16): Checking optimality for number of Iterations (for MAR)
For NMAR: Table (3. 17): Checking optimality by number of Iterations (for NMAR) NMAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and cluster=5 Iterations 5
10
15
20
25
30
35
40
45
50
0.1908
0.1954
0.1955
0.1946
0.1951
0.1930
0.1938
0.1928
0.1921
0.1922
69
Chapter Three: Practical Part
For NMAR iteration (5) has best RMSE, but worst RMSE is for 15 however the difference of them is just (0.004779) .as seen in Table (3.17)
RMSE
and Figure (3.17) 0.2 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 0.155 0.15 5
10
15
20
25
30
35
40
45
50
Iterations
Figure (3.17): Checking optimality by number of Iterations (for NMAR)
3.7.2 Number of Clusters Clustering could actually help ameliorate the accuracy of the prediction through narrowing the potential space of the target value. Since there is relationship between RMSE and the number of clusters in this technique, we should discuss which groups have the best result (namely, the minimal value of RMSE) For MCAR: Table (3.18): Checking optimality by number of clusters (for MCAR)
MCAR Number of Clusters cluster= 2
cluster= 3
cluster= 4
cluster= 5
cluster= 6
cluster= 7
cluster= 8
cluster= 9
cluster=1 0
0.1577
0.1589
0.1599
0.1575
0.1580
0.1590
0.1601
0.1583
0.1562
70
RMSE
Chapter Three: Practical Part 0.21 0.205 0.2 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 0.155 0.15
Figure (3.18): Checking optimality by number of clusters (for MCAR)
As seen in Table (3.18) and Figure (3.18), it could be obviously observed that when the whole data is combined into 10 groups, the RMSE of PA declines to the minimum. Since difference between maximum and minimum RMSE is just (0.0039). For MAR: Table (3.19): Checking optimality by number of clusters (for MAR)
MAR Number of Clusters cluster =10
0.1586
0.1536
RMSE
cluster= cluster= cluster= cluster= cluster= cluster= cluster= cluster= 2 3 4 5 6 7 8 9 0.1566
0.1541
0.1558
0.1587
0.1588
0.1588
0.1571
0.21 0.205 0.2 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 0.155 0.15
Figure (3.19): Checking optimality by number of clusters (for MAR)
71
Chapter Three: Practical Part
Table (3.19) and Figure (3.19) displayed that the same 10 clusters is desirable for MAR. For NMAR: Table (3.20): Checking optimality by number of clusters (for NMAR)
NMAR Number of Clusters cluster= cluster= cluster= cluster= cluster= cluster= cluster= cluster= cluster=1 2 3 4 5 6 7 8 9 0
RMSE
0.2003
0.1934
0.1937
0.1908
0.2002
0.1808
0.2017
0.1958
0.1974
0.21 0.205 0.2 0.195 0.19 0.185 0.18 0.175 0.17 0.165 0.16 0.155 0.15
Figure (3.20): Checking optimality by number of clusters (for NMAR)
From Table (3.20) and Figure (3.20), we seen that cluster seven (7) is optimal for NMAR. Finally experimental results demonstrated that proposed algorithm requires the different optimal number of clusters. After selecting optimum number of iterations and clusters the final result of simulated dataset shown in Table (3.21) and Figure (3.21)
72
Chapter Three: Practical Part
Table (3.21): Result of proposed algorithm for simulated data under different mechanism with varying missing rates Missing Rate Missing Mechanism 5%
10%
15%
20%
MCAR
0.160187
0.156174
0.160077504
0.160309
MAR
0.159357
0.153592
0.160294
0.159913
NMAR
0.187617
0.1808
0.2006182
0.20736
0.21 0.2 0.19 MCAR 0.18
MAR
0.17
NMAR
0.16 0.15 5%
10%
15%
20%
Figure (3.21): Result of proposed algorithm for Simulated data under different mechanism with varying missing rates
3.8 Comparison between wine and simulated dataset The results show that our proposed algorithm remains stable with increasing the size of data. Which mean the success of our algorithm when implemented on huge amount of data in data mining algorithms as shown by Figure (3.22), (3.23), (3.24), (3.25), (3.26) and (3.27). We mixed Figures of iterations and clusters of each Wine dataset and simulated dataset to satisfy that results of simulated dataset are more stable comparing to Wine dataset.
73
Chapter Three: Practical Part MCAR 0.17
RMSE
0.16 0.15
wine dataset
0.14
Simulated dataset
0.13 5
10
15
20
25
30
35
40
45
50
Iterations
Figure (3.22): comparison between wine and simulated dataset for various iterations (MCAR) MAR 0.17
RMSE
0.16 0.15
wine dataset
0.14
Simulated dataset
0.13 5
10
15
20
25
30
35
40
45
50
Iterations
Figure (3.23): comparison between wine and simulated dataset for various iterations (MAR)
RMSE
NMAR 0.2 0.19 0.18 0.17 0.16 0.15 0.14 0.13
Wine dataset simulated dataset
1
2
3
4
5
6
7
8
9
10
Iterations
Figure (3.24): comparison between wine and simulated dataset for various iterations (NMAR)
74
Chapter Three: Practical Part MCAR
RMSE
0.16 0.15 Wine dataset
0.14
Simulated dataset
0.13
Figure (3.25): comparison between wine and simulated dataset for various clusters (MCAR) MAR 0.19
RMSE
0.18 0.17 0.16 Wine dataset
0.15
Simulated dataset
0.14 0.13
Figure (3.26): comparison between wine and simulated dataset for various clusters (MAR)
RMSE
NMAR 0.21 0.2 0.19 0.18 0.17 0.16 0.15 0.14 0.13 0.12
Wine dataset Simulated dataset
Figure (3.27): comparison between wine and simulated dataset for various iterations (NMAR)
75
Chapter Three: Practical Part
Figure (3.22),(3.23) (3.24),(3.25),(3.26) and (3.27) show that in general results of PA for simulated dataset are more stable than wine dataset under each missingness mechanism. From Table (3.21) and Figure (3.21), also we can observed that proposed algorithm with growing the size of data starting to stable under different missing rate that mean RMSE slightly difference between various proportion of missing. And results show the success of our proposed algorithm. We made out interval of proposed algorithm for each Wine dataset from Table (3.9), (3.11), (3.13) and simulated dataset from table (3.21) to satisfied that our proposed algorithm`s results are stable for simulated dataset than Wine dataset. The interval of Wine data for MCAR is among [0.13 – 0.18], for MAR is [0.13 – 0.16] and for NMAR is [0.12 – 0.16], while the interval of simulated data for MCAR is among [0.15 – 0.16], for MAR [0.15 – 0.16] and for NMAR [0.18 – 0.21] .Difference between each interval of simulated and Wine dataset displayed in Table (3.22). Table (3.22): Difference between RMSE of Simulated and Wine dataset
Interval
Difference between interval
Missing Mechanism Wine dataset
Simulated dataset
Wine dataset
Simulated dataset
MCAR
0.13-0.18
0.15-0.16
0.05
0.01
MAR
0.13-0.16
0.15-0.16
0.03
0.01
NMAR
0.12-0.16
0.18-0.21
0.04
0.03
76
Chapter Three: Practical Part
3.9 Comparing Algorithms by Number of Iteration We compare optimal number of iteration between our proposed algorithm and existing algorithms, as shown in Table (3.23). Table (3.23): Comparing number of iteration between proposed algorithm and existing algorithms
Algorithms
No. of Iteration
NIIA
27
CMI
22
CRI
25
MIGEC
Proposed algorithm
Wine dataset
Simulated dataset
5
15
18
From Table (3.23), by using MAR mechanism when (NIIA, CMI, CRI, MIGEC) applied to RCSF (Remote Controlling for Spacecraft Flying) dataset, optimal numbers of iteration for each of them are (27, 22, 25, 18) respectively. But when proposed algorithm applied to wine and simulated dataset we see that best RMSE got by 5 times of iteration for (Wine dataset) and 15 times of iteration for (simulated dataset) , then among these algorithm, our proposed algorithm has minimum times of iteration which tends to less response running time of our proposed algorithm for large dataset.
Chapter Four
Conclusions and Future Work
Chapter Four: Conclusions and Future Work
77
Chapter Four Conclusions and Future Work 4.1 Conclusions The problem of incomplete data is one which researchers must handle it. Many researchers fail to consider missing values of varying natures in their analyses, treating them as a singular type or not considering the impact of the missing values at all. In this thesis an extension algorithm based on MIGEC for dealing with incomplete data has been proposed. The experimental results show: 1- Experimental results on wine dataset from University of California
Irvine (UCI) repository illustrate the superiority of proposed algorithm to other imputation methods on accuracy of imputing missing data on three different missing types MCAR, MAR and NMAR. 2- The RMSE shows that our proposed algorithm has better results
(namely, the minimal value of RMSE) than MIGEC algorithm, with average absolute difference beyond (0.025108). 3- When calculating GRA on attributes instead of instances we work
with more homogeneous values in comparison with calculating GRA based on instances and as a result the attribute belong to proper cluster. 4- In general, increasing proportion of missing instances deteriorates the
accuracy of the interpolation in RMSE. It states that incomplete values negatively impact on the completion, in other words, more
Chapter Four: Conclusions and Future Work
78
available information could promote the precision of the final predictions. 5- Proposed algorithm can handle missing values and perform better
either with small or huge amount of the raw data, we can conclude that proposed algorithm remain stable with increasing the size of dataset which means our proposed algorithm is suitable for large data repositories. 6- Proposed algorithm reach results with less imputed iterations in
comparison with other algorithms which means less run time needed in case of huge amount of data in data repositories. 7- The drawback of our proposed algorithm on MIGEC is appeared in
cases when there is large amount of heterogeneity inside the attributes since GRA in our proposed algorithm depends on attribute values instead of instance values. This conclusion appears when we run the algorithm on difference simulated data.
4.2 Future Work 1- Working with data mining techniques need powerful computer to
implement our work speedily and not restrict us. In this thesis because of computer limitation we cannot increase the size of simulated dataset because it needs days to get results of proposed algorithm with vast size of dataset. 2- Hybrid proposed algorithm with another data mining or statistical
techniques like (Neural Network, Nearest Neighbor, …). 3- Extending proposed algorithm to work with categorical attributes.
Chapter Four: Conclusions and Future Work
79
4- Noise has a great impact on the effectiveness of imputation
techniques, while real-world data often contain much noise, therefore, another preprocessing algorithm can be implemented to clean the data before implementing PA. 5- Implementing different data mining algorithms such as association
rule mining on PA and compare the results with other existing algorisms.
References
References
80
References: Books: [1] A.Gelman, J.Hill, (2006) “Data analysis using regression and multilevel/hierarchical models”, Cambridge University Press. [2] A. Ghosh , C.Jain, (2005) “Evolutionary Computation in Data Mining”, Springer Verlag Berlin Heidelberg, Printed in Germany. [3]
C. C. Aggarwal , (2015) “Data Classification Algorithms and Applications ” , IBM T. J. Watson Research Center , Yorktown Heights, New York, USA , International Standard Book Number-13: 978-1-46658675-8 .
[4] C. K. Enders, (2010) “Applied Missing Data Analysis” , The Guilford Press ,New York, London , ISBN 978-1-60623-639-0 . [5] D. Pyle , (1999) “Data Preparation for Data Mining” , Order toll free 800-745-7323,Morgan Kaufmann, San Francisco. [6] E. Alpaydin, (2010) “Introduction to Machine Learning”, Second Edition, MIT Press, Cambridge . [7] G.J. Mclachlan, K.A. Do, C. Ambroise, (2004) microarray gene expression data”, Wiley, New York.
“Analyzing
[8] I.H. Witten, E. Frank, (2005) “Data Mining: Practical Machine Learning Tools and Techniques”, Morgan Kaufmann Series in Data Management Systems, 2nd edn. Morgan Kaufmann, San Francisco . [9] J.C. Bezdek, J. Keller, R. Krisnapuram, N. Pal , (1999) “Fuzzy Models and Algorithms for Pattern Recognition and Image Processing ”, The handbooks of fuzzy sets series Vol. 4. Springer.
References
81
[10] J. Han and M. Kamber, (2011) “Data Mining: Concepts and Techniques”, Morgan Kaufmann,San Francisco . [11] J. Harrell, E. Frank, (2001) “Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis”, Springer-Verlag, New York. [12] L. Symeonidis , A. Mitkas (2005) “Agent Intelligence through Data Mining”, Springer Science+Business Media, Inc. . [13] P.Tan, M.Steinbach, V. Kuma , (2006) “Introduction to Data Mining”, Addison-Wesley. [14] R. Babuška , (2001) “Computational Intelligence In Modeling And Control” , Delft Center for Systems and Control , Delft University of Technology Delft, the Netherlands. [15] R.J.A. Little, D.B. Rubin, (1987) “Statistical Analysis with Missing Data” , 1st edn. Wiley Series in Probability and Statistics, New York. [16] R. Nisbet , J. Elder, G. Miner , (2009) “Handbook of Statistical Analysis and Data Mining Applications” . Academic Press, Boston. [17] R. Wicklin , (2013) “Simulating Data with SAS ”, SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513-2414, USA. [18] S. García, F. Herrera, J. Luengo , (2015) “Data Preprocessing in Data Mining” , Springer International Publishing Switzerland ,London , New York.
Thesis or Dissertation: [19] A.A.Saeed , (2008) “Association Rule Mining for Analyzing AltunMall Market Basket in Sulaimani” , M.Sc. Thesis, University of Sulaimani .
References
82
[20] B. M.Bidgoli ,(2004) “ Data Mining For A Web-Based Educational System” , Ph.D. Thesis , Michigan State University . [21] J.A. Boyko, (2013) “Handling Data with Three Types of Missing Values” , Ph.D. Thesis , University of Connecticut . [22] M.A . Janecek , (2009) “ Efficient feature reduction and classification methods”, Ph.D. Thesis ,University of Wien. [23] R. M. Kilany , (2013) “Efficient Classification and Prediction Algorithms for Biomedical Information”, Ph.D. Thesis ,University of Connecticut.
Papers: [24] A.D.Nuovo , (2011) “Missing data analysis with fuzzy C-means: a study of its application in a psychological scenario”, Expert Syst Appl 38(6):6793–6797. [25] A.Farhangfar, L.Kurgan, W.Pedrycz, (2004) “Experimental analysis of methods for imputation of missing values in databases” In: Intelligent computing: theory and applications II ,Orlando, Florida, 12 April 2004. Proceedings of SPIE, vol 5421. SPIE Press, Bellingham, pp 172–182. [26] A.R. Donders, G.J. van der Heijden, T. Stijnen, K.G. Moons , (2006) “Review: a gentle introduction to imputation of missing values”, J Clin Epidemiol 59(10):1087–1091. [27] C. C. Huang and H. M. Lee , (2004) “A grey-based nearest neighbor approach for missing attribute value prediction”, Applied Intelligence, vol. 20, no. 3, pp. 239-252. [28] C.C. Huang, H.M. Lee, (2006) “An instance-based learning approach based on grey relational structure”. Appl Intell 25(3):243– 251,Springer.
References
83
[29] C. C. Huang, S. Kulkarni , B. C. Kuo, (2011) “A New Weighted Fuzzy C-Means Clustering Algorithm for Remotely Sensed Image Classification”, IEEE Selected Topics in Signal Processing, vol. 5, no. 3. [30] C. E. Shannon ,(1948) “A mathematical theory of communication” , Bell Syst Tech J 27(3):379–423. [31] C. Enders, S .Dietz, M .Montague, J. Dixon ,(2005) “Modern alternatives for dealing with missing data in special education research” .Adv Learn Behav Disabil 19:101–129. [32] C.Y. J. Peng , J. Zhu , (2008) “Comparison of Two Approaches for Handling Missing Covariates in Logistic Regression” , Educational and Psychological Measurement ,Volume 68 Number 1 , 58-77, 2008 Sage Publications, 10.1177/0013164407305582 . [33] C. Zhang, Y .Qin, X .Zhu, J .Zhang, S .Zhang , (2006) “Clusteringbased missing value imputation for data preprocessing”. In: Proceedings of IEEE international conference on industrial informatics, Singapore, 16–18 Aug 2006, pp 1081–1086 . [34] D.B. Rubin, (1976) “Inference and missing data”, Biometrika,Vol. 63, NO.3,581-592. [35] D. C. Hoang, R. Kumar and S. K. Panda, (2012) “Optimal data aggregation tree in wireless sensor networks based on intelligent water drops algorithm”, IET Wireless Sensor Systems, vol. 2, no. 3 . [36] F. C. S. Liu , (2014)“Using Multiple Imputation for Vote Choice Data: A Comparison across Multiple Imputation Tools” , Institute of Political Science, National Sun Yat-Sen University, Kaohsiung, Taiwan, Open Journal of Political Science, 4, 39-46 . [37] G. Karypis, E. H. Han, and V. Kumar, (1999) “Chameleon: hierarchical clustering using dynamic modeling” IEEE Computer society , vol. 32, no. 8, pp. 68–75.
References
84
[38] G. Nagpal, M.Uddin, A.Kaur , (2014) “Grey Relational Effort Analysis Technique Using Regression Methods for Software Estimation” ,The International Arab Journal of Information Technology, Vol. 11, No. 5 . [39] G. Sang , K. Shi, Zh. Liu , L. Gao , (2014) “ Missing Data Imputation Based on Grey System Theory ” , International Journal of Hybrid Information Technology , Vol.7, No.2 ,pp.347-356 . [40] H. Li, C. Zhao, F. Shao, G. Zheng Li, X. Wang ,(2014)“A hybrid imputation approach for microarray missing value estimation ” ,From IEEE International Conference on Bioinformatics and Biomedicine BIBM. [41] H. Wang , S. Wang, (2010) “Mining incomplete survey data through classification”. Knowl.Inf. Syst. 24(2), 221–233 . [42] J.A.Sáez, J.Luengo, F.Herrera, (2013) “Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification” Pattern Recogn. 46(1), 355–364. [43] J.Barnard, X.Meng, (1999) “Applications of multiple imputation in medical studies: from aids to nhanes”, Stat. Methods Med. Res. 8(1), 17–36 . [44] J.Kaiser , (2014 ) “Dealing with Missing Values in Data”, JOURNAL OF SYSTEMS INTEGRATION . [45] J.L. Deng , (1982) “Control problems of grey systems” , Systems and Control Letters, vol.5, pp.288-294 . [46] J.Tian , B. Yu , D. Yu , Sh. Ma, (2012) “A Fuzzy Clustering Approach for Missing Value Imputation with Non-Parameter Outlier Test” State Key Laboratory of Software Development Environment , Beihang University.
References
85
[47] J. Tian , B. Yu , D. Yu , Sh. Ma , (2014) “A hybrid multiple imputation algorithm using Gray System Theory and entropy based on clustering” Appl Intell 40:376–388, DOI 10.1007/s10489-013-0469-x, Springer Science+Business Media New York. [48] K. A. Hallgren , (2013) “Conducting Simulation Studies in the R Programming Environment ” , Tutorials in Quantitative Methods for Psychology , Vol. 9(2), p. 43-60 . [49] K.L.Wen, (2004) “The grey system analysis and its application in gas breakdown and var compensator finding(invited paper)”, Int. Journal of computational Cognition 2(121-44. ) . [50] L. Altmayer, (2011) “Hot-Deck Imputation: A Simple DATA Step Approach”. U.S. Bureau of the Census; Washington, DC: 1999. [51] Minakshi , R. Vohra, Gimpy, (2014)”Missing Value Imputation in Multi Attribute Data Set” ,(IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 5315-5321) . [52] M. D. Zio , U. Guarnera , O. Luzi , (2007) “Imputation through finite Gaussian mixture models” , Computational Statistics & Data Analysis 51 5305 – 5316 . [53] M. Gong, Y. Liang, J. Shi, W. Ma, J. Ma, (2013) “Fuzzy C-Means Clustering With Local Information and Kernel Metric for Image Segmentation” , IEEE Transactions on Image Processing,vol. 22, no. 2 . [54] M.L. Zhang, Z.H. Zhou , (2009) “Multi-instance clustering with applications to multi-instance prediction”, Appl Intell 31(1):47–68. [55] M. M. Rahman, D. N. Davis , (2013) “Machine Learning Based Missing Value Imputation Method for Clinical Dataset” , springer.
References
86
[56] N. Ankaiah ,V. Ravi, ( 2011) “A Novel Soft Computing Hybrid for Data Imputation”, Proceedings of the 7th international conference on data mining (DMIN). [57] O. B. Shukur, M.H. Lee , (2015) “Imputation of Missing Values in Daily Wind Speed Data Using Hybrid AR-ANN Method ”, Published by Canadian Center of Science and Education ,Modern, ISSN 19131844 E-ISSN 1913-1852, Applied Science; Vol. 9, No. 11 . [58] O.Harel, X.H. Zhou , (2007) “Multiple imputation: Review of theory, implementation and software”, Stat Med 26(16):3057–3077. [59] P.J.G.Laencina, , J.L.S.Gómez, A.R.F.Vidal, M.Verleysen , (2009) “K nearest neighbours with mutual information for simultaneous classification and missing data imputation” , Neurocomputing 72 1483–1493. [60] R. Hathaway, J.C. Bezdek , (2001). “Fuzzy C-means clustering of incomplete Data ”, IEEE Trans Syst Man Cybern, Part B, Cybern . [61] R. Sallehuddin, S. M. H. Shamsuddin, S. Z.M. Hashim , (2008) “Grey Relational Analysis And Its Application On Multivariate Time Series” , University Technology Malaysia, 81300 . [62] S. Gajawada and D. Toshniwal , (2012). “Missing Value Imputation Method Based on Clustering and Nearest Neighbours” International Journal of Future Computer and Communication, Vol. 1, No. 2 . [63] S. González, Rueda, A. Arcos , (2008) “An improved estimator to analyse missing data”, Stat Pap 49(4):791–796 . [64] S. H. Al-Harbi , V. J. Rayward-Smith ,(2006) “Adapting k-means for supervised clustering ”, Appl Intell 24(3):219–226. [65]S.Parveen, P .Green ,(2004) “Speech enhancement with missing data techniques using recurrent neural networks” In: Proceedings of the
References
87
IEEE international conference on acoustics, speech, and signal processing (ICASSP ’04), vol 1, pp 733–738 . [66] S .Zhang , (2011) “Shell-neighbor method and its application in missing data imputation ”, Appl Intell 35(1):123–133. [67] S. Zhang, J. Zhang, X .Zhu, Y .Qin, C. Zhang , (2008) “Missing value imputation based on data clustering”. In: Transactions on computational science I. Lecture notes in computer science, vol 4750, pp 128–138. [68] S. Zhang, Z. Jin, X. Zhu , (2011) “Missing data imputation by utilizing information within incomplete instances”. J Syst Softw 84(3):452–459. [69] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth ,(1996) “From data mining to knowledge discovery”, American Association for Artificial Intelligence, San Francisco, Vol. 17, No. 3. [70] W. Ziliang, and L. Sifeng , (2004) “Extension of Grey Superiority Analysis, Liu S.F. et al. Grey system Theory and its application”, Science Press, Beijing, China, 616 - 621 Vol. 1 . [71] X.Y. Zhou , J. S. Lim , (2014) “Replace Missing Values with EM algorithm based on GMM and Naïve Bayesian” International Journal of Software Engineering and Its Applications Vol.8, No.5, pp.177-188. [72] Y. Lu, T.Ma, C.Yin, X. Xie, W. Tian ,S.M. Zhong , (2013) “Implementation of the Fuzzy C-Means Clustering Algorithm in Meteorological Data” , International Journal of Database Theory and Application , Vol.6, No.6 . [73] Z. Chi , F. H. cai , J. Kai , Y. Ting , (2013) “The nearest neighbor algorithm of filling missing data based on cluster analysis” , Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering .
Appendixes
Appendixes
88
Appendix A Wine dataset from UCI (University of California, Irvine) Machine Learning Repository database. Alcohol
Malic
Ash
Alcalinity
Magnesium
Phenols
Flavanoids
Non flavanoids
Proanthocyanins
Color
Hue
Dilution
Proline
14.23
1.71
2.43
15.6
127
2.8
3.06
0.28
2.29
5.64
1.04
3.92
1065
13.2
1.78
2.14
11.2
100
2.65
2.76
0.26
1.28
4.38
1.05
3.4
1050
13.16
2.36
2.67
18.6
101
2.8
3.24
0.3
2.81
5.68
1.03
3.17
1185
14.37
1.95
2.5
16.8
113
3.85
3.49
0.24
2.18
7.8
0.86
3.45
1480
13.24
2.59
2.87
21
118
2.8
2.69
0.39
1.82
4.32
1.04
2.93
735
14.2
1.76
2.45
15.2
112
3.27
3.39
0.34
1.97
6.75
1.05
2.85
1450
14.39
1.87
2.45
14.6
96
2.5
2.52
0.3
1.98
5.25
1.02
3.58
1290
14.06
2.15
2.61
17.6
121
2.6
2.51
0.31
1.25
5.05
1.06
3.58
1295
14.83
1.64
2.17
14
97
2.8
2.98
0.29
1.98
5.2
1.08
2.85
1045
13.86
1.35
2.27
16
98
2.98
3.15
0.22
1.85
7.22
1.01
3.55
1045
14.1
2.16
2.3
18
105
2.95
3.32
0.22
2.38
5.75
1.25
3.17
1510
14.12
1.48
2.32
16.8
95
2.2
2.43
0.26
1.57
5
1.17
2.82
1280
13.75
1.73
2.41
16
89
2.6
2.76
0.29
1.81
5.6
1.15
2.9
1320
14.75
1.73
2.39
11.4
91
3.1
3.69
0.43
2.81
5.4
1.25
2.73
1150
14.38
1.87
2.38
12
102
3.3
3.64
0.29
2.96
7.5
1.2
3
1547
13.63
1.81
2.7
17.2
112
2.85
2.91
0.3
1.46
7.3
1.28
2.88
1310
14.3
1.92
2.72
20
120
2.8
3.14
0.33
1.97
6.2
1.07
2.65
1280
13.83
1.57
2.62
20
115
2.95
3.4
0.4
1.72
6.6
1.13
2.57
1130
14.19
1.59
2.48
16.5
108
3.3
3.93
0.32
1.86
8.7
1.23
2.82
1680
13.64
3.1
2.56
15.2
116
2.7
3.03
0.17
1.66
5.1
0.96
3.36
845
14.06
1.63
2.28
16
126
3
3.17
0.24
2.1
5.65
1.09
3.71
780
12.93
3.8
2.65
18.6
102
2.41
2.41
0.25
1.98
4.5
1.03
3.52
770
13.71
1.86
2.36
16.6
101
2.61
2.88
0.27
1.69
3.8
1.11
4
1035
12.85
1.6
2.52
17.8
95
2.48
2.37
0.26
1.46
3.93
1.09
3.63
1015
13.5
1.81
2.61
20
96
2.53
2.61
0.28
1.66
3.52
1.12
3.82
845
13.05
2.05
3.22
25
124
2.63
2.68
0.47
1.92
3.58
1.13
3.2
830
13.39
1.77
2.62
16.1
93
2.85
2.94
0.34
1.45
4.8
0.92
3.22
1195
13.3
1.72
2.14
17
94
2.4
2.19
0.27
1.35
3.95
1.02
2.77
1285
13.87
1.9
2.8
19.4
107
2.95
2.97
0.37
1.76
4.5
1.25
3.4
915
14.02
1.68
2.21
16
96
2.65
2.33
0.26
1.98
4.7
1.04
3.59
1035
13.73
1.5
2.7
22.5
101
3
3.25
0.29
2.38
5.7
1.19
2.71
1285
Appendixes
89
13.58
1.66
2.36
19.1
106
2.86
3.19
0.22
1.95
6.9
1.09
2.88
1515
13.68
1.83
2.36
17.2
104
2.42
2.69
0.42
1.97
3.84
1.23
2.87
990
13.76
1.53
2.7
19.5
132
2.95
2.74
0.5
1.35
5.4
1.25
3
1235
13.51
1.8
2.65
19
110
2.35
2.53
0.29
1.54
4.2
1.1
2.87
1095
13.48
1.81
2.41
20.5
100
2.7
2.98
0.26
1.86
5.1
1.04
3.47
920
13.28
1.64
2.84
15.5
110
2.6
2.68
0.34
1.36
4.6
1.09
2.78
880
13.05
1.65
2.55
18
98
2.45
2.43
0.29
1.44
4.25
1.12
2.51
1105
13.07
1.5
2.1
15.5
98
2.4
2.64
0.28
1.37
3.7
1.18
2.69
1020
14.22
3.99
2.51
13.2
128
3
3.04
0.2
2.08
5.1
0.89
3.53
760
13.56
1.71
2.31
16.2
117
3.15
3.29
0.34
2.34
6.13
0.95
3.38
795
13.41
3.84
2.12
18.8
90
2.45
2.68
0.27
1.48
4.28
0.91
3
1035
13.88
1.89
2.59
15
101
3.25
3.56
0.17
1.7
5.43
0.88
3.56
1095
13.24
3.98
2.29
17.5
103
2.64
2.63
0.32
1.66
4.36
0.82
3
680
13.05
1.77
2.1
17
107
3
3
0.28
2.03
5.04
0.88
3.35
885
14.21
4.04
2.44
18.9
111
2.85
2.65
0.3
1.25
5.24
0.87
3.33
1080
14.38
3.59
2.28
16
102
3.25
3.17
0.27
2.19
4.9
1.04
3.44
1065
13.9
1.68
2.12
16
101
3.1
3.39
0.21
2.14
6.1
0.91
3.33
985
14.1
2.02
2.4
18.8
103
2.75
2.92
0.32
2.38
6.2
1.07
2.75
1060
13.94
1.73
2.27
17.4
108
2.88
3.54
0.32
2.08
8.9
1.12
3.1
1260
13.05
1.73
2.04
12.4
92
2.72
3.27
0.17
2.91
7.2
1.12
2.91
1150
13.83
1.65
2.6
17.2
94
2.45
2.99
0.22
2.29
5.6
1.24
3.37
1265
13.82
1.75
2.42
14
111
3.88
3.74
0.32
1.87
7.05
1.01
3.26
1190
13.77
1.9
2.68
17.1
115
3
2.79
0.39
1.68
6.3
1.13
2.93
1375
13.74
1.67
2.25
16.4
118
2.6
2.9
0.21
1.62
5.85
0.92
3.2
1060
13.56
1.73
2.46
20.5
116
2.96
2.78
0.2
2.45
6.25
0.98
3.03
1120
14.22
1.7
2.3
16.3
118
3.2
3
0.26
2.03
6.38
0.94
3.31
970
13.29
1.97
2.68
16.8
102
3
3.23
0.31
1.66
6
1.07
2.84
1270
13.72
1.43
2.5
16.7
108
3.4
3.67
0.19
2.04
6.8
0.89
2.87
1285
12.37
0.94
1.36
10.6
88
1.98
0.57
0.28
0.42
1.95
1.05
1.82
520
12.33
1.1
2.28
16
101
2.05
1.09
0.63
0.41
3.27
1.25
1.67
680
12.64
1.36
2.02
16.8
100
2.02
1.41
0.53
0.62
5.75
0.98
1.59
450
13.67
1.25
1.92
18
94
2.1
1.79
0.32
0.73
3.8
1.23
2.46
630
12.37
1.13
2.16
19
87
3.5
3.1
0.19
1.87
4.45
1.22
2.87
420
12.17
1.45
2.53
19
104
1.89
1.75
0.45
1.03
2.95
1.45
2.23
355
12.37
1.21
2.56
18.1
98
2.42
2.65
0.37
2.08
4.6
1.19
2.3
678
13.11
1.01
1.7
15
78
2.98
3.18
0.26
2.28
5.3
1.12
3.18
502
12.37
1.17
1.92
19.6
78
2.11
2
0.27
1.04
4.68
1.12
3.48
510
13.34
0.94
2.36
17
110
2.53
1.3
0.55
0.42
3.17
1.02
1.93
750
12.21
1.19
1.75
16.8
151
1.85
1.28
0.14
2.5
2.85
1.28
3.07
718
12.29
1.61
2.21
20.4
103
1.1
1.02
0.37
1.46
3.05
0.906
1.82
870
Appendixes
90
13.86
1.51
2.67
25
86
2.95
2.86
0.21
1.87
3.38
1.36
3.16
410
13.49
1.66
2.24
24
87
1.88
1.84
0.27
1.03
3.74
0.98
2.78
472
12.99
1.67
2.6
30
139
3.3
2.89
0.21
1.96
3.35
1.31
3.5
985
11.96
1.09
2.3
21
101
3.38
2.14
0.13
1.65
3.21
0.99
3.13
886
11.66
1.88
1.92
16
97
1.61
1.57
0.34
1.15
3.8
1.23
2.14
428
13.03
0.9
1.71
16
86
1.95
2.03
0.24
1.46
4.6
1.19
2.48
392
11.84
2.89
2.23
18
112
1.72
1.32
0.43
0.95
2.65
0.96
2.52
500
12.33
0.99
1.95
14.8
136
1.9
1.85
0.35
2.76
3.4
1.06
2.31
750
12.7
3.87
2.4
23
101
2.83
2.55
0.43
1.95
2.57
1.19
3.13
463
12
0.92
2
19
86
2.42
2.26
0.3
1.43
2.5
1.38
3.12
278
12.72
1.81
2.2
18.8
86
2.2
2.53
0.26
1.77
3.9
1.16
3.14
714
12.08
1.13
2.51
24
78
2
1.58
0.4
1.4
2.2
1.31
2.72
630
13.05
3.86
2.32
22.5
85
1.65
1.59
0.61
1.62
4.8
0.84
2.01
515
11.84
0.89
2.58
18
94
2.2
2.21
0.22
2.35
3.05
0.79
3.08
520
12.67
0.98
2.24
18
99
2.2
1.94
0.3
1.46
2.62
1.23
3.16
450
12.16
1.61
2.31
22.8
90
1.78
1.69
0.43
1.56
2.45
1.33
2.26
495
11.65
1.67
2.62
26
88
1.92
1.61
0.4
1.34
2.6
1.36
3.21
562
11.64
2.06
2.46
21.6
84
1.95
1.69
0.48
1.35
2.8
1
2.75
680
12.08
1.33
2.3
23.6
70
2.2
1.59
0.42
1.38
1.74
1.07
3.21
625
12.08
1.83
2.32
18.5
81
1.6
1.5
0.52
1.64
2.4
1.08
2.27
480
12
1.51
2.42
22
86
1.45
1.25
0.5
1.63
3.6
1.05
2.65
450
12.69
1.53
2.26
20.7
80
1.38
1.46
0.58
1.62
3.05
0.96
2.06
495
12.29
2.83
2.22
18
88
2.45
2.25
0.25
1.99
2.15
1.15
3.3
290
11.62
1.99
2.28
18
98
3.02
2.26
0.17
1.35
3.25
1.16
2.96
345
12.47
1.52
2.2
19
162
2.5
2.27
0.32
3.28
2.6
1.16
2.63
937
11.81
2.12
2.74
21.5
134
1.6
0.99
0.14
1.56
2.5
0.95
2.26
625
12.29
1.41
1.98
16
85
2.55
2.5
0.29
1.77
2.9
1.23
2.74
428
12.37
1.07
2.1
18.5
88
3.52
3.75
0.24
1.95
4.5
1.04
2.77
660
12.29
3.17
2.21
18
88
2.85
2.99
0.45
2.81
2.3
1.42
2.83
406
12.08
2.08
1.7
17.5
97
2.23
2.17
0.26
1.4
3.3
1.27
2.96
710
12.6
1.34
1.9
18.5
88
1.45
1.36
0.29
1.35
2.45
1.04
2.77
562
12.34
2.45
2.46
21
98
2.56
2.11
0.34
1.31
2.8
0.8
3.38
438
11.82
1.72
1.88
19.5
86
2.5
1.64
0.37
1.42
2.06
0.94
2.44
415
12.51
1.73
1.98
20.5
85
2.2
1.92
0.32
1.48
2.94
1.04
3.57
672
12.42
2.55
2.27
22
90
1.68
1.84
0.66
1.42
2.7
0.86
3.3
315
12.25
1.73
2.12
19
80
1.65
2.03
0.37
1.63
3.4
1
3.17
510
12.72
1.75
2.28
22.5
84
1.38
1.76
0.48
1.63
3.3
0.88
2.42
488
12.22
1.29
1.94
19
92
2.36
2.04
0.39
2.08
2.7
0.86
3.02
312
11.61
1.35
2.7
20
94
2.74
2.92
0.29
2.49
2.65
0.96
3.26
680
11.46
3.74
1.82
19.5
107
3.18
2.58
0.24
3.58
2.9
0.75
2.81
562
Appendixes
91
12.52
2.43
2.17
21
88
2.55
2.27
0.26
1.22
2
0.9
2.78
325
11.76
2.68
2.92
20
103
1.75
2.03
0.6
1.05
3.8
1.23
2.5
607
11.41
0.74
2.5
21
88
2.48
2.01
0.42
1.44
3.08
1.1
2.31
434
12.08
1.39
2.5
22.5
84
2.56
2.29
0.43
1.04
2.9
0.93
3.19
385
11.03
1.51
2.2
21.5
85
2.46
2.17
0.52
2.01
1.9
1.71
2.87
407
11.82
1.47
1.99
20.8
86
1.98
1.6
0.3
1.53
1.95
0.95
3.33
495
12.42
1.61
2.19
22.5
108
2
2.09
0.34
1.61
2.06
1.06
2.96
345
12.77
3.43
1.98
16
80
1.63
1.25
0.43
0.83
3.4
0.7
2.12
372
12
3.43
2
19
87
2
1.64
0.37
1.87
1.28
0.93
3.05
564
11.45
2.4
2.42
20
96
2.9
2.79
0.32
1.83
3.25
0.8
3.39
625
11.56
2.05
3.23
28.5
119
3.18
5.08
0.47
1.87
6
0.93
3.69
465
12.42
4.43
2.73
26.5
102
2.2
2.13
0.43
1.71
2.08
0.92
3.12
365
13.05
5.8
2.13
21.5
86
2.62
2.65
0.3
2.01
2.6
0.73
3.1
380
11.87
4.31
2.39
21
82
2.86
3.03
0.21
2.91
2.8
0.75
3.64
380
12.07
2.16
2.17
21
85
2.6
2.65
0.37
1.35
2.76
0.86
3.28
378
12.43
1.53
2.29
21.5
86
2.74
3.15
0.39
1.77
3.94
0.69
2.84
352
11.79
2.13
2.78
28.5
92
2.13
2.24
0.58
1.76
3
0.97
2.44
466
12.37
1.63
2.3
24.5
88
2.22
2.45
0.4
1.9
2.12
0.89
2.78
342
12.04
4.3
2.38
22
80
2.1
1.75
0.42
1.35
2.6
0.79
2.57
580
12.86
1.35
2.32
18
122
1.51
1.25
0.21
0.94
4.1
0.76
1.29
630
12.88
2.99
2.4
20
104
1.3
1.22
0.24
0.83
5.4
0.74
1.42
530
12.81
2.31
2.4
24
98
1.15
1.09
0.27
0.83
5.7
0.66
1.36
560
12.7
3.55
2.36
21.5
106
1.7
1.2
0.17
0.84
5
0.78
1.29
600
12.51
1.24
2.25
17.5
85
2
0.58
0.6
1.25
5.45
0.75
1.51
650
12.6
2.46
2.2
18.5
94
1.62
0.66
0.63
0.94
7.1
0.73
1.58
695
12.25
4.72
2.54
21
89
1.38
0.47
0.53
0.8
3.85
0.75
1.27
720
12.53
5.51
2.64
25
96
1.79
0.6
0.63
1.1
5
0.82
1.69
515
13.49
3.59
2.19
19.5
88
1.62
0.48
0.58
0.88
5.7
0.81
1.82
580
12.84
2.96
2.61
24
101
2.32
0.6
0.53
0.81
4.92
0.89
2.15
590
12.93
2.81
2.7
21
96
1.54
0.5
0.53
0.75
4.6
0.77
2.31
600
13.36
2.56
2.35
20
89
1.4
0.5
0.37
0.64
5.6
0.7
2.47
780
13.52
3.17
2.72
23.5
97
1.55
0.52
0.5
0.55
4.35
0.89
2.06
520
13.62
4.95
2.35
20
92
2
0.8
0.47
1.02
4.4
0.91
2.05
550
12.25
3.88
2.2
18.5
112
1.38
0.78
0.29
1.14
8.21
0.65
2
855
13.16
3.57
2.15
21
102
1.5
0.55
0.43
1.3
4
0.6
1.68
830
13.88
5.04
2.23
20
80
0.98
0.34
0.4
0.68
4.9
0.58
1.33
415
12.87
4.61
2.48
21.5
86
1.7
0.65
0.47
0.86
7.65
0.54
1.86
625
13.32
3.24
2.38
21.5
92
1.93
0.76
0.45
1.25
8.42
0.55
1.62
650
13.08
3.9
2.36
21.5
113
1.41
1.39
0.34
1.14
9.4
0.57
1.33
550
13.5
3.12
2.62
24
123
1.4
1.57
0.22
1.25
8.6
0.59
1.3
500
Appendixes
92
12.79
2.67
2.48
22
112
1.48
1.36
0.24
1.26
10.8
0.48
1.47
480
13.11
1.9
2.75
25.5
116
2.2
1.28
0.26
1.56
7.1
0.61
1.33
425
13.23
3.3
2.28
18.5
98
1.8
0.83
0.61
1.87
10.52
0.56
1.51
675
12.58
1.29
2.1
20
103
1.48
0.58
0.53
1.4
7.6
0.58
1.55
640
13.17
5.19
2.32
22
93
1.74
0.63
0.61
1.55
7.9
0.6
1.48
725
13.84
4.12
2.38
19.5
89
1.8
0.83
0.48
1.56
9.01
0.57
1.64
480
12.45
3.03
2.64
27
97
1.9
0.58
0.63
1.14
7.5
0.67
1.73
880
14.34
1.68
2.7
25
98
2.8
1.31
0.53
2.7
13
0.57
1.96
660
13.48
1.67
2.64
22.5
89
2.6
1.1
0.52
2.29
11.75
0.57
1.78
620
12.36
3.83
2.38
21
88
2.3
0.92
0.5
1.04
7.65
0.56
1.58
520
13.69
3.26
2.54
20
107
1.83
0.56
0.5
0.8
5.88
0.96
1.82
680
12.85
3.27
2.58
22
106
1.65
0.6
0.6
0.96
5.58
0.87
2.11
570
12.96
3.45
2.35
18.5
106
1.39
0.7
0.4
0.94
5.28
0.68
1.75
675
13.78
2.76
2.3
22
90
1.35
0.68
0.41
1.03
9.58
0.7
1.68
615
13.73
4.36
2.26
22.5
88
1.28
0.47
0.52
1.15
6.62
0.78
1.75
520
13.45
3.7
2.6
23
111
1.7
0.92
0.43
1.46
10.68
0.85
1.56
695
12.82
3.37
2.3
19.5
88
1.48
0.66
0.4
0.97
10.26
0.72
1.75
685
13.58
2.58
2.69
24.5
105
1.55
0.84
0.39
1.54
8.66
0.74
1.8
750
13.4
4.6
2.86
25
112
1.98
0.96
0.27
1.11
8.5
0.67
1.92
630
12.2
3.03
2.32
19
96
1.25
0.49
0.4
0.73
5.5
0.66
1.83
510
12.77
2.39
2.28
19.5
86
1.39
0.51
0.48
0.64
9.899999
0.57
1.63
470
14.16
2.51
2.48
20
91
1.68
0.7
0.44
1.24
9.7
0.62
1.71
660
13.71
5.65
2.45
20.5
95
1.68
0.61
0.52
1.06
7.7
0.64
1.74
740
13.4
3.91
2.48
23
102
1.8
0.75
0.43
1.41
7.3
0.7
1.56
750
13.27
4.28
2.26
20
120
1.59
0.69
0.43
1.35
10.2
0.59
1.56
835
13.17
2.59
2.37
20
120
1.65
0.68
0.53
1.46
9.3
0.6
1.62
840
14.13
4.1
2.74
24.5
96
2.05
0.76
0.56
1.35
9.2
0.61
1.6
560
Appendixes
93
Appendix B Results for Wine dataset Table B1:Checking Iterations optimality for MCAR MCAR with %10 missing rate,(5,10,15,20,25,30,35,40,45,50) iterations and 5 clusters Iterations 5
10
15
20
25
30
0.119995
0.119995
0.119995
0.119995
0.119995
0.119995
0.157127
0.157127
0.157127
0.157127
0.157127
0.144542
0.144542
0.144542
0.144542
0.163347
0.163347
0.163347
0.123995
0.123995
0.141801
35
40
45
50
0.119995
0.119995
0.119995
0.119995
0.157127
0.157127
0.157127
0.157127
0.157127
0.144542
0.144542
0.144542
0.144542
0.144542
0.144542
0.163347
0.163347
0.163347
0.163347
0.163347
0.163347
0.163347
0.123995
0.123995
0.123995
0.123995
0.123995
0.123995
0.123995
0.123995
0.145956
0.145956
0.145956
0.145956
0.145956
0.145956
0.145956
0.145956
0.145956
0.219304
0.219304
0.219304
0.219304
0.219304
0.219304
0.219304
0.219304
0.219304
0.135228
0.135228
0.135228
0.135228
0.135228
0.135228
0.135228
0.135228
0.135228
0.198424
0.198424
0.198424
0.198424
0.198424
0.198424
0.198424
0.198424
0.198424
0.154646
0.154646
0.154646
0.154646
0.154646
0.154646
0.154646
0.154646
0.154646
0.156257
0.144885
0.144885
0.144885
0.144885
0.144885
0.144885
0.144885
0.144885
0.200714
0.200714
0.200714
0.200714
0.200714
0.200714
0.200714
0.200714
0.177991
0.177991
0.177991
0.177991
0.177991
0.177991
0.177991
0.177991
0.132489
0.132489
0.132489
0.132489
0.132489
0.132489
0.132489
0.132489
0.143639
0.143639
0.143639
0.143639
0.143639
0.143639
0.143639
0.143639
0.157485
0.195644
0.195644
0.195644
0.195644
0.195644
0.195644
0.195644
0.208403
0.208403
0.208403
0.208403
0.208403
0.208403
0.208403
0.161633
0.161633
0.161633
0.161633
0.161633
0.161633
0.161633
0.192958
0.192958
0.192958
0.192958
0.192958
0.192958
0.192958
0.178453
0.178453
0.178453
0.178453
0.178453
0.178453
0.178453
0.164969
0.139598
0.139598
0.139598
0.139598
0.139598
0.139598
0.175441
0.175441
0.175441
0.175441
0.175441
0.175441
0.194173
0.194173
0.194173
0.194173
0.194173
0.194173
0.111892
0.111892
0.111892
0.111892
0.111892
0.111892
0.12863
0.12863
0.12863
0.12863
0.12863
0.12863
0.161964
0.188968
0.188968
0.188968
0.188968
0.188968
0.12392
0.12392
0.12392
0.12392
0.12392
0.181677
0.181677
0.181677
0.181677
0.181677
0.128676
0.128676
0.128676
0.128676
0.128676
0.17699
0.17699
0.17699
0.17699
0.17699
0.161645
0.152143
0.152143
0.152143
0.152143
0.173484
0.173484
0.173484
0.173484
0.2229
0.2229
0.2229
0.2229
Appendixes
94 0.1616
0.1616
0.1616
0.1616
0.170026
0.170026
0.170026
0.170026
0.1637
0.129903
0.129903
0.129903
0.096851
0.096851
0.096851
0.12573
0.12573
0.12573
0.151848
0.151848
0.151848
0.168106
0.168106
0.168106
0.160048
0.178388
0.178388
0.158312
0.158312
0.149087
0.149087
0.147104
0.147104
0.119003
0.119003
0.158974
0.136323 0.158392 0.19003 0.171362 0.142905 0.159057
Table B2:Checking Iterations optimality for MAR MAR with %10 missing rate, (5,10,15,20,25,30,35,40,45,50) iterations and 5 clusters Iterations 5
10
15
20
25
30
35
0.116424
0.116424
0.116424
0.116424
0.116424
0.116424
0.116424
0.160154
0.160154
0.160154
0.160154
0.160154
0.160154
0.119233
0.119233
0.119233
0.119233
0.119233
0.165774
0.165774
0.165774
0.165774
0.132559
0.132559
0.132559
0.132559
0.138829
0.158503
0.158503
0.184031
40
45
50
0.116424
0.116424
0.116424
0.160154
0.160154
0.160154
0.160154
0.119233
0.119233
0.119233
0.119233
0.119233
0.165774
0.165774
0.165774
0.165774
0.165774
0.165774
0.132559
0.132559
0.132559
0.132559
0.132559
0.132559
0.158503
0.158503
0.158503
0.158503
0.158503
0.158503
0.158503
0.184031
0.184031
0.184031
0.184031
0.184031
0.184031
0.184031
0.184031
0.189875
0.189875
0.189875
0.189875
0.189875
0.189875
0.189875
0.189875
0.189875
0.11239
0.11239
0.11239
0.11239
0.11239
0.11239
0.11239
0.11239
0.11239
0.164468
0.164468
0.164468
0.164468
0.164468
0.164468
0.164468
0.164468
0.164468
0.150341
0.137727
0.137727
0.137727
0.137727
0.137727
0.137727
0.137727
0.137727
0.1798
0.1798
0.1798
0.1798
0.1798
0.1798
0.1798
0.1798
0.155587
0.155587
0.155587
0.155587
0.155587
0.155587
0.155587
0.155587
0.154735
0.154735
0.154735
0.154735
0.154735
0.154735
0.154735
0.154735
0.235491
0.235491
0.235491
0.235491
0.235491
0.235491
0.235491
0.235491
0.157783
0.154625
0.154625
0.154625
0.154625
0.154625
0.154625
0.154625
0.152466
0.152466
0.152466
0.152466
0.152466
0.152466
0.152466
0.139568
0.139568
0.139568
0.139568
0.139568
0.139568
0.139568
Appendixes
95 0.242568
0.242568
0.242568
0.242568
0.242568
0.242568
0.242568
0.179524
0.179524
0.179524
0.179524
0.179524
0.179524
0.179524
0.161775
0.142842
0.142842
0.142842
0.142842
0.142842
0.142842
0.166338
0.166338
0.166338
0.166338
0.166338
0.166338
0.203657
0.203657
0.203657
0.203657
0.203657
0.203657
0.132613
0.132613
0.132613
0.132613
0.132613
0.132613
0.182556
0.182556
0.182556
0.182556
0.182556
0.182556
0.16254
0.138687
0.138687
0.138687
0.138687
0.138687
0.133001
0.133001
0.133001
0.133001
0.133001
0.233413
0.233413
0.233413
0.233413
0.233413
0.144331
0.144331
0.144331
0.144331
0.144331
0.149607
0.149607
0.149607
0.149607
0.149607
0.162085
0.205097
0.205097
0.205097
0.205097
0.209826
0.209826
0.209826
0.209826
0.209837
0.209837
0.209837
0.209837
0.150513
0.150513
0.150513
0.150513
0.130369
0.130369
0.130369
0.130369
0.164805
0.167136
0.167136
0.167136
0.22898
0.22898
0.22898
0.121962
0.121962
0.121962
0.221249
0.221249
0.221249
0.188631
0.188631
0.188631
0.167404
0.191256
0.191256
0.140439
0.140439
0.164967
0.164967
0.19752
0.19752
0.229722
0.229722
0.169334
0.137825 0.126992 0.118114 0.159787 0.136048 0.165976
Table B3:Checking Iterations optimality for NMAR NMAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and 5 cluster Iterations 5
10
15
20
25
30
35
0.101827
0.101827
0.101827
0.101827
0.101827
0.101827
0.101827
0.148259
0.148259
0.148259
0.148259
0.148259
0.148259
0.134618
0.134618
0.134618
0.134618
0.134618
0.134618
40
45
50
0.101827
0.101827
0.101827
0.148259
0.148259
0.148259
0.148259
0.134618
0.134618
0.134618
0.134618
Appendixes
96
0.171184
0.171184
0.171184
0.171184
0.171184
0.171184
0.171184
0.171184
0.171184
0.171184
0.132262
0.132262
0.132262
0.132262
0.132262
0.132262
0.132262
0.132262
0.132262
0.132262
0.13763
0.154081
0.154081
0.154081
0.154081
0.154081
0.154081
0.154081
0.154081
0.154081
0.132
0.132
0.132
0.132
0.132
0.132
0.132
0.132
0.132
0.101074
0.101074
0.101074
0.101074
0.101074
0.101074
0.101074
0.101074
0.101074
0.12951
0.12951
0.12951
0.12951
0.12951
0.12951
0.12951
0.12951
0.12951
0.142087
0.142087
0.142087
0.142087
0.142087
0.142087
0.142087
0.142087
0.142087
0.13469
0.171737
0.171737
0.171737
0.171737
0.171737
0.171737
0.171737
0.171737
0.144791
0.144791
0.144791
0.144791
0.144791
0.144791
0.144791
0.144791
0.151151
0.151151
0.151151
0.151151
0.151151
0.151151
0.151151
0.151151
0.153515
0.153515
0.153515
0.153515
0.153515
0.153515
0.153515
0.153515
0.140745
0.140745
0.140745
0.140745
0.140745
0.140745
0.140745
0.140745
0.140589
0.125031
0.125031
0.125031
0.125031
0.125031
0.125031
0.125031
0.144125
0.144125
0.144125
0.144125
0.144125
0.144125
0.144125
0.134358
0.134358
0.134358
0.134358
0.134358
0.134358
0.134358
0.154087
0.154087
0.154087
0.154087
0.154087
0.154087
0.154087
0.107476
0.107476
0.107476
0.107476
0.107476
0.107476
0.107476
0.138696
0.160749
0.160749
0.160749
0.160749
0.160749
0.160749
0.123673
0.123673
0.123673
0.123673
0.123673
0.123673
0.111506
0.111506
0.111506
0.111506
0.111506
0.111506
0.132467
0.132467
0.132467
0.132467
0.132467
0.132467
0.132824
0.132824
0.132824
0.132824
0.132824
0.132824
0.137405
0.201953
0.201953
0.201953
0.201953
0.201953
0.130917
0.130917
0.130917
0.130917
0.130917
0.113388
0.113388
0.113388
0.113388
0.113388
0.13169
0.13169
0.13169
0.13169
0.13169
0.185789
0.185789
0.185789
0.185789
0.185789
0.139962
0.158892
0.158892
0.158892
0.158892
0.139257
0.139257
0.139257
0.139257
0.125552
0.125552
0.125552
0.125552
0.135817
0.135817
0.135817
0.135817
0.119563
0.119563
0.119563
0.119563
0.13937
0.128803
0.128803
0.128803
0.121294
0.121294
0.121294
0.147059
0.147059
0.147059
0.143882
0.143882
0.143882
0.122895
0.122895
0.122895
0.138547
0.152443
0.152443
0.129408
0.129408
0.157881
0.157881
0.131696
0.131696
0.126149
0.126149
0.138655
0.149964 0.130314
Appendixes
97 0.166372 0.165953 0.174888 0.140539
Table B4:Checking clusters optimality for MCAR Checking Optimality by Clusters { from 2 to 10 } for MCAR and %10 missing rate with 5 iteration cluster=2
cluster=3
cluster=4
cluster=5
cluster=6
cluster=7
0.119995
0.119995
0.119995
0.119995
0.119995
0.1644784
0.1310551
0.143052
0.157127
0.1707968
0.1475838
0.147584
0.144542
0.1516658
0.144586
0.151666
0.169648
0.1681193
0.1553168
0.1422678
cluster=8
cluster=9
cluster=10
0.119995
0.119995
0.119995
0.119995
0.163305
0.136033
0.1532939
0.1861692
0.1838858
0.165234
0.1492543
0.1891945
0.1435476
0.1430013
0.163347
0.140378
0.1258645
0.1554044
0.1304298
0.1618359
0.142791
0.123995
0.152117
0.1338744
0.1270823
0.1481036
0.1487056
0.141017
0.141801
0.148206
0.1330042
0.148994
0.145649
0.15148472
Table B5:Checking clusters optimality for MAR Checking Optimality by Clusters { from 2 to 10 } for MAR and %10 missing rate with 5 iteration cluster=2
cluster=3
cluster=4
cluster=5
cluster=6
cluster=7
cluster=8
cluster=9
cluster=10
0.1164238
0.1164238
0.1164238
0.1164238
0.1164238
0.1164238
0.1164238
0.1164238
0.1164238
0.1111938
0.1042875
0.1988778
0.1601544
0.1305181
0.1305181
0.1810234
0.1538771
0.14902
0.1455251
0.1363458
0.130387
0.1192332
0.1284773
0.1284773
0.1928594
0.1284773
0.1284773
0.139088
0.1871515
0.1708623
0.1657735
0.2131112
0.2131112
0.2131112
0.2070394
0.1419509
0.1484469
0.1877002
0.1402389
0.1325589
0.1427671
0.1427671
0.220838
0.1882515
0.1882515
0.13213552
0.14638176
0.15135796
0.13882876
0.1462595
0.1462595
0.1848512
0.1588138
0.1448247
Table B6:Checking clusters optimality for NMAR Checking Optimality by Clusters { from 2 to 10 } for NMAR and %10 missing rate with 10 iteration cluster=2
cluster=3
cluster=4
cluster=5
cluster=6
cluster=7
cluster=8
cluster=9
cluster=10
0.101827
0.101827
0.101827
0.101827
0.101827
0.101827
0.1018274
0.1018274
0.1018274
0.147712
0.141803
0.146086
0.148259
0.161689
0.181101
0.102285
0.1111108
0.132016
0.141073
0.148043
0.156468
0.134618
0.14306
0.143543
0.1204954
0.154565
0.154565
0.122868
0.139485
0.128994
0.171184
0.120328
0.168822
0.125493
0.1423266
0.1711844
0.185993
0.168247
0.161213
0.132262
0.140954
0.163517
0.1197519
0.1322619
0.1384011
0.143375
0.119339
0.119682
0.154081
0.147615
0.169057
0.1433045
0.1014166
0.1423992
0.167334
0.12867
0.137231
0.132
0.175561
0.122293
0.153223
0.1761987
0.1247971
Appendixes
98
0.15731
0.166899
0.129085
0.101074
0.13569
0.129942
0.1317386
0.1504504
0.1225512
0.134908
0.126762
0.12951
0.12951
0.18249
0.160278
0.1195941
0.1284742
0.1649264
0.146148
0.162895
0.131819
0.142087
0.122715
0.151186
0.1203773
0.1531819
0.155201
0.144855
0.140397
0.134192
0.13469
0.143193
0.149157
0.12380902
0.13518135
0.14078688
Final results for each MCAR&MAR&NMAR with varying missing rate (%5, %10, %15, and % 20) MCAR with 5 iteration & 7 clusters 5% 10% 15% 20% 0.1357158 0.119995 0.1320235 0.1478433 0.2369731 0.136033 0.1777792 0.176007 0.1420297 0.1492543 0.1614744 0.1787792 0.2457712 0.1258645 0.1246992 0.1868581 0.131962 0.1338744 0.1638367 0.1466173 0.1784904 0.1330042 0.1519626 0.167221
MAR with 5 iteration & 2 clusters 5% 10% 15% 20% 0.1256567 0.1164238 0.1257124 0.1302127 0.09937878 0.1111938 0.1695857 0.1710534 0.130162 0.1455251 0.1485814 0.182729 0.1371659 0.139088 0.1849823 0.1511046 0.1904594 0.1484469 0.1174341 0.1895456 0.13656456 0.1321355 0.1492592 0.1649291
NMAR with 10 iteration & 8 clusters 5% 10% 15% 20% 0.09692448 0.1018274 0.1372783 0.1423096 0.1426843 0.102285 0.1578463 0.1695661 0.1099717 0.1204954 0.1592683 0.1479767 0.1132756 0.125493 0.1565235 0.1706958 0.1651139 0.1197519 0.194072 0.1577921 0.1358363 0.1433045 0.1848806 0.1911542 0.1128783 0.153223 0.1445272 0.1691071 0.07659872 0.1317386 0.1702514 0.166472 0.1334439 0.1195941 0.1544002 0.161353 0.09654459 0.1203773 0.1415541 0.152692 0.11832718 0.123809 0.1600602 0.1629119
Appendixes
99
Appendix C Results for simulation dataset Table C1:Checking Iterations optimality for MCAR MCAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and 5 clusters Iterations 5
10
15
20
25
30
35
40
45
50
0.153359
0.153359
0.153359
0.153359
0.153359
0.153359
0.153359
0.153359
0.153359
0.153359
0.158336
0.158336
0.158336
0.158336
0.158336
0.158336
0.158336
0.158336
0.158336
0.158336
0.16414
0.16414
0.16414
0.16414
0.16414
0.16414
0.16414
0.16414
0.16414
0.16414
0.16309
0.16309
0.16309
0.16309
0.16309
0.16309
0.16309
0.16309
0.16309
0.16309
0.163074
0.163074
0.163074
0.163074
0.163074
0.163074
0.163074
0.163074
0.163074
0.163074
0.1604
0.158522
0.158522
0.158522
0.158522
0.158522
0.158522
0.158522
0.158522
0.158522
0.169981
0.169981
0.169981
0.169981
0.169981
0.169981
0.169981
0.169981
0.169981
0.148495
0.148495
0.148495
0.148495
0.148495
0.148495
0.148495
0.148495
0.148495
0.168434
0.168434
0.168434
0.168434
0.168434
0.168434
0.168434
0.168434
0.168434
0.153843
0.153843
0.153843
0.153843
0.153843
0.153843
0.153843
0.153843
0.153843
0.160127
0.149075
0.149075
0.149075
0.149075
0.149075
0.149075
0.149075
0.149075
0.159246
0.159246
0.159246
0.159246
0.159246
0.159246
0.159246
0.159246
0.15548
0.15548
0.15548
0.15548
0.15548
0.15548
0.15548
0.15548
0.16857
0.16857
0.16857
0.16857
0.16857
0.16857
0.16857
0.16857
0.143935
0.143935
0.143935
0.143935
0.143935
0.143935
0.143935
0.143935
0.158505
0.155882
0.155882
0.155882
0.155882
0.155882
0.155882
0.155882
0.154877
0.154877
0.154877
0.154877
0.154877
0.154877
0.154877
0.156831
0.156831
0.156831
0.156831
0.156831
0.156831
0.156831
0.165743
0.165743
0.165743
0.165743
0.165743
0.165743
0.165743
0.15544
0.15544
0.15544
0.15544
0.15544
0.15544
0.15544
0.158318
0.155984
0.155984
0.155984
0.155984
0.155984
0.155984
0.161967
0.161967
0.161967
0.161967
0.161967
0.161967
0.1529
0.1529
0.1529
0.1529
0.1529
0.1529
0.153367
0.153367
0.153367
0.153367
0.153367
0.153367
0.147581
0.147581
0.147581
0.147581
0.147581
0.147581
0.157526
0.167094
0.167094
0.167094
0.167094
0.167094
0.169896
0.169896
0.169896
0.169896
0.169896
0.142285
0.142285
0.142285
0.142285
0.142285
0.183654
0.183654
0.183654
0.183654
0.183654
0.167614
0.167614
0.167614
0.167614
0.167614
0.158956
0.155271
0.155271
0.155271
0.155271
0.1659
0.1659
0.1659
0.1659
0.154727
0.154727
0.154727
0.154727
Appendixes
100 0.156046
0.156046
0.156046
0.156046
0.170652
0.170652
0.170652
0.170652
0.15918
0.159226
0.159226
0.159226
0.16181
0.16181
0.16181
0.15641
0.15641
0.15641
0.171235
0.171235
0.171235
0.161378
0.161378
0.161378
0.159534
0.142579
0.142579
0.169344
0.169344
0.154192
0.154192
0.157791
0.157791
0.15821
0.15821
0.159188
0.162551 0.171719 0.147943 0.159561 0.160148 0.159308
Table C2:Checking Iterations optimality for MAR MAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and 5 clusters Iterations 5
10
15
20
25
30
35
40
45
50
0.153728
0.153728
0.153728
0.153728
0.153728
0.153728
0.153728
0.153728
0.153728
0.153728
0.1648
0.1648
0.1648
0.1648
0.1648
0.1648
0.1648
0.1648
0.1648
0.1648
0.153778
0.153778
0.153778
0.153778
0.153778
0.153778
0.153778
0.153778
0.153778
0.153778
0.155587
0.155587
0.155587
0.155587
0.155587
0.155587
0.155587
0.155587
0.155587
0.155587
0.156406
0.156406
0.156406
0.156406
0.156406
0.156406
0.156406
0.156406
0.156406
0.156406
0.15686
0.173553
0.173553
0.173553
0.173553
0.173553
0.173553
0.173553
0.173553
0.173553
0.152668
0.152668
0.152668
0.152668
0.152668
0.152668
0.152668
0.152668
0.152668
0.171421
0.171421
0.171421
0.171421
0.171421
0.171421
0.171421
0.171421
0.171421
0.146845
0.146845
0.146845
0.146845
0.146845
0.146845
0.146845
0.146845
0.146845
0.153936
0.153936
0.153936
0.153936
0.153936
0.153936
0.153936
0.153936
0.153936
0.158272
0.139333
0.139333
0.139333
0.139333
0.139333
0.139333
0.139333
0.139333
0.156994
0.156994
0.156994
0.156994
0.156994
0.156994
0.156994
0.156994
0.155585
0.155585
0.155585
0.155585
0.155585
0.155585
0.155585
0.155585
0.150068
0.150068
0.150068
0.150068
0.150068
0.150068
0.150068
0.150068
0.152859
0.152859
0.152859
0.152859
0.152859
0.152859
0.152859
0.152859
0.155837
0.160279
0.160279
0.160279
0.160279
0.160279
0.160279
0.160279
0.159149
0.159149
0.159149
0.159149
0.159149
0.159149
0.159149
0.161601
0.161601
0.161601
0.161601
0.161601
0.161601
0.161601
Appendixes
101 0.158001
0.158001
0.158001
0.158001
0.158001
0.158001
0.158001
0.167223
0.167223
0.167223
0.167223
0.167223
0.167223
0.167223
0.157191
0.159067
0.159067
0.159067
0.159067
0.159067
0.159067
0.170436
0.170436
0.170436
0.170436
0.170436
0.170436
0.161758
0.161758
0.161758
0.161758
0.161758
0.161758
0.155481
0.155481
0.155481
0.155481
0.155481
0.155481
0.163212
0.163212
0.163212
0.163212
0.163212
0.163212
0.158151
0.174201
0.174201
0.174201
0.174201
0.174201
0.161014
0.161014
0.161014
0.161014
0.161014
0.162374
0.162374
0.162374
0.162374
0.162374
0.14687
0.14687
0.14687
0.14687
0.14687
0.144909
0.144909
0.144909
0.144909
0.144909
0.158105
0.160961
0.160961
0.160961
0.160961
0.15469
0.15469
0.15469
0.15469
0.155764
0.155764
0.155764
0.155764
0.149137
0.149137
0.149137
0.149137
0.151771
0.151771
0.151771
0.151771
0.157585
0.156602
0.156602
0.156602
0.161862
0.161862
0.161862
0.160368
0.160368
0.160368
0.140072
0.140072
0.140072
0.158005
0.158005
0.158005
0.157309
0.162652
0.162652
0.156742
0.156742
0.155924
0.155924
0.145725
0.145725
0.167808
0.167808
0.15736
0.152795 0.150149 0.151995 0.173102 0.160254 0.15739
Table C3:Checking Iterations optimality for NMAR NMAR with %10 missing rate , (5,10,15,20,25,30,35,40,45,50) iteration and 5 clusters Iterations 5
10
15
20
25
30
35
40
45
50
0.191215
0.191215
0.191215
0.191215
0.191215
0.191215
0.191215
0.191215
0.191215
0.191215
0.180921
0.180921
0.180921
0.180921
0.180921
0.180921
0.180921
0.180921
0.180921
0.180921
0.189186
0.189186
0.189186
0.189186
0.189186
0.189186
0.189186
0.189186
0.189186
0.189186
0.193569
0.193569
0.193569
0.193569
0.193569
0.193569
0.193569
0.193569
0.193569
0.193569
Appendixes
102
0.198927
0.198927
0.198927
0.198927
0.198927
0.198927
0.198927
0.198927
0.198927
0.198927
0.190763
0.187168
0.187168
0.187168
0.187168
0.187168
0.187168
0.187168
0.187168
0.187168
0.214106
0.214106
0.214106
0.214106
0.214106
0.214106
0.214106
0.214106
0.214106
0.209192
0.209192
0.209192
0.209192
0.209192
0.209192
0.209192
0.209192
0.209192
0.190581
0.190581
0.190581
0.190581
0.190581
0.190581
0.190581
0.190581
0.190581
0.199373
0.199373
0.199373
0.199373
0.199373
0.199373
0.199373
0.199373
0.199373
0.195424
0.205131
0.205131
0.205131
0.205131
0.205131
0.205131
0.205131
0.205131
0.204471
0.204471
0.204471
0.204471
0.204471
0.204471
0.204471
0.204471
0.179282
0.179282
0.179282
0.179282
0.179282
0.179282
0.179282
0.179282
0.186874
0.186874
0.186874
0.186874
0.186874
0.186874
0.186874
0.186874
0.203141
0.203141
0.203141
0.203141
0.203141
0.203141
0.203141
0.203141
0.195542
0.176813
0.176813
0.176813
0.176813
0.176813
0.176813
0.176813
0.198609
0.198609
0.198609
0.198609
0.198609
0.198609
0.198609
0.19868
0.19868
0.19868
0.19868
0.19868
0.19868
0.19868
0.181641
0.181641
0.181641
0.181641
0.181641
0.181641
0.181641
0.20381
0.20381
0.20381
0.20381
0.20381
0.20381
0.20381
0.194634
0.217707
0.217707
0.217707
0.217707
0.217707
0.217707
0.184723
0.184723
0.184723
0.184723
0.184723
0.184723
0.19732
0.19732
0.19732
0.19732
0.19732
0.19732
0.203079
0.203079
0.203079
0.203079
0.203079
0.203079
0.183122
0.183122
0.183122
0.183122
0.183122
0.183122
0.195146
0.184925
0.184925
0.184925
0.184925
0.184925
0.183616
0.183616
0.183616
0.183616
0.183616
0.165382
0.165382
0.165382
0.165382
0.165382
0.192947
0.192947
0.192947
0.192947
0.192947
0.183129
0.183129
0.183129
0.183129
0.183129
0.192955
0.209584
0.209584
0.209584
0.209584
0.193737
0.193737
0.193737
0.193737
0.185378
0.185378
0.185378
0.185378
0.209035
0.209035
0.209035
0.209035
0.195736
0.195736
0.195736
0.195736
0.193775
0.188837
0.188837
0.188837
0.185255
0.185255
0.185255
0.192626
0.192626
0.192626
0.180581
0.180581
0.180581
0.183989
0.183989
0.183989
0.192835
0.16191
0.16191
0.184659
0.184659
0.214747
0.214747
0.173244
0.173244
0.195047
0.195047
0.192067
0.185084 0.177579 0.179497
Appendixes
103 0.206431 0.215945 0.192151
Table C4:Checking clusters optimality for MCAR Checking Optimality by Clusters { from 2 to 10 } for MCAR and %10 missing rate with 25 iteration cluster=2
cluster=3
cluster=4
cluster=5
cluster=6
cluster=7
cluster=8
cluster=9
cluster=10
0.1533592
0.1533592
0.1533592
0.1533592
0.1533592
0.1533592
0.1533592
0.1533592
0.1533592
0.1489389
0.1483384
0.1569593
0.1583362
0.1719848
0.1693923
0.1722979
0.1672909
0.1603954
0.1623354
0.1537003
0.1791276
0.1641396
0.1684081
0.1541449
0.1541449
0.1684081
0.1482683
0.1375055
0.1643216
0.1670794
0.1630901
0.1549887
0.1384483
0.1375055
0.1471827
0.1630901
0.1520735
0.1411166
0.1689852
0.1630736
0.1605996
0.1684685
0.1497198
0.1853841
0.1436523
0.1562331
0.1617841
0.1678442
0.158522
0.1752373
0.1659302
0.1701602
0.1454038
0.1583527
0.1719522
0.1558143
0.1700855
0.1699811
0.1472907
0.1558143
0.1478893
0.1479569
0.1506295
0.1502757
0.1650343
0.1603936
0.148495
0.1329248
0.1630998
0.1651562
0.1564991
0.1486734
0.1505391
0.1696355
0.1634406
0.168434
0.1581702
0.1540173
0.1559501
0.1661535
0.15807
0.1624842
0.1542061
0.1542061
0.1538426
0.1695223
0.1604386
0.1798738
0.153607
0.1675288
0.1618832
0.1502475
0.1542317
0.1490754
0.1608844
0.1474603
0.1559411
0.1845573
0.1562098
0.1584813
0.1573709
0.1560721
0.1592457
0.1468942
0.1598619
0.1675985
0.1696406
0.1556324
0.1650894
0.1514869
0.1693832
0.1554798
0.1591576
0.1581082
0.1631849
0.138378
0.1806822
0.1515375
0.1648625
0.1601565
0.1685699
0.1496362
0.176223
0.1618728
0.1621714
0.1422305
0.1580196
0.1591094
0.1637442
0.143935
0.1661519
0.152468
0.1514247
0.1681505
0.1739812
0.160114
0.149538
0.1638434
0.1558824
0.1614639
0.1710976
0.1558824
0.1417505
0.1417505
0.1527515
0.1801297
0.1453186
0.1548768
0.159631
0.1621434
0.1647212
0.144054
0.1485046
0.155914
0.168735
0.1566251
0.1568308
0.156257
0.1655719
0.1522495
0.1610732
0.1598972
0.1608124
0.1605367
0.1545108
0.1657425
0.170586
0.1518525
0.1721695
0.1646111
0.1590815
0.1717702
0.1498006
0.1698343
0.1554398
0.1598
0.1556466
0.1594241
0.1424231
0.1635823
0.1548753
0.1404491
0.1557742
0.1559838
0.1451302
0.167485
0.1639821
0.1687283
0.1533676
0.1662146
0.1542759
0.1579264
0.1619668
0.1627313
0.1610508
0.1770395
0.1579174
0.1524603
0.1775547
0.1702114
0.1504134
0.1528995
0.1571493
0.1637851
0.1573379
0.1657478
0.1503753
0.1465657
0.1588564
0.1524613
0.1533672
0.1496224
0.1543442
0.1579138
0.1523508
0.1601699
0.1559756
0.1887714
0.1450477
0.1475811
0.1532
0.1445442
0.1566019
0.1438552
0.1543964
0.1577302
0.1588677
0.1598729
0.157526
0.1580312
0.1589902
0.160136
0.1582662
0.1561737
Appendixes
104
Table C5:Checking clusters optimality for MAR Checking Optimality by Clusters { from 2 to 10 } for MAR and %10 missing rate with 15 iteration cluster=2
cluster=3
cluster=4
cluster=5
cluster=6
cluster=7
cluster=8
cluster=9
cluster=10
0.1537281
0.1537281
0.1537281
0.1537281
0.1537281
0.1537281
0.1537281
0.1537281
0.1537281
0.1559714
0.1653103
0.168922
0.1647996
0.1605876
0.1762846
0.152632
0.1609916
0.1564276
0.1430237
0.1578057
0.1558575
0.1537778
0.1578932
0.1578932
0.1473044
0.1583922
0.1711735
0.1797234
0.1570308
0.1532421
0.1555874
0.1678468
0.1502265
0.1666473
0.1577527
0.1479904
0.1543626
0.1580641
0.1547588
0.1564055
0.1607431
0.1593679
0.1712027
0.1580641
0.1505579
0.1738606
0.1473499
0.1543532
0.1735531
0.1625813
0.1707245
0.1683802
0.1661751
0.1518151
0.1479519
0.1678214
0.1587382
0.1526677
0.1364277
0.1653977
0.151088
0.1364277
0.1490839
0.1630103
0.1616283
0.1492837
0.1714212
0.1519786
0.1692667
0.163754
0.1533212
0.155685
0.1601952
0.1593245
0.1547209
0.1468447
0.1523563
0.1385901
0.151686
0.1623853
0.1432987
0.1507822
0.1555362
0.1441457
0.1539358
0.1651569
0.1536312
0.1606629
0.1561751
0.150177
0.1625122
0.153142
0.1645387
0.1393328
0.165179
0.1635327
0.1577447
0.161684
0.1492152
0.1528108
0.1480366
0.1450526
0.1569943
0.1543731
0.1532529
0.1490633
0.1649248
0.1464023
0.16153
0.1636468
0.1635872
0.155585
0.1545211
0.1550309
0.1731423
0.1674821
0.1645329
0.1568732
0.1530184
0.1390426
0.1500678
0.1598051
0.1531167
0.1673272
0.1586938
0.1649376
0.1626306
0.1479142
0.1516159
0.1528591
0.1780196
0.1619305
0.1472205
0.1396384
0.1488475
0.1585977
0.1566238
0.1541058
0.1558373
0.1587465
0.1587983
0.1587722
0.1570557
0.1535915
Table C6:Checking clusters optimality for NMAR Checking Optimality by Clusters { from 2 to 10 } for NMAR and %10 missing rate with 5 iteration cluster=2
cluster=3
cluster=4
cluster=5
cluster=6
cluster=7
cluster=8
cluster=9
cluster=10
0.1912149
0.1912149
0.1912149
0.1912149
0.1912149
0.1912149
0.1912149
0.1912149
0.1912149
0.2029784
0.1686545
0.193161
0.1809205
0.2081727
0.1852416
0.1934861
0.1934835
0.1839155
0.2022097
0.1928589
0.2077185
0.189186
0.2158239
0.1892655
0.1967761
0.1882468
0.2062008
0.2054832
0.2026381
0.2054832
0.1935687
0.211011
0.1774081
0.2196228
0.2026549
0.2091377
0.1996884
0.2115519
0.1706702
0.1989272
0.1748489
0.1608717
0.2075485
0.2034826
0.1965995
0.2003149
0.1933837
0.1936496
0.1907635
0.2002143
0.1808004
0.2017297
0.1958165
0.1974137
Appendixes
105
Final results for each MCAR&MAR&NMAR with varying missing rate (%5, %10, %15, and % 20) MCAR with 25 iteration &10 clusters 5%
10%
15%
20%
0.1670757
0.1533592
0.1635079
0.1610349
0.1496876
0.1603954
0.153637
0.1481382
0.1666809
0.1482683
0.1635736
0.1714814
0.1843632
0.1630901
0.1641762
0.1589371
0.1652808
0.1436523
0.1438135
0.1620806
0.1520065
0.1583527
0.1588536
0.1684821
0.1484322
0.1506295
0.1571042
0.1518329
0.1652677
0.1486734
0.1500573
0.165141
0.1313637
0.15807
0.1609874
0.1615885
0.1565603
0.1675288
0.1611219
0.1646294
0.1638889
0.1562098
0.1676683
0.1700945
0.1529984
0.1556324
0.1670556
0.1545456
0.1472798
0.1806822
0.1715883
0.1510724
0.1607192
0.1422305
0.1653691
0.156485
0.1520928
0.1739812
0.1670011
0.1624461
0.1506007
0.1417505
0.1610007
0.1609171
0.1679272
0.1485046
0.1493622
0.1580064
0.1740199
0.1598972
0.1673336
0.1642655
0.1594476
0.1590815
0.14001
0.1580765
0.1711124
0.1635823
0.1631971
0.1586979
0.1532117
0.1533676
0.1600419
0.1613813
0.1672335
0.1524603
0.1554851
0.1604047
0.164847
0.1503753
0.1562593
0.1668781
0.1618757
0.1601699
0.161643
0.1455881
0.1707137
0.1543964
0.1720897
0.1655179
0.160187484
0.156173656
0.160077504
0.160308928
MAR with 15 iteration & 10 clusters 5%
10%
15%
20%
0.1456426
0.1537281
0.1539078
0.1614865
0.1537746
0.1564276
0.168065
0.1558089
0.1559374
0.1711735
0.1539176
0.1609754
0.1598557
0.1479904
0.1605395
0.1492341
0.1780625
0.1505579
0.1580683
0.1615557
0.182565
0.1518151
0.1514933
0.1723007
0.1304944
0.1490839
0.161859
0.160757
0.1494331
0.155685
0.1621599
0.1651331
0.1528486
0.1432987
0.1667488
0.1606182
Appendixes
106
0.1634133
0.150177
0.1601363
0.1637187
0.1530894
0.1492152
0.1684983
0.1516515
0.1590996
0.1464023
0.151475
0.1657967
0.1573837
0.1645329
0.1755507
0.1588642
0.1657624
0.1649376
0.1547743
0.1549464
0.1829929
0.1488475
0.1572161
0.1558534
0.159357
0.1535915
0.160294
0.1599134
NMAR with 5 iteration & 7 clusters 5%
10%
15%
20%
0.180415
0.1912149
0.1966909
0.2101974
0.2242219
0.1852416
0.2089012
0.2048447
0.1837408
0.1892655
0.2039496
0.2112158
0.1781351
0.1774081
0.2018332
0.2061187
0.1715732
0.1608717
0.1917163
0.204424
0.1876172
0.1808004
0.2006182
0.2073601
Appendixes
107
Appendix D Programming (wine dataset) We created this program from R programming language (version 3.2.3, 2015) Kill all variables Set Seed= 1200 FOR (z=1,2,3,4,5): # Number of Iterations Kill all variables except z variable Create Y1 as matrix of raw data (wine data set) Remove any Heterogeneous attribute from the Y1 Set Y=Y1 8. Set C=5 ← Initialize number of clusters (c>=2) 9. Set V ← Number of variables from the matrix Y # The data set variables 10. Set iter ← Initialize number of fuzzy Iterations (maximum iteration is 100 iterations ) 11. Set s ← Initialize value of fuzzy parameter # 2 12. Set B ← Initialize value of gray relational analysis parameter # 1 13. GRG= Initialize empty matrix for GRG 14. Set CV=NULL # Maximum value of GRG 15. Pr = Initialize empty matrix for entropy values 16. Set E=NULL # Set Entropy vector of instances to NULL 17. Set H=NULL # Set Entropy vector to NULL 18. Set L=NULL # Set average vector of instances to NULL 19. Set ENTR=NULL # Set final vector of Entropy to NULL 20. Set INF=0 # Set initial value of Information vector to Zero 21. Set WF=NULL # Set average weight vector to NULL 22. ES= Initialize empty matrix for estimating missing values 23. Select case 24. Case MCAR X=MCAR(Y,α) # Y: Original data set , α : Probability of missing rate Case MAR X=MAR(Y,α) # Y: Original data set , α : Probability of missing rate Case NMAR X=NMAR(Y,α) # Y: Original data set , α : Probability of missing rate 1. 2. 3. 4. 5. 6. 7.
Call library mice from r package 25. Separate Xc matrix by calling cc function # Xc are cases (instances) without missing data (ie. Xc is complete data set) 26. Separate Xic matrix by calling ic function # Xic are cases (instances) with missing data (ie. Xic is incomplete data set) 27. Call library e1071 from R package to calculate fuzzy c-mean
Appendixes
108
28. FCM=cmeans(Xc,iter,m=s) # apply FCM to complete dataset 29. cp=unname(FCM$centers) # retrieve centroid of clusters from FCM 30. rank incomplete instances by missing amount in descending order 31. Xic=Xic[ order(rowSums(is.na(Xic)), decreasing=TRUE), ] 32. Applay Grey system theory to incomplete dataset (Xic) {GRA} 33. normalize incomplete dataset for GRA by calling norm function 34. nor<- as.data.frame(lapply(Xic, norm)) 35. For (i in 1:C){
dalta=abs(sweep(nor, 2, cp[i,],'-')) names(dalta) <- paste("dalta", 1:ncol(Xic),sep=".") minmin=min(dalta ,na.rm=TRUE) maxmax=max(dalta,na.rm=TRUE) GRC= (minmin+(B*maxmax))/(dalta+(B*maxmax)) # Grey relational coefficient names(GRC) <- paste("GRC", 1:ncol(Xic),sep=".") for (j in 1:ncol(Xic)){ GRG[i,j] =c(sum(GRC[,j],na.rm=TRUE)/(length(na.omit(Xic[,j])))) # grey relational grade } } 36. For (j in 1:v){ CV[j]=(which.max (GRG[,j])) 37. Change incomplete dataset to binary dataset in which {observed=1 & missed=0} Xic[] <- as.numeric(!is.na(Xic)) Xic=t(Xic) Xicc=cbind(Xic,CV) # add class vector to incomplete data set xclas<-as.data.frame(Xicc) xclas=t(xclas) 38. Calculate Entropy
r =nrow(xclas)-1 for (p in 1:r){ t=table(xclas[p,],CV) } m0 <- matrix(0, 2, c) cn=as.numeric(colnames(t)) cl=length(cn) for (c in 1:cl){ m0[1,cn[c]]=t[1,c] m0[2,cn[c]]=t[2,c] } t=m0
Appendixes
109
r=nrow(xclas)-1 for(p in 1:r){ L=rowSums(t)\sum(t) for (ii in 1:2){ s=0 for (jj in 1:c){ s=s+t[ii,jj]} for (jj in 1:2){ Pr[ii,jj]=t[ii,jj]/s} } for (i in 1:2){ E[i]=0 for (j in 1:c){ H=-(Pr[i,j]*log2(Pr[i,j])) H=ifelse(is.na(H), 0, H) E[i]=E[i]+H} ENTR[p]=0 ENTR[p]=ENTR[p]+E[i]*L[i] INF[p]=1-ENTR[p] } } SINF=0 for (p in 1:r){ SINF=SINF+INF[p] } for (p in 1:r){ WF[p]=INF[p]/SINF } Win=cbind(Xin,WF) CM=colMeans(Win,na.rm = T) Fin=rbind(Win,CM) #------------------------------------------------------------------IFin=Fin n1=nrow(Fin)-1 m1=ncol(Fin)-1 for (m in 1:m1){ for (n in 1:n1){ EE=0 if (is.na(Fin[n,m])){ for (nn in 1:n1){ if (nn!=n){ if(is.na(Fin[nn,m])){EE=EE+(Fin[n1+1,m]*Fin [nn,m1+1])} else {EE=EE+(Fin[nn,m]*Fin[nn,m1+1])} ES[n,m]=EE Fin[n,m]=EE }
Appendixes
110
} } else {ES[n,m]=Fin[n,m]} } } for (m in 1:m1){ for (n in 1:n1){ if (is.na(Fin[n,m])) IFin[n,m]= ES[n,m] } } for (m in 1:m1){ for (n in 1:n1){ if (is.na(Fin[n,m])) IFin[n,m]= ES[n,m] } } WNA=Win WIM=Fin[-(n+1),-(m+1)] CDTU=rbind(WIM,Xc) CDT=CDTU[order(as.numeric(row.names(CDTU))),] 39. Calculating RMSE based on imputation dataset, missing dataset and original dataset RMSE=function (imp, mis, true, norm = TRUE) { imp <- as.matrix(imp) mis <- as.matrix(mis) true <- as.matrix(true) missIndex <- which(is.na(mis) errvec <- imp[missIndex] - true[missIndex] rmse <- sqrt(mean(errvec^2)) if (norm) { rmse <- rmse/sd(true[missIndex]) } return(rmse) } RMSE=RMSE(CDT,x,y,norm=TRUE) Print the final result 40. print(RMSE) }
Appendixes
111
Appendix E Programming (Simulated dataset) We created this program from R programming language (version 3.2.3, 2015) 1. Kill all variables 2. Set Seed= 1200 3. Set n=1000 #Number of generated instances 4. for (i=1 to number of generated variable) 5. Set variable= rnorm( n , mean , standard deviation) 6. Y1 is data frame of generated variables 7. FOR (z=1,2,3,4,5): 8. Kill all variables except z and y variable
# Number of Iterations
9. Remove any Heterogeneous attribute from the Y1 10. Set Y=Y1 11. Set C=5 ← Initialize number of clusters (c>=2) 12. Set V ← Number of variables from the matrix Y
# The data set variables 13. Set iter ← Initialize number of fuzzy Iterations (maximum iteration is 100 iterations ) 14. Set s ← Initialize value of fuzzy parameter # 2 15. Set B ← Initialize value of gray relational analysis parameter # 1 16. GRG= Initialize empty matrix for GRG 17. Set CV=NULL # Maximum value of GRG 18. Pr = Initialize empty matrix for entropy values 19. Set E=NULL # Set Entropy vector of instances to NULL 20. Set H=NULL # Set Entropy vector to NULL 21. Set L=NULL # Set average vector of instances to NULL 22. Set ENTR=NULL # Set final vector of Entropy to NULL 23. Set INF=0 # Set initial value of Information vector to Zero 24. Set WF=NULL # Set average weight vector to NULL 25. ES= Initialize empty matrix for estimating missing values 26. Select case 27. Case MCAR X=MCAR(x,α) # x: Original data set , α : Probability of missing rate Case MAR X=MAR(x,α) # x: Original data set , α : Probability of missing rate Case NMAR X=MCAR(x,α) # x: Original data set , α : Probability of missing rate Call library mice from r package 28. Separate Xc matrix by calling cc function # Xc are cases (instances) without missing data (ie. Xc is complete data set)
Appendixes 29. Separate Xic matrix by calling ic function
112 # Xic are cases (instances)
with missing data (ie. Xic is incomplete data set) 30. 31. 32. 33.
Call library e1071 from R package to calculate fuzzy c-mean FCM=cmeans(Xc,iter,m=s) # apply FCM to complete dataset cp=unname(FCM$centers) # retrieve centroid of clusters from FCM rank incomplete instances by missing amount in descending order
34. Xic=Xic[ order(rowSums(is.na(Xic)), decreasing=TRUE), ] 35. Applay Grey system theory to incomplete dataset (Xic) {GRA} 36. normalize incomplete dataset for GRA by calling norm function 37. nor<- as.data.frame(lapply(Xic, norm)) 38. For (i in 1:C){
dalta=abs(sweep(nor, 2, cp[i,],'-')) names(dalta) <- paste("dalta", 1:ncol(Xic),sep=".") minmin=min(dalta ,na.rm=TRUE) maxmax=max(dalta,na.rm=TRUE) GRC= (minmin+(B*maxmax))/(dalta+(B*maxmax)) # Grey relational coefficient names(GRC) <- paste("GRC", 1:ncol(Xic),sep=".") for (j in 1:ncol(Xic)){ GRG[i,j] =c(sum(GRC[,j],na.rm=TRUE)/(length(na.omit(Xic[,j])))) # grey relational grade } } 39. For (j in 1:v){ CV[j]=(which.max (GRG[,j])) 40. Change incomplete dataset to binary dataset in which {observed=1 & missed=0} Xic[] <- as.numeric(!is.na(Xic)) Xic=t(Xic) Xicc=cbind(Xic,CV) # add class vector to incomplete data set xclas<-as.data.frame(Xicc) xclas=t(xclas) 41. Calculate Entropy
r =nrow(xclas)-1 for (p in 1:r){ t=table(xclas[p,],CV) } m0 <- matrix(0, 2, c) cn=as.numeric(colnames(t)) cl=length(cn) for (c in 1:cl){ m0[1,cn[c]]=t[1,c] m0[2,cn[c]]=t[2,c]
Appendixes } t=m0 r=nrow(xclas)-1 for(p in 1:r){ L=rowSums(t)\sum(t) for (ii in 1:2){ s=0 for (jj in 1:c){ s=s+t[ii,jj]} for (jj in 1:2){ Pr[ii,jj]=t[ii,jj]/s} } for (i in 1:2){ E[i]=0 for (j in 1:c){ H=-(Pr[i,j]*log2(Pr[i,j])) H=ifelse(is.na(H), 0, H) E[i]=E[i]+H} ENTR[p]=0 ENTR[p]=ENTR[p]+E[i]*L[i] INF[p]=1-ENTR[p] } } SINF=0 for (p in 1:r){ SINF=SINF+INF[p] } for (p in 1:r){ WF[p]=INF[p]/SINF } Win=cbind(Xin,WF) CM=colMeans(Win,na.rm = T) Fin=rbind(Win,CM) #------------------------------------------------------------------IFin=Fin n1=nrow(Fin)-1 m1=ncol(Fin)-1 for (m in 1:m1){ for (n in 1:n1){ EE=0 if (is.na(Fin[n,m])){ for (nn in 1:n1){ if (nn!=n){ if(is.na(Fin[nn,m])){EE=EE+(Fin[n1+1,m]*Fin [nn,m1+1])} else {EE=EE+(Fin[nn,m]*Fin[nn,m1+1])} ES[n,m]=EE Fin[n,m]=EE } }
113
Appendixes
} else {ES[n,m]=Fin[n,m]} } } for (m in 1:m1){ for (n in 1:n1){ if (is.na(Fin[n,m])) IFin[n,m]= ES[n,m] } } for (m in 1:m1){ for (n in 1:n1){ if (is.na(Fin[n,m])) IFin[n,m]= ES[n,m] } } WNA=Win WIM=Fin[-(n+1),-(m+1)] CDTU=rbind(WIM,Xc) CDT=CDTU[order(as.numeric(row.names(CDTU))),] 42. Calculating RMSE based on imputation dataset, missing dataset and original dataset RMSE=function (imp, mis, true, norm = TRUE) { imp <- as.matrix(imp) mis <- as.matrix(mis) true <- as.matrix(true) missIndex <- which(is.na(mis) errvec <- imp[missIndex] - true[missIndex] rmse <- sqrt(mean(errvec^2)) if (norm) { rmse <- rmse/sd(true[missIndex]) } return(rmse) } RMSE=RMSE(CDT,x,y,norm=TRUE) Print the final result 43. print(RMSE) }
114
Appendixes
115
Appendix F Functions MCAR # Missing Completely At Random (MCAR) mcar=function (x, alpha ) { n <- nrow(x) p <- ncol(x) NAloc <- rep(FALSE, n * p) NAloc[sample(n * p, floor(n * p * alpha))] <- TRUE x[matrix(NAloc, nrow = n, ncol = p)] <- NA return(x) }
MAR # Missing at random (MAR) mar=function(x,alpha){ alpha=alpha*2 c=cor(x) diag(c)<-0 m=NULL for (i in 1:ncol(x)){ m[i]= max(abs(c[,i])) } v <- row.names(c)[apply(c, 2, which.max)] L=mget(v,inherits =TRUE) df <- data.frame(matrix(unlist(L), nrow=nrow(x), ncol=ncol(x))) for (j in 1:ncol(x)){ x[(df[,j] <= median(df[,j])) & (runif(nrow(x)) < alpha ),j] <- NA } x }
Appendixes
116
NMAR # Not Missing at random (NMAR) nmar=function(x,alpha){ alpha=alpha*2 for (j in 1:ncol(x)){ x[(x[,j] <= median(x[,j])) & (runif(nrow(x)) < alpha ),j] <- NA } x }
CC # Extracting complete cases from a data set, is also known as 'listwise deletion' or 'complete case analyses' cc=(function (x, drop) return(x[cci(x)]))(x, drop) #where Complete case indicator cci=(function (x) return(!is.na(x)))
IC # Extracts incomplete cases from a data set ic=(function (x, drop) return(x[ici(x)]))(x, drop) #where Incomplete case indicator ici=(function (x) return(is.na(x)))
NORM # Min-Max normalization function norm <- function(x) {(x - min(x, na.rm=TRUE))/(max(x,na.rm=TRUE) -min(x, na.rm=TRUE))}
Appendixes
FCM #Fuzzy C-mean Clustering cmeans=function (x, centers, iter.max = 100, verbose = FALSE, dist = "euclidean", method = "cmeans", m = 2, rate.par = NULL, weights = 1, control = list()) { x <- as.matrix(x) xrows <- nrow(x) xcols <- ncol(x) if (missing(centers)) stop("Argument 'centers' must be a number or a matrix.") dist <- pmatch(dist, c("euclidean", "manhattan")) if (is.na(dist)) stop("invalid distance") if (dist == -1) stop("ambiguous distance") method <- pmatch(method, c("cmeans", "ufcl")) if (is.na(method)) stop("invalid clustering method") if (method == -1) stop("ambiguous clustering method") if (length(centers) == 1) { ncenters <- centers centers <- x[sample(1:xrows, ncenters), , drop = FALSE] if (any(duplicated(centers))) { cn <- unique(x) mm <- nrow(cn) if (mm < ncenters) stop("More cluster centers than distinct data points.") centers <- cn[sample(1:mm, ncenters), , drop = FALSE] } } else { centers <- as.matrix(centers) if (any(duplicated(centers))) stop("Initial centers are not distinct.") cn <- NULL ncenters <- nrow(centers) if (xrows < ncenters) stop("More cluster centers than data points.") } if (xcols != ncol(centers)) stop("Must have same number of columns in 'x' and 'centers'.") if (iter.max < 1) stop("Argument 'iter.max' must be positive.")
117
Appendixes if (method == 2) { if (missing(rate.par)) { rate.par <- 0.3 } } reltol <- control$reltol if (is.null(reltol)) reltol <- sqrt(.Machine$double.eps) if (reltol <= 0) stop("Control parameter 'reltol' must be positive.") if (any(weights < 0)) stop("Argument 'weights' has negative elements.") if (!any(weights > 0)) stop("Argument 'weights' has no positive elements.") weights <- rep(weights, length = xrows) weights <- weights/sum(weights) perm <- sample(xrows) x <- x[perm, ] weights <- weights[perm] initcenters <- centers pos <- as.factor(1:ncenters) rownames(centers) <- pos if (method == 1) { retval <- .C("cmeans", as.double(x), as.integer(xrows), as.integer(xcols), centers = as.double(centers), as.integer(ncenters), as.double(weights), as.double(m), as.integer(dist - 1), as.integer(iter.max), as.double(reltol), as.integer(verbose), u = double(xrows * ncenters), ermin = double(1), iter = integer(1), PACKAGE = "e1071") } else if (method == 2) { retval <- .C("ufcl", x = as.double(x), as.integer(xrows), as.integer(xcols), centers = as.double(centers), as.integer(ncenters), as.double(weights), as.double(m), as.integer(dist - 1), as.integer(iter.max), as.double(reltol), as.integer(verbose), as.double(rate.par), u = double(xrows * ncenters), ermin = double(1), iter = integer(1), PACKAGE = "e1071") } centers <- matrix(retval$centers, ncol = xcols, dimnames = list(1:ncenters, colnames(initcenters))) u <- matrix(retval$u, ncol = ncenters, dimnames = list(rownames(x), 1:ncenters)) u <- u[order(perm), ] iter <- retval$iter - 1 withinerror <- retval$ermin cluster <- apply(u, 1, which.max) clustersize <- as.integer(table(cluster))
118
Appendixes retval <- list(centers = centers, size = clustersize, cluster = cluster, membership = u, iter = iter, withinerror = withinerror, call = match.call()) class(retval) <- c("fclust") return(retval) }
RMSE
#Root mean square error (Normalize root mean square error) RMSE=function (imp, mis, true, norm = TRUE) { imp <- as.matrix(imp) mis <- as.matrix(mis) true <- as.matrix(true) missIndex <- which(is.na(mis)) errvec <- imp[missIndex] - true[missIndex] rmse <- sqrt(mean(errvec^2)) if (norm) { rmse <- rmse/sd(true[missIndex]) } return(rmse) }
119
الخالصة وعظي البًاٌات األولًُ املىجىدَ يف العامل اذتقًقٌ حتوتىٍ لمِ خطااْ كجريَ .اُ وطوتىدع البًاٌات الكبريَ حيىّ لمِ خٌىاع ووتعددَ وَ القًي الشاذَ واليت بدوزِا تؤثس لمِ ٌوتًجُ حتمًن البًاٌات ،ومبا خُ الٍىاذج ادتًدَ لادَ وا حتوتاج اىل بًاٌات جًدَ ,لرلك فالبًاٌات املدطمُ جيب اُ تكىُ وَ الٍاحًُ الكىًُ واهلًكمًُ والشكمًُ وٍاضبُ وبشكن وجالٌ وع كن ارتاىات املطوتخدوُ لموتٍقًب لَ البًاٌات ( Data .)Miningلطىْ اذتظ ،قىالد البًاٌات يف العامل الىاقعٌ توتأثس بشكن كبري بالعىاون الطمبًُ كىجىد الضىضاْ والقًي املفقىدَ ( )Missing Valuesوالبًاٌات غري املوتكاومُ والػري ضسوزيُ وكرلك االحجاً الكبريَ يف كال البعديَ ،الٍىاذج و ارتصآص .لرلك ِرَ املشاكن تؤدّ اىل ضعف يف ٌوتآج حتمًن البًاٌات. وبالوتالٌ البًاٌات وٍخفضُ ادتىدَ تؤدّ اىل اداْ وٍخفض يف الوتٍقًب لَ البًاٌات .يف وطوتىدع البًاٌات الكبريَ لىمًُ وعادتُ البًاٌات وّىُ جدا لموتعاون وع املشاكن اليت مت ذكسِا آٌفاً .وعادتُ البًاٌات حتىٍ لمِ العديد وَ املّاً كــوتٍقًُ البًاٌات ،تكاون البًاٌات ،حتىين البًاٌات ،وتقمًن البًاٌات و تفسيدِا. البًاٌات املفقىدَ ِى لًب شآع يف العديد وَ زتاوًع البًاٌات يف العامل الىاقعٌ .البًاٌات املفقىدَ تشري اىل القًي الػريومخىظُ يف زتاوًع البًاٌات واليت ميكَ اُ تكىُ لمِ اٌىاع خموتمفُ وزمبا فقدت ألضباب خموتمفُ ,ووَ ِره األضباب املخوتمفُ ٌِ لدً اضوتجابُ الىحدَ ،لدً اضوتجابُ العٍصس ،الوتطسب ،ارتاأ البشسٍ ،فشن األجّصه و الابقات الكاوٍُ (ارتفًُ) .وجىد وجن ِره العًىب لادَ وا يوتامب وسحمُ املعادتُ حبًح يوتي حتضري البًاٌات وتٍقًوتّا لكٌ تكىُ وفًدَ وواضخُ بشكن كايف ألجن لىمًُ اضوتخساج املعسفُ. املخوتصني بعمي االحصاْ شخصىا ثالثُ طبقات لمبًاٌات املفقىدَ وٌِ بًاٌات وفقىدَ لشىآًا بشكن كاون ( , (MCARوبًاٌات وفقىدَ لشىآًا ( )MARوبًاٌات لًطت وفقىدَ لشىآًا (.)NMAR ٍِاك العديد وَ األضرتاتًجًات لموتعاون وع القًي املفقىدَ .اضّن اذتمىه لمقًي املفقىدَ ٌِ ختفًض زتىىلُ البًاٌات والوتخمص وَ كن العًٍات اليت حتوتىٍ لمِ القًي املفقىدَ .وٍِالك حن آطس يطىِ طسيقُ الوتطاوح ) . (Tolerance methodيف الٍّايُ ميكَ الوتعاون وع وشكمُ القًي املفقىدَ لَ طسيق خضالًب
األحوتطاب ملخوتمف القًي املفقىدَ وبالوتالٌ اصبخت ِره الاسيقُ خٍ طسيقُ األحوتطاب واحده وَ خكجس اذتمىه شآع يف الوتعاون وع البًاٌات املفقىدَ. يف ِره السضالُ ،مت اقرتاح طىازشوًُ تعوتىد لمِ حتطني طىازشوًُ ) (MIGECلَ طسيقُ اضوتخداً اضمىب األحوتطاب لموتعاون وع القًي املفقىدَ .مت تابًق خلــ GRAلمِ قًي املوتػريات (Attribute ) valuesبدال لَ قًي املشاِدات ) ، (Instance valuesوالبًاٌات املفقىدَ يوتي احوتطابّا اوال باملعده املوتىض ثي بعد ذلك يوتي تقديسِا وَ قبن ارتىازشوًُ املقرتحُ والرٍ يطوتخدً كقًىُ كاومُ ألحوتطاب القًىُ املفقىدَ الوتالًُ. متت وقازٌُ ارتىازشوًُ املقرتحُ وع العديد وَ ارتىازشوًات األطسّ وجن (MMS, HDI, ) KNNMI, FCMOCS, CRI, CMI, NIIA and MIGECحتت خموتمف آلًات البًاٌات املفقىدَ ،الٍوتآج الوتجسيبًُ خظّست اُ ارتىازشوًُ املقرتحُ هلا قًي ادترز الرتبًعٌ ملوتىض وسبعات ارتاأ ( )RMSEخقن وقازًٌُ بارتىازشوًات األطسّ حتت مجًع آلًات البًاٌات املفقىدَ.
ثوختة لةطةلَ بةزدةوامى ثيَشكةوتهى تةكهةلؤجيايى شانيازى داتاى تؤمازكساو لة داتابةيطى كؤمجانياو دةشطا حكوميةكاى بةزدةوام لة شيادبووى داية .يةلَةجنيَهانى شانيازى ( )Information Miningزِاضت و دزوضت لة طةجنيهةكانى داتا ( )Data warehouseبؤتة بوازيَكى طسنط لة تويَريهةوةكانى شانطتى يةلَةجنيَهانى داتا ( .)Data miningبوونى داتاى نادزوضت ( )Noise Dataياى داتاى ووى بوو ( )Missing Valueلةم طةجنيهانةى داتادا بةدواى خؤيدا كيَشةى نازِاضتى ئةجنامةكانى يةلَةجنيَهانى شانيازى دزوضت كسدووة .وة بؤ ئةم مةبةضتة زِيَطاكانى ضازةضةزكسدنى ثيَش وةخت ( )Preprocessingضةزجنى تويَرةزةكانى ئةم بوازةى شياتس بةالى خؤيدا زِاكيَشاوة . داتاى ووى بوو ( )Missing Valueيةكيَكة لة كيَشة ضةزةكيةكانى طةجنيهةكانى داتا كة بة بةزدةوامى دةزدةكةويَت و يؤكازى ووى بوونى داتا دةطةزِيَتةوة بؤ تؤماز نةكسدنى داتا لةاليةى ئاميَسةكانةوة ( Machine ,)Failيةلَةى مسؤيى لة تؤمازكسدندا ياى نةدانى شانيازى يةضتياز .بوونى ئةم كةم و كوزتيانة يةميشة ثيَويطتى بة زِيَطاكانى ضازةضةزكسدنى ثيَش وةختى داتا ( )Data Preprocessingيةية بةو شيَواشةى كة داتاى ئامادةكساو ضوود بةخش بيَت بؤ ثسِؤضةى يةلَةجنيَهانى شانيازى. ئامازناضاى جؤزةكانى داتاى ووى بوو بؤ ضىَ جؤز ثؤليَو دةكةى ,ووى بووى بةتةواوى بة شيَوةيةكى يةزِةمةكى ( , )MCARووى بوونى يةزِةمةكى ( )MARلةطةلَ ووى بوونى نايةزِةمةكى (. )NMAR شيَواشو ضرتاتيرى جؤزاو جؤز بةكازدةييَهسيَت بؤ خةمالَندنى بةياى داتاى ووى بوو ,ضادةتسيهياى كوذاندنةوةى يةموو ئةو منوونانةى كة داتاى ووى بووياى
تياية بة شيَوةى كوذاندنةوةى ئاضؤيى (( Instance )Case
( Deletionلةو تؤمازانةى كة شانيازى ووى بوو لةخؤ دةطسيَت ,جطة لةم شيَواشة ,شيَواشى تسيش بةكاز دةييَهسيَت وةك شيَواشى زيَطةثيَداى ( , )Tolerance Methodبةالَم ئةم شيَواشانة دةبهة يؤكازى كةمكسدنةوةى بسِى داتا لة طةجنيهةكانى داتادا لةاليةكةوة وة فةوتانى شانيازى ثةيوضت بة تؤمازةكانى شانيازى ووى بوو لةاليةكى تسةوة . خةمالَندنى ياى ضةزلةنوىَ دؤشيهةوةى بةياى داتاى ووى بوو ( )Data Imputationيةكيَك لة زِيَطةكانى ضازةضةزكسدنى بةياى داتاى ووى بوو. لةم تيَصةدا ثشت بةضت بة يةكيَك لة ئةلطؤزيتنةكانى دؤشيهةوةى بةياى داتاى ووى بوو كة لة اليةى ( )Tian , Yu , Maثيَشهيازكساوة ناو دةبسيَت بة ( , )MIGECدواى جيَبةجيَكسدنى بةياى ضتونى ( )Attribute valuesداتاكاى لة جياتى بةياى ئاضؤيى ( )Instance valuesداتاكاى لة بةشى ( )GRAى
ئةلطؤزيتنةكة لةاليةكةوة وة داتا ووى بووةكاى لةضةزةتاوة يةذماز دةكسيَت بة بةكازييَهانى زيَطةى ناوةند ( Mean )Imputationوة دواتس لة زِيَطةى ئةلطؤزيتنة ثيَشهيازكساوةكةوة بةيا ووى بووةكاى دةخةممَيَهسيَو بة شيَوةيةك بةياى خةممَيَهساوى داتاى ووى بووى يةكةم بؤ خةمآلندنى بةياى داتاى ووى بووى دواتسبةكازدةييَهسيَت . دوابةدواى بةزوازدكسدنى ئةجنامةكانى ئةلطؤزيتنى ثيَشهيازكساو لةطةلَ ضةند يو ئةلطؤزيتنى تسى جؤزاوجؤز وةك ( NII ، CMI ، CRI ، FCMOCM، INNMI ، HDI ، MMMو ) MIGECوة بة بةكازييَهانى ميكانيصمى ووى بووى جياواش ،ئةجنامة ئةشموونةكاى واماى ثيشاى دةدات كة ئةلطؤزيتنة ثيَشهيازكساوةكة كةمرتيو بةياى RMMEيةية لةضاو ئةلطؤزيتنةكانى تس لةذيَس يةموو ميكانيصمةكانى ووى بوونى داتادا .
دور طرق احتشاب البيانات املفكودة عمى دقة نتائج تهكيب البيانات رسالة مكدمة اىل جممص كمية األدارة و األقتصاد – جامعة الشميمانية و هي جزء مو متطمبات نين درجة ماجشتري عموم يف االحصاء مو قبن
ذيان حممد عمر
بإشراف االستاذ املشاعد
د.نزار عبدالكادر عمي
1437ه
2716ك
2016م
طرنطى ريَطاكانى ئةذماركردنى داتاى وون بوو لةسةر دروستى ئةجنامةكانى هةلَةجنيَنانى داتا
نامةيةكي ماجستيَرة ثيَشكةش كراوة بة ئةجنومةني كؤليَجي كارطيَرِي و ئابوررى -زانكؤي سميَماني وةك بةشيَك لة ثيَداويستيةكاني وةدةستويَناني ثمةى ماجستيَر لة زاسيت ئامار
لةاليةن
ذيان حممد عمر بةسةرثةرشيت ثرؤفيسؤري ياريدةدةر
د.نزار عبدالكادر عمي
1437ه
2716ك
2016م