Multiple Imputation Method for Missing Data in (R.C.B.D) Factorial Experiments
A Thesis Submitted to the council of the college of Administration and Economics- University of Sulaimani As a partial Fulfillment of the requirements for Master Degree of Science in Statistics
By:
Hawkar Qasim Pirdawd
Supervised by
Dr. Shawnm Abdul Qadir
2707 K
D.r Wasfi Tahir Salih
2007 M
ال فق د
طريق التنسيب ال تعدد لل ي ن في التج ر الع ملي بتص يم ()R.C.B.D رسال مقدم إلى مج س ك ي اإدارة وااقتصاد بجامع الس يماني وهي جزء من متط با نيل درج الماجستير ع و في اإحصاء من قبل الطالب
ه وك ر ق سم ثيرداود بأشراف
الدكتور .شونم ع دالقادر
الدكتور .وصفي طاهر صالح
حزيرا 7002 - ميادي
ط اَ 0727ك رد
[
]
-39( سورة مريم اآيات )39
In the name of Allah, Most Gracious, Most Merciful. Not one of the beings In the heavens and the earth but must come to ((Allah)) Most gracious As a servant
He does take an account of them
(all), and hath numbered them (all) exactly
And everyone of them will
المست لص ال ي ن
الم قو أ ال قص هي
اإعا يو ال وويإ اعووتإاع
سأط اعووتإاع
التج س العلمي اأرب .ال ي نو الإب اإحص ئي في ال ي ن
كل
م ج اً في الع ي ن حقووِ ال ثو ،كمو فوي التق يو ساعو حسووويقي حثقيقو
الم قوو حعقو التثقيقو
الم قو ك نت ج ا
طبيو
م ن ال ث ،في العقو اأريوب حث يو الوعويل
المتع ( )MIهو ون أك وب الإوب جوااط ً اعوت ا لمع لجو ال و
يت وين طو
العو ل لل ي نو
الت سوي
الم قوو فوي حثليوا
تع المتغيبا
الهدف من هذه الرسالة هو حق يب ال ي ن التجووو س الع ليووو طصووومي القإ عووو
الم قو ط عت ال بيق الت سي المتع ( )MIفوي
ك لووو الع ووووائي (( )R.C.B.Dحيووو ،حوووافت جموعووو
الم ه ا ط كا ع وائي ط كا اصإ عي ن جموع ال ي ن طي ن
ساعو
العلميو ك يوباً علو عو يا الم و ِ حإوويب
التثليلي المائم طوجو الم ه ا ال قصو هوو عوااِ سئيسوي لمثللوي ال ي نو ااح ط المتغيب
يو
الك ل ااصلي
ع ال ي ن
ووون
الك ل ) ث اجبيت ق سن ال ت ئج
المق س طإبيق ()MI
في ها البع ل اعت ل بيق الت سي المتع عل ال ي ن
الم قو
ط عت ال طبن ج ر
( MICEالت سي المتع ط لمع ا المقي ) قوسنت ال ت ئج التي ح التوصا الي و طإبيقو الوتت سي المتع ( )MIط ل ت ئج طي نو ك نووت ال ت و ئج لووت ( ˆ ) ك و س ي ال ().000 نظبي المعلو
اأصولي ارتافو
فوي الثو لتين (ق وا فقو ا
( ).0.0لووت )ˆ vaˆr(ك و
().0..0
ق ً ل ا ال ت ئج نوصي ال ح ين ط لعما علو
للثك عل ارتي س أفضا نمو
طعو حقو يب قوي الم قوو )
اأرإ و المعي سي و لكووا الث و لتين بيقو ( )MIثو اعوت ال أ ا
ث ختة نامة انياريةةةة ةمةةةة اد ية ي ةةةة ةةةة ي ةةةة هةةةةرن ط نطةةةة ان انياريةةةة ببن بن ةةةاد يةةةاخ ارن ةةةان ن يدي ةةةةبن ا بن ةةة ةنة ي ةةةة كةةةاى ة ان راطةيانةةة دي بنرط ن ةةة راي ؤر ةةةة بنرط ن ي ؤ ي ةبن ا رطان ي ي ؤ ي ةبن ان طةيان دي ن يدي ةبن ثزي ية ادي ب ي ؤ ي ةةبن انستية ان ن .انيارية ببن بن اد ة ؤر ة ي ؤ ي ةبن انستية اد ئا ؤ ن ةا ةؤ نو بنةة. ي ةةة بب ةةة بن ةةةة ثةرنسةةةةن ِ ريطةةةا ئاماريةةةة اد ةةةة ارن انياريةةةة ببن بن ةةةاد ربب ةةةةري ن يدي ةبن اد ة ؤنايية ان سةة نِ را ة بب .يةاري ن ريطةاِ ط نوةاب اةي ار ةة بن ةهاِ نا يار ث سياري سةرن يية ؤ ئةب ةسانة انيارية اد ا ن ةنةبن. رببد ن يتةبن ةجيطي بن هةمةايةنة اد ( )MIية ي ة ةنا ةة ريطةا هةةرن سةةرن را ي ةةةرن اد ةةؤ كارنسةةةر ن هؤ ةةار ط ةةت انيةةارِ ببن ة ب ةاةةي نةبن طةةؤرابن ية ايةنة ادء طؤرابن هةمةايةنة اد. يتيةةة ةةة ؤ ي ةةةبن انياريةةة ببن بن ةةاد ةةةريطاِ ()MI ئامااا لةمااويةذينةوندااو ةناقي نةبن فا تةرية اد ةة يزاي ( ( )R.C.B.Dةاةي نيةك ؤمةا انيارييةة واد ا ة ببن ربسةةت اب ة ؤمة ةةة انيارييةةة نةةةباب)ي ة ةةةها اد ةاةةي نية هةرنمةةة بن ةاةةي نية ثااةةاد ةةةراببر ابن ةةةني اد ئةنوامةةة ان انيارييةةة نةةةبابن سةةةرن يية ة ةطةةةي انيارييةةة ؤ رابن اد ةريطا (.)MI ة ماسةتةر نامةيةة ا ريطةا ( ) MIةة ار هةان بن ةؤ ؤ ي ةةبن انياريةة ببن بن ةادي بن ةةةة ارهي ان ث ؤط امي ةةةة ناي ةةةةن .MICEةةةةةراببر ةةة ابن ةةةةةني اد ئةةةةب ئةنوامانةةةةة ة ة نسةةةا هةةةان بن ةةةة ئةةةةنوام جي ةةةةجي ن ريطةةةاِ ( )MIةسةةةةر انيارييةةةة ببن بن ةةةاد ةنةخ ةسا ِ ( )R.C.B.Dةطةي ئةجامة ان انيارية سةرننايية ادي طةي تي ة ئةبن ة ئةب جيابا ييةةةةةِ ةةةةةني اد هةةةةةر بب ئةنوامة ة ايةةةةة (ثةةةةيش ببن ةةةة دء باِ ؤ ي ةةةةة انيارييةةةةة ببن بن ةةاد)ي ئةنوامةةة اد ةة يتي( ةةؤ ( ˆ ) يتةة ةة ب ةةة ()...0ي ةةؤ )ˆ vaˆr(يتةة ةة ب ()..000ي بن هة ةةة ثية راب ةةؤ هةةةر بب كا ةةة ية سةةانة ةةة ( .)..33ةربانطةةة ئةنوامةةة ادي ن يدنرنبن اد راسثار ن ن ةي( ؤ ار د ةسةر ريطاِ ()MIي ثاااد ةة ارهي ان ية ؤ ِ انياريية اد (نظ ية ا وع ما ) ؤ ياري ن اات ي( نو بنة.
Dedication I dedicate this Thesis to:
The prophet and teacher of the whole mankind Muhammad- peace be upon him. Two man-teachers and moral- pedagogue whom they are always my supporters in life: my parents whom Allah said about them: “Oh my Lord show your mercy to them they brought up me when I was young”. My Mate and partner of life who has never shown any hint of bore and who endured every thing related to the household during my study- my dear wife- Triska. My liver and my hope in the future my daughter "Ro’ya" My dear supervisors (Dr. shawnem A. Muhayaddin and Dr. Wasfi T. Salih), and to those who taught me even a letter.
Hawkar
I
II
Acknowledgements [ ]
Allah says: “O My Lord! So order me that I may be grateful for Thy favours, which Thou
hast bestowed on me”. The prophet Muhammad peace be upon him said “All does not thank a guy who doesn’t thank the folk”. To show my gratitude and loyalty, I would like to express my deep gratitude to my supervisors (Dr. Shawnim Abdul-Qadir) and (Dr. Wasfy Tahir) who gave me their time during writing this thesis. I would like also to thank those teachers who did their best to help me and support me. Further more, my thanks go to the members of the discussion committee. I would like to thank (Parwin Muhammad Hamakhan) the head of the Department of statistics and (Narmin Mahruf Gafur) the head of the department of higher Education in college of Administration and Economy University of Suleimany. I would like to show my gratitude to my dear brother (Dler Hussein Qadir) and my dear sister (Ruqia Abdulqadir), who helped and encouraged me. I would like also to thank all my: Teachers, friends, and colleagues (Mr. Slam Nawxosh, Dlawar Othman Majeed, Hanaw Ahmed, Dler Mustafa, Awat Abdulla Saeed and Shanaz Ali Rahim). I would like to express my deep gratitude to Mr. (Aumed Sabir) for providing provided me the data related to my study. I would like to thank Mrs. (Munira Muhammad) as well who translated most of my study and I never forget. And Mr. (Salah Othman’s) consult and encouragement. Finally, I am to thank most of the officials of college of Sharia and Islamic studies especially Miss (Bafrin Muheadin) last but not least, I would like to thank those who helped me to complete my study.
Hawkar
III
Abstract Missing or incomplete data is a very serious problem in many fields of research, such as in active media technology, opinion polls, market research surveys, mail enquiries, medical studies, and other scientific experiments. Missing data frequently complicates scientific investigations. The development of statistical methods to address missing data has been an active area of research in recent decades. Determining the appropriate analytic approach in the presence of incomplete observations is a major question for data analysts. Multiple imputation (MI) appears to be one of the most attractive methods for general purpose handling of missing data in univariate and multivariate analysis.
The aim of the thesis is to estimate the missing data by MI in (R.C.B.D) factorial experiments (artificially randomly missed dataset from a complete dataset), then to compare the results of original complete data results with those missing and estimated with MI. In this thesis the MI application was used on missing data using a special software MICE (Multivariate Imputation of Chained Equations). The results that have been achieved by MI were compared with the original data results; the difference between two cases for ˆ was (0.02), for vaˆr(ˆ) was: (0.002), and the standard errors to the both cases were equal namely (0.33). According to these conclusions, we recommend for further works on MI, then using the information theory tools to judge on best model selection.
IV
Contents Title List of figures List of tables Abbreviations 1-1 1-2
Page V V V
Chapter One: Introduction and Literature review Introduction Literature review
2 4
Chapter two :Missing Data and Multiple Imputation 2-1 2-2 2-3 2-4 2-4-1 2-4-2 2-4-3 2-5 2-5-1 2-5-1-1 2-5-1-2 2-5-1-3 2-5-1-4 2-5-1-5 2-5-2 2-6 2-7
Introduction Missing data Treatment of Missing Data Types of missing data Missing Completely at Random (MCAR): Missing at Random (MAR): Missing not at Random (MNAR or nonignorable): Imputation Types of imputation Imputation using outside information Mean imputation Hot deck imputation Model based imputation Multiple imputation Desirable features for multiple imputation Bayesian Inference Multiple imputation paradigm
9 9 12 12 12 13 14 20 20 20 21 21 22 23 27 28 31
Chapter three: Practical aspect 3-1 3-2 3-3 3-4 3-5
Introduction Description of the data Missing and classifying data Data analysis Imputation Specifications
39 39 41 42 43
Chapter Four: Conclusions and Recommendations 4-1
Conclusions
49
V
Title 4-2
Recommendations References
Page 50 51-55
Appendixes Appendix (1) Software for imputation Appendix (2) The (m) results for Multiple Imputation by WINMICE Arabic abstract Kurdish abstract
57-62 63-65 -----
List of Figures Figure
Title
2-1
Loss of information due to case deletion Graphical representations of (a) missing completely at random (MCAR), (b) missing at random (MAR), and (c) missing not at random (MNAR) in a univariate missing-data pattern. X represents variables that are completely observed, Y represents a variable that is partly missing, Z represents the component of the causes of missingness unrelated to X and Y, and R represents the missingness. Missing data example Effect of mean imputation on the shape of a distribution Schematic representation of the steps in multiple imputations. Multiple imputation Example of a conditional predictive distribution The multiple imputation principle
2-2
2-3 2-4 2-5 2-6 2-7 2-8
Page 11 11
11 11 12 12 22 21
List of Tables Table No. 3-1 3-2 3-3 3-4 3-5 3-6 3-7
Title The initial Complete data Missing values in the initial complete data Data Descriptives for incomplete data Missing Data Pattern Polling Multiple Imputation Analyses Find the missing value for (m) results The complete dataset estimate
VI
Page 40 41 21 42 45 46 47
Abbreviations Abbreviation MI MICE R.C.B.D MCAR MAR MNAR CART RF mis obs com SAS MCMC inc exp W B T CAT PAN EM SOLAS rtf PROC DA pmm MVA
Details Multiple imputation Multiple imputation using chained equations Random complete block design Missing Completely at Random Missing at Random Missing not at Random Classification and Regression Tree Random forests Missing data Observation data Complete data Statistical analytical system Markov Chain Monte Carlo Income information Expenditure information within-imputation” variance between-imputation” variance Total variance Categorical data Panel data Expectation maximization Statistical Solutions Rich text format Procedures Data augmentation Predictive mean matching Missing Value Analysis
VII
Appendices
Appendices
Appendix (1) Software for imputation[10][16][21][29] 1- Imputation methods Simple imputation methods (one variable at a time) can readily be programmed using commands such as regression analyses in standard packages. These are most commonly available as part of contributed packages or add-ons.
2- Post imputation procedures After a multiple imputation procedure has been carried out the user has a new data file or files that give several copies of the complete data. Often this will be a file in which the imputed data sets are stacked one above the other, indexed by the number of the imputation. Post-imputation procedures involve analysing each of the imputed data sets separately and averaging the results. The differences between results obtained on the different data sets can be used to adjust the standard errors from statistical procedures. This process has been automated in some packages, so that one command will produce the averaged analyses and the results with adjusted standard errors. This process works well for procedures such as regression analyses. For simple exploratory analyses it may be sufficient to work with a single data set, unless there is a substantial proportion of missing data for one or more of the variables being analysed. One way to check this is to run the exploratory procedures on different imputations to get an informal estimate of the variation between the imputations.
57
Appendices
3- What different packages can do summaries of some of the procedures available for handling missing data in the packages featured on this site. SAS missing value patterns
MI
SPSS
Stata
MVA
repeated measures MIXED not GLM (*)
analysis single imputation
multiple imputation
MVA MI
MVA with
IMPUTE
EM
(IVEWARE)
algorithm
nmissing (dm67)
md.pattern(mice)
mvmissing(dm91)
prelim.norm(norm)
?
pan
impute
em.norm
uvis(ice)
da.norm
ice (ice)
Micombine (r-buddy.gif" post-imputation
MIANALYZE
R
alt='warn') mifit and others (st0042)
norm(norm) mice (mice)
glm.mids,pool(mice) mi.inference(norm)
3-1 SOLAS version 3.0 Statistical Solutions (North American office) http://www.statsol.ie/solas/solas.htm,
[email protected] SOLAS is designed specifically for the analysis of datasets with missing observations. It offers a variety of multiple imputation techniques in a single, easy-touse package with a well designed graphical user interface. SOLAS supports predictive mean model (using the closest observed value to the predicted value) and propensity score models for missing continuous variables, and discriminant models for missing binary and categorical variables. Once the multiple datasets are created, the package (*) PROC GLM in SAS does listwise deletion and so does not allow for missing values. This is also true of SPSS repeated measures analyses) Items in (brackets) indicate that the item is a set of contributed procedures. In particular the following research groups have provided routines and their web sites are helpful.
58
Appendices
allows the calculation of descriptive statistics, t-tests and ANOVA, frequency tables, and linear regression. The system automatically provides summary measures by combining the results from the multiple analyses. These reports can be saved as rich text format (rtf). It has extensive capabilities to read and write database formats (1-2-3, dBase, FoxPro, Paradox), spreadsheets (Excel), and statistical packages (Gauss, Minitab, SAS, S-Plus, SPSS, Stata, Statistica, and Systat). Imputed datasets can be exported to other statistical packages, though this is a somewhat cumbersome process, since the combination of results from the multiple analyses then needs to be done manually. The new script language facility is particularly useful in documenting the steps of a multiple imputation run, and for conducting simulations. It can be set up to automatically record the settings from a menu-based multiple imputation session, and store this configuration in a file for later revision and reuse. SOLAS includes the ability to view missing data patterns, review the quantity and positioning of missing values, and classify them into categories of monotone or non monotone missingness. Because it does not consolidate observations with the same pattern of missing data, however, this feature is of limited utility in large datasets. A nice feature of SOLAS is the fine-grained and intuitive control of the details of the imputation model. As an example, incorporating auxiliary information (variables in the imputation model but not in the regression model of interest) is straightforward.
3-2 SAS 8.2 (beta) SAS Institute Inc. SAS Campus Drive http://www.sas.com,
[email protected] The SAS System is described as an integrated suite of software for enterprisewide information delivery, which includes a major module for statistical analysis, as implemented in SAS/STAT. In release 8.1 two new experimental procedures (PROC MI and PROC MIANALYZE) were made available, which implemented multiple
59
Appendices
imputation. The interface for PROC MI changed substantially in release 8.2. SAS anticipates putting PROC MI and PROC MIANALYZE into production for release 9. The imputation step is carried out by PROC MI, which allows use of either monotone (predictive mean matching, denoted by REGRESSION, or propensity, denoted by PROPENSITY) or nonmonotone (using MCMC) missingness methods. The MCMC methods can also be used in a hybrid model where the dataset is divided into monotone and nonmonotone parts, and a regression method is used for the monotone component. Extensive control and graphical diagnostics of the MCMC methods are provided. SAS supports transformation and back-transformation of variables. This may make an assumption of multivariate normality, needed for the REGRESSION and MCMC methods. A disadvantage of the imputation methods provided by PROC MI is that the analyst has little control over the imputation model itself. In addition, for the regression and MCMC methods, SAS does not impute an observed value that is closest to the predicted value (i.e., there is no support for predictive mean matching using observed values). Instead, it uses an assumption of multivariate normality to generate a plausible value for the imputation. SAS allows the analyst to specify a minimum and maximum value for imputed values on a variable-by-variable basis, as well as the ability to round imputed values. In addition, a SAS data step could be used to generate observed values. In practice, however, both these approaches are somewhat cumbersome.
3-3 Missing Data Library for S-Plus Insightful (formerly Math Soft) http://www.insightful.com,
[email protected] S-Plus 6.0 features a new missing data library, which extends S-Plus to support model-based missing data models, by use of the EM algorithm (Dempster, Laird, and Rubin 1977) and data augmentation (DA) algorithms (Tanner and Wong 1987). DA algorithms can be used to generate multiple imputations. The missing data library provides support for multivariate normal data (impGauss), categorical data (impLoglin), 60
Appendices
and conditional Gaussian models (impCgm) for imputations involving both discrete and continuous variables. The package provides a concise summary of missing data distributions and patterns (further described in the discussion of the examples), including both text-based and graphical displays.
3-4 MICE http://www.multiple-imputation.com Multiple Imputation by Chained Equations (MICE) is a library distributed for SPlus (described above) and R, a system for statistical computation and graphics, whose language and interface are very similar to S-Plus. MICE provides a variety of imputation models, including forms of predictive mean matching and regression methods, logistic and polytomous regression, and discriminant analysis. Nonmonotone missingness is handled by using chained equations (MCMC) to loop through all missing values. Extensive graphical summaries of the MCMC process are provided. In addition, MICE allows users to program their own imputation functions, which is useful for undertaking sensitivity analyses of different (possibly nonignorable) missingness models. The system allows transformation of variables, and fine-grained control over the choice of predictors in the imputation model. The imputation step is carried out using the mice() function. For continuous missing variables, MICE supports imputation using the norm unction (similar to SAS’ REGRESSION option), and the pmm function (similar to SOLAS’ predictive mean matching). Completed datasets can be extracted using the complete() function, or can be run for each imputation using the lm.mids() or glm.mids() function. Finally, results can be combined using the pool() function. Although
computationally
attractive,
the
chained
equation
approach
implemented in MICE requires assumptions about the existence of the multivariate 61
Appendices
posterior distribution used for sampling, however, it is not always certain that such a distribution exists (van Buuren et al. 1999). Like SOLAS,MICE uses a fixed seed for random number generation, which must be overridden during the imputation phase to avoid always generating the same imputed values. It would be preferable to have this seed vary by default, but allow the option to. x the seed to allow replication of results. S-Plus was described previously. R is free software for Unix, Windows, and Macintosh that is distributed under a GNU-style copyleft. More information can be found at the R project web site: www.r-project.org. The MICE library is freely available, and may be downloaded from the www.multipleimputation.com Web site.
4- Practical issues with real data The programs that implement these methods are fairly new and several of them are still under development. They have all been written in response to users needs and have many helpful practical features. In the next section we will mention some of the good features that exist in some implementations, and also some caveats. Things can easily go seriously wrong and you should always check that the data are reasonable. Unbounded imputation from a normal distribution can sometimes give extremely high values that affect the mean. A minimum is to look at some histograms of the observed and imputed values. Details that ensure that the imputed values make sense need to be considered when an imputation scheme is designed. For example, if one imputes missing values for the two questions "do you smoke cigarettes?" and "how many cigarettes do you smoke?" we need to be careful that numbers of cigarettes are not imputed for nonsmokers.
62
Chapter One Introduction and Literature review
Chapter One :Introduction and Literature review
1-1 Introduction
[2][3][7][8][21][26]
Missing data is a returning problem in large national public use datasets. In recent years multiple imputation (MI) has been introduced as a useful, consistent, and straightforward solution. With (MI) the responsibility of correctly dealing with the missing items, in order to avoid biased estimates and overestimation of the precision is laid in the hands of the data collectors. One of the main attractive reasons to use MI is the fact that it results in completed data and therefore it allows for standard statistical complete-data techniques afterwards. The idea of (MI) is to draw several times from the predictive distribution of the missing values. From these draws several completed datasets are made by replacing the missing values with the imputed values. By applying the standard statistical techniques to the completed datasets and combining the results afterwards by pooling, both the uncertainty due to missing data and the variability in the complete data itself are taken into account These types of variability are called within imputation variance and between imputation variance. Imputation methods are based on the assumption that the data are missing at random (MAR), that is, the response mechanism does not depend on unobserved information or on the variable(s) with missing values itself. Our imputation model consists of a set of predictor variables, the socalled donors, and a statistical model which characterizes the relationship between the imputed variable and its donors. One can think of a multiple regression model (based on a multivariate normal distribution), a logistic regression model or a multinomial logit regression model. The algorithm, which is based on Gibbs sampling, imputes each incomplete column of the 2
Chapter One :Introduction and Literature review
dataset in an iterative fashion, variable by variable. Donors may have missing values themselves which are imputed based on a particular imputation model for this donor. The algorithm has been investigated by Brand (1999). Van Buuren & Oudshoorn (1999) describe MICE, an implementation of this algorithm in the statistical package S-PLUS. MICE stands for Multivariate Imputation of Chained Equations. This implementation is programmed in such a way that it is very easy to add imputation models not yet included.
The aim of this thesis is to estimate the missing data by MI in (R.C.B.D) factorial experiments (artificially randomly missed dataset from a complete dataset), then to compare the results of original complete data results with those missing and estimated with MI. In this thesis the MI application was used on missing data, artificially some values from the groups of the complete data (R.C.B.D) factorial experiments were deleted randomly give us a group of incomplete dataset. MI method was used to estimate the missing data.
This thesis consists of four chapters: The first chapter deals with the introduction, the aim of this study literature review about this subject. The second chapters deal with the theoretical aspect (MI Method for Missing data in (R.C.B.D) factorial experiments on), its importance and the condition to employ to find its logical value. The third chapter deals with the practical part, during this chapter we find its logical worth to facts (Data) by MICE program specialized to this application i.e. [MI Method for Missing Data in (R.C.B.D) for Missing Data in (R.C.B.D) Factorial Experiments] by MICE to find its logical value. 3
Chapter One :Introduction and Literature review
The fourth chapter deals with the significant conclusions that we reach to during the applications and the recommendations where they exist.
1-2 Literature review Missing data is a commonly occurring complication in many scientific investigations. Determining the appropriate analytic approach in the presence of incomplete observations is a major question for data analysts. The development of statistical methods to address missing data has been an active area of research in recent decades. There are three types of concerns that typically arise with missing data: Loss of efficiency, Complication in data handling and analysis, and Bias due to differences between the observed and unobserved data. One approach to incomplete data problems that addresses these concerns is multiple imputation, which was proposed by Rubin (1977) and described in detail by Rubin (1987).[13] Educational researchers have become increasingly aware of the problems and biases which can be caused by missing data.[25] Several approaches to imputing multivariate data have been developed over the last decade, Li (1988) and Rubin and Schafer (1990) presented techniques for generating multivariate multiple imputations. Both are Bayesian simulation algorithms that draw imputations from the posterior predictive distribution of the missing data giving the observed data. The Rubin- Schafer method assumes that the data have a multivariate normal distribution and are missing at random.[11][23] The implemented algorithm differs slightly from Schafer’s approach in that the conditional models can be specified directly, without the need to choose an encompassing multivariate model for the entire data set. It is 4
Chapter One :Introduction and Literature review
assumed that a multivariate distribution exists, and the draws from it can be generated by iteratively sampling from the conditional distribution. In this way, the multivariate problem is split into a series of univariate problems. similar ideas have been applied by Kennickell (1991)[7]. In the medical research area, Rubin and Schenker (1991) provided and overview of multiple imputation in health- care databases[6]. Heitjan and Landis (1994) developed a multiple imputation approach to assess secular trends in blood pressure in the presence of incomplete multivariate data[6][15]. Roth (1994) reported that imputation- based procedures have a number of advantages. First, imputation strategies save a great deal in terms of information over listwise deletion since an individual is not deleted from the analysis as a result of missing a small amount of “Paired” with the missing data are also saved. Finally, the imputed data preserve derivation from the mean and the shape of the distribution[6]. Rubin (1996) described multiple imputation as a three step process. First; sets of plausible values for missing observations are created that reflect uncertainty about the non response model. Each of these sets of plausible values can be used to “fill-in” the missing values and create a “completed” dataset. Second; each of these datasets can be analyzed using complete-data methods. Finally, the results are combined, which allow the uncertainty regarding the imputation to be taken into account.[13][20] Lavori, Dawson and Shera (1996) developed a multiple imputation strategy for clinical trials with truncation of patient data. In order to solve these technical difficulties it was decided to use a multiple imputation approach known as the approximate Bayesian bootstrap approach which has been applied successfully to clinical trials in the medical research. [6] 5
Chapter One :Introduction and Literature review
In the applied research area of survey sampling, Montalto and Sung, (1996) discovered that multiple Imputation accounted the 1992 survey of consumer finances. They concluded that inference on this survey from single imputation techniques ignored the extra variability the to missing values and risks misrepresenting the precision of estimates and the significance of relationships.[12] Schafer (1997) applied the underlying principle to other multivariate distributions, and derived imputation algorithms for multivariate numerical, categorical and mixed data.[7] In the area of political survey research, King (1998) showed that multiple imputation contributed to an efficient way of using about 50% more information in the data than are currently used by political scientists using traditional methods of analyzing incomplete multivariate data. [6] King and Liu (1998) performed a comparison of available-case and multiple imputation analysis and showed a marked improvement in the interpretation and results of political polling research using multiple imputation as opposed to available case analysis. [6] Following the seminal publications of Rubin about thirty years ago, statisticians have become increasingly aware of the inadequacy of “complete case” analysis of datasets with missing observations. In medicine for example, observations may be missing in a sporadic way for different covariates, and a complete-case analysis may omit as many as half of the available cases. Hot deck imputation was implemented in stata in 1999 by Mander and Clayton.[12] David J. Fogarty (2000) applied multiple imputation techniques in commercial data warehousing environment in an effort to improve data mining activities in consumer financial services firms.[6]
6
Chapter One :Introduction and Literature review
Fogarty (2000) analyses the importance of using proper techniques for the reject inference to develop consumer credit scoring. An overview and comparison of the standard missing data approaches of reject inference are provided. MI is also discussed as a method to reject inference which can potentially reduce some of the biases which can occur from using some of the traditional missing data techniques. A quantitative analysis is then provided to confirm the hypothesis that model-based multiple imputation is an enhancement over traditional missing data approaches to reject inference. [6] Schafer and Graham (2002) point out that time shouldn’t be the only factor to consider since more advanced procedures like single and multiple imputation methods produced significantly higher levels of accuracy than simple techniques such as listwise deletion. Also, the percentage of missing data for each of the variables is important to consider as well.[15] Rassler (2003) discusses the importance of imputation for public use databases in the context of advances in data fusion. When it’s a dataset that will be used by another party other than the imputer then consideration must be taken to choose a procedure that doesn’t require complexity in conducting the final statistical analysis. Multiple imputation can be difficult in these situations while single imputation applications can be easier to understand and prevent improper use and interpretation at the final stages of the analysis. [6] Yan He (2006) discussed
that the non-parametric bootstraps were
implemented to impute missing values before cases were drooped down the tree (CART/RF), and the classification results were compared to both complete- data/full- data analysis and to the ossification results using surrogate variables/proximity. Significant improvements in the ability to predict were found for both CART and RF models.[27]
7
Chapter two Missing Data and Multiple Imputation
Chapter two: Missing Data and Multiple Imputation
Chapter two Missing Data and Multiple Imputation 2-1 Introduction Missing Data are a problem for all statistical analyses. Missing data mechanisms are crucial since the properties of missing data methods depend very strongly on the nature of these mechanisms. The crucial role of the mechanism in the analysis of data with missing values was largely ignored until the concept was formalized in the theory of Rubin (1987), through the simple device of treating the missing data indicators as random variables and assigning them a distribution. A useful feature of MI is that there is no requirement that the model for creating the imputations be the same as the analysis model applied to the filled-in data. This idea featured prominently in Rubin’s initial proposal of MI, which was in the context of imputation of public use data sets where the imputer could not necessarily be expected to have the same analysis in mind as the analyst. A number of thorny theoretical issues arise when the imputation and analysis models differ and may not be mutually “congenial”, but the added flexibility opens up some attractive missing-data solutions for the practitioner. The resulting methods are not in all cases theoretically “pristine”, but they are likely to have excellent real- life properties.
2-2 Missing data[2][3][4][7][8][12][25] Missing data is one of the most pervasive problems in data analysis. The problem occurs when rats die, equipment malfunctions, respondents become recalcitrant, or somebody goofs. Its seriousness depends on the pattern of missing data, how much is missing? And why it is missing? Missing data in public health research is a major problem. Mean or median imputation is frequently used because it is easy to implement. Although multiple imputation has good statistical properties, it is not yet used extensively. For two real 9
Chapter two: Missing Data and Multiple Imputation
studies and a real study-based simulation, all imputation methods showed similar results for both real studies, but somewhat different results were obtained when only complete cases were used. The simulation showed large differences among various multiple imputation methods with a different number of variables for creating the matching metric for multiple imputation. Multiple imputation using only a few covariates in the matching model produced more biased coefficient estimates than using all available covariates in the matching model. The simulation also showed better standard deviation estimates for multiple imputation than for single mean imputation. Often empirical researchers are confronted with missing values in their data sets. As the phenomenon is often not seen as a possible threat to the validity of the research, the most common approach to this problem is simply to deny it. However, a closer look to the data often reveals 5% to 20% of missing values in a few variables, reducing the complete data for any multivariate analysis considerably, see Figure (2-1). Moreover, often these blind spots were not dropped randomly all over the responses. We find special socio-economic groups or minorities disproportionately struck by missing values. Even worse, if the missingness depends on the variable of interest itself, like it is common that the highest income appears to be unknown. The same happens when, e.g., populations with worst health conditions or high at risk refuse to be sampled. Finally, the quality of response deteriorates with long and boring questionnaires like they are common practice in media research. In all these cases, missing data can be a threat to the research and the remaining data are all but representative for the population of interest.
10
Chapter two: Missing Data and Multiple Imputation
Figure (2-1): Loss of information due to case deletion
Missing data are common in practice and usually complicate data analyses for scientific investigations. A rather general method for handling missing values in a data set is to impute, i.e., fill in one or more plausible values for each missing datum so that one or more completed data sets are created. Often it is easier to first impute for the missing values and then use a standard complete-data method of analysis than to develop special statistical techniques that allow the analysis of incomplete data directly. Imputing a single value for each missing datum and then analyzing the completed data using standard techniques designed for complete data will usually result in standard error estimates that are too small, confidence intervals that undercover, and p-values that are too significant; this is true even if the modeling for imputation is carried out carefully. The usual single imputation strategies such as mean imputation, hot deck, or regression imputation typically result in confidence
11
Chapter two: Missing Data and Multiple Imputation
intervals and p-values that ignore the uncertainty due to the missing data, because the imputed data were treated as if they were fixed known values.
2-3 Treatment of Missing Data[2][27] There are three approaches in dealing with missing data: 1. Impute the missing data: that is, filling in the missing values. 2. Model the probability of missingness: this is a good option if imputation is infeasible; in certain cases it can account for much of the bias that would otherwise occur. 3. Ignore the missing data: a poor choice, but by far the most common one. This section gives a brief description of alternative approaches to handling the problem of missing data.
2-4 Types of missing data[3][4][7][8][15][18][23] The most appropriate way to handle missing or incomplete data will depend upon how data points became missing. Little and Rubin (1987) define three unique types of missing data mechanisms.
2-4-1 Missing Completely at Random (MCAR)[3][4][7][8][15][23]: MCAR data exist when missing values are randomly distributed across all observations. In this case, observations with complete data are indistinguishable from those with incomplete data. That is, whether the data point on Y is missing is not at all related to the value of Y or to the values of any Xs in that dataset. e.g. if you are asking people their weight in a survey, some people might fail to respond for no good reason – i.e. their non response is in no way related to what their actual weight is, and is also not related to anything else we might be measuring. The term has a precise meaning (Rubin, 1977; Little & Rubin, 1987): Thinking of the data set as a large matrix, the missing values are randomly 12
Chapter two: Missing Data and Multiple Imputation
distributed throughout the matrix. This rarely happens in family studies because it is well established that men, individuals in minority groups, people with high incomes, those with little education, and people who are depressed or anxious are less likely than their counterparts to answer every item in a questionnaire. MCAR is an unreasonable assumption for many family studies. One exception is when data are missing by design. Giving a 100-item interview to children between ages 5 and 9 years would create a serious fatigue problem. A researcher might randomly select 20 items for each child and just ask those items. Because 80% of the values for each child will be missing, using listwise or casewise deletion, it is likely there would be no usable observations. These data, however, would meet the assumption of MCAR because the random process would insure that whether a child answers any one item is unrelated to the child’s score on any of the 100 items. Modern approaches to missing values allow researchers to estimate population parameters that are unbiased compared to the results that would have been obtained if each child answered all 100 items and did not get fatigued in the process. The only limitation is that uncertainty is introduced by the imputation process, and this uncertainty reduces statistical power compared to having complete data.
2-4-2 Missing at Random (MAR) [3][4][7][8][15][23]: MAR data exist when the observations with incomplete data differ from those with complete data, but the pattern of data missingness on Y can be predicted from other variables in the dataset (Xs) and beyond that bears no relationship to Y itself – i.e., whatever nonrandom processes existed in generating the missing data on Y can be explained by the rest of the variables in the dataset. MAR assumes that the actual variables where data are missing are not the cause of the incomplete datainstead; the cause of the missing data is due to some other factors that we also measured. e.g., one sex may be less likely to disclose its weight.
13
Chapter two: Missing Data and Multiple Imputation
MAR is much more common than MCAR. MAR data are assumed by most methods of dealing with missing data. It is often but not always tenable. Importantly, the more relevant and related predictors we can include in statistical models, the more likely it is that the MAR assumption will be met. Sometimes, if the data that we already have are not sufficient to make our data MAR, we can try to introduce extenrnal data as well- e.g., estimating income based on Census block data associated with the address of the respondent. If we can assume that data are MAR, the best methods to deal with the missing data issue are multiple imputation and raw maximum likelihood methods. Together, MAR and MCAR are called ignorable missing data patterns, although that’s not quite correct as sophisticated methods are still typically necessary to deal with them. The missing data for a variable are MAR if the likelihood of missing data on the variable is not related to the participant’s score on the variable, after controlling for other variables in the study. These other variables provide the mechanism for explaining missing values. In a study of maternal depression, 10% or more of the mothers may refuse to answer questions about their level of depression. Suppose a study includes poverty status coded as 1 ¼ poverty status, 0 ¼ not in poverty status. A mother’s score on depression is MAR if her missing values on depression do not depend on her level of depression, after controlling for poverty status. If the likelihood of refusing to answer the question is related to poverty status but is unrelated to depression within each level of poverty status, then the missing values are MAR. The issue of MAR is not whether poverty status can predict maternal depression, but whether poverty status is a mechanism to explain whether a mother will or will not report her depression level (pattern of missingness).
2-4-3 Missing Not at Random (MNAR or nonignorable) [3][7][8][15][23]: Data may be missing in ways that are neither MAR nor MCAR, but nevertheless are systematic. In a panel study of college students where an outcome 14
Chapter two: Missing Data and Multiple Imputation
variable is an academic performance, there is a likely to be attrition because the students who drop out of college and are lost to the study are more likely to have low scores on academic performance. The pattern of data missingness is non-random and it is not predictable from other variables in the dataset. MNAR data arise due to the data missingness pattern being explainable only by the very variable(s) on which the data are missing. E.g., heavy (or light) people may be less likely to disclose their weight. MNAR data are also sometimes described as having selection bias. MNAR data are difficult to deal with, but sometimes that’s unavoidable; if the data are MNAR, we need to model the missing-data mechanism. Two approaches used for that are selection models and pattern mixture; however, we will not deal with them here although your book provides an intro. Adopting a generic notation, let us denote the complete data as Ycom and partition it as Ycom=(Yobs, Ymis), where Yobs and Ymis are the observed and missing parts, respectively. where missing data to be MAR if the distribution of missingness does not depend on Ymis, P(K|Ycom) = P(K|Yobs)
(1)
An important special case of MAR, called missing completely at random (MCAR), occurs when the distribution does not depend on Yobs either, P(K|Ycom) = P(K)
(2)
When Equation (1) is violated and the distribution depends on Ymis, the missing data are said to be missing not at random (MNAR). MAR is also called ignorable nonresponse, and MNAR is called nonignorable. For intuition, it helps to relate these definitions to the patterns in Figure (2-2).
15
Chapter two: Missing Data and Multiple Imputation
K
K
K
Figure (2-2). Graphical representations of (a) missing completely at random (MCAR), (b) missing at random (MAR), and (c) missing not at random (MNAR) in a univariate missing-data pattern. X represents variables that are completely observed, Y represents a variable that is partly missing, Z represents the component of the causes of missingness unrelated to X and Y, and K represents the missingness.
With the arbitrary pattern of Figure (2-2), MCAR still requires independence between missingness and Y1...Yp. However, MAR is now more difficult to grasp. MAR means that a participant’s probabilities of response may be related only to his or her own set of observed items, a set that may change from one participant to another. One could argue that this assumption is odd or unnatural, and in many cases we are inclined to agree. However, the apparent awkwardness of MAR does not imply that it is far from true. Indeed, in many situations, MAR is quite plausible, and the analytic simplifications that result from making this assumption are highly beneficial. We will discuss these definitions more technically now. Let u be any population of interest, finite or not, and ui (ui1 , ui 2 ,..., uir ) denote the value of a random vector U (U1 ,U 2 ,...,U r ) for each unit i , i u . Without loss of generality let fU (ui ; ) be the probability of drawing a certain unit i, i u with observations ui (ui1 , ui 2 ,..., uir ) depending on the parameter which may be regarded as a
scalar or vector. In the case of continuous random variables U , f U may be taken as the density function instead of the probability function. To be more general, f U may also describe a finite mixture of densities. Finally, let a random sample of n
16
Chapter two: Missing Data and Multiple Imputation
independently observed units from U be given with probability or, more generally, with density function i 1 fU (ui ; ), . n
Now denote the observed part of the random vector U by Uobs, and the missing part by Umis, so that U = (Uobs,Umis). The joint distribution of Uobs and Umis is given by fU (ui ; ) fUobs ,Umis (uobs,i , u mis ,i ; ) for each unit i u . Furthermore, let R be an indicator variable being zero or one depending on whether the corresponding element of U is missing or observed; i.e., 1 Rij 0
, if var iable U j is observed on the unit i. , else,
(3)
For all units i u and variables Uj , j = 1, 2, . . . , r. Generally a probability model for R with f R (r; ) is assumed, which depends on some unknown nuisance parameter . Hence, the joint distribution of the response indicator R and the interesting variables U is given by fU , R (u, r; , ) fU (u; ) f R / U (r / u; ),
( , ) , .
(4)
The density or probability function describing the observed data of any unit i u and, thus, their likelihood may actually be written
L( , ; u obs , r ) f U obs, R (u obs , r; , ) f U obs,U mis (u obs , u mis ; ) f R / U obs,U mis (r / u obs , u mis ; )du mis ,
(5)
With ( , ; uobs , r ) , for simplicity we want the integral to be understood as the sum for discrete distributions. To ease reading, we usually refer to fU as the density function of U hereinafter.
17
Chapter two: Missing Data and Multiple Imputation
Figure (2-3): Missing data example
Consider, e.g., an iid sample with two random variables U1 and U2 observed from n (U1) or r < n (U2) units, respectively. Then the observed data likelihood according to the data presented in Figure (2-3), is r
L( , ; u obs , r ) fU1 ,U 2 (ui1 , ui 2 ; ) f R / U1 ,U 2 (ri / ui1 , ui 2 ; ) i 1
n
fU1 ,U 2 (ui1 , ui 2 ; ) f R / U1 ,U 2 (ri / ui1 , ui 2 ; )dui 2 i r 1
Notice that the integration of the second term is not easily done without further assumptions. Now the assumptions concerning the missing-data mechanisms can be explained in more detail. • First of all it is assumed that and are “distinct”; i.e., knowing will provide no information about and vice versa. Then the joint parameter space , is the product
of
the
parameter
space
of
and
the
parameter
space
of
, i.e., , . Thus, the conditional distribution of R given a value U = u,
i.e., R|U = u or, for short, R|u, does not depend on and can therefore be written as f R / U (r / u; ) . • Under the MCAR mechanism the response indicator R and the interesting variables U are assumed to be independent with f R / U (r / u; ) f R (r; ) for all U.
18
Chapter two: Missing Data and Multiple Imputation
• Under the MAR mechanism the conditional distribution of R|U = u does not depend on the missing data Umis and is given by f R / U (r / u; ) f R / U (r / uobs ;U ) for obs
all Umis. Thus we have seen, if the parameters and are distinct and the missingdata mechanism is at least MAR, then the conditional distribution of R|u is given by f R / U obs (r / uobs ; ) . The conditional distribution of R|u is independent of Umis and ;
the missingness mechanism is said to be ignorable. The likelihood (5) of the observed data of any unit i u under MAR can now be factorized into
(6( According to Little and Rubin (2002) and illustrated by (6) under ignorable missingness, it is not necessary to consider a model for R if likelihood-based inference about is intended. For the above example as shown in Figure (2-3), if and are distinct and f r / U1 ,U 2 (r / u1 , u 2 ; ) does not dependent on the missing data, i.e., the MAR
assumptions holds, then the observed-data likelihood reduces to
(7( And maximizing L( ; uobs ) with respect to gives the correct ML estimate of . Thus, given n observations independently drawn from the underlying population, the likelihood ignoring the missing-data mechanism is 19
Chapter two: Missing Data and Multiple Imputation
(8( Notice that uobs,i describes the observed value of unit i for i = 1, 2, . . . , n. Concerning the example above, uobs,i = (ui1, ui2) for units i = 1, 2, . . . , r and uobs,i = (ui1) for units i = r + 1, r + 2, . . . , n. Hence, we have seen that all relevant statistical information about the parameters incorporated by should be contained in the observed-data likelihood L ; u obs if the complete-data model, i.e., the data generating process assuming no
missingness, and the ignorability assumption is correct.
2-5 Imputation[2][4][6][7][8][13][25] When a survey has missing values it is often practical to fill the gaps with an estimate of what the values could be. The process of filling in the missing values is called IMPUTATION. Once the data has been imputed the analysts can just use it as though there was nothing missing.
2-5-1 Types of imputation There are many different systems of imputation that may often be used in combinations with one another. The categories below give general definitions but some of them can have several different variants.
2-5-1-1 Imputation using outside information This is only possible when it may be possible to determine exactly what the answer to a question should have been from other sources. For example if a respondent says that they get a certain benefit, but they don’t know how much it is, then the survey firm can look it up and complete the data. This is only possible for a few variables and it can be very expensive in time and resources.
20
Chapter two: Missing Data and Multiple Imputation
2-5-1-2 Mean imputation For numerical data the missing values are replaced by the mean of all responders to that question or that wave of the survey. This will get the correct average value but it is not a good procedure otherwise. It can distort the shape of distributions and the distort relationships between variables. The figure (2-4) below shows what imputing 70 values out of 500 in a survey of incomes could do to the shape of the distribution. The mean is the same but everything else is wrong.
Figure (2-4): Effect of mean imputation on the shape of a distribution
2-5-1-3 Hot deck imputation In hot deck imputation the missing values are filled in by selecting the values from other records within the survey data. It gets its name from the way it was originally carried out when survey data were on cards and the cards were sorted in order to find similar records to use for the imputation. The process involves finding other records in the data set that are similar in other parts of their responses to the record with the missing value or values. Often there will be more than one record that could be used for hot deck imputation and the record that could potentially be used for filling a cell are known as donor records. Hot deck imputation often involves taking, not the best match, but a random choice from a series of good matches and replacing the missing value or values with one of the records from the donor set.
21
Chapter two: Missing Data and Multiple Imputation
Hot deck imputation is very heavily used with census data. It has the advantage that it can be carried out as the data are being collected using everything that is in the data set so far. Hot deck imputation procedures are usually programmed up in a programming language and generally done by a survey firm often around the time the data are being collected.
2-5-1-4 Model based imputation Model based imputation involves fitting a statistical model and replacing the missing value with a value which relates to the value that the statistical model would have predicted. In the simplest case might have one variable with a missing value which will call y which is missing for some observations in the data set. Would then use the observations in the data set for which y was measured to develop a regression model to predict y from other variables. These other variables have to be available for the cases with missing values also. Then calculate the predicted value of y for the missing observations. One method of imputing would simply be to replace the missing data with the predicted values. This has the same disadvantage which showed for the mean imputation up above. It tends to give values that cluster around the fitted prediction equation. A better procedure is to replace the missing value with a random draw from distribution predicted for y. The procedure have just described assumes that the regression model is correct and completely accurate. only have an estimate of the regression model not an exact representation of it. So a further step is to add some additional noise to the imputed value to allow for the fact that the regression model is fitted with error. When this is done and all these sources of variability have been incorporated into the procedure the imputations are said to be "proper imputation" in that they incorporate all the variability that affects the imputed value. This would be the simplest type of model based imputation, more complicated methods can be used that look not only at one variable at a time but model jointly a whole set of variables. 22
Chapter two: Missing Data and Multiple Imputation
[2][4][6][7][8][13][25]
2-5-1-5 Multiple imputation
Multiple imputation is a statistical technique for analyzing incomplete data sets, that is, data sets for which some entries are missing. The main idea is that each missing value is replaced by several (m) values, thus producing (m) imputed data sets. The differences between these data sets reflect the uncertainty of the missing values. Each imputed data set is analyzed by standard complete-data procedures, which ignore the distinction between real and imputed values. The (m) resulting analyses are then combined into one final analysis. Figure (2-5) illustrates the flow of operations in multiple imputation.
Figure (2-5): Schematic representation of the steps in multiple imputation. The process starts with an incomplete data set (on the left side), which is imputed m times (m=3 here) thus creating m completed data sets. Each complete data set is analyzed by using standard complete-data software, thus resulting in m analysis results. Finally, these m results are pooled into one final result that adequately reflects the amount of uncertainty in the estimates.
IMPUTATION: Impute (=fill in) the missing entries of the incomplete data sets, not once, but m times (m=3 in the figure). Imputed values are drawn for a distribution (that can be different for each missing entry). This step results is m complete data sets.
23
Chapter two: Missing Data and Multiple Imputation
ANALYSIS: Analyze each of the m completed data sets. This step results in m analyses. POOLING: Integrate the m analysis results into a final result. Simple rules exist for combining the m analyses. In recent years multiple imputation has been introduced as a useful, consistent and, straight forward solution. With multiple imputation the responsibility of correctly dealing with the missing items, in order to avoid biased estimates and overestimation of the precision is laid in the hands of the data collectors. One of the main attractive reasons to use multiple imputation is the fact that it results in completed data and therefore it allows for standard statistical complete-data techniques afterward. Although multiple imputation has good statistical properties, it is not yet used extensively. For two real studies and a real study-based simulation, all imputation methods showed similar results for both real studies, but somewhat different results were obtained when only complete cases were used. The simulation showed large differences among various multiple imputation methods with a different number of variables for creating the matching metric for multiple imputation. Multiple imputation using only a few covariates in the matching model produced more biased coefficient estimates than using all available covariates in the matching model. The simulation also showed better standard deviation estimates for multiple imputation than for single mean imputation. Multiple imputation (MI), is an approach that retains the advantages of imputation while allowing the data analyst to make valid assessments of uncertainty. The concept of multiple imputation reflects uncertainty in the imputation of the missing values through resulting in theoretically wider confidence intervals and thus p-values suggesting less significance than single imputation would. MI is a Monte Carlo technique that replaces the missing values by m > 1 simulated versions, generated according to a probability distribution indicating how likely the true values are given the observed data; see Figure (2-6). 24
Chapter two: Missing Data and Multiple Imputation
Figure (2-6): Multiple imputation
Typically m is small, e.g., m = 5, although with upcoming computational power m can be 10 or 20, in general, this depends on the amount of missingness and on the distribution of the parameter to be estimated, especially if analyst’s model and imputer’s model differ. Each of the imputed (and thus completed) data sets is first analyzed by standard methods; the results are then combined to produce estimates and confidence intervals that reflect the missing data uncertainty The basic idea of multiple imputation, first proposed by Rubin (1977) and elaborated in his (1987) book, is quite simple: 1. Impute missing values using an appropriate model that incorporates random variation. 2. Do this M times (usually 3-5 times), producing M “complete” data sets.
25
Chapter two: Missing Data and Multiple Imputation
3. Perform the desired analysis on each data set using standard complete-data methods. 4. Average the values of the parameter estimates across the M samples to produce a single point estimate. 5. Calculate the standard errors by (a) averaging the squared standard errors of the M estimates (b) calculating the variance of the M parameter estimates across samples, and (c) combining the two quantities using a simple formula. The primary advantage of multiple imputation is that it leads to valid statistical inferences in the presence of non-response. A second advantage is that only familiar complete-data software is needed to analyze the data. Despite these virtues, the application of multiple imputation is not without problems. Though simple and sound procedures exist for pooling the m analyses, generating proper multiple imputations is not a trivial task. In practical applications, a major difficulty is the derivation of an appropriate predictive distribution from which imputations are to be drawn. Closed form analytic solutions are often unavailable, and some form of iterative algorithm is needed. The algorithm given in the next section requires only the specification of conditional distribution for the missing data in each incomplete variable. Once data have been imputed one can simply carry on to analyze the imputed data as if it were the real data. If there is a substantial amount of missing data the results that come from this analysis, although on average they will give good estimates, will be over-optimistic in that they will be assuming that the missing data really were measured by the imputed values. know this is not the case because if imputed the same value twice, using methods described under hot deck or model based imputation, would not always get the same imputed values. In hot deck imputation would usually get a different choice from potential donor records. In model based imputation we would select a different value from the distribution of the predicted values.
26
Chapter two: Missing Data and Multiple Imputation
In order to incorporate this variation in the analyses one needs to run the imputation more than once. This is very straightforward, one simply carries out the regression more than once and in the first instance looks to see whether the imputed results are the same for each of the analyses. They will never be exactly the same and, in order to incorporate this variation into our estimates of errors there are some relatively straightforward formulae that can be used. To use the formulae you need to express your results in terms of statistics that follow a normal or a (t-distribution), but this covers a wide range such as means, proportions and all types of regression.. Ideally, proper imputation should be used for each of the multiple imputations. It is not usually necessary to carry out the multiple imputation many times, 5 to 10 have been suggested, but we would suggest rather more (e.g. 20 to 50) if good estimates of standard errors are required. The formula for combining imputations works well in practice since it is usually a much smaller source of error than other aspects of a survey.
2-5-2 desirable features of Multiple imputation
[2][4][8][13]
Multiple imputation has several desirable features: 1. Introducing appropriate random error into the imputation process makes it possible to get approximately unbiased estimates of all parameters. No deterministic imputation method can do this in general settings. 2. Repeated imputation allows one to get good estimates of the standard errors. Single imputation methods don’t allow for the additional error introduced by imputation (without specialized software of very limited generality). 3. MI can be used with any kind of data and any kind of analysis without specialized software.
27
Chapter two: Missing Data and Multiple Imputation
2-5 Bayesian Inference[4][7][8][15][16][24] The Bayesian paradigm is based on specifying a probability model for the observed data U with joint density fU \ (u \ ) given a vector of unknown parameters which is identical to the likelihood function L( ; u) understood as a function
of . Then we assume that is random and has a prior distribution with density or probability function f . Inference about is then summarized in the function f \U , which is called the posterior distribution of given the data. The posterior distribution is derived from the joint distribution fU \ fU \ f according to Bayes’ formula (9(
Where denotes the parameter space of . Notice that from a Bayesian perspective the joint distribution fU \ (u \ ) equates the likelihood L( ; u) when the data are observed and only is still variable. For brevity again the integral is used, although may also be discrete. In such cases the integral should be understood as the sum. From (9) it is easily seen that f \U ( \ u) is proportional to the likelihood multiplied by the prior; i.e.,
And thus involves a contribution from the observed data through L( ; u) and a contribution from prior information quantified through f ( ) The quantity
28
Chapter two: Missing Data and Multiple Imputation
is usually treated as the normalizing constant of f \U ( \ u) ensuring that it is a density or probability function, i.e., to integrate or sum to one. Notice that c(u ) is a constant when the data U=u are observed. Before the data U are obtained, their distribution fU(u) is called the marginal density of U or the prior predictive distribution, which is not conditioning on previous observations. To predict a future observation value bu when the data U = u have been observed, we condition on these previous observations u. The distribution fUˆ \U (uˆ \ u) of Uˆ \ U is called the posterior predictive distribution with (10(
if is continuous, otherwise the sum is taken instead of the integral. Notice that usually Uˆ and U are assumed to be conditionally independent given ; thus fUˆ \,U (uˆ \ , u) fUˆ \ (uˆ \ ) holds. Hence the posterior predictive distribution is
conditioned on the values U = u already observed and predicts a value Uˆ uˆ that is observable. A classical and extensive introduction to Bayesian inference is given by Box and Tiao (1992); for deeper insights into Bayesian inference and computation the interested reader is referred thereto. Frequentist methods, however, do not tell us what our belief in a theory should be, given the data we have actually observed. This question is usually answered by the posterior distribution f \U . To work out this value we must first establish f ; i.e., we have to formulate some “prior probability” for the theory in mind. In contrast to classical Bayesian inference we do not focus further on what we can learn about our theory given the data. Our objective is the derivation of a suitable imputation procedure that has good properties under the frequentist’s randomization perspective.
29
Chapter two: Missing Data and Multiple Imputation
In the Bayesian framework all inference is based on a posterior density function for the unknown parameters conditioning on the quantities observed. Returning to our notation the unknown parameters are (, ) and the observed quantities are (Uobs, R). According to Bayes’ theorem the posterior distribution of (U obs , R)
given (Uobs = uobs, R = r), i.e., the observed-data posterior
distribution f ,\U
obs , R
f , \U obs, R / fU obs, R , may be written as
(11( With normalizing constant
Note that L( , ; uobs , r ) denotes the likelihood (6) of the observed data considering the missingness mechanism, and f , ( , ) is the joint prior distribution of the parameters. Under the assumption of MAR and the distinctness of (, ) , which means prior independence of and in Bayesian inference, i.e., f , ( , ) f ( ) f ( ) , according to (6) the observed-data posterior (11) reduces to;
(12(
From the Bayesian point of view the MAR assumption requires the independence of R and , i.e., f R \U
obs ,
independence of Uobs and , i.e., fU
(r \ uobs , , ) fU obs, (r \ uobs , ) , as well as the
obs \ ,
(uobs , , ) fU obs\ (uobs , ) leading to (12)
finally. 30
Chapter two: Missing Data and Multiple Imputation
Hence the marginal posterior distribution of is achieved by integrating (12) over the nuisance parameter with
(13(
(13(
Thus, under ignorability all information about is included in the posterior that ignores the missing-data mechanism, (14(
2-6 Multiple imputation paradigm[6][17][14] The theoretical motivation for multiple imputation is Bayesian, although the resulting multiple imputation inference is usually valid also from a frequentist's viewpoint. Basically, MI requires independent random draws from the posterior predictive distribution of the missing data given the observed data analogous to figure (10). Since fU
mis \U obs
it self often is difficult to draw from directly, a two-step
procedure for each of the m draws is useful: (a) First, we make random draws of the parameters according to their observeddata posterior distribution f \U
obs
according to (14),
31
Chapter two: Missing Data and Multiple Imputation
(b) Then, we perform random draws of Umis according to their conditional predictive distribution fU
mis \U obs ,
Because (15(
holds, analogous to (10), with (a) and (b) we achieve imputations of Umis from their posterior predictive distribution conditional predictive distribution fU
mis \U obs ,
fU mis \U obs . For many models the
is rather straightforward due to the data
model used; see as an example Figure (2-8). It often may easily be formulated for each unit with missing data. On the contrary, the corresponding observed-data posteriors f \U usually are obs
difficult to derive for those units with missing data, especially when the data have a multivariate structure and different missing data patterns, see for illustration Figure (2-7). The observed data posteriors are often not standard distributions from which random numbers can easily be generated. However, simpler methods have been developed to enable multiple imputation on the grounds of Markov chain Monte Carlo (MCMC) techniques; they are extensively discussed by Schafer (1997). In MCMC the desired distributions fU
mis \U obs
and f \U are achieved as stationary obs
distributions of Markov chains which are based on the easier to compute completedata distributions. To proceed further, let denote a scalar quantity of interest that is to be estimated, such as a mean, variance, or correlation coefficient. Notice that now can be completely
32
Chapter two: Missing Data and Multiple Imputation
Figure (2-7): Example of a conditional predictive distribution
Different from the data model used before to create the imputations. In the remainder of this section, the quantity µ to be estimated from the multiply imputed data set, has to be distinguished from the parameter µ used in the model for imputation. Consider, for example, Let us assume that only the income (inc) information is missing for some units. If the imputation of the missing income is based on the expenditure (exp) information, e.g., by applying a simple linear regression inci 0 01 exp i i for i 1,2,..., n , then
(imputation ) ( 0 , 1 ) . If, on the
other hand, the analyst’s model explains expenditure by income, such that exp i 0 1inc i i for i 1,2,..., n holds, then analysis) = ( 0 , 1 ) .
Although (analysis) could be an explicit function of (imputation), as it is the case in the example above, one of the strengths of the multiple imputation approach is that this need not be the case. In fact, (analysis) could even be the parameter of the imputation model, then imputation and analysis models are the same and are said to be congenial, a term coined by Meng (1995). However, multiple imputation is designed for situations when the analyst and the imputer are different, thus, the analyst’s model could be quite different from the imputer’s model. As long as the 33
Chapter two: Missing Data and Multiple Imputation
two models are not overly incompatible or the fraction of missing information is not high, inferences based on the multiply imputed data should still be approximately valid. Even more, if the analyst’s model is a sub-model of the imputer’s model, i.e., the imputer uses a larger set of covariates than the analyst and the covariates are good predictors of the missing values, then MI inference can beat the best inference possible using only the variables in the analyst’s model. This property is called superefficiency by Rubin (1996). On the other hand, if the imputer ignores some important correlates of variables with missing data, but these variables are used in the analyst’s model, then the result will be biased. This refers to an imputation being done under the hypotheses of zero correlation between income and expenditure which is surely not the case, thus, results will be biased.3 Moreover, the imputer’s model also allows to use in-house variables such as additional information from the interviewers (area of living, neighborhood, house size, number of car garages etc.) which are typically not available to the analyst but may show some correlation with the missing variables. Thus, an inclusive strategy is nearly always better than a restrictive one. As described before, U=(Uobs,Umis) denotes the random variables concerning the data with observed and missing parts, and ˆ ˆ(U ) denotes the statistic that would be used to estimate
if the data were complete. Furthermore, let
var(ˆ) var(ˆ(U )) be the variance estimate of ˆ(U ) based on the complete data set. The MI principle assumes that ˆ and var(ˆ) can be regarded as an
approximate complete data posterior mean and variance for , with
and
34
Chapter two: Missing Data and Multiple Imputation
Based on a suitable complete-data model and prior. Moreover, we should assume that with complete data, tests and interval estimates based on the normal approximation
(16(
should work well. Hence, we assume that the complete-data inference can be based on ˆ ~ N ( , var(ˆ)) and that var(ˆ) is of lower-order variability than var(ˆ) ;
Notice that the usual maximum-likelihood estimates and their asymptotic variances derived from the inverted Fisher information matrix typically satisfy these assumptions. Sometimes it is necessary to transform the estimate ˆ to a scale for which the normal approximation can be applied. For example, we can use the socalled
z-transformation
for
z( ˆ ) (1 / 2) ln((1 ˆ ) /(1 ˆ )) ,
the which
correlation makes
coefficient
z ( ˆ )
estimate,
approximately
with
normally
distributed with mean z ( ˆ ) and constant variance 1/(n-3). Suppose now that the data are missing and we make m>1 independent simulated imputations (Uobs,U(1)mis), (Uobs,U(2)mis), . . . , (Uobs,U(m)mis ) enabling us to (t ) ( m) calculate the imputed data estimate ˆ (t ) ˆ(U obs ,U mis along with its ,..., (U obs ,U mis
(t ) estimated variance var(ˆ (t ) var(ˆ(U obs ,U mis t = 1, 2, . . . ,m. Figure (2-8) illustrates
the multiple imputation principle. From these (m) imputed data sets the multiple imputation estimates are computed. The MI point estimate for is simply the average
(17(
35
Chapter two: Missing Data and Multiple Imputation
Figure (2-8): The multiple imputation principle
To obtain a standard error
var(ˆMI ) for the MI estimate ˆMI we first
calculate the “between-imputation” variance
(18(
And then the “within-imputation” variance (19(
Finally, the estimated total variance is defined by
(20(
36
Chapter two: Missing Data and Multiple Imputation
Notice that the term ((m + 1)/m)B enlarges the total variance estimate T compared to the usual analysis of variance with T = B + W; (m+1)/m is an adjustment for finite m. An estimate of the fraction of missing information about due to nonresponse is given by:
(21(
For large sample sizes, tests and two-sided (1 )100% interval estimates can be based on the Student’s (t-distribution) (22(
With the degrees of freedom, (23(
Which are based on a Satterthwaite approximation. For small data sets an improved expression for the degrees of freedom is given by Barnard and Rubin (1999). They relax the assumption of a normal reference distribution of (16) for the complete-data interval estimates and tests to allow a (t-distribution), and they derive the corresponding degrees of freedom for the MI inference to replace the formula (23) given here. Moreover, additional methods are available for combining vector estimates and covariance matrices, p-values, and Likelihood-ratio statistics. From equation (22) we realize that the multiple imputation interval estimate is expected to produce a larger interval than an estimate based only on the observed cases or based only on one single imputation. The multiple imputation interval estimates are widened to account for the missing data uncertainty and simulation error.
37
Chapter three Practical Part
Chapter three: Practical Part
Chapter three Practical Part
3-1 Introduction This chapter contains a practical example from Factorial Experimental in (R.C.B.D), where the number of data are (75) values, which has missed some values from the groups of the complete data by random method. So as to give us a group of incomplete data set used the method of numerous impute to find the missing data. Miss the (15) values by random method to get the incomplete data set, and application the MI for estimation the missing data
3-2 Description of the data[1] Table (3-1) stands the studying of the (5) growth organization's, effect (Factor A) in vegetable growth to (5) groups for (Cucurbit a pope L.) (Factor B) by using the design (Factorial R.C.B.D), as in the table below:-
39
Chapter three: Practical Part
Table (3-1) the initial compete data Replications Factor A Factor B yi r1 r2 r3 b1 0.49 0.33 0.83 0.550 b2 0.42 0.60 0.74 0.587 a1 b3 0.44 0.21 0.28 0.310 b4 0.08 0.16 0.06 0.100 b5 0.00 0.00 0.00 0.000 b1 0.42 0.77 0.99 0.727 b2 0.62 1.07 0.84 0.843 a2 b3 0.30 0.38 0.66 0.447 b4 0.13 0.16 0.40 0.230 b5 0.08 0.03 0.24 0.117 b1 0.46 0.92 1.07 0.817 b2 0.80 0.54 1.00 0.780 a3 b3 0.28 0.53 0.48 0.430 b4 0.48 0.27 0.84 0.530 b5 1.67 0.24 0.38 0.763 b1 0.10 0.04 0.17 0.103 b2 0.26 0.17 0.61 0.347 a4 b3 0.20 0.00 0.54 0.247 b4 0.19 0.00 0.13 0.107 b5 0.11 0.02 0.26 0.130 b1 0.80 0.62 0.51 0.643 b2 0.30 0.41 0.89 0.533 a5 b3 0.51 0.73 0.78 0.673 b4 0.24 0.27 0.41 0.307 b5 0.09 0.31 0.38 0.260
S i2 0.065 0.026 0.014 0.003 0.000 0.083 0.051 0.036 0.022 0.012 0.101 0.053 0.018 0.083 0.621 0.004 0.054 0.075 0.009 0.015 0.021 0.098 0.021 0.008 0.023
Appreciation value in the average and the total variance and the standard errors for a group of complete data will be found as: 1 N yi N i 1 1 (0.49 0.42 ... 0.41 0.38) 75
ˆ
ˆ 0.42
2 1 N yi ˆ N 1 i 1 1 0.49 - 0.422 0.42 0.422 ... 0.41 0.422 0.38 0.422 75 1
vaˆr(ˆ)
vaˆr(ˆ) 0.11
40
Chapter three: Practical Part
So according to the standard errors is: Standard Deviation 0.33
3-3 Missing and classifying data We miss (15) values from complete data set by random method. So as to give us a non-complete data set. Then to classify data to groups where each set consists of (5) values to give us a classified data set. For example; the existing value between factor (A) and factor (B) (a1bi) combing them in one group and the existing value between (a2bi) where i=1,2,3,4,5, we combine them in one group and so on…, as it is clear in the table below:-
Factor A
a1
a2
a3
a4
a5
Table (3-2) missing values in the initial complete data Replications Factor B yi r1 Class r2 Class r3 Class b1 0.49 1 2 0.83 3 0.66 b2 0.42 1 0.60 2 0.74 3 0.59 b3 1 0.21 2 0.28 3 0.25 b4 0.08 1 0.16 2 3 0.12 b5 0.00 1 0.00 2 0.00 3 0.00 b1 0.42 4 0.77 5 6 0.60 b2 0.62 4 1.07 5 0.84 6 0.84 b3 0.30 4 5 0.66 6 0.48 b4 0.13 4 0.16 5 0.40 6 0.23 b5 4 0.03 5 0.24 6 0.14 b1 0.46 7 0.92 8 1.07 9 0.82 b2 0.80 7 0.54 8 1.00 9 0.78 b3 0.28 7 8 0.48 9 0.38 b4 7 0.27 8 0.84 9 0.56 b5 1.67 7 0.24 8 9 0.96 b1 11 0.04 11 12 0.04 b2 0.26 11 0.17 11 0.61 12 0.35 b3 0.20 11 0.00 11 0.54 12 0.25 b4 0.19 11 11 0.13 12 0.16 b5 0.11 11 0.02 11 0.26 12 0.13 b1 0.80 13 0.62 14 15 0.71 b2 13 0.41 14 0.89 15 0.65 b3 0.51 13 0.73 14 0.78 15 0.67 b4 0.24 13 0.27 14 0.41 15 0.31 b5 0.09 13 14 0.38 15 0.24
41
S i2 0.06 0.03 0.00 0.00 0.00 0.06 0.05 0.06 0.02 0.02 0.10 0.05 0.02 0.16 1.02 --0.05 0.07 0.00 0.01 0.02 0.12 0.02 0.01 0.04
Chapter three: Practical Part
3-4 Data analysis In this stage the data will be analyzed by (MI) method on the missing data to find the average estimate, total variance and standard error for incomplete data set, as the following. Table (3-3) Data Descriptives for incomplete data Variable Mean STDEV Valid N Data
0.44
0.34
60
The first column of the data descriptive output shows the names of variable data in the dataset. For each variable, the mean, standard deviation and valid N are given, where mean of row Data is 0.44, standard deviation is 0.34 and valid N is 60 data. Table (3-4) Missing Data Pattern N Data Missing class 06 1 0 15 0 1 15
From table (3-4), each row in the missing data pattern output represents a response. A valid response is indicated by a “1”, a missing response by a “0”. The pattern in the upper row consists of only valid responses. There are no missing values in this pattern, indicated by 0 in the rightmost column. The number in the left column, 60, is the number of times that this response pattern occurs. Out of 75 cases, 60 have no missing values at all. This means that not even half of the cases would be available for the estimation of a multivariate model, if listwise deletion of missing value was applied. The Data variable has the highest number of missing values, 15. A requirement for the estimation of a multilevel model is an adequate number of cases within each of classes. The minimal number of cases in a class must at least be one bigger than the number of class-specific parameters that you 42
Chapter three: Practical Part
estimate. Therefore, it’s a good idea to look at the data frequencies of the class variable.
3-5 Imputation Specifications The first output that you see after running the imputation through the menu, is an overview of the imputation specifications that were applied. For each variable with missing values, the imputation method is shown: Variable ID no missing values Variable class no missing values Variable Data Method = MULTILEVEL For variables that have multilevel regression as their imputation method, a list of parameter estimates is provided for each imputation. Parameter estimated and their standard errors are given for every 1500 iterations. Multilevel Estimates, impnr 1, dependent = DATA Class Indicator = class Parameter B[ intercept]
0.4796
Sigma Squared
0.1261
D[ intercept]
0.0074
Where the B[intercept], sigma Squared and D[intercept] are Parameter estimates for the posterior distribution to estimates missing by (m) imputation.
43
Chapter three: Practical Part
when m 1 n
ˆi
i 1
N
ˆ (t )
1 0.490 ... 0.384 ... 0.380 75 0.430
ˆ (t )
when m 2 n
ˆi
i 1
N
ˆ (t )
1 0.490 ... 0.949 ... 0.380 75 0.448
ˆ (t )
Also for the m=3, 4, 5 (see the Appendix (2) for complete estimation to ) By the equation (17) the MI point estimates for is simply average: 1 0.430 0.448 ... 0.450 5 0.442
ˆMI
To obtain a standard error for the MI estimate, we first calculate the “between imputation” variance by equation (18); Va rˆ(ˆ) between
1 0.430 0.4422 0.448 0.4422 ... 0.450 0.4422 5 1
0.0001
And by eqyation (19) the “within imputation” variance is: 1 Varˆ(ˆ) within 0.112 0.105 ... 0.128 5 0.111
44
Chapter three: Practical Part
Then; the estimated total variance is: m 1 1 B Varˆ(ˆMI ) T Varˆ(ˆ) within 1 varˆ(ˆ) between W m m 5 1 0.111 0.0001 5 Varˆ(ˆMI ) T 0.112
Where the Standard Deviation(MI) 0.334
For estimating the fraction of missing information about due to nonresponse is given by: 1 1 1 0.0001 1 B 5 m 0.0012 0.111 T
Where are obvious in the following table:
m
ˆ (t )
m=1 m=2 m=3 m=4 m=5
0.430 0.448 0.431 0.450 0.450
Table (3-5) polling Multiple Imputation Analyses ˆ ˆ ˆ Fraction ˆMI Var (ˆ (t ) ) Va r ( ) between Va r ( ) within Va r ( MI ) total missing
0.442
0.112 0.105 0.105 0.107 0.128
(B)
(W)
(T)
0.0001
0.111
0.112
45
0.0012
Chapter three: Practical Part
46
Chapter three: Practical Part
By dependence to the table (3-6) we obtain the table (3-7) for complete data set estimation by MI.
Factor A
a1
a2
a3
a4
a5
Table (3-7) the complete dataset estimate Replications Factor B
b1 b2 b3 b4 b5 b1 b2 b3 b4 b5 b1 b2 b3 b4 b5 b1 b2 b3 b4 b5 b1 b2 b3 b4 b5
r1
r2
r3
0.49 0.42 1341 0.08 0.00 0.42 0.62 0.30 0.13 1333 0.46 0.80 0.28 1355 1.67 0.40 0.26 0.20 0.19 0.11 0.80 1347 0.51 0.24 0.09
1331 0.60 0.21 0.16 0.00 0.77 1.07 1371 0.16 0.03 0.92 0.54 1329 0.27 0.24 0.04 0.17 0.00 1317 0.02 0.62 0.41 0.73 0.27 1357
0.83 0.74 0.28 1338 0.00 1346 0.84 0.66 0.40 0.24 1.07 1.00 0.48 0.84 1372 1352 0.61 0.54 0.13 0.26 1331 0.89 0.78 0.41 0.38
47
yi
S i2
0.54 0.59 0.30 0.21 0.00 0.55 0.84 0.56 0.23 0.20 0.82 0.78 0.35 0.55 0.88 0.32 0.35 0.25 0.13 0.13 0.57 0.59 0.67 0.31 0.35
0.07 0.03 0.01 0.02 0.00 0.04 0.05 0.05 0.02 0.02 0.10 0.05 0.01 0.08 0.53 0.06 0.05 0.07 0.00 0.01 0.06 0.07 0.02 0.01 0.06
Chapter Four Conclusions and recommendations
Chapter four: Conclusions and recommendations
4-1 Conclusions After reviewing all ordinary and modern methods of estimating parameters of the (multiple imputation Method for missing data in Factorial (R.C.B.D)), and checking for results variation, the following conclusions were drawn; 1- Multilevel regression model, with MI method was used to estimate the parameters for the incomplete data set, for the initial complete dataest, and for the complete data. It could be noticed that: a- For ˆ , the difference between the two estimates was (0.02). b- The difference between the two variances vaˆr(ˆ) was: (0.002): that means the estimation by the MI method on missing data in Factorial (R.C.B.D) is a good method. 2- The standard errors are identical for the two cases (before and after missing) where the results are equal to (0.33). The standard error is zero for two cases indicate on the MI is best methods for estimation missing data. 3- The Multilevel Estimates for the parameters were B[intercept] = 0.4796, Sigma Squared = 0.1261 and D[intercept] = 0.0074. Indicate that the distribution is joint distribution posterior. 4- The MICE algorithm is a conceptually simple, flexible and practical way to generate multivariate MI. For each incomplete variable the user can choose a set of predictors that will be used for imputation. This is useful for imputing large data sets. Passive imputation is a built-in feature that takes care that transformed data are always in sync with their original values.
49
Chapter four: Conclusions and recommendations
4-2 recommendations 1- Other practical enhancements including the use of constrained imputations, the allowance for survey weights, the possibility to specify interaction terms, and the development of additional elementary imputation methods, e.g. for Poisson regression. We expect that most features can be built into the existing software without too much trouble. 2- We recommend using MI in other fields of statistics like time series, quality control …etc. 3- We recommend for further works on MI, then using the information theory
tools to judge on best model selection.
50
References
References
Reference [1]
الع وودر وح
دراسوو فوونرا ووايل فلت وون الع وودتج ل و:)2002( ووب ه
وود ا
أ يد ووب،[ الشوونا1]
رسدل المدجسع ا ر اإحصدءأ كت اإدار يااقعصددأ جدد.العط ق ذات النمنذ ال د ت . ار ن-البتج
[2] [ Allison Paul D. (2007) Multiple Imputation for Missing Data A Cautionary Tale: Address correspondence to Paul D. Allison. Pennsylvania: Sociology Department, University of Pennsylvania. http://www.ssc.upenn.edu/~allison/MultInt99.pdf (1/6/2007). [3]
Allison, P. D. (2002) Missing data. Sage University papers on Quantitative Applications in the social Sciences (Series no. 07-136). Thousand Oaks, CA: Sage Publications, Inc.
[4]
Congdon P. (2001) Bayesian Statistical Modelling,( First Edition). New York:
[5]
Euredit (2000) "The development and evaluation of new methods for editing and imputation". York, UK: University Press. Internet page: www.cs.york.ac.uk/euredit, University of York, United Kingdom.
[6]
Fogarty David J., (2000); Multiple Imputation as a Missing Data Approach to Reject Inference on consumer credit Scoring. Phoenix, Arizona: University of Phoenix Press. http://interstat.statjournals.net/YEAR/2006/articles/0609001.pdf
[7]
Kennickell A. B. (1998) Multiple Imputation in the Survey of Consumer Finances. http://www.federalreserve.gov/pubs/oss/oss2/papers/impute98.pdf.(26/8/2006)
[8]
Little, R. J. A., and Rubin, D. B. (2002) Statistical Analysis with Missing
52
References
Data (2nd edition). Hoboken, New Jersey: John Wiley and Sons, Inc. [9]
National Center for Statistics and Analysis Research and Development Transitioning to Multiple Imputation – A New Method to Impute Missing Blood Alcohol Concentration (BAC) values in FARS. (2002)
[10] Oudshoorn, K., Van Buuren, S., and Van Rijckevorsel, J. (1999) Flexible multiple imputation by chained equations of the AVO-95 Survey. Technical Report PG/VGZ/99.045, TNO Preventions and Health, Public Health, POB 2215, 2301 CE Leiden, Available in http://www.multiple-imputation.com. (2/9/2006). [11] Poirier, C. (1999) "A Functional Evaluation of Edit and Imputation Tools". Rome – Italy: Proceedings of the Workshop on Data Editing UN-ECE. [12] Royston, P. (2005) Multiple imputation of missing values. Stata Technical Journal, 5(4):527-536. [13] Rubin, D. B. (1987) Multiple Imputation for Non-response in Surveys. New York: John Wiley & Sons, Inc. [14] Schafer, J. L. (1999b) NORM: Multiple imputation of incomplete multivariate data under a normal model (version 2). Software for Windows 95/98/NT, available in the URL: http://www.stat.psu.edu/ jls/misoftwa.html. [15] Schafer J. L. and Graham J. W. (2002) Missing Data: Our View of the State of the Art. Pennsylvania State, USA: American Psychological Association, Inc. http://www.nyu.edu/classes/shrout/G89-2247/Schafer&Graham2002.pdf (1/9/2006).
[16] Schafer, J. L. (1997). Analysis of incomplete multivariate data. London:
53
References
Chapman and Hall. [17] Statistical Solutions (2001) Solas v. 3.0 Software for missing data analysis. 8 South Bank, Crosse's Green, Cork, Ireland, http://www.statsol.ie. [18] Tabachnik, B.G., and Fidell, L.S. (2006) Using multivariate statistics. (5th Edition). Boston: Allyn & Bacon. [19] UN/ECE. "List of Documents". Internet page of the 1997 Workshop on Statistical Data Editing, www.unece.org/stats/documents/1997.10.sde.htm, Czech Republic (Prague), 1997. [20] UN/ECE. "List of Documents". Internet page of the 1999 Workshop on Statistical Data Editing, www.unece.org/stats/documents/1999.06.sde.htm, Italy (Rome), 1999. [21] UN/ECE. "List of Documents". Internet page of the 2000 Workshop on Statistical Data Editing, www.unece.org/stats/documents/2000.10.sde.htm,
United
Kingdom
(Cardiff), 2000. [22] Van Buuren, S. and Oudshoorn, K. (1999) Flexible multiple imputation by MICE. Technical Report PG/VGZ/99.054, TNO Prevention and Health, Public
Health,
POB
2215,
2301
CE
Leiden,
Available
from
http://www.multiple-imputation.com. http://web.inter.nl.net/users/S.van.Buuren/mi/docs/rapport99054.pdf (2/9/2006) [23] Van Buuren, S. [et al] (2005) Fully Conditional Specification in Multivariate Imputation. Leiden: TNO Quality of Life. http://web.inter.nl.net/users/S.van.Buuren/publications/FCS%20(revised%20
54
References
Jan%202005).pdf [24] Van Buuren, S., Oudshoorn C.G.M. (2000) Multivariate imputation by chained equations. Leiden: TNO Preventie en Gezondheid. http://web.inter.nl.net/users/S.van.Buuren/mi/docs/Manual.pdf [25] Wayman, J. C. (2003) Multiple Imputation For Missing Data: What Is It And How Can I Use It? Chicago, USA: American Educational Research Association. [26] Winkler, B. (1999) Draft Glossary of Terms Used in Data Editing. Rome – Italy: Proceedings of the Workshop on Data Editing, UN-ECE [27] Yan He, (2006) Missing Data Imputation For Tree- Based Models. A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of philosophy in Statistics, University of California / Los Angeles. http://theses.stat.ucla.edu/47/phd_thesis_YanHe.pdf (31/8/2006)
55
Certification of the supervisor
We certify that this thesis “Multiple Imputation method for Missing data in (R.C.B.D) Factorial Experiments” was prepared under our supervision at the Department of Statistics in the College of Administration and Economics at the University of Sulaimani; and we do hereby recommend it to be accepted as a partial fulfillment of the requirements for the degree of Master of Science in Statistics.
Signature:
Signature:
Supervisor: Dr. shawnem A. Muhayaddin
Supervisor: Dr. Wasfi T. Saalih
Assistant professor
Assistant professor
Date:
Date:
/ /2007
/ /2007
Recommendation of the chairman of scientific and higher education committee. In view of the available recommendations I forward this thesis for debate by the examining committee.
Signature: Name: Muhammad R. Sa'eed Date:
/ /2007
Certification of the Language Supervisor
I certify that this thesis entitled “Multiple Imputation method for Missing data in (R.C.B.D) Factorial Experiments” was revised by me. It is now free from spelling and grammatical mistakes and is ready for discussion.
Signature: Name: Aree M. Abdulrahman Date:
/ /2007
Members of Examining Committee We, members of the Examining committee, certify that we have read this thesis “Multiple Imputation method for Missing data in (R.C.B.D) Factorial Experiments” and as an examining committee we examined the student “Hawkar Qasim Pirdawd” in its contents and that in our opinion, it merits granting the degree of Master of Science in Statistical with an average of (Very Good).
Signature: Name: Dr. Samir M. Khidir
Signature: Name: Dr. Abdulrahim Kh. Rahi
Assistant professor
Assistant professor
Committee head Date: / /2007
Member Date: / /2007
Signature:
Signature:
Name: Dr. Beston M. Abdulkarim Instructor Member Date: / /2007
Name: Dr. Shawnm A. Muhayaddin Assistant professor Member & Supervisor Date: / /2007
Signature: Name: Dr. Wasfi T. Saaleh Assistant professor Member & Supervisor Date: / /2007 Approval by the college council The council of the college of Administration and Economics at Sulaimani University Approved the decision arrived at by the Examining Committee.
Signature: Name: Dr. Arass H. Mahmood Assistant professor
To obtain complete data Estimation in the end step, to be estimated the average for (m) results. As it is clear in the table below:Table (3-6): Find the missing value for (m) results ID Factor Replications Class DATA
itr_nr
ˆ
MI
1 b1 2 a1 b2 3 b3
r1
r2
8 8 8
0.920 0.540 0.169
1 1 1
1 b1 2 a1 b2 3 b3
r1
1 1 1
0.490 0.420 0.949
2 2 2
r2
8 8 8
0.920 0.540 0.563
2 2 2
1 b1 2 a1 b2 3 b3
r1
1 1 1
0.490 0.420 0.140
3 3 3
r2
8 8 8
0.920 0.540 0.406
3 3 3
36 b1 37 a3 b2 38 b3
36 b1 37 a3 b2 38 b3
36 b1 37 a3 b2 38 b3
0.490 0.420 0.384
1 1 1
1 b1 2 a1 b2 3 b3
r1
1 1 1
0.490 0.420 0.350
4 4 4
r2
8 8 8
0.920 0.540 0.241
4 4 4
1 b1 2 a1 b2 3 b3
r1
1 1 1
0.490 0.420 0.237
5 5 5
r2
8 8 8
0.920 0.540 0.067
5 5 5
1 1 1
36 b1 37 a3 b2 38 b3
36 b1 37 a3 b2 38 b3
m
ˆ (t )
Factor
t 1
m
a1b1r1
5 ˆ (t ) ˆ5 t 1 5 1 0.384 0.949 0.237 5 ˆ5 0.412
46
estimate 0.490
a1b3r1
0.412
a1b1r2 a1b2r2
0.310 0.600
a1b4r3
0.376
a2b4r1 a2b5r1 a2b1r2 a2b2r2 a2b3r2 a2b4r2 a2b5r2 a2b1r3
0.130 0.333 0.770 1.070 0.706 0.160 0.030 0.461
a3b4r1 a3b5r1 a3b1r2 a3b2r2 a3b3r2 a3b4r2
0.548 1.670 0.920 0.540 0.289 0.270
a3b5r3 a4b1r1 a4b1r1
0.719 0.401 0.260
a4b4r2 a4b5r2 a4b1r3 a4b2r3
0.072 0.020 0.517 0.610
a5b2r1 a5b3r1
0.466 0.510
a5b4r2 a5b5r2 a5b1r3
0.270 0.567 0.298
a5b5r3
0.380