Imputation for statistical inference with coarse data - Wiley Online Library

Viewer
Transcript

The Canadian Journal of Statistics Vol. 40, No. 3, 2012, Pages 604–618 La revue canadienne de statistique

604

Imputation for statistical inference with coarse data Jae Kwang KIM 1 * and Minki HONG2 1 Department 2 Korean

of Statistics, Iowa State University, Ames, IA 50011, USA Labor Institute, Seoul 150-740, Korea

Key words and phrases: EM algorithm; fractional imputation; grouped data; measurement error models. MSC 2010: Primary 62D05; secondary 62G09. Abstract: Coarse data is a general type of incomplete data that includes grouped data, censored data, and missing data. The likelihood-based estimation approach with coarse data is challenging because the likelihood function is in integral form. The Monte Carlo EM algorithm of Wei & Tanner [Wei & Tanner (1990). Journal of the American Statistical Association, 85, 699–704] is adapted to compute the maximum likelihood estimator in the presence of coarse data. Stochastic coarse data is also covered and the computation can be implemented using the parametric fractional imputation method proposed by Kim [Kim (2011). Biometrika, 98, 119–132]. Results from a limited simulation study are presented. The proposed method is also applied to the Korean Longitudinal Study of Aging (KLoSA). The Canadian Journal of Statistics 40: 604–618; 2012 © 2012 Statistical Society of Canada Re´ sume´ : Les donn´ees brutes sont un type g´en´eral de donn´ees incompl`etes qui incluent les donn´ees group´ees, les donn´ees censur´ees et les donn´ees manquantes. Il est difﬁcile d’utiliser l’estimation par le maximum de vraisemblance avec les donn´ees brutes parce que la fonction de vraisemblance est sous forme d’int´egrale. L’algorithme EM par Monte-Carlo de Wei et Tanner (1990) est adapt´e pour calculer l’estimateur du maximum de vraisemblance en pr´esence de donn´ees brutes. Cette m´ethode permet aussi l’utilisation des donn´ees brutes stochastiques et l’´evaluation peut eˆ tre implant´ee en utilisant l’imputation fractionnaire param´etrique propos´ee par Kim (2011). Nous pr´esentons les r´esultats d’une petite e´ tude de simulation. La m´ethode propos´ee est aussi appliqu´ee a` des donn´ees longitudinales cor´eennes sur le vieillissement (KLoSA). La revue canadienne de statistique 40: 604–618; 2012 © 2012 Société statistique du Canada

1. INTRODUCTION Statistical methods handling incomplete data have been an area of interest in recent decades (Little & Rubin, 2002). Incomplete data has been studied mostly as a missing data problem, where each data value is either perfectly known or completely unknown. Imputation, which can be roughly described as a technique for ﬁlling in missing values using a set of plausible values, is popular in handling missing data. Once all missing values have been imputed, the dataset can then be analyzed using standard techniques for complete data. Haziza (2009) provided a comprehensive review of the statistical issues in imputation, especially in the context of survey sampling. In many practical situations, however, data can be neither entirely missing nor completely present. Instead, only partial information about the true, unobservable data is available. This kind of incomplete data is referred to as coarse data. Coarse data is frequently encountered in * Author to whom correspondence may be addressed. E-mail: [email protected] © 2012 Statistical Society of Canada / Société statistique du Canada

2012

STATISTICAL INFERENCE WITH COARSE DATA

605

many areas of statistics. A simple form of coarse data is grouped data, where continuous data values are observed or reported in a grouped form. Rounding data that reports to the nearest integer value is a special case of grouped data. Fisher (1922) considered the estimation problem for grouped data. Heitjan (1989) provided a review of the literature on the analysis of grouped data. An observation in grouped data can be viewed as y˜ i = T (yi ), where T (yi ) is a many-to-one transformation of yi , and yi is a conceptually true value for item y for unit i. That is, instead of observing yi , we observe y˜ i , a group version of yi , which contains only partial information on yi . For example, many surveys offer bracket categories to respondents who initially refused or were unable to provide an exact value for their assets or income. Brackets represent partial responses to asset questions. In the Korean Longitudinal Study of Aging (KLoSA), which motivated our research, the study variable yi is the amount of ﬁnancial asset holding for household i. Those who do not report yi for some reason such as conﬁdentiality or lack of the exact knowledge are asked to answer one of the group categories. The categorized value of yi is denoted by y˜ i . In this case, the value of y˜ i is not needed if yi is observed. However, if yi is missing, y˜ i gives a rough idea about the true value of yi . Thus, we can express y˜ i = T (yi ), with T (·) being a known function. The parameter of interest is θ in f (y; θ), or the marginal density of y. The function T (·) characterizes the nature of the grouping. If the parameter of interest is θ in the conditional density f (y | x; θ), the information about θ is contained in the conditional density of y, ˜ which takes the form of f (y | x; θ) dy, (1) R(y) ˜

where R(y) ˜ = {y; T (y) = y} ˜ is the set of original y that satisﬁes T (y) = y. ˜ Direct computation of the maximum likelihood (ML) estimator based on the conditional density in (1) is computationally challenging because the density is expressed as an integral. Numerical integration can be used to ﬁnd the maximum likelihood estimator, but the computation can be cumbersome or almost impossible for vector y-variables. Lindley (1950) and Tallis (1967), inspired by the work of Fisher (1922), discussed some approximation ML methods for grouped data. Heitjan & Rubin (1991) provided a rigorous treatment of the coarse data mechanism and discussed the ignorability condition. Heitjan & Rubin (1990) used multiple imputation from a Bayesian point of view, but they did not cover the ML estimation directly. Heeringa (1995) proposed Bayesian imputation in the context of grouped data. Taylor & Yu (2002) discussed the problem in the regression context. The ML estimation for coarse data maximizes the likelihood based on the marginal density of the coarse data, which can be computationally challenging because it involves integrating the original likelihood over the missing part. Instead of computing the marginal density directly, we use a version of the EM algorithm for coarse data. To compute the conditional expectation, the Monte Carlo approximation of Wei & Tanner (1990), which was originally proposed to handle missing data, can be adapted to handle coarse data. The Monte Carlo EM can be viewed as an imputation method. Once imputed data are created, we can estimate the parameters using the standard techniques for complete data. To our knowledge, applying the Monte Carlo EM algorithm to coarse data has not been fully addressed in the literature. In this paper, we provide a uniﬁed treatment of the coarse data based on the frequentist approach of imputation. As noted by Wang & Robins (1998), the frequentist imputation is generally more efﬁcient than the Bayesian imputation in that the imputed estimator has a smaller variance. The proposed method can be used for a more general class of coarse data analysis in the sense that it can be extended to stochastic coarse data where the transformation from yi to y˜ i = T (yi ) is not DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

606

KIM AND HONG

Vol. 40, No. 3

deterministic. That is, the transformation function T (y) is also a random variable. Measurement error models can be treated as a special case of the stochastic coarse data problem. Furthermore, we also propose an alternative computation method using parametric fractional imputation, as recently proposed by Kim (2011). The parametric fractional imputation has some computational advantages over the classical Monte Carlo EM method. The paper is organized as follows. Section 2 discusses the basic setup. In Section 3, the proposed method is presented. In Section 4, results from a simulation study are presented. The proposed method is applied to the Korean Longitudinal Study of Aging (KLoSA) in Section 5. 2. BASIC SETUP Let (xi , yi ), i = 1, 2, . . . , n, be a random sample of size n from a distribution with density f (y | x; θ)h(x), where f (y | x; θ) is the conditional density of y given x with parameter θ, and h(x) is the marginal density of x that does not necessarily have a speciﬁc parametric form. We allow xi to be vector-valued. If (xi , yi ), i = 1, 2, . . . , n, were completely observed throughout the sample, the ML estimator of θ could be obtained by solving the score equation Sn (θ) ≡

n ∂ ln f (yi | xi , θ) = 0. ∂θ i=1

Now, under the existence of the coarse data, deﬁne δi =

1

if yi is observed

0

otherwise.

and assume that we observe y˜ i instead of yi if δi = 0. For example, in the censored data where δi = 0 if and only if yi > c for some c, we have y˜ i = c. In the standard missing data problem, y˜ i is empty. The ﬁnal observation is

(xi , δi , δi yi + (1 − δi )y˜ i ); i = 1, 2, . . . , n .

(2)

We assume that δi = 1 for some portion of the sample. If δi = 0 for all i, then some of the parameters may not be identiﬁable. Assume that the coarsening mechanism is parametrically modelled with density g y˜ i | xi , yi , δi ; φ

(3)

for some parameter φ. Note that g(y˜ i | xi , yi , δi ; φ) is a degenerating density function at y˜ i = yi if δi = 1. Assume that the response mechanism is δi | (xi , yi ) ∼ Bernoulli(pi ) where pi = p(xi , yi ; ψ) is the response probability known up to ψ. We also assume that Pr (δ = 0 | x, y, y) ˜ = Pr (δ = 0 | x, y)

(4)

which holds when the sigma algebra σ(y) ˜ generated by y˜ is a subset of the sigma algebra σ(y) generated by y. The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2012

STATISTICAL INFERENCE WITH COARSE DATA

607

The observed likelihood function of (θ, φ, ψ) is the marginal likelihood of the ﬁnal observation in (2) and can be written as Lobs (θ, φ, ψ) = f (yi | xi , θ)p(xi , yi ; ψ) δi =1

×

f (yi | xi , θ)g0 (y˜ i | xi , yi ; φ) {1 − p(xi , yi ; ψ)} dyi .

δi =0

where g0 (y˜ i | xi , yi ; φ) = g(y˜ i | xi , yi , δi ; φ) for δi = 0. To compute the ML estimator (θˆ MLE , φˆ MLE , ψˆ MLE ) that maximizes the observed likelihood, we solve the following observed score equation Sobs (θ, φ, ψ) ≡

∂ ln Lobs (θ, φ, ψ) = 0. ∂(θ , φ , ψ )

(5)

Finding the solution to Equation (5) is not easy because of its integral form. The following Lemma, discussed in Dempster, Laird, & Rubin (1977) and Louis (1982), provides an alternative expression for the observed score function. Lemma 1. we have

Under the condition that the order of integration and differentiation is exchangeable, Sobs (θ, φ, ψ) = S¯ 1 (θ, φ, ψ), S¯ 2 (θ, φ, ψ), S¯ 3 (θ, φ, ψ)

(6)

where S¯ 1 (θ, φ, ψ) =

δi =1

S¯ 2 (θ, φ, ψ) =

S1 (θ; xi , yi ) +

δi =0

δi =0

S¯ 3 (θ, φ, ψ) =

E S1 (θ; xi , yi ) | xi , y˜ i , δi = 0

E S2 φ; xi , yi , y˜ i | xi , y˜ i , δi = 0 ,

S3 (ψ; xi , yi , δi ) +

δi =1

E S3 (ψ; xi , yi , δi ) | xi , y˜ i , δi = 0

δi =0

and S1 (θ; xi , yi ) = ∂ ln f (yi | xi , θ)/∂θ, S2 (φ; xi , yi , y˜ i ) = ∂ ln g0 (y˜ i | xi , yi ; φ)/∂φ, S3 (ψ; xi , yi , δi ) = ∂{δi ln pi + (1 − δi ) ln(1 − pi )}/∂ψ are the score functions for θ, φ and ψ, respectively. By Lemma 1, we have only to compute the conditional expectation of the original score function to obtain the observed score function. Often the conditional expectation does not have a closed form and we rely on a Monte Carlo method to approximate the expectation. Note that, by Bayes theorem and by (4), the conditional expectation in (6) involves integration with respect to the density f (yi | xi )g0 (y˜ i | xi , yi ) {1 − p(xi , yi )} f yi | xi , y˜ i , δi = 0 = . f (yi | xi )g0 (y˜ i | xi , yi ) {1 − p(xi , yi )} dyi

(7)

In the pure missing data problem, y˜ i is empty when δi = 0 and the conditional density in (7) reduces to f (yi | xi , δi = 0) =

DOI: 10.1002/cjs

f (yi | xi ) {1 − p(xi , yi )} . f (yi | xi ) {1 − p(xi , yi )} dyi

(8)

The Canadian Journal of Statistics / La revue canadienne de statistique

608

KIM AND HONG

Vol. 40, No. 3

For the deterministic coarse data, the coarsening function T (y) is ﬁxed and 1 if T (yi ) = y˜ i g0 (y˜ i | xi , yi ) = 0 otherwise so that the conditional density in (7) reduces to I yi ∈ R y˜ i f (yi | xi ; θ) {1 − p(xi , yi )} . f yi | xi , y˜ i , δi = 0 = R(y˜ i ) f (yi | xi ; θ) {1 − p(xi , yi )} dyi

(9)

If the response probability depends on yi only through y˜ i = T (yi ), that is, p(xi , yi ) = p(xi , y˜ i ),

(10)

for all yi such that y˜ i = T (yi ) for all i = 1, . . . , n, then the response mechanism is called “MissingAt-Random” in the sense of Rubin (1976) or, more generally, “Coarsening-At-Random” (CAR) as described in Heitjan & Rubin (1991). In the pure missing data problem, y˜ is empty and condition (10) means that p(xi , yi ) is a function of xi only. Under CAR, (9) further reduces to I yi ∈ R(y˜ i ) f (yi | xi ; θ) . (11) f yi | xi , y˜ i , δi = 0 = R(y˜ ) f (yi | xi ; θ)dyi i

In the case of stochastic coarse data, we often need a validation sub-sample among the units with δi = 0 to obtain a consistent estimator of φ in g0 (y˜ i | xi , yi ; φ). See Appendix B. Generating Monte Carlo samples from (7) is called imputation. Thus, imputation can be viewed as a tool for Monte Carlo approximation of the conditional expectation given the observed data. If a proper imputation method is used, the resulting ML estimator from the imputed score function will be in convergence to the true ML estimator from the observed score function. For a satisfactory approximation, one would require a large Monte Carlo sample size. 3. PROPOSED METHOD We now consider a Monte Carlo method that computes the conditional expectation in (6). There are two problems in generating imputed values from (7). First, we do not know the parameters in (7). Second, even if we know the parameters, it is not clear how to generate samples from the conditional distribution. If the coarsening mechanism is deterministic and assuming CAR, we have only to generate the Monte Carlo samples from (11). In this case, one can adapt the Monte Carlo EM (MCEM) algorithm of Wei & Tanner (1990) to coarse data analysis. In the original Monte Carlo EM (MCEM) algorithm of Wei & Tanner (1990), only missing data problems were considered and the Monte Carlo samples are generated from (8). The adapted MCEM method for generating samples from (11) can be described as follows: [Step 1] Start from the cases where yi is observed and estimate θˆ 0 , the initial parameter estimate. [Step 2] For each i with δi = 0, generate M independent imputed values from ∼ f yi | xi ; θˆ ,

∗(m) i.i.d.

yi

m = 1, 2, . . . , M,

(12)

where θˆ is the current estimate of θ. If

∗(m) T yi = y˜ i

The Canadian Journal of Statistics / La revue canadienne de statistique

(13) DOI: 10.1002/cjs

2012

STATISTICAL INFERENCE WITH COARSE DATA ∗(m)

609 ∗(m)

holds, then accept yi as the mth imputed value for yi . Otherwise, generate yi again from (12) until (13) holds. [Step 3] Next, use the imputed data to update the parameter estimate by solving the imputed score equations

S1 (θ; xi , yi ) +

δi =1

M 1

∗(m) S1 θ; xi , yi = 0, M

(14)

m=1 δi =0

where S1 (θ; xi , yi ) is deﬁned after (6). [Step 4] Go to Step 2. Continue until convergence has been reached. [Step 2] is used to implement the E-step of the EM algorithm and [Step 3] corresponds to the M-step of the EM algorithm. In the E-step, a rejection sampling method is used. Under CAR and deterministic coarse mechanism, (13) occurs with a positive probability for each unit i with δi = 0 in the sample. In the pure missing data problem with MAR, the rejection step in (13) is not needed. The Monte Carlo sample size M needs to be sufﬁciently large for successful approximation. Example 1.

Consider the following normal-theory simple linear regression model yi | xi ∼ N xi β, σ 2 ,

(15)

where xi = (xi1 , . . . , xiK ) is a 1 × K vector of regressors, and β = (β1 , β2 , . . . , βK ) is a K × 1 vector of coefﬁcients. Let θ = (β, σ 2 ) be a vector of parameters to be estimated. If (xi , yi ) are completely observed, then the maximum likelihood estimate (MLE) of θ can be computed from n

xi (yi − xi β) = 0

i=1

and n

(yi − xi β)2 − σ 2 = 0. i=1

Now, suppose that in some unit i with δi = 0, we observe y˜ i instead of yi . Then, under CAR, we can generate M imputed values of yi as

(t) ∗(1) ∗(M) i.i.d. yi , . . . , yi xi ∼ N xi βˆ , σˆ (t)2 (16) (t) ∗(m) where T (yi ) = y˜ i , for m = 1, 2, . . . , M, and (βˆ , σˆ 2(t) ) is the t-th (current) parameter estimate. The parameters are then updated by solving

δi =1

xi (yi

M 1 ∗(m) − xi β) + xi yi − xi β = 0 M

(17)

δi =0 m=1

and M 2

1 ∗(m) 2 2 2 yi − xi β − σ = 0. (yi − xi β) − σ + M

δi =1

DOI: 10.1002/cjs

(18)

δi =0 m=1

The Canadian Journal of Statistics / La revue canadienne de statistique

610

KIM AND HONG

Vol. 40, No. 3

(t+1) (t) For ﬁxed M, the estimator θˆ does not necessarily converge in the sense that θˆ − θˆ < for sufﬁciently large t for any > 0. Thus, M may need to be increased for an increased number of iterations (Booth & Hobert, 1999). For M → ∞, the above iterative method can produce the maximum likelihood estimates of the parameters. The adapted MCEM algorithm in [Step 1]–[Step 4] is applicable when the coarsening mechanism is deterministic and CAR. For a more general coarsening mechanism, we need to generate samples from ˆ 0 (y˜ i | xi , yi ; φ){1 ˆ ˆ − p(xi , yi ; ψ)} f (yi | xi ; θ)g f yi | xi , y˜ i , δi = 0 = , ˆ ˆ ˆ − p(xi , yi ; ψ)} dyi f (yi | xi ; θ)g0 (y˜ i | xi , yi ; φ){1

(19)

ˆ φ, ˆ In this case, we can use the Metropolis–Hastings ˆ ψ). evaluated at the current estimate (θ, ˆ algorithm with f (yi | xi ; θ) as the proposal distribution for generating initial imputed values. Thus, [Step 2] can be replaced by the following Markov Chain Monte Carlo (MCMC) method using the Metropolis–Hastings algorithm: ˆ φ, ˆ we use ˆ ψ), [Step 2*] To generate the samples from (19) for the current parameter estimate (θ, the following iterative steps: ∗(0) ˆ (1) Generate yi from f (yi | xi , θ). ∗(t)

∗(t+1)

(2) Given yi , the following algorithm generates yi ˆ [a] Sample a candidate value yi∗ from f (yi | xi , θ). [b] Compute

∗(t) R yi , yi∗ = ∗(t)

ˆ ˆ − p(xi , yi∗ ; ψ)} g0 (y˜ i | yi∗ , xi , φ){1 . ∗(t) ∗(t) ˆ g0 y˜ i | yi , xi , φˆ 1 − p(xi , yi ; ψ)

∗(t+1)

[c] If R(yi , yi∗ ) ≥ 1, then set yi

∗(t+1) yi

=

:

∗(t)

= yi∗ . If R(yi , yi∗ ) < 1, then ∗(t)

yi∗

with probability R(yi , yi∗ )

yi

otherwise.

∗(t)

[d] Increment t and return to step (a). ∗(j)

(3) For sufﬁciently large t, choose the last M values of yi

as the imputed values.

Givens & Hoeting (2005) provide a comprehensive description of the Metropolis–Hastings algorithm. In the Metropolis–Hastings algorithm, the convergence to a stationary distribution is hard to check. Furthermore, using MCMC for each EM iteration is computationally challenging. To reduce the computational burden associated with the MCMC method, the parametric fractional imputation (PFI) method proposed by Kim (2011) can be used. Speciﬁcally, instead of generating samples from [Step 2*], we simply generate ∗(1)

yi

∗(M)

, . . . , yi

∼ f yi | xi , θˆ

(20)

and assign the fractional weights ∗(m) wi

∗(m) ∗(m) ˆ g0 y˜ i | yi , xi , φˆ 1 − p(xi , yi ; ψ) = M ∗(m) ∗(m) ˆ , xi , φˆ 1 − p(xi , yi ; ψ) m=1 g0 y˜ i | yi

The Canadian Journal of Statistics / La revue canadienne de statistique

(21)

DOI: 10.1002/cjs

2012

STATISTICAL INFERENCE WITH COARSE DATA

611

∗(m)

to yi , for m = 1, 2, . . . , M. In the pure missing data problem where y˜ is empty, the target distribution is given by (8). Thus, the fractional weights in (21) reduce to ∗(m) wi

∗(m) 1 − p xi , yi ; ψˆ = M . ∗(m) ˆ ; ψ) m=1 1 − p(xi , yi

The fractional weights are used to compute the imputed score equation. That is, the imputed score equation is computed as a weighted mean of the score equation applied to the fractionally imputed data. With fractionally imputed data, the parameter estimates are updated by solving

S1 (θ; xi , yi ) +

δi =1

M

∗(m)

wi

∗(m) S1 θ; xi , yi =0

(22)

δi =0 m=1 M

∗(m)

wi

∗(m) S2 φ; xi , yi , y˜ i = 0

δi =0 m=1

and

S3 (ψ; xi , yi , δi ) +

δi =1

M

∗(m)

wi

∗(m) S3 ψ; xi , yi , δi = 0,

δi =0 m=1

where S1 (θ; xi , yi ), S2 (φ; xi , yi , y˜ i ), and S3 (ψ; xi , yi , δi ) are deﬁned after (6). Unlike the MCEM method, the EM sequence obtained from the PFI method converges for sufﬁciently large t without increasing the imputation size M. See Theorem 2 of Kim (2011). Example 2. We now consider a modiﬁcation of Example 1 where yi is censored if yi > c for known c. In this case, I{yi > c}f (yi | xi ; θ) f yi | xi , y˜ i , δi = 0 = . yi >c f (yi | xi ; θ) dyi Thus, if M imputed values are generated from (20), the fractional weights are simply computed by ∗(m) wi

∗(m) I yi >c = M ∗(k) . >c k=1 I yi

4. SIMULATION STUDY In this section, we present results from a limited simulation study to illustrate the performance of our proposed method for ML estimation in the presence of grouped data. We use a linear regression model yi |xi ∼ N(β1 + β2 xi , σ 2 ), with δi |(xi , yi ) ∼ Ber(pi ), where x ∼ N(20, 4) and log{pi /(1 − pi )} = 2.0 − 0.1xi , to generate the simulated data. In this simulation, we set (β1 , β2 , σ) = (0.5, 1.0, 2.0). Variable yi is observed if δi = 1 and is not observed if δi = 0. The response rate for y is about 70%. For comparison, we also generated another set of missing data with 30% response rate by using log{pi /(1 − pi )} = 0.6 − 0.1xi . For δi = 0, we generate interval data for yi using the following three methods; (1) Scenario 1: rounding to the largest integer smaller than yi , (2) Scenario 2: grouping with 4 categories, and (3) DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

612

KIM AND HONG

Vol. 40, No. 3

Table 1: Monte Carlo means and standard errors of the parameter estimates based on 2,000 simulation. Response n

200

400

200

rate

70%

70%

30%

Complete

Scenario 1

Scenario 2

Scenario 3

Parameter

data

(12 categories)

(4 categories)

(pure missing)

β1

0.496 (1.428)

0.509 (1.422)

0.498 (1.562)

0.492 (1.845)

β2

1.000 (0.072)

1.000 (0.070)

1.000 (0.078)

1.001 (0.092)

σ

1.998 (0.101)

1.988 (0.101)

1.983 (0.110)

1.982 (0.126)

β1

0.500 (0.999)

0.497 (1.005)

0.495 (1.096)

0.504 (1.243)

β2

1.000 (0.050)

1.000 (0.050)

1.000 (0.055)

1.000 (0.062)

σ

1.997 (0.070)

1.993 (0.071)

1.998 (0.079)

1.990 (0.085)

β1

0.496 (1.428)

0.495 (1.423)

0.509 (1.610)

0.505 (2.692)

β2

1.000 (0.072)

1.001 (0.075)

1.000 (0.080)

1.000 (0.134)

σ

1.998 (0.101)

1.990 (0.103)

1.989 (0.124)

1.962 (0.181)

Standard errors are in parentheses.

Scenario 3: pure missing without grouping. In Scenario 1, for example, if yi = 15.3, then yi = 15, which implies yi ∈ [15, 16). Given the simulation parameters, Scenario 1 generates 12 categories and Scenario 2 creates four categories as [−∞, 16), [16, 20), [20, 24), [24, ∞). In Scenario 3, we assume that there is no additional interval information for missing data. Hence we accept imputed values from the initial parameter estimates using perfectly observed data in the model. That is, there is no iteration procedure (Step 4 of the suggested algorithm in Section 3) in Scenario 3. We adopt the Monte Carlo approach suggested in Section 3 to generate imputed values ∗(1) ∗(M) for missing observations y˜ i , and then maximize the approximate complete-data yi , . . . , yi log likelihood. In the implementation, M was taken to be 20 for iterations 1–10 and 1,000 after the 10th iteration. Using the Monte Carlo imputed data, we estimate three parameters: β1 , β2 , and σ. Table 1 reports the mean and standard errors of the parameter estimates for 70% response based on 2,000 simulations. We report results for samples of size n = 200 and n = 400. For comparison, we add the simulation result with complete data in the second column of the table. The estimates from the proposed method are nearly unbiased regardless of censoring methods. Standard errors are shown in parentheses in the table. The standard errors of βˆ 2 are 0.0738 with complete data, 0.0750 with 12 categories, 0.0756 with 4 categories, and 0.0910 with pure missing when n = 200. The standard errors of βˆ 1 and σˆ show similar patterns. This simulation result indicates that, not surprisingly, ﬁner categories generate smaller standard errors because of the extra information contained. This pattern still holds when n = 400, while the results show that increasing sample size tends to reduce standard errors for all models. The same pattern holds for the case of 30% response rates. The standard errors are larger for 30% response rates than for 70% response rates. 5. APPLICATION TO A REAL SURVEY EXAMPLE This section presents an illustrative empirical application of the proposed method. We use the data from Korean Longitudinal Study of Aging (KLoSA) to estimate the regression of log transformed ﬁnancial assets holdings (y) on retirement, gender, age, the size of household, and years of education. For the purposes of illustration, we ignored the sampling design and treated the original sample as if it were obtained from simple random sampling. A more careful modelling would be needed to make the actual sampling design ignorable. In the 2006 wave of KLoSA data, there were 4,232 respondents for the items on ﬁnancial assets. Respondents were asked to provide exact point values for their holdings of various assets. The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2012

STATISTICAL INFERENCE WITH COARSE DATA

613

Respondents who did not know or who otherwise refused to report the exact point value for some component of the wealth items were asked whether the value was within a sequence of brackets. For ﬁnancial assets, exact point measurements were obtained from 3,595 respondents and informative interval measurements were obtained from 637 respondents. The 11 brackets for ﬁnancial assets in KLoSA are (1) below 100, (2) around 100, (3) from 100 to 500, (4) around 500, (5) from 500 to 1,000, (6) around 1,000, (7) from 1,000 to 5,000, (8) around 5,000, (9) from 5,000 to 10,000, (10) around 10,000, and (11) above 10,000, where the unit is 10,000 Korean won. Note that the brackets contain both deterministic intervals such as “between 100 and 500” and stochastic ones such as “around 500.” We used the CAR assumption, which seems reasonable because those who are reluctant to answer the original question may answer the second question, which does not require them to divulge their exact net worth. Since the intervals include some stochastic intervals, we estimate the regression of log ﬁnancial assets using the fractional weighting method suggested in Section 3. Based on (22), θˆ (t+1) can be obtained by

βˆ (t+1) 2 σ (t+1)

−1 n M n 1 1 ∗(m) ∗(m) = x i xi wi xi yi n n i=1 i=1 m=1 M n 2 1 ∗(m) ∗(m) yi = wi − xi βˆ (t+1) , n i=1

m=1

where the weights are given by (21). For the stochastic intervals containing wording such as “around k,” we assume that f (yi | y˜ i , xi ) given y˜ i = k is a normal density with a mean of k and a variance of (k/10)2 , which was decided after consulting a subject matter specialist in this survey. Instead of this approximation, one could use (7) to compute the conditional distribution f (yi | y˜ i , xi ), using y˜ i | (xi , yi ) ∼ N(yi , (yi /10)2 ). In this case, the conditional distribution in (7) does not have a known parametric form and a Monte Carlo approximation might be used. For the deterministic intervals, the weights are 1/M. Table 2 presents the results from the three estimation procedures considered. The variable male is a gender dummy, age and education are measured in years, size is the total number of household members, and retire is a dummy variable equal to one if the subject is retired and zero otherwise. Column (1) of Table 2 shows point estimates and standard errors using the suggested MCEM algorithm with the fractional weights. The standard errors were Table 2: Estimation results for KLoSA data.

Intercept

MCEM algorithm

Dropping grouped

Imputation with

using PFI (1)

data (2)

midpoints (3)

5.0765 (0.1779)

4.9414 (0.1899)

5.0885 (0.1776)

Male

0.0721 (0.0413)

0.0794 (0.0437)

0.0766 (0.0412)

Age

0.0013 (0.0024)

0.0030 (0.0025)

0.0017 (0.0024)

Size

−0.0131 (0.0157)

−0.0078 (0.0167)

−0.0101 (0.0156)

Retire

0.1263 (0.0526)

0.1170 (0.0561)

0.1421 (0.0525)

Education

0.0738 (0.0051)

0.0722 (0.0054)

0.0729 (0.0050)

Standard errors are in parentheses. DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

614

KIM AND HONG

Vol. 40, No. 3

obtained using the method described in Appendix A. The estimation results show that households with retired individuals are predicted to hold 12.6% more ﬁnancial assets than households without retired individuals holding other factors ﬁxed. The coefﬁcient on retirement is also very signiﬁcant. Column (2) of the table shows the estimation results using only the 3,595 respondents with exact point measurements. We can see that the standard errors here are higher than in column one because we have foregone some information by choosing to use 3,595 respondents rather than all respondents. Another frequently used procedure is to treat the midpoint of a class interval as a proxy for the grouped data (e.g., Karmel & Polasek, 1970). The last column (3) of the table gives the estimation results of imputation based on the midpoints in each interval. In the case of an open-ended interval such as “above 10,000,” an arbitrary value of 22,000 was used after consulting the subject matter specialist. Some results of the midpoint imputation are similar to the ones based on the proposed method, while other results are different. For example, the coefﬁcient estimates on retirement are 0.126 from the proposed method and 0.142 from the midpoint imputation. In general, as shown by Hsiao (1983) and Haitovsky (1973), the resulting midpoint estimates are biased and inconsistent unless the missing data are uniformly distributed. The amount of bias depends on both the distribution of the missing data and the rate of missingness. ACKNOWLEDGEMENTS The work for the ﬁrst author was supported in part by the professional services contract from Korean Labor Institute and by a Cooperative Agreement between the US Department of Agriculture Natural Resources Conservation Service and Iowa State University. We thank the three anonymous referees and the Associate Editor for very constructive comments, which greatly improved this paper. BIBLIOGRAPHY Booth, J. G. & Hobert, J. P. (1999). Maximizing generalized linear mixed models with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society: Series B, 61, 265–285. Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. (2006). Measurement Error in Nonlinear Models: A Modern Perspective, 2nd ed., Chapman & Hall/CRC, Boca Raton, FL. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39, 1–38. Durrant, G. B. & Skinner, C. (2006). Using missing data methods to correct for measurement error in a distribution function. Survey Methodology, 32, 25–36. Fisher, R. A. (1922). On the mathematical foundation of theoretical statistics. Philosophical Transactions of the Royal Society of London. Series A, 222, 309–368. Fuller, W. A. (1987). Measurement Error Models, John Wiley & Sons, Inc., New York. Fuller, W. A & Kim, J. K. (2005). Hot deck imputation for the response model. Survey Methodology, 31, 139–149. Givens, G. H. & Hoeting, J. A. (2005). Computational Statistics, John Wiley & Sons, Inc., New York. Haitovsky, Y. (1973). Regression Estimation from Grouped Observations, Hafner Press, New York. Haziza, D. (2009). Imputation and inference in the presence of missing data, In Handbook of Statistics, Sample surveys: Design, Methods and Applications, Pfeffermann. D & Rao C. R., editors. Elsevier B.V., Amsterdam, The Netherlands. Heeringa, S. G. (1995). Application of Generalized Iterative Bayesian Simulation Methods to Estimation and Inference for Coarsened Household Income and Asset Data, In Proceedings of the Section on Survey Research Methods, American Statistical Association, Washington, DC, pp. 42–51. The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2012

STATISTICAL INFERENCE WITH COARSE DATA

615

Heitjan, D. F. (1989). Inference from grouped continuous data: A review. Statistical Science, 4, 164– 183. Heitjan, D. F. & Rubin, D. B. (1990). Inference from coarse data via multiple imputation with application to age heaping. Journal of the American Statistical Association, 85, 304–314. Heitjan, D. F. & Rubin, D. B. (1991). Ignorability and coarse data. The Annals of Statistics, 19 2244– 2253. Hsiao, C. (1983). Regression Analysis with a Categorized Explanatory Variable, In Econometrics, Time Series, and Multivariate Statistics, Karlin, S., Amemiya, T., & Goodman L., editors. Academic Press, New York. Karmal, P. H. & Polasek, M. (1970). Applied Statistics for Economists, 3rd ed., Pitman, London. Kim, J. K. (2011). Parametric fractional imputation for missing data analysis. Biometrika, 98, 119–132. Lindley, D. V. (1950). Grouping corrections and maximum likelihood equations. Proceedings of the Cambridge Philological Society, 46, 106–110. Little, R. J. A. & Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed., John Wiley & Sons, Inc., New York. Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B, 44 226–233. Rubin, R. B. (1976). Inference and missing data. Biometrika, 63, 581–592. Tallis, G. M. (1967). Approximate maximum likelihood from grouped data. Technometrics, 9, 599–606. Taylor, J. M. G. & Yu, M. (2002). Bias and efﬁciency loss due to categorizing an explanatory variable. Journal of Multivariate Analysis, 83, 248–263. Wang, C.-Y. & Pepe, M. S. (2002). Expected estimating equations to accommodate covariate measurement error. Journal of the Royal Statistical Society: Series B, 62, 509–524. Wang, N. & Robins, J. M. (1998). Large-sample theory for parametric multiple imputation procedures. Biometrika, 85, 935–948. Wei, G. C. G. & Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithm. Journal of the American Statistical Association, 85, 699– 704.

APPENDIX A. Variance Estimation Variance estimation can be implemented using the formula of Louis (1982). For simplicity, we assume CAR. In this case, we can use the PFI method to obtain the approximate value of the MLE of (θ, φ) by solving

S1 (θ; xi , yi ) +

δi =1

M

∗(m)

wi

∗(m) =0 S1 θ; xi , yi

δi =0 m=1

and M

∗(m)

wi

∗(m) S2 φ; xi , yi , y˜ i = 0,

δi =0 m=1 ∗(m)

where S1 (θ; xi , yi ) and S2 (φ; xi , yi , y˜ i ) are deﬁned after (6) and wi are the fractional weights as∗(m) ∗(m) are generated from f (yi | xi , θˆ 0 ), then the fractional weights are computed signed to yi . If yi DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

616

KIM AND HONG

Vol. 40, No. 3

by ∗(m) wi

∗(m) ∗(m) ∗(m) f yi | xi , θˆ g0 y˜ i | xi , yi ; φˆ f yi | xi , θˆ 0 = M . ∗(m) ˆ 0 y˜ i | xi , yi∗(m) ; φˆ /f yi∗(m) | xi , θˆ 0 | xi , θ)g m=1 f (yi

ˆ φ), ˆ deﬁne the observed information matrix For the variance estimation of (θ, Iobs (θ, φ) = −

∂ Sobs (θ, φ). ∂(θ, φ)

(A.1)

The variance–covariance matrix of θˆ is the inverse of the observed information (A.1) evaluated ˆ Using the argument of Louis (1982), we can write at θ. Iobs (θ, φ) =

n

Iobs,i (θ, φ)

(A.2)

i=1

where Iobs,i (θ) = E Q(θ, φ; xi , yi ) | xi , y˜ i − V S(θ, φ; xi , yi ) | xi , y˜ i ,

(A.3)

with Qi (θ, φ; xi , yi , y˜ i ) = −∂S(θ, φ; xi , yi , y˜ i )/∂(θ, φ) and S(θ, φ; xi , yi , y˜ i ) = S1 (θ; xi , yi ), S2 (φ; xi , yi , y˜ i ) . The conditional expectation in (A.3) is computed with the conditional distribution (7). Writing Q1 (θ; xi , yi ) = −∂S1 (θ; xi , yi )/∂θ and Q2 (φ; xi , yi , y˜ i ) = −∂S2 (φ; xi , yi , y˜ i )/∂φ,the ﬁrst component on the right-hand side of (A.3) can be estimated by ˆ 2i (φ) where Qˆ i (θ, φ) = block diag Qˆ 1i (θ), Q ˆ 1i (θ) = Q

if δi = 1

Q1 (θ; xi , yi ) M ∗(m) ∗(m) Q1 θ; xi , yi m=1 wi

if δi = 0,

and ˆ 2i (φ) = Q

0 M

∗(m) ∗(m) Q2 φ; xi , yi , y˜ i m=1 wi

∗(m)

∗(m)

where yi is the mth imputed value of yi and wi second term of (A.3) can be estimated by Bi (θ, φ) =

M

∗(m) wi

m=1 ∗(m)

∗(m)

where S1i (θ) = S1 (θ; xi , yi Thus, we have

∗(m)

if δi = 1 if δi = 0,

is the corresponding fractional weight. The

M

∗(m) ∗(m) S1i (θ) m=1 wi ∗(m) ∗(m) ∗(m) M S2i (φ) − m=1 wi S2i (φ)

S1i (θ) −

∗(m)

∗(m)

) and S2i (φ) = S2 (φ; xi , yi

⊗2 ,

, y˜ i ). If δi = 1, then Bi (θ, φ) = 0.

ˆ i (θ, φ) − Bi (θ, φ). Iobs,i (θ, φ) = Q

(A.4)

ˆ ˆ For sufﬁciently large M, the estimated variance–covariance matrix of (θ, φ) can be obtained by ˆ φ) ˆ = ni=1 Iobs,i (θˆ , φ). ˆ the inverse of Iobs (θ, The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2012

STATISTICAL INFERENCE WITH COARSE DATA

617

B. Measurement Error Models We brieﬂy discuss the measurement error model problem as a special case of the stochastic coarsening data problem. Wang & Pepe (2002) also addressed the measurement error problem using the conditional expectation of the score function. Fuller (1987) and Carroll et al. (2006) provided comprehensive overviews of the estimation methods for measurement error models. We brieﬂy discuss how the idea of the PFI approach for the stochastic coarsening data problem can be applied to the measurement error problem. Durrant & Skinner (2006) also considered an imputation approach for measurement error models. Let θ1 be the parameter of interest in the conditional distribution f (y | x) and θ2 the parameter in the marginal distribution h(x) of x. Suppose that we observe x˜ i , a coarsened version of xi , instead of xi for δi = 0. That is, we observe (xi , yi ) for δi = 1 and observe (x˜ i , yi ) for δi = 0. For simplicity, ˜ y) = p(δ = 1 | x, ˜ y). the response mechanism for δi is CAR in the sense that p(δ = 1 | x, x, Assume that the coarsening mechanism is parametrically modelled by xi | x˜ i ∼ g (xi | x˜ i ; φ)

(B.1)

for some parameter φ. Model (B.1) is often called the Berkson error model in the literature. If this model is not feasible, then one can postulate a classical error model gc (x˜ i | xi ; φ) and derive the Berkson error model by g(xi | x˜ i ) =

gc (x˜ i | xi ; φ) h(xi ; θ2 ) . gc (x˜ i | xi ; φ) h(xi ; θ2 ) dxi

To estimate φ, we further assume that a validation sub-sample is selected for δi = 0 and we observe xi in addition to x˜ i from the validation sample. A consistent estimator of φ can be obtained by solving

S(φ; xi , x˜ i ) = 0

i∈V

where S(φ; xi , x˜ i ) is the score function for φ and V denotes the validation sample. To compute the conditional expectation in (6), the conditional expectation should be taken with respect to xi given x˜ i and yi . Thus, the target distribution for generating imputed values is xi∗

f yi | xi ; θˆ 1 g xi | x˜ i , φˆ | (x˜ i , yi ) ∼ , f yi | xi ; θˆ 1 g xi | x˜ i , φˆ dxi

(B.2)

ˆ As noted in Section 3, one can use the MCMC method to generate samples from for some (θˆ 1 , φ). (B.2). Instead of generating Monte Carlo samples from (B.2), the PFI method can be implemented by ﬁrst generating ∗(m)

xi

ˆ | x˜ i ∼ g(xi | x˜ i , φ)

(B.3)

and assigning the fractional weights computed by ∗(m) wi

DOI: 10.1002/cjs

∗(m) f yi | xi , θˆ 1 . = M ∗(m) ˆ , θ1 m=1 f yi | xi

(B.4)

The Canadian Journal of Statistics / La revue canadienne de statistique

618

KIM AND HONG

Vol. 40, No. 3

With fractionally imputed data, the parameter estimate for θ1 is updated by solving δi =1

S1 (θ1 ; xi , yi ) +

M

∗(m)

wi

∗(m) S1 θ1 ; xi , yi = 0

δi =0 m=1

where S1 (θ1 ; xi , yi ) is the score function for θ1 . The parameter estimate for φ is not updated in the EM iteration. Only the fractional weights in (B.4) are updated in the EM iteration. Because the imputed values in (B.3) do not change in the EM iteration, the convergence of the EM sequence is guaranteed without increasing M. Received 26 July 2011 Accepted 20 April 2012

The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

Strategies for online communities - Wiley Online Library

Probiotics for treating women with gestational ... - Wiley Online Library

ELTGOL - Wiley Online Library

Statistics for the Millennium - Wiley Online Library

poly(styrene - Wiley Online Library

Recurvirostra avosetta - Wiley Online Library

Kitaev Transformation - Wiley Online Library

PDF(3102K) - Wiley Online Library

Standard PDF - Wiley Online Library

Authentic inquiry - Wiley Online Library

TARGETED ADVERTISING - Wiley Online Library

Verbal Report - Wiley Online Library

PDF(270K) - Wiley Online Library

Phylogenetic Systematics - Wiley Online Library

PDF(270K) - Wiley Online Library

Standard PDF - Wiley Online Library

PDF(118K) - Wiley Online Library