Bootstrap confidence intervals in DirectLiNGAM

Viewer
Transcript

Bootstrap confidence intervals in DirectLiNGAM Kittitat Thamvitayakul, Shohei Shimizu, Tsuyoshi Ueno, Takashi Washio and Tatsuya Tashiro The Institute of Scientific and Industrial Research (ISIR) Osaka University Osaka, Japan Email: [email protected]

Abstract—We have been considering a problem of finding significant connection strengths of variables in a linear nonGaussian causal model called LiNGAM. In our previous work, bootstrap confidence intervals of connection strengths were simultaneously computed in order to test their statistical significance. However, the distribution of estimated elements in an adjacency matrix obtained by the bootstrap method was not close enough to the real distribution even though the number of bootstrap replications was increased. Moreover, such a naive approach raised the multiple comparison problem which many directed edges were likely to be falsely found significant. In this study, we propose a new approach used to correct the distribution obtained by the bootstrap method. We also apply a representative technique of multiple comparison, the Bonferroni correction, then evaluate its performance. The result of this study shows that the new distribution is more stable and also even closer to the real distribution. Besides, the number of falsely found significant edges is less than the previous approach. Keywords-Structural equation models, Bayesian networks, non-Gaussianity, causal discovery, Bayesian information criteria, adaptive Lasso, bootstrap method.

I. I NTRODUCTION In the study of causal analysis, the confidence interval has been used to determine whether an edge in the directed acyclic graph is a statistical significant edge or not. In order to construct this confidence interval, the bootstrap technique, which is one of resampling methods, could be used. In previous work on LiNGAM [1], the bootstrap confidence interval was applied to determine the significance of the connection strengths in the adjacency matrix estimated by LiNGAM. However, we found two main problems in this work. The first one is it is possible to have more than one correct causal ordering as the answers of DirectLiNGAM. Therefore, even though the number of bootstrap replications is increased, the distribution of the estimated elements obtained by the bootstrap method is still not likely to be close to the real distribution. The second problem is the multiple comparison problem, that is, the confidence intervals with appropriate range for all elements could not be constructed. This problem leads many directed edges likely to be falsely found significant. In the present study, we propose the solution against these two problems. For the first problem, we propose a new

approach used to correct the distribution of the estimated elements in order to make it converge to the real distribution. For the second problem, we use a representative multiple comparison procedure, the Bonferroni correction [2], to construct a more appropriate confidence intervals. The rest of the paper is organized as follows. In Section II, we review the LiNGAM, DirectLiNGAM, adaptive lasso penalized likelihood function and the Bonferroni correction. In Section III, we present our new approach to construct confidence intervals in DirectLiNGAM. The simulation results are presented in Section IV, while the applications to the real-world data are shown in Section V. In Section VI, we provide the conclusion of this study. II. BACKGROUND In this section, we first briefly review the background of the linear non-Gaussian structural equation model; LiNGAM in short [3] and a direct estimation method of LiNGAM called DirectLiNGAM [1] in Section II-A and Section II-B, respectively. We then review the brief concept of the adaptive lasso penalized likelihood function in Section II-C. In Section II-D, we mention the Bonferroni correction, the multiple comparison procedure which we use in this study. A. LiNGAM LiNGAM [3] is a model used for exploratory causal analysis. It is assumed that the data are generated from a process represented graphically by a directed acyclic graph; DAG. Note that DAG can be further represented by an adjacency matrix B = {bij } where bij is the connection strength directed from a variable xj to xi . If bij is nonzero then it means that there is a directed edge from variable xj to xi . We denote by k(i) a causal order of xi in DAG so that no later variable determines or has a directed path on any earlier variables. Without loss of generality, each observed variable xi is assumed to have zero mean. Then we have bij xj + ei , (1) xi = k(j)
where ei are external influences which are continuous variables having non-Gaussian distributions with zero means and nonzero variances and are mutually independent. The

independence assumption between ei means that there is no latent confounding variable. The model in (1) is rewritten in matrix form as x = Bx + e,

(2)

where the connection strength matrix or the adjacency matrix B collects bij and the vectors x and e are d-dimensional vectors collecting xi and ei respectively. The adjacency matrix B could be represented to be a lower triangular matrix with all zeros on the diagonal by simultaneous equal row and column permutations according to the causal orderings. The purpose of this model is to estimate the adjacency matrix B based on only data x. Note that since the external influences ei have non-Gaussian distributions and are mutually independent, the model in (2) is known to be identifiable. B. DirectLiNGAM DirectLiNGAM [1] is a direct estimation method of LiNGAM. It is used to estimate causal orders of all variables. An exogenous variable is defined as a variable with no parent and its corresponding row in B consists of all zeros. It can be at the top of such a causal ordering which makes B a lower triangular matrix with zeros on the diagonal. In the process of the DirectLiNGAM, an exogenous variable is first identified based on its independence of the residuals. How it is possible to find an exogenous variable is shown in Lemma 1: Lemma 1: Assume that the input data x strictly follows the LiNGAM in (2). This means that we assume that all the model assumptions are met and the sample size is infinite. (j) Denote by ri the residual when xi is regressed on xj : (j) cov(xi ,xj ) xj (i = j). Then a variable xj is ri = xi − var(x j) exogenous if and only if xj is independent of its residuals (j) ri for all i = j. In addition, the independence between a variable xj and (j) its residual ri mentioned in Lemma 1 can be proved by using the Darmois-Skitovitch theorem [4]. Theorem 1 (Darmois-Skitovitch theorem): Define two random variable x1 and x2 as linear combinations of independent random variables ei (i = 1, . . . , d): x1 =

d i=1

α i ei ,

x2 =

d

β i ei .

i=1

Then, if x1 and x2 are independent, all variables ek for which αk βk = 0 are Gaussian. In other words, this theorem indicates that if there exists a non-Gaussian ek for which αk βk = 0 then x1 and x2 are dependent. When xj is not an exogenous variable, that is, xj has at least one parent variable xi which has a non-Gaussian (j) ek for which αk βk = 0. Likewise, ri , the residual of xi on xj , must also contain a non-Gaussian ek as well. This

(j)

indicates that xj and ri are dependent. Meanwhile, when (j) xj is an exogenous variable, xj and its residual ri are independent. In practice, an exogenous variable can be identified by finding a variable which is the most independent of its residuals. After the first exogenous variable is found by Lemma 1, the effect of that exogenous variable on other variables is removed by using least squares regression. In order to continue removing the second exogenous variable, the (d-1)-dimensional residuals vector r(j) can be considered as a new input to the LiNGAM by applying Lemma 2: Lemma 2: Assume that the input data x strictly follows the LiNGAM in (2) and a variable xj is exogenous. Denote by r(j) a (d-1)-dimensional vector that collects the residual r(j) when all xi of x are regressed on xj where (i = j). Then a LiNGAM holds for the residual vector r(j) : r(j) =B(j) r(j) +e(j) , where B(j) is a matrix that can be permuted to be strictly lower-triangular by a simultaneous row and column permutation, and elements of e(j) are nonGaussian and mutually independent. According to Lemma 2, the residuals can be used as a new input to LiNGAM model. Then, by applying Lemma 1, an exogenous residual is considered to be the next exogenous variable again and so on. The process of identifying an exogenous variable and its effect removal is iterated until the causal orderings of all variables are found. The DirectLiNGAM can estimate the causal orderings of variables and their connection strengths by using only the dataset, which are defined as: matrix X The matrix with dimension (d × n) containing the original dataset with dimension d and sample size n vector O The vector with dimension size d containing d variables which are ordered following the estimated causal ordering. matrix B The adjacency matrix with dimension (d × d) whose elements containing the connection strengths between d variables The steps showing how the DirectLiNGAM estimates the variable O and B are given in Algorithm 1. C. Adaptive Lasso Penalized Likelihood Function In the regularization parameter selections via generalized information criterion [5], the nonconcave penalized likelihood approach [6] was applied to obtain variable selections as well as shrinkage estimators. The usage of the generalized information criterion (GIC) [7], under a general nonconcave penalized likelihood approach was proposed in this paper. However, instead of the general nonconcave penalized likelihood, we use the adaptive lasso [8] as an objective function in this study. Therefore, we first introduce the adaptive lasso in Section II-C1 then introduce the Bayesian information criterion [9] in Section II-C2, respectively.

Algorithm 1 Steps in DirectLiNGAM 1: Calculate the residuals of each variable and then indicate an exogenous variable by using Lemma 1. 2: Put the exogenous variable found in step 1 into the causal ordering in O. 3: Regress the exogenous variable found in step 1 out and use the resulting regressions on the exogenous variable as a new input to DirectLiNGAM using Lemma 2. 4: Repeat step 1 to 3 in order to remove the effect of each exogenous variable from other variables until we get the orderings of all variables contained in O. 5: Construct B as a strictly lower triangular matrix by following the causal orderings in O and then estimate each connection strength in B by using the least squares regression.

1) Adaptive Lasso: Adaptive lasso [8] is a modification of the lasso [10]. Originally the lasso is a regularization technique for simultaneous estimation and variable selection. Lasso variable selection has been shown to be consistent under certain conditions. Suppose that y = (y1 , . . . , yn )T is the response vector and xj = (x1j , . . . , xnj )T , j = 1, . . . , d are linearly independent predictors with the parameter vector ˆ be the coefficient estimator β = (β1 , . . . , βd )T . Let β produced by the lasso which is defined as 2 d d ˆ = argmin y − β x β + λ |βj | , j j β j=1 j=1

(3)

where λ is a nonnegative regularization parameter. The second term in (3) is a penalty function called the 1 penalty. The lasso continuously shrinks the coefficients toward 0 if λ increases, and some coefficients are shrunk to exact 0 if λ is sufficiently large. The difference between the lasso and adaptive lasso is that the adaptive lasso is assigned different weights to different coefficients in order to solve the multiple local minimal ˆ∗(m) be the coefficient estimator produced by issue. Let β the adaptive lasso where the value of m is from 0 to its upper limit. This coefficient estimator is given by

∗(m)

ˆ β

2 d d |β | jγ , (4) = argmin y − xj βj + λ m ˆj β j=1 j=1 β

where both λm and γ are tuning parameters and the weight βˆj is a consistent estimate of βj obtained by ordinary least squares (OLS) regression. The second term in (4) is called the adaptively weighted 1 penalty. Note that, in this study, we fix the value of the tuning parameter γ to 1. How to solve the equation in (4) are shown in Algorithm 2.

Algorithm 2 Steps in the Adaptive Lasso ˆj where j = 1, . . . , d by using the 1: Calculate weight β ordinary least squares (OLS) regression on y and x. xj 2: Define x∗∗ j = where j = 1, . . . , d. βˆj 3: Solve the lasso problem for all λm by minimizing the following equation: 2 d d ∗∗(m) ∗∗ ˆ β = argmin y − xj βj + λ |βj | , m β j=1 j=1 ˆ∗∗(m) = (βˆ∗∗(m) , . . . , βˆ∗∗(m) ). β 1 d ˆ∗∗(m) β ∗(m) ˆ 4: Calculate β = j where j = 1, . . . , d as the j βˆj output of the algorithm. where

2) Bayesian Information Criterion: The Bayesian information criterion (BIC) [9] is an information criterion for models defined in terms of their posterior probability. This information criterion is used to evaluate goodness or badness of the estimated model. In other words, it is used for selecting the regularization parameter. ˆ∗∗(m) for all λm in step 3 of Algorithm After obtaining β 2, we use the BIC-type regularization parameter selector [5] to evaluate goodness of each model and select the regularˆ∗∗(m) which minimizes the ization parameter. The value of β ˆ∗(m) as in step BIC is selected and then used to calculate β 4 of Algorithm 2 which is the output of the adaptive lasso. The equation of the BIC-type selector is given by

BICm

⎫ ⎧ 2 ⎪ ⎪ d ⎬ ⎨ 1 ∗∗ ˆ∗∗(m) y − (5) = x + log(n)df β m j j ⎪ n⎪ ⎭ ⎩ j=1

where y = (y1 , . . . , yn )T is the response vector, x∗∗ = j ∗∗ T (x∗∗ , . . . , x ) where j = 1, . . . , d are linearly independent 1j nj predictors calculated in step 2 of Algorithm 2 and dfm is the size of the model (the number of nonzero parameters). When applying to LiNGAM, we assume the relation between the response vector y and its predictors x∗∗ has the same manner as the data generating process of LiNGAM mentioned in (1) where the parent variables xj are the predictors of the variable xi when k (j) < k (i). Therefore, we can consider the predictors x∗∗ and the response vector y as the variables xj and xi , respectively. We assume that the set of such parent variables xj that k (j) < k (i) is known by LiNGAM at the first place. In the same way as (4), the adaptive lasso penalizes connection strengths bij in the adaptively weighted 1 penalty by minimizing the objective function which is defined as

2 |b | xi − + λm ijγ , b x ij j ˆ k(j)
III. N EW A PPROACH TO C ONSTRUCT A C ONFIDENCE I NTERVAL FOR D IRECT L I NGAM (6)

where λm and γ are tuning parameters and ˆbij is a consistent estimate of bij obtained by ordinary least squares (OLS) regression. The adaptive lasso has a very attractive property that it asymptotically selects the right set of such variables xj that bij is not zero, where k (j) < k (i). ∗∗(m) In the same manner as (5), with ˆbij for all λm obtained ∗∗(m) ˆ which minimizes the in step 3 of Algorithm 2, the bij value of BIC in (7) is selected to calculate the output of the adaptive lasso in the next step. The BIC-type selector in case of LiNGAM is given by

BICm

⎫ ⎧ 2 ⎪ ⎪ ⎬ ⎨ 1 ∗∗(m) ∗∗ ˆ x , (7) = − x + log(n)df b i m j ij ⎪ n⎪ ⎭ ⎩ k(j)
where n is the sample size of dataset and dfm is the number of nonzero parameters. D. Bonferroni Correction In this subsection, we mention the multiple comparison procedures, the Bonferroni correction [2]. Before introducing the Bonferroni correction, we first introduce the definition of the multiple comparison problem and familywise error rate. The multiple comparison is a problem which occurs when one considers a group of hypotheses simultaneously. This problem increases the familywise error rate which is the probability of rejecting even one of the true null hypotheses, or Type I error, in the given family of hypotheses. The familywise error rate is given by FWER = Pr(number of Type I errors ≥ 1)

(8)

Many statistical techniques have been developed in order to correct this problem. The Bonferroni correction is one of these techniques. With the Bonferroni correction, it allows many confidence intervals to be constructed while assuring the overall significance level is still maintained [2]. This technique works by adjusting the value of significance level α based on the number of hypotheses we are considering. If we have p hypotheses denoted by θ1 , θ2 , . . . , θp in the group, the confidence interval of each hypothesis θj after applying the Bonferroni correction is given by Pr (Lj ≤ θj ≤ Uj ) ≥ 1 −

α , p

(9)

where Lj and Uj are the confidence lower bound and upper bound of each hypothesis θj respectively.

In this study, we propose a new approach to construct confidence intervals for DirectLiNGAM [1]. This new approach consists of two main components used to solve the problems we mentioned earlier. We first explain the first component which is the pruning technique in Section III-A. The second component, the multiple comparison, is given in Section III-B, respectively. A. Pruning Technique In DirectLiNGAM, it is possible to have more than one causal ordering as the answers where all of them are correct orderings. For example, when the original matrix is

‫ݔ‬ଵ

‫ݔ‬ଵ

‫ݔ‬ଶ

‫ݔ‬ଷ

0

0

0

0

0

‫ݔ‬ଶ ܾଶଵ ‫ݔ‬ଷ ܾଷଵ

0

0

Original Matrix

‫ݔ‬ଵ

‫ݔ‬ଶ

‫ݔ‬ଷ

‫ݔ‬ଶ

‫ݔ‬ଵ ‫ݔ‬ଷ

Original DAG

‫ݔ‬ଵ ‫ݔ‬ଷ ‫ݔ‬ଶ Estimated Orderings

The estimated ordering of this adjacency matrix can be either (x1 → x2 → x3 ) or (x1 → x3 → x2 ). Note that both of them are correct orderings but the only difference is an unnecessary element whose connection strength is very small shown by the dotted edge. Therefore, we apply the pruning technique to eliminate this unnecessary element in order to make both orderings become the same DAG. For instance, if there are M patterns of the orderings denoted by O1 , O2 , . . . , OM . Then the distribution of an element ˆbij is equal to Pr(ˆbij ) =

M

Pr(ˆbij |Om ) Pr(Om ),

(10)

m=1

where Pr(Om ) denotes the probability of the ordering Om from all of M patterns, and Pr(ˆbij |Om ) denotes the distribution of ˆbij given on the ordering pattern Om , respectively. When we apply the bootstrap method with NB replications to DirectLiNGAM, we define Braw , a matrix whose size is (d × d × NB ), as a collection of NB coefficient matrices size (d × d). We use the adaptive lasso with the BIC-type selector mentioned in Section II-C to prune those unnecessary elements from Braw . The result after pruning is a collection of DAGs whose size is (d × d × NB ) collecting NB bootstraps of DAG which represents d vertices. However, in case of the bootstrap method, it is possible that the pruned Braw has several patterns of DAG. If there are L different patterns of DAG denoted by G1 , G2 , . . . , GL then the distribution of an element ˆbij is given by

Pr(ˆbij ) =

L

Pr(ˆbij |G ) Pr(G ),

(11)

=1

where Pr(G ) denotes the probability of the graph pattern G from all of L patterns, and Pr(ˆbij |G ) denotes the distribution of ˆbij given on the graph pattern G , respectively. After removing unnecessary elements, we need to modify the values of bij corresponding to the new pruned patterns. We create another matrix with the same size as Braw , called Bpruned , used for collecting all new values of bij . Then we estimate new values of ˆbij according to each DAG pattern one-by-one and store them into Bpruned . We propose the algorithm which follows the concept as shown in (11) in Algorithm 3. Before we present our new algorithm, we first introduce the variables which we use in this algorithm. We define them as follows: matrix X The matrix with dimension (d × n) containing the original dataset with dimension d and sample size n matrix G The matrix with dimension (d × d × L) collecting L different patterns of DAG representing d vertices which we discovered after pruning vector F The vector with size L whose elements containing the number of occurrences of each DAG pattern occurring in NB bootstrap replications. Note that each element in this vector can be considered as Pr(G ) × NB where = 1, 2, . . . , L. Algorithm 3 How to modify the value of ˆbij 1: procedure M ODIFY (X,G,F) 2: for all DAG pattern in G denoted by G do 3: repeat 4: y ← data size n collected from X of the parent variables found in the pattern G 5: x ← data size n collected from X of the variable(s) corresponding to y ˆ given on G using y and 6: B ← estimated B x (more details in Algorithm 4)

7: store new B in pattern G → Bpruned 8: until reaching the frequency of G stored in F 9: end for 10: end procedure With Algorithm 3, we get Bpruned which contains all of modified values of ˆbij by using a linear regression based on each pattern of DAG. In addition, the details of how we estimate the value of Pr(ˆbij |G ) are given in Algorithm 4. In the previous work of DirectLiNGAM, it was mentioned that the computational complexity of DirectLiNGAM is O(nd3 M 2 + d4 M 3 ) where n is the sample size of data, d is the number of variables and M is the maximal rank

Algorithm 4 How to estimate new ˆbij given on each G 1: procedure E STIMATE (X,G) 2: for all row in G do 3: y ← data size n collected from X of the parent variables found in this row 4: x ← data size n collected from X of the variable(s) corresponding to y 5: yrand ← data size n randomly drawn with replacement from y 6: xrand ← data size n randomly drawn with replacement from x 7: b ← new value of ˆbij in this row estimated by a linear regression on yrand and xrand 8: store b → ˆbij in this row given on G 9: end for 10: end procedure

found by the low-rank decomposition used in the kernelbased independence measure. In this article, we continue executing DirectLiNGAM for NB bootstrap iterations so the computational complexity of our new approach is the complexity of DirectLiNGAM multiplied by NB . In other words, the compatational complexity of this new approach is O(nd3 M 2 NB + d4 M 3 NB ). B. Multiple Comparison In order to solve the multiple comparison problem, we apply the Bonferroni correction which is mentioned earlier in Section II-D. In case of LiNGAM, if we have d variables then the size of the adjacency matrix estimated by DirectLiNGAM is (d × d). According to the idea of the Bonferroni correction, we construct d2 − d confidence intervals for elements bij in this matrix excluding the elements on the diagonal line. We perform the bootstrap replications for NB times before we construct the confidence intervals. In order to construct confidence intervals for d2 − d elements with the overall significance level α, one could construct each confidence interval with coefficient 1 − d2α−d . Therefore, the confidence interval of each bij is given by Pr (Lij ≤ bij ≤ Uij ) ≥ 1 −

d2

α , −d

(12)

where bij is each element in the estimated adjacency matrix size (d × d) , and Lij and Uij are, respectively, the confidence lower bound and upper bound of the element bij . We estimate the value of Lij and Uij by using the bootstrap percentile method, that is, Lij is the element th percentile and Uij is the element at 100 · 12 · d2α−d th at 100 · 1 − 12 · d2α−d percentile in the ordered NB replications of each bij .

IV. S IMULATIONS We performed two simulations in order to show how our new approach can solve each problem found in the previous study. We first show the simulation of distribution in Section IV-A. Then we show the simulation of the precision, recall and familywise error rate (FWER) in Section IV-B. A. Simulation of Distribution In the first simulation, we compared the distribution of ˆbij of Braw and Bpruned which are both calculated by the bootstrap method. The point of this simulation is to check that the distribution of which method is closer to the real distribution. The measures we used to evaluate how close between two distributions are the moments which consist of mean (1st moment), variance (2nd moment), skewness (3rd moment), and kurtosis (4th moment). In the process of generating the simulated data, we set the dimension d = 3 and sample size n = 1000. Then we constructed d × d adjacency matrix with some nonzero elements inserted following the original matrix in Section III-A. As mentioned earlier, this matrix gives two ordering patterns which lead an unnecessary element to occur. We replaced each nonzero element in the matrix by a value randomly chosen from the interval [−1.5, −0.5] ∪ [0.5, 1.5]. We also fixed the distribution of the external influence variable ei to one of 18 non-Gaussian distributions [11] and generated simulation data with sample size n. Then the values of the observed variables were generated according to the LiNGAM model in (1). The following steps are how we operated this simulation. 1) We set the bootstrap replication NB = 1000 then generated NB -bunch of datasets for the real distribution and an only bunch of dataset for the bootstrap method. This NB indicates the number of points inside the distribution. 2) In case of the real distribution, we estimated matrix B by using DirectLiNGAM for each set of data for NB times. Meanwhile, in case of the bootstrap method, we iterated bootstrap sampling from a bunch of dataset for NB replications, and we estimated matrix B by using DirectLiNGAM in each time of bootstrap. - For the real distribution: we defined the resulting matrix with size (d × d × NB ) as [Breal ]raw . - For the bootstrap method: we defined the resulting matrix with size (d × d × NB ) as [Bboot ]raw . 3) We applied the adaptive lasso with the BIC-type selector to both [Breal ]raw and [Bboot ]raw in order to prune unnecessary elements. 4) We discovered a set of DAG patterns in both the resulting matrices denoted by Greal and Gboot , respectively. We then estimated new values of ˆbij in these new matrices given by each DAG pattern in Greal and Gboot using the algorithm mentioned in Algorithm 3

and Algorithm 4. Then we stored these new values of ˆbij in each case into the other two matrices whose size is (d × d × NB ) as follows - For the real distribution: we named the new matrix as [Breal ]pruned . - For the bootstrap method: we named the new matrix as [Bboot ]pruned . 5) We repeated step 1 to 4 for 500 times. 6) After finishing 500-time replications in step 5, we repeated step 1 to 5 all over again, but we changed the value of NB to be 2000, 3000, 4000 and 5000 times respectively. 7) For each value of NB , we calculated the difference between two pairs of distributions, which are - [Breal ]raw versus [Bboot ]raw - [Breal ]pruned versus [Bboot ]pruned by using the moments. Due to there are d2 −d elements of interest in the matrix, we calculated the value of difference by summing up the differences of all d2 − d elements in each moment as follows: (a) Mean difference (Δmean) : ⎤ ⎡ E ˆbij ⎦ − E ˆbij boot E⎣ real i,j,i=j

(b) Variance difference (Δvar) : ⎡ ˆbij V ˆbij E⎣ − V real boot

⎤ ⎦

i,j,i=j

(c) Skewness difference (Δskewness) : ⎡ S ˆbij ⎣ E − S ˆbij boot real

⎤ ⎦

i,j,i=j

(d) Kurtosis difference (Δkurtosis) : ⎡ ˆbij K ˆbij E⎣ − K real boot

⎤ ⎦

i,j,i=j

where E(·), V(·), S(·) and K(·) are mean, variance, skewness and kurtosis calculations, respectively. The simulation results are shown in Figure 1. The results are given by four graphs representing the Δmean, Δvar, Δskewness and Δkurtosis, respectively. We can see that when the number of bootstrap iterations (NB ) increases, the values of Δskewness and Δkurtosis between the real and bootstrap distribution of Braw show more swings. Because of the unnecessary elements in the adjacency matrix, it causes the difference between the distribution of bootstrap and the real distribution. Meanwhile, those differences of Bpruned become more stable when NB increases.

Raw Pruned

Δmean 0.1

Raw Pruned

Δvar 1.20E-03

0.09 1.00E-03

0.08 0.07

8.00E-04

0.06 6.00E-04

0.05 0.04

4.00E-04

0.03 0.02

2.00E-04

0.01 0

0.00E+00 1000

2000

3000

4000

5000

1000

bootstrap replications (times)

Δskewness

Raw Pruned

3.5

2000

3000

4000

5000

bootstrap replications (times)

3

Δkurtosis

Raw Pruned

60

50

2.5

40

2

30 1.5 20

1

10

0.5 0

0 1000

2000

3000

4000

5000

bootstrap replications (times)

1000

2000

3000

4000

5000

bootstrap replications (times)

Figure 1. The graphs representing the differences of mean, variance, skewness and kurtosis between the real distribution and bootstrap method of both Braw and Bpruned when the bootstrap iterations (NB ) = 1000, 2000, 3000, 4000 and 5000

B. Simulation of Precision, Recall and FWER The purpose of this simulation is to compare the performance of a naive approach, where any correction in the multiple comparison is not applied, and the proposed approach applying the Bonferroni correction. We performed 1000 iterations in order to evaluate the results. In each iteration, we generated datasets with dimension d and sample size n then constructed the confidence intervals of the estimated B by using the naive approach and Bonferroni correction. In the process of generating the simulated data in this simulation is the same as that of the first simulation. We set the dimension d = 5 and sample size n = 1000. The null hypothesis here is that each element bij in the original B is zero. Therefore, the FWER here is the probability when at least one of these null hypotheses is erroneously rejected. The following steps are how we operated this simulation. 1) We generated a simulation dataset whose size is (d×n) with fixed distributions of external influence variables ei as mentioned previously. 2) We iterated bootstrap sampling from the generated dataset for NB = 5000 replications. In each time of bootstrap replication, we estimated matrix B with the dimension (d × d) by using the DirectLiNGAM. Therefore, the resulting matrix after performing NB bootstrap replications has size (d×d×NB ). We define this resulting matrix as Braw .

3) We applied the adaptive lasso with the BIC-type selector to Braw in order to prune unnecessary elements appeared in Braw . 4) We discovered the patterns of DAG in the resulting matrix from the previous step then we used these patterns to create another matrix with the same size as Braw . We define this new matrix as Bpruned . The details of how we modify the values of bij are given in Algorithm 3 and 4. 5) We repeated step 1 to 4 for 1000 times. 6) We constructed the confidence interval of each element in Braw and Bpruned by using both the naive and Bonferroni method with the overall significance level α = 0.05. 7) Let the precision be the ratio of the number of correctly estimated nonzeros to the total number of estimated nonzeros, recall be the ratio of the number of correctly estimated nonzeros to the total number of real nonzeros. We calculated the precision, recall and FWER of both methods. The simulation results are shown in Table I. We found that the FWER of both Braw and Bpruned applying the Bonferroni correction are less than the overall significance level α, while those of both Braw and Bpruned applying the naive approach exceeds the value of α. This indicates that the Bonferroni correction could control the FWER to be less than α. Furthermore, the precisions of Bpruned in both the naive and Bonferroni method are a little higher than those of Braw . This is because the unnecessary elements are removed.

Precision

Recall

1.01

FWER

1.02

0.16 0.14

1

1

0.99

0.98

0.98

0.96

0.12 0.1 0.08

0.97

0.94

0.96

0.92

0.95

0.9

0.94

0.88 200

500

1000

0.06 0.04

0.02 0 200

sample size (n) Bpruned with Bonferroni Braw with Bonferroni

500

1000

200

sample size (n)

Bpruned with naïve Braw with naïve

Bpruned with Bonferroni Braw with Bonferroni

Bpruned with naïve Braw with naïve

Bpruned with Bonferroni Braw with Bonferroni

Recall

Precision 1.02

1.2

1

1

500

1000

sample size (n) Bpruned with naïve Braw with naïve

FWER 0.4 0.35

0.3

0.98

0.8

0.25

0.96 0.6

0.2

0.94

0.15

0.4

0.92

0.1 0.2

0.9 0.88

0.05

0 3

5

7

0 3

dimension (d) Bpruned with Bonferroni Braw with Bonferroni

Bpruned with naive Braw with naïve

5

7

dimension (d) Bpruned with Bonferroni Braw with Bonferroni

Bpruned with naïve Braw with naïve

3

5

dimension (d)

Bpruned with Bonferroni Braw with Bonferroni

7

Bpruned with naïve Braw with naïve

Figure 2. The graphs representing the precision, recall and FWER when the sample size (n) are 200, 500 and 1000 (on the upper row), and when the dimension d are 3, 5 and 7 (on the lower row) Table I P RECISION , R ECALL AND FAMILYWISE ERROR RATE OF THE NAIVE AND B ONFERRONI METHODS APPLYING TO BOTH B raw AND B pruned WITH 1000 ITERATIONS WHERE NB = 5000 Simulation Braw Bpruned

Precision Recall FWER Precision Recall FWER

Methods Naive Bonferroni 0.9605 0.9983 1.0 1.0 0.1350 0.0060 0.9765 0.9986 1.0 1.0 0.0900 0.0050

In addition, we also tried adjusting the number of dimensions (d) and sample size (n) of the simulated data in order to see how they affect the precision, recall and FWER. The results are shown in Figure 2. The graphs on the upper row shows the changes of the precision, recall and FWER when the sample size (n) are 200, 500 and 1000. While those on the lower row shows the changes of the precision, recall and FWER when the dimension (d) are 3, 5 and 7, respectively. The number of dimensions or variables in this simulation might be too few when considering the scalability. This is because, when applying the Bonferroni correction, if a number of variables are large then a much larger number of bootstrap iterations are required to determine the appropriate lower and upper bounds of a confidence interval. In other words, if we have d variables, we need to simultaneously test d2 − d hypotheses whose lower and upper bounds are estimated as the data at the percentile of (NB × α2 × d21−d ) and 100 − (NB × α2 × d21−d ). For this reason, we have to

perform up to 5000 bootstrap iterations for the 7-variable data in order to obtain an appropriate confidence interval. However, for 10 or more variables, we also need to repeat 10000 or more bootstraps in order to get an appropriate confidence interval as well. Therefore, we consider this scalability issue as an important future work. V. A PPLICATIONS We also applied our new approach to the causal proteinsignaling networks [12]. This article proposed the network model representing the causal relations of 11 proteins in individual cells in 9 perturbations. The network model representing the causal relations of 11 protein variables is shown in Figure 3. We chose the dataset of proteins in the perturbation condition where the anti-CD3/CD28, which is a general perturbation, was used as the reagent. This dataset consists of 11 protein variables with 853 samples. We performed bootstrap replications for 25000 times then identified the significant relations in the estimated adjacency matrix by using 4 approaches: Braw with the naive approach, Braw with the Bonferroni approach, Bpruned with the naive approach and Bpruned with the Bonferroni approach, where the overall significance α is 0.05. The application results are shown in Figure 4. A blue element indicates the significant directed edge of a causal relation estimated by DirectLiNGAM which follows the network model in Figure 3, while a red element indicates the significant directed edge which does not.

Raf Mek Plcg PIP2 PIP3 Erk Akt PKA PKC P38 Jnk

Precision and Recall

Raf Mek Plcg PIP2 PIP3 Erk Akt PKA PKC P38 Jnk

Raf

Raf

Mek

Mek

0.6

Plcg

Plcg

PIP2

PIP2

PIP3

PIP3

Erk

Erk

Akt

Akt

PKA

PKA

PKC

PKC

P38

P38

Jnk

0.5

0.4

Jnk

Braw with naive

Braw with Bonferroni

Raf Mek Plcg PIP2 PIP3 Erk Akt PKA PKC P38 Jnk Raf

Raf

Mek

Mek

Plcg

Plcg

PIP2

PIP2

PIP3

PIP3

Erk

Erk

Akt

Akt

PKA

PKA

PKC

PKC

P38

P38

Jnk

Jnk

Bpruned with naive

0.3

Raf Mek Plcg PIP2 PIP3 Erk Akt PKA PKC P38 Jnk

0.2

0.1

0

Precision

Bpruned with Bonferroni

(a) Adjacency matrices estimated by DirectLiNGAM Figure 4.

Braw with naïve

Recall

Braw with Bonferroni

Bpruned with naïve

Bpruned with Bonferroni

(b) Precision and Recall of all approaches

Application to the causal protein-signaling network

Raf Mek Plcg PIP2 PIP3 Erk Akt PKA PKC P38 Jnk

VI. C ONCLUSION In this article, we have proposed the new approach to construct the bootstrap confidence intervals in DirectLiNGAM. We have shown that the bootstrap distribution of the estimated values calculated by the new approach is more stable and also even closer to the real distribution. Moreover, with the usage of the Bonferroni correction, the new approach can control the familywise error rate (FWER) to be less than the overall significance level α, that is, this new approach can reduce the number of falsely found significant edges.

Raf Mek Plcg PIP2 PIP3 Erk

Akt PKA

ACKNOWLEDGMENT

PKC

Kittitat Thamvitayakul would like to convey thanks to MEXT for the financial supports. Shohei Shimizu and Takashi Washio were partially supported by Grant-in-Aid for Young Scientists (B) #24700275, and by Grant-in-Aid for Scientific Research (B) #22300054, respectively. We also thank Satoshi Hara and Somchai Nuanprasert for useful discussions.

P38 Jnk

Figure 3.

Causal protein-signaling network model [12]

We found that the precisions of all approaches except Braw with the naive approach are the same. This is because each nonzero element in the estimated adjacency matrix is quite small so the confidence intervals with the Bonferroni approach, whose range is definitely wider than the naive approach, includes zero into its interval. Therefore, some nonzero elements in the adjacency matrix are considered not significant. However, the falsely found edges denoted by the red element can be reduced by the Bonferroni correction.

R EFERENCES [1] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyv¨arinen, Y. Kawahara, T. Washio, P. O. Hoyer, and K. Bollen, “DirectLiNGAM: A direct method for learning a linear nonGaussian structural equation model,” Journal of Machine Learning Research, vol. 12, pp. 1225–1248, April 2011. [2] Y. Hochberg and A. Tamhane, Multiple comparison procedures. Wiley, 1987.

[3] S. Shimizu, P. O. Hoyer, A. Hyv¨arinen, and A. Kerminen, “A linear non-gaussian acyclic model for causal discovery,” Journal of Machine Learning Research, vol. 7, pp. 2003– 2030, 2006. [4] G. Darmois, “Analyse g´en´erale des liaisons stochastiques,” Review of the International Statistical Institute, vol. 21, pp. 2–8, 1953. [5] C. T. Y. Zhang, R. Li, “Regularization parameter selection via generalized information criterion,” Journal of the American Statistical Association, vol. 105, pp. 312–323, 2010. [6] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, vol. 96, pp. 1348–1360, 2001. [7] R. Nishii, “Asymptotic properties of criteria for selection of variables in multiple regression,” The Annals of Statistics, vol. 12, pp. 758–765, 1984. [8] H. Zou, “The adaptive Lasso and its oracle properties,” Journal of the American Statistical Association, vol. 101, pp. 1418–1429, 2006. [9] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, vol. 19(2), pp. 461–464, 1978. [10] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of Royal Statistical Society: Series B, vol. 58, no. 1, pp. 267–288, 1996. [11] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” Journal of Machine Learning Research, vol. 3, pp. 1–48, 2002. [12] K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan, “Causal protein-signaling networks derived from multiparameter single-cell data,” Science, vol. 308, no. 5721, pp. 523–529, April 2005.

Empirical calibration of confidence intervals - GitHub