# A Fused Latent and Graphical Model for Multivariate Binary Data

## Abstract

We consider modeling, inference, and computation for analyzing multivariate binary data. We propose a new model that consists of a low dimensional latent variable component and a sparse graphical component. Our study is motivated by analysis of item response data in cognitive assessment and has applications to many disciplines where item response data are collected. Standard approaches to item response data in cognitive assessment adopt the multidimensional item response theory (IRT) models. However, human cognition is typically a complicated process and thus may not be adequately described by just a few factors. Consequently, a low-dimensional latent factor model, such as the multidimensional IRT models, is often insufficient to capture the structure of the data. The proposed model adds a sparse graphical component that captures the remaining ad hoc dependence. It reduces to a multidimensional IRT model when the graphical component becomes degenerate. Model selection and parameter estimation are carried out simultaneously through construction of a pseudo-likelihood function and properly chosen penalty terms. The convexity of the pseudo-likelihood function allows us to develop an efficient algorithm, while the penalty terms generate a low-dimensional latent component and a sparse graphical structure. Desirable theoretical properties are established under suitable regularity conditions. The method is applied to the revised Eysenck’s personality questionnaire, revealing its usefulness in item analysis. Simulation results are reported that show the new method works well in practical situations.

= 20pt

KEY WORDS: latent variable model, graphical model, IRT model, Ising model, convex optimization, model selection, personality assessment

## 1Introduction

Latent variable models are prevalent in many studies. We consider the context of cognitive assessment that has applications in many disciplines including education, psychology/psychiatry, political sciences, marketing, etc. For instance, in educational measurement, students’ solutions to test problems are observed to measure their skill levels; in psychiatric assessment, patients’ responses to diagnostic questions are observed to assess the presence or absence of mental health disorders; in political sciences, politicians’ voting behavior reflects their political views; in marketing analysis, consumers’ purchase history reflects their preferences. A common feature in these studies is that the observed human behaviors are driven by their latent attributes that are often unobservable. Latent variable models can be employed in these contexts to describe the relationship between the observed behavior, which is often in the form of responses to items, and the underlying attributes.

Various linear and nonlinear latent variable models have been studied extensively in the literature [33]. In this paper, we focus on one of the widely used nonlinear models for categorical responses, that is, the item response theory (IRT) model. Popular IRT models include the Rasch model [45], the two-parameter logistic model, and the three-parameter logistic model [5] that are single-factor models. A natural extension is the multidimensional two-parameter logistic (M2PL) model [41] assuming a multidimensional latent vector. Originated in psychological measurement [45], IRT models have been widely used in other fields for modeling multivariate binary data, such as political voting [2], marketing [14], and health sciences [29].

In this paper, we use the multidimensional two-parameter logistic model as the starting point. In particular, each observation is a random vector X with binary components, . Associated with each observation is an unobserved continuous latent vector . The conditional distribution of each response given the latent vector follows a logistic model

which is known as the *item response function*. Furthermore, the responses are assumed to be conditionally independent given , that is,

A prior distribution on is also imposed.

In recent years, computer-based instruments are becoming prevalent in educational and psychiatric studies, where a large number of responses with complex dependence structure are observed. A low-dimensional latent vector is often insufficient to capture all the dependence structure of the responses. Many contextual factors, such as the item wording and question order, may exert additional influence on the item response [36]. Moreover, problem solving and task accomplishing are likely to be complicated cognitive processes. It is conceivable that they cannot be adequately described by only a few latent attributes. Thus, model lack of fit is often observed in practical analysis [48]. From the technical aspect, a low-dimensional latent variable model is simply not rich enough to capture all the dependence structure of the responses [52]. Ideally, we wish to include all the factors that influence the cognitive process, that would result in a high-dimensional latent vector. A factor model with too many latent variables can be difficult to estimate and may lack interpretability. Thus, in practice, the dimension of the latent vector is often kept low in spite of the lack of fit.

We propose a new model that maintains a low-dimensional latent structure and captures the remaining dependence. We achieve this by including an additional graphical component to describe the dependence that is not explained by the low-dimensional latent vector. We call it *Fused Latent and Graphical* (FLaG) model. The new model captures two interpretable sources of dependence, i.e. the common dependence through the latent vector and the ad hoc dependence through a sparse graphical structure.

Figure 1 provides a graphical illustration of the multidimensional IRT model and the FLaG model. The left panel shows a graphical representation of the marginal distribution of the responses, where there is an edge between each pair of responses. Under the conditional independence assumption , there exists a latent vector . If we include in the graph, then there is no edge among ’s. Our concern is that the middle graph may be oversimplified and there may not exist a low-dimensional to achieve such a simple structure. The FLaG model (the right panel) is a natural extension. There remain edges among ’s even if is included, suggesting that does not fully explain the dependence among . However, the remaining dependence is substantially reduced compared with the left penal.

From the inference viewpoint, it is important to separate the common dependence due to the latent vector and the ad hoc dependence due to the graphical structure. To do so, we make inference based on the following assumptions. The variation of responses is mostly characterized by the latent vector. A low-dimensional latent vector model is largely correct and majority of the dependence among the responses is induced by the common latent vector. There is just a small remainder due to the graphical structure. In particular, the conditional graph does not have too many edges. Technical statements of these assumptions will be described in the sequel. During the estimation, we assume that neither the dimension of the latent vector nor the graphical structure is known. We estimate the latent structure and the conditional graph simultaneously by including penalty terms to regularize the dimension of the latent vector and the number of edges of the conditional graph. Thus, the resulting model contains a low-dimensional latent vector and a sparse conditional graph.

To model the graphical component, we adopt an undirected graph that characterizes the conditional independence/dependence structure among the responses [44]. In particular, we consider the Ising model originated in physics [31]. It has been employed to model multivariate binary data in political voting [3] and climate data [4]. Estimation of the Ising model via regularization has been studied by [30], [46], and [27].

The proposed modeling framework is related to the analysis of decomposing a matrix into low-rank and sparse components [9] and the statistical inference of a multivariate Gaussian model whose precision matrix admits the form of a low-rank matrix plus a sparse matrix [11]. However, the inference and optimization of the current model are different from the linear case. We construct a pseudo-likelihood function, based on which a regularized estimator is proposed for simultaneous model selection and parameter estimation. The optimization for the regularized estimator is convex, for which we develop an efficient algorithm through the alternating direction method of multiplier [6].

The rest of this paper is organized as follows. In Section 2, we first provide a brief review of the multidimensional item response theory model and the Ising model. It is then followed by the introduction of the FLaG model. Section 3 introduces a pseudo-likelihood function and presents the regularized pseudo-likelihood estimator. An efficient algorithm is developed and related computational issues are also discussed in Section 4. Section 5 includes simulation studies and a real data analysis.

## 2Fused latent and graphical model

### 2.1Two basic models

To begin with, we present two commonly used models as the basic building blocks: the multidimensional two-parameter logistic model and the Ising model. We consider that independent and identically distributed random vectors are observed. We use X to denote the th random observation and x its realization. Furthermore, we use X as a generic random vector equal in distribution to each X. Throughout this paper, we consider binary observations, that is, each takes values in . For more general types of categorical variables, the analysis can be extended if it can be fit into an exponential family.

Latent variable models assume that there exists an unobserved random vector associated with X, such that the conditional distribution of X given takes a simpler form that is easy to parameterize and estimate. For instance, the conditional variance X is substantially reduced compared to X, in which case the random vector X is very close to (or essentially lives on) a low-dimensional manifold generated by . Another popular approach is to assume that X is conditionally independent given , that is,

This is also known as the *local independence* assumption that is widely used in cognitive assessment [16]. In this case, the dependence among ’s is fully accounted for by the common latent vector and the variation of given is essentially considered as independent random noise.

Latent variable models largely fall into two classes based on the type of : discrete and continuous. In this paper, we consider the latter that is a -dimensional continuous random vector. The multidimensional item response theory model is a popular class of nonlinear latent variable models. The conditional distribution of each given admits the form of a generalized linear model. In the case of binary data, the most popular is the multivariate 2-parameter logistic model (M2PL)

where is the loading vector of the latent vector and controls the marginal probability of . The above probability as a function of is also known as the *item response function*. Furthermore, the responses are assumed to be independent conditional on the latent vector , that is,

In addition, a prior distribution is imposed and the marginal distribution X is

In latent variable modeling, it is important to keep , the dimension of the latent vector, strictly less than , that of the observed data. In fact, in most cases, is much smaller than . As mentioned previously, a low-dimensional latent variable model is often insufficient to capture all the dependence among X. We take the multidimensional item response model as the basic latent variable model and further add a graphical component to it.

We consider the Ising model as the graphical component that is an undirected graphical model, also known as Markov random field. The specification of an undirected graphical model consists of a set of vertices and a set of edges . The graph is undirected in the sense that if and only if . We associate a Bernoulli random variable to each vertex . The graph encodes the conditional dependence structure among . In particular, vertices and do not have an edge, , if and are conditionally independent given all others, . The Ising model parameterizes an undirected graph via the exponential family admitting the following probability mass function

where is a by symmetric matrix, i.e., , and is the normalizing constant

The matrix maps to a graphical structure. There is an edge between vertices and , , if and only if . According to the probability mass function , it is easy to check that and are conditionally independent given all other ’s, or , if .

### 2.2Fused latent and graphical model

We propose a fused latent and graphical (FLaG) model that combines the IRT model and the Ising model. To do so, we present another representation of the IRT model. We write the item response function as

With the local independence assumption, the joint conditional distribution is

where and b

With this representation, the probability mass function of the Ising model in can be similarly written as

We combine these two models and write

We remove the term bx, because it is absorbed into the diagonal terms of . Notice that and thus . The squared terms in becomes linear . For technical convenience, we further impose a prior distribution on such that the joint distribution of X given the parameters is

where is the usual Euclidean norm on . Define the normalizing constant

The complete data likelihood function of a single observation is

The normalizing constant is not easy to compute and thus evaluation of the above likelihood is not straightforward. We will address this issue momentarily.

Both the IRT and the Ising models are special cases of . By setting , recovers the Ising model with parameter matrix ; by setting for , is equivalent to an IRT model. Conditional on , X follows the Ising model, in particular,

where for and . The graphical structure , in particular, , captures the remaining dependence that is not explained by the latent vector. For each , if we further condition on the rest of the random variables X, the conditional distribution admits the form of a logistic model

Thus, the conditional distribution can be written in a closed form, though the joint likelihood is often difficult to evaluate.

Lastly, we consider the marginal joint distribution of X with the latent vector integrated out, more precisely,

As the latent vector is not directly observed, our subsequent analysis of the estimation is mostly based on the above marginal likelihood. Notice that the loading matrix enters the likelihood function x in the form of . Therefore, is not identifiable by itself. We reparameterize the likelihood function and define . With a slight abuse of notation, we write

This is mostly because the latent vector is not directly observed and its loading matrix can only be identified up to a non-degenerate transformation. Note also that there is an identifiability issue between and , as the two matrices enter the marginal likelihood function in the form of . In particular, characterizes the dependence among X that is due to the latent structure and characterizes that of the graphical structure. In the analysis, assumptions will be imposed on the parameter space so that and are separable from each other based on the data.

## 3On maximum regularized pseudo-likelihood estimator

### 3.1Estimation

In this section, we address issues related to estimation of the latent graphical model described in the previous section including evaluation of the likelihood function, dimension estimation/reduction of the latent vector, estimation of the conditional graph, parameter identifiability, and oracle property of the proposed estimator. To begin with, we assume that all parameters including the dimension of the latent vector and the conditional graph are unknown.

The first issue concerning the estimation is that evaluation of the marginal likelihood function involves the normalizing constant whose computational complexity grows exponentially fast in the dimension . In fact, its computation is practically infeasible even for a moderately large . We take a slightly different approach by considering the conditional likelihood of given X, which has a closed form as discussed previously. Let and . We have

This closed form is crucial for our inference. Let

denote the conditional likelihood for given X. Our estimation is based on a pseudo-likelihood function by multiplying all the conditional likelihood together. The pseudo-likelihood based on independent observations is

where x is the th observation.

In the above pseudo-likelihood, and are unknown parameters. Besides, the dimension of the latent vector and the conditional graphical structure implied by are also unknown. We will estimate the set of edges . As for the dimension of , to ensure identifiability, we assume that the loading matrix is of full column rank; otherwise, we can always reduce the dimension and make full column rank. Thus, also has rank . Notice that is a positive semidefinite matrix. The rank of is the same as the number of its non-zero eigenvalues. To estimate the conditional graph and the dimension of the latent vector, we impose regularization on and .

As mentioned previously, the parameters and enter the likelihood function in the form of . In principle, one cannot identify from based on the data only. We will impose additional assumptions to ensure their identifiability (or uniqueness of the estimator) based on the following rationale. We believe that the multidimensional IRT model (with the local independence assumption) is largely correct. The latent vector accounts for most dependence/variation of the multivariate response vector . In the context of cognitive assessment, this is interpreted as that a person’s responses to items are mostly driven by a few latent attributes. The remaining dependence is rather low. Thus, a crucial assumption in our estimation is that the graphical structure explains a small portion of the dependence in . To quantify this assumption, we assume that the matrix is sparse. In addition, the dimension of the latent vector stays low. These assumptions will be made precise in later discussions where theoretical properties of our estimator are established.

Based on the above discussion, we propose an estimator by optimizing a regularized pseudo-likelihood

where is defined by and the minimization is subject to the constraints that is positive semidefinite and is symmetric. Throughout this paper, we use to denote that is positive semidefinite.

We provide some explanations of the two penalty terms O and . In the first term, O is a matrix such that it is identical to except that its diagonal entries are all zero, that is, O where for and . Thus,

which penalizes the number of nonzero ’s that is also the number of edges in the conditional Ising model. Notice that we do not penalize the diagonal elements of because controls the marginal distribution of . As mentioned previously, the constant term in the IRT model is absorbed into the diagonal term . By increasing the regularization parameter , the number of nonzero off-diagonal elements decreases and thus the number of edges in the conditional graph also decreases. The penalty was originally proposed in [56] for linear models and later in the context of graphical models [42].

The second penalty term is . Notice that is a positive semidefinite matrix and admits the following eigendecomposition

where is an orthogonal matrix, and . The nuclear norm can be alternatively written as

Therefore, penalizes the number of nonzero eigenvalues of , which is the same as the rank of . This regularization is first proposed in [19] and its statistical properties are studied in [1]. The estimators and depend on the regularization parameters and , whose choice will be described in the sequel. To simplify notation, we omit the indices and in the notation and .

The regularized estimators and naturally yield estimators of the dimension of and the conditional graph . In particular, an estimator of the dimension of is

and an estimator of the conditional graph is

In what follows, we state the theoretical properties of this regularized pseudo-likelihood estimator.

### 3.2Theoretical properties of the estimator

In this subsection, we present the properties of the regularized estimator defined as in and the estimators and defined as in and . Throughout the discussion, let and denote the true model parameters.

To state the assumptions, we first need the following technical developments. The pseudo-likelihood (and the likelihood) function depends on and through . Define

If we reparameterize , its information associated with the pseudo-likelihood is given by

which is a by matrix and is the true parameter matrix. For a differentiable manifold , we let denote its tangent space at . We refer to [53] for the definition of a manifold and its tangent space. The first condition, which ensures local identifiability, is as follows.

The matrix is positive definite restricted to the set

That is, for each vector , ^* and the equality holds if and only if .

In what follows, we describe a few submanifolds of and their tangent spaces. Let be the set of symmetric matrices admitting the same sparsity as that of , that is,

On considering that is a submanifold of , its tangent space at is itself, that is,

Define the set of matrices

where . The set is differentiable in a neighborhood of . Therefore, it is a submanifold of within the neighborhood of and its tangent space at is well defined. To describe the tangent space of at , we consider its eigendecomposition

where is a matrix satisfying , is the identity matrix, and is a diagonal matrix consisting of the (positive) eigenvalues of . Then, the tangent space of at is

We make the following assumptions on and .

The positive eigenvalues of are distinct.

The intersection between and is trivial, that is, , where is the zero-matrix.

Lastly, we present an irrepresentable condition that is key to the consistency of the regularized estimator. Define a linear operator F,

where ^* is the matrix in . With a slight abuse of notation, we let ^* denote matrix-vector multiplication where and are vectorized with their elements being arranged in the same order as the order of the derivatives in ^*. The map is the projection operator of matrix onto the manifold with respect to the inner product for matrices,

That is, is the matrix in minimizing the distance to induced by the matrix inner product “”. We define a linear operator F,

For a linear subspace , denote its orthogonal complement in . For a matrix , we apply the function to each of its element, that is

Furthermore, for each constant , define a norm for a matrix couple of appropriate dimensions such that

where and are the maximum and spectral norm respectively. Here, the spectral norm is defined as the largest eigenvalue of for a positive semidefinite matrix . The last condition is stated as follow

There exists a positive constant such that

The following lemma guarantees that F in is well defined.

With these conditions, we present the theoretical properties of our estimator.

We provide a discussion on the technical conditions. Condition A1 ensures local identifiability of the parameter . Given the likelihood function is log-concave, the parameter can be estimated consistently by the pseudo-likelihood. Condition A3 corresponds to the *transversality condition* in [11]. Lastly, Condition A4 is similar to the irrepresentable condition [60] that plays an important role in the consistency of sparse model selection based on -norm regularization.

### 3.3On the choice of tuning parameters

Theorem ? provides a theoretical guideline of choosing the regularization parameters and . Nonetheless, it leaves quite some freedom. In what follows, we provide a more specific choice of and that will be used in the simulation study and the real data analysis.

We consider to choose and to minimize the Bayes information criterion [50], that is known to yield consistent variable selection. BIC is defined as

where is the current model, is the maximal likelihood for a given model , and is the number of free parameters in . In this study, we replace the likelihood function with the pseudo-likelihood function. To avoid ambiguity, we change the notation and use and to denote the estimator in corresponding to regularization parameters and . Let

be the submodel selected by the tuning parameters . It contains all models in which the positive semidefinite matrix has rank no larger than that of and the symmetric matrix has the same support as . We select the tuning parameters and such that the corresponding model minimizes the Bayesian information criterion based on the pseudo-likelihood

where the number of parameters in is

for . The two terms are the numbers of free parameters in and respectively. Specifically, the number of free parameters in is counted as follows. Let be the eigendecomposition of , where is a diagonal matrix and columns of are unit-length eigenvectors of . has parameters and has parameters due to constraint . Combining them together, has parameters.

Maximizing the pseudo-likelihood in is no longer a convex optimization problem. However, our experience shows that this nonconvex optimization can be solved stably using a generic numerical solver, with starting point . The tuning parameters are finally selected by

In addition, the corresponding maximal pseudo-likelihood estimates of and are used as the final estimate of and :

## 4Computation

In this section, we describe the computation of the regularized estimator in , which is not straightforward for two reasons. First, the coordinate-wise descent algorithms [23], which are widely used in convex optimization problems with norm regularization, do not apply well to this problem. These algorithms optimize the objective function with respect to one parameter at a time. For our case, updating with respect to is not in a closed form. Moreover, the optimization is constrained on a space where the matrix is positive semidefinite. As a consequence, it becomes a semidefinite programming problem, for which a standard approach is the interior point methods [7]. The computational cost for each iteration and the memory requirements of an interior point method are prohibitively high for this problem, especially when is large.

We propose a method that avoids these problems by taking advantage of the special structure of the and nuclear norms by means of the alternating direction method of multiplier [6]. The key idea is to decompose the optimization of into subproblems that can be solved efficiently.

Consider two closed convex functions

where the domains and of functions and are closed convex subsets of , and is nonempty. Both and are possibly nondifferentiable. The alternating direction method of multiplier is an iterative algorithm that solves the following generic optimization problem:

or equivalently

To describe the algorithm, we first define proximal operators : as

and :

where is the usual Euclidean norm on and is a scale parameter that is a fixed positive constant. The algorithm starts with some initial values , , . At the th iteration, is updated according to the following steps until convergence

;

;

.

The algorithm is fast when the proximal operators and can be efficiently evaluated. The convergence properties of the algorithm are summarized in the following result in [6]. Let be the minimized value in .

We would like to point out that the assumption on the Lagrangian is mild [7]. In particular, if strong duality holds for the problem and let and be the corresponding primal and dual optimal points, forms a saddle point for the Lagrangian.

We now adapt this algorithm to the optimization of the regularized pseudo-likelihood. In particular, we reparameterize and let (viewed as a vector). Let be defined as in . We define

O ,

,

and , .

Obviously, the optimization can be written as

In addition, it is easy to verify that is a closed convex set and is a closed convex function. Furthermore, is closed and it is also convex since the symmetric and positive semidefinite constraints are convex constraints. is convex, since the pseudo-likelihood function is the sum of several log-likelihood functions of the logistic models that are all concave [7]. Because the and nuclear norms are convex functions, is a convex function. Thanks to the continuity, is closed. In summary, is a closed convex function on its domain .

We now present each of the three steps of the ADMM algorithm and show that the proximal operators and are easy to evaluate. Let

Step 1. We solve . Due to the special structure of , , and can be updated separately. More precisely,

where is the matrix Frobenius norm, defined as for a matrix . We now discuss the optimization problems -. First, and can be computed in closed forms. More precisely, when and are both symmetric matrices (which is guaranteed when , , , , , and are chosen to be symmetric),

where is its eigendecomposition and is a diagonal matrix with its th diagonal element being . The operation is called eigenvalue thresholding. In addition, is updated as

and its off-diagonal entries are

Furthermore, solving is equivalent to solving -dimensional unconstrained convex optimization problems. To see this, we denote

as the th column of a matrix . According to equation , the conditional likelihood x defined by can be written as a function of that only depends on and we denote it as x. As a result, evaluating can be decomposed into solving

for . It can be solved efficiently using a standard solver, such as the Broyden-Fletcher-Goldfarb-Shanno method [25], where could be as large as a few hundreds.

Step 2. We solve . Denote Then evaluating becomes:

This is a quadratic programming problem subject to linear constraints and thus can be solved in a closed form. Specifically,

Step 3 is a simple arithmetic. The advantage of the proposed algorithm is its low computational and memory cost at each iteration. In particular, the nondifferentiable and nuclear norms and the positive semidefinite constraint that induce difficulty in a generic solver are efficiently handled by closed-form updates. In addition, the -dimensional function is decomposed to a sum of functions that can be optimized in parallel.

## 5Simulation Study and Real Data Analysis

In this section, we first conduct simulation studies to investigate the performance of the proposed methods. Then we illustrate the method by analyzing a real data set of personality assessment.

### 5.1Simulation

We consider items and sample sizes and under the following three settings.

latent variable. For the -matrix, all off-diagonal elements are zero except for for . There are in total 15 edges in the graph. This graph is equivalent to grouping the variables in pairs, {1,2}, {3,4}, ..., and {29, 30}. There is an edge between each pair.

latent variable. For , , , and are nonzero. There are 30 edges in the conditional graph. This is equivalent to grouping the variables in triples, {1,2,3}, {4,5,6}, ..., {28,29,30}. There are edges within the triple.

, and the conditional graph is the same as that of setting 1.

The conditional graphs are visualized in Figure 2, where the upper and the lower panels represent the graphs in settings 1 and 2, respectively.

For each model setting and each sample size, we generate 50 independent data sets. The tuning parameters are chosen based on the Bayesian information criterion as described in Section 3.3.

**Data generation.** To generate a sample from the latent graphical model, we first generate from its marginal distribution

The above summation is computationally feasible because of the sparse graphical structure as in Figure 2. The latent vector is sampled from the above marginal distribution by the accept/reject algorithm. The conditional distribution of x given are independent between pairs and triples.

**Evaluation criteria.** To assess the performance of the dimension reduction and the estimation of the graph, we consider the criterion . For a particular data set, if and only if there exists a pair of , such that and graph induced by is the same as that by , where and are the true parameters.

Furthermore, we evaluate the BIC-based tuning parameter selection via criteria , , and . Let be the final estimates of the selected model defined as in . Criterion evaluates the estimation of the rank of ,

In addition, evaluates the positive selection rate of the network structure of , defined as

Furthermore, evaluates the false discovery rate,

If the tuning parameter is reasonably selected, we expect that , is close to 1, and is close to 0.

In Figure 3, the averages of over 50 independent data sets versus the sample sizes are presented under all settings. Based on Figure 3, we observe that, as the sample size becomes larger, the probability that the path of regularized estimator captures the true model increases and is close to 1 when the sample size is over 1000. The graphical structure is difficult to capture when the sample size is small.

The results of model selection based on BIC are presented in Table 1, where the mean of and the means and standard errors of , and over 50 replications are presented. According to these results, the BIC tends to choose a model that is close to the true one. In particular, according to , the number of latent factors (i.e. the rank of ) can be recovered with high probability with a reasonable sample size. Specifically, the numbers of factors are recovered without error for all situations except when for Model 3. For this case, BIC selects a single-factor model, which is mainly due to the small sample size. In addition, the edges in the conditional graph are recovered with high probability according to . Based on