Learning Attribute Patterns in High-Dimensional Structured Latent Attribute Models

# Learning Attribute Patterns in High-Dimensional Structured Latent Attribute Models

Yuqi Gu and Gongjun Xu
Department of Statistics
University of Michigan
###### Abstract

Structured latent attribute models (SLAMs) are a special family of discrete latent variable models widely used in social and biological sciences. This paper considers the problem of learning significant attribute patterns from a SLAM with potentially high-dimensional configurations of the latent attributes. We address the theoretical identifiability issue, propose a penalized likelihood method for the selection of the attribute patterns, and further establish the selection consistency in such an overfitted SLAM with diverging number of latent patterns. The good performance of the proposed methodology is illustrated by simulation studies and two real datasets in educational assessment.

\@footnotetext

This research is partially supported by National Science Foundation grants SES1659328 and DMS-1712717, and Institute of Education Sciences grant R305D160010.

## 1 Introduction

Structured Latent Attribute Models (SLAMs) are widely used statistical and machine learning tools in modern social and biological sciences. SLAMs offer a framework to achieve find-grained inference on individuals’ latent attributes based on their observed multivariate responses, and also obtain the latent subgroups of a population based on the inferred attribute patterns. In practice, each latent attribute is often assumed to be discrete and has particular scientific interpretations, such as mastery or deficiency of some targeted skill in educational assessment (Junker and Sijtsma, 2001; de la Torre, 2011), presence or absence of some underlying mental disorder in psychiatric diagnosis (Templin and Henson, 2006; de la Torre et al., 2018), the existence or nonexistence of some disease pathogen in subjects’ biological samples (Wu et al., 2017). In these scenarios, the framework of SLAMs enables one to simultaneously achieve the machine learning task of clustering, and the scientific purpose of diagnostic inference.

Different from the exploratory nature of traditional latent variable models, SLAMs often have some additional scientific information for model fitting. In particular, the observed variables are assumed to have certain structured dependence on the unobserved latent attributes, where the dependence is introduced through a binary design matrix to respect the scientific context. The rich structure and nice interpretability of SLAMs make them popular in many scientific disciplines, such as cognitive diagnosis in educational assessment (Junker and Sijtsma, 2001; von Davier, 2008; Henson et al., 2009; Rupp et al., 2010; de la Torre, 2011), psychological and psychiatric measurement for diagnosis of mental disorders (Templin and Henson, 2006; de la Torre et al., 2018), and epidemiological and medical studies for scientifically constrained clustering (Wu et al., 2017, 2018).

One challenge in modern applications of SLAMs is that the number of discrete latent attributes could be large, leading to a high-dimensional space for all the possible configurations of the attributes, i.e., high-dimensional latent attribute patterns. In many applications, the number of potential attribute patterns is much larger than the sample size. For scientific interpretability and practical use, it is often assumed that not all the possible attribute patterns exist in the population. Examples with large number of potential latent attribute patterns and moderate sample sizes can be found in educational assessment (Lee et al., 2011; Choi et al., 2015; Yamaguchi and Okada, 2018) and the medical diagnosis of disease etiology (Wu et al., 2017). For instance, Example 1 in Section 2 presents a dataset from Trends in International Mathematics and Science Study (TIMSS), which has binary latent attributes (i.e., possible latent attribute patterns) while only 757 students’ responses are observed. In cognitive diagnosis, it is of interest to select the significant attribute patterns among these . In such high-dimensional scenarios, existing estimation methods often tend to over select the number of latent attribute patterns, and may not scale to datasets with large number of latent attribute patterns. Moreover, theoretical questions remain open on whether and when the “sparse” latent attribute patterns are identifiable and can be consistently learned from the data.

Identifiability of SLAMs has long been an issue in the literature (e.g., von Davier, 2008; DeCarlo, 2011; Maris and Bechger, 2009; von Davier, 2014; Xu and Zhang, 2016). SLAMs can be viewed as a special class of restricted latent class models and their identifiability has a close connection with the study of tensor decompositions, by noting that the probability distribution of SLAMs can be viewed as a mixture of specially structured tensor products. In the literature, it is known that unrestricted latent class models are not identifiable (Gyllenberg et al., 1994). Nonetheless, Carreira-Perpinán and Renals (2000) showed through extensive simulations that they are almost always identifiable, which the authors termed as practical identifiability. Allman et al. (2009) further established generic identifiability of various latent variable models, including latent class models. Generic identifiability is weaker than strict identifiability, and it implies that the model parameters are almost surely identifiable with respect to the Lebesgue measure of the parameter space. The study of Allman et al. (2009) is based on an identifiability result of the three-way tensor decomposition in Kruskal (1977). Other analysis of tensor decomposition has also been developed to study the identifiability of various latent variable models (e.g., Drton et al., 2007; Hsu and Kakade, 2013; Anandkumar et al., 2014; Bhaskara et al., 2014; Anandkumar et al., 2015; Jaffe et al., 2018). However, the structural constraints imposed by the design matrix make these results not directly applicable to SLAMs.

With the aid of the structural constraints, strict identifiability of SLAMs has been obtained under certain conditions on the design matrix (Xu, 2017; Xu and Shang, 2018; Gu and Xu, 2018b, a). However, these works either make the strong assumption that all the possible combinations of the attributes exist in the population with positive probabilities (Xu, 2017; Xu and Shang, 2018), or assume these significant attribute patterns are known a priori (Gu and Xu, 2018a). These assumptions are difficult to meet in practice for SLAMs with high-dimensional attributes patterns and the fundamental learnability issue of the sparse attribute patterns in SLAMs remains unaddressed.

In terms of estimation, learning sparse attribute patterns from a high-dimensional space of latent attribute patterns is related to learning the significant mixture components in a highly overfitted mixture model. Researchers have shown that the estimation of the mixing distributions in overfitted mixture models is technically challenging and it usually leads to nonstandard convergence rate (e.g., Chen, 1995; Ho and Nguyen, 2016; Heinrich and Kahn, 2018). Estimating the number of components in the mixture model goes beyond only estimating the parameters of a mixture, by learning at least the order of the mixing distribution (Heinrich and Kahn, 2018). This problem was also studied in Rousseau and Mengersen (2011) from a Bayesian perspective; however, the Bayesian estimator in Rousseau and Mengersen (2011) may not guarantee the frequentist selection consistency, as to be shown in Section 3. In the setting of SLAMs with the structural constraints and a large number (larger than sample size) of potential latent attribute patterns, it is not clear how to consistently select the significant attribute patterns.

Our contributions in this paper contain the following aspects. First, we characterize the identifiability requirement needed for a SLAM with an arbitrary subset of attribute patterns to be learnable, and establish mild identifiability conditions. Our new identifiability conditions significantly extends the results of previous works (Xu, 2017; Xu and Shang, 2018) to more general and practical settings. Second, we propose a statistically consistent method to perform attribute pattern selection. In particular, we establish theoretical guarantee for selection consistency in the setting of high dimensional latent attribute patterns, where both the sample size and the number of latent attribute patterns can go to infinity. Our analysis also shows that imposing the popular Dirichlet prior on the population proportions would fail to select the true model consistently, when the convergence rate of the SLAM is slower than the usual root- rate. As for computation, we develop two approximation algorithms to maximize the penalized likelihood for pattern selection. In addition, we propose a fast screening strategy for SLAMs as a preprocessing step that can scale to huge number of potential latent attribute patterns with a moderate sample size, and establish its sure screening property.

The rest of the paper is organized as follows. Section 2 introduces the general setup of structured latent attribute models and motivates our study. Section 3 investigates the learnability requirement and propose mild sufficient conditions for learnability. Section 4 proposes the estimation methodology and establishes theoretical guarantee for the proposed methods. Section 5 and Section 6 include simulations and real data analysis respectively. The proofs of the theoretical results are deferred to the Appendix.

## 2 Model Setup and Motivation

### 2.1 Structured Latent Attribute Models and Examples

We first introduce the general setup of SLAMs. Consider a SLAM with items which depend on the latent attributes of interest. There are two types of subject-specific variables in the model, the observed responses to items and the latent attribute pattern , both assumed to be binary vectors in this work. The -dimensional vector denotes the observed binary responses to the set of items. The -dimensional vector denotes a profile of existence or non-existence of the attributes.

A key structure that specifies how the items depend on the latent attributes is called the -matrix, which is a matrix with binary entries. We denote and reflects whether or not item requires (i.e., depends on) attribute . We denote the th row vector of by , then the -dimensional binary vector reflects the full attribute requirements of item . For an attribute pattern , we say possesses all the required attributes of item , if , where denotes for all . Example 1 below gives an example of the -matrix.

###### Example 1

Trends in International Mathematics and Science Study (TIMSS) is a large scale cross-country educational assessment. TIMSS evaluates the mathematics and science abilities of fourth and eighth graders every four years since 1995. Researchers have used SLAMs to analyze the TIMSS data (e.g., Lee et al., 2011; Choi et al., 2015; Yamaguchi and Okada, 2018). For example, a -matrix constructed by mathematics educators was specified for the TIMSS 2003 eighth grade mathematics assessment (Choi et al., 2015). Thirteen attributes () are identified, which fall in five big categories of skill domains measured by the eighth grade exam, Number, Algebra, Geometry, Measurement, and Data. Table 1 shows the first and last three rows of the -matrix, i.e., .

The -matrix constrains the model parameters in a certain way to reflect the scientific assumptions. We next introduce the model parameters and how the -matrix impose constraints on them in general. Conditional on a subject’s latent attribute pattern , his/her responses to the items are assumed to be independent Bernoulli random variables with parameters . Specifically, denotes the positive response probability, and is also called an item parameter of item . We collect all the item parameters in the matrix , which has size with rows indexed by the items and columns by the attribute patterns. For pattern , we denote its corresponding column vector in by .

One key assumption in SLAMs is that for a latent attribute pattern and item , the parameter is only determined by whether possesses the attributes in the set that is, those attributes related to item as specified in the -matrix. We will sometimes call the attributes in the required attributes of item . Under this assumption, all latent attribute patterns in the set

 Cj={α∈{0,1}K:α⪰qj} (1)

share the same value of ; namely,

 maxα∈Cjθj,α=minα∈Cjθj,α for any j∈{1,…,J}. (2)

We will call the set a constraint set. Thus, the -matrix puts constraints on by forcing certain entries of it to be the same; specifically, for an item , those attribute patterns that only differ in the attributes in are constrained to the same level of ’s. Different SLAMs model the dependence of on the required attributes differently to reflect the underlying scientific assumptions. Please see Examples 2 and 3 for examples.

In addition to (2), another key assumption in SLAMs is the monotonicity assumption that

 θj,α>θj,α′ % for any α∈Cj,α′∉Cj. (3)

Constraint (3) is commonly used in our motivating applications of cognitive diagnosis in educational assessment, where (3) indicates subjects mastering all required attributes of an item are more “capable” of giving a positive response to it (i.e., with higher ), than those who lack some required attributes. Nonetheless, our theoretical results of model learnability in Section 3 also applies if (3) is relaxed to

 θj,α≠θj,α′%foranyα∈Cj,α′∉Cj. (4)

This allows more flexibility in the model assumptions of SLAMs used in other applications.

Next we introduce some popular SLAMs in educational and psychological applications. These models are also called Cognitive Diagnosis Models in the psychometrics literature. The first type of SLAMs have exactly two item parameters associated with each item.

###### Example 2 (two-parameter SLAM)

The two-parameter SLAM specify exactly two levels of item parameters for each item , which we denote by and with . The popular Deterministic Input Noisy output “And” gate (DINA) model introduced in Junker and Sijtsma (2001) is a two-parameter SLAM. It assumes the general form of can be rewritten as

 θtwo-paraj,α={θ+j,if  α∈Cj,θ−j,if  α∉Cj.

In the application of the two-parameter SLAM in educational assessment, the item parameters and have the following interpretations. The is called the slipping parameter, denoting the probability of a “capable” subject slips the correct response, despite mastering all the required attributes of item ; and is called the guessing parameter, denoting the probability of a “non-capable” subject coincidentally giving the correct response by guessing, despite lacking some required attributes of item . In this case, the unique item parameters in matrix reduce to , where and . Under the two-parameter SLAM, the constraint set of each item takes the form of (1) and satisfies (2) and (3).

Another family of SLAMs are the multi-parameter models, which allow each item to have multiple levels of item parameters.

###### Example 3 (multi-parameter SLAMs)

Multi-parameter SLAMs can be categorized into two general types, the main-effect models and the all-effect models. The main-effect models assume the main effect of the required attributes play a role in distinguishing the positive response probabilities. The item parameters can be written as

 θmain-effj,α=f(βj,0+∑k∈Kjβj,kαk), (5)

where is a link function. Different link functions lead to different models, including the popular reduced Reparameterized Unified Model (reduced-RUM; DiBello et al., 1995) with being the exponential function, the Linear Logistic Model (LLM; Maris, 1999) with being the sigmoid function, and the Additive Cognitive Diagnosis Model (ACDM; de la Torre, 2011) with the identify function.

Another type of multi-parameter SLAMs are the all-effect models. The item parameter of an all-effect model can be written as

 θall-effj,α=f(∑S⊆Kjβj,S∏k∈Sαk), (6)

When is the identity function, (6) is the Generalized DINA (GDINA) model proposed by de la Torre (2011); and when is the sigmoid function, (6) is the Log-linear Cognitive Diagnosis Models (LCDMs) proposed by Henson et al. (2009); see also the General Diagnostic Models (GDMs) proposed in von Davier (2008).

Under the multi-parameter SLAMs, the constraint set of each item also takes the form of (1). Those attribute patterns in still share the same value of item parameters by the definition; and what is different from the two-parameter counterpart is that those not in can have different levels of item parameters. We next give another example of multi-parameter SLAMs.

###### Example 4 (Deep Boltzmann Machines)

The Restricted Boltzmann Machine (RBM) (Smolensky, 1986; Goodfellow et al., 2016) is a popular neural network model. RBM is an undirected probabilistic graphical model, with one layer of latent (hidden) binary variables, one layer of observed (visible) binary variables, and a bipartite graph structure between the two layers. We denote variables in the observed layer by and variables in the latent layer by , with their lengths and , respectively. Under a RBM, the probability mass function of and is where , , and are the parameters. The binary -matrix then specifies the sparsity structure in , by constraining only if . The Deep Boltzmann Machine (DBM) is a generalization of RBM by allowing multiple latent layers. Consider a DBM with two latent layers and of length and , respectively. The probability mass function of in this DBM can be written as

 P(R,α(1),α(2))∝exp( −R⊤WQα(1)−(α(1))⊤Uα(2)−f⊤R−b⊤1α(1)−b⊤2α(2)), (7)

where , for , and , are model parameters; Figure 1 gives an example of a DBM with a -matrix. For and , the conditional distribution of an observed variable given the latent variables is

 P(Rj=1∣α(1),α(2),⋯)=P(Rj=1∣α(1))= exp(∑K1k=1wj,kα(1)k+fj)1+exp(∑K1k=1wj,kα(1)k+fj), (8)

where “ ” represents deeper latent layers that potentially exist in a DBM. Moreover, from (7) we have so a DBM satisfies the local independence assumption that the ’s are conditionally independent given the . Therefore, a DBM can be viewed as a multi-parameter main-effect SLAM in (5) with a sigmoid link function. Viewing a DBM in this way, (8) gives the item parameter , and the constraint set of each item also takes the form .

### 2.2 Motivation and Problem

One challenge in modern applications of SLAMs is that the number of potential latent attribute patterns increases exponentially with and could be much larger than the sample size . It is often assumed that a relatively small portion of attribute patterns exist in the population. For instance, Example 1 has different configurations of attribute patterns. Given the limited sample size , it is desirable to learn the potentially small set of significant attribute patterns from data.

Another motivation for assuming a small number of attribute patterns exist in the population, results from the possible hierarchical structure among the targeted attributes. For instance, in educational assessment of a set of underlying latent skill attributes, some attributes often serve as prerequisites for some others (Leighton et al., 2004; Templin and Bradshaw, 2014). Specifically, the prerequisite relationship depicts the different level of difficulty of the skill attributes, and also reveals the order in which these skills are learned in the population of students. For instance, if attribute is a prerequisite for attribute , then the attribute pattern does not exists in the population, naturally resulting in a sparsity structure of the existence of attribute patterns. When the number of attributes is large and the underlying hierarchy structure is complex and unknown, it is desirable to learn the hierarchy of attributes directly from data. In such cases with attribute hierarchy, the number of patterns respecting the hierarchy could be far fewer than .

The problem of interest is that, given a moderate sample size, how to consistently estimate the small set of latent attribute patterns among all the possible patterns. As discussed in the introduction, in the high-dimensional case when the total number of attribute patterns is large or even larger than the sample size, the questions of when the true model with the significant attribute patterns are learnable from data, and how to perform consistent pattern selection, remain open in the literature.

This problem is equivalent to selecting the nonzero elements of the population proportion parameters , where we use to denote the proportion of the subjects with attribute pattern in the population. The satisfies for and . In this work, we will treat the latent attribute patterns as random variables (random effects). For any subject, his/her attribute pattern is a random vector that (marginally) follows a categorical distribution with population proportion parameters . One main reason for this random effect assumption is that, when the number of observed variables per subject (i.e., ) does not increase with the sample size asymptotically, the counterpart fixed effect model can not consistently estimate the model parameters. As a consequence, the fixed effect approach can not give consistent selection of significant attribute patterns. This scenario with relatively small but larger and is commonly seen in the motivating applications in educational and psychological assessments.

We would like to point out that we give the joint distribution of the attributes full flexibility by modeling it as a categorical distribution with free proportion parameters ’s. Modeling in this way allows those “sparse” significant attribute patterns to have arbitrary structures among the possibilities. On the contrary, any simpler parametric modeling of the distribution of with fewer parameters would fail to capture all the possibilities of the attributes’ dependency.

Under the introduced notations, the probability mass function of a subject’s response vector can be written as for Alternatively, the responses can be viewed as a -th order tensor and the probability mass function of can be written as a probability tensor

 (9)

where “” denotes the tensor outer product and ’s are constrained by (2) and (3).

In the following sections, we first investigate the learnability requirement of learning a SLAM with an arbitrary set of true attribute patterns, and provide identifiability conditions in Section 3. Then in Section 4, we propose a penalized likelihood method to select the attribute patterns, and we establish theoretical guarantee for the proposed method.

## 3 Learnability Requirement and Conditions

To facilitate the discussion on identifiability of SLAMs, we need to introduce a new notation, the -matrix. We first introduce the constraint matrix that is entirely determined by the -matrix. The rows of are indexed by the items, and columns by the latent attribute patterns in . The th entry of is defined as

 Γallj,α=I(α⪰qj)=I(α∈Cj),j∈{1,…,J}, α∈{0,1}K, (10)

which is a binary indicator of whether attribute pattern possess all the required attributes of item . We will also call the constraint matrix, since its entries indicate what latent attribute patterns are constrained to have the highest level of parameters for each item. For example, consider the -matrix in the following (11). Then its corresponding -matrix with a saturated set of attribute patterns takes the following form.

 (11)

More generally, we generalize the definition of the constraint matrix in (10) to an arbitrary subset of the entire attribute pattern space , and an arbitrary set of items . For and , we simply denote by the submatrix of with column indices from . When , we will sometimes just denote by for simplicity. Then itself can be viewed as the constraint matrix for a SLAM with attribute pattern space , and directly characterizes how the items constrain the positive response probabilities of latent attribute patterns in .

Given the -matrix, we denote by the set of true attribute patterns existing in the population, i.e., . In knowledge space theory (Düntsch and Gediga, 1995), the set of patterns corresponds to the knowledge structure of the population. We further denote by the item parameter matrix respecting the constraints imposed by ; specifically, has the same size as , with rows and columns indexed by the items and the attribute patterns in , respectively. For any positive integer , we let be the -dimensional simplex, i.e., . We denote the true proportion parameters by , then by the definition of .

The following toy example illustrates why we need to establish identifiability guarantee for pattern selection.

###### Example 5

Consider the -matrix together with its corresponding -matrix in Equation (11). Consider two attribute pattern sets, the true set and an alternative set . Under the two-parameter SLAM, for any valid item parameters restricted by and any proportion parameters such that , we have This is because from (11) and hence ; and also by our construction that . This implies even if one knows exactly there are two latent attribute patterns in the population, one can never tell which two patterns those are based on the likelihood function. In this sense, is not identifiable, due to the fact that and do not lead to distinguishable distributions of responses under the two-parameter SLAM.

From the above example, to make sure the set of true attribute patterns is learnable from the observed multivariate responses, we need the -matrix to have certain structures. We state the formal definition of (strict) learnability of .

###### Definition 1 (strict learnability of A0)

Given , is said to be (strictly) learnable, if for any constraint matrix of size with , any valid item parameters respecting constraints given by , and any proportion parameters , , the following equality

 P(R∣ΘA0,pA0)=P(R∣ΘA,pA) (12)

implies . Moreover, if (12) implies , then we say the model parameters are (strictly) identifiable.

Next we further introduce some notations and definitions about the constraint matrix and then present the needed identifiability result. Consider an arbitrary subset of items . For , we denote under , if for each there is . If viewing as being “capable” of item , then would mean is at least as capable as of items in set . Then under , any subset of items defines a partial order “” on the set of latent attribute patterns . For two item sets and , we say under , if for any , , we have under if and only if under . The next theorem gives conditions that ensure the constraint matrix as well as the -constrained model parameters are jointly identifiable.

###### Theorem 1 (conditions for strict learnability)

Consider a SLAM with an arbitrary set of true attribute patterns , and a corresponding constraint matrix . If this true satisfies the following conditions, then is identifiable.

1. There exist two disjoint item sets and , such that has distinct column vectors for and “ ” under .

2. For any , where under for or , there exists some such that .

3. Any column vector of is different from any column vector of , where .

Recall that each column in the -matrix corresponds to a latent attribute pattern, then Conditions and help ensure the -matrix of the true patterns contains enough information to distinguish between these true patterns. Specifically, Condition requires to contain two vertically stacked submatrices corresponding to item sets and , each having distinct columns, i.e., each being able to distinguish between the true patterns; and Condition requires the remaining submatrix of to distinguish those pairs of true patterns that have some order () based on the first two item sets or . Condition is necessary for identifiability of by ensuring that any true pattern would have a different column vector in from that of any false pattern. Condition is satisfied for any if the -matrix contains an identity submatrix , because such a -matrix will give a that has all the columns distinct.

We would like to point out that our identifiability conditions in Theorem 1 do not depend on the unknown parameters (e.g., and ), but only rely on the structure of the constraint matrix . When the conditions in Theorem 1 are satisfied, is identifiable and from Theorem 4.1 in Gu and Xu (2018a), the model parameters associated with are also identifiable.

###### Corollary 1

Under the conditions in Theorem 1, the model parameters associated with are identifiable.

Note that the result of Theorem 1 differs from the existing works Xu (2017), Xu and Shang (2018) and Gu and Xu (2018a) in that those works assume is known a priori and study the identifiability of , while in the current work is unknown and we focus on the identifiability of itself. This is crucially needed in order to guarantee that we can learn the set of true attribute patterns.

###### Remark 1

The identifiability result in Theorem 1 and Corollary 1 is related to the uniqueness of tensor decomposition. As shown in (9), the probability mass function of the multivariate responses of each subject can be viewed as a higher order tensor with constraints on entries of the tensor, and unique decomposition of the tensor correspond to identification of the constraint matrix as well as the model parameters. The identifiability conditions in Theorem 1 are weaker than the general conditions for uniqueness of three-way tensor decomposition in Kruskal (1977), which is a celebrated result in the literature. Kruskal’s conditions require the tensor can be decomposed as a Khatri-Rao product of three matrices, two having full-rank and the other having Kruskal rank at least two (Kruskal rank of a matrix is the largest number such that every set of columns of it are linearly independent). Consider an example with , , , and the corresponding in the form of (13). Then we can set , and Condition in Theorem 1 is satisfied. Further, Condition is also satisfied since and under . Therefore, Theorem 1 guarantees the set is identifiable, and further guarantees the parameters are identifiable. On the contrary, results based on Kruskal’s conditions for unique three-way tensor decomposition can not guarantee identifiability, because other than two full rank structures given by the items in and , the remaining item 5 in corresponds to a structure with Kruskal rank only one.

 (13)

We next discuss two extensions of the developed identifiability theory. First, Theorem 1 guarantees the strict learnability of . Under a multi-parameter SLAM, these conditions can be relaxed if the aim is to obtain the so-called generic joint identifiability of , which means that is learnable with the true model parameters ranging almost everywhere in the restricted parameter space except a set of Lebesgue measure zero. Specifically, we have the following definition.

###### Definition 2 (generic learnability of the true model)

Denote the parameter space of constrained by by . We say is generically identifiable, if there exists a subset of that has Lebesgue measure zero, such that for any , Equation (12) implies implies . Moreover, if for any , Equation (12) implies , we say the model parameters are generically identifiable.

The generic learnability result is presented in the next theorem.

###### Theorem 2 (conditions for generic learnability)

Consider a multi-parameter SLAM with the set of true attribute patterns and the constraint matrix . If this true satisfies Condition and also the following conditions, then is generically identifiable.

1. There exist two disjoint item sets and , such that altering some entries from 0 to 1 in can yield a satisfying Conditions . That is, has distinct columns for and under .

2. For any , where under for or , there exists some such that .

We also have the following corollary, where the identifiability requirements are directly characterized by the structure of the design -matrix.

###### Corollary 2

If the -matrix satisfies the following conditions, then for any true set of attribute patterns such that satisfies Condition , the set is generically identifiable.

1. The contains two sub-matrices , , such that for ,

 Q=⎛⎜⎝Q1Q2Q′⎞⎟⎠J×K;Qi=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝1∗…∗∗1…∗⋮⋮⋱⋮∗∗…1⎞⎟ ⎟ ⎟ ⎟ ⎟⎠K×K,i=1,2, (14)

where each ‘’ can be either zero or one.

2. With in the form of (14), there is for each .

###### Remark 2

When the conditions in Theorem 2 are satisfied, is generically identifiable and from Theorem 4.3 in Gu and Xu (2018a), the model parameters are also generically identifiable. Corollary 2 differs from Theorem 4.3 in Gu and Xu (2018a) in that, here we allow the true set of attribute patterns to be unknown and arbitrary, and study its identifiability, while Gu and Xu (2018a) assumes is pre-specified and studies the identifiability of the model parameters .

The above generic identifiability of ensures the nonidentifiability case happens only in a zero measure set. The second extension of Theorem 1 regards a case when the nonidentifiability case lies in a positive measure set. This happens when certain latent attribute patterns always have the same positive response probabilities to all the items, i.e., for some . We define and to be in the same equivalence class if . For instance, still consider the following -matrix under the two-parameter SLAM introduced in Example 2,

 Q=(0111), (15)

then attribute patterns and are equivalent under the two-parameter SLAM, as can be seen from the in (11). Therefore the two latent patterns and are not identifiable, no matter which values the true model parameters take.

In this case when both strict and generic identifiability do not hold, we study the -partial identifiability, a concept introduced in Gu and Xu (2018a). Specifically, when some attribute patterns have the same positive response probabilities across all items, we define the set of these attribute patterns as an equivalence class, and aim to identify the proportion of this equivalence class, instead of the separate proportions of these equivalent patterns in the population. For instance, in the above example in (15), because and are equivalent, there are three equivalence classes: , , and . We denote these three equivalence classes by (or , since ), and , since , and form a complete set of representatives of the equivalence classes. For any , we denote the induced set of equivalence classes by of latent patterns, where form a complete set of representatives of the equivalence classes. In this case, the pattern selection problem of interest is to learn which equivalence classes in are significant.

For the two-parameter SLAM introduced in Example 2, two attribute patterns are in the same equivalence class if and only if . This is because under the two-parameter SLAM, the -matrix determined by the -matrix with fully captures the model structure in the sense that . In this case, we can obtain a complete set of representatives of the equivalence classes directly from the -vectors, which are

 AQ={∨j∈Sqj:S⊆{1,…,J}}, (16)

where . For , we define the vector to be , the all-zero attribute pattern. The reasons for being a complete set of representatives are that, first, has distinct columns and contains all the unique column vectors in ; and second, for any other pattern not in , there is some pattern in such that the two patterns have identical column vectors in . It is not hard to see that if and only if the -matrix contains a submatrix .

For multi-parameter SLAMs introduced in Example 3, two attribute patterns are in the same equivalence class if This can be seen by considering , i.e., for some item . Then different from the two-parameter SLAMs, for such item , the and are not always the same by the modeling assumptions of multi-parameter SLAMs. Indeed, under a multi-parameter SLAM, for item , patterns in the set can have multiple levels of item parameters.

We have the following corollary of Theorem 1 on the identifiability, when certain attribute patterns are not distinguishable. We denote the set of significant equivalence classes by , which is a subset of the saturated set . Denote the set of representative patterns of the significant equivalence classes by .

###### Corollary 3

If the matrix satisfies Conditions , and , is identifiable.

###### Remark 3

Under the two-parameter SLAM with , the -matrix by definition would have distinct column vectors. Therefore any column vector of in Corollary 3 must be different form any column vector of . In this case, Condition is automatically satisfied. And in order to identify , one only needs to check if satisfies Conditions and .

## 4 Penalized Likelihood approach to pattern selection

### 4.1 Shrinkage Estimation

The developed identifiability conditions guarantee that the true set of patterns can be distinguished from any alternative set that has not more than patterns, since they would lead to different probability mass functions of the responses. As , we know that learning the significant attribute patterns is equivalent to the selection of nonzero elements of . In practice, if we directly overfit the data with all the possible attribute patterns, the corresponding maximum likelihood estimator (MLE) can not correctly recover the sparsity structure of the proportion parameters . In this case, we propose to impose some regularization on the proportion parameters , and perform pattern selection through maximizing a penalized likelihood function.

In general, we denote by the set of candidate attribute patterns given to the shrinkage estimation method as input. If the saturated space of all the possible attribute patterns are considered, and it contains all the possible configurations of attributes. When we propose to use a preprocessing step that returns a proper subset of the saturated set as candidate attribute patterns, and then perform the shrinkage estimation (please see Section 4.2 for the preprocessing procedure).

We first introduce the general data likelihood of a structured latent attribute model. Given a sample of size , we denote the th subject’s response by , . We further use to denote the data matrix The marginal likelihood can be written as

 L(Θ,p∣R)=N∏i=1[∑α∈AinputpαJ∏j=1θRi,jj,α(1−θj,α)1−Ri,j], (17)

where the constraints on imposed by are made implicit. We denote the corresponding log likelihood by .

As the proportion parameters belongs to a simplex, in order to encourage sparsity of , we propose to use a -type penalty with a tuning parameter . Specifically, we use the following penalized likelihood as the objective function,

 ℓλ(Θ,p) = ℓ(Θ,p)+λ∑α∈AinputlogρN(pα),λ∈(−∞,0), (18)

where and is a small threshold parameter that is introduced to circumvent the singularity issue of the function at zero. Specifically, we take

 ρN≍N−d, (19)

for some constant , where for two sequences and , we denote if and if and . Any attribute pattern whose estimated will be considered as 0, and hence not selected. The tuning parameter controls the sparsity level of the estimated proportion vector , and a smaller leads to a sparser solution (with more estimated falling below ). Given a , we denote the estimated set of patterns by .

###### Remark 4

In the literature, Chen et al. (2001) and Chen et al. (2004) used a similar form of penalty as the summation term in our (18), but instead imposed to avoid sparse solutions of the proportion parameters. These works used that penalty in order to avoid singularity when performing restricted likelihood ratio test. While our goal here is to encourage sparsity of so that significant attribute patterns can be selected.

The formulation of (18) can also be interpreted in a Bayesian way, where the penalty term regarding the proportions is the logarithm of the Dirichlet prior density with hyperparameter over the proportions. But note that when , the penalty term is not a proper prior density. Our later Proposition 1 reveals that, under nonstandard convergence rate of the mixture model, the traditional Bayesian way of imposing a proper Dirichlet prior over proportions is not sufficient for selecting significant attribute patterns consistently. Instead, this classical procedure will yield two many false patterns being selected. Therefore, our novelty of allowing in (18) to be negative with arbitrarily large magnitude is crucial to selection consistency.

Other than the nice connection to the Dirichlet prior density in the Bayesian literature, the log-type penalty in (18) also facilitates the computation based on modified EM and variational EM algorithms, as shown in our Algorithms 1 and 2. For such reasons, this work uses the log-type penalty. There are also alternative ways of imposing penalty on the proportion parameters that would lead to selection consistency, such as the truncated penalty used in Shen et al. (2012) for high-dimensional feature selection.

We denote the MLE obtained from directly maximizing in (17) by and , and denote the “oracle” MLE of the parameters obtained by maximizing the likelihood constrained to the true set of attribute patterns by . We denote the rate of convergence of to by , that is,

 [ℓ(ˆΘ,ˆp)−ℓ(ˆΘA0,ˆpA0)]/N=OP(N−δ). (20)

When , (20) implies converges with the usual root- rate, and would imply a slower convergence rate. In the literature, Ho and Nguyen (2016) and Heinrich and Kahn (2018) have studied the technically involved problem of convergence rate of the mixing distribution of certain mixture models, and showed these models may not have the standard root- rate. As implied by these works, for complicated models like SLAMs, the convergence rate of the mixing distribution is likely to be slower than root-, so as the convergence rate of .