Learning Item-Attribute Relationship in Q-Matrix Based Diagnostic Classification Models

Learning Item-Attribute Relationship in -Matrix Based Diagnostic Classification Models

Jingchen Liu, Gongjun Xu, and Zhiliang Ying

Columbia University
Abstract

Recent surge of interests in cognitive assessment has led to the developments of novel statistical models for diagnostic classification. Central to many such models is the well-known -matrix, which specifies the item-attribute relationship. This paper proposes a principled estimation procedure for the -matrix and related model parameters. Desirable theoretic properties are established through large sample analysis. The proposed method also provides a platform under which important statistical issues, such as hypothesis testing and model selection, can be addressed.

Keywords: Cognitive assessment, consistency, DINA model, DINO model, latent traits, model selection, optimization, self-learning, statistical estimation.

1 Introduction

Diagnostic classification models (DCM) are important statistical tools in cognitive diagnosis and have widespread applications in educational measurement, psychiatric evaluation, human resource development, and many other areas in science, medicine, and business. A key component in many such models is the so-called -matrix, first introduced by Tatsuoka (1983); see also Tatsuoka (2009) for a detailed coverage. The -matrix specifies the item-attribute relationship, so that responses to items can reveal attributes configuration of the respondent. In fact, Tatsuoka (1983, 2009) proposed the rule space method that is simple and easy-to-use.

Flexible and sophisticated statistical models can be built around the -matrix. Two such models are the DINA model (Deterministic Input, Noisy Output “AND” gate; see Junker and Sijtsma, 2001) and the DINO model (Deterministic Input, Noisy Output “OR” gate; see Templin, 2006; Templin and Henson, 2006). Other important developments can be found in *Tatsuoka1985,DiBello,Junker, Hartz,TatsuokaC,AHM,Templin2006,Chiu. *Rupp contains a comprehensive summary of many classical and recent developments.

There is a growing literature on the statistical inference of -matrix based DCMs that addresses the issues of estimating item parameters when the -matrix is prespecified (Rupp, 2002; Henson and Templin, 2005; Roussos, Templin, and Henson, 2007; Stout, 2007). Having a correctly specified -matrix is crucial both for parameter estimation (such as the slipping, guessing probability, and the attribute distribution) and for the identification of subjects’ underlying attributes. As a result, these approaches are sensitive to the choice of the -matrix (Rupp and Templin, 2008; de la Torre, 2008; de la Torre and Douglas, 2004). For instance, a misspecified -matrix may lead to substantial lack of fit and, consequently, erroneous attribute identification. Thus, it is desirable to be able to detect misspecification and to obtain a data driven -matrix.

In contrast, there has not been much work about estimation of the -matrix. To our knowledge, the only rigorous treatment of the subject is given by *LXY2011, which defines an estimator of the -matrix under the DINA model assumption and provides regularity conditions under which desirable theoretical properties are established. The work of this paper may be viewed as the continuation of Liu et al. (2011) in the sense that it completes the estimation of the -matrix for the DINA model and extends the estimation procedure (as well as the consistency results) to the DINO model. The DINA and the DINO models impose rather different interactions among attributes. However, we show that there exists a duality between the two models. This particular feature is interesting especially for theoretical development, as it allows us to adapt the results and analysis techniques developed for the DINA model to the DINO model without much additional effort. This will be shown in our technical developments.

The main contribution of this paper is two-fold. First, it provides a rigorous analysis of the -matrix for the DINA model when both the slipping and guessing parameters are unknown. This is a substantial extension of the results in Liu et al. (2011) which requires a complete knowledge of the guessing parameter. It gives a definitive answer to the estimability of the -matrix for the DINA model by presenting a set of sufficient conditions under which a consistent estimator exists. Second, we conduct a parallel analysis (to the analysis for the DINA model) for the DINO model. In particular, a consistent estimator of the -matrix for the DINO model and its properties are presented. Thanks to the duality structure, part of the intermediate results developed for the DINA model can be borrowed to the analysis of the DINO model.

One may notice that our estimation procedure is in fact generic in the sense that it is implementable to a large class of DCMs besides the DINA and DINO models. In particular, the procedure is implementable to the NIDA (Noisy Inputs, Deterministic “And” Gate) model and the NIDO (Noisy Inputs, Deterministic “Or” Gate) model among others, though theoretical properties under such model specifications still need to be established. In addition to the estimation of the -matrix, we emphasize that the idea behind the derivations forms a principled inference framework. For instance, during the course of the description of the estimation procedure, necessary conditions for a correctly specified -matrix are naturally derived. Such conditions can be used to form appropriate statistics for hypothesis testing and model diagnostics. In that connection, additional developments (e.g. the asymptotic distributions of those statistics) are needed, but they are not the focus of the current paper. Therefore, the proposed framework can potentially serve as a principled inference tool for the -matrix in diagnostic classification models.

This paper is organized as follows. Section 2 contains the main ingredient: presentation of the estimation procedures for both the DINA and DINO models and the statement of the consistency results. Section 3 includes further discussions of the theorems and various issues. The proofs of the main theorems in Section 2 and several important propositions are given in Section 4. The most technical proofs of two central propositions are given in the Appendix.

2 Main results

2.1 Notation and model specification

The specification of the diagnostic classification models considered in this paper consists of the following concepts.

Attribute: subject’s underlying mastery of certain skills or presence of certain mental health conditions. There are attributes and we use to denote the vector of attributes, where or , indicating presence or absence of the -th attribute, .

Responses to items: There are items and we use to denote the vector of responses to them. For simplicity, we assume that is a binary variable for each .

Note that both and are subject specific. Throughout this paper, we assume that the number of attributes is known and that the number of items is always observed.

-matrix: the link between the items and the attributes. In particular, is an matrix with binary entries. For each and , indicates that item requires attribute and otherwise.

We define capability indicator, , which indicates if a subject possessing attribute profile is capable of providing a positive response to item if the item-attribute relationship is specified by matrix . Different capability indicators give rise to different DCMs. For instance,

(1)

is associated with the DINA model, where is the usual indicator function. The DINA model assumes conjunctive relationship among attributes, that is, it is necessary to possess all the attributes indicated by the -matrix to be capable of providing a positive response to an item. In addition, having additional unnecessary attributes does not compensate for the lack of the necessary attributes. The DINA model is particularly popular in the context of educational testing.

Alternative to the “and” relationship, one may impose an “or” relationship among the attributes, resulting in the DINO model. The corresponding capability indicator takes the following form

(2)

That is, one needs to possess at least one of the required attributes to be capable of responding positively to that item.

The last ingredient of the model specification is related to the so-called slipping and guessing parameters. The names “slipping” and “guessing” arise from the educational applications. The slipping parameter is the probability that a subject (with attribute profile ) responds negatively to an item if the capability indicator to that item ; similarly, the guessing parameter refers to the probability that a subject’s responds positively if his/her capability indicator . We use to denote the slipping probability and to denote the guessing probability (with corresponding subscript indicating different items). In the technical development, it is more convenient to work with the complement of the slipping parameter. Therefore, we define to be the correctly answering probability, with and being the corresponding item-specific notation. Given a specific subject’s profile , the response to item under the DINA model follows a Bernoulli distribution

(3)

With the same definition of and , the response under the DINO model follows

(4)

In addition, conditional on , are jointly independent.

Lastly, we use subscripts to indicate different subjects. For instance, is the response vector of subject . Similarly, is the attribute vector of subject . With subjects, we observe but not . Thus, we finished our model specification.

2.2 Estimation of the -matrix

In this section, we develop a general approach to the estimation of the -matrix and item parameters. We first deal with the DINA model and then, via introducing a duality relation, the DINO model.

2.2.1 DINA model

We need to introduce additional notation and concepts. Throughout the discussion, we use to denote the true matrix and to denote a generic binary matrix.

Attribute distribution. We assume that the subjects are a random sample (of size ) from a designated population so that their attribute profiles, , are i.i.d. random variables, with the following distribution

(5)

where, for each , and . We use to denote the distribution of the attribute profiles.

The -matrix. The -matrix is a non-linear function of the -matrix and provides a linear relationship between the attribute distribution and the response distribution. In particular, let be a matrix of columns. Each column of corresponds to one attribute profile . To facilitate the description, we use binary vectors of length to label the the columns of instead of using ordinal numbers. For instance, the -th column of is the column that corresponds to attribute .

Let be a generic notation for a positive response to item . Let “” stand for “and” combination. For instance, denotes positive responses to both item and . Each row of corresponds to one item or one “and” combination of items, for instance, , , or ,… For containing all the single items and all “and” combinations, it has rows. We will later say that such a is saturated.

We now proceed to the description of each row vector of . We define to be a dimensional row vector. Using the same labeling system as that of the columns of , the -th element of is defined as , that is, this element indicates if a subject with attribute is capable of responding positively to item . Thus, is the vector indicating the attribute profiles that is capable of responding positively to item .

Using a similar notation, we define that

(6)

where the operator “” is element-by-element multiplication from to . For instance,

means that , where and . Therefore, is the vector indicating the attributes that are capable of responding positively to items . The row in corresponding to is .

-vector. We let be a column vector whose length is equal to the number of rows in . Each component in corresponds to a row vector of . The element in corresponding to is , where denotes the number of people with positive responses to items , that is

No slipping or guessing.

We first consider a simplified situation in which both the slipping and guessing probabilities are zero. Under this special situation, (3) implies that

In other words, the probabilistic relationship becomes a certainty relationship. We further let be the (unobserved) empirical distribution of the attribute profiles, that is,

Note that each row vector of indicates the attribute profiles that are capable of responding positively to the corresponding item(s). Then, for each set of , we may expect the following identity

where is a row vector and is a column vector. Therefore, thanks to the construction of and vector , in absence of possibility of slipping and guessing, we may expect the following set of linear equations holds

Note that is not observed. The above display implies that if the -matrix is correctly specified and the slipping and guessing probabilities are zero, then the linear equation (with being the variable) has at least one solution. For each binary matrix , we define that

where the minimization is subject to the constraints that and . Based on the above results, we may expect that and therefore is one of the minimizers of . In addition, the empirical distribution is one of the minimizers of . Therefore, we just derived a set of necessary conditions for a correctly specified -matrix. In our subsequent theoretical developments, we will show that under some circumstances these conditions are also sufficient.

Illustrative example.

To aid the understanding of the -matrix, we provide one simple example. Consider the following -matrix,

(7)

and the contingency table of attributes

multiplication
addition

Note that if the -matrix is correctly specified and the slipping and guessing probabilities are all zero we should be able to obtain the following identities

(8)

We then create the corresponding -matrix and -vector as follows

(9)

The first column of corresponds to the zero attribute profile; the second corresponds to ; the third corresponds to ; and the last corresponds to . The first row of corresponds to item , the second to , the third to . In addition, we may further consider combinations such as

The corresponding -matrix and -vector should be

(10)

Under the DINA model assumption and , we obtain that

Nonzero slipping and guessing probabilities.

We next extend the necessary conditions just derived to nonzero but known slipping and guessing probabilities. To do so, we need to modify the -matrix. Let be a matrix with the same dimension as that of , with each row vector being defined slightly differently to incorporate the slipping and guessing probability. In particular, let

where is the row vector of ones and is the positive responding probability of item . In addition, we let

(11)

Clearly, each element of is the probability of observing a positive response to item for a certain attribute profile. Likewise, elements of indicate the probabilities of positive responses to items ,…, . The row in corresponding to is . To facilitate our statement, we define that

(12)

where is the zero vector. That is, is the matrix with guessing probabilities being zero.

Recall that is the attribute distribution. Thus,

Further, we obtain that

In presence of slipping and guessing, one cannot expect to solve equation exactly the same way as in the case of no guessing and slipping. On the other hand, thanks to the law of large numbers, we obtain that as . Then this equation can be solved asymptotically. Thus, for a generic , we defined the loss function

(13)

where the above optimization is subject to the constraint that and and is the Euclidean normal. In view of the preceding argument, we expect that

(14)

almost surely as , that is, the true -matrix asymptotically minimizes the criterion function . This leads us to propose the following estimator of

(15)

where is included in to indicate that the resulting estimator requires the knowledge of the correct responding and guessing probabilities.

Situations when and are unknown.

Suppose that for a given , we can construct an estimator of . In addition, suppose that is consistent, that is, in probability as . Then, we define

(16)

that is, we plug in the estimator of into the objective function in (15). We will present one specific choice of in Section 2.2.3.

2.2.2 DINO model

We now proceed to the description of the estimation procedure of the DINO model. The DINO can be considered as the dual model of the DINA model. The estimation procedure is similar except that the “AND” relationship needs to be changed to an “OR” relationship. In subsequent technical development, we will provide the precise meaning of the duality. First, we present the construction of the estimator.

The -matrix. The matrix is similar to except that it admits an “OR” relationship among items. In particular, first define to be a vector of dimension and the -th element is defined as . Therefore, indicates the attribute profiles that are capable of providing positive responses to item . We use “” to denote the “OR” combinations among items and define

Thus, is a vector indicating the attribute profiles that are capable of responding positively to at least one of the item(s) ,…, . We let the row in corresponding to be . In presence of slipping and guessing, we define

and

We let the row in corresponding to “” be .

The -vector. The vector plays a similar role as the vector for the DINA model. Specifically, is a column vector whose length is equal to the number of rows of . Each element of corresponds to one row vector of . The element of corresponding to is defined as

With such a construction and a correctly specified , one may expect that

almost surely as . Therefore, we define objective function

(17)

where subject to and . Furthermore, an estimator of can be obtain by

(18)

In cases when parameters or are unknown, we may plug in their estimates and define

(19)

2.2.3 Estimators for the slipping and guessing parameters

To complete our estimation procedure, we provide one generic estimator for . For the DINA model, we let

(20)

and for the DNIO model, we let

(21)

We emphasize that may not be a consistent estimator of . To illustrate this, we present one example discussed in Liu et al. (2011). Consider the case of items with attributes and a complete matrix , the identity matrix. The degrees of freedom of a -way binary table is . On the other hand, the dimension of parameters is . Therefore, , , and cannot be consistently identified without additional information. This problem is typically tackled by introducing addition parametric assumptions such as satisfying certain functional form or in the Bayesian setting (weakly) informative prior distributions *Gelman08. Given that the emphasis of this paper is the inference of -matrix, we do not further investigate the identifiability of . Despite the consistency issues, if one adopts the estimators in (20) and (21) for the estimator of as in (16) and (19), the consistency results remain even if is inconsistent. We will address this issue in more details in the remarks after the statements of the main theorems.

2.3 Theoretical properties

2.3.1 Notation

To facilitate the statements, we first introduce notation and some necessary conditions that will be referred to in later discussions.

  • Linear space spanned by vectors :

  • For a matrix , denotes the submatrix containing the first rows and all columns of .

  • Vector denotes a column vector with the -th element being 1 and the rest being 0. When there is no ambiguity, we omit the length index of .

  • Matrix denotes the identity matrix.

  • For a matrix , is the linear space generated by its column vectors. It is usually called the column space of .

  • For a matrix , denotes the set of its column vectors and denotes the set of its row vectors.

  • Vector denotes the zero vector, . When there is no ambiguity, we omit the index of length.

  • Define a dimensional vector

  • For dimensional vectors and , write if for all and if for all .

  • Matrix denotes the true matrix and denotes a generic binary matrix.

The following definitions will be used in subsequent discussions.

Definition 1

We say that is saturated if all combinations of the form , for , are included in . Similarly, we say that is saturated if all combinations of the form , for , are included in .

Definition 2

We write if and only if and have identical column vectors, which could be arranged in different orders; otherwise, we write .

Remark 1

It is not hard to show that “” is an equivalence relation. if and only if they are identical after an appropriate permutation of the columns. Each column of is interpreted as an attribute. Permuting the columns of is equivalent to relabeling the attributes. For , we are not able to distinguish from based on data.

Definition 3

A -matrix is said to be complete if ( is the set of row vectors of ); otherwise, we say that is incomplete.

A -matrix is complete if and only if for each attribute there exists an item only requiring that attribute. Completeness implies that . We will show that completeness is among the sufficient conditions to identify . In addition, it is pointed out by Chiu et al. (2009) (c.f. the paper for more detailed formulation and discussion) that the completeness of the -matrix is a necessary condition for a set of items to consistently identify attributes. Thus, it is always recommended to use a complete -matrix unless additional information is available.

Listed below are assumptions which will be used in subsequent development.

  • Matrix is complete.

  • Both and are saturated.

  • Random vectors are i.i.d. with the following distribution

    We further let .

  • The attribute population is diversified, that is, .

2.3.2 Consistency results

We first present the consistency results for the DINA model.

Theorem 1

Under the DINA model, suppose that conditions C1-4 hold, that is, is complete, is saturated, the attribute the profiles are i.i.d., and is diversified. Suppose also that the and are known. Let be as defined in (13) and

Then,

In addition, with an appropriate arrangement of the column order of , let

Then, for any ,

Theorem 2

Under the DINA model, suppose that the conditions in Theorem 1 hold, except that the and are unknown. For any , and are estimators for and . When , is a consistent estimator of . Let be as defined in (16). Then

In addition, with an appropriate arrangement of the column order of , let

Then, for any ,

In what follows, we present the consistency results for the DINO model.

Theorem 3

Under the DINO model, suppose that conditions C1-4 hold, that is, is complete, is saturated, the attribute profiles are i.i.d., and is diversifies. Suppose also that the and are known. Let be defined as in (17) and

Then,

In addition, with an appropriate arrangement of the column order of , let

Then, for any ,

Theorem 4

Under the DINO model, suppose that the conditions in Theorem 3 hold, except that the and are unknown. For any , and are estimators for and . When , and are consistent estimators of and . Let be defined as in (19). Then

In addition, with an appropriate arrangement of the column order of , let

Then, for any ,

Remark 2

It is not hard to verify that “” defines a binary equivalence relation on the space of binary matrices, denoted by . As previously mentioned, the data do not contain information about the specific meaning of the attributes. Therefore, we do not expect to distinguish from if . Therefore, the identifiability in the theorems is the strongest type that one may expect. The corresponding quotient set is the finest resolution that is possibly identifiable based on the data. Under weaker conditions, such as in absence of completeness of the -matrix or the complete diversity of the attribute distribution, the identifiability of the -matrix may be weaker, which corresponds to a coarser quotient set.

Remark 3

We would like to point out that, when the estimators in (20) and (21) are chosen, is always a consistent estimator of , even if is not a consistent estimator for . This is because the proof of Theorem 2 is based on the fact that in probability; when , is bounded below by some . Given that and that is chosen to minimize the objective function , decreases to zero regardless whether or not is consistent. In addition, the fact that is bounded below by some does not require any consistency property of . Therefore, the consistency of does not rely on the consistency of if it is of the particular forms as in (20) and (21). On the other hand, in order to have being consistent, it is necessary to require the consistency for . Therefore, in the statement of Theorem 2 we require the consistency of , though it is necessary to point out this subtlety. A similar argument applies to Theorem 4 as well.

3 Discussions and implementation

This paper focuses mostly on the estimation of the -matrix. In this section, we discuss several practical issues and a few other usages of the proposed tools.

Computational issues.

There are several aspects we would like to address. First, for a given , the evaluation of only consists of optimization of a quadratic function subject to linear constraint(s). This can be done by quadratic programming type of well established algorithms.

Second, the theories require construction of a saturated -matrix or -matrix which is by . Note that when is reasonably large, for instance, , a saturated -matrix has over 1 million rows. One solution is to include part of the combinations and gradually include more combinations if the criterion function admit small values at multiple -matrices. Alternatively, we may split the items into multiple groups which we will elaborate in the next paragraph.

The third computational issue is related to minimization of with respect to . This involves evaluating function over all the binary matrices, which has a cardinality of . Simply searching through such a space is a substantial computation overhead. In practice, one may want to handle such a situation by splitting the -matrix in the following manner. Suppose there are items. We split them into groups, each of which has (a computationally manageable number) items. This is equivalent to dividing a large -matrix into multiple smaller sub-matrices. When necessary, we may allow different groups to have overlaps of items. Then, we can estimate each sub-matrix separately and merge them into an estimate of the big -matrix. Given that the asymptotic results are applicable to each of the sub-matrices, the combined estimate is also consistent. This is similar to the splitting procedure in Chapter 8.6 of Tatsuoka (2009). We emphasize that splitting the parameter space is typically not valid for usual statistical inferences. However, the -matrix admits a special structure with which the splitting is feasible and valid. This partially helps to relieve the computation burden related to the proposed procedure. On the other hand, it is always desirable to have a generic efficient algorithm for a general large scale -matrix. We leave this as a topic for a future investigation.

Partially specified -matrix.

It is often reasonable to assume that some entries of the -matrix are known. For example, suppose we can separate the attributes into “hard” and “soft” ones. By “hard”, we mean those that are concrete and easily recognizable in a given problem and, by “soft”, we mean those that are subtle and not obvious. We can then assume that the entry columns which correspond to the “hard” attributes are known. Another instance is that there is a subset of items whose attribute requirements are known and the item-attribute relationships of the other items need to be learnt, such as the scenarios when new items need to be calibrated according to the existing ones. In this sense, even if an estimated -matrix may not be sufficient to replace the a priori -matrix provided by the “expert” (such as exam makers), it can serve as a validation as well as a source of calibration of the existing knowledge of the -matrix.

When such information is available and correct, the computation can be substantially reduced. This is because the optimization, for instance that in (16), can be performed subject to the existing knowledge of the -matrix. In particular, once a set of items is known to form a complete -matrix, that is, item is known to only require attribute for , then one can calibrate one item at a time. More specifically, at each time, one can estimate the sub-matrix consisting of items to as well as one additional item, the computational cost of which is . Then the overall computational cost is reduced to , which is typically of a manageable order.

Validation of a -matrix.

The propose framework is applicable to not only the estimation of the -matrix but also validation of an existing -matrix. Consider the DINA and DINO models. If the -matrix is correctly specified, then one may expect

in probability as . The above convergence requires no additional conditions (such as completeness or diversified attribute distribution). In fact, it suffices to have that the responses are conditionally independent given the attributes and are consistent estimators of . Then, one may expect that

If the convergence rate of the estimators is known, for instance, , then a necessary condition for a correctly specified -matrix is that . The asymptotic distribution of depends on the specific form of . Consequently, checking the closeness of to zero forms a procedure for validation of the existing knowledge of the -matrix.

4 Proofs of the theorems

4.1 Preliminary results: propositions and lemmas

Proposition 1

Under the setting of the DINA model, suppose that is complete and matrix is saturated. Then, we are able to arrange the columns and rows of and such that has rank , that is, after removing one zero column this sub-matrix has full column rank.

Proof of Proposition 1. We let the first column of correspond to the zero attribute profile. Then, the first column is a zero vector, which is the column we mean to remove in the statement of the proposition. Provided that is complete, without loss of generality we assume that the -th row vector of is for , that is, item only requires attribute for each . The first rows of are associated with . In particular, we let the first rows correspond to and the second to the -th columns of correspond to ’s that only have one attribute. We further arrange the next rows of to correspond to combinations of two items, , . The next columns of correspond to ’s that only have two positive attributes. Similarly, we arrange for combinations of three, four, and up to items. Therefore, the first rows of admit a block upper triangle form. In addition, we are able to further arrange the columns within each block such that the diagonal matrices are identities, so that has form

(22)

obviously has full rank after removing the zero (first) column.   

From now on, we assume that and the first rows of are arranged in the order as in (22).

Proposition 2

Under the DINA model, that is, the ability indicator follows (1), assume that is a complete matrix and is saturated. Without loss of generality, let . Assume that the first rows of form a complete matrix. Further, assume that . If and , then for all there exists at least one column vector of not in the column space , where is as defined in (12) being the -matrix with zero guessing probabilities.

Proposition 3

Under the DINA model, that is, the ability indicator follows (1), assume that is a complete matrix and is saturated. Without loss of generality, let . If and is incomplete, then for all there exists at least one nonzero column vector of not in the column space .

In the statement of Propositions 2 and 3, , , and can be any real numbers and are not restricted to be in . Propositions 2 and 3 are the central results of this paper, whose proofs are delayed to the Appendix. To state the next proposition, we define matrix

(23)

that is, we add one more row of one’s to the original -matrix.

Proposition 4

Under the DINA model, that is, the ability indicator follows (1), suppose that is a complete matrix, , is saturated, and . Then, for all , there exists one column vector of (depending on ) not in . In addition, is of full column rank.

Lemma 1

Consider two matrices and of the same dimension. If , then for any matrix of appropriate dimension for multiplication, we have

Conversely, if the -th column vector of does not belong to