# Joint Maximum Likelihood Estimation for High-dimensional Exploratory Item Response Analysis

## Abstract

Multidimensional item response theory is widely used in education and psychology for measuring multiple latent traits. However, exploratory analysis of large scale item response data with many items, respondents and latent traits is still a challenge. In this paper, we consider a high-dimensional setting that both the number of items and the number of respondents grow to infinity. A constrained joint maximum likelihood estimator is proposed for estimating both item and person parameters, which yields good theoretical properties and computational advantage. Specifically, we derive error bounds for parameter estimation and develop an efficient algorithm that can scale to very large data sets. The proposed method is applied to a large scale personality assessment data set from the Synthetic Aperture Personality Assessment (SAPA) project. Simulation studies are conducted to evaluate the proposed method.

KEY WORDS: Item response theory, high-dimensionality, diverging number of items, joint maximum likelihood estimator (JMLE), alternating minimization, projected gradient descent, matrix completion, SAPA project

## 1Introduction

Multidimensional item response theory (MIRT) models have been widely used in psychology and education for measuring multiple latent traits based on dichotomous or polytomous items. The concept of MIRT dates back to [25], [24] and [32], and is closely related to linear factor analysis [2] that models continuous multivariate data. We refer the readers to [31] for a review of MIRT models. An important application of MIRT models is exploratory item factor analysis, which serves to uncover a set of latent traits underlying a battery of psychological items. Such exploratory analysis facilitates the understanding of large scale response data with many items, respondents and possibly many latent traits.

A key step in MIRT based analysis is parameter estimation, which can be numerically challenging under the high-dimensional setting, i.e., when the number of items, the sample size, and the dimension of the latent space are large. The most popular method for estimating item parameters is the maximum marginal likelihood estimator (MMLE). In this approach, respondent specific parameters (i.e., their latent trait levels) are treated as random effects and the marginal likelihood function of item parameters is obtained by integrating out the random effects. Item parameter estimates are obtained by maximizing the marginal likelihood function. This approach typically involves evaluating a -dimensional integral, where is the number of latent traits. When is moderately large (say, ), evaluating the integral can be numerically challenging. Many approaches have been proposed to approximate the integral, including adaptive Gaussian quadrature methods [35], Monte Carlo integration [26], fully Bayesian estimation methods [3], and data augmented stochastic approximation algorithm [6]. However, even with these state-of-the-art algorithms, the computation is time-consuming.

An alternative approach to parameter estimation in MIRT models is the joint maximum likelihood estimator (JMLE; see, e.g. [16]). This approach treats both the person and item parameters as fixed effects (i.e., model parameters). Parameter estimates are obtained by maximizing the joint likelihood function of both person and item parameters. Traditionally, the JMLE is less preferred to the MMLE. This is partly because, under the usual asymptotic setting where the number of respondents grows to infinity and the number of items is fixed, the number of parameters in the joint likelihood function also diverges, resulting in inconsistent estimate of item parameters [27] even for simple IRT models.

In this paper, we propose a variant of joint maximum likelihood estimator, which adds constraints to the parameter space. This constrained joint maximum likelihood estimator (CJMLE) is computationally efficient and statistically solid under the high-dimensional setting. In terms of computation, an alternating minimization (AM) algorithm with projected gradient decent update is proposed, which can be parallelled efficiently. Specifically, we implement this parallel computing algorithm in R with core functions written in C++ through Open Multi-Processing [13] that can scale to very large data. For example, according to our simulation study, the algorithm can fit a data set with 25,000 respondents, 1000 items, and 10 latent traits in 12 minutes on a single Intel(R) machine (Core(TM) i7CPU@5650U@2.2 GHz) with eight threads (OpenMP enabled). Theoretically, we provide error bounds on parameter estimation, under the setting that both the numbers of items and respondents grow to infinity. Specifically, we show that the item parameters can be consistently estimated up to a scaling and a rotation (possibly oblique). Consequently, the latent structure of test items may be investigated by using appropriate rotational methods [5]. Both our computational algorithm and theoretical framework are new to the psychometrics literature.

Our theoretical framework assumes a diverging number of items, which is suitable when analyzing large scale data. To the best of our knowledge, such an asymptotic setting has not received enough attention, except in [21] and [11]. Our theoretical analysis applies to a general MIRT model that includes the multidimensional two-parameter logistic model [33] as a special case, while the analyses in [21] and [11] are limited to the unidimensional Rasch model [30] and cognitive diagnostic models [34], respectively. Our technical tools for studying the properties of the CJMLE include theoretical developments in matrix completion theory [9] and matrix perturbation theory [36].

We apply the proposed method to a data set from the Synthetic Aperture Personality Assessment (SAPA) project [12] that is collected to evaluate the structure of personality constructs in the temperament domain. Applying traditional exploratory item factor analysis methods to this data set is challenging because (1) the data set is of large scale, containing 23,681 respondents, 696 items, and possibly many latent traits, (2) the responses are not continuous and thus linear factor analysis is inappropriate, and (3) the data contain “massive missingness” by design that is missing completely at random. The proposed method turns out to be suitable for analyzing the data set. Specifically, the prediction power of the MIRT model and CJMLE is studied and the latent structure is investigated and interpreted.

The rest of the paper is organized as follows. In Section 2, we propose the CJMLE and an algorithm for its computation. Theoretical properties of CJMLE are developed in Section 3, followed by simulation studies and real data analysis in Sections Section 4 and Section 5. Finally, discussions are provided in Section 6. Proofs of our theoretical results are provided in Appendix.

## 2Method

### 2.1Model

To simplify the presentation, we focus on binary response data. We point out that our framework can be extended to ordinal and nominal responses without much difficulty. Let indicates respondents and indicates items. Each respondent is represented by a -dimensional latent vector and each item is represented by parameters . Let be the response from respondent to item , which is assumed to follow distribution

This model can be viewed as an extension of the multidimensional two-parameter logistic model [33]. That is, if is set to be 1 for all , then becomes

which takes the same form as a multidimensional two-parameter logistic model with latent traits. Similar to many latent factor models, Model is also rotational and scaling invariant, as will be discussed in Section 3.

### 2.2Constrained Joint Maximum Likelihood Estimator

The traditional way of estimating item parameters in an item response theory model is through the marginal maximum likelihood estimator. That is, the latent vector s are further assumed to be independent and identically distributed samples from some distribution . Let be the observed value of random variable . The marginal likelihood function is a function of , defined as

Marginal maximum likelihood estimates (MMLE) of , ..., are then obtained by maximizing . Appropriate constraints may be imposed on , ..., to ensure identifiability. In this approach, the latent trait s are treated as random effects, while the item parameters , ..., are treated as fixed effects. When is large, the marginal maximum likelihood estimator suffers from computational challenge brought by the -dimensional integration in . The computation becomes time-consuming when is moderately large (say when ) even with the state-of-art algorithms.

An alternative approach is the joint maximum likelihood estimator [16]. In this approach, both s and s are treated as fixed effects and their estimates are obtained by maximizing the joint likelihood function

Traditionally, the JMLE is less preferred to the MMLE. This is partly because, under the usual asymptotic setting where the number of respondents grows to infinity and the number of items is fixed, the number of parameters in also diverges, resulting in inconsistent estimate of item parameters even for the Rasch model that takes a simple form [1].

Under the high-dimensional setting, , , and can all be large, so that MMLE is computationally infeasible. Interestingly, JMLE becomes a good choice. Intuitively, the number of parameters in the joint likelihood is in the order , which can be substantially smaller than the number of observed responses (in the order of ) when is relatively small. Consequently, under reasonable regularities, JMLE yields satisfactory statistical properties, as will be discussed in Section 3. In fact, [21] shows that under the Rasch model, the estimation of both respondent parameters and item parameters is consistent, when and grow to infinity and converges to 0. Unfortunately, the results of [21] relies on the simple form of the Rasch model and thus can hardly be generalized to the general model setting as in .

Computationally, can be optimized efficiently. That is because maximizing is equivalent to maximizing its logarithm, which can be written as

Due to its factorization, is a biconvex function of and , meaning that is a convex function of when is given and vise versa. As a consequence, can be optimized by iteratively maximizing with respect to one given the other. This optimization program can be further speeded up by parallel computing, thanks to the special structure of . See Section 2.4 for further discussions.

Following the above discussion, we propose a constrained joint maximum likelihood estimator, defined as

where denotes the Euclidian norm of a vector and is a positive tuning parameter that imposes regularization on the magnitude of each and . When , becomes exactly a JMLE. The constraints in avoid overfitting when the number of model parameters is large.

This estimator has the property of rotational invariance. To see this, we rewrite the joint likelihood function in the matrix form. Let , , and . Then the joint log-likelihood function can be rewritten as

The rotational invariance property of the CJMLE is described by the following proposition.

The CJMLE is even invariant up to a scaling and an oblique rotation if the estimate satisfies for all or for all . We summarize it by the following proposition.

Like the tuning parameter in ridge regression, our tuning parameter also controls the bias-variance trade-off of estimation. Roughly speaking, for a large value of , the true parameters and are likely to satisfy the constraints. Thus, the estimation tends to have a small bias. In the meanwhile, the variance may be large. When is small, the bias tends to be large and variance tends to be small.

### 2.3Practical Consideration: Missing Responses

In practice, each respondent may only respond to a small proportion of items, possibly due to the data collection design. For example, in the SAPA data that are analyzed in Section 5, each respondent is only administered a small random subset of 696 items (containing on average 86 items) to avoid high time costs to respondents.

Our CJMLE can be modified to handle missing responses, while maintaining its computational advantage. In addition, as will be shown in Section 3, under a reasonable regularity condition on the data missingness mechanism, the CJMLE still has good statistical properties. Let denote the non-missing responses, where if item is administered to respondent . Then the joint likelihood function becomes

and our CJMLE becomes

If for all and , no response is missing and is the same as . From now on, we focus on the analysis of , which includes as a special case when for all and .

### 2.4Algorithm

We develop an alternating minimization algorithm for solving that is often used to fit low rank models [38]. This algorithm can be efficiently paralleled. In this algorithm, we assume the number of latent traits and tuning parameter are known. In practice, when having no knowledge about and , they are chosen by a T-fold (e.g., ) cross validation.

To handle the constraints in , a projected gradient descent update is used in each iteration. A projected gradient descent update is as follows. Let be a -dimensional vector. We define projection operator:

Here, y returns a feasible point (i.e., the constraint y ) that is most close to y. Consider optimization problem

where is a differentiable convex function. Denote the gradient of by . Then a projected gradient descent update at is defined as

where is a step size decided by line search. Due to the projection, . Furthermore, it can be shown that for sufficiently small , , when satisfies mild regularity conditions and . We refer the readers to [29] for further details about this projected gradient descent update.

The algorithm guarantees that the joint likelihood function increases in each iteration. The parallel computing in step 2 of the algorithm is implemented through OpenMP, which greatly speeds up the computation even on a single machine with multiple cores. The efficiency of this parallel algorithm is further amplified, when running on a computer cluster with many machines.

## 3Theory

The proposed CJMLE has good statistical properties under the high-dimensional setting, when and both grow to infinity. We assume the number of latent traits is fixed and known and the true parameters and satisfy

and for all , .

We first derive an error bound for , where and are the true parameters. It requires an assumption on the mechanism of data missingness – missing completely at random.

s in are independent and identically distributed Bernoulli random variables with

Note that under assumption A1 and A2, we observe on average responses, that is,

With assumptions A1 and A2, we have the following theorem.

We call the left hand side of the scaled Frobenius loss. We provide some remarks on Theorem ?. First, under the asymptotic regime that and both grow to infinity, the probability that is satisfied converges to 1. Second, the left hand side of is a reasonable scaling of the squared Frobenius norm of the error . This is because is a matrix and thus has terms. When both and grow to infinity, the unscaled may diverge with a positive probability. Third, when as assumed, the right hand size of converges to 0. In particular, if no response is missing, then and the right hand size of converges to 0 with a speed . Fourth, the constant plays a role in the right hand size of . For any satisfying A1, the larger being used in , the larger the upper bound in . Ideally, we’d like to use the smallest that satisfies A2. Finally, the error bound in Theorem ? can be further used to quantify the error of predicting respondents’ responses to items that have not been administered due to the missing data design.

We then study the recovery of , as it is of our interest to explore the factor structure of items via the loading matrix. However, due to the rotational and scaling invariance properties, it is not appropriate to measure the closeness between and by any matrix norm directly. Following the canonical approach in matrix perturbation theory [36], we consider to measure their closeness by the largest principal angle between the linear spaces and that are spanned by the column vectors of and , respectively. Specifically, for two linear spaces and , the largest principal angle is defined as

where is the angle between two vectors. In fact, and equality is satisfied when .

If the largest principal angle between and is 0, then , meaning that there exist a rank- matrix , such that . In other words, and are equivalent up to a scaling and an oblique rotation. Roughly speaking, if the largest principal angle is close to zero and has a simple loading structure, the simple structure of may be found by applying a rotational method to , such as the promax rotation [23].

In what follows, we provide a bound on . We make the following assumption.

There exists a positive constant , such as the th largest singular value of , denoted by , satisfying

when and grow to infinity.

Theorem ? has a straightforward corollary.

We point out that Assumption A3 is mild. The following proposition implies that assumption A3 is satisfied with high probability when s and s are independent samples from certain distributions.

## 4Simulation

### 4.1Study I

We evaluate the performance of CJMLE and verify the theoretical results by simulation studies. The number of items takes value 300, 400, ..., 1000 and for each given , sample size . The number of latent traits are considered and are assumed to be known. We sample s from a truncated normal distribution,

where the underlying normal distribution has mean and standard deviation and the truncation interval . This implies that . The loading parameter s are generated from a mixture distribution. Let be a random indicator, satisfying . If , is set to be 0 and if , is sampled from the truncated normal distribution . Under this setting, about of the ’s are 0, which mimics a simple loading structure where each item does not measure all the latent traits. Similarly, . Missing data are considered by setting , where is set to be , , and , corresponding to the situations of “massive missing”, “moderate missing”, and “no missing”. To provide some intuition, when and , only about 11% and 43% of the entries of are not missing when and , respectively. For each , , and , 50 independent replications are generated.

We verify the theoretical results. To guarantee the satisfaction of assumption A1, we set . Results are shown in Figures ? and ?. In the two panels of Figures ?, the -axis records the value of and the -axis represents the scaled Frobenius loss . The lines in Figure ? are obtained by connecting the median of the scaled Frobenius loss among 50 replications for different s, given and . For each setting of , , and , an interval is also plotted that shows the upper and lower quartiles of the scaled Frobenius loss. According to Figure ?, for each missing data situation, the scaled Frobenius loss tends to converge to 0 as and simultaneously grow. In addition, we observe that the less missing data, the smaller the scaled Frobenius loss. Results from Figure ? are consistent with Theorem ?.

In Figure ?, the -axis records the number of items and the -axis records the largest principal angle between the column spaces of and , . Similar as above, the median and upper and lower quartiles of the largest principal angle among 50 independent replications are presented. According to Figure ?, the largest principal angle between the two spaces and tend to converge to 0 when both and diverge, for given and . In addition, having less missing data yields a smaller principal angle. Results from Figure ? are consistent with Theorem ?.

### 4.2Study II

We then investigate the situation when is fixed and grows. Specifically, we consider , , and grows from 1,000 to 10,000. The generation of and is the same as in Study I and is still set to be . For ease of presentation, we only show the “no missing” case (). The patterns are similar in the presence of missing data. Results are shown in Figure ?. The left panel of Figure ? plots sample size (-axis) versus scaled Frobenius loss (-axis) and the right panel plots sample size (-axis) versus largest principal angle (-axis). According to Figure ?, when is fixed and grows, both the scaled Frobenius loss and the largest principal angles first decrease and then tend to stabilize around some positive values. This result implies that the CJMLE may not perform well when either or is small.

## 5Real Data Analysis

We apply the proposed method to analyzing selected personality data^{1}

We first explore the latent dimensionality of data. Specifically, we consider the dimension , 5, 10, 15, 20 and 25 and the constraint tuning parameter , 4, 6, 8, and 10. The combination of and that best predicts the missing responses are investigated by five-fold cross validation. Let be the indicator matrix of non-missing responses. We randomly split the non-missing responses into five non-overlapping sets that are of equal sizes, indicated by , satisfying . Moreover, we denote , indicating the data excluding set . For given and , we find the CJMLE for each (), denoted by , by solving . Then the fitted model with parameters is used to predict non-missing responses in set . The prediction accuracy is measured by the following log-likelihood based cross-validation (CV) error:

where are entries of the matrix . The combination of and that yields smaller CV error is preferred, as the corresponding model tends to better predict unobserved responses. Results are shown in Figures Figure 3 and Figure 4. Except when , for each value of , the log-likelihood based CV error first decreases and then increases and achieve the minima at . When , the CV error decreases as increases from 2 to 10. This may be due to a bias-variance trade-off, where bias decreases and variance increases as increases. When , the bias term may dominate the variance, due to stringent unidimensionality assumption. In addition, the CV errors when is much higher than the smallest CV errors under other choices of , as shown in Figure 4. It means that a unidimensional model is inadequate to capture the underlying structure of the 696 items. Figure 5 shows the smallest CV error when , 10, 15, 20, and 25. Specifically, the smallest CV error is achieved when and , meaning that in terms of prediction, 15 latent traits tend to best characterize this personality item pool in the temperament domain.

We then explore the factor loading matrix , where and are chosen based on the CV error above. The loading matrix after promax rotation has good interpretation. Specifically, we present 10 items with highest absolute value of loadings for each latent trait, as shown in Tables Table 3 and Table 5, where “()” and “()” denote positive and negative loadings. The latent traits are ordered based on the total sum of squares (TSS) of loadings. In fact, items heavily load on latent traits 2-12 seem to be about “compassion”, “extroversion”, “depression”, “conscientiousness”, “intellect”, “irritability”, “boldness”, “intellect”, “perfection”, “traditionalism”, and “honesty”, respectively. Except for latent traits 6 and 9 that seem to be both about “intellect”, all the others among traits 2-12 seem to be about different well-known personality facets. Items that heavily load on latent traits 1, and 13-15 seem to be heterogeneous and these latent traits are worth further investigation.

() Am not interested in other people’s problems. () Can’t be bothered with other’s needs. () Don’t understand people who get emotional. () Am not really interested in others. () Pay too little attention to details. () Don’t put my mind on the task at hand. () Am seldom bothered by the apparent suffering of strangers. () Am not easily amused. () Shirk my duties. () Rarely notice my emotional reactions. | () Feel others’ emotions. () Am sensitive to the needs of others. () Feel sympathy for those who are worse off than myself. () Sympathize with others’ feelings. () Have a soft heart. () Inquire about others’ well-being. () Sympathize with the homeless. () Am concerned about others. () Know how to comfort others. () Suffer from others’ sorrows. | () Keep in the background. () Am mostly quiet when with other people. () Prefer to be alone. () Dislike being the center of attention. () Want to be left alone. () Seek quiet. () Keep others at a distance. () Tend to keep in the background on social occasions. () Hate being the center of attention. () Am afraid to draw attention to myself. |

() Have once wished that I were dead. () Often feel lonely. () Am often down in the dumps. () Find life difficult. () Feel a sense of worthlessness or hopelessness. () Dislike myself. () Am happy with my life. () Seldom feel blue. () Rarely feel depressed. () Suffer from sleeplessness. | () Get to work at once. () Complete my duties as soon as possible. () Get chores done right away. () Find it difficult to get down to work. () Have difficulty starting tasks. () Keep things tidy. () Get started quickly on doing a job. () Start tasks right away. () Leave a mess in my room. () Need a push to get started. | () Think quickly. () Can handle complex problems. () Nearly always have a “ready answer” when talking to people. () Can handle a lot of information. () Am quick to understand things. () Catch on to things quickly. () Formulate ideas clearly. () Know immediately what to do. () Come up with good solutions. () Am hard to convince. |

() Rarely show my anger. () Get irritated easily. () Lose my temper. () Get angry easily. () Can be stirred up easily. () When my temper rises, I find it difficult to control. () Get easily agitated. () It takes a lot to make me feel angry at someone. () Rarely get irritated. () Seldom get mad. | () Would never go hang gliding or bungee jumping. () Like to do frightening things. () Love dangerous situations. () Seek adventure. () Am willing to try anything once. () Do crazy things. () Take risks that could cause trouble for me. () Take risks. () Never go down rapids in a canoe. () Act wild and crazy. | () Need a creative outlet. () Don’t pride myself on being original. () Do not have a good imagination. () Enjoy thinking about things. () Do not like art. () Have a vivid imagination. () See myself as an average person. () Consider myself an average person. () Am just an ordinary person. () Believe in the importance of art. |

() Want everything to add up perfectly. () Dislike imperfect work. () Want everything to be “just right.” () Demand quality. () Have an eye for detail. () Want every detail taken care of. () Avoid mistakes. () Being in debt is worrisome to me. () Am exacting in my work. () Dislike people who don’t know how to behave themselves. | () Believe in one true religion. () Don’t consider myself religious. () Believe that there is no absolute right and wrong. () Tend to vote for liberal political candidates. () Think marriage is old-fashioned and should be done away with. () Tend to vote for conservative political candidates. () Believe one has special duties to one’s family. () Like to stand during the national anthem. () Believe that we should be tough on crime. () Believe laws should be strictly enforced. | () Use flattery to get ahead. () Tell people what they want to hear so they do what I want. () Would never take things that aren’t mine. () Don’t pretend to be more than I am. () Tell a lot of lies. () Play a role in order to impress people. () Switch my loyalties when I feel like it. () Return extra change when a cashier makes a mistake. () Use others for my own ends. () Not regret taking advantage of someone impulsively. |

() Like to take my time. () Like a leisurely lifestyle. () Would never make a high-risk investment. () Let things proceed at their own pace. () Always know why I do things. () Always admit it when I make a mistake. () Have read the great literary classics. () Am more easy-going about right and wrong than most people. () Value cooperation over competition. () Don’t know much about history. | () Am interested in science. () Trust what people say. () Find political discussions interesting. () Don’t [worry about] political and social problems. () Would not enjoy being a famous celebrity. () Believe that we coddle criminals too much. () Enjoy intellectual games. () Believe that people are basically moral. () Like to solve complex problems. () Trust people to mainly tell the truth. | () Dislike loud music. () Like telling jokes and funny stories to my friends. () Seldom joke around. () Prefer to eat at expensive restaurants. () Laugh aloud. () Most things taste the same to me. () People spend too much time safeguarding their future with savings and insurance. () Amuse my friends. () Love my enemies. () Am not good at deceiving other people. |

## 6Discussion

In this paper, we propose a constrained maximum likelihood estimator for analyzing large scale item response data, which allows for the presence of high percentage of missing responses. It differs from the traditional JMLE, by adding constraint on the Euclidian norms of both the item and respondent parameters. An efficient parallel computing algorithm is proposed that is scalable even on a single machine to large data sets with tens of thousands of respondents, thousands of items, and more than ten latent traits, with good timings. The CJMLE also has good statistical properties. In particular, we provide error bounds on the parameter estimates and show that the linear space spanned by the column vectors of the factor loading matrix can be consistently recovered, under mild regularity conditions and the high-dimensional setting that both the numbers of items and respondents grow to infinity. This result implies that the true loading structure can be learned by applying proper rotational methods to the estimated loading matrix, when the true loading matrix has a simple structure (i.e., each latent trait only measures a subset of the items). These theoretical developments are consistent with results from the simulation studies. Our simulation results also imply that the high-dimensional setting that both the numbers of respondents and items grow to infinity is important. When either the sample size or the number of items is small, the CJMLE may not work well. The proposed model is applied to analyzing large scale data from the Synthetic Aperture Personality Assessment project. It is found that a model with 15 latent traits has the best prediction power based on results from cross validation and the majority of the traits seem to be homogeneous and correspond to well-known personality facets.

The proposed method may be extended along several directions. First, the current algorithm and theory focus on binary responses. It is worth extending them to multidimensional IRT models for ordinal response data, such as the generalized partial credit model [39] which is an extension of the multidimensional two-parameter logistic model to analyzing ordinal responses. Second, even after applying rotational methods, the obtained factor loading matrix may still be dense and difficult to interpret. To better pursue a simple loading structure, it may be helpful to further impose regularization of the factor loading parameters, as introduced in [37]. Third, it is also important to incorporate respondent specific covariates in the analysis, so that the relationship between baseline covariates and psychological latent traits can be investigated. Finally, the current theoretical framework assumes the number of latent traits as known. Theoretical results on estimating the number of latent traits based on the CJMLE are also of interest, which may provide guidance on choosing the number of latent traits.

## 7Appendix

### 7.1Proof of Theorem

The proof of Theorem ? is similar to that of [14]. Thus, we only state the main steps and omit the details. Let

where is an matrix whose entries are all zero. Then, we have the following lemma from [14].

Let and in the above lemma, we have

Define

Note that if , then

Thus, from , we further have

We use the following result, which is a slight modification of the last equation on p.210 of [14].

where , denotes the Kullback-Leibler divergence between the joint distribution of when the model parameters are and . In addition, we have the following inequality, which is a direct application of Lemma A.2 in [14] and the third equation on page 211 of [14]. For any such that ,

with under our settings. Combining , and , we can see that with probability ,

Note that for , . Thus, we can rearrange terms in the above inequality and simplify it as

For , we further have

Combine the above equation with and note that is assumed fixed, we complete the proof.

### 7.2Proof of Theorem

Denote and . Let be the singular values of and , ..., be the corresponding singular vectors. Similarly, we let be the singular values of and , ..., be the corresponding singular vectors. Due to the form of and , only their first singular values can be nonzero.

We first show that . Let be the singular decomposition of , where , , and be the left singular matrix. Then

According to assumption A3, for large enough and , . Thus, is of full column rank, implying that matrix is of full rank. Thus, or equivalently, It implies that . Similarly, we have , implying that . Thus,

We then focus on the event :

which holds with probability at least according to Theorem ?. Under this event and by Weyl’s perturbation theorem [36], we have

where denotes the spectral norm of a matrix and the second inequality is due to the relationship between matrix spectral norm and matrix Frobenius norm. Thus, when event happens and for sufficiently large and ,

which is because is of order when and and grow to infinity. Then following the same proof above, we have

Following the Modified Davis-Kahan-Wedin sine theorem (Theorem 20) in [28], we have

Because of the relationship between the matrix spectral norm and Frobenius norm and assumption A3, we have

Under the event , we have

### 7.3Proof of Propositions

To show that is also a minimizer of , one only needs to show that

;

;

.

(a) is true, because , where the last equality is because is a orthogonal matrix. (b) is true because multiplying by a orthogonal matrix does not change the Euclidian norm of a vector. (c) is true for the same reason.

We consider the situation that for all and the case when for all is handled similarly. Let be an invertible matrix with singular values , satisfying

Then

.

.

.

(a) is trivial. (b) is due to

where denotes the matrix spectral norm. The proof of (c) is similar to that of (b).

First note that is the th eigenvalue of . Note that and share the same nonzero eigenvalues. Therefore, we only needs to show that with probability tending to 1, the minimum eigenvalue of satisfies

for some positive constant .

Since

we bound the two minimum eigenvalues on the right hand side separately. By law of large number, converges in probability to the identity matrix. Therefore, by applying Weyl’s theorem again, with probability tending to 1, as grows to infinity, we have

Similarly, we can show that with probability tending to 1, as grows to infinity,

Combining the above two,

with probability tending to 1, as and grow to infinity. Thus, is satisfied by choosing .

### Footnotes

- The data set is available from http://dx.doi.org/10.7910/DVN/SD7SVE.

### References

*Conditional inference and models for measuring*.

Andersen, E. B. (1973). Mentalhygiejnisk Forlag, Copenhagen, Denmark.*An introduction to multivariate statistical analysis*.

Anderson, T. W. (2003). Wiley, Hoboken, NJ.**MCMC estimation and some model-fit analysis of multidimensional IRT models.**

Béguin, A. A. and Glas, C. A. (2001).*Psychometrika*, 66(4):541–561.**Estimation of compensatory and noncompensatory multidimensional item response models using Markov chain Monte Carlo.**

Bolt, D. M. and Lall, V. F. (2003).*Applied Psychological Measurement*, 27(6):395–414.**An overview of analytic rotation in exploratory factor analysis.**

Browne, M. W. (2001).*Multivariate Behavioral Research*, 36(1):111–150.**High-dimensional exploratory item factor analysis by a Metropolis–Hastings Robbins–Monro algorithm.**

Cai, L. (2010a).*Psychometrika*, 75(1):33–57.**Metropolis-Hastings Robbins-Monro algorithm for confirmatory item factor analysis.**

Cai, L. (2010b).*Journal of Educational and Behavioral Statistics*, 35(3):307–335.**Exact matrix completion via convex optimization.**

Candès, E. and Recht, B. (2012).*Communications of the ACM*, 55(6):111–119.**Matrix completion with noise.**

Candès, E. J. and Plan, Y. (2010).*Proceedings of the IEEE*, 98(6):925–936.**The power of convex relaxation: Near-optimal matrix completion.**

Candès, E. J. and Tao, T. (2010).*IEEE Transactions on Information Theory*, 56(5):2053–2080.**Joint maximum likelihood estimation for diagnostic classification models.**

Chiu, C.-Y., Köhn, H.-F., Zheng, Y., and Henson, R. (2016).*Psychometrika*, 81(4):1069–1092.**Selected personality data from the SAPA-Project: On the structure of phrased self-report items.**

Condon, D. and Revelle, W. (2015).*Journal of Open Psychology Data*, 3(1).**OpenMP: an industry standard API for shared-memory programming.**

Dagum, L. and Menon, R. (1998).*Computational Science & Engineering, IEEE*, 5(1):46–55.**1-bit matrix completion.**

Davenport, M. A., Plan, Y., van den Berg, E., and Wootters, M. (2014).*Information and Inference*, 3(3):189–223.**A Markov chain Monte Carlo approach to confirmatory item factor analysis.**

Edwards, M. C. (2010).*Psychometrika*, 75(3):474–497.*Item response theory for psychologists*.

Embretson, S. E. and Reise, S. P. (2000). Lawrence Erlbaum Associates Publishers, Mahwah, NJ.**A revised version of the psychoticism scale.**

Eysenck, S. B., Eysenck, H. J., and Barrett, P. (1985).*Personality and Individual Differences*, 6(1):21–29.**Inconsistent maximum likelihood estimators for the Rasch model.**

Ghosh, M. (1995).*Statistics & Probability Letters*, 23(2):165–170.**A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models.**

Goldberg, L. R. (1999).**The international personality item pool and the future of public-domain personality measures.**

Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., and Gough, H. G. (2006).*Journal of Research in Personality*, 40(1):84–96.**Maximum likelihood estimates in exponential response models.**

Haberman, S. J. (1977).*The Annals of Statistics*, 5(5):815–841.**Joint and conditional maximum likelihood estimation for the Rasch model for binary responses.**

Haberman, S. J. (2004).*ETS Research Report Series RR-04-20*.**Promax: A quick method for rotation to oblique simple structure.**

Hendrickson, A. E. and White, P. O. (1964).*British Journal of Mathematical and Statistical Psychology*, 17(1):65–70.*Statistical theories of mental test scores.*

Lord, F. M., Novick, M. R., and Birnbaum, A. (1968). Addison-Wesley, Oxford, England.*Nonlinear factor analysis (Psychometric Monographs, No.15)*.

McDonald, R. P. (1967). Psychometric Corporation, Richmond, VA.**Fitting full-information item factor models and an empirical investigation of bridge sampling.**

Meng, X.-L. and Schilling, S. (1996).*Journal of the American Statistical Association*, 91(435):1254–1267.**Consistent estimates based on partially consistent observations.**

Neyman, J. and Scott, E. L. (1948).*Econometrica*, 16(1):1–32.**Random perturbation of low rank matrices: Improving classical bounds.**

O’Rourke, S., Vu, V., and Wang, K. (2013).*arXiv preprint arXiv:1311.2657*.**Proximal algorithms.**

Parikh, N., Boyd, S., et al. (2014).*Foundations and Trends in Optimization*, 1(3):127–239.*Probabilistic models for some intelligence and achievement tests*.

Rasch, G. (1960). Nielsen and Lydiche, Copenhagen, Denmark.*Multidimensional item response theory*.

Reckase, M. (2009). Springer, New York, NY.*Development and application of a multivariate logistic latent trait model*.

Reckase, M. D. (1972). PhD thesis, Syracuse University, Syracuse NY.**Some latent trait theory in a multidimensional latent space.**

Reckase, M. D. and McKinley, R. L. (1983). In Weiss, D. J., editor,*Proceedings of the 1982 Item Response Theory and Computerized Adaptive Testing Conference*, pages 151–188. University of Minnesota, Department of Psychology, Minneapolis MN.*Diagnostic measurement: Theory, methods, and applications*.

Rupp, A. A., Templin, J., and Henson, R. A. (2010). Guilford Press, New York, NY.**High-dimensional maximum marginal likelihood item factor analysis by adaptive quadrature.**

Schilling, S. and Bock, R. D. (2005).*Psychometrika*, 70(3):533–555.*Matrix Perturbation Theory*.

Stewart, G. and Sun, J. (1990). Academic Press, Cambridge, MA.**Latent variable selection for multidimensional item response theory models via regularization.**

Sun, J., Chen, Y., Liu, J., Ying, Z., and Xin, T. (2016).*Psychometrika*, 81(4):921–939.**Generalized low rank models.**

Udell, M., Horn, C., Zadeh, R., Boyd, S., et al. (2016).*Foundations and Trends in Machine Learning*, 9(1):1–118.**A multidimensional partial credit model with associated item and test statistics: An application to mixed-format tests.**

Yao, L. and Schwarz, R. D. (2006).*Applied Psychological Measurement*, 30(6):469–492.