Grade Prediction with Course and Student Specific Models
Abstract
The accurate estimation of students’ grades in future courses is important as it can inform the selection of next term’s courses and create personalized degree pathways to facilitate successful and timely graduation. This paper presents futurecourse grade predictions methods based on sparse linear models and lowrank matrix factorizations that are specific to each course or studentcourse tuple. These methods identify the predictive subsets of prior courses on a coursebycourse basis and better address problems associated with the notmissingatrandom nature of the studentcourse historical grade data. The methods were evaluated on a dataset obtained from the University of Minnesota. This evaluation showed that the course specific models outperformed various competing schemes with the best performing scheme achieving a RMSE across the different courses of 0.632 vs 0.661 for the best competing method.
1 Introduction
Data mining and machine learning approaches are being increasingly used to analyze educational and learningrelated datasets towards understanding how students learn and improving learning outcomes. This has led to the development of various approaches for modeling and predicting the success or failure of students in completing specific tasks in the context of intelligent tutoring systems [16, 19, 15, 18, 12, 9], building intelligent “early warning systems” that monitor the students’ performance during the term [1, 3], predicting how well the students will perform by analyzing their activities with the learning management system (e.g., Moodle) [8, 17, 11], and predicting students’ term and final GPA [14, 13, 2].
Our work focuses on developing methods that utilize historical studentcourse grade information to accurately estimate how well students will perform (as measured by their grade) on courses that they have not yet taken. Being able to accurately estimate students’ grades in future courses is important as it can be used by them (and/or their academic advisers) to identify the appropriate set of courses to take during the next term, and create personalized degree pathways that enable them to successfully and effectively acquire the required knowledge to complete their studies in a timely fashion.
Existing approaches for predicting a student’s grade in a future course [6, 7, 4] rely on neighborhoodbased collaborative filtering methods [10]. Despite their relative simplicity, the estimations obtained by these methods are reasonably accurate indicating that there is sufficient information in the historical studentcourse grade data to make the estimation problem feasible.
In this paper we improve upon these methods by developing various futurecourse grade prediction methods that utilize approaches based on sparse linear models and lowrank matrix factorizations. These methods rely entirely on the performance that the students achieved in previously taken courses. A unique aspect of many of our methods is that their associated models are either specific to each course or specific to each studentcourse tuple. This allows them to identify and utilize the relevant information from the prior courses that are associated with the grade for each course and better address problems associated with the notmissingatrandom nature of the studentcourse historical grade data. We experimentally evaluated the performance of our methods on a dataset obtained from the University of Minnesota that contained historical grades that span 12.5 years. Our results showed that the course specific models outperformed various competing schemes and that the best performing scheme, which is based on coursespecific regression, achieves a RMSE across the different courses of 0.632 whereas the best competing method achieves an RMSE of 0.661.
The reminder of the paper is organized as follows. Section 2 introduces the notation and definitions used. Section 3 describes the methods developed and section 4 provides information about the experimental design. Section 5 presents an extensive experimental evaluation of the methods and compares them against existing approaches. Finally, Section 6 provides some concluding remarks.
2 Definitions and Notations
Throughout the paper, bold lowercase letters will denote column vectors (e.g., ) and bold uppercase letters will denote matrices (e.g., ). Individual elements will be denoted using subscripts (e.g., for a vector , and for a matrix ). A single subscript on a matrix will denote its corresponding row. The sets will be represented by calligraphic letters.
The historical studentcourse grade information will be represented by a sparse matrix , where and are the number of students and courses, respectively, and is the grade in the range of [0,4] that student achieved in course . If a student has not taken a course, the corresponding entry will be missing. The course and student whose grades need to be predicted will be called target course and target student, respectively.
3 Methods
In this section we describe various classes of methods that we developed for predicting the grade that a student will obtain on a course that he/she has not yet taken.
3.1 CourseSpecific Regression (CSR)
Undergraduate degree programs are structured in such a way that courses taken by students provide the necessary knowledge and skills for them to do well in future courses. As a result, the performance that a student achieved in a subset of the earlier courses can be used to predict how well he/she will perform in future courses. Motivated by this, we developed a grade prediction method, called coursespecific regression (CSR) that predicts the grade that a student will achieve in a specific course as a sparse linear combination of the grades that the student obtained in past courses.
In order to estimate the CSR model for course , we extract from the overall studentcourse matrix the set of rows corresponding to the students that have taken . For each of these students (rows), we keep only the grades that correspond to courses taken prior to course . Let be the matrix representing that extracted information, where is the number of students that took course . In addition, let be the grades that the students in obtained in course (the is the grade corresponding the the student in the th row of ). Given this, the CSR model for is estimated as:
(1) 
where is a bias term, is a vector of ones and are regularization parameters to control overfitting and promote sparsity. The model is nonnegative because we assume that prior courses can only provide knowledge to future courses. The individual weights of indicate how much each prior course contributes in the prediction and represent a measure of the importance of the prior course within the context of the estimated model. Using this model, the grade that a student will obtain in course is estimated as
(2) 
where is the vector of the student’s grades in the courses he/she has taken so far.
We found that by centering each student’s grades around his/hers GPA leads to more accurate predictions (see Section 5.1). In this approach, prior to estimating the model using Equation 1, we first subtract from each grade the GPA of each student (GPA is calculated based on the information in ). This centers the data for each student and takes into consideration a notion of student bias as it predicts the performance with respect to the current state of a student. Note that in the case of GPAcentered data, we remove the nonnegativity constraint on . We will refer to this model as the CSRRC (Row Centered) model.
3.2 StudentSpecific Regression (SSR)
Depending on the major, the structure of different undergraduate degree programs can be different. Some degree programs have limited flexibility as to the set of courses that a student has to take and at which point in their studies they can take them (i.e., specific semester). Other degree programs are considerably more flexible and are structured around a fairly small number of core courses and a large number of elective courses.
For the latter type of degree programs, a drawback of the CSR method is that it requires the same linear regression model to be applied to all students. However, given that the set of prior courses taken by students in such flexible degree programs can be quite different, a single linear model can fail to capture the various prior course combinations. In fact, there can be cases in which many of the most important courses that were identified by the CSR model were simply not taken by some students, even though these students have acquired the necessary knowledge and skills by taking a different set of courses. To address this limitation, we developed a different method, called studentspecific regression (SSR), which estimates coursespecific linear regression models that are also specific to each student.
The student specific model is derived by creating a studentcourse specific grade matrix for each target student and each target course from the matrix used in CSR method. is created in two steps. First, we eliminate from any grades for courses that were not taken by the target student. Second, we eliminate from the rows that correspond to students that have not taken a sufficient number of courses that are in common with the target student . Specifically, if and are the set of courses for student and respectively, we compute the overlap ratio (OR) and if OR, then student is not included in . The value of is a parameter of the SSR method and high values ensure that the set of students forming have taken many courses in common with and have followed similar degree plans. Given , the SSR method proceeds to estimate the model using Equation 1 (with replacing ), and uses Equation 2 for prediction.
3.3 Methods based on Matrix Factorization
Low rank matrix factorization (MF) approaches have been shown to be very effective for accurately estimating ratings in the context of recommender systems [10]. These approaches can be directly applied to the problem of predicting the grade that a student will achieve on a particular course by treating the studentcourse grade matrix as the useritem rating matrix.
The use of such MFbased approaches for grade prediction is postulated on the fact that there is a low dimensional latent feature space that can jointly represent both students and courses. Given the nature of the domain, this latent space can correspond to the space of knowledge components. Each course vector is the set of components associated with a course and each student vector represents the student’s level of knowledge across these knowledge components.
By applying the common approaches of MFbased rating prediction to the problem of grade prediction, the grade that student will obtain on course is estimated to be
(3) 
where is a global bias term, and are the student and course bias terms, respectively, and and are the latent representations for student and course , respectively. The parameters of the MF method (, and ) are estimated following a matrix completion approach that considers only the observed entries in as
(4) 
where is a regularization parameter and is the dimensionality of the latent space, which is a parameter to this method.
The accurate recovery of the low rank model (when such a model exists) from a set of partial observations depends on having a sufficient number of observed entries, and on these entries be randomly sampled from the entries of the target matrix [5]. However, in the context of student grade data, the set of courses that students take is not a random subset of the courses being offered as they need to satisfy their degree program requirements. As a result, such an MF approach may lead to suboptimal prediction performance.
In order to address this problem we developed a course specific matrix factorization (CSMF) approach that estimates an MF model for each course by utilizing a course specific subset of the data that is denser (in terms of the number of observed entries and the dimensions of the matrix). As a result, it contains a larger number of random by sampled subsets of sufficient size.
Given a course and a set of students for which we need to estimate their grade for (i.e., the students in have not taken this course yet), the data that CSMF utilizes are the following: {enumerate*}[label=()]
the students and grades of the matrix and vector of the CSR method (Section 3.1),
the students in and their grades. This data is used to form a matrix , where is the number of students in , , and is the number of distinct courses that have at least one grade in or . The values stored in are the grades that exist in and . The last column of stores the grades for the course that were obtained from the students in . Thus, contains all the prior grades associated with the students who have already taken course and the students for which we need to have their grade on predicted. Matrix is then used in place of matrix in Equation 4 to estimate the parameters of the CSMF method, which are then used to predict the missing entries of the last column of , which are the grades that need to be predicted.
4 Experimental Design
4.1 Dataset
The studentcoursegrade dataset that we used in our experiments was obtained from the University of Minnesota which has a very flexible degree program. It contains the students that have been part of the Computer Science and Engineering (CSE) and Electrical and Computer Engineering (ECE) programs from Fall of 2002 to Spring of 2014. Both of these degree programs are part of the College of Science & Engineering (CS&E) in which students have to take a common set of core science courses during the first 2–3 semesters. We removed from the dataset any courses that are not part of those offered by CS&E departments, as these correspond to various liberal arts and physical education courses, which are taken by few students and in general do not count towards degree requirements. Furthermore, we eliminated any courses that were taken as pass/fail. The initial grades were in the A–F scale, which was converted to the 4–0 scale using the standard lettergrade to GPA conversion. The resulting dataset consists of 2,949 students, 2,556 different courses, and 76,748 studentcourse grades.
We used this dataset to assess the performance of the different methods for the task of predicting the grades that the students will obtain in the last semester (i.e., the most recent semester for which we have data). For this reason, the dataset was further split into two parts, one containing the students that are still active, i.e., have taken courses in the last semester () and one that contains the remaining students (). contains 876 students, 19,089 grades, out of which 3,427 grades are for the 475 distinct classes taken in the last semester. contains 2,073 students and 57,659 grades.
These datasets were used to derive various training and testing datasets for the different methods that we developed. Specifically, for the CSR method we extracted the course specific training and testing datasets as follows. For each course that was offered in the last semester, we extracted coursespecific training and testing sets ( and ) by selecting from and , respectively, the students that have taken , and prior to taken , they also took at least other courses. The reason that these datasets were parametrized with respect to is because we wanted to assess how the methods perform when different amount of historical student performance information is available. In our experiments we used in the set . That information will create the grade matrix , where is the grade of the th student on the th course from the training set . Table 1 shows various statistics about the various coursespecific datasets for different values of .
Prior courses  5  7  9 
Average number of students in training set  270  232  212 
Average number of students in test set  22  21  20 
Average number of prior courses  141  141  145 
Average number of grades  3,872  3,663  3,663 
Courses predicted  92  90  80 
Grades predicted  2,088  1,959  1,666 
For the CSMF method, the training dataset for course was obtained by combining and into a single matrix after removing the grades that the target students achieved in course .
For the MF method, the matrix is constructed as the union of the sets and for every course to be predicted after removing the grades that the active students achieved in the courses we want to predict. We formulated the dataset in this way in order to provide the same information for training and testing to all our models.
In the SSR, the grade matrix is created by selecting from the set of courses that were also taken by student and the set of students whose OR with is at least . Figure 1 shows some statistics about these datasets as a function of .
Finally, we did not consider the models that have less than 20 students in their corresponding dataset, as we consider them to have too few training instances for reliable estimation.
4.2 Competing Methods
In our experiments, we compared our methods with the following competing approaches.

BiasOnly. We only took into consideration local and global bias to predict the students’ grades. These biases were estimated using Eqn. 4 when .

StudentBased Collaborative Filtering (SBCF). This method implements the approach described in [4]. For a target course , every student is represented by a vector formed with his/hers grades in courses taken prior to . The vector of a target student is compared against the vectors of the other students that have taken course with the Pearson’s correlation coefficient. We select the students with positive similarity to perform grade prediction for in according to:
(5) where is the number of students selected, is a confidence lower limit for significance weighting, is the average grade of the student prior taking , and represents the similarity of target student with .
4.3 Parameters and Model Selection
For CSR, we let take values from 0 to 40 in increments of 2.5 and from 0 to 50 in increments of 2.5. For SSR, we let take values from 0 to 10 in increments of 1 and from 0 to 14 in increments of 2. For MF and CSMF, we let take values from 0 to 6 in increments of 0.05. For SSR, the range of the tested values for overlap ratio is 0.3 to 1, in increments of 0.04. For MF and CSMF methods we tested the number of latent dimensions with the values 2, 5 and 8.
As we could not use cross validation for the SSR, we did not apply it for any regression model, in order to be fair with our comparisons. The best models are selected based on their performance on the test set. For MF based approaches, we used the semester before the target semester to estimate and select the best parameters.
4.4 Evaluation Methodology & Performance Metrics
We evaluated the performance of the different approaches by using them to predict the grades for the last semester in our dataset using the data from the previous semester for training.
We assessed the performance using the root mean square error (RMSE) between the actual grades and the predicted ones. Since the courses whose grades are predicted have different number of students, we computed two RMSEbased metrics. The first is the overall RMSE in which all the grades across the different courses were pooled together, and the second is the average RMSE obtained by averaging the RMSE values for each course. We will denote the first by RMSE and the second as AvgRMSE.
5 Experimental Results
5.1 CourseSpecific Regression
Table 2 shows the performance achieved by the CSR and CSRRC models when trained using the three different datasets discussed in Section 4.1. These results show that among the two models, CSRRC, which operates on the GPAcentered grades leads to considerably lower errors both in terms of RMSE and AvgRMSE. In terms of the sensitivity of their performance on the amount of historical information that was available when estimating these models (i.e., the minimum number of prior courses), we can see that for CSRRC, the RMSE performance of the models does not change significantly; though the AvgRMSE performance improves when going from five to nine prior courses. This indicates that training sets with more number of prior courses tend to help smaller courses.
RMSE  AvgRMSE  

Prior courses  5  7  9  5  7  9 
CSR  0.751  0.761  0.779  0.757  0.785  0.762 
CSRRC  0.634  0.632  0.632  0.585  0.579  0.543 

The performance of the models trained on the different datasets were evaluated on the test set, which is the common subset among their respective test sets.
5.2 StudentSpecific Regression
As one of the parameters for this problem was the overlap ratio between the courses of the target student and other students, Figure 2 presents the behavior of the model’s RMSE (left) and AvgRMSE (right) as we vary the overlap ratio for and . When the overlap ratio is increased, the selected students have more courses in common with the target user and that results to better performance. In order to compare the performance of SSR against CSRRC, Figure 3 shows the RMSE of the best CSRRC and SSR models. The RMSE values were computed as the subsets of the test set that was predicted by both models. If the overlap ratio is more than 0.8, then SSR is more accurate. However, the capability of this method to predict courses is very low, i.e., we can predict 50% less courses than the CSR model for when the overlap ratio is more than 0.8, because there are not as many students that had followed the same degree plan as the selected student.
5.3 Methods based on Matrix Factorization
The performance of the methods based on matrix factorization is shown in Table 3 for various number of latent factors. Besides the MF and CSMF schemes that were described in Section 3.3, this table also shows results for a method labeled “MFGB”, which is derived from the MF scheme by eliminating the global bias term () of Eqn. 4. These results show that CSMF leads to lower RMSE values when there are more than nine prior courses per student, which confirms that by building matrix factorization models on smaller but denser coursespecific submatrices, we can derive lowrank models that lead to more accurate matrix completion. Even for the case with more than five prior courses, if we focus on denser models, the majority of courses are predicted better by CSMF* than by the best model, MFGB. In terms of the number of latent factors, we can see that in most cases, the best performance is achieved with small number of latent factors. This should not be surprising, as the average number of grades per student is low, which does not support a large number of latent factors.
Prior courses  Latent Factors  MF  MFGB  CSMF  CSMF*  

5  2  RMSE  0.662  0.661  0.683  0.676 
5  0.666  0.667  0.682  0.682  
8  0.667  0.672  0.679  0.676  
2  AvgRMSE  0.597  0.581  0.648  0.645  
5  0.603  0.569  0.643  0.647  
8  0.604  0.596  0.645  0.644  
7  2  RMSE  0.667  0.671  0.684  0.679 
5  0.673  0.675  0.680  0.677  
8  0.676  0.681  0.681  0.676  
2  AvgRMSE  0.590  0.598  0.641  0.643  
5  0.603  0.607  0.638  0.640  
8  0.604  0.610  0.637  0.640  
9  2  RMSE  0.675  0.684  0.683  0.671 
5  0.677  0.687  0.676  0.672  
8  0.681  0.692  0.677  0.674  
2  AvgRMSE  0.581  0.600  0.653  0.648  
5  0.582  0.607  0.645  0.646  
8  0.579  0.599  0.648  0.647 
5.4 Comparison with other methods
Table 4 compares the performance of the baseline approaches described in Section 4.2 (BiasOnly and SBCF) with the bestperforming coursespecific regression method (CSRRC), and the best CSMF method (two latent factors). In addition, the results labeled “CSMF” correspond to those obtained by CSMF in which the bestperforming number of latent factors for each course can be different and was selected based on their performance on the validation set (10% of the training data). CSRRC and CSMF lead to RMSE and AvgRMSE values that are substantially better than either BiasOnly or SBCF. In terms of the methods that we developed, we see that CSRRC consistently outperforms CSMF, suggesting that sparse linear regression methods are better than those based on matrix factorization for this setting. Finally, comparing the performance of CSMF against CSMF, we see that even though the former achieved better performance, the difference is not very large, which suggests that CSMF’s performance is more consistent across its different model parameters.
RMSE  AvgRMSE  

Prior courses  5  7  9  5  7  9 
BiasOnly  0.728  0.687  
SBCF  0.677  0.675  
CSRRC  0.634  0.632  0.632  0.585  0.579  0.543 
CSMF  0.679  0.680  0.676  0.645  0.638  0.645 
CSMF*  0.676  0.676  0.671  0.644  0.640  0.648 

The performance of the models trained on the different datasets were evaluated on the test set, which is the common subset among their respective test sets.
6 Conclusions
In this paper, we presented two coursespecific approaches based on linear regression and matrix factorization that perform better than existing approaches based on traditional methods. This suggests that focusing on a course specific subset of the data can result in more accurate predictions. A studentcourse specific approach was also developed but its accuracy in grade prediction is limited by the diverse nature of degree plans. The coursespecific regression was the one with the best results compared to any other method tested.
7 Aknowledgements
This work was supported in part by NSF (IIS0905220, OCI1048018, CNS1162405, IIS1247632, IIP1414153, IIS1447788) and the Digital Technology Center at the University of Minnesota. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomputing Institute. http://www.msi.umn.edu
References
 [1] Starfish: Earlyalert. http://www.starfishsolutions.com/home/studentsuccesssolutions/, [Online; accessed 4 October 2015]
 [2] AlBarrak, M.A., AlRazgan, M.: Predicting students final gpa using decision trees: A case study. International Journal of Information and Education Technology 6(7), 528 (2016)
 [3] Arnold, K.E., Pistilli, M.D.: Course signals at purdue: using learning analytics to increase student success. In: Proceedings of the 2nd International Conference on Learning Analytics and Knowledge. pp. 267–270. ACM (2012)
 [4] Bydžovská, H.: Are collaborative filtering methods suitable for student performance prediction? In: Progress in Artificial Intelligence, pp. 425–430. Springer (2015)
 [5] Chen, Y., Bhojanapalli, S., Sanghavi, S., Ward, R.: Coherent matrix completion. arXiv preprint arXiv:1306.2979 (2013)
 [6] Denley, T.: Course recommendation system and method. http://www.google.com/patents/US20130011821, [Online; accessed 4 October 2015]
 [7] Denley, T.: Austin peay state university: Degree compass. EDUCAUSE Review Online. Available: http://www. educause. edu/ero/article/austinpeaystateuniversitydegreecompass (2012)
 [8] Elbadrawy, A., Studham, R.S., Karypis, G.: Collaborative multiregression models for predicting students’ performance in course activities. In: Proceedings of the Fifth International Conference on Learning Analytics And Knowledge. pp. 103–107. ACM (2015)
 [9] Hwang, C.S., Su, Y.C.: Unified clustering locality preserving matrix factorization for student performance prediction. IAENG International Journal of Computer Science 42(3) (2015)
 [10] Kantor, P.B., Rokach, L., Ricci, F., Shapira, B.: Recommender systems handbook. Springer (2011)
 [11] Luo, J., Sorour, E., Goda, K., Mine, T.: Predicting student grade based on freestyle comments using word2vec and ann by considering prediction results obtained in consecutive lessons pp. 396–399 (June 2015)
 [12] McKay, T., Miller, K., Tritz, J.: What to do with actionable intelligence: E 2 coach as an intervention engine. In: Proceedings of the 2nd International Conference on Learning Analytics and Knowledge. pp. 88–91. ACM (2012)
 [13] Ogunde, A., Ajibade, D.: A data mining system for predicting university students? graduation grades using id3 decision tree algorithm. Journal of Computer Science and Information Technology 2(1), 21–46 (2014)
 [14] Osmanbegović, E., Suljić, M.: Data mining approach for predicting student performance. Economic Review 10(1) (2012)
 [15] Pardos, Z.A., Heffernan, N.T.: Using hmms and bagged decision trees to leverage rich features of user and skill from an intelligent tutoring system dataset. Journal of Machine Learning Research W & CP (2010)
 [16] Romero, C., Ventura, S., Espejo, P.G., Hervás, C.: Data mining algorithms to classify students. In: Educational Data Mining 2008 (2008)
 [17] Sorour, S.E., Mine, T., Goda, K., Hirokawa, S.: A predictive model to evaluate student performance. Journal of Information Processing 23(2), 192–201 (2015)
 [18] ThaiNghe, N., Drumond, L., Horváth, T., SchmidtThieme, L.: Using factorization machines for student modeling. In: UMAP Workshops (2012)
 [19] Toscher, A., Jahrer, M.: Collaborative filtering applied to educational data mining. KDD cup (2010)