# Using Latent Variable Models to Observe Academic Pathways

###### Abstract

Understanding large-scale patterns in student course enrollment is a problem of great interest to university administrators and educational researchers. Yet important decisions are often made without a good quantitative framework of the process underlying student choices. We propose a probabilistic approach to modelling course enrollment decisions, drawing inspiration from multilabel classification and mixture models. We use ten years of anonymized student transcripts from a large university to construct a Gaussian latent variable model that learns the joint distribution over course enrollments. The models allow for a diverse set of inference queries and robustness to data sparsity. We demonstrate the efficacy of this approach in comparison to others, including deep learning architectures, and demonstrate its ability to infer the underlying student interests that guide enrollment decisions.

Using Latent Variable Models to Observe Academic Pathways

Nate Gruver |

Stanford University |

ngruver@cs.stanford.edu |

Ali Malik |

Stanford University |

malikali@cs.stanford.edu |

Brahm Capoor |

Stanford University |

brahm@cs.stanford.edu and |

Chris Piech |

Stanford University |

piech@cs.stanford.edu |

Mitchell L. Stevens |

Stanford University |

stevens4@stanford.edu |

Andreas Paepcke |

Stanford University |

paepcke@cs.stanford.edu |

\@float

copyrightbox[b]

\end@floatEducation researchers increasingly recognize the need to understand the sequential accumulation of college coursework into academic pathways. In [?, ?], Bailey et al. call for change in how colleges organize course offerings to enable more efficient pathways. Rather than presenting a bewildering array of courses, cafeteria-style, they recommend “guided pathways” through academic offerings. Baker [?] builds on Bailey’s work, suggesting “meta-majors” for simplifying choice without curtailment of options. Meta-majors entail combining coursework supporting multiple majors into larger, substantively coherent content domains. Baker proposes social-network analytic techniques to discover opportunities for building meta-majors. All of these authors argue that rather than limiting choice, such interventions can yield more tractable programs, faster degree completion, and lower cost for both students and schools.

Such reforms can be enabled by analysis of data corpora describing the academic sequences of prior student enrollments. For example, some courses may be de facto prerequisites for other courses, whether listed as "required" or not in formal catalogue entries. Similarly, “odd” delays in taking particular courses, or unexplained detours in course selection, can be symptoms of unintended scheduling conflicts.

In the service of such reforms, we offer a model of course enrollment capable of efficient inference over hundreds to thousands of classes. Our generative model captures the full joint distribution of course enrollments and can be used to sample potential pathways for any given student. The model’s complexity allows us to determine an underlying "typography" of students, from implicit course-taking patterns to differing levels of novelty in their academic pathways relative to the overall population of paths.

Predicting course enrollment decisions may be viewed as a problem of multi-label classification: the task of assigning a subset of labels to each data point in a collection. In context of academic course enrollments, each data point is a student and the labels are courses enrolled. The problem of modeling all possible enrollment choices scales exponentially with the number classes (), which motivates a statistical approach. Probabilistic graphical models (PGMs) and deep neural networks are perhaps the most prominent methods for stochastic models of high-dimensional data. As our motivation in this work is not simply high accuracy but also interpretability and inference, we focus on PGMs, which fare better on those aspects and are amenable to scaling adequate for our empirical setting.

Latent variable models are a subclass of PGMs in which some variables are never observed in training data and are thus “latent." These models are more computationally demanding than fully observed models, but also are able to capture complex structure in data without supervision.

Among the simplest and most commonly used latent variable models is the naive Bayes model with hidden variable taking discrete values and observations . In the enrollment setting, and with (0 denoting no enrollment and 1 an enrollment). The generative process of the model is described below:

Given the value of the hidden variable, each individual probability of enrolling in a course is independent. This is a poor inductive bias because enrollment decisions often influence one another, and the number of courses taken by a student in a given year is dependent on the courses taken. It is easier to capture these two facets of the data if we can model enrollments jointly, without strong independence assumptions.

Any joint probability distribution over all discrete combinations of requires parameters and is thus intractable. One possible solution to is relaxation of the discrete problem to a real-valued vector space with and

By training a model over , we can take advantage of real-valued distributions with much smaller parameter spaces.

The Gaussian Mixture Model (GMM) is an archetypal latent variable model for real-valued data [?]. We can describe a GMM by generative process below:

We can modify the GMM for the setting of multi-label classification by providing an unbiased estimator of the probability of each binary sample: