A Modelbased Projection Technique for Segmenting Customers
Srikanth Jagabathula
Stern School of Business, New York University, New York, NY 10012,sjagabat@stern.nyu.edu
Lakshminarayanan Subramanian, Ashwin Venkataraman
Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, {lakshmi,ashwin}@cs.nyu.edu
We consider the problem of segmenting a large population of customers into nonoverlapping groups with similar preferences, using diverse preference observations such as purchases, ratings, clicks, etc. over subsets of items. We focus on the setting where the universe of items is large (ranging from thousands to millions) and unstructured (lacking welldefined attributes) and each customer provides observations for only a few items. These data characteristics limit the applicability of existing techniques in marketing and machine learning. To overcome these limitations, we propose a modelbased projection technique, which transforms the diverse set of observations into a more comparable scale and deals with missing data by projecting the transformed data onto a lowdimensional space. We then cluster the projected data to obtain the customer segments. Theoretically, we derive precise necessary and sufficient conditions that guarantee asymptotic recovery of the true customer segments. Empirically, we demonstrate the speed and performance of our method in two realworld case studies: (a) 84% improvement in the accuracy of new movie recommendations on the MovieLens data set and (b) 6% improvement in the performance of similar item recommendations algorithm on an offline dataset at eBay. We show that our method outperforms standard latentclass and demographicbased techniques.
Key words: Segmentation, Estimation/statistical techniques, missing data, projection
Customer segmentation is a key marketing problem faced by a firm. It involves grouping customers into nonoverlapping segments with each segment comprising customers having similar needs and preferences. Accurately segmenting its customers allows a firm to effectively customize its product offerings, promotions, and recommendations to the particular preferences of its customers (Smith 1956). This is particularly the case when firms do not have sufficient number of observations per customer to accurately personalize their decisions to individual customers. For example, Netflix and eBay.com help their customers navigate their large catalogs by offering new movie and similar item recommendations, respectively. However, new movies on Netflix lack ratings or viewing data^{1}^{1}1This problem is popularly referred to as the ‘coldstart’ problem within the recommendation systems literature. whereas each customer might interact with only a small fraction of eBay’s catalog; for example, in our sample data set, a customer on average interacted with 5 out of 2M items (see Section id1). As a result, individuallevel personalization is not practical and customizing the recommendations to each segment can greatly increase the accuracy of recommendations.
Because firms are able to collect large amounts of data about their customers, we focus on the setting where the firm has collected finegrained observations such as direct purchases, ratings, and clicks over a subset of items, in addition to any demographic data such as age, gender, income, etc., for each customer. Our goal is to use these data to classify a large population of customers into corresponding segments in order to improve the accuracy of a given prediction task. The observations are collected over a universe of items that is unstructured, lacking welldefined feature information, and large, consisting of thousands to millions of items. The data may comprise information on diverse types of actions such as purchases, clicks, ratings, etc., which are represented on different scales. Moreover, the observations from each customer are highly incomplete and span only a small fraction of the entire item universe. Such data characteristics are common in practice. For instance, the eBay marketplace offers a diverse product catalog consisting of products ranging from a Fitbit tracker/iPhone (products with welldefined attributes) to obscure antiques and collectibles, which are highly unstructured. Of these, each customer may purchase/click/rate only a few items.
The literature in marketing and machine learning has studied the problem of segmentation but the above data characteristics preclude the applicability of existing techniques. Specifically, most techniques within marketing focus on characterizing the market segments in terms of product and customer features by analyzing structured products and small samples of customer populations; consequently, they do not scale to directly classifying a large population of customers. The standard clustering techniques in machine learning (see Jain 2010 for a review), on the other hand, are designed for such direct classification and rely on a similarity measure used to determine if two customers should belong to the same segment or not. However, the diversity and incompleteness of observations make it challenging to construct a meaningful similarity measure. For example, it is difficult to assess the degree of similarity between a customer who purchased an item and another who has rated the same item, or between two customers who have purchased completely nonoverlapping sets of products.
To overcome the above limitations, we propose a modelbased projection technique that extends the extant clustering techniques in machine learning to handle categorical observations from diverse data sources and having (many) missing entries. The algorithm takes as inputs the observations from a large population of customers and a probabilistic model class describing how the observations are generated from an individual customer. The choice of the model class is determined by the prediction task at hand, as described below, and provides a systematic way to incorporate domain knowledge by leveraging the existing literature in marketing, which has proposed rich models describing individual customer behavior. It outputs a representation of each customer as a vector in a lowdimensional Euclidean space whose dimension is much smaller than the number of items in the universe. The vector representations are then clustered, using a standard technique such as the means algorithm, to obtain the corresponding segments. In particular, the algorithm proceeds in two sequential steps: transform and project. The transform step is designed to address the issue of diversity of the observed signals by transforming the categorical observations into a continuous scale that makes different observations (such as purchases and ratings) comparable. The project step then deals with the issue of missing data by projecting the transformed observations onto a lowdimensional space, to obtain a vector representation for each customer.
The key novelty of our algorithm is the transform step. This step uses a probabilistic model to convert a categorical observation into the corresponding (log)likelihood of the observation under the model. For example, if the probability that an item is liked by a customer is given by , then a “like” observation is transformed into and a “dislike” observation is transformed into . We call our algorithm modelbased because the transform step relies on a probabilistic model; Section id1 presents a case study in which we illustrate the choice of the model when the objective is to accurately predict if a new movie will be liked by a customer. We estimate the model parameters by pooling together the data from all customers and ignoring the possibility that different customers may have different model parameters. This results in a model that describes a ‘pooled’ customer—a virtual customer whose preferences reflect the aggregated preferences of the population. The likelihood transformations then measure how much a particular customer’s preferences differ from those of the population’s. The discussion in Section id1 (see Lemmas 1, 2 and Theorems 1, 2, 3 and 4) shows that under reasonable assumptions, customers from different segments will have different (log)likelihood values under the pooled model—allowing us to separate them out.
Our algorithm is inspired by existing ideas for clustering in the theoretical computer science literature and systematically generalizes algorithms that are popular within the machine learning community. In particular, when the customer observations are continuous and there are no missing entries, then the transform step reduces to the trivial identity mapping and our algorithm reduces to the standard spectral projection technique (Vempala and Wang 2004, Achlioptas and McSherry 2005, Kannan et al. 2005) for clustering the observations from a mixture of multivariate Gaussians. When the observations are continuous but there are missing entries, our algorithm becomes a generalized matrix factorization technique—commonly used in collaborative filtering applications (Rennie and Srebro 2005, Mnih and Salakhutdinov 2007, Koren et al. 2009).
Our work makes the following key contributions:

Novel segmentation algorithm. Our algorithm is designed to operate on large customer populations and large collections of unstructured items. Moreover, it is (a) principled: reduces to standard algorithms in machine learning in special cases; (b) fast: order of magnitude faster than benchmark latentclass models because it requires fitting only one model (as opposed to a mixture model); and (c) flexible: allows practitioners to systematically incorporate problemdependent structures via model specification—providing a way to take advantage of the literature in marketing that has proposed rich models describing individual customer behavior.

Analytical results. Under standard assumptions on the customer preference heterogeneity, we derive necessary and sufficient conditions for exact recovery of the true segments. Specifically, we bound the asymptotic misclassification rate, i.e. the expected fraction of customers incorrectly classified, of a nearestneighbor classifier which uses the customer representations obtained after the project step in our algorithm above. Given a total of observations such that each customer provides atleast observations, we show that the misclassification rate scales as where are constants that depend on the underlying parameters of the latent class model. In other words, our algorithm correctly classifies all customers into their respective segments as . Our results are similar in spirit to the conditions derived in the existing literature for Gaussian mixture models (Kannan et al. 2005). However, existing proof techniques don’t generalize to our setting. Our results are one of the first to provide such guarantees for latentclass preference models.

Empirical results. We conducted three numerical studies to validate our methodology:

Faster and more accurate than EM. Using synthetic data, we show that our method obtains more accurate segments, while being upto faster, than the standard EMbased latent class benchmark.

Coldstart problem in the MovieLens data set. We apply our method to the problem of recommending new movies to users, popularly referred to as the coldstart problem. On the publicly available MovieLens dataset, we show that segmenting users via our method and customizing recommendations to each segment improves the recommendation accuracy by 48%, 55%, and 84% for drama, comedy, and action genres, respectively, over a baseline populationlevel method that treats all users as having the same preferences. In addition, it also outperforms the standard EMbased latent class benchmark by , and respectively while achieving a speedup in the computation time.

Similar item recommendations on eBay.com. We describe a realworld implementation of our segmentation methodology for personalizing similar item recommendations on eBay. The study shows that (1) segmenting the user population based on just their viewing/click/purchase activity using our approach results in upto 6% improvement in the recommendation quality when compared to treating the population as homogeneous and (2) our algorithm can scale to large datasets. The improvement of 6% is nontrivial because before our method, eBay tried several natural segmentations (by similarity of demographics, frequency/recency of purchases, etc.), but all of them resulted in improvement.

Our work has connections to literature in both marketing and machine learning. We start with positioning our work in the context of segmentation techniques in the marketing literature. Customer segmentation is a classical marketing problem, with origins dating back to the work of Smith (1956). Marketers classify various segmentation techniques into a priori versus post hoc and descriptive versus predictive methods, giving rise to a 2 x 2 classification matrix of these techniques (Wedel and Kamakura 2000). Our algorithm is closest to the posthoc predictive methods, which identify customer segments on the basis of the estimated relationship between a dependent variable and a set of predictors. These methods consist of clustering techniques and latent class models. The traditional method for predictive clustering is automatic interaction detection (AID), which splits the customer population into nonoverlapping groups that differ maximally according to a dependent variable, such as purchase behavior, on the basis of a set of independent variables, like socioeconomic and demographic characteristics (Assael 1970, Kass 1980, Maclachlan and Johansson 1981). However, these approaches typically require large sample sizes to achieve satisfactory results. Ogawa (1987) and Kamakura (1988) proposed hierarchical segmentation techniques tailored to conjoint analysis, which group customers such that the accuracy with which preferences/choices are predicted from product attributes or profiles is maximized. These methods estimate parameters at the individuallevel, and therefore are restricted by the number of observations available for each customer. Clusterwise regression methods overcome this limitation, as they cluster customers such that the regression fit is optimized within each cluster. The applicability of these methods to market segmentation was identified by DeSarbo et al. (1989) and Wedel and Kistemaker (1989), and extended by Wedel and Steenkamp (1989) to handle partial membership of customers in segments.
Latent class (or mixture) methods offer a statistical approach to the segmentation problem, and belong to two types: mixture regression and mixture multidimensional scaling models. Mixture regression models (Wedel and DeSarbo 1994) simultaneously group subjects into unobserved segments and estimate a regression model within each segment, and were pioneered by Kamakura and Russell (1989) who propose a clusterwise logit model to segment households based on brand preferences and price sensitivities. This was extended by Gupta and Chintagunta (1994) who incorporated demographic variables and Kamakura et al. (1996) who incorporated differences in customer choicemaking processes, resulting in models that produces identifiable and actionable segments. Mixture multidimensional scaling (MDS) models simultaneously estimate market segments as well as preference structures of customers in each segment, for instance, a brand map depicting the positions of the different brands on a set of unobserved dimensions assumed to influence perceptual or preference judgments of customers. See DeSarbo et al. (1994) for a review of these methods.
One issue with the latent class approach is the discrete model of heterogeneity and it has been argued (Allenby et al. 1998, Wedel et al. 1999) that individuallevel response parameters are required for direct marketing approaches. As a result, continuous mixing distributions have been proposed that capture finegrained customer heterogeneity, leading to hierarchical Bayesian estimation of models (Allenby and Ginter 1995, Rossi et al. 1996, Allenby and Rossi 1998). Computation of the posterior estimates in Bayesian models, however, is typically intractable and Markov Chain Monte Carlo based techniques are employed, which entail a number of computational issues (Gelman and Rubin 1992).
The purpose of the above modelbased approaches to segmenting customers is fundamentally different from our approach. These methods focus on characterizing the market segments in terms of product and customer features (such as prices, brands, demographics, etc.) by analyzing structured products (i.e. having welldefined attributes) and small samples of customer populations; consequently, they do not scale to directly classifying a large population of customers. Our algorithm is explicitly designed to classify the entire population of customers into segments as accurately as possible, and can be applied even when the data is lessstructured or unstructured (refer to the case study in Section id1). Another distinction is that we can provide necessary and sufficient conditions under which our algorithm guarantees asymptotic recovery of the true segments in a latent class model, which is unlike most prior work in the literature. In addition, our algorithm can still incorporate domain knowledge by leveraging the rich models describing customer behavior specified in existing marketing literature.
Our work also has methodological connections to literature in theoretical computer science and machine learning. Specifically, our modelbased projection technique extends existing techniques for clustering realvalued observations with no missing entries (Vempala and Wang 2004, Achlioptas and McSherry 2005, Kannan et al. 2005) to handle diverse categorical observations having (many) missing entries. The project step in our segmentation algorithm has connections to matrix factorization techniques in collaborative filtering, and we point out the relevant details in our discussion of the algorithm in Section id1.
Our goal is to segment a population of customers comprised of a fixed but unknown number of nonoverlapping segments. To carry out the segmentation, we assume access to individuallevel observations that capture differences among the segments. The observations may come from diverse sources—organically generated clicks or purchases during online customer visits; ratings provided on review websites (such as Yelp, TripAdvisor, etc.) or recommendation systems (such as Amazon, Netflix, etc.); purchase attitudes and preferences collected from a conjoint study; and demographics such as age, gender, income, education, etc. Such data are routinely collected by firms as customers interact through various touch points. Without loss of generality, we assume that all the observations are categorical—any continuous observations may be appropriately quantized. The data sources may be coarsely curated based on the specific application but we don’t assume access to finegrained feature information.
To deal with observations from diverse sources, we consider a unified representation where each observation is mapped to a categorical label for a particular “item” belonging to the universe of all items. We use the term “item” generically to mean different entities in different contexts. For example, when observations are product purchases, the items are products and the labels binary purchase/nopurchase signals. When observations are choices from a collection of offer sets (such as those collected in a choicebased conjoint study), the items are offer sets and labels the IDs of chosen products. Finally, when observations are ratings for movies, the items are movies and the labels star ratings. Therefore, our representation provides a compact and general way to capture diverse signals. We index a typical customer by , item by , and segment by .
In practice, we observe labels for only a small subset of the items for each customer. Because the numbers of observations can widely differ across customers, we represent the observed labels using an edgelabeled bipartite graph , defined between the customers and the items. An edge denotes that we have observed a label from customer for item , with the edgelabel representing the observed label. We call this graph the customeritem preference graph. We let denote the vector^{2}^{2}2We use bold lowercase letters like etc. to represent vectors of observations for customer with if the label for item from customer is unobserved/missing. Let denote the set of items for which we have observations for customer . It follows from our definitions that also denotes the set of neighbors of the customer node in the bipartite graph and the degree , the size of the set , denotes the number of observations for customer . Note that the observations for each customer are typically highly incomplete and therefore, and the bipartite graph is highly sparse.
In order to carry out the segmentation, we assume that different segments are characterized by different preference parameters and a model relates the latent parameters to the observations. Specifically, the customer labels are generated according to a parametric model from the modelclass , where is the parameter space that indexes the models ; is the vector of item labels, and is the probability of observing the item labels from a customer with model parameter . Here, is the domain of possible categorical labels for item . When all the labels are binary, and . The choice of the parametric family depends on the application context and the prediction problem at hand. For instance, if the task is to predict whether a particular segment of customers likes a movie or not, then can be chosen to be the binary logit model class. Instead, if the task is to predict the movie rating (say, on a 5star scale) for each segment, then can be the ordered logit model class. Finally, if the task is to predict which item each segment will purchase, then can be the multinomial logit (MNL) model class. Depending on the application, other models proposed within the marketing literature may be used. We provide a concrete illustration as part of a case study with the MovieLens dataset, described in Section id1.
A population comprised of segments is described by distinct models with corresponding parameters so that the observations from customers in segment are generated according to model . Specifically, we assume that customer in segment generates the label vector and we observe the labels for all the items , for some preference graph . Further, for ease of notation, we drop the explicit dependence of models in on the parameter in the remainder of the discussion. Let denote the observed label vector from customer and define the domain . Given any model , we define for each , where represent the missing labels vector for customer and the summation is over all possible feasible missing label vectors when given the observations . Observe that defines a distribution over . Finally, let denote the length of the vector ; we have .
We now describe our segmentation algorithm. For purposes of exposition, we describe three increasingly sophisticated variants of our algorithm. The first variant assumes that the number of observations is the same across all customers—that is, all customers have the same degree in the preference graph —and the model class is completely specified. The second variant relaxes the equal degree assumption, and the third variant allows for partial specification of the model class .
We first focus on the case when we have the same number of observations for all the customers, so that for all , resulting in an regular preference graph . We describe the algorithm assuming the number of segments is specified and discuss below how may be estimated from the data. The precise description is presented in Algorithm 1. The algorithm takes as inputs observations in the form of the preference graph and the model family and outputs a unidimensional representation of each customer. The algorithm proceeds as follows. It starts with the hypothesis that the population of customers is in fact homogeneous and looks for evidence of heterogeneity to refute the hypothesis. Under this hypothesis, it follows that the observations are i.i.d. samples generated according to a single model in . Therefore, the algorithm estimates the parameters of a ‘pooled’ model by pooling together all the observations and using a standard technique such as the maximum likelihood estimation (MLE) method. As a concrete example, consider the task of predicting whether a segment of customers likes a movie or not, so that is chosen to be the binary logit model class where movie is liked with probability , independent of the other movies. Then, the parameters of the pooled model can be estimated by solving the following MLE problem:
Because the objective function is separable, the optimal solution can be shown to be given by .
Once the pooled model is estimated, the algorithm assesses if the hypothesis holds by checking how well the pooled model explains the observed customer labels. Specifically, it quantifies the model fit by computing the (normalized) negative loglikelihood of observing under the pooled model, i.e., . A large value of indicates that the observation is not well explained by the pooled model or that customer ’s preferences are “far away” from that of the population. We term the modelbased projection score, or simply the projection score, for customer because it is obtained by “projecting” categorical observations () into real numbers by means of a model (). Note that the projection scores yield onedimensional representations of the customers. The entire process is summarized in Algorithm 1.
The projection scores obtained by the algorithm are then clustered into groups using a standard distancebased clustering technique, such as the means algorithm, to recover the customer segments. We discuss how to estimate at the end of this section.
We make the following remarks. First, our modelbased projection technique is inspired by the classical statistical technique of analysisofvariance (ANOVA), which tests the hypothesis of whether a collection of samples are generated from the same underlying population or not. For that, the test starts with the null hypothesis that there is no heterogeneity, fits a single model by pooling together all the data, and then computes the likelihood of the observations under the pooled model. If the likelihood is low (i.e., below a threshold), the test rejects the null hypothesis and concludes that the samples come from different populations. Our algorithm essentially separates customers based on the heterogeneity within these likelihood values.
Second, to understand why our algorithm should be able to separate the segments, consider the following simple case. Suppose a customer from segment likes any item with probability and dislikes it with probability for some . Different segments differ on the value of the parameter . Suppose denotes the size of segment , where and for all . Now, when we pool together a large number of observations from these customers, we should essentially observe that the population as a whole likes an item with probability ; this corresponds to the pooled model. Under the pooled model, we obtain the projection score for customer as where each . Now assuming that is large and because the ’s are randomly generated, the projection score should essentially concentrate around the expectation where r.v. takes value “like” with probability and “dislike” with probability , when customer belongs to segment . The value is the crossentropy between the distributions and . Therefore, if the crossentropies for the different segments are distinct, our algorithm should be able to separate the segments.^{3}^{3}3Note that the crossentropy is not a distance measure between distributions unlike the standard KL (KullbackLeibler) divergence. Consequently, even when for some segment , the crossentropy is not zero. Our algorithm relies on the crossentropies being distinct to recover the underlying segments We formalize and generalize these arguments in Section id1.
Third, our algorithm fits only one model—the ‘pooled’ model—unlike a classical latent class approach that fits, typically using the ExpectationMaximization (EM) method, a mixture distribution , where all customers in segment are described by model and represents the size (or proportion) of segment . This affords our algorithm two advantages: (a) speed: up to faster than the standard latent class benchmark (see Section id1) without the issues of initialization and convergence, typical of EMmethods; and (b) flexibility: allows for fitting models from complex parametric families , that more closely explain the customer observations.
Estimating the number of segments. Although our algorithm description assumed that the number of customer segments is provided as input, it can actually provide datadriven guidance on how to pick —which is often unknown in practice. While existing methods rely on crossvalidation and informationtheoretic measures such as AIC, BIC etc. (see McLachlan and Peel 2004 for details), our algorithm can also rely on the structure of the projection scores to estimate . As argued above, when there are underlying segments in the population, the projection scores will concentrate around distinct values corresponding to the crossentropies between the distributions and . Therefore, when plotted on a line, these scores will separate into clusters, with the number of clusters corresponding to the number of segments. In practice, however, the scores may not cleanly separate out. Therefore, we use the following systematic procedure: estimate the density of the customer projection scores using a general purpose technique such as Kernel Density Estimation (KDE) and associate a segment with each peak (i.e. local maximum) in the density. The estimated density should capture the underlying clustering structure with each peak in the density corresponding to the value around which many of the projection scores concentrate and as a result, should be able to recover the underlying number of segments. We use this technique to estimate the number of segments in our realworld case studies (see Sections id1 and id1).
We now focus on the case when the number of observations may be different across customers. The key issue is that the loglikelihood values depend on the number of observations and therefore, should be appropriately normalized in order to be meaningfully compared across customers. A natural way is to normalize the loglikelihood value of customer by the corresponding degree , which results in Algorithm 1 but applied to the unequal degree setting. Such degree normalization is appropriate when the observations across items are independent, so that the pooled distribution has a product form . In this case, the loglikelihood under the pooled model becomes , which scales in the number of observations .
The degree normalization, however, does not take into account any dependence structure in the item labels. For instance, consider the extreme case when the observations across all items are perfectly correlated under the pooled model, such that customers either like all items or dislike all items with probability 0.5 each. In this case, the loglikelihood does not depend on the number of observations, but the degree normalization unfairly penalizes customers with few observations. To address this issue, we use entropybased normalization:
(1) 
where denotes the entropy of distribution . When the observations across items are i.i.d. it can be seen that entropybased normalization reduces to degreenormalization, upto constant factors. The key benefit of the entropy normalization is that when the population is homogeneous (i.e. consists of a single segment), it can be shown that the projection scores of all customers concentrate around (see the discussion in Section id1). Consequently, deviations in the projection scores from indicate heterogeneity in the customer population and allows us to identify the different segments.
Entropybased normalizations have been commonly used in the literature for normalizing mutual information (Strehl and Ghosh 2002)—our normalization is inspired by that. In addition to accounting for dependency structures within the pooled distribution, it has the effect of weighting each observation by the strength of the evidence it provides. Specifically, because the loglikelihood value only provides incomplete evidence of how well captures the preferences of customer when there are missing observations, we assess the confidence in the evidence by dividing the loglikelihood value by the corresponding entropy of the distribution . Higher values of entropy imply lower confidence. Therefore, when the entropy is high, the projection score will be low, indicating that we don’t have sufficient evidence that customer ’s observations are not wellexplained by . The algorithm with entropybased normalization is summarized in Algorithm 2.
We note that the entropy may be difficult to compute because, in general, it requires summing over an exponentially large space. For such cases, either the entropy may be computed approximately using existing techniques in probabilistic graphical models (Hinton 2002, Salakhutdinov et al. 2007) or the degree normalization of Algorithm 1 may be used as an approximation.
The discussion so far assumed that the model family is fullyspecified, i.e., for a given value of , the model completely specifies how an observation vector is generated. In practice, such complete specification may be especially difficult to provide when the item universe is large (for instance, millions in our eBay case study) and there are complex crosseffects among items, such as the correlation between the rating and purchase signal for the same item or complementarity effects among products from related categories (clothing, shoes, accessories, etc.). To address this issue, we extend our algorithm to the case when the model is only partially specified.
We assume that the universe of items is partitioned into “categories” such that is the set of items in category and containing items. A model describing the observations within each category is specified, but any interactions across categories are left unspecified. We let denote the model class for category , so that segment is characterized by the models with for all . Further, denotes the vector of observations of customer for items in category ; if there are no observations, we set .
Under this setup, we run our segmentation algorithm (Algorithm 1 or 2) separately for each category of items. This results in a dimensional representation for each customer , where is the representation computed by our algorithm for customer and category . When , we set . We represent these vectors compactly as the following matrix with row corresponding to customer :
When matrix is complete, the algorithm stops and outputs . When it is incomplete, we obtain lowdimensional representations for the customers using lowrank matrix decomposition (or factorization) techniques, similar to those adopted in collaborative filtering applications (Rennie and Srebro 2005, Mnih and Salakhutdinov 2007, Koren et al. 2009). These techniques assume that the matrix with missing entries can be factorized into a product of two lowrank matrices—one specifying the customer representation and the other an item representation, in a low dimensional space. The lowrank structure naturally arises from assuming that only a small number of (latent) factors influence the crosseffects across the categories. With this assumption, we compute an dimensional representation for each customer and for each item category by solving the following optimization problem:
(2) 
where matrices and are such that
Note that the rank . When the number of customers or categories is large, computing the lowrank decomposition may be difficult. But there has been a lot of recent work to develop scalable techniques for such matrices (as in collaborative filtering applications like Netflix), see the works of Takács et al. (2009), Mazumder et al. (2010) and references therein. The precise algorithm is summarized in Algorithm 3.
We conclude this section with a few remarks on how to obtain the segments from the representations of customers obtained by our algorithms. Let denote the dimension of the representation of each customer. It follows from our descriptions above that when the model class is fully specified and when the model class is only partially specified. When , the means algorithm or spectral clustering (Von Luxburg 2007) with Euclidean distances may be used to cluster the representations of the customers. In Section id1, we show that these techniques successfully recover the underlying segments under standard assumptions. When is large, the curse of dimensionality kicks in and traditional distancebased clustering algorithms may not be able to meaningfully differentiate between similar and dissimilar projection score vectors (see Aggarwal et al. 2001 and Steinbach et al. 2004). For this setting, we recommend using the spectral projection technique, which projects the matrix to the subspace spanned by its top principal components and then clusters the projections in the lowerdimensional space. This technique was proposed in the theoretical computer science literature (Vempala and Wang 2004, Achlioptas and McSherry 2005, Kannan et al. 2005) to cluster observations generated from a mixture of multivariate Gaussians. The projections can be shown to preserve the separation between the mean vectors and at the same time ensure (tighter) concentration of the original samples around the respective means, resulting in more accurate segmentation.
Our segmentation algorithm is analytically tractable and in this section, we derive theoretical conditions for how “separated” the underlying segments must be to guarantee asymptotic recovery using our algorithm. Our results are similar in spirit to existing theoretical results for clustering observations from mixture models, such as mixture of multivariate Gaussians (Kannan et al. 2005).
For the purposes of the theoretical analysis, we focus on the following standard setting—there is a population of customers comprising distinct segments such that a proportion of the population belongs to segment , for each . Segment is described by distribution over the domain . Note that this corresponds to the scenario when for all items (see the notation in Section id1). We frequently refer to as like and as dislike in the remainder of the section. Customer ’s latent segment is denoted by , so that if , then samples a vector according to distribution , and then assigns the label for item . We focus on asymptotic recovery of the true segment labels , as the number of items .
The performance of our algorithm depends on the separation among the hyperparameters describing the segment distributions , as well as the number of data points available per customer. Therefore, we assume that the segment distributions are “wellseparated” (the precise technical conditions are described below) and the number of data points per customer goes to infinity as . The proofs of all statements are in the Appendix.
We first consider the case where belongs to a fully specified model family , such that customer labels across items are independent. More precisely, we have the following model:
Definition 1 (Latent Class Independent Model (LcInd))
Each segment is described by distribution such that labels are independent and identically distributed. Denote for all items , i.e. is the probability that a customer from segment likes an item. Customer in segment samples vector according to distribution and provides label for item .
We assume that the segment parameters are bounded away from 0 and 1, i.e. there exists a constant such that for all segments . Further, let denote the crossentropy between the Bernoulli distributions and and denote the binary entropy function, where . Let denote the (unidimensional) projection score computed by Algorithm 2, note that it is a random variable under the above generative model.
Given the above, we derive necessary and sufficient conditions to guarantee (asymptotic) recovery of the true customer segments under the LCIND model.
We first prove an important lemma concerning the concentration of the customer projection scores computed by our algorithm.
Lemma 1 (Concentration of customer projection scores under LcInd model)
Given a customer population and collection of items , suppose that the preference graph is regular, i.e. for all customers . Define the parameter . Then given any , the projection scores computed by Algorithm 2 are such that:
where . In other words, the projection scores of customers in segment concentrate around the ratio , with high probability as the number of observations from each customer .
Lemma 1 reveals the necessary conditions our algorithm requires to recover the true customer segments. To understand the result, first suppose that . Then, the above lemma states that the modelbased projection scores of customers in segment concentrate around which is proportional to . Consequently, we require that whenever to ensure that the projection scores of customers in different segments concentrate around distinct values. The result also states that the projection scores of customers with similar preferences (i.e. belonging to the same segment) are close to each other, i.e. concentrate around the same quantity, whereas the scores of customers with dissimilar preferences (i.e. belonging to different segments) are distinct from each other. For this reason, although it is not a priori clear, our segmentation algorithm is consistent with the classical notion of distance or similaritybased clustering, which attempts to maximize intracluster similarity and intercluster dissimilarity. When , it follows that for any , and therefore all the customer projection scores concentrate around . In this scenario, our algorithm cannot separate the customers even when the parameters of different segments are distinct. Note that , which is the probability that a random customer from the population likes an item. Therefore, when , the population is indifferent in its preferences for the items, resulting in all the customers being equidistant from the pooled customer.
The above discussion leads to the following theorem.
Theorem 1 (Necessary conditions for recovery of true segments under LcInd model)
Under the setup of Lemma 1, the following conditions are necessary for recovery of the true customer segments:
. All segment parameters are distinct, i.e. whenever , and . .
It is easy to see that the first condition is necessary for any segmentation algorithm. We argue that the second condition, i.e. , is also necessary for the standard latent class segmentation technique. Specifically, consider two segments such that and let . Then, it follows that . Let us consider only a single item, i.e. . Then, under this parameter setting, all customers in segment will give the label and all customers in segment will give label . Recall that the latent class method estimates the model parameters by maximizing the loglikelihood of the observed labels, which in this case looks like:
Then it can be seen that the solution and achieves the optimal value of the above loglikelihood function, and therefore is a possible outcome recovered by the latent class method. This shows that the condition is also necessary for the standard latent class segmentation technique.
We note that our results readily extend to the case when is not regular but with additional notation.
Having established the necessary conditions, we now discuss the asymptotic misclassification rate, defined as the expected fraction of customers incorrectly classified, of our algorithm. In order to analyze the misclassification rate, we consider the following nearestneighbor (NN) classifier , where customer is classified as:
where . Note that since , for all .
Given the necessary conditions established above and to ensure that we can uniquely identify the different segments, we assume that the segments are indexed such that . Then, we can prove the following recoverability result:
Theorem 2 (Asymptotic recovery of true segments under LcInd model)
Under the setup of Lemma 1, suppose and . Further, denote . Given any , suppose that
Then, it follows that
Further, when and , we have:
Theorem 2 provides an upper bound on the misclassification rate of our segmentation algorithm in recovering the true customer segments. The first observation is that as the number of labels from each customer , the misclassification rate of the NN classifier goes to zero. The result also allows us determine the number of samples needed per customer to guarantee an error rate . In particular, depends on three quantities:

where is the minimum separation between the segment parameters. This is intuitive—the “closer” the segments are to each other (i.e. smaller value of ), the more samples are required per customer to successfully identify the true segments.

where recall that is the probability that a random customer from the population likes an item. If , then so that we require a large number of samples per customer. As deviates from , the quantity increases, so fewer samples are sufficient. This also makes sense—when , our algorithm cannot identify the underlying segments, and the farther is from , the easier it is to recover the true segments.

, where as , the number of samples required diverges. Note that (resp. ) specifies a lower (upper) bound on the segment parameters —a small value of indicates that there exists segments with values of close to either or ; and since the number of samples required to reliably estimate (resp. ) grows as (resp. ), must diverge as .
Our result shows that as long as each customer provides atleast labels, the misclassification rate goes to zero, i.e. we can accurately recover the true segments with high probability, as the number of items . Although the number of labels required from each customer must go to infinity, it must only grow logarithmically in the number of items . Further, this holds for any population size “large enough”.
Note that the NN classifier above assumes access to the “true” normalized crossentropies . In practice, we use “empirical” NN classifiers, which replace by the corresponding cluster centers of the projection scores. Lemma 1 guarantees the correctness of this approach under appropriate assumptions, because the projection scores of segment customers concentrate around .
We can extend the results derived above to the case when the distributions belong to a partially specified model family, as discussed in Section id1. Specifically, suppose that the item set is partitioned into (disjoint) categories: . The preferences of customers vary across the different categories, specifically we consider the following generative model:
Definition 2 (Latent Class Independent Category Model (LcIndCat))
Each segment is described using distribution such that labels for items within a single category are independent and identically distributed; but labels for items in different categories can have arbitrary correlations. Let be such that for each item and each category . Customer in segment samples vector according to distribution and provides label for each item .
The above model is general and can be used to account for correlated item preferences (as opposed to independent preferences considered in section id1). As a specific example, suppose that for each item, we have two customer observations available: whether the item was purchased or not, and a like/dislike rating (note that one of these can be missing). Clearly these two observations are correlated and we can capture this scenario in the LCINDCAT model as follows: there are two item “categories”—one representing the purchases and the other representing the ratings. In other words, we create two copies of each item and place one copy in each category. Then, we can specify a joint model over the item copies such that purchase decisions for different items are independent, like/dislike ratings for different items are also independent but the purchase decision and like/dislike rating for the same item are dependent on each other. Similar transformations can be performed if we have more observations per item or preferences are correlated for a group of items. Therefore, the above generative model is fairly broad and captures a wide variety of customer preference structures.
As done for LCIND model, we assume that the underlying segment parameters are bounded away from 0 and 1, i.e. there exists constant such that for all segments , and all item categories . Let be the number of observations for customer in category and let denote the projection score vector computed by Algorithm 3 for customer , note that it is a dimensional random vector under the generative model above.
We first state an analogous concentration result for the customer projection score vectors computed by our algorithm.
Lemma 2 (Concentration of projection score vectors under LcIndCat model)
For a population of customers and items partitioned into categories, suppose that the preference graph is such that each customer labels exactly items in category , i.e. for all . Define the parameters for each item category , , and where . Then given any , the projection score vectors computed by Algorithm 3 are such that:
In the lemma above, is a dimensional vector such that (recall notation from section id1) and denotes the norm. Lemma 2 implies the following necessary conditions:
Theorem 3 (Necessary conditions for recovery of true segments under LcIndCat model)
Under the setup of Lemma 2, the following conditions are necessary for recovery of the true customer segments:

for some category .

Let and denote as the subvector consisting of components corresponding to item categories . Then whenever .
Similar to the LCIND case, for all item categories implies that the population is indifferent over items in all the categories. However, we require the population to have welldefined preferences for at least one category in order to be able to separate the segments. Further, since , we need for at least one item category where to ensure that the vectors and are distinct, for any two segments .
As for the case of the LCIND model, we consider another NN classifier to evaluate the asymptotic misclassification rate of our segmentation algorithm, where customer is classified as:
Given the above necessary conditions, we can prove the following recoverability result:
Theorem 4 (Asymptotic recovery of true segments under LcIndCat model)
Suppose that the conditions in Theorem 3 are satisfied. Denote with and where represents elementwise product. Under the setup of Lemma 2 and given any , suppose that
Then, it follows that
Further, when and , for fixed we have:
We make a few remarks about Theorem 4. First, as , i.e. the number of labels in each item category , the misclassification rate of the NN classifier goes to zero. Second, to achieve misclassification rate of atmost , the number of samples scales as

where is the minimum weighted norm of the difference between the parameter vectors of any two segments. This is similar to a standard “separation condition”—the underlying segment vectors should be sufficiently distinct from each other, as measured by the norm. However, instead of the standard norm, we require a weighted form of the norm, where the weight of each component is given by . If , then so that the separation in dimension is weighed lower than categories where is sufficiently distinct from . This follows from the necessary condition in Theorem 3 and is a consequence of the simplicity of our algorithm that relies on measuring deviations of customers from the population preference.

where is the number of item categories. This is expected—as the number of categories increases, we require more samples to achieve concentration in all the dimensions of the projection score vector .

, the dependence on which is similar to the LCIND model case, but with an extra factor of in the denominator, indicating a more stronger dependence on .
Finally, it follows that a logarithmic number of labels in each category is sufficient to guarantee recovery of the true segments with high probability as the total number of items , provided the population size is “large enough”.
In this section, we use synthetic data to analyze the misclassification rate of our segmentation algorithm as a function of the number of labels available per customer. We compare our approach to the standard latent class (LC) method, which uses the EM algorithm to estimate posterior segment membership probabilities of the different customers (we discuss the method in more detail below). The results from the study demonstrate that our approach (1) outperforms the LC benchmark by upto in recovering the true customer segments and is more robust to high levels of sparsity in the customer labels and (2) is fast, with upto gain in computation time compared to the LC method.
Setup. We chose customers and items and considered the following standard latent class generative model: There are customer segments with denoting the proportion of customers in segment ; we have for all and . The customeritem preference graph follows the standard ErdősRényi (Gilbert) model with parameter : each edge between customer and item is added independently with probability . The parameter quantifies the sparsity of the graph: higher the value of , sparser the graph. All customers in segment generate binary labels as follows: given parameter , they provide rating to item with probability and rating with probability .
We denote each groundtruth model type by the tuple: </