Distributional Clustering: A distributionpreserving clustering method
Abstract
One key use of kmeans clustering is to identify cluster prototypes which can serve as representative points for a dataset. However, a drawback of using kmeans cluster centers as representative points is that such points distort the distribution of the underlying data. This can be highly disadvantageous in problems where the representative points are subsequently used to gain insights on the data distribution, as these points do not mimic the distribution of the data. To this end, we propose a new clustering method called “distributional clustering”, which ensures cluster centers capture the distribution of the underlying data. We first prove the asymptotic convergence of the proposed cluster centers to the data generating distribution, then present an efficient algorithm for computing these cluster centers in practice. Finally, we demonstrate the effectiveness of distributional clustering on synthetic and real datasets.
Cluster prototypes Data reduction Kmeans clustering Representative points
1 Introduction
Clustering is widely used in a variety of statistical and machine learning applications. These applications can be roughly split into two objectives (Tan et al. 2016): (i) the identification of homogeneous groups within a dataset, and (ii) the summarization of a dataset into “representative points” or “cluster prototypes”. Objective (i) spans a broad range of realworld problems, including the discovery of gene groups which have similar biological functions, the identification of communities in social networks, and so on. Similarly, objective (ii) covers many important problems in the everpresent world of big data, including data summarization (the reduction of a dataset for expensive downstream computations) and data compression (the representation of a dataset by its prototypes). We will focus on the latter objective in this work.
Kmeans clustering (Gan, Ma & Wu 2007) is one of the most widely used clustering methods. Let be a dimensional dataset of points, and suppose cluster prototypes are desired from this data. Kmeans clustering chooses these prototypes as the following optimal cluster centers:
(1.1) 
where are the cluster centers to be optimized, and returns the closest point in to . In other words, these optimal centers minimize the sum of the squared distances between it and its closest points in the data. The clustering criterion (1.1) is known as the withincluster sumofsquares criterion (Pollard 1981). This kmeans approach for generating representative points has been widely used (with minor variations) in statistics (e.g., Dalenius (1950), the principal points in Flury (1990), the msereppoints in Fang & Wang (1994)), signal processing (Lloyd 1982), and stochastic programming (Heitsch & Römisch 2003).
Although kmeans clustering is an intuitive and easytoimplement clustering method, one key weakness is that the cluster prototypes in (1.1) distort the distribution of the underlying data. To demonstrate this, we perform kmeans clustering on a = 100,000point dataset, simulated from the univariate standard normal distribution, to obtain = 100 representative points. Figure 1 shows the density of the data generating distribution (call this ), along with the kernel density estimate of the kmeans cluster centers. We see that the latter density is much more heavytailed than the former density, which suggests that the kmeans cluster prototypes distort the true data distribution . In fact, this distortion can be quantified theoretically: one can show (see Zador (1982), Su (2000) and Theorem 7.5 of Graf & Luschgy (2007)) that the density of kmeans cluster centers converges to as the number of prototype points . In other words, even with a large number of points, the distribution of kmeans centers will not converge to the underlying data distribution!
This distribution distortion can be detrimental in many applications of cluster prototyping. For example, consider the problem of data summarization, where a large dataset is reduced to a smaller representative dataset, to be used for timeconsuming downstream computations. Such computations may involve fitting complex models to data, or conducting expensive experiments at each data point (e.g., in uncertainty quantification (Smith 2013)). Here, a distribution distortion can result in highly biased insights from downstream computations, in that the results using the summarized data may deviate greatly from results using the full data. In this sense, the representative points from kmeans clustering may yield undesired and incorrect results when used for data summarization.
To address this, we propose a new clustering method called “distributional clustering”, whose cluster centers preserve the underlying data distribution. The key idea is to generalize the withincluster sumofsquares criterion in (1.1), to a new withincluster sumofpowers criterion, which minimizes the sum of distances to a positive power within each cluster. We first show that, as power , the optimal cluster centers from this new clustering criterion indeed converge to the true datagenerating distribution as the number of clusters tend to infinity; this addresses the distortion problem of kmeans. We then propose a modification of Lloyd’s algorithm (a widelyused algorithm for solving the kmeans criterion (1.1)) to compute these distributional clustering centers in practice. Finally, we demonstrate the effectiveness of distributional clustering in several numerical examples.
2 Distributional clustering
To address the distribution distortion of kmeans cluster centers, we begin by modifying the withincluster sumofsquares criterion in (1.1), by raising the Euclidean distance between each observation and its corresponding cluster center to th power (where ). This new clustering criterion, which we call the withincluster sumofpowers criterion, can be written as follows:
(2.1) 
Note that, for , the criterion in (2.1) reduces to the kmeans clustering criterion in (1.1).
To aid the derivation of our method, we will assume for now that , i.e., an infinite amount of data is available on the population. Letting be the data generating density function, the above clustering criterion can be generalized as:
(2.2) 
where:
(2.3) 
Here, ‘*’ is used to denote the setting of infinite data (). Note that the power is added to (2.3); this does not affect the optimization of the cluster centers (since the transformation is monotonic), but will aid in subsequent derivations.
The criterion in (2.2) has been studied in the signal processing literature, particularly for vector quantization. Graf & Luschgy (2007) showed that the cluster centers in (2.2) converge asymptotically to a distribution with density as the number of cluster centers ^{1}^{1}1From here on, the notion of asymptotics refers to the number of cluster centers .. This suggests that as , the density of becomes closer to the datagenerating density . This motivates the following distributional clustering criterion:
(2.4) 
where:
(2.5) 
The following theorem gives a closedform expression for the limiting objective function, , in (2.5).
Theorem 1 (Withincluster sumoflimitingpower).
For any , , we have:
(2.6) 
The proof is given in Appendix A.
The key advantage of Theorem 1 is that, as we show later in Section 4, it provides us a closedform expression for the distributional clustering criterion (2.4) in the finite data setting of . We will call the logterm in (2.6) as the logpotential of point with respect to center ; this is motivated by a similar logpotential distance used in the experimental design literature (Dette & Pepelyshev 2010).
3 Distributional convergence
With the distributional clustering criterion (2.4) in hand, we now show that the optimal cluster centers indeed fixes the distribution distortion problem from kmeans, i.e., the empirical distribution of these centers converges asymptotically to the underlying data distribution. Let , be the datagenerating distribution function (assumed to be continuous), with density function . Let and denote the empirical distribution functions (e.d.f.s) for the optimal cluster centers in (2.2) and (2.4), respectively. Also, let denote the distribution function with density proportional to . The following theorem shows that, under mild regularity conditions on , the distribution of the proposed cluster centers, , converges asymptotically to the desired distribution .
Theorem 2 (Distributional convergence of distributional clustering).
Suppose the data distribution satisfies the mild assumptions:

(Moment): , for some , and .

(Uniform convergence): uniformly over for some .
Then , the e.d.f of distributional clustering centers, satisfies:
(3.1) 
Assumption (A1) is a standard moment assumption on the data distribution . Assumption (A2) concerns the uniform convergence of to , and holds for densities which are not too heavytailed, for example, the exponential and power distributions (Fort & Pagès 2002). Note that pointwise convergence of to is already proven in Theorem 7.5 of Graf & Luschgy (2007). The proof for this theorem is given in Appendix B.
Theorem 2 shows that, by minimizing the new clustering criterion in (2.4), the resulting cluster centers capture the data distribution as the number of representative points . This shows that the proposed distributional clustering method indeed addresses the key problem of distribution distortion for kmeans clustering.
4 Cluster center optimization for distributional clustering
Next, we present next an optimization algorithm for efficiently computing these new cluster centers in practice. This algorithm is described in three parts. First, we present a clustering algorithm for solving (2.4) in the practical setting of finite data (i.e., ). Second, we propose an adjustment which leads to better distribution representation in the case when the number of representative points is small. Third, we combine both parts to obtain a general optimization algorithm for distributional clustering.
4.1 Optimizing the withincluster sumoflogpotential criterion
We begin by presenting an optimization algorithm for computing the cluster centers in (2.4) in the case of finite data (i.e., ). Recall that Theorem 2 shows the empirical distribution of converges asymptotically to the underlying data distribution . To obtain , we therefore need to solve the optimization problem in (2.4). Since is typically unknown in practice, we approximate the integral in (2.4) with a finite sample average over the data we have available. This yields the following optimization problem:
(4.1) 
Here, a small nugget parameter is added to (4.1) to avoid singularity of the objective function when . Letting , where is the th cluster center to be optimized, then (4.1) can be written as:
(4.2) 
where is a binary decision variable indicating the assignment of point to cluster , and is the x matrix corresponding to ’s. Note that the constant is removed from (4.2) since it does not affect the optimization.
The optimization problem in (4.2) is now quite similar to the kmeans criterion in (1.1). Thus, to optimize (4.2), we will use a novel extension of Lloyd’s algorithm (Lloyd 1982), which is widely used for optimizing the kmeans clustering problem (1.1). The idea behind Lloyd’s algorithm is as straightforward. Beginning with an initial set of cluster centers (which are randomly sampled from the data), the first step is to (a) assign each data point to its nearest cluster center in Euclidean norm; this optimizes the assignment decision variables given fixed centers. Next, for each cluster, (b) update its cluster center as the point which minimizes the sum of squared distances with all cluster points; this optimizes the cluster centers given fixed cluster assignments. For (1.1), the latter step amounts to simply updating each cluster center as the mean of its cluster points. Both optimization steps are then repeated iteratively until convergence.
Since the proposed criterion in (4.2) is similar to the kmeans criterion in (1.1), we can extend Lloyd’s iterative algorithm for optimizing (4.2). The key modification is in step (b) of the above description. Define first the logpotential of the th cluster as:
(4.3) 
where, in the th cluster, is the number of points and is its th point. Note that the logpotential is defined given fixed cluster assignments. In view of (4.2), the only modification required in step (b) is that we instead update cluster centers by minimizing the logpotential in (4.3).
Unfortunately, unlike kmeans, the minimum of the logpotential (4.3) has no closedform expression. However, this minimization can be greatly simplified by the following intuition. Suppose we have a onedimensional dataset, and the th cluster contains three points: and . Figure 2 plots the logpotential with a small nugget term . From this, the local minima of the logpotential appears to occur at the three data points within the cluster. Theorem 3 shows that this is indeed the case for sufficiently small .
Theorem 3 (Logpotential minima).
For any , there exists a , such that for all , the global minimum of the logpotential is found in .
The proof is given in Appendix C.
The insight from this theorem is that, for each cluster, it allows us to search for the optimal center within the points assigned to this cluster. In particular, using Theorem 3, the optimization problem in (4.2) can be written as:
(4.4) 
where is the set of points assigned to the th cluster. The term in (4.4) results from the fact that is also a data point in the th cluster. Since is constant and , (4.4) can further be simplified to obtain the cluster centers in the finite data setting () as:
(4.5) 
Hence, given fixed assignment variables , each cluster center can then be updated by the discrete optimization problem:
(4.6) 
where is the number of points and is the th point, assigned to the th clutser (). This can be efficiently optimized via several heuristics, which we discuss next.
We now summarize the full optimization algorithm for (4.5) (called DC_asymp) in Algorithm 1. First, initial cluster centers are randomly sampled from the data. Next, repeat the following two steps until convergence: (a) assign each data point to its nearest cluster center, and (b) update each cluster center to minimize the logpotential criterion (4.6). Although step (b) solves a discrete optimization problem, its computation time can be greatly reduced using the following heuristic. The key idea is that we do not need to compute the objective over every data point in the cluster (or ). Since the logpotential cluster center in (4.6) measures cluster centrality, one would expect it to be near the sample mean of all the points in the cluster. Thus, one way to reduce computation is to compute the objective for only the points closest to the mean (for a small choice of ) and screen out the remaining of points. The choice of seems to work well in our numerical examples.
We also note an interesting connection between distributional clustering and kmedoids clustering (Kaufman & Rousseeuw 1987). The cluster centers in kmedoids optimize the same criterion as kmeans, but are restricted to be within the data itself (which is similar to DC_asymp). However, distributional clustering enjoys the same advantage over kmedoids as it does over kmeans: the proposed cluster centers capture the data distribution asymptotically, whereas this is not true for kmedoids clustering. Hence, the proposed method is expected to perform better in applications where cluster prototypes should be representative of the data distribution.
4.2 Optimizing the withincluster sumofpowers criterion
Note that Theorem 2 guarantees the proposed cluster centers converge in distribution to the data distribution as the number of centers . In the case of small , however, an adjustment can be performed to provide improved distribution representation. To see why, Fig. 3 (left) shows the kernel density estimate of cluster centers on 1dimensional standard normal data for different values of , with fixed at 0. As decreases, the optimal cluster centers tend to move towards the highdensity regions of the data, which makes the estimated density function less heavy tailed. One reason for this is that, as we minimize the logpotential between the cluster center and cluster points, there is lesser incentive to minimize the distance between the cluster center from a faroff point than in minimizing it from many nearby points. Thus, for small , cluster centers are pushed towards highdensity regions, which induces a finitesample distribution distortion (note that this distortion disappears as ). We refer to this distortion as a “small distortion”.
Recall that the withincluster sumoflogpotential criterion (4.5) is a limiting case of the withincluster sumofpowers criterion (2.1). To correct the small distortion in the former, one strategy is to optimize the latter with an appropriately chosen power . Figure 3 (right) shows the kernel density estimate of the optimal cluster centers on 1dimensional standard normal data for different values of , with fixed at 30. As increases, the density of the cluster centers becomes more heavy tailed, which is opposite of the effect of the small distortion! This suggests that the small distortion can be corrected by increasing power in equation (2.1). Intuitively, one reason for this is that, as increases, the incentive is higher in minimizing the distance of the cluster center from a far away point rather than from many nearby points. This then encourages cluster centers to move towards low density regions, which leads to cluster centers being more spread out.
Adopting the above adjustment strategy, we present next an optimization algorithm, called DC_finite (see Algorithm 2), for the withincluster sumofpowers criterion (2.1) with fixed (a tuning procedure for is discussed in the next section). First, initial cluster centers are randomly sampled from the data. Next, the following two steps are iterated until convergence: (a) assign each data point to its nearest cluster center, and (b) update each cluster center to minimize the sumofpowers within each cluster. In particular, step (b) performs the following update:
(4.7) 
For , the optimization in (4.7) is convex, so a global optimum can be obtained via the truncated Newton method (Dembo & Steihaug 1983). In our implementation, this optimization is performed using the R package ‘nloptr’ (Ypma 2014). A similar approach was used by Mak & Joseph (2018) for the case of large, but for a different goal of experimental design.
4.3 Distributional clustering algorithm
Finally, we present a way to tune the power in (2.1) to best correct the small distortion. We will make use of the following energy distance (Székely & Rizzo 2004). Given data and optimal cluster centers , the energy distance between and is defined as:
(4.8)  
This energy distance was initially proposed as a twosample goodnessoffit test between two datasets and . Here, we do not use this criterion for goodnessoffit testing, but rather to tune a good choice of power which maximizes goodnessoffit between data and cluster centers . More specifically, we wish to find the power which satisfies:
(4.9) 
In implementation, is estimated as follows. First, beginning with the initial case of , we generate the optimal cluster centers using the algorithm DC_asymp in Section 4.1, and compute its energy distance to data . Next, we iteratively increase power by , starting from , then generate the optimal cluster centers using the algorithm DC_finite in Section 4.2, and compute its energy distance to the data. We increment as long as the computed energy distance decreases, and terminate the procedure when it increases for a new power . Finally, we take the optimal as the power prior to an increase in energy distance. From simulations (see Figure 4), the energy distance appears to be nearconvex in , which justifies this iterative tuning approach. Algorithm 3, which we call DC, outlines the full distributional clustering algorithm with power tuning.
Figure 4 visualizes this tuning procedure for two toy examples. The left plot shows the plot of energy distance against for a 10dimensional standard normal data with , and . Here, the tuned power is . In light of the discussion in Section 4.2, this large power is not surprising, since the number of representative points is quite small. The right plot shows the tuned power as a function of , where the data is generated from a 9dimensional multivariate standard normal distribution with . As increases, the tuned power needed to correct this distortion decreases to zero, which makes sense since the small distortion disappears as . This also supports the result in Theorem 2, that the distributional clustering centers converge to the data distribution as .
5 Numerical examples
We now investigate the performance of distributional clustering in two numerical examples. To provide a fair comparison, we will use the energy distance as well as another metric – the multivariate Cramér statistic (Baringhaus & Franz 2004) – to compare different reduction methods. The Cramér statistic between the data , and cluster centers (for a given ) is:
(5.1)  
where is a kernel function. We have chosen , as this kernel compares the distributions on both dispersion and location. The lower the Cramérstatistic, the closer the respective distributions.
Example 1.
We compare the numerical performance of our distributional clustering algorithm with kmeans clustering and random subsampling, on data simulated from the standard normal, exponential and gamma distributions, with dimensions varying from 2 to 8. Here, , and . The energy distance in (4.8), and the multivariate Cramér statistic in (5.1) are used as metrics for evaluating the reduction methods. Figure 5 shows the energy distance and Cramér statistic for each of the three reduction methods, over the three distribution choices. We see that the converged cluster centers for distributional clustering have both the lowest energy distance and the lowest Cramér statistic, for all distributions and dimensions, which suggests that the proposed clustering method better captures the distribution of the underlying data compared to existing methods. In case of for normal distribution, the energy distance is almost the same for and , and so we observe a comparable performance of distributional clustering with kmeans. For the tuned power , we also observe larger for the normallydistributed data (with values increasing in dimension), but smaller for the exponential and gammadistributed data, including for some cases. This suggests, while the clustering procedure under the withincluster sumoflogpotential criterion is asymptotically consistent for distribution matching, it may also be useful for problems with a small number of clusters (depending on the distribution of the data). However, in other cases, our other procedure is also important to ensure good distribution matching.
Example 2.
High resolution regional climate models (RCMs) are driven from a comparatively coarse resolution global climate models for depicting high resolution future climate states of the atmosphere with associated uncertainty. Several targeted random sampling techniques have been developed (for example, Rife et al. (2013)) to select a subset of days from the coarse resolution dataset such that distribution of climate variables on those days matches with that of the entire population. Using such a representative sample provides a more economical and computationally feasible method of determining climate change uncertainty (Pinto et al. 2014). Thus, we are motivated to apply distributional clustering to reduce climate data.
We demonstrate the effectiveness of distributional clustering on climate data (https://rattle.togaware.com/weatherAUS.csv) containing daily measurements of wind speed, humidity, pressure and minimum temperature in Australia from October 2007 to June 2017. Here, and . In order to show the effectiveness of distributional clustering on not just one sample, but in general for any sample, we randomly select 100 distinct initializing samples, and compare performance of all reduction methods on each of the 100 samples. Figure 6 shows the energy distance and Cramér statistic boxplots for the reduced samples from distributional clustering, kmeans clustering, and random subsampling. We see that both the energy distances and Cramér statistics for distributional clustering are noticeably lower than those for the other two methods, which suggests that the proposed method again outperforms existing methods in terms of capturing the distribution of the underlying data.
6 Conclusion
Kmeans clustering is a widelyused approach for identifying cluster prototypes of an underlying dataset. One drawback, however, is that the kmeans cluster centers (as representative points) distort the distribution of the data. To address this, we proposed a new distributional clustering method, which ensures the cluster centers indeed capture the data distribution. We proved the asymptotic convergence of the proposed cluster centers to the data generating distribution, then presented an efficient algorithm for computing such centers in practice. Numerical examples show that the proposed cluster prototypes provide a better representation of the underlying data, compared to kmeans clustering and random subsampling.
There are several interesting directions to pursue for future research. First, while the representative points from distributional clustering preserve the overall data distribution, it does not necessarily preserve marginal distributions over each variable – a property shown to be important for highdimensional data reduction (Mak & Joseph 2017). An extension of the proposed method for capturing marginal distributions would be worthwhile. Second, it would be nice to develop a more computationally efficient strategy for tuning the power , which would allow our method to scale better for large datasets.
Acknowledgements
This research is supported by a U.S. National Science Foundation grant DMS1712642.
Appendix A: Proof of Theorem 1.
Theorem 1 (Withincluster sumoflimitingpower).
For any , , we have:
(A.1) 
Proof.
(Theorem 1). We apply log transformation to (2.3), and then apply limit on to obtain . Using L’Hospital’s Rule to compute the limit, we obtain . Computing the limit for , we obtain (2.6), which completes the proof. ∎
Appendix B: Proof of Theorem 2.
Theorem 2 (Distributional convergence of distributional clustering).
Suppose satisfies the mild assumptions (A1) and (A2). Then , the e.d.f of distributional clustering centers, satisfies:
(B.1) 
Proof.
(Theorem 2).
First, we claim that for any , (i) for all . This follows directly from Theorem 7.5 in Graf & Luschgy (2007), under the assumption (A1). This convergence is also uniform in under assumption (A2). Next, applying Scheffé’s lemma (Scheffé 1947), it is easy to see that (ii) for all , since (the limiting density of ) converges to (the density of ) almost everywhere in .
We now wish to use (i) and (ii) to prove (B.1). Choose any event , the Borel algebra on . Let , , and denote the probability of under , , and , respectively. If the following lemma holds:
Lemma 1 (Exchange of limits).
(B.2) 
then equation (B.1) must hold, because the lefthand side equals by applying (i) and (ii), and the righthand side implies (B.1).
Proof.
∎
Appendix C: Proof of Theorem 3.
Theorem 3 (Log potential minima).
For any , there exists a , such that for all , the global minimum , of the logpotential , is found in .
Proof.
(Theorem 3). Logpotential at , where , is . Log potential at , where , is . Let , where and . Then . For any and , there exists a such that , or . Let . Then, for . This implies that all points in correspond to local minima, or one of the points in is the global minimum. ∎
References
 (1)
 Baringhaus & Franz (2004) Baringhaus, L. & Franz, C. (2004), ‘On a new multivariate twosample test’, Journal of Multivariate Analysis 88(1), 190–206.
 Dalenius (1950) Dalenius, T. (1950), ‘The problem of optimum stratification’, Scandinavian Actuarial Journal 1950(34), 203–213.
 Dembo & Steihaug (1983) Dembo, R. S. & Steihaug, T. (1983), ‘TruncatedNewton algorithms for largescale unconstrained optimization’, Mathematical Programming 26(2), 190–212.
 Dette & Pepelyshev (2010) Dette, H. & Pepelyshev, A. (2010), ‘Generalized Latin hypercube design for computer experiments’, Technometrics 52(4), 421–429.
 Fang & Wang (1994) Fang, K. & Wang, Y. (1994), NumberTheoretic Methods in Statistics, volume 51 of Monographs on Statistics and Applied Probability, Chapman and Hall, London.
 Flury (1990) Flury, B. A. (1990), ‘Principal points’, Biometrika 77(1), 33–41.
 Fort & Pagès (2002) Fort, J.C. & Pagès, G. (2002), ‘Asymptotics of optimal quantizers for some scalar distributions’, Journal of Computational and Applied Mathematics 146(2), 253–275.
 Gan et al. (2007) Gan, G., Ma, C. & Wu, J. (2007), Data Clustering: Theory, Algorithms, and Applications, Vol. 20, SIAM.
 Graf & Luschgy (2007) Graf, S. & Luschgy, H. (2007), Foundations of Quantization for Probability Distributions, Springer.
 Habil (2006) Habil, E. D. (2006), ‘Double sequences and double series’, IUG Journal of Natural Studies 14(1), No 1.
 Hadash et al. (2018) Hadash, G., Kermany, E., Carmeli, B., Lavi, O., Kour, G. & Jacovi, A. (2018), ‘Estimate and replace: A novel approach to integrating deep neural networks with existing applications’, arXiv preprint arXiv:1804.09028 .
 Heitsch & Römisch (2003) Heitsch, H. & Römisch, W. (2003), ‘Scenario reduction algorithms in stochastic programming’, Computational Optimization and Applications 24(23), 187–206.
 Kaufman & Rousseeuw (1987) Kaufman, L. & Rousseeuw, P. J. (1987), Clustering by means of medoids, in ‘Statistical Data Analysis based on the L1 Norm and related methods’, (Edited by Y. Dodge), 405–416, Birkhäuser, Switzerland.
 Kour & Saabne (2014a) Kour, G. & Saabne, R. (2014a), Fast classification of handwritten online arabic characters, in ‘Soft Computing and Pattern Recognition (SoCPaR), 2014 6th International Conference of’, IEEE, pp. 312–318.
 Kour & Saabne (2014b) Kour, G. & Saabne, R. (2014b), Realtime segmentation of online handwritten arabic script, in ‘Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on’, IEEE, pp. 417–422.
 Lloyd (1982) Lloyd, S. (1982), ‘Least squares quantization in PCM’, IEEE Transactions on Information Theory 28(2), 129–137.
 Mak & Joseph (2017) Mak, S. & Joseph, V. R. (2017), ‘Projected support points: a new method for highdimensional data reduction’, arXiv preprint arXiv:1708.06897 .
 Mak & Joseph (2018) Mak, S. & Joseph, V. R. (2018), ‘Minimax and minimax projection designs using clustering’, Journal of Computational and Graphical Statistics 27(1), 166–178.
 Pinto et al. (2014) Pinto, J. O., Monaghan, A. J., Delle Monache, L., Vanvyve, E. & Rife, D. L. (2014), ‘Regional assessment of sampling techniques for more efficient dynamical climate downscaling’, Journal of Climate 27(4), 1524–1538.
 Pollard (1981) Pollard, D. (1981), ‘Strong consistency of means clustering’, The Annals of Statistics 9(1), 135–140.
 Rife et al. (2013) Rife, D. L., Vanvyve, E., Pinto, J. O., Monaghan, A. J., Davis, C. A. & Poulos, G. S. (2013), ‘Selecting representative days for more efficient dynamical climate downscaling: Application to wind energy’, Journal of Applied Meteorology and Climatology 52(1), 47–63.
 Scheffé (1947) Scheffé, H. (1947), ‘A useful convergence theorem for probability distributions’, The Annals of Mathematical Statistics 18(3), 434–438.
 Smith (2013) Smith, R. C. (2013), Uncertainty Quantification: Theory, Implementation, and Applications, Vol. 12, SIAM.
 Su (2000) Su, Y. (2000), ‘Asymptotically optimal representative points of bivariate random vectors’, Statistica Sinica 10(2), 559–575.
 Székely & Rizzo (2004) Székely, G. J. & Rizzo, M. L. (2004), ‘Testing for equal distributions in high dimension’, InterStat 5(16.10), 1249–1272.
 Tan et al. (2016) Tan, P.N., Steinbach, M. & Kumar, V. (2016), Introduction to Data Mining, Pearson Education India.
 Ypma (2014) Ypma, J. (2014), Introduction to nloptr: an R interface to NLopt, Technical report.
 Zador (1982) Zador, P. (1982), ‘Asymptotic quantization error of continuous signals and the quantization dimension’, IEEE Transactions on Information Theory 28(2), 139–149.