Stochastic Backward Euler: An Implicit Gradient Descent Algorithm for means Clustering
Abstract
In this paper, we propose an implicit gradient descent algorithm for the classic means problem. The implicit gradient step or backward Euler is solved via stochastic fixedpoint iteration, in which we randomly sample a minibatch gradient in every iteration. It is the average of the fixedpoint trajectory that is carried over to the next gradient step. We draw connections between the proposed stochastic backward Euler and the recent entropy stochastic gradient descent (EntropySGD) for improving the training of deep neural networks. Numerical experiments on various synthetic and real datasets show that the proposed algorithm finds the global minimum (or its neighborhood) with high probability, when given the correct number of clusters. The method provides better clustering results compared to means algorithms in the sense that it decreased the objective function (the cluster) and is much more robust to initialization.
Keywords:
means backward Euler implicit gradient descent fixedpoint iteration minibatch gradient∎
1 Introduction
The means method appeared in vector quantization in signal processing, and had now become popular for clustering analysis in data mining. In the seminal paper lloyd (), Lloyd proposed a twostep alternating algorithm that quickly converges to a local minimum. Lloyd’s algorithm is also known as an instance of the more general ExpectationMaximization (EM) algorithm applied to Gaussian mixtures. In Bottou_95 (), Bottou and Bengio cast Lloyd’s algorithm as Newton’s method, which explains its fast convergence.
Aiming to speed up Lloyd’s algorithm, Elkan elkan () proposed to keep track of the distances between the computed centroids and data points, and then cleverly leverage the triangle inequality to eliminate unnecessary computations of the distances. Similar techniques can be found in yingyang (). It is worth noting that these algorithms do not improve the clustering quality of Lloyd’s algorithm, but only achieve acceleration. However, there are well known examples where poor initialization can lead to low quality local minima for Lloyd’s algorithm. Random initialization has been used to avoid these low quality fixed points. The article kmeans++ () introduced a smart initialization scheme such that the initial centroids are wellseparated, which gives more robust clustering than random initialization.
We are motivated by problems with very large data sets, where the cost of a single iteration of Lloyd’s algorithm can be expensive. Minibatch mnbatch (); stochastic_kmeans () was later introduced to adapt means for large scale data with high dimensions. The centroids are updated using a randomly selected minibatch rather than all of the data. Minibatch (stochastic) means has a flavor of stochastic gradient descent whose benefits are twofold. First, it dramatically reduces the periteration cost for updating the centroids and thus is able to handle big data efficiently. Second, similar to its successful application to deep learning deep_learning (), minibatch gradient introduces noise in minimization and may help to bypass some bad local minima. Furthermore, the aforementioned Elkan’s technique can be combined with minibatch means for further acceleration nested_mnbatch ().
In this paper, we propose a backward Euler based algorithm for means clustering. Fixedpoint iteration is performed to solve the implicit gradient step. As is done for stochastic minibatch means, we compute the gradient only using a minibatch of samples instead of the whole data, which enables us to handle massive data. Unlike the standard fixedpoint iteration, the resulting stochastic fixedpoint iteration outputs an average over its trajectory. Extensive experiments show that, with proper choice of step size for backward Euler, the proposed algorithm empirically locates the neighborhood of global minimum with high probability.
In other words, while Lloyd’s algorithm is effective with a full gradient oracle we achieve better performance with the weaker minibatch gradient oracle. We are motivated by recent work by two of the authors deep_relax () which applied a similar algorithm to accelerate the training of Deep Neural Networks.
2 Stochastic backward Euler
The celebrated proximal point algorithm (PPA) proximal_point () for minimizing some function is:
(1) 
PPA has the advantage of being monotonically decreasing, which is guaranteed for any step size . Indeed, by the definition of in (1), we have
When for any with being the Lipschitz constant of , the (subsequential) convergence to a stationary point is established in PPM (). If is differentiable at , it is easy to check that the following optimality condition to (1) holds
By rearranging the terms, we arrive at implicit gradient descent or the socalled backward Euler:
(2) 
Fixed point iteration is a viable option for solving (2) if has Lipschitz constant and .
Proposition 1
If , then we have

is strongly convex, and the proximal problem (1) has a unique solution .

The fixed point iteration
(3) generates a sequence converging to at least linearly.
See (Ber08, , Proposition 1.2.3).
Let us consider means clustering for a set of data points in with centroids . We assume each cluster contains the same number of points. Denoting , we seek to minimize
(4) 
Note that is nondifferentiable at if there exist and such that
This means that there is a data point which has two or more distinct nearest centroids and . The same situation may happen in the assignment step of Lloyd’s algorithm. In this case, we simply assign to one of the nearest centroids. With that said, is basically piecewise differentiable. By abuse of notation, we can define the ’gradient’ of at any point by
(5) 
where denotes the index set of the points that are assigned to the centroid . Similarly, we can compute the ’Hessian’ of as was done in Bottou_95 ():
where is an D vector of all ones. When the number of points is large and ’s are distinct from each other, the jumps at discontinuities of caused by the ambiguity in the assignment of data points to centroids are very small. Thus with (5), one can roughly consider to be Lipschitz continuous with the Lipschitz constant by ignoring these tiny jumps. In what follows, we analyze how the fixed point iteration (3) works on the piecewise differentiable with discontinuous .
Definition 1
is piecewise Lipschitz continuous on with Lipschitz constant if can be partitioned into subdomains and is Lipschitz continuous on each subdomain , i.e., for each we have
According to the definition, we can see that is approximately piecewise Lipschitz continuous. But in the extreme case where just one point is assigned to each of the first clusters and points to the last cluster, only has piecewise Lipschitz constant of . Our main result proves the convergence of fixed point iteration on means problem.
Theorem 2.1
Suppose is piecewise Lipschitz. If , then the fixed point iteration for minimizing given by
with initialization satisfies

and as .

is bounded. Moreover, if any limit point of a convergent subsequence of lies in the interior of some subdomain, then the whole sequence converges to which is a fixed point obeying
Proof
(a) We know that is piecewise quadratic. Suppose (note that could be on the boundary), then has a uniform expression restricted on which is a quadratic function, denoted by . We can extend the domain of from to the whole , and we denote the extended function still by . Since is quadratic, is Lipschitz continuous on . Then we have the following wellknown inequality
Using the above inequality and the definition of , we have
In the second equality above, we used the identity
with and . Since , is monotonically decreasing. Moreover, since is bounded from below by 0, converges and thus as .
(b) Since as , combining with the fact that , we have is bounded. Consider a convergent subsequence whose limit lies in the interior of some subdomain. Then for sufficiently large , will always remain in the same subdomain in which lies and thus . Since by (a), , we have as . Therefore,
which implies is a fixed point. Furthermore, by the piecewise Lipschitz condition,
Since , when is sufficiently large, is also in the same subdomain containing . By repeatedly applying the above inequality for , we conclude that converges to .
2.1 Algorithm description
Instead of using the full gradient in fixedpoint iteration, we adopt a randomly sampled minibatch gradient
at the th inner iteration. Here, denotes the index set of the points in the th minibatch associated with the centroid obeying . The fixedpoint iteration outputs a forward looking average over its trajectory. Intuitively averaging greatly stabilizes the noisy minibatch gradients and thus smooths the descent. We summarize the proposed algorithm in Algorithm 1. Another key ingredient of our algorithm is an aggressive initial step size , which helps pass bad local minimum at the early stage. Unlike in deterministic backward Euler, diminishing step size is needed to ensure convergence. But should decay slowly because large step size is good for a global search.
2.2 Related work
Chaudhari et al. esgd () recently proposed the entropy stochastic gradient descent (EntropySGD) algorithm to tackle the training of deep neural networks. Relaxation techniques arising in statistical physics were used to change the energy landscape of the original nonconvex objective function yet with the minimizers being preserved, which allows easier minimization to obtain a ’good’ minimizer with a better geometry. More precisely, they suggest to replace with a modified objective function called local entropy local_entropy () as follows
where is the heat kernel. The connection between EntropySGD and nonlinear partial differential equations (PDEs) was later established in deep_relax (). The local entropy function turns out to be the solution to the following viscous HamiltonJacobi (HJ) PDE at
(6) 
with the initial condition . In the limit , (6) reduces to the nonviscous HJ equation
whose solution is closely related to the proximal operator :
The gradient descent dynamics for is obtained by taking the limit of the following system of stochastic differential equation as the homogenization parameter :
(7) 
where is the standard Wiener process. Specifically, we have
with and being the solution of (2.2) for fixed x. This gives rise to the implementation of EntropySGD deep_relax (). We remark that stochastic backward Euler is equivalent to EntropySGD with the step size of the gradient flow being equal to .
3 Experimental results
We show by several experiments that the proposed stochastic backward Euler (SBE) gives superior clustering results compared with the stateoftheart algorithms for means. SBE scales well for large problems. In practice, only a small number of fixedpoint iterations are needed in the inner loop, and this seems not to depend on the size of the problem. Specifically, we chose the parameters imaxit
= 5 or and the averaging parameter in all experiments. Moreover, we always set .
3.1 2D synthetic Gaussian data
We generated 4000 synthetic data points in 2D plane by multivariate normal distributions with 1000 points in each cluster. The means and covariance matrices used for Gaussian distributions are as follows:
For the initial centroids given below, Lloyd’s algorithm (or EM) got stuck at a local minimum; see the left plot of Fig. 1.
Starting from where EM got stuck, we can see that SBE managed to jump over the trap of local minimum and arrived at the global minimum; see the right plot of Fig. 1.
3.2 Iris dataset
The Iris dataset, which contains 150 4D data samples from 3 clusters, was used for comparisons between SBE and the EM algorithm. 100 runs were realized with the initial centroids randomly selected from the data samples. For the parameters, we chose minibatch size , initial step size, imaxit
, omaxit
, and decay parameter . The histograms in Fig. 2 record the frequency of objective values given by the two algorithms. Clearly there was chance that EM got stuck at a local minimum whose value is about 0.48, whereas ESGD managed to locate the near global minimum region valued at around 0.264 every time.
3.3 Gaussian data with MNIST centroids
We selected 8 handwritten digit images of dimension from MNIST dataset shown in Fig. 3, and then generated 60,000 images from these 8 centroids by adding Gaussian noise. We compare SBE with both EM and minibatch EM (mbEM) mnbatch (); stochastic_kmeans () on 100 independent realizations with random initial guess. For each method, we recorded the minimum, maximum, mean and variance of the 100 objective values by the computed centroids.
We first compare SBE and EM with the true number of clusters . For SBE, minibatch size , maximum number of iterations for backward Euler omaxit
=150, maximum fixedpoint iterations imaxit
= 10 for SBE. We set the maximum number of iterations for EM to be 50, which was sufficient for its convergence. The results are listed in the first two rows of Table 1. We observed that the global minimum was around 15.68 and that SBE always found the global minimum up to a tiny error due to the noise from minibatch.
In the comparison between SBE and mbEM, we reduced minibatch size to , omaxit
, imaxit
and tested for . Table 1 shows that with the same minibatch size, SBE outperforms mbEM in all three cases, in terms of both mean and variance of the objective values.
Method  Batch size  Max iter  Min  Max  Mean  Variance  
8  EM  7500  50  15.6800  27.2828  20.0203  6.0030 
SBE  1000  (150,10)  15.6808  15.6808  15.6808  1.49  
6  mbEM  500  100  20.44  23.4721  21.8393  0.67 
SBE  500  (100,5)  20.2989  21.2047  20.4939  0.0439  
8  mbEM  500  100  15.9193  18.5820  16.4009  0.7646 
SBE  500  (100,5)  15.6816  15.6821  15.6820  1.18  
10  mbEM  500  100  15.9148  18.1848  16.1727  0.4332 
SBE  500  (100,5)  15.6823  15.6825  15.6824  1.5 
3.4 Raw MNIST data
In this example, We used the 60,000 images from the MNIST training set for clustering test, with 6000 samples for each digit (cluster) from 0 to 9. The comparison results are shown in Table 2. We conclude that SBE consistently performs better than EM and mbEM. The histograms of objective value by the three algorithms in the case are plotted in Fig. 4.
Method  Batch size  Max iter  Min  Max  Mean  Variance  

10  EM  6000  50  19.6069  19.8195  19.6725  0.0028 
SBE  1000  (150,10)  19.6087  19.7279  19.6201  5.7  
8  mbEM  500  100  20.4948  20.7126  20.5958  0.0018 
SBE  500  (100,5)  20.2723  20.4104  20.3090  0.0014  
10  mbEM  500  100  19.9029  20.2347  20.0146  0.0041 
SBE  500  (100,5)  19.6103  19.7293  19.6354  0.0011  
12  mbEM  500  100  19.3978  19.7147  19.5136  0.0042 
SBE  500  (100,5)  19.0492  19.1582  19.0972  6.2 
3.5 MNSIT features
We extracted the feature vectors of MNIST training data prior to the last layer of LeNet5 mnist_98 (). The feature vectors have dimension 64 and lie in a better manifold compared with the raw data. The results are shown in Table 3 and Fig. 5 and 6.
Method  Batch size  Max iter  Min  Max  Mean  Variance  

10  EM  6000  50  1.6238  3.0156  2.1406  0.0977 
SBE  1000  (150,10)  1.6238  1.6239  1.6239  2.7  
8  mb EM  500  100  2.3428  3.5972  2.7157  0.0666 
SBE  500  (100,5)  2.2833  2.4311  2.3274  0.0015  
10  mb EM  500  100  1.6504  2.6676  2.1391  0.0712 
SBE  500  (100,5)  1.6239  1.6242  1.6240  1.37  
12  mb EM  500  100  1.5815  2.6189  1.7853  0.0661 
SBE  500  (100,5)  1.5326  1.5891  1.5622  9.8 
4 Discussions
At the th iteration, SBE solves . Since is technically only piecewise Lipschitz continuous, the backward Euler may have multiple solutions, and we will obtain these solutions with certain probabilities. For example, in this 1D example, we get two solutions in the leftmost valley and in the second from the left. by SBE is the extrapolation of these solutions in expectation. The averaging step helps reduce variance. If is far to the right, then is dragged away from the leftmost valley to the second valley, i.e. passes the local minimum. During the above process, the objective value may increase, which explains the jump of objective value showed in the right plot of Fig. 1. In some cases, however, the jump simply does not appear. The reason can be that the second valley is wider and deeper, and thus is further to the right. The could be pulled to the second valley with the smaller objective value with some probability.
Acknowledgement
This work was partially supported by AFOSR grant FA95501510073 and ONR grant N000141612157. We would like to thank Dr. Bao Wang for helpful discussions.
References
 (1) D. Arthur and S. Vassilvitskii. ”means++: The advantages of careful seeding”, Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics (2007).
 (2) C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, ”Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses”, Physical review letters, 115(12), 128101 (2015).
 (3) D.P. Bertsekas: ”Nonlinear Programming” Second Edition, Athena Scientific, Belmont, Massachusetts, 2008.
 (4) L. Bottou and Y. Bengio, ”Convergence properties of the means algorithms”, Advances in neural information processing systems (1995).
 (5) P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. ”Entropysgd: Biasing gradient descent into wide valleys”, arXiv preprint arXiv:1611.01838 (2016).
 (6) P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier, ”Deep Relaxation: partial differential equations for optimizing deep neural networks”, arXiv preprint arXiv:1704.04932 (2017).
 (7) Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz, ”Yinyang means: A dropin replacement of the classic means with consistent speedup”, Proceedings of the 32nd International Conference on Machine Learning (2015).
 (8) C. Elkan, ”Using the triangle inequality to accelerate means”, Proceedings of the 20th International Conference on Machine Learning (2003).
 (9) A. Kaplan and R. Tichatschke, ”Proximal point method and nonconvex optimization”, Journal of Global Optimization, 13, 389406 (1998).
 (10) Y. LeCun, Y. Bengio, and G. Hinton, ”Deep learning”, Nature, 521(7553), 436444 (2015).
 (11) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, ”Gradientbased Learning Applied to Document Recognition”, Proceedings of the IEEE, 86(11), 22782324 (1998).
 (12) S. Lloyd, ”Least squares quantization in PCM”, IEEE transactions on information theory 28(2), 129137 (1982).
 (13) J. Newling and F. Fleuret, ”Nested MiniBatch means”, Advances in Neural Information Processing Systems (2016).
 (14) R. Rockafellar, ”Monotone operators and the proximal point algorithm”, SIAM J. Control and Optimization, 14, 877898 (1976).
 (15) D. Sculley, ”Webscale means clustering”, Proceedings of the 19th international conference on World wide web, ACM (2010).
 (16) C. Tang and C. Monteleoni. ”Convergence rate of stochastic means”, arXiv preprint arXiv:1610.04900 (2016).