# Stochastic Backward Euler: An Implicit Gradient Descent Algorithm for -means Clustering

## Abstract

In this paper, we propose an implicit gradient descent algorithm for the classic -means problem. The implicit gradient step or backward Euler is solved via stochastic fixed-point iteration, in which we randomly sample a mini-batch gradient in every iteration. It is the average of the fixed-point trajectory that is carried over to the next gradient step. We draw connections between the proposed stochastic backward Euler and the recent entropy stochastic gradient descent (Entropy-SGD) for improving the training of deep neural networks. Numerical experiments on various synthetic and real datasets show that the proposed algorithm finds the global minimum (or its neighborhood) with high probability, when given the correct number of clusters. The method provides better clustering results compared to -means algorithms in the sense that it decreased the objective function (the cluster) and is much more robust to initialization.

## 1Introduction

The -means method appeared in vector quantization in signal processing, and had now become popular for clustering analysis in data mining. In the seminal paper [12], Lloyd proposed a two-step alternating algorithm that quickly converges to a local minimum. Lloyd’s algorithm is also known as an instance of the more general Expectation-Maximization (EM) algorithm applied to Gaussian mixtures. In [4], Bottou and Bengio cast Lloyd’s algorithm as Newton’s method, which explains its fast convergence.

Aiming to speed up Lloyd’s algorithm, Elkan [8] proposed to keep track of the distances between the computed centroids and data points, and then cleverly leverage the triangle inequality to eliminate unnecessary computations of the distances. Similar techniques can be found in [7]. It is worth noting that these algorithms do not improve the clustering quality of Lloyd’s algorithm, but only achieve acceleration. However, there are well known examples where poor initialization can lead to low quality local minima for Lloyd’s algorithm. Random initialization has been used to avoid these low quality fixed points. The article [1] introduced a smart initialization scheme such that the initial centroids are well-separated, which gives more robust clustering than random initialization.

We are motivated by problems with very large data sets, where the cost of a single iteration of Lloyd’s algorithm can be expensive. Mini-batch [15] was later introduced to adapt -means for large scale data with high dimensions. The centroids are updated using a randomly selected mini-batch rather than all of the data. Mini-batch (stochastic) -means has a flavor of stochastic gradient descent whose benefits are twofold. First, it dramatically reduces the per-iteration cost for updating the centroids and thus is able to handle big data efficiently. Second, similar to its successful application to deep learning [10], mini-batch gradient introduces noise in minimization and may help to bypass some bad local minima. Furthermore, the aforementioned Elkan’s technique can be combined with mini-batch -means for further acceleration [13].

In this paper, we propose a backward Euler based algorithm for -means clustering. Fixed-point iteration is performed to solve the implicit gradient step. As is done for stochastic mini-batch -means, we compute the gradient only using a mini-batch of samples instead of the whole data, which enables us to handle massive data. Unlike the standard fixed-point iteration, the resulting stochastic fixed-point iteration outputs an average over its trajectory. Extensive experiments show that, with proper choice of step size for backward Euler, the proposed algorithm empirically locates the neighborhood of global minimum with high probability.

In other words, while Lloyd’s algorithm is effective with *a full gradient oracle* we achieve better performance with the weaker *mini-batch gradient oracle*. We are motivated by recent work by two of the authors [6] which applied a similar algorithm to accelerate the training of Deep Neural Networks.

## 2Stochastic backward Euler

The celebrated proximal point algorithm (PPA) [14] for minimizing some function is:

PPA has the advantage of being monotonically decreasing, which is guaranteed for any step size . Indeed, by the definition of in (Equation 1), we have

When for any with being the Lipschitz constant of , the (subsequential) convergence to a stationary point is established in [9]. If is differentiable at , it is easy to check that the following optimality condition to (Equation 1) holds

By rearranging the terms, we arrive at implicit gradient descent or the so-called backward Euler:

Fixed point iteration is a viable option for solving (Equation 2) if has Lipschitz constant and .

See [3].

Let us consider -means clustering for a set of data points in with centroids . We assume each cluster contains the same number of points. Denoting , we seek to minimize

Note that is non-differentiable at if there exist and such that

This means that there is a data point which has two or more distinct nearest centroids and . The same situation may happen in the assignment step of Lloyd’s algorithm. In this case, we simply assign to one of the nearest centroids. With that said, is basically piecewise differentiable. By abuse of notation, we can define the ’gradient’ of at any point by

where denotes the index set of the points that are assigned to the centroid . Similarly, we can compute the ’Hessian’ of as was done in [4]:

where is an -D vector of all ones. When the number of points is large and ’s are distinct from each other, the jumps at discontinuities of caused by the ambiguity in the assignment of data points to centroids are very small. Thus with (Equation 5), one can roughly consider to be Lipschitz continuous with the Lipschitz constant by ignoring these tiny jumps. In what follows, we analyze how the fixed point iteration ( ?) works on the piecewise differentiable with discontinuous .

According to the definition, we can see that is approximately piecewise -Lipschitz continuous. But in the extreme case where just one point is assigned to each of the first clusters and points to the last cluster, only has piecewise Lipschitz constant of . Our main result proves the convergence of fixed point iteration on -means problem.

(a) We know that is piecewise quadratic. Suppose (note that could be on the boundary), then has a uniform expression restricted on which is a quadratic function, denoted by . We can extend the domain of from to the whole , and we denote the extended function still by . Since is quadratic, is -Lipschitz continuous on . Then we have the following well-known inequality

Using the above inequality and the definition of , we have

In the second equality above, we used the identity

with and . Since , is monotonically decreasing. Moreover, since is bounded from below by 0, converges and thus as .

(b) Since as , combining with the fact that , we have is bounded. Consider a convergent subsequence whose limit lies in the interior of some sub-domain. Then for sufficiently large , will always remain in the same sub-domain in which lies and thus . Since by (a), , we have as . Therefore,

which implies is a fixed point. Furthermore, by the piecewise Lipschitz condition,

Since , when is sufficiently large, is also in the same sub-domain containing . By repeatedly applying the above inequality for , we conclude that converges to .

### 2.1Algorithm description

Instead of using the full gradient in fixed-point iteration, we adopt a randomly sampled mini-batch gradient

at the -th inner iteration. Here, denotes the index set of the points in the -th mini-batch associated with the centroid obeying . The fixed-point iteration outputs a forward looking average over its trajectory. Intuitively averaging greatly stabilizes the noisy mini-batch gradients and thus smooths the descent. We summarize the proposed algorithm in Algorithm ?. Another key ingredient of our algorithm is an aggressive initial step size , which helps pass bad local minimum at the early stage. Unlike in deterministic backward Euler, diminishing step size is needed to ensure convergence. But should decay slowly because large step size is good for a global search.

### 2.2Related work

Chaudhari et al. [5] recently proposed the entropy stochastic gradient descent (Entropy-SGD) algorithm to tackle the training of deep neural networks. Relaxation techniques arising in statistical physics were used to change the energy landscape of the original non-convex objective function yet with the minimizers being preserved, which allows easier minimization to obtain a ’good’ minimizer with a better geometry. More precisely, they suggest to replace with a modified objective function called local entropy [2] as follows

where is the heat kernel. The connection between Entropy-SGD and nonlinear partial differential equations (PDEs) was later established in [6]. The local entropy function turns out to be the solution to the following viscous Hamilton-Jacobi (HJ) PDE at

with the initial condition . In the limit , (Equation 6) reduces to the non-viscous HJ equation

whose solution is closely related to the proximal operator :

The gradient descent dynamics for is obtained by taking the limit of the following system of stochastic differential equation as the homogenization parameter :

where is the standard Wiener process. Specifically, we have

with and being the solution of (Equation 7) for fixed x. This gives rise to the implementation of Entropy-SGD [6]. We remark that stochastic backward Euler is equivalent to Entropy-SGD with the step size of the gradient flow being equal to .

## 3Experimental results

We show by several experiments that the proposed stochastic backward Euler (SBE) gives superior clustering results compared with the state-of-the-art algorithms for -means. SBE scales well for large problems. In practice, only a small number of fixed-point iterations are needed in the inner loop, and this seems not to depend on the size of the problem. Specifically, we chose the parameters `imaxit`

= 5 or and the averaging parameter in all experiments. Moreover, we always set .

### 3.12-D synthetic Gaussian data

We generated 4000 synthetic data points in 2-D plane by multivariate normal distributions with 1000 points in each cluster. The means and covariance matrices used for Gaussian distributions are as follows:

For the initial centroids given below, Lloyd’s algorithm (or EM) got stuck at a local minimum; see the left plot of Fig. ?.

Starting from where EM got stuck, we can see that SBE managed to jump over the trap of local minimum and arrived at the global minimum; see the right plot of Fig. ?.

### 3.2Iris dataset

The Iris dataset, which contains 150 4-D data samples from 3 clusters, was used for comparisons between SBE and the EM algorithm. 100 runs were realized with the initial centroids randomly selected from the data samples. For the parameters, we chose mini-batch size , initial step size, `imaxit`

, `omaxit`

, and decay parameter . The histograms in Fig. ? record the frequency of objective values given by the two algorithms. Clearly there was chance that EM got stuck at a local minimum whose value is about 0.48, whereas ESGD managed to locate the near global minimum region valued at around 0.264 *every time*.

### 3.3Gaussian data with MNIST centroids

We selected 8 hand-written digit images of dimension from MNIST dataset shown in Figure 1, and then generated 60,000 images from these 8 centroids by adding Gaussian noise. We compare SBE with both EM and mini-batch EM (mb-EM) [15] on 100 independent realizations with random initial guess. For each method, we recorded the minimum, maximum, mean and variance of the 100 objective values by the computed centroids.

We first compare SBE and EM with the true number of clusters . For SBE, mini-batch size , maximum number of iterations for backward Euler `omaxit`

=150, maximum fixed-point iterations `imaxit`

= 10 for SBE. We set the maximum number of iterations for EM to be 50, which was sufficient for its convergence. The results are listed in the first two rows of Table ?. We observed that the global minimum was around 15.68 and that SBE always found the global minimum up to a tiny error due to the noise from mini-batch.

In the comparison between SBE and mb-EM, we reduced mini-batch size to , `omaxit`

, `imaxit`

and tested for . Table ? shows that with the same mini-batch size, SBE outperforms mb-EM in all three cases, in terms of both mean and variance of the objective values.

Method | Batch size | Max iter | Min | Max | Mean | Variance | |
---|---|---|---|---|---|---|---|

8 |
EM | 7500 | 50 | 15.6800 | 27.2828 | 20.0203 | 6.0030 |

SBE | 1000 | (150,10) | 15.6808 | 15.6808 | 15.6808 | 1.49 | |

6 |
mb-EM | 500 | 100 | 20.44 | 23.4721 | 21.8393 | 0.67 |

SBE | 500 | (100,5) | 20.2989 | 21.2047 | 20.4939 | 0.0439 | |

8 |
mb-EM | 500 | 100 | 15.9193 | 18.5820 | 16.4009 | 0.7646 |

SBE | 500 | (100,5) | 15.6816 | 15.6821 | 15.6820 | 1.18 | |

10 |
mb-EM | 500 | 100 | 15.9148 | 18.1848 | 16.1727 | 0.4332 |

SBE | 500 | (100,5) | 15.6823 | 15.6825 | 15.6824 | 1.5 | |

### 3.4Raw MNIST data

In this example, We used the 60,000 images from the MNIST training set for clustering test, with 6000 samples for each digit (cluster) from 0 to 9. The comparison results are shown in Table ?. We conclude that SBE consistently performs better than EM and mb-EM. The histograms of objective value by the three algorithms in the case are plotted in Fig. ?.

Method | Batch size | Max iter | Min | Max | Mean | Variance | |
---|---|---|---|---|---|---|---|

10 |
EM | 6000 | 50 | 19.6069 | 19.8195 | 19.6725 | 0.0028 |

SBE | 1000 | (150,10) | 19.6087 | 19.7279 | 19.6201 | 5.7 | |

8 |
mb-EM | 500 | 100 | 20.4948 | 20.7126 | 20.5958 | 0.0018 |

SBE | 500 | (100,5) | 20.2723 | 20.4104 | 20.3090 | 0.0014 | |

10 |
mb-EM | 500 | 100 | 19.9029 | 20.2347 | 20.0146 | 0.0041 |

SBE | 500 | (100,5) | 19.6103 | 19.7293 | 19.6354 | 0.0011 | |

12 |
mb-EM | 500 | 100 | 19.3978 | 19.7147 | 19.5136 | 0.0042 |

SBE | 500 | (100,5) | 19.0492 | 19.1582 | 19.0972 | 6.2 | |

### 3.5MNSIT features

We extracted the feature vectors of MNIST training data prior to the last layer of LeNet-5 [11]. The feature vectors have dimension 64 and lie in a better manifold compared with the raw data. The results are shown in Table 1 and Fig. ? and Figure 2.

Method | Batch size | Max iter | Min | Max | Mean | Variance | |
---|---|---|---|---|---|---|---|

10 |
EM | 6000 | 50 | 1.6238 | 3.0156 | 2.1406 | 0.0977 |

SBE | 1000 | (150,10) | 1.6238 | 1.6239 | 1.6239 | 2.7 | |

8 |
mb EM | 500 | 100 | 2.3428 | 3.5972 | 2.7157 | 0.0666 |

SBE | 500 | (100,5) | 2.2833 | 2.4311 | 2.3274 | 0.0015 | |

10 |
mb EM | 500 | 100 | 1.6504 | 2.6676 | 2.1391 | 0.0712 |

SBE | 500 | (100,5) | 1.6239 | 1.6242 | 1.6240 | 1.37 | |

12 |
mb EM | 500 | 100 | 1.5815 | 2.6189 | 1.7853 | 0.0661 |

SBE | 500 | (100,5) | 1.5326 | 1.5891 | 1.5622 | 9.8 | |

## 4Discussions

At the -th iteration, SBE solves . Since is technically only piecewise Lipschitz continuous, the backward Euler may have multiple solutions, and we will obtain these solutions with certain probabilities. For example, in this 1D example, we get two solutions in the leftmost valley and in the second from the left. by SBE is the extrapolation of these solutions in expectation. The averaging step helps reduce variance. If is far to the right, then is dragged away from the leftmost valley to the second valley, i.e. passes the local minimum. During the above process, the objective value may increase, which explains the jump of objective value showed in the right plot of Fig. ?. In some cases, however, the jump simply does not appear. The reason can be that the second valley is wider and deeper, and thus is further to the right. The could be pulled to the second valley with the smaller objective value with some probability.

## Acknowledgement

This work was partially supported by AFOSR grant FA9550-15-1-0073 and ONR grant N00014-16-1-2157. We would like to thank Dr. Bao Wang for helpful discussions.

### References

- D. Arthur and S. Vassilvitskii. “-means++: The advantages of careful seeding”, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics (2007).
- C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, and R. Zecchina, “Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses”, Physical review letters, 115(12), 128-101 (2015).
- D.P. Bertsekas: “Nonlinear Programming” Second Edition, Athena Scientific, Belmont, Massachusetts, 2008.
- L. Bottou and Y. Bengio, “Convergence properties of the -means algorithms”, Advances in neural information processing systems (1995).
- P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. “Entropy-sgd: Biasing gradient descent into wide valleys”, arXiv preprint arXiv:1611.01838 (2016).
- P. Chaudhari, A. Oberman, S. Osher, S. Soatto, and G. Carlier, “Deep Relaxation: partial differential equations for optimizing deep neural networks”, arXiv preprint arXiv:1704.04932 (2017).
- Y. Ding, Y. Zhao, X. Shen, M. Musuvathi, and T. Mytkowicz, “Yinyang -means: A drop-in replacement of the classic -means with consistent speedup”, Proceedings of the 32nd International Conference on Machine Learning (2015).
- C. Elkan, “Using the triangle inequality to accelerate -means”, Proceedings of the 20th International Conference on Machine Learning (2003).
- A. Kaplan and R. Tichatschke, “Proximal point method and nonconvex optimization”, Journal of Global Optimization, 13, 389-406 (1998).
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning”, Nature, 521(7553), 436-444 (2015).
- Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, “Gradient-based Learning Applied to Document Recognition”, Proceedings of the IEEE, 86(11), 2278-2324 (1998).
- S. Lloyd, “Least squares quantization in PCM”, IEEE transactions on information theory 28(2), 129-137 (1982).
- J. Newling and F. Fleuret, “Nested Mini-Batch -means”, Advances in Neural Information Processing Systems (2016).
- R. Rockafellar, “Monotone operators and the proximal point algorithm”, SIAM J. Control and Optimization, 14, 877-898 (1976).
- D. Sculley, “Web-scale -means clustering”, Proceedings of the 19th international conference on World wide web, ACM (2010).
- C. Tang and C. Monteleoni. “Convergence rate of stochastic -means”, arXiv preprint arXiv:1610.04900 (2016).