1 Introduction

NESTEROV’S SMOOTHING TECHNIQUE AND MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS FOR HIERARCHICAL CLUSTERING

[2ex] N. M. NAM111Fariborz Maseeh Department of Mathematics and Statistics, Portland State University, Portland, OR 97207, USA (mau.nam.nguyen@pdx.edu). Research of this author was partly supported by the National Science Foundation under grant #1411817., W. GEREMEW222School of General Studies, Stockton University, Galloway, NJ 08205, USA (wondi.geremew@stockton.edu), S. REYNOLDS333Fariborz Maseeh Department of Mathematics and Statistics, Portland State University, Portland, OR 97207, USA (ser6@pdx.edu ). and T. TRAN444Fariborz Maseeh Department of Mathematics and Statistics, Portland State University, Portland, OR 97207, USA (tuyen2@pdx.edu)..

[2ex]

Abstract. A bilevel hierarchical clustering model is commonly used in designing optimal multicast networks. In this paper, we consider two different formulations of the bilevel hierarchical clustering problem, a discrete optimization problem which can be shown to be NP-hard. Our approach is to reformulate the problem as a continuous optimization problem by making some relaxations on the discreteness conditions. Then Nesterov’s smoothing technique and a numerical algorithm for minimizing differences of convex functions called the DCA are applied to cope with the nonsmoothness and nonconvexity of the problem. Numerical examples are provided to illustrate our method.
Key words. DC programming, Nesterov’s smoothing technique, hierarchical clustering, subgradient, Fenchel conjugate.
AMS subject classifications. 49J52, 49J53, 90C31

1 Introduction

Although convex optimization techniques and numerical algorithms have been the topics of extensive research for more than 50 years, solving large-scale optimization problems without the presence of convexity remains a challenge. This is a motivation to search for new optimization methods that are capable of handling broader classes of functions and sets where convexity is not assumed. One of the most successful approaches to go beyond convexity is to consider the class of functions representable as differences of two convex functions. Functions of this type are called DC functions, where DC stands for difference of convex. It was recognized early by P. Hartman [10] that the class of DC functions has many nice algebraic properties. For instance, this class of functions is closed under many operations usually considered in optimization such as taking linear combination, maximum, or product of a finite number of DC functions.

Given a linear space , a DC program is an optimization problem in which the objective function can be represented as , where are convex functions. This extension of convex programming enables us to take advantage of the available tools from convex analysis and optimization. At the same time, DC programming is sufficiently broad to use in solving many nonconvex optimization problems faced in recent applications. Another feature of DC programming is that it possesses a very nice duality theory; see [3] and the references therein. Although DC programming had been known to be important for many applications much earlier, the first algorithm for minimizing differences of convex functions called the DCA was introduced by Tao and An in [3, 9]. The DCA is a simple but effective optimization scheme used extensively in DC programming and its applications.

Cluster analysis or clustering is one of the most important problems in many fields such as machine learning, pattern recognition, image analysis, data compression, and computer graphics. Given a finite number of data points in a metric space, a centroid-based clustering problem seeks a finite number of cluster centers with each data point assigned to the nearest cluster center in a way that a certain distance, a measure of dissimilarity among data points, is minimized. Since many kinds of data encountered in practical applications have nested structures, they are required to use multilevel hierarchical clustering which involves grouping a data set into a hierarchy of clusters. In this paper, we apply the mathematical optimization approach to the bilevel hierarchical clustering problem. In fact, using mathematical optimization in clustering is a very promising approach to overcome many disadvantages of the mean algorithm commonly used in clustering; see [1, 2, 5, 15] and the references therein. In particular, the DCA was successfully applied in [2] to a bilevel hierarchical clustering problem in which the distance measurement is defined by the squared Euclidean distance. Although the DCA in [2] provides an effective way to solve the bilevel hierarchical clustering in high dimensions, it has not been used to solve the original model defined by the Euclidean distance measurement proposed in [5] which is not suitable for the resulting DCA according to the authors of [2]. By applying Nesterov’s smoothing technique and the DCA, we are able to solve the original model proposed in [5] in high dimensions.

The paper is organized as follows. In Section 2, we present basic definitions and tools of optimization that are used throughout the paper. In section 3, we study two models of bilevel hierarchical clustering problems, along with two new algorithms based on Nesterov’s smoothing technique and the DCA. Numerical examples and conclusions are presented in Section 4 and Section 5, respectively.

2 Basic Definitions and Tools of Optimization

In this section, we present two main tools of optimization used to solve the bilevel hierarchical crusting problem: the DCA introduced by Pham Dinh Tao and Nesterov’s smoothing technique.

We consider throughout the paper DC programming:

(2.1)

where and are convex functions. The function in (2.1) is called a DC function and is called a DC decomposition of .

Given a convex function , the Fenchel conjugate of is defined by

Note that is also a convex function. In addition, if and only if , where denotes the subdifferential operator in the sense of convex analysis; see, e.g., [19-21]

Let us present below the DCA introduced by Tao and An [3, 9] as applied to (2.1). Although the algorithm is used for nonconvex optimization problems, the convexity of the functions involved still plays a crucial role.

The DCA

INPUT: , .
for do
  Find .
  Find .
end for
OUTPUT: .

Let us discuss below a convergence result of DC programming. A function is called -convex () if the function defined by , , is convex. If there exists such that is convex, then is called strongly convex. We say that an element is a critical point of the function defined by (2.1) if

Obviously, in the case where both and are differentiable, is a critical point of if and only if satisfies the Fermat rule . The theorem below provides a convergence result for the DCA. It can be derived directly from [9, Theorem 3.7].

Theorem 2.1

Consider the function defined by (2.1) and the sequence generated by the DCA. The following properties are valid:
(i) If is -convex and is -convex, then

(ii) The sequence is monotone decreasing.
(iii) If is bounded from below, is -convex and is -convex with , and is bounded, then every subsequential limit of the sequence is a critical point of .

Now we present a direct consequence of Nesterov’s smoothing technique given in [16]. In the proposition below, denotes the Euclidean distance and denotes the Euclidean projection from a point to a nonempty closed convex set in .

Proposition 2.2

Given any and , a Nesterov smoothing approximation of defined in has the representation

(2.2)

Moreover, and

where is the closed unit ball of .

3 The Bilevel Hierarchical Clustering Problem

Given a set of points (nodes) in , our goal is to decompose this set into clusters. In each cluster, we would like to find a point among the nodes and assign it as the center for this cluster with all points in the cluster connected to this center. Then we will find a total center among the given points , and all centers are connected to this total center. The goal is to minimize the total transportation cost in this tree computed by the sum of the distances from the total center to each center and from each center to the nodes in each cluster. This is a discrete optimization problem which can be shown to be NP-hard. We will solve this problem based on continuous optimization techniques.

3.1 The Bilevel Hierarchical Clustering: Model I

The difficulty in solving this hierarchical clustering problem lies in the fact that the centers and total center have to be among the nodes. We first relax this condition with the use of artificial centers that could be anywhere in . Then we define the total center as the centroid of given by

The total cost of the tree is given by

Note that here each is assigned to its closest center. However, what we expect are the real centers, which can be approximated by trying to minimize the difference between the artificial centers and the real centers. To achieve this goal, define the function

Observe that if and only if for every , there exists such that , which means that is a real node. Therefore, we consider the constrained minimization problem:

This problem can be converted to an unconstrained minimization problem:

(3.1)

where is a penalty parameter. Similar to the situation with the clustering problem, this new problem is nonsmooth and nonconvex, which can be solved by smoothing techniques and the DCA. Note that a particular case of this model in two dimensions was considered in [5] where the problem was solved using the derivative-free discrete gradient method established in [4], but this method is not suitable for large-scale settings in high dimensions. The DCA was used in [2] to solve a similar model with the squared Euclidean distance function used as the distance measurement. In this paper, the authors also addressed the difficulty of dealing with (3.1) using the DCA as not suitable for the resulting DCA. Nevertheless, we will show in what follows that the DCA is applicable to this model when combined with Nesterov’s smoothing technique.

Note that the functions and in (3.1) belong to the class of DC functions with the following DC decompositions:

It follows that the objective function in (3.1) has the DC decomposition:

This DC decomposition is not suitable for applying the DCA because there is no closed form for a subgradient of the function involved.

In the next step, we apply Nesterov’s smoothing technique from Proposition 2.2 to approximate the objective function by a new DC function favorable for applying the DCA. To accomplish this goal, we simply replace each term of the form from the first part of (the positive part) by the smooth approximation (2.2), while keeping the second part (the negative part) the same. As a result, we obtain

The original bilevel hierarchical clustering problem now can be solved using a DC program:

In this formulation, and are convex functions on defined by

with their respective components defined as

To facilitate the gradient and subgradient calculations for the DCA, we introduce a data matrix and a variable matrix . The data is formed by putting each , , in the row, i.e.,

Similarly, if are the cluster centers, then the variable is formed by putting each , , in the row, i.e.,

Then the variable matrix of the optimization problem belongs to , the linear space of by real matrices equipped with the inner product . The Frobenius norm on is defined by

Finally, we represent the average of the cluster centers by , i.e., .

Gradient and Subgradient Calculations for the DCA

Let us start by computing the gradient of

Using the Frobenius norm, the function can equivalently be written as

where is a matrix whose entries are all ones. Hence, one can see that is differentiable and its gradient is given by

Similarly, can equivalently be written as

where is a matrix whose entries are all ones. Hence, is differentiable and its gradient is given by

Since , its gradient can be computed by

Therefore,

Our goal now is to compute , which can be accomplished by the relation

The latter can equivalently be written as

Then with some algebraic manipulation we can show that

Next, we will demonstrate in more detail the techniques we used in finding a subgradient for the convex function . Recall that is defined by

We will start with the function given by

From its representation, one can see that is differentiable, and hence its subgradient coincides with its gradient, that can be computed by the partial derivatives with respect to , i.e.,

Thus, for , is the matrix whose row is .

Similarly, one can see that the function given by

is differentiable with its partial derivatives computed by

Hence, for , is the matrix whose row is .

Unlike and , the convex functions and are not differentiable, but both can be written as a finite sum of the maximum of a finite number of convex functions. Let us compute a subgradient of as an example. We have

where, for ,

Then, for each , we find according to the subdifferential rule for the maximum of convex functions. Then define to get a subgradient of the function at by the subdifferential sum rule. To accomplish this goal, we first choose an index from the index set such that

Using the familiar subdifferential formula of the Euclidean norm function, the row for of the matrix is determined as follows

The row of the matrix is .

The procedure for calculating a subgradient of the function given by

is very similar to what we just demonstrated for .

At this point, we are ready to give a new DCA based algorithm for Model I.

Algorithm 1.

INPUT: .
while stopping criteria (, ) = false do
  for do
    Find .
    Find .
  end for
  update .
end while
OUTPUT: .

3.2 The Bilevel Hierarchical Clustering: Model II

In this section, we introduce the second model to solve the bilevel hierarchical clustering problem. In this model, we use an additional variable to denote the total center. At first we allow the total center to be a free point in , the same as the cluster centers. Then the total cost of the tree is given by

To force the centers to be chosen from the given nodes (or to make them as close to the nodes as possible), we set the constraint

Our goal is to solve the optimization problem

Similar to the first model, this problem formulation can be converted to an unconstrained minimization problem involving a penalty parameter :

(3.2)

Next, we apply Nesterov’s smoothing technique to get an approximation of the objective function given in (3.2) which involves two parameter and :

As we will show in what follows, it is convenient to apply the DCA to minimize the function . This function can be represented as the differences of two convex functions defined on using a variable whose row is for :

In this formulation, and are convex functions defined on by

and

with their respective components given by

Gradient and Subgradient Calculations for the DCA

Let us start by computing the gradient of the first part of the DC decomposition, i.e.,

By applying similar techniques used in computing gradients/subgradients for Model I, can be written as

where is the matrix whose entries are all ones. Similarly, can also be written as

where is the identity matrix, and is the matrix whose entries are all zeros except its row and column, which both are filled by the vector .

It follows that

where . Our goal now is to compute , which can be accomplished by the relation

The latter can be equivalently written as

whose solutions can be explicitly computed by its row for :

In the representation

the convex functions are differentiable. The partial derivatives of are given by