1 Introduction

NESTEROV’S SMOOTHING TECHNIQUE AND MINIMIZING DIFFERENCES OF CONVEX FUNCTIONS FOR HIERARCHICAL CLUSTERING

[2ex] N. M. NAM111Fariborz Maseeh Department of Mathematics and Statistics, Portland State University, Portland, OR 97207, USA (mau.nam.nguyen@pdx.edu). Research of this author was partly supported by the National Science Foundation under grant #1411817., W. GEREMEW222School of General Studies, Stockton University, Galloway, NJ 08205, USA (wondi.geremew@stockton.edu), S. REYNOLDS333Fariborz Maseeh Department of Mathematics and Statistics, Portland State University, Portland, OR 97207, USA (ser6@pdx.edu ). and T. TRAN444Fariborz Maseeh Department of Mathematics and Statistics, Portland State University, Portland, OR 97207, USA (tuyen2@pdx.edu)..

[2ex]

Abstract. A bilevel hierarchical clustering model is commonly used in designing optimal multicast networks. In this paper, we consider two different formulations of the bilevel hierarchical clustering problem, a discrete optimization problem which can be shown to be NP-hard. Our approach is to reformulate the problem as a continuous optimization problem by making some relaxations on the discreteness conditions. Then Nesterov’s smoothing technique and a numerical algorithm for minimizing differences of convex functions called the DCA are applied to cope with the nonsmoothness and nonconvexity of the problem. Numerical examples are provided to illustrate our method.
Key words. DC programming, Nesterov’s smoothing technique, hierarchical clustering, subgradient, Fenchel conjugate.
AMS subject classifications. 49J52, 49J53, 90C31

## 1 Introduction

Although convex optimization techniques and numerical algorithms have been the topics of extensive research for more than 50 years, solving large-scale optimization problems without the presence of convexity remains a challenge. This is a motivation to search for new optimization methods that are capable of handling broader classes of functions and sets where convexity is not assumed. One of the most successful approaches to go beyond convexity is to consider the class of functions representable as differences of two convex functions. Functions of this type are called DC functions, where DC stands for difference of convex. It was recognized early by P. Hartman [10] that the class of DC functions has many nice algebraic properties. For instance, this class of functions is closed under many operations usually considered in optimization such as taking linear combination, maximum, or product of a finite number of DC functions.

Given a linear space , a DC program is an optimization problem in which the objective function can be represented as , where are convex functions. This extension of convex programming enables us to take advantage of the available tools from convex analysis and optimization. At the same time, DC programming is sufficiently broad to use in solving many nonconvex optimization problems faced in recent applications. Another feature of DC programming is that it possesses a very nice duality theory; see [3] and the references therein. Although DC programming had been known to be important for many applications much earlier, the first algorithm for minimizing differences of convex functions called the DCA was introduced by Tao and An in [3, 9]. The DCA is a simple but effective optimization scheme used extensively in DC programming and its applications.

Cluster analysis or clustering is one of the most important problems in many fields such as machine learning, pattern recognition, image analysis, data compression, and computer graphics. Given a finite number of data points in a metric space, a centroid-based clustering problem seeks a finite number of cluster centers with each data point assigned to the nearest cluster center in a way that a certain distance, a measure of dissimilarity among data points, is minimized. Since many kinds of data encountered in practical applications have nested structures, they are required to use multilevel hierarchical clustering which involves grouping a data set into a hierarchy of clusters. In this paper, we apply the mathematical optimization approach to the bilevel hierarchical clustering problem. In fact, using mathematical optimization in clustering is a very promising approach to overcome many disadvantages of the mean algorithm commonly used in clustering; see [1, 2, 5, 15] and the references therein. In particular, the DCA was successfully applied in [2] to a bilevel hierarchical clustering problem in which the distance measurement is defined by the squared Euclidean distance. Although the DCA in [2] provides an effective way to solve the bilevel hierarchical clustering in high dimensions, it has not been used to solve the original model defined by the Euclidean distance measurement proposed in [5] which is not suitable for the resulting DCA according to the authors of [2]. By applying Nesterov’s smoothing technique and the DCA, we are able to solve the original model proposed in [5] in high dimensions.

The paper is organized as follows. In Section 2, we present basic definitions and tools of optimization that are used throughout the paper. In section 3, we study two models of bilevel hierarchical clustering problems, along with two new algorithms based on Nesterov’s smoothing technique and the DCA. Numerical examples and conclusions are presented in Section 4 and Section 5, respectively.

## 2 Basic Definitions and Tools of Optimization

In this section, we present two main tools of optimization used to solve the bilevel hierarchical crusting problem: the DCA introduced by Pham Dinh Tao and Nesterov’s smoothing technique.

We consider throughout the paper DC programming:

 \rm minimizef(x):=g(x)−h(x),x∈Rn, (2.1)

where and are convex functions. The function in (2.1) is called a DC function and is called a DC decomposition of .

Given a convex function , the Fenchel conjugate of is defined by

 g∗(y):=sup{⟨y,x⟩−g(x)|x∈Rn},y∈Rn.

Note that is also a convex function. In addition, if and only if , where denotes the subdifferential operator in the sense of convex analysis; see, e.g., [19-21]

Let us present below the DCA introduced by Tao and An [3, 9] as applied to (2.1). Although the algorithm is used for nonconvex optimization problems, the convexity of the functions involved still plays a crucial role.

The DCA

 INPUT: x0∈Rn, N∈N. for k=1,…,N do Find yk∈∂h(xk−1). Find xk∈∂g∗(yk). end for OUTPUT: xN.

Let us discuss below a convergence result of DC programming. A function is called -convex () if the function defined by , , is convex. If there exists such that is convex, then is called strongly convex. We say that an element is a critical point of the function defined by (2.1) if

 ∂g(¯x)∩∂h(¯x)≠∅.

Obviously, in the case where both and are differentiable, is a critical point of if and only if satisfies the Fermat rule . The theorem below provides a convergence result for the DCA. It can be derived directly from [9, Theorem 3.7].

###### Theorem 2.1

Consider the function defined by (2.1) and the sequence generated by the DCA. The following properties are valid:
(i) If is -convex and is -convex, then

 f(xk)−f(xk+1)≥γ1+γ22∥xk+1−xk∥2\rm for all k∈N.

(ii) The sequence is monotone decreasing.
(iii) If is bounded from below, is -convex and is -convex with , and is bounded, then every subsequential limit of the sequence is a critical point of .

Now we present a direct consequence of Nesterov’s smoothing technique given in [16]. In the proposition below, denotes the Euclidean distance and denotes the Euclidean projection from a point to a nonempty closed convex set in .

###### Proposition 2.2

Given any and , a Nesterov smoothing approximation of defined in has the representation

 φμ(x):=12μ∥x−a∥2−μ2[d(x−aμ;B)]2. (2.2)

Moreover, and

 φμ(x)≤φ(x)≤φμ(x)+μ2,

where is the closed unit ball of .

## 3 The Bilevel Hierarchical Clustering Problem

Given a set of points (nodes) in , our goal is to decompose this set into clusters. In each cluster, we would like to find a point among the nodes and assign it as the center for this cluster with all points in the cluster connected to this center. Then we will find a total center among the given points , and all centers are connected to this total center. The goal is to minimize the total transportation cost in this tree computed by the sum of the distances from the total center to each center and from each center to the nodes in each cluster. This is a discrete optimization problem which can be shown to be NP-hard. We will solve this problem based on continuous optimization techniques.

### 3.1 The Bilevel Hierarchical Clustering: Model I

The difficulty in solving this hierarchical clustering problem lies in the fact that the centers and total center have to be among the nodes. We first relax this condition with the use of artificial centers that could be anywhere in . Then we define the total center as the centroid of given by

 x∗:=1k(x1+x2+⋯+xk).

The total cost of the tree is given by

 φ(x1,…,xk):=m∑i=1minℓ=1,…,k∥xℓ−ai∥+k∑ℓ=1∥xℓ−x∗∥.

Note that here each is assigned to its closest center. However, what we expect are the real centers, which can be approximated by trying to minimize the difference between the artificial centers and the real centers. To achieve this goal, define the function

 ϕ(x1,…,xk):=k∑ℓ=1mini=1,…,m∥xℓ−ai∥,x1,…,xk∈Rn.

Observe that if and only if for every , there exists such that , which means that is a real node. Therefore, we consider the constrained minimization problem:

 \rm minimize φ(x1,…,xk)\rm subject% to ϕ(x1,…,xk)=0.

This problem can be converted to an unconstrained minimization problem:

 \rm minimize fλ(x1,…,xk):=φ(x1,…,xk)+λϕ(x1,…,xk),x1,…,xk∈Rn, (3.1)

where is a penalty parameter. Similar to the situation with the clustering problem, this new problem is nonsmooth and nonconvex, which can be solved by smoothing techniques and the DCA. Note that a particular case of this model in two dimensions was considered in [5] where the problem was solved using the derivative-free discrete gradient method established in [4], but this method is not suitable for large-scale settings in high dimensions. The DCA was used in [2] to solve a similar model with the squared Euclidean distance function used as the distance measurement. In this paper, the authors also addressed the difficulty of dealing with (3.1) using the DCA as not suitable for the resulting DCA. Nevertheless, we will show in what follows that the DCA is applicable to this model when combined with Nesterov’s smoothing technique.

Note that the functions and in (3.1) belong to the class of DC functions with the following DC decompositions:

 φ(x1,…,xk) =m∑i=1[k∑ℓ=1∥xℓ−ai∥−maxr=1,…,kk∑ℓ=1,ℓ≠r∥xℓ−ai∥]+k∑ℓ=1∥xℓ−x∗∥ =[m∑i=1k∑ℓ=1∥xℓ−ai∥+k∑ℓ=1∥xℓ−x∗∥]−m∑i=1maxr=1,…,kk∑ℓ=1,ℓ≠r∥xℓ−ai∥, ϕ(x1,…,xk) =k∑ℓ=1m∑i=1∥xℓ−ai∥−k∑ℓ=1maxs=1,…,mm∑i=1,i≠s∥xℓ−ai∥.

It follows that the objective function in (3.1) has the DC decomposition:

 fλ(x1,…,xk) =[(1+λ)k∑ℓ=1m∑i=1∥xℓ−ai∥+k∑ℓ=1∥xℓ−x∗∥] −[m∑i=1maxr=1,…,kk∑ℓ=1,ℓ≠r∥xℓ−ai∥+λk∑ℓ=1maxs=1,…,mm∑i=1,i≠s∥xℓ−ai∥].

This DC decomposition is not suitable for applying the DCA because there is no closed form for a subgradient of the function involved.

In the next step, we apply Nesterov’s smoothing technique from Proposition 2.2 to approximate the objective function by a new DC function favorable for applying the DCA. To accomplish this goal, we simply replace each term of the form from the first part of (the positive part) by the smooth approximation (2.2), while keeping the second part (the negative part) the same. As a result, we obtain

 fλμ(x1,…,xk) −(1+λ)μ2m∑i=1k∑ℓ=1[d(xℓ−aiμ;B)]2−μ2k∑ℓ=1[d(xℓ−x∗μ;B)]2 −m∑i=1maxr=1,…,kk∑ℓ=1,ℓ≠r∥xℓ−ai∥−λk∑ℓ=1maxs=1,…,mm∑i=1,i≠s∥xℓ−ai∥.

The original bilevel hierarchical clustering problem now can be solved using a DC program:

 \rm minimizefλμ(x1,…,xk)=gλμ(x1,…,xk)−hλμ(x1,…,xk),x1,…,xk∈Rn.

In this formulation, and are convex functions on defined by

 gλμ(x1,…,xk):=g1λμ(x1,…,xk)+g2λμ(x1,…,xk), hλμ(x1,…,xk):=h1λμ(x1,…,xk)+h2λμ(x1,…,xk)+h3λμ(x1,…,xk)+h4λμ(x1,…,xk),

with their respective components defined as

 g1λμ(x1,…,xk):=1+λ2μm∑i=1k∑ℓ=1∥xℓ−ai∥2,g2λμ(x1,…,xk):=12μk∑ℓ=1∥xℓ−x∗∥2, h3λμ(x1,…,xk):=m∑i=1maxr=1,…,kk∑ℓ=1,ℓ≠r∥xℓ−ai∥,h4λμ(x1,…,xk):=λk∑ℓ=1maxs=1,…,mm∑i=1,i≠s∥xℓ−ai∥.

To facilitate the gradient and subgradient calculations for the DCA, we introduce a data matrix and a variable matrix . The data is formed by putting each , , in the row, i.e.,

 A=⎛⎜ ⎜ ⎜ ⎜⎝a11a12a13…a1na21a22a23…a2n⋮⋮⋮⋮am1am2am3…amn⎞⎟ ⎟ ⎟ ⎟⎠.

Similarly, if are the cluster centers, then the variable is formed by putting each , , in the row, i.e.,

 X=⎛⎜ ⎜ ⎜ ⎜⎝x11x12x13…x1nx21x22x23…x2n⋮⋮⋮⋮xk1xk2xk3…xkn⎞⎟ ⎟ ⎟ ⎟⎠.

Then the variable matrix of the optimization problem belongs to , the linear space of by real matrices equipped with the inner product . The Frobenius norm on is defined by

Finally, we represent the average of the cluster centers by , i.e., .

Let us start by computing the gradient of

 gλμ(X)=g1λμ(X)+g2λμ(X).

Using the Frobenius norm, the function can equivalently be written as

 g1λμ(X) =1+λ2μm∑i=1k∑ℓ=1∥xℓ−ai∥2 =1+λ2μm∑i=1k∑ℓ=1[∥xℓ∥2−2⟨xℓ,ai⟩+∥ai∥2] =1+λ2μ[m∥∥X∥∥2F−2⟨X,EkmA⟩+k∥∥A∥∥2F],

where is a matrix whose entries are all ones. Hence, one can see that is differentiable and its gradient is given by

 ∇g1λμ(X)=1+λμ[mX−EkmA].

Similarly, can equivalently be written as

 g2λμ(X) =12μk∑ℓ=1∥xℓ−x∗∥2 =12μk∑ℓ=1[∥xℓ∥2−2⟨xℓ,x∗⟩+∥x∗∥2] =12μ[∥∥X∥∥2F−2k⟨X,EkkX⟩+1k⟨X,EkkX⟩] =12μ[∥∥X∥∥2F−1k⟨X,EkkX⟩],

where is a matrix whose entries are all ones. Hence, is differentiable and its gradient is given by

 ∇g2λμ(X)=1μ[X−1kEkkX].

Since , its gradient can be computed by

 ∇gλμ(X) =∇g1λμ(X)+∇g2λμ(X) =1+λμ[mX−EkmA]+1μ[X−1kEkkX] =1μ[(1+λ)mX−(1+λ)EkmA+X−1kEkkX] =1μ[[[(1+λ)m+1]Ikk−1kEkk]X−(1+λ)EkmA].

Therefore,

 ∇gλμ(X)=1μ[(((1+λ)m+1)Ikk−J)X−(1+λ)S],whereJ=1kEkk,andS=EkmA.

Our goal now is to compute , which can be accomplished by the relation

 X=∇g∗(Y)\rm if and only if Y=∇g(X).

The latter can equivalently be written as

Then with some algebraic manipulation we can show that

 ∇g∗(Y)=X=[11+(1+λ)mIkk+1[1+(1+λ)m](1+λ)mJ][(1+λ)S+μY].

Next, we will demonstrate in more detail the techniques we used in finding a subgradient for the convex function . Recall that is defined by

 hλμ(X)=4∑i=1hiλμ(X).

 h1λμ(X)=(1+λ)μ2m∑i=1k∑ℓ=1[d(xℓ−aiμ;B)]2.

From its representation, one can see that is differentiable, and hence its subgradient coincides with its gradient, that can be computed by the partial derivatives with respect to , i.e.,

 ∂h1λμ∂xℓ(X)=(1+λ)m∑i=1[xℓ−aiμ−P(xℓ−aiμ;B)].

Thus, for , is the matrix whose row is .

Similarly, one can see that the function given by

 h2λμ(X)=μ2k∑ℓ=1[d(xℓ−x∗μ;B)]2

is differentiable with its partial derivatives computed by

 ∂h2μ∂xℓ(X)=[xℓ−x∗μ−P(xℓ−x∗μ;B)]−1kk∑j=1[xj−x∗μ−P(xj−x∗μ;B)].

Hence, for , is the matrix whose row is .

Unlike and , the convex functions and are not differentiable, but both can be written as a finite sum of the maximum of a finite number of convex functions. Let us compute a subgradient of as an example. We have

 h3λμ(X)=m∑i=1maxr=1,…,kk∑ℓ=1,ℓ≠r∥xℓ−ai∥=m∑i=1γi(X),

where, for ,

 γi(X):=max⎧⎨⎩γir(X)=k∑ℓ=1,ℓ≠r∥xℓ−ai∥,r=1,…,k⎫⎬⎭.

Then, for each , we find according to the subdifferential rule for the maximum of convex functions. Then define to get a subgradient of the function at by the subdifferential sum rule. To accomplish this goal, we first choose an index from the index set such that

 γi(X)=γir∗(X)=k∑ℓ=1,ℓ≠r∗∥xℓ−ai∥.

Using the familiar subdifferential formula of the Euclidean norm function, the row for of the matrix is determined as follows

 wℓi:=⎧⎨⎩xℓ−ai∥xℓ−ai∥2%ifxℓ≠ai,u∈Bifxℓ=ai.

The row of the matrix is .

The procedure for calculating a subgradient of the function given by

 h4λμ(x1,…,xk)=λk∑ℓ=1maxs=1,…,mm∑i=1,i≠s∥xℓ−ai∥,

is very similar to what we just demonstrated for .

At this point, we are ready to give a new DCA based algorithm for Model I.

Algorithm 1.

 INPUT: X0,λ0,μ0,N∈N. while stopping criteria (λ, μ) = false do for k=1,2,3,⋯,N do Find Yk∈∂h1λμ(Xk−1). Find Xk∈∂g∗1λμ(Yk). end for update λandμ. end while OUTPUT: XN.

### 3.2 The Bilevel Hierarchical Clustering: Model II

In this section, we introduce the second model to solve the bilevel hierarchical clustering problem. In this model, we use an additional variable to denote the total center. At first we allow the total center to be a free point in , the same as the cluster centers. Then the total cost of the tree is given by

 φ(x1,…,xk+1):=m∑i=1minℓ=1,…,k∥xℓ−ai∥+k∑ℓ=1∥xℓ−xk+1∥,x1,…,xk+1∈Rn.

To force the centers to be chosen from the given nodes (or to make them as close to the nodes as possible), we set the constraint

 ϕ(x1,…,xk+1):=k+1∑ℓ=1mini=1,…,m∥xℓ−ai∥=0.

Our goal is to solve the optimization problem

 \rm minimize φ(x1,…,xk+1) \rm subject to ϕ(x1,…,xk+1),x1,…,xk+1∈Rn.

Similar to the first model, this problem formulation can be converted to an unconstrained minimization problem involving a penalty parameter :

 \rm minimize fλ(x1,…,xk+1):=φ(x1,…,xk+1)+λϕ(x1,…,xk+1),x1,…,xk+1∈Rn. (3.2)

Next, we apply Nesterov’s smoothing technique to get an approximation of the objective function given in (3.2) which involves two parameter and :

 fλμ(X) :=1+λ2μm∑i=1k+1∑ℓ=1∥xℓ−ai∥2+12μk∑ℓ=1∥xℓ−xk+1∥2−12μm∑i=1∥xk+1−ai∥2 −λμ2m∑i=1[d(xk+1−aiμ;B)]2−(1+λ)μ2m∑i=1k∑ℓ=1[d(xℓ−aiμ;B)]2 −μ2k∑ℓ=1[d(xℓ−xk+1μ;B)]2−m∑i=1maxr=1,…,kk∑ℓ=1,ℓ≠r∥xℓ−ai∥−λk+1∑ℓ=1maxs=1,…,mm∑i=1,i≠s∥xℓ−ai∥.

As we will show in what follows, it is convenient to apply the DCA to minimize the function . This function can be represented as the differences of two convex functions defined on using a variable whose row is for :

 fλμ(X)=gλμ(X)−hλμ(X),X∈R(k+1)×n.

In this formulation, and are convex functions defined on by

 gλμ(X):=g1λμ(X)+g2λμ(X)

and

 hλμ(X):=h1λμ(X)+h2λμ(X)+h3λμ(X)+h4λμ(X)+h5λμ(X)+h6λμ(X),

with their respective components given by

 g1λμ(X):=1+λ2μm∑i=1k+1∑ℓ=1∥xℓ−ai∥2,g2λμ(X):=12μk∑ℓ=1∥xℓ−xk+1∥2, h1λμ(X):=12μm∑i=1∥xk+1−ai∥2,h2λμ(X):=λμ2m∑i=1[d(xk+1−aiμ;B)]2, h3λμ(X):=(1+λ)μ2m∑i=1k∑ℓ=1[d(xℓ−aiμ;B)]2,h4λμ(X):=μ2k∑ℓ=1[d(xℓ−xk+1μ;B)]2, h5λμ(X):=m∑i=1maxr=1,…,kk∑ℓ=1,ℓ≠r∥xℓ−ai∥,h6λμ(X):=λk+1∑ℓ=1maxs=1,…,mm∑i=1,i≠s∥xℓ−ai∥.

Let us start by computing the gradient of the first part of the DC decomposition, i.e.,

 gλμ(X)=g1λμ(X)+g2λμ(X).

By applying similar techniques used in computing gradients/subgradients for Model I, can be written as

 ∇g1λμ(X)=1+λμ[mX−EA],

where is the matrix whose entries are all ones. Similarly, can also be written as

 ∇g2λμ(X)=1μ[I+T]X,

where is the identity matrix, and is the matrix whose entries are all zeros except its row and column, which both are filled by the vector .

It follows that

 ∇gλμ(X) =∇g1λμ(X)+∇g2λμ(X) =1+λμ[mX−EA]+1μ[I+T]X =1μ[c1I+T]X−1+λμEA,

where . Our goal now is to compute , which can be accomplished by the relation

 X=∇g∗λμ(Y)\rm if and only if Y=∇gλμ(X).

The latter can be equivalently written as

 [c1I+T]X =(1+λ)EA+μY,

whose solutions can be explicitly computed by its row for :

 xℓ=[(1+λ)EA+μY]ℓ+xk+1c1\rm for ℓ=1,…,k, xk+1=c1[(1+λ)EA+μY]k+1+∑kℓ=1[(1+λ)EA+μY]ℓ(c1+k)(c1−1).

In the representation

 hλμ(X)=h1λμ(X)+h2λμ(X)+h3λμ(X)+h4λμ(X)+h5λμ(X)+h6λμ(X),

the convex functions are differentiable. The partial derivatives of are given by

 ∂h1λμ∂xℓ(X