Portfolio diversification based on ratios of risk measures

# Portfolio diversification based on ratios of risk measures

M. Barkhagen School of Mathematics, The University of Edinburgh, UK. B. Fleming Aberdeen Standard Investments, Edinburgh, UK. S. García School of Mathematics, The University of Edinburgh, UK. J. Gondzio School of Mathematics, The University of Edinburgh, UK. J. Kalcsics School of Mathematics, The University of Edinburgh, UK. J. Kroeske Aberdeen Standard Investments, Edinburgh, UK. S. Sabanis A. Staal Aberdeen Standard Investments, Edinburgh, UK.
July 1, 2019
###### Abstract

A new framework for portfolio diversification is introduced which goes beyond the classical mean-variance theory and other known portfolio allocation strategies such as risk parity. It is based on a novel concept called portfolio dimensionality and ultimately relies on the minimization of ratios of convex functions. The latter arises naturally due to our requirements that diversification measures should be leverage invariant and related to the tail properties of the distribution of portfolio returns. This paper introduces this new framework and its relationship to standardized higher order moments of portfolio returns. Moreover, it addresses the main drawbacks of standard diversification methodologies which are based primarily on estimates of covariance matrices. Maximizing portfolio dimensionality leads to highly non-trivial optimization problems with objective functions which are typically non-convex with potentially multiple local optima. Two complementary global optimization algorithms are thus presented. For problems of moderate size, a deterministic Branch and Bound algorithm is developed, whereas for problems of larger size a stochastic global optimization algorithm based on Gradient Langevin Dynamics is given. We demonstrate through numerical experiments that the introduced diversification measures possess desired properties as introduced in the portfolio diversification literature.

## 1 Introduction

The global financial crisis of 2008 induced heavy losses for most asset portfolios held by institutional investors, prompting investors to question their portfolio construction methodologies and understanding of the level of diversification that can thus be achieved. This led to increased activity in both academia and the financial industry seeking to develop new portfolio construction techniques with the goal of obtaining a well diversified portfolio. Despite the fact that an overwhelming majority of investors seeks to hold a well diversified portfolio, there is no agreed upon definition or measure of diversification in the literature or among practitioners. A common understanding is that a diversified portfolio should provide risk dissemination and be protected against large drawdowns. Expressed differently, the risk of the portfolio should not be concentrated to only a few risk factors and the tail risk of the portfolio should be controlled. In the diversification literature, the focus was initially purely on volatility reduction, but this definition cannot lead to a useful measure as hedging reduces volatility but does not lead to a better risk dissemination. Most diversification measures and construction methods in the literature are based on the covariance matrix which can be contributed to the seminal paper on mean-variance optimization by Markowitz (1952). Other approaches, with a growing literature which are used by many asset managers and institutional investors, are the risk parity approach coined by Qian (2019) (see also e.g. Qian (2011), Roncalli (2014) and Roncalli and Weisang (2016)) and the most diversified portfolio of Choueifaty and Coignard (2008) (see also Choueifaty et al. (2013)). Another portfolio diversification measure based on the covariance matrix is introduced in Meucci (2009). There, the asset universe is orthogonalized via Principal Component Analysis of the covariance matrix leading to a new universe consisting of uncorrelated so called principal portfolios. Based on the squared weighted volatilities of the portfolio in the new universe, a portfolio diversification measure based on the dispersion of the squared weighted volatilities is defined. The interpretation of this measure put forth by Meucci is that it represents the effective number of uncorrelated bets that the portfolio is exposed to.

There are two main drawbacks with the portfolio diversification approaches that are based only on the covariance matrix. Firstly, the portfolio which is produced by the diversification method critically depends on the accuracy of the estimated covariance matrix. Given a large asset universe there are many parameters that need to be estimated from a limited history of observed returns leading to potentially large estimation errors. The larger the investment universe, the more parameters in the covariance matrix need to be estimated, which leaves more room for estimation errors. A portfolio diversification methodology based on the covariance matrix such as mean-variance optimization or risk parity would put large weight on those assets with a low volatility and a negative correlation to other assets. These are exactly those assets that have the largest estimation error. The portfolio optimizer of the mean-variance approach acts like an “error maximizer” leading to an amplification of the estimation errors (see Green and Hollifield, 1992). The amplification of the estimation errors leads to very unstable weights over time and has the undesirable effect that the optimal portfolios tend to be concentrated to only a few assets. Robust estimation of covariance matrices has typically been handled in the literature by shrinkage estimators (see Ledoit and Wolf, 2003, 2004). However, as illustrated by an empirical test in DeMiguel et al. (2009), even when shrinkage is applied, the naive portfolio outperforms the mean-variance optimal portfolio out-of-sample.

Secondly, basing the portfolio diversification methodology only on the covariance matrix, or equivalently on the portfolio volatility, ignores the higher order moments of the portfolio return. This often leads to portfolios that in the volatility based frameworks appear to be well diversified but that are heavily exposed to tail risks. These are exactly the risks that should be avoided in order not to realize large drawdowns for the portfolio during events such as the global financial crisis of 2008 or more recently in 2018. As an example, consider a portfolio consisting of a volatility selling strategy and the underlying equity index. The volatility of most volatility selling strategies is lower than that of the equity index, while the negative skewness and the kurtosis are more pronounced. A portfolio construction strategy based on volatility would thus put larger weight to the volatility selling strategy in order to decrease volatility risk at the expense of being more exposed to tail risk. Such a portfolio would have suffered heavy losses during the VIX spike on 5 February 2018. On that day the VIX index experienced its largest one-day jump in its 25 year history, rising 20 points from 17.31 from the previous day’s close to 37.32 at the end of the trading day. This example highlights why we require our diversification measure to have a direct link to the tail properties of the portfolio distribution. Secondly, we require the measure to be leverage invariant as being 100% exposed to S&P 500 is as diversified as being 50% exposed to S&P 500 and leaving the rest in cash. Thus, the diversification measure should not be based on portfolio volatility or Expected Shortfall alone. After all, the best strategy to reduce volatility or Expected Shortfall is to have more capital allocated in cash but that does not increase the diversification.

In this paper we introduce a portfolio diversification framework that addresses the shortcomings of the standard frameworks in the literature that are based on the covariance matrix. The requirement that the diversification measure is leverage invariant naturally leads to measures based on ratios of convex functions. The introduced framework is related to the higher order standardized moments and tail properties of the portfolio return distribution, and hence the large drawdowns associated with market events such as the global financial crisis of 2008 are explicitly accounted for. Optimizing ratios of convex functions is a global optimization problem with potentially several local optima that are not equal to the global optimum or optima. We therefore develop two methods, one deterministic and one stochastic, for global optimization of ratios of convex functions.

## 2 Diversification measures based on ratios of convex functions

In this section we present a framework that addresses the identified shortcomings of the covariance based portfolio diversification methodologies. Without loss of generality, it is assumed henceforth that the expected returns of the assets in the asset universe are all zero.

### 2.1 Portfolio kurtosis as a measure of non-Gaussianity

The main goal of the portfolio diversification framework is to manage the distribution of the portfolio returns. In an idealised world, one would define portfolio dimensionality as the number of equally sized independent return streams in the portfolio. Arguments based on the Central Limit Theorem (CLT) imply that adding independent exposures to the portfolio leads to a portfolio whose distribution is closer to the Gaussian distribution and thereby the tail risk is reduced. Obviously, financial markets do not obey the idealized assumptions of independent and identically distributed (i.i.d.) returns of the standard CLT (see Barbieri et al., 2010, for some examples of the CLT with relaxed assumptions). The idea behind our diversification measures is to base them on the degree of non-Gaussianity of the portfolio return distribution. A portfolio with a low degree of non-Gaussianity is a well diversified portfolio, and vice versa. Measuring the degree of non-Gaussianity is directly related to the tail properties of the portfolio and naturally leads to measures which are leverage invariant. Measuring and optimizing non-Gaussianity have been thoroughly studied in the Independent Component Analysis (ICA) literature, see e.g. Hyvärinen and Oja (2000). A common measure of non-Gaussianity in the ICA literature is kurtosis, and other frequently used measures are based on neg-entropy or Kullback-Leibler divergence. Inspired by the ICA literature, we initially link the notion of a well diversified portfolio to a portfolio with a low kurtosis which implies a low (symmetric) tail risk.

Another attractive aspect of using kurtosis is that the uncertainty regarding the variance of the portfolio is controlled by the kurtosis of the portfolio (see e.g. Van der Vaart, 1998). Thus, by lowering the kurtosis of the portfolio, the problem of error maximization due to estimation error for the covariance matrix is explicitly addressed. Hence, using portfolio kurtosis as a basis for portfolio diversification addresses the two major drawbacks of the covariance based portfolio diversification methodologies. Moreover, as kurtosis is the fourth moment of standardized portfolio returns, it does not depend on the leverage of the portfolio.

### 2.2 A novel approach to portfolio dimensionality: definition and examples

Basing a portfolio diversification measure on portfolio kurtosis addresses the two main drawbacks of the covariance based approaches. However, for assets with significant negative skewness, only focusing on kurtosis could lead to suboptimal portfolios in terms of tail risk. Another reason for taking skewness into account is the strong link between an asset’s return and skewness. An empirical result from the risk premia literature is that maximizing the Sharpe ratio of a portfolio seems to be strongly linked to maximizing the negative skewness, see Lempérière et al. (2017). There are several ways to incorporate skewness into a portfolio diversification framework. In Lassance and Vrins (2018), a portfolio risk measure based on exponential Rényi entropy is used in order to incorporate higher order moments into the portfolio decision framework. Through a truncated Gram-Charlier expansion of Rényi entropy they demonstrate that their portfolio risk measure can be directly expressed as a function of portfolio skewness and kurtosis. Another approach, see e.g. Jondeau and Rockinger (2006), relies on a higher order Taylor expansion of the investor utility function, which leads to an expression in terms of the non-standardized portfolio moments. Thus, this approach suffers from the drawback of optimizing an objective function which is not invariant to leverage.

With all the above in mind, we proceed with defining a diversification framework which is invariant to leverage. It is also flexible enough to allow different objective functions, such as excess kurtosis or the square of skewness, but within a robust setting for measuring, in an appropriate sense, the level of non-Gaussianity of resulting portfolios. Furthermore, in order to measure the behaviour of the resulting portfolio relative to the behaviour of a portfolio with a certain tail risk, which is known also as a reference portfolio and which is representative of the given asset universe, we link it with the tail risk of an equally weighted portfolio of i.i.d. (reference) assets. We are then able to define the notion of portfolio dimensionality relative to the tail risk of the reference asset.

Let denote the set of all random variables with mean zero, finite -th moment, for a suitable , and either skewed and leptokurtic or Gaussian distribution. Such random variables represent the assets in a given asset universe under consideration. We define a function, which measures the level of non-Gaussianity, using such that: (i) , for any and (leverage invariant), and (ii) the function

 fY,ν(n)=ν(n∑i=1Yi),where Yi∈X are i.i.d. and n∈N,

is strictly decreasing in and . Then, one can proceed with the definition of portfolio dimensionality relative to a reference random variable by

 DZ,ν(w)=ν(Z)ν(∑ni=1wiXi), (1)

where is the return of the -th asset of the portfolio and is the corresponding weight.

Let us concentrate now on the case where is either excess kurtosis or the square of skewness. By taking into account the findings of Fleming and Kroeske (2017), where the notion of the distribution of portfolio variance (DOPV) is used and the effective size of its support is related to the spectrum of Rényi entropies, and the leverage invariance property of , one calculates for any that

 DZ,ν(w)=ν(Z)ν(1k∑ki=1Zi)ν(∑ki=1Zi)ν(∑ni=1wiXi)=kfZ,ν(k)ν(∑ni=1wiXi) (2)

where is a sequence of i.i.d. random variables such that . Thus, it is clear that in the case where the denominator in (2) is equal to then . Naturally then, one can interpret the dimensionality value as the equivalent number of independent (reference) assets. Moreover, since is a monotonically decreasing function, if

 fZ,ν(k+1)<ν(n∑i=1wiXi)

with taking a non-integer value according to a monotonic interpolation of . As a result, one observes that the higher the values for , the closer we are to a tail risk similar to the one given by a standard Gaussian. To see this, consider a large enough and . Consequently, one obtains due to (2), that the number of independent assets is increased accordingly,

 ν(n∑i=1wiXi)≤fZ,ν(k)=ν(∑ki=1Zi√kE[Z2]),(due to leverage% invariance)

and thus due to the CLT and property (ii), one observes the desired result.

### 2.3 Desirable properties of the diversification measure: Toy example

Although there is no agreed upon definition of diversification in the literature, a number of desirable properties of a diversification methodology have been proposed. In Choueifaty et al. (2013), the notion of polico invariance is introduced. Extending an asset universe by adding a positive linear combination of assets already belonging to the universe should not affect the weights to the original assets when applying the diversification methodology. A special case of polico invariance, denoted duplication invariance, considers the duplication of one of the assets in the universe. This case naturally arises in applications when one of the assets is listed on multiple exchanges. Applying the diversification methodology should produce the same portfolio irrespective of any asset in the universe being duplicated or not. In Koumou (2017) further desirable properties of diversification measures are introduced. However, some of the properties presented in Koumou (2017) are not consistent with the requirements that we have on a diversification measure. In Section 1, we introduced the requirement that the portfolio diversification measure should be leverage invariant. This contrasts one of the desired properties presented in Koumou (2017). Furthermore, in Koumou (2017), the portfolio diversification measure is required to be concave or quasi-concave. As we have argued, a leverage invariant diversification measure naturally leads to a ratio of two convex functions which in general is neither concave nor quasi-concave.

In the following, a numerical example is used to demonstrate that important desirable properties are satisfied by the newly introduced portfolio diversification measure. The demonstration is based on a toy example with a universe consisting of three assets with the following covariance matrix

 C=⎡⎢⎣1 ρ 0ρ 1 00 0 1⎤⎥⎦. (3)

As the correlation between asset one and asset two approaches one, these two assets behave as one asset and hence this corresponds to the case when one of the assets in the universe is duplicated. For this case, the weight of asset three should approach as . When , this corresponds to the case when either asset one or asset two is a perfect hedge of the other. In this case, assuming that is positive definite, the volatility of a portfolio given by the weight vector tends to a small value as . In Choueifaty et al. (2013), it is demonstrated that risk parity suffers from what the authors denote the duplication invariance problem. It is well known in the literature that the global minimum variance portfolio tends to be highly concentrated to assets with low volatility, see e.g. Clarke et al. (2013). Thus, for an asset universe where the exposure to some assets to a large extent has been hedged away, the global minimum variance portfolio tends to be highly concentrated to the hedged assets. We denote this undesirable property of the global minimum variance portfolio the hedging invariance problem.

Consistency with the duplication invariance and hedging invariance properties for the introduced diversification framework is illustrated in Figure 1 for the case when the marginal distributions of the assets can be assumed to be approximately symmetric. In this case, the non-Gaussianity of the portfolio is adequately captured by portfolio kurtosis. The consistency with the desired properties is monitored through the weight of asset three for the cases when and . The weight of asset three obtained when minimizing portfolio kurtosis is compared to the corresponding weights obtained with risk parity and from maximizing the diversification ratio introduced in Choueifaty and Coignard (2008). Since the volatilities of the three assets are equal, the portfolio obtained from maximizing the diversification ratio coincides with the global minimum variance portfolio, see Choueifaty and Coignard (2008). For risk parity and the most diversified portfolio, the weight of asset three can be solved analytically and is given by

 wRP3=2√1+ρ−(1+ρ)3−ρ, (4)

for the risk parity portfolio, and

 wDR3=1+ρ3+ρ, (5)

for the maximized diversification ratio and the global minimum variance portfolios. Thus, when , the weight of the third asset approaches for the risk parity portfolio, whereas for the maximized diversification ratio. From Figure 1, one observes that the minimum kurtosis portfolio and the maximized diversification ratio satisfy the duplication invariance property, whereas risk parity does not.

When , the volatility of a portfolio with weight vector approaches the small value . All portfolio construction methodologies that are based on only the covariance matrix will approach this solution when . The question is at which rate. From Figure 1, one observes that approaches zero at a rate which is close to linear when varies between 0 and -1. Since this corresponds to the behaviour of the global minimum variance portfolio, which suffers from the hedging invariance problem, this rate is too large when is not close to -1. Figure 1 reveals that the weight of asset three for both the minimum kurtosis and risk parity portfolios approaches zero at a slower rate compared to the diversification ratio portfolio when is not close to -1. These portfolios are thus not too heavily concentrated to the partially hedged exposure represented by asset one and asset two in our example. We conclude that the minimum kurtosis and risk parity portfolios satisfy the hedging invariance property. Hence, only the minimum kurtosis portfolio satisfies the two desired properties when the asset distributions are symmetric.

We finally stress that in this paper we do not attempt to accurately estimate higher order moments or joint distributions of assets returns. The multivariate distribution of the asset returns is modelled with a Gaussian copula and marginal distributions which allow for differing skewness and kurtosis parameters for the individual assets. By modelling the dependence structure with a Gaussian copula, we avoid the notoriously difficult task of estimating a nonlinear dependence structure between the assets. The cost of using a model with less uncertainty in the estimated parameters is that we only take linear dependence between the assets returns into account in this paper. In order to obtain a robust implementation of the framework we take the approach of assigning representative tail risk parameters for different asset groups. Based on the assigned tail risk parameters, the diversification framework then lets us measure and optimize the portfolio dimensionality for a given asset universe.

## 3 Deterministic global optimization of ratios of convex functions

There are numerous applications in finance that involve the optimization of ratios, see, e.g., Stoyanov et al. (2007). In the previous two sections we argued that formulating an appropriately defined portfolio diversification measure naturally leads to functions that are ratios of convex functions. In this section we develop a deterministic algorithm for solving such problems to global optimality.

Let be a nonempty compact convex set and consider the maximization problem

 maxx∈Xh(x),% whereh(x)=f(x)g(x) (6)

and are positive and continuous functions. In Avriel et al. (1988) it is shown that when is concave and is convex, then is a strictly quasi-concave function. Many theoretical results, as well as algorithms of convex programming, apply to the problem of maximizing a strictly quasi-concave function over a convex set (see Dinkelbach, 1967; Schaible, 1974, 1976). In particular, each local maximum is again a global maximum. For the case when and are either both convex or both concave, is in general neither a quasi-concave nor quasi-convex function and the function may have multiple local optima that are different from the global optimum.

### 3.1 Formulation of the portfolio kurtosis minimization problem

Using the notation for higher order portfolio moments introduced in Appendix A, the portfolio kurtosis as a function of the portfolio weights can be expressed as

 (7)

where is the vector of relative portfolio weights, denotes the vector of asset returns, , and and denote the covariance and fourth co-moment matrices of the asset returns, respectively. We assume that is positive definite and hence that for all non-zero . Therefore, the ratio (7) is well defined. By application of Jensen’s inequality we also have that

 E((w⊤(r−μ))4)≥(E((w⊤(r−μ))2))2and thus% w⊤M4(w⊗w⊗w)>0. (8)

The convention for the majority of papers in the fractional programming literature is to formulate the fractional program as a maximization problem. Since , for and , we formulate the portfolio kurtosis optimization problem as the following maximization problem, which we denote by (P)

 (P) maxw∈Wh(w),whereh(w)=f(w)g(w)=(w⊤M2w)2w⊤M4(w⊗w⊗w) (9)

and denotes the feasible set for the weights. Since we assume no short selling and a fully invested portfolio, the feasible set is given by . Letting denote the optimal weights, the minimum kurtosis over the feasible set is then given by . Since is positive definite, the numerator in (9) is a convex function. In Athayde and Flores (2003) it is shown that the fourth moment of the portfolio return is a convex function and hence (9) is a ratio of two convex functions.

### 3.2 Branch and Bound algorithm for global minimization of portfolio kurtosis

Global optimization of ratios of convex functions is a very difficult optimization problem and has attracted attention in the optimization research community. In this section we present a Branch and Bound (BB) algorithm for global minimization of portfolio kurtosis. The basic idea of BB is to recursively subdivide the solution space geometrically into smaller and smaller subsets, until we can either compute the optimal solution over a subset or rule out that a subset contains the global optimum. A crucial component of the algorithm and key to its efficiency, is the derivation of tight upper and lower bounds on the objective function value, both globally and locally for each subset. Examples of papers in the literature which develop BB algorithms for the special case of ratios of convex quadratic functions are Gotoh and Konno (2001), Benson (2006a) and Yamamoto and Konno (2007). The first, and to the best of our knowledge only, paper which develops a BB algorithm for global optimization of a single ratio of general convex functions is Benson (2006b). The generalized problem of optimizing a sum of ratios of convex functions has also attracted considerable attention in the literature. In Shen et al. (2013) a BB algorithm for global optimization for the sum of ratios of convex functions over a convex set is developed, while Shen et al. (2009) develop a BB algorithm for the case of optimizing the sum of ratios of convex functions when the feasible set is non-convex. Comprehensive treatments of BB algorithms for global optimization can be found in Horst and Tuy (1996) and Floudas (2000).

We apply the BB algorithm developed by Benson (2006b) to the problem of portfolio kurtosis minimization and improve the convergence rate by constructing considerably tighter bounds. In the following we first give an overview of the BB algorithm before we describe the steps of the procedure in more detail. As input to the algorithm, one chooses an error tolerance which determines the maximum allowed relative distance between the output value of the algorithm and the global optimum. The output of the algorithm is a -globally optimal solution:

###### Definition 1 (ρ-globally optimal solution)

A solution for problem (P) is called -globally optimal, if , where and is an optimal solution for (P).

The basic idea of the BB algorithm is rather simple and consists of the following elements.

##### Branching process

Consists of choosing a subset that is to be subdivided, and then applying a partitioning method for splitting this subset into two smaller subsets.

##### Upper bounding process

Consists of solving a subproblem to obtain an upper bound for the maximum of over each subset created by the branching process. Moreover, the upper bound for each subset is used to update a global upper bound for the maximum of over .

##### Lower bounding process

Consists of calculating a lower bound for the maximum of over each subset created by the branching process. Moreover, the lower bound for each subset is used to update the global lower bound for the maximum of over .

##### Fathoming process

Deletes each subset in the partition which satisfies . The algorithm stops when all subsets have been fathomed, i.e., the partition is empty.

Unlike heuristic methods, BB algorithms terminate with the guarantee that the value of the best found solution is -globally optimal. BB algorithms are however often slow, and in many cases they require computational effort that grows exponentially with the problem size. This is due to the fact that the size of the partition will grow from iteration to iteration, unless we can fathom subsets. Fathoming subsets, however, depends on the quality of the lower and, especially, the upper bound for a subset. If the upper bound is loose, then a good feasible solution found early in the search may be detected as good only much later in the partitioning process. In other words, the main computational burden of the BB algorithm typically comes from proving global optimality of a feasible point found at an early stage. Thus, in order for the BB algorithm to be efficient, it is crucial to carefully model the functions used for producing upper bounds for each subset generated by the branching process, to be able to fathom them as quickly as possible. Compared to the BB algorithm in Benson (2006b), we develop two extensions which provide much tighter upper bounds and, thereby, considerably speed up the convergence of the algorithm. Next, we will give a more detailed description of the BB algorithm applied to the problem of minimizing portfolio kurtosis.

#### 3.2.1 Branching process

The branching process splits the feasible set into successively finer partitions. We denote by the initial partition and by the partition in iteration of the BB algorithm, where is a finite index set, , and , for . Note that, strictly speaking, once we start fathoming subsets, will no longer form a partition of . However, for the ease of exposition, we will still call a partition. At the beginning of step , the partition consists of subsets not yet deleted by the algorithm. To determine the subset of to be partitioned, we follow the classical best-first rule, which selects the subset with the largest upper bound. The rationale for this rule is to pick a subset which is likely to contain a good feasible solution, which will, hopefully, allow for a quick increase in the global lower bound and thereby speed up the fathoming process. See Locatelli and Shoen (2013) for other common rules.

First, we observe that our feasible set is identical to the standard -simplex. In order to refine a partition , we follow Benson (2006b) and split the chosen subset into two halves by simplicial bisection, which is a special case of radial subdivision introduced in Horst (1976):

###### Definition 2 (Radial subdivision)

Let be an -simplex with vertex set . Choose a point which is uniquely represented by

 m=n∑i=0λivi, λi≥0 (i=0,…,n), n∑i=0λi=1,

and for each such that form the simplex obtained from by replacing the vertex by , i.e., .

A simplicial bisection is obtained by choosing as the midpoint of a longest edge of the simplex , see Figure 2 for an example. Horst and Tuy (1996) prove that the set of subsets that can be constructed from an -simplex by an arbitrary radial subdivision forms a partition of into -simplices. Hence, our subsets are again -simplices. Let denote the midpoint of one of the longest edges of and , the corresponding endpoints of this edge. In the branching process, we replace by the two -simplices with vertex sets and using simplicial bisection to obtain a refined partition .

#### 3.2.2 Upper bounding process

Let be an -simplex of the partition with vertices . Initially, we follow Benson (2006b) and overestimate the objective function by the ratio of two affine functions: one that overestimates and one that underestimates . We will improve these bounding functions in Section 3.2.5 in order to obtain tighter upper bounds and thereby increase the speed of convergence. The function in the denominator is underestimated by a first order Taylor expansion around the barycenter of the simplex according to

 gS(w)=g(^v)+∇wg(^v)(w−^v)⊤. (10)

As is a convex function, , , and, hence, is an underestimator of . The gradient of the fourth central moment of the portfolio return is given by (see Appendix A)

 ∇wg(w)=4M4(w⊗w⊗w). (11)

In order to ensure that the approximation is positive, let

 z(w)=max{α,gS(w)},whereα=minw∈Wg(w). (12)

With being convex, the minimization problem on the right-hand side can be solved efficiently.

In order to construct a linear overestimator of the function in the numerator we need the following definition given in Horst and Tuy (1996):

###### Definition 3 (Concave envelope)

The concave envelope of a function taken over a nonempty subset  of its domain is the function that satisfies:

 (i) pM is a concave function defined over the convex hull % of M, (ii) pM(x)≥p(x), for all x∈M, and (iii) if q is a concave function defined over the convex % hull of M that satisfies q(x)≥p(x) for all x∈M, then q(x)≥pM(x) for all x in the convex % hull of M.

Horst (1976) shows that when is an -simplex and is a convex function on , then is the unique affine function that coincides with at the vertices of . Denoting by the concave envelope of over , we construct the following upper bound for the maximum of over

 UB(S)=maxw∈SfS(w)z(w). (13)

Since , , and , , is equal to the optimal value of the following problem:

 (P1(S))maxt,w∈S fS(w)t s.t. t≥α, (14) t−gS(w)≥0. (15)

As is compact and the objective function is continuous, (P1()) has an optimal solution. Moreover, as the ratio of two linear functions is quasi-concave, every local optimum over the closed convex set is also a global optimum. Thus, the fractional program can be solved to global optimality with any local solver. However, as P1() has to be solved many times during the BB algorithm, we follow Benson (2006b) and reformulate the problem as follows. Each can be written as

 w=n∑i=0λivi,whereλi≥0,i=0,1,…,n, and n∑i=0λi=1. (16)

(see Horst and Tuy, 1996). As is an affine function, we then get . Substituting and adding the conditions for gives the equivalent fractional program

 (P2(S))maxt,λ,w 1tn∑i=0λif(vi) s.t. w=n∑i=0λivi, (17) n∑i=0λi=1, (18) λi≥0,i=0,1,…,n. (19)

To linearize the objective function, we apply the Charnes-Cooper transformation (Charnes and Cooper, 1962), performing the following change of variables

 u=1t,bi=λit,y=wt, (20)

where and , which results in the equivalent problem

 (P3'(S))maxu,b,y n∑i=0bif(vi) s.t. u≤1/α, (21) u⋅gS(y/u)≤1, (22) y=n∑i=0bivi, (23) n∑i=0bi−u=0, (24) bi≥0, i=0,…,n, (25) u>0.

Since is an affine function, (P3’()) is a linear program, except for the domain constraint on . However, Avriel et al. (1988) showed that when a solution to (P2()) exists, then the strict inequality can be replaced by , and we obtain the linear program

 (P3(S))maxu,b,y n∑i=0bif(vi) s.t. u≥0. (26)

This formulation can now be solved very efficiently using any linear programming solver.

Finally, the upper bound for , , is now computed as . Moreover, for each iteration , the upper bounding process also computes an upper bound for the global optimal value of the original problem (P) based on the partition :

 UBk=maxS∈QkUB(S). (27)

By construction, the upper bound is monotonically decreasing in , i.e., , .

#### 3.2.3 Lower bounding process

Denoting by the best solution of the problems (P1()) encountered up to iteration , the lower bound for the global optimal value  in iteration is given by . The bounds are monotonically increasing in : , .

#### 3.2.4 Fathoming process

Based on the lower and upper bounds produced by the algorithm, the fathoming process deletes all subsets from that are guaranteed not to contain the global optimal solution. At the beginning of each iteration, i.e., all are removed for which . If this results in being empty, then

 h(wk−1)≥(1−ρ)UBk−1≥(1−ρ)maxw∈Wh(w)=(1−ρ)h(w∗), (28)

which means that is a -globally optimal solution to problem (P). Benson (2006b) shows that when the number of iterations for the BB algorithm is infinite, it generates two sequences of points whose accumulation points are the global optimal solution for (P), and

 limk→∞LBk=limk→∞UBk=h(w∗). (29)

This result implies that whenever , the BB algorithm is finite.

The complete BB algorithm is summarized below.

Remark. For the case with additional constraints, such as position limits, the feasible set is no longer given by the standard -simplex. The extension of the algorithm to a more general case is however straightforward (see Benson, 2006b, for details).

#### 3.2.5 Improving the upper bound

Preliminary computational tests showed that the BB algorithm spends the vast majority of the computing time calculating the upper bound over the -simplex . Moreover, while it often took only a few iterations to obtain a very good lower bound on the optimal value , the upper bound was improving only very slowly. In order to achieve faster convergence for the BB algorithm, we present in the following two extensions of the algorithm presented in Benson (2006b) that lead to a much faster reduction of the global upper bound . In the first, the lower bound of the function in the denominator is improved by adding affine functions to the approximation. In the second, the upper bound of the function in the numerator is enhanced by using a generalization of the concave envelope. This generalization requires the introduction of binary variables, which means that the improved upper bound comes with the cost of having to solve a more difficult combinatorial optimization problem.

To tighten the lower bound for the function , we extend the linearization technique in (12) by adding first order Taylor expansions of around additional points in . We then define

 ~z(w)=max(α,gS(w),gR1(w),…,gRp(w)), (30)

where , . The idea of the improvement is illustrated in Figure 3 for the case when is a -simplex and .

For the general case of an -simplex , the locations of the points are chosen so that they are evenly distributed in , see Section 3.3 for more details. The resulting problem is then given as

 (P4(S))maxu,b,y n∑i=0bif(vi) s.t. u≥0, u⋅gRj(y/u)≤1,j=0,…,p. (31)

Obviously, the accuracy of the approximation increases with , at the expense of adding more linear constraints to the optimization problem.

Next, we turn our attention to improving the accuracy of the approximation of the numerator of the objective function. We start by subdividing the -simplex by radial subdivision according to Definition 2. Let the set of -simplices created by the radial subdivision be given by and the corresponding set of all vertices by . The improved upper bound is then constructed by the combination of the concave envelopes over the -simplices in . The construction is more easily illustrated by the simplest possible example in one dimension given in Figure 4. For this example, the set of -simplices and corresponding vertices are after the radial subdivision given by and , respectively. The generalized concave envelope over is constructed from the concave envelopes over and .

When calculating the concave envelope, we have to introduce binary variables , , in order to keep track of which -simplex in is active. The function representing the generalized concave envelope over the -simplex can now be formulated as

 ∑i∈Iλif(vi) ∑j∈Jqj=1, (32) λi≤∑j: vi∈Sjqj, i∈I, (33) qj∈{0,1}, j∈J. (34)

Condition (33) ensures that only ’s belonging to vertices of the -simplex that is active, i.e. for which , can be non-zero. Using the improved approximation function, we obtain the optimization problem

 (P5(S)) maxt,λ,q,w 1t∑i∈Iλif(vi) s.t.

As before, we transform this problem into a mixed-integer linear program (MILP) via Charnes-Cooper transformation through the variable transformations in (20). The last set of constraints is transformed into

 bi−∑j: vi∈Sjqju≤0, i∈I, (35)

in the new variables. The product of variables is linearized by introducing the continuous variables , , and adding the following constraints for each :

 zj≤qjα,zj≤u,zj≥u−(1−qj)/α,andzj≥0. (36)

If , then the first and last constraint ensures that , while the third only states that has to be greater than a negative number. If , then the first constraint enforces , and the second and third ensure that . Summing up, we obtain the following equivalent MILP, which we denote by (P6(S))

 (P6(S)) maxu,b,z,q,y ∑i∈Ibif(vi) s.t. u≥0.

This problem can easily be enhanced by adding linear terms to the constraints in order to obtain a better approximation of the function in the denominator. The hope is that the improved upper bound in (P6()) will lead to a sufficiently fast decrease of the global upper bound in order to compensate for the increased computational time induced by solving an MILP instead of an LP for each instance of the upper bounding process.

### 3.3 Numerical implementation

In this section we will demonstrate the convergence properties of the BB algorithm when applied to the problem of minimizing the portfolio kurtosis for an increasing number of assets. As sample problem we assume that all assets have identical marginal distributions and that all correlations between different assets are assumed to be equal. This problem instance represents a non-convex problem with multiple local optima. When the subproblems for the BB algorithm are given by the MILP (P6()), the description in Section 3.2 needs to be extended in order to produce an efficient algorithm. Numerical experiments show that radial subdivision does not improve the upper bound of sufficiently to produce an efficient algorithm. In order to further improve the upper bound of , the -simplex is instead subdivided with barycentric subdivision. Roughly speaking, the barycentric subdivision of an -simplex is obtained by radial subdivision of all -faces of dimension in decreasing order of dimension. It is also possible with partial barycentric subdivision of by restricting the radial subdivision to all -faces of dimension , with , in decreasing order of dimension (see Ahmad and Welker, 2018, for a detailed description of barycentric subdivision). The partial barycentric subdivision of a 2-simplex with corresponds to radial subdivision as illustrated in Figure 5 (a). The full barycentric subdivision of the 2-simplex is illustrated in Figure 5 (b). Numerical experimentation reveals that full barycentric subdivision is required in order to produce a sufficiently improved upper bound of for the MILP formulation. Unfortunately, this means that binary variables need to be introduced when solving the subproblems with the MILP formulation. The formulation of the optimization problem (P6()) does not change when subdividing the -simplex with barycentric subdivision instead of with radial subdivision. The only thing that changes is the set of -simplices created by the subdivision and the corresponding set of vertices.

In the following we investigate the improvements obtained in terms of iteration count and runtime, when using the enhanced LP formulation and the MILP formulation for solving the subproblems of the algorithm. The BB algorithm was implemented in MATLAB. For all LP formulations of the subproblems we use the solver CPLEX in the implementation, whereas the subproblems arising from the MILP formulation are solved with the built-in solver intlinprog in MATLAB. For all comparisons we set the parameter to . When using the enhanced LP formulation (P4()), a choice has to be made regarding how many extra constraints are added to the problem. The points defining the added constraints are distributed evenly over the subsimplex for which the subproblem is solved. Letting denote the number of added constraints per asset, one has that . We choose to distribute the points evenly between the vertices and the barycenter of . Thus, for the points are defined by the vertices. For , the points are defined by the vertices and the points

 (37)

We also experimented with distributing points evenly between the vertices but that did not bring any noticeable improvement in terms of iteration count. Naturally, adding more constraints in order to obtain a tighter lower bound should decrease the iteration count for the BB algorithm, at the cost of increasing the runtime for each of the subproblems that are solved. This trade-off is now investigated. In the following, we denote the enhanced LP formulation (P4()) by LP2 and the LP formulation (P3()) as used in Benson (2006b) by LP1.

Figure 6 (a) displays the evolution of the global lower and upper bounds of the portfolio kurtosis for the iterations of the BB algorithm applied to the three asset problem. Note that these are the inverses of the global lower and upper bounds calculated by the BB algorithm, i.e., for . Simulated return data is used to calculate the moment matrices in the objective function (9). The sample moment matrices and are calculated from simulated asset returns with NIG-distributed margins and dependence structure given by the Gaussian copula with a homogeneous correlation matrix. The problem instance is defined by the homogeneous correlation and marginal kurtosis for all the assets. Appendix B contains a description of the simulation procedure. Figure 6 (b) shows the fraction of deleted simplices for the iterations of the algorithm for the three asset problem. Note that the fraction of deleted simplices decreases if the number of deleted simplices is less than the number of simplices that are added by the subdivision procedure. From the graphs it is visible that the enhanced LP formulation LP2 with converges faster to the global optimum in terms of number of iterations compared to the original LP formulation LP1. As can be seen in Table 1, the number of iterations decreases with the number of extra constraints added to the problem. However, as displayed in Table 2 the decrease in the number of iterations is not significant enough in order to compensate for the increased runtime associated with the larger number of constraints. Thus, the enhanced LP formulation with has a lower runtime than the formulations with . Compared to the original LP formulation LP1, the runtime of LP2 with is the same for the three asset problem.

Figure 7 (a) and (b) display the evolution of the global lower and upper bounds of the portfolio kurtosis and the fraction of deleted simplices for the iterations of the BB algorithm applied to the five asset problem. The solid and dotted lines represent the evolution of the bounds and the fraction of deleted simplices for LP2 with and LP1, respectively. The graphs reveal that there is a significant decrease in iteration count when using the enhanced LP formulation LP2 compared to LP1. As for the three asset case, the iteration count for LP2 decreases with the number of added constraints . However, as can be seen in Table 2 the decrease in iteration count does not compensate for the added computational cost and hence LP2 with has the lowest runtime. LP2 with also has a lower runtime than LP1 for the five asset case, 175 seconds compared to 193 seconds.

We will now investigate the performance of the MILP formulation against the LP formulation with the lowest runtime for the five asset problem. Figure 8 (a) and (b) show the evolution of the global lower and upper bounds of the portfolio kurtosis and the fraction of deleted simplices for the two cases. The solid and dotted lines represent the evolution of the bounds and the fraction of deleted simplices for the MILP formulation and LP2 with , respectively. From Figure 8, one observes that the MILP formulation improves the global lower bound of the kurtosis, corresponding to the global upper bound for the BB algorithm, much faster than the LP formulation up to around iteration count 5,000. After that, the global lower bound of the MILP formulation improves slower than for the LP formulation. The overall iteration count is lower for the MILP formulation compared to LP2. However, the improvement in iteration count does not compensate for the increased computational cost associated with solving a MILP instead of an LP as can be seen in Table 2. The runtime for the MILP formulation can however likely be reduced by using a state-of-the art solver instead of the built-in solver in MATLAB.

For the six asset problem with homogeneous correlation , the number of iterations is 620,000 for the best performing solution method for the subproblems, LP2 with . The corresponding runtime for the six asset problem is 49,680 seconds, illustrating the exponential growth in computational effort when the BB algorithm is applied to the portfolio kurtosis minimization problem. The BB algorithm can be enhanced by developing special purpose solvers for the subproblems. Furthermore, the algorithm can be parallelized in order to further reduce the runtime. Moreover, we could combine the MILP formulation and LP2, starting with the former to quickly raise the upper bound, and then switch to LP2 to save runtime. It is, however, unlikely that any of these will admit solving problems with significantly higher number of assets than six.

## 4 Stochastic global optimization

In Section 3.2 we developed a deterministic global optimization algorithm for minimizing the inverse of the introduced portfolio diversification measures. However, as is well known and illustrated by the numerical examples in Section 3.3, the BB algorithm suffers from the curse of dimensionality and converges too slowly for problems where the number of assets exceeds six. In this section we develop a stochastic optimization algorithm for global optimization of portfolio kurtosis. The BB algorithm has the desirable property that the objective function value at the obtained solution is guaranteed to be arbitrarily close to the global minimum. For the algorithm developed in this section it is not possible to determine if the solution is a global optimum. However, the algorithm is a special case of stochastic approximation with a rich and well developed theory for convergence analysis. Since the BB algorithm is limited to problems of moderate size, the algorithm developed in this section complements the BB algorithm in the sense that it allows for tackling problems of larger size.

### 4.1 Stochastic algorithms for global optimization

There is a huge literature on global optimization algorithms, so called metaheuristic methods, for which it is not possible to guarantee that the obtained solution is a global optimum. These methods iteratively search the feasible set for the global optimum and without prior knowledge there is always the possibility that the optimal point lies in an unexplored region when the algorithm stops. Important examples of metaheuristic methods are genetic algorithms (Holland, 1975), simulated annealing (Kirkpatrick et al., 1983) and tabu search (Glover, 1986). The interested reader may consult Gendreau and Potvin (2010) for an overview of metaheuristic methods. Even though, for metaheuristic methods, it is not possible to guarantee that a global optimal point has been found, algorithms that are based on stochastic approximation have a solid theoretical foundation and in many cases non-asymptotic convergence results are available with explicit constants, see Dalalyan (2017), Durmus and Moulines (2017) and Durmus and Moulines (2018). This can be contrasted to many other popular metaheuristic methods where the theory is often incomplete or even nonexistent, see Spall (2003). A strong aspect of stochastic approximation is the rich convergence theory that has been developed over many years. It has been used to show convergence of many stochastic algorithms such as neural network backpropagation and simulated annealing. For a rigorous example where stochastic approximation methods are applied to problems in finance see Laurelle and Pages (2012).

In stochastic approximation one is concerned with finding at least one root to , based on noisy measurements of . Root finding via stochastic approximation was introduced in Robbins and Monro (1951) and important generalizations were made in Kiefer and Wolfowitz (1952). Consider the unconstrained minimization problem

 minθL(θ), (38)

where is a smooth function, which has multiple local minima. For the special case when is given by , the stochastic approximation algorithm is given by the following stochastic gradient descent (SGD) algorithm

 θk+1=θk−akH(θk,Xk+1), (39)

where is a sequence of -valued i.i.d. data and is an unbiased estimate of the gradient, i.e. . In (39), can either be a decreasing positive sequence satisfying appropriate conditions or a fixed small positive value , for any .

In many estimation problems, a full set of data is collected and (or ) is chosen by conditioning on the data. This conditioning removes the randomness from the problem and the estimation problem becomes deterministic. In the machine learning literature this is commonly referred to as the batch gradient descent algorithm, which is given by , where is the collected data and

 ¯H(θ)=1NN∑k=1H(θ,Xk). (40)

Since has multiple local minima, applying SGD to (38) may yield convergence to a local minimum of . Under broad conditions, Kushner and Yin (1997) show that (39) converges to one of the local minima of with probability 1. However, the iterates will often be trapped at a local optimum and will miss the global one. Nevertheless, SGD, or one of its various extensions, is commonly used in machine learning for optimization of Deep Neural Networks, see Goodfellow et al. (2016). When has a unique minimum, Chau et al. (2016) provide convergence results for the case with dependent data, discontinuous , and fixed step size.

The idea behind simulated annealing is that by adding an additional noise term to the iterations one can avoid getting prematurely trapped in a local minimum of . In Gelfand and Mitter (1991), the following modified SGD algorithm is analyzed

 θk+1=θk−akH(θk,Xk+1)+bkϵk+1, (41)

where is a sequence of standard -dimensional independent Gaussian random variables, and and are decreasing sequences of positive numbers tending to zero. They show that under suitable assumptions, tends to the global minimizer as in probability. In the machine learning literature, the closely related Stochastic Gradient Langevin Dynamics (SGLD) algorithm has attracted significant interest in the research community in recent years. The SGLD algorithm for global optimization can be formulated as

 θk+1=θk−akH(θk,Xk+1)+√2ak/βϵk+1, (42)

where is a sequence of standard -dimensional independent Gaussian variables and is a temperature parameter. The batch version of this algorithm, Gradient Langevin Dynamics (GLD), is correspondingly given by

 θk+1=θk−ak¯H(θk)+√2ak/βϵk+1. (43)

Assuming that the gradient is Lipschitz continuous and under further assumptions, Raginsky et al. (2017) provide a non-asymptotic analysis of SGLD and GLD applied to non-convex problems for the case when the step size is a positive constant. The analysis provides non-asymptotic guarantees for SGLD and GLD to find an approximate minimizer. The rate of convergence is further improved for both SGLD and GLD in the recent papers by Xu et al. (2018) and in Chau2019 even in the presence of dependent data streams.

### 4.2 A Gradient Langevin Dynamics algorithm for minimization of kurtosis

Motivated by the enormous progress in the aforementioned optimization algorithms, we develop a GLD algorithm for global minimization of portfolio kurtosis. Since portfolio kurtosis is the ratio of two convex functions, the batch gradient is not simply given by the average as in (40). Given a sample of observed return data for a given asset universe, the sample covariance matrix and sample fourth co-moment matrix can be estimated. The batch version of the portfolio kurtosis is then given by

 ¯h(w)=¯f(w)¯g(w)=w⊤^M4(w⊗w⊗w)(w⊤^M2w)2, (44)

where and denote the sample covariance and fourth co-moment matrices, respectively. Given the complicated form of the approximate bias for sample kurtosis, see Bao (2013), global minimization of portfolio kurtosis is not easily adapted to the algorithms in Section 4.1 which utilize a stochastic unbiased estimate of the gradient. For this reason we only develop a GLD algorithm for the global minimization problem.

The algorithms in Section 4.1 are formulated for unconstrained optimization problems and hence need to be adapted to constrained minimization over the standard -simplex. The GLD algorithm for the constrained problem is given by the following projected iterations

 wk+1=ΠW(wk−λ∇w¯h(wk)+√2λ/βϵk+1), (45)

where denotes the Euclidean projection onto the feasible set, is the fixed step size and . Euclidean projection of a point onto the standard -simplex is a quadratic program which can be solved very efficiently. See Chen and Ye (2011) for a fast and simple algorithm for computing the projection onto the standard -simplex. The gradient of the batch version of portfolio kurtosis is given by

 ∇w¯h(w)=∇w¯f(w)¯g(w)−¯f(w)∇w¯g(w)(¯g(w))2, (46)

where the explicit form of the gradient is given in Appendix A and

 ∇w¯g(w)=4(w⊤^M2w)^M2w. (47)

Most convergence results for SGLD and GLD are only applicable for algorithms without projection. A natural way to avoid the projection step in each iteration would be to extend the objective function with a convex function outside of the feasible set. Naturally, the extended objective function needs to be continuous on the boundary of the feasible set and have a continuous gradient on the boundary. However, in Tawarmalani and Sahinidis (2002) it is shown that a sufficient condition for the existence of a convex extension of a function outside of a convex feasible set, is the convexity of the function. Even if the requirement of convexity of the function to be extended is relaxed such that convexity is only required close to the boundary of the feasible set, this does not hold for portfolio kurtosis. It can easily be shown that portfolio kurtosis in general is a non-convex function on the boundary of the feasible set. Hence, it is not possible to find a convex extension of portfolio kurtosis outside of the feasible set .

When the objective function is convex, Bubeck et al. (2018) provide convergence results for the projected SGLD and GLD algorithms. In the case of a non-convex objective function, no convergence results for projected SGLD and GLD currently exist in the literature, to the best of our knowledge. It should however be mentioned that the analysis of SGLD and GLD algorithms is currently a very active research area which is gaining significant recognition amongst the Optimization and ML research community.

In order for the iterations (45) to converge, the gradient needs to be Lipschitz continuous on the domain given by the feasible set . The Hessian of is given by

 ∇2w¯h(w)= 1(¯g(w))3((¯g(w))2∇2w¯f(w)−¯g(w)(∇w¯f(w)(∇w¯g(w))⊤+∇w¯g(w)(∇w¯f(w))⊤)+ ¯g(w)¯f(w)∇2w¯g(w)+2¯f(w)∇w¯g(w)(∇w¯g(w))