An Unethical Optimization Principle

# An Unethical Optimization Principle

Nicholas Beale Heather Battey Department of Mathematics, Imperial College London, 180 Queen’s Gate, London SW7 2AZ, UK Anthony C. Davison Institute of Mathematics, Ecole Polytechnique Fédérale de Lausanne, Station 8, 1015 Lausanne, Switzerland Robert S. MacKay Mathematics Institute, Zeeman Building, University of Warwick, Coventry CV4 7AL, UK
###### Abstract

If an artificial intelligence aims to maximise risk-adjusted return, then under mild conditions it is disproportionately likely to pick an unethical strategy unless the objective function allows sufficiently for this risk. Even if the proportion of available unethical strategies is small, the probability of picking an unethical strategy can become large; indeed unless returns are fat-tailed tends to unity as the strategy space becomes large. We define an Unethical Odds Ratio Upsilon () that allows us to calculate from , and we derive a simple formula for the limit of as the strategy space becomes large. We give an algorithm for estimating and in finite cases and discuss how to deal with infinite strategy spaces. We show how this principle can be used to help detect unethical strategies and to estimate . Finally we sketch some policy implications of this work.

AI Ethics Artificial Intelligence Economics Extreme Value Theory Financial Regulation
\templatetype

pnasresearcharticle \leadauthorBeale \significancestatementThis paper formulates the Unethical Optimization Principle for AI and analytically quantifies the risk amplification involved. Under mild assumptions we show that an AI is almost certain to adopt an unethical strategy when the returns are Gaussian or have a similar thin-tailed distribution, and that although the probability that such a strategy is adopted decreases as the returns become heavier-tailed, it is still appreciably higher than the incidence of unethical strategies in the strategy space as a whole. The implications for owners and regulators are that special care must be taken, but the Principle can also be used to help root out ethically problematic strategies \authorcontributionsNB had the initial idea, formulated the Principle, co-wrote the paper, derived equation  from the analysis by AD and equation . HB indicated that the extremal types theorem could be used to quantify the risk in wide generality. RM did the initial analysis, leading to formulating the problem in terms of the Odds Ratio. AD provided most of the analysis and co-wrote the paper. All authors contributed importantly to the review and editing of the paper, and did extensive background analysis. \authordeclarationThe authors declare no conflict of interest. \correspondingauthor1To whom correspondence should be addressed. E-mail: nicholas.beale@sciteb.com \datesThis manuscript was compiled on 12 November 2019 \doiwww.pnas.org/cgi/doi/10.1073/pnas.XXXXXXXXXX

\dropcap

### Derivation of limiting pU

The extremal types theorem (8, Theorem 1.4.2) implies that in wide generality, the maximum of a random sample with cumulative distribution function may be renormalized using sequences and in order that converges as to a limiting random variable having a generalized extreme-value distribution. A simple sufficient condition for this is that is twice continuously differentiable with density and that the reciprocal hazard function is such that converges to a constant as approaches the upper support point of . Then we can take , and the distribution of is

 Gξ(x)=exp{−(1+ξx)−1/ξ+},x∈R, (6)

where ; setting gives the Gumbel distribution . The quantity , sometimes called the tail index, typically satisfies , with smaller values corresponding to lighter tails. If , then the limiting density has an upper support point at , whereas if then the limiting density has no finite upper support point, so the limiting random variable has no upper bound.

This implies that we can write for sufficiently large , where the quality of the approximation depends on ; it has long been known that the convergence is extremely slow for Gaussian variables (9). A result of Khintchine (8, Theorem 1.2.3) implies that if and for some fixed , then as ,

 bm−bnan → βη={η/(1−η)}ξ−1ξ, aman → αη=(η1−η)ξ,

with when .

To apply these results, let denote the maximum of independent random variables with common distribution function , which represent the returns of ethical, Green, strategies, and suppose that converges in distribution to a random variable as . Let denote the maximum of independent random variables representing the returns of unethical, red, strategies. We suppose that is a random sample from and that and quantify the increase in mean return and in volatility for unethical returns. We briefly discuss the case where the and have different distributions below. Then

 MR D= Δ+(1+γ)max(Z′1,…,Z′m),

where means ‘has the same distribution as’, and as , will converge in distribution to a random variable with the same distribution as .

If is large enough, then we can write , and so the probability that the best return from an unethical strategy exceeds the best return from a ethical one satisfies

 Pr(MR>MG)→Pr{βη+A(Δ,γ,η)+(1+γ)αηY>X},

as , where depends on , , and the normalising sequence for .

We now discuss the behaviour for large of

 Δan+γbman=Δan+γbmamaman. (7)
• If , then and , so . In this case the distributions of and become more and more concentrated for large , and any advantage for red leads to it beating green with probability one, in the limit, because red returns have a higher upper limit than green ones.

• If , then as , so , which is infinite if . The behaviour of depends on the limit of as . For example, if is exponential, then converges to a constant, whereas if is Gaussian, then . For exponential maxima, therefore, is infinite if , but is finite if , for any . For Gaussian maxima, and , so if either of or is positive, i.e., if there is any systematic advantage for red strategies.

Other limits might appear when and depend on , but one would need to consider whether this is realistic; for example, this might apply if , i.e., red strategies are a vanishingly small fraction of all possible ones. This does not seem very realistic, since presumably any ethical strategy could be tweaked slightly to make it more profitable but unethical.

Here are the details for the special cases in the main text.

• If is Gaussian, then we can take and , giving , so and . The limiting variables and are Gumbel, and red will beat green if either or is positive.

• If is log-Gaussian, then we can take and , so , and . The limiting variables and are Gumbel. Here and , so red always beats green, owing to its higher volatility.

• If is exponential, then , and , so and are Gumbel, , and

 (Δ+γbm)/an=Δ+γlogS+γlogη

tends to infinity unless : red beats green in the limit owing to its higher volatility.

• If is Pareto, then , and , so , and . Here and have Fréchet distributions, for , and as , we obtain

 Pr(MR>MG) → η(1+γ)ν1−η+η(1+γ)ν. (8)

Hence for large if and only if . This calculation also applies to other distributions with Pareto-like tails, such as the Student . Inserting (8) into (2) yields (4).

The discussion above presupposes that the red and green returns only differ by a location and/or scale shift. If the limiting variables have the same support but different tail indexes, then the variable with the higher asymptotically dominates the other: if has a higher tail index than , then red returns will beat green returns with probability one for large .

### Estimation

To estimate the distributions for the ethical and unethical strategies, we suppose that the sampled strategies with the highest risk-adjusted returns have been divided into unethical and ethical strategies, with respective returns and , and we denote by the largest sampled return that is not among these . In our asymptotic framework the generalized Pareto distribution (GPD) (10) provides a suitable probability model for and , i.e., the ‘excess’ returns over . The probability density functions for the red and green excesses are

 1τR(1+ξrj−uτR)−1/ξ−1+,1τG(1+ξgi−uτG)−1/ξ−1+,

for and . The shape parameter is the same as in (6), and are scale parameters. The effect of changes in both and appears in the ratio , which will be larger than unity if there is an advantage for red returns, whereas should be the same for red and green subsets. This last property is helpful: can be hard to estimate from small samples, but inference for it will be based on all of the largest returns. The adequacy of the GPD is readily checked using standard techniques (6, Ch. 4), and the parameters can be estimated, and models compared, using standard likelihood methods (11, Ch. 4).

Having obtained estimates , and , we estimate by Monte Carlo simulation as follows. We generate standard uniform variables and Poisson variables with mean , all mutually independent. We then compute , for , and estimate by

 ^pU=R−1R∑r=1exp[−rg{1−^FG(M∗r)}],

where denotes the fitted cumulative distribution function for the green exceedances over , which is generalized Pareto with parameters and . In the simulations described below we took , which reduces variation in to the third decimal place.

We performed a small simulation experiment to check these ideas. For different settings with normal and returns, we simulated 10,000 samples, each with and . We constructed each sample by generating , and then made red returns , with the green returns being . We took the largest returns for each sample, ascertained whether they were red or green, and obtained , and . We then fitted the GPD to the entire sample of excesses, and to the red and green excesses separately, using a common value of ; this enabled us to compute the likelihood ratio statistic for testing whether , based on the largest returns; the proportion of times this is rejected is the statistical power for testing the hypothesis at a nominal 5% significance level. If the return distributions differ greatly, then this power should be high. We also computed the empirical value of , based on whether the largest return in each sample was red or green, which would not be useful in practice, as it would equal either 0 or 1, based on the single sample available. As estimates of we computed the empirical proportion and the estimate described above, both of which would be available in practice.

Table 1 summarises the results of this experiment. The rows with show that and are both close to the expected value of 10% when there is no difference between red and green returns, and the power is close to the anticipated value, 5%. Although increases when either of or is positive, it generally has a downward bias, and appears to provide a better estimate of . On the other hand computations not shown indicate that can be highly variable, though taking reduces its variance. The power increases when or is positive, as predicted by the asymptotic theory; the power shows that when and , for example, a difference between red and green returns can be detected in around 91% of samples. For the returns, and its estimates again increase, but more modestly, and more for increased volatility, , than for increased mean, . Again, this corresponds to the asymptotic theory.

### Computation of pU

Let and . It is straightforward to check that

 pU=m∫Fn{Δ+(1+γ)x}f(x)Fm−1(x)dx,

which can be estimated by Monte Carlo simulation as follows:

• generate , then set for ;

• compute an estimate

 p∗1=R−1R∑r=1F{Δ+(1+γ)M∗r}n

of ;

• repeat the steps above, with replaced by to give an estimate ;

• return as an estimate of .

The first step uses inversion to generate maxima directly from , the second step averages the exact probabilities , and the third and fourth steps use antithetic sampling to reduce the variance of . With this gives probabilities accurate to three decimal places almost instantaneously. The R (12) code below embodies this.

prob.sim <- function(S, eta, delta, gamma, R=10^5)
{ # F is distribution function and Finv its inverse
n <- (1-eta)*S
m <- eta*S
u <- runif(R)
x <- Finv( u^(1/m) )
m1 <- mean( F(delta+(1+gamma)*x)^n )
x <- Finv( (1-u)^(1/m) )
m2 <- mean( F(delta+(1+gamma)*x)^n )
(m1+m2)/2
}


High-precision arithmetic may help in computing more accurately for very large , though its precise value is rarely crucial.

### Infinite strategy spaces and correlated returns

As one example of the kind of approach discussed in the paper, consider the following:

Let denote the copula that determines the dependence of random variables and having uniform marginal distributions. One standard measure of extremal dependence is (13)

 χ(u)=Pr(U>u∣V>u)=1−2u+C(u,u)1−u,0

where is of most interest in the present context. If , then and are said to be asymptotically dependent, with corresponding to total dependence and to so-called asymptotic independence. The quantity can be roughly interpreted as the equivalent number of independent extremes at high levels of , so yields one ‘equivalent independent’ variable, and yields two ‘equivalent independent’ variables. Rank-based estimators for from independent data pairs are available for high values of , e.g., . As these are based on the ranks, the marginal distributions of and are irrelevant.

To apply these ideas, suppose that can be treated as a stationary process, that there is a measure of distance on , and evaluate on an equi-spaced grid, at , say. Thus we can observe the joint properties of at distances and so forth, taking and for each in the grid. If we take all such distinct pairs a distance apart and estimate as described above, then we can assess the dependence of the extremes of the process at lag , for example by plotting the estimate against . This extremogram (14) will equal unity for , and should drop to zero as increases, and thus can be used to assess the approximate number of equivalent independent values in .

To illustrate this, we took , created a function by linear interpolation between independent Gaussian variables at , and evaluated on a grid with random initial value and . Figure 3 shows these plots for four simulated functions. The sampling properties of for large mimic those for the usual time series correlogram in the presence of strong dependence and are not good, but the sharp decline near the origin shows precisely the behaviour we expect; it appears that extreme values of would be independent of those for or perhaps , as we would anticipate from its construction. Thus if we sampled at sites no closer than two units apart, the corresponding values of could be taken as independent at extreme levels. Figure 3: Four examples of χk for the linear interpolation process described in the text. The red points show the estimates of χ(0.95) at different lags, and the tick marks show 95% confidence intervals for individual estimates. The sharp initial decline shows that local dependence of extrema of A(s) becomes negligible when kδ>1 or so, as would be expected from the construction of A(s).

Although further refinement is certainly feasible, the discussion above strongly suggests that it should be possible to identify an approximate number of ‘independent’ extrema in an infinite strategy space, under assumptions similar to those above, perhaps using a development of the ideas in Leadbetter (15).

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters   