A Statistical Model with Qualitative Input
Chulalongkorn University
Bangkok, Thailand
Abstract
A statistical estimation model with qualitative input provides a mechanism to fuse human intuition in a form of qualitative information into a quantitative statistical model. We investigate statistical properties and devise a numerical computation method for a model subclass with a uniform correlation structure. We show that, within this subclass, qualitative information can be as objective as quantitative information. We also show that the correlation between each pair of variables controls the accuracy of the statistical estimate. An application to portfolio selection is discussed. The correlation, although compromising the accuracy of the statistical estimation, affects the performance of the portfolio in a minimal way.
Key words: Normal distribution, conditional expectation, constrained statistical model, portfolio optimization
1 Introduction
Qualitative information can be as objective as quantitative information. To be concrete, let us consider a standardized test whose scores are normally distributed with mean eqaul to 100 and standard deviation equal to 10. Suppose one hundred test takers are randomly sampled from the test takers population. Without knowning the exact score of each of the sample, suppose we are informed only their ranking. We are almost certain that the fifth highest score of the sample is equal to 116.45 or the mean plus 1.645 standard deviation. Knowing only ranking, which is a kind of qualitative information, is as good as knowing the exact scores, which is quantitative information.
In the above example, the statistical inference model composes of a prior probability distribution of the measurement of interest and a side information in a form of qualitative information. The inference model is to estimate the measurement from its quantitative prior conditioned on the side qualitative infromation. The ability to factor qualitative side information into a quantitative prior information opens ways to combine human intuition into quantitative data model.
From this perspective, Chiarawongse et al. (2012) introduced a portfolio optimization model with qualitative input, which combine a qualitative view of an investor into a return estimation of financial assets. Their model is derived from the vision of Black and Litterman (1991, 1992) to fuse human view into quantitative data model. However, the view in the BlackLitterman model is also of a quantitative nature, deemed to be too demanding for human investors. In a simulation study, Chiarawongse et al. (2012) reported that a qualitative input in a form of stock ranking when fused in the expected return estimation can enhance the performance of a portfolio significantly, especially as dimension grows. However, they did not provide justification on why such an excellent performance can be observed. In this paper, we consider the same statistical inference model. Our objective is to provide justification on how such statistical model with qualitative input can perform well. In the process, we also discuss the model properties and its limitation. We will see that dependence among variables compromises the performance of the model in an interesting way.
To study the properties of the proposed inference model, one requires a computation method. In abstract terms, the inference model is an integration problem over a convex polytope in high dimension, which in general is intractable numerically (Khachiyan, 1989). The solution method usually relies on a Markov chain Monte Carlo (MCMC) (Smith, 1984; Kaufman and Smith, 1998; Lovász and Vempala, 2006a, b; Kiatsupaibul et al., 2011) in which, by allowing a confidence level lower than one, the method can solve the problem in polynomial time.
Even though the problem cannot be solve numerically in general, with some particular problem specification, a numerical method can be devised to carry out the required integration in the model. In specific, we consider the prior distribution that is multivariate normal with a uniform correlation structure together with a qualitative information in a form of complete ranking. This specific problem is proved to be quite challenging for an MCMC. However, Kiatsupaibul et al. (2017) proposes a solution method based on the recursive integration technique (Hayter, 2006) that can be adapted to solve this specific problem in . In this paper, we propose an adapted solution method based on that of Kiatsupaibul et al. (2017) to solve the problem.
The adapted solution method not only enables us to explore the properties of the model but also lends itself in practical use. An application of the statistical inference with qualitative input can be found in a portflio selection problem where statistical estimates of future expected returns are the main decision parameters (Markowitz, 1952; Best and Grauer, 1991; Chiarawongse et al., 2012). With the solution method, we extend the simulation study by Chiarawongse et al. (2012), investigating the effects of the correlation coefficient on the portfolio performance.
The organization is as follows. In Section 2, the statistical inference model with qualitative input under investigation is defined. The main result that is the limiting property of the model is stated for a subclass of the model where the prior distribution is the normal distribution with a uniform correlation structure and the qualitative input is in the form of ranking. In Section 3, the finite dimensional properties of the model are explored through the proposed recursive integration technique. In this section, the recursive integration technique adapted from Kiatsupaibul et al. (2017) is also described. In Section 4, the properties of the model when applied to a portfolio selection problem is investigated. In specific, the effects of the model’s correlation coefficient on the portfolio selection are studied. Finally, in Section 5, a conclusion is provided.
2 Statistical models with qualitative input
Let be a normally distributed vector with mean vector and covariance matrix , i.e., . Let be a polytope defined by a set of linear inequalities.
where and are an real matrix and a real vector, respectively. The inference problem concerning us is to compute the conditional expectation
(1) 
The polytope represents a qualitative input, and this model can be interpreted as an estimation problem of a statistical quantity given a qualitative information is available (Chiarawongse et al., 2012; Kiatsupaibul et al., 2017).
In this study, we limit ourselves to a particular model of and . We restrict our attention to formed by a complete ranking, i.e.,
(2) 
For , we assume that the correlation matrix has a uniform structure.
Definition 1.
A normal random vector possesses a uniform correlation structure if the correlation coefficient between and , for all and , are equal to a constant . In addition, if , for all , also has the standard normal (marginal) distribution, , we say that possesses a standard uniform correlation structure.
Observe that, if possesses a uniform correlation structure, it can be represented by a onefactor model as follows. Assume that are independent and identically distributed (iid) with distribution ,
(3) 
where . In this representation, the random variable is a common factor that generates dependence among ’s. The random vector that possesses a standard uniform correlation structure can be represented by the onefactor model in (3) with and , for all . In what follows, for a set of random variable , we write their order statistics as
The following Theorem 1 demonstrates that a ranking information has a potential to enhance the accuracy of a statistical estimate of a random vector in high dimension. It also set a limitation of the estimation accuracy based on the dependence among variables.
Lemma 1.
Let be iid . Let denote the order statistic of . Let denote the distribution function of and denote its inverse function. We have, with probability one,
(4) 
Furthermore, as ,
(5) 
and
(6) 
Theorem 1.
For a normal random vector with standard uniform correlation structure and the complete ranking qualitative input in (2), we have, for ,
(7) 
(8) 
Proof.
Since possesses the standard uniform correlation structure, it can be represented by the onefactor model (3) with and for all . That is equivalent to that . Therefore, with denoting the order statistic of ,
(5) imples (7). By a similar argument, with and ’s being independent,
since and ’s are independent,  
Therefore,
By (6), the last term on the right hand side of the last equation goes to zero as goes to infinity. (8) then follows. ∎
In this model, we estimate the random variable based on a ranking information by the conditional expectation given the ranking. Therefore, the conditional variance given the ranking measures the estimation accuracy, and (8) specifies the limiting accuracy of the estimate. From (8), when , the conditional variance given the ranking is zero, implying that the conditional expectation given the ranking is a perfect estimate with no error. Also from (8), the limiting accuracy deteriorates when increases, suggesting that the dependence among the variables causes the estimation error to this model. In conclustion, under the standard uniform correlation structure, at small and with large, the estimate from the conditional expectation given the ranking achieves high accuracy, even though the information that forms the estimate is only of qualitative nature.
Theorem 1 only specifies the limiting behavior of the estimate of the the inference model with ranking input. To study the finite dimensional behavior of this model, a computation method for the conditional expectation given a ranking (1) at finite but high is required. This computation would also enable this model to be deployed in real world applications. In the next section, we describe a computational method to perform this task.
3 A Computational method
In this section, a numerical integration method for evaluating the conditional expectation (1) is introduced. With the computation method, we study the finite dimensional behavior of the inference model either with the standard or a nonstandard uniform correlation structure.
The conditional expectation (1) ostensibly requires an dimensional integration operation. Kiatsupaibul et al. (2017) provides a recursive integration method that reduces this dimensional integration to a series of twodimensional integration operation. The following is the recursive integration formula by Kiatsupaibul et al. (2017) that is adapted to the onefactor model (3).
(9) 
where and are evaluated by the following twodimensional recursive integration formulae. Let denote the probability density function (pdf) of the standard normal distribution . For each , let denote pdf of the normal distribution .
(10) 
where and, for , define for each ,
(11) 
The recursive integration formula for is as follows. For and , define , and when ,
where is the index of the variable in (9) whose expectation to be evaluated.
(12) 
where and, for , define for each ,
(13) 
The conditional second moment given the ranking can be with the same formula by replacing when . The variance and standard deviation given the ranking can be deduced from the conditional expectation and the conditional second moment. Refer to Kiatsupaibul et al. (2017) for the implementation and some properties of the above recursive integration formulae.
With the inference method described above, we can study the finite dimensional behavior of the rank constrained inference model with the standard uniform correlation structure. Figure 1 show the convergence speeds of the standard deviations of the estimates to the limit implied by Theorem 1.
Consider the case when possesses a standard uniform correlation structure with and for all . We compute the conditional standard deviation of given the complete ranking information at dimensions where the correlation coefficient are controlled at . The computation are performed by recursive integration technique described above. The conditional standard deviations of the 0.25, 0.50, 0.75 and 1.00 quantiles of the estimates are shown in Figure 1. From Figure 1, at each and quantile, we observe that the conditional standard deviation (SD) decreases as dimension grows. The speed of reduction in the conditional SD is tapered off at high dimension . Since the conditional SD measures the estimation accuracy, this result implies that the estimation accuracy increases at higher dimensions, but converges to the limit imposed by Theorem 1. We also observed that the conditional SD is smaller with smaller correlation coefficient . This result implies that the estimation accuracy is compromises the dependence among the variables.
From Figure 1, at a fixed , there is no obvious difference in the graphs among 0.25, 0.50 and 0.75 quantiles. However, the graph for 1.00 quantile is quite different from the others. It should be noted that the limiting conditional SD for 0.25, 0.50 and 0.75 quantiles are all governed by Theorem 1, which states that they all converge to a constant. However, the limiting conditional SD for 1.00 quantile is beyond the scope of Theorem 1. When , the limit of the conditional SD for 1.00 quantile is governed by the extreme value theorem. Therefore, it is possible that the decreasing pattern for the conditional SD in the case of 1.00 quantile is different from those in the other cases. For the case of 1.00 quantile with , there is no limit theorem to explain the limiting behavior of the conditional SD. Nevertheless, in Figure 1, we can still observe the decreasing pattern of the conditional SD for 1.00 quantile with in finite dimension.
Now consider a case when possesses a nonstandard uniform correlation structure, i.e., are not identically distributed. Let , for , but ’s be different from one another. We call a qualitative input reinforcing if . On the other hand, if , we call opposing. The degree of reinforcement depends on how deep is in . We would like to observe the effect of the degree of reinforcement on the accuracy of the estimate (1).
To do so, we set as follows. Let vector be a vector whose component is
Then whose components are equispaced. Now let be
(14) 
where is the Euclidean distance of . In other words, is the equispaced vector that is scaled to have length , emanating from the origin, which is the tip point of the cone . The length can be regarded as the degree of reinforcement. It should be noted that can be negative. A negative length expresses the degree of opposition of to the input . We call the reinforcement index. Figure 2 shows the conditional SD of the estimates versus when is 0.
From the left panel of Figure 2, in low dimension (), the graph of the conditional SD is tilted upward from negative to positive . This increasing pattern implies that, in low dimension, we obtain a higher accuracy of the estimates when we have opposing inputs. However, in higher dimension (), as shown in the right panel of Figure 2, the graph is relatively flat. This pattern implies that, in higher dimension, there are no differences in the accuracy between the reinforcing inputs and the opposing inputs. Figure 3 shows the conditional SD of the estimates versus when is 0.5. From the left panel of Figure 3, when , we still observe the increasing pattern of the graph in low dimension () even though it is not as distinct as that when . From the right panel of Figure 3, when the graph of the conditional SD is relatively flat in higher dimension (), similar to the case when .
4 An application to a portfolio model
An application of statistical estimation with ranking input can be found in a meanvariance portfolio selection problem. In a meanvariance portfolio context, the random variables of interest are, from a Bayesian perspective, the future expected returns on the assets. That is, in a universe of assets, is the future expected return on asset for . Best and Grauer (1991) showed that the solution to the portfolio selection problem is very sensitive to the estimate of . From Theorem 1, we learned that ranking information can enhance the accuracy of the estimates of at high dimension. Consequently, the performance of a portfolio would also be enhanced when the accuracy of the estimates of is improved by the ranking input. In Theorem 1, we see that the degree of the accuracy of the estimates with ranking input is controlled by the correlation coefficient . In this section, we investigate the effects of on the performance of the portfolio formed by the estimates of from the methodology put forth.
Recall a meanvariance portfolio selection problem:
(15) 
To determine an optimal portfolio weight vector , this problem requires, as inputs, , the estimate of future expected asset return vector and , the estimate of the covariance matrix of future asset returns. In order to single out the effect of the estimation accuracy on the optimal portfolio performance, we assume that is known, and we estimate only . In this section, we evaluate the performances of the portfolio constructed based on the return estimates with ranking input at different levels of . We then compare the performances of the portfolio with rank constrained return estimates with some benchmarks to observe the relative performances of the portfolios when ranking information is available. We separate the study into two cases based on the covariance matrix structure, namely the standard uniform correlation structure and the uniform correlation structure. In the standard uniform correlation structure case, the performance of the portfolio based on rank constrained return estimates can be assessed through Theorem 1. In the nonstandard uniform correlation structure case, the portfolio performance is assessed through a simulation experiment.
Standard uniform correlation structure
We first consider the scenerio where the future expected asset returns possess a standard uniform correlation structure. In Theorem 1, we have seen that the correlation coefficient reduces the accuracy of a rank constrained estimate of the future expected asset returns. However, one can show that the performance of the portfolio is not affected by . Consider the following example.
Example 1.
Consider two portfolio selection problems with two different return estimates and . Let and be different by a constant, i.e.,
where and are constant vectors of values and , respectively. Let us assume that the other parameters, which are the covariance matrix and the risk aversion parameter, are the same for the two portfolio problem. One can easily see that the optimal solutions to the two portfolio problems are the same. That means the optimal solution will not be affected by a parallel shift of the return input.
In the discussion prior to Theorem 1, a future expected return vector with a standard uniform correlation structure can be written in a onefactor model defined in (3). The estimation error of the return vector can be decomposed into the estimation error of the common factor and that of the idiosyncratic term . From the proof of Theorem 1, the estimation error of the idiosyncratic term is eliminated by the knowledge of a perfect ranking at high dimension, while the estimation error of the common factor remains intact. However, the estimation error from the common factors is only a parallel shift in the return estimation. As shown in Example 1, the parallel shift does not influence the optimal solution. Therefore the increase in the correlation efficient, even though reduce the estimation accuracy, does not compromise the portfolio performance.
Uniform correlation structure
The characteristics of a rank constrained return estimates and, hence, the performance of a portfolio, with respect to a nonstandard uniform correlation structure, is beyond the scope of Theorem 1. To assess the effect of correlation coefficient on the portfolio performance in this case, we resort to a simulation experiment equipped with numerical computation method described in Section 3. We extend the simulation study of Chiarawongse et al. (2012) by controlling , the correlation coefficient defined in (3). We carry out the meanvariance portfolio selection in (15), assuming that is known and estimating only . The objective of the simulation is to compare the optimal portfolio performance with respect to three types of estimates: the prior mean, the true mean and the conditional mean with ranking input.
In our simulation study, we assume that the future expected asset return vector has a uniform correlation structure. To simulate the vector , we require the following hyperparameters: prior mean vector , the standard deviation parameter for generating the prior means , the variance parameter for generating the covariance matrix , the scaling parameter for covariance matrix and the correlation coefficient . The sequence of steps in our simulation is as follows.

Simulate an vector of prior means whose are iid and each one has distribution.

Simulate covariance matrix of asset returns as follows. First, we simulate an vector of scaling factors whose are iid and each one has distribution. We then let where is an diagonal matrix whose diagnonal entry is and is an correlation matrix whose offdiagonal entries all equal the correlation coefficient .

Simulate where .

Extract the ranking information from .

Form the three portfolios based on the three estimations of and measure their performances.
The experiments are done in this setting where the correlation coefficient and the number of assets are controlled at

,

.
In Step 5, we execute three meanvariance portfolio selection models based on the three types of future expected return estimate substituted into in (15):

Prior: The prior mean .

Clairvoyance: The true mean , a realization of the future expected asset return vector generated by the simulation.

Rank constrained: The conditional mean with ranking information
(16)
For the rank constrained model (16), we use as the perfect ranking of the future expected asset returns extracted from the simulated . The conditional expectation (16) with ranking information is computed from the prior mean , the covariance matrix and the ranking information by the recursive integration technique described in Section 3.
The optimal weights vector for the three portfolio selection models are computed according to (15) by adopting the true as the covariance matrix. Following Chiarawongse et al. (2012), the risk aversion parameter is set to 4. We denote the optimal weight vectors corresponding to the prior model, the clairvoyance model and the rank constrained model by , and , respectively.
We solve 100 instances of the portfolio selection problem (15) with simulated parameters obtained by the simulation environment described above and with different estimates of . In each instance, we evaluate the performance of each portfolio selection model based on the Certainty Equivalence Return (CEQ) defined as
(17) 
where the weights vector varies according to each model solution (, or ). Finally, we average the performances of the 100 instances of the three portfolio models at the different values of correlation coefficient and compare the average performance across number of assets as shown in Figure 4. To facililitate the comparison across levels of correlation coefficient, we also compute the percent differences between the clairvoyance and the rank constrained model as shown in Figure 5.
Figure 4 shows the performance evaluated for each model at different numbers of assets . For each return estimation model, for all levels of correlation coefficient , the clairvoyance model achieves the highest performance while trailed closely by the rank contrained model. The prior model leads to the worst performance in every setting. This confirms the benefit of incorporating ranking information in the return estimation as previously found in Chiarawongse et al. (2012).
Figure 5 shows the percent differences between the clairvoyance model and the rank constrained model as functions of the dimension in Panel (a) and as functions of the correlation coefficients in Panel (b). According to Figure 1, the estimation discrepancy, represented by the standard deviation of the estimator, declines in higher dimension. Panel (a) of Figure 5 shows that the performance discrepancy between the ranked constrained model and the clairvoyance model also declines in a larger portfolio as expected. From Figure 1, since the standard deviation of the estimator with ranking information increases with larger correlation coefficient , one may expect the performance of the rank constrained model deteriorates commensurately with larger . However, as seen in the case of the standard uniform correlation structure, the return estimation error from is largely a result of a parallel shift, which does not compromise the performance of the portfolio. Panel (b) of Figure 5 shows no visible trend of the relative portfolio performance between the rank constrained model and the clairvoyance as grows larger. This result confirms that ranking information when fused into the return estimation can eliminate the influence of on the portfolio performance.
5 Conclusion
Qualitative input in a form of ranking can enhance the accuracy of statistical estimates of variables. Especially in high dimension when the variables of interest are independent and standard normally distributed, the statistical estimates of the variables with ranking input can be perfect, as suggested by Theorem 1. In a finite dimension, an efficient numerical algorithm to compute the estimates given ranking input is available, providing tools to make use of this type of statistical estimates in practice. This computational tool also allows an investigation into the convergence speed of the estimate and the impact of the degree of reinforcement of the input. Under the uniform correlation structure, the correlation coefficient representing the dependence among the variables compromises the accuracy of the estimates. However, in portfolio selection problem, the estimation error caused by this kind of dependence does not compromise the quality of the optimal solution.
References
 Best and Grauer (1991) Best, J., Grauer, R. R., 1991. On the sensitivity of meanvarianceefficient portfolios to changes in asset means: Some analytical and computational results. The Review of Financial Studies 4 (2), 315–342.
 Black and Litterman (1991) Black, F., Litterman, R., June 1991. Asset allocation: Combining investors views with market equilibrium. Journal of Fixed Income 1 (1), 7–18.
 Black and Litterman (1992) Black, F., Litterman, R., SeptemberOctober 1992. Global portfolio optimization. Financial Analalysts Journal, 28–43.
 Chiarawongse et al. (2012) Chiarawongse, A., Kiatsupaibul, S., Tirapat, S., Van Roy, B., 2012. Portfolio selection with qualitative input. Journal of Banking and Finance 36, 489–496.
 Hayter (2006) Hayter, A. J., 2006. Recursive integration methodologies with statistical applications. Journal of Statistical Planning and Inference 136, 2284–2296.
 Kaufman and Smith (1998) Kaufman, D. E., Smith, R. L., 1998. Direction choice for accelerated convergence in hitandrun sampling. Operations Research 46 (1), 84–95.
 Khachiyan (1989) Khachiyan, L. G., 1989. The problem of computing the volume of polytopes is NPhard. Uspekhi Mat. Nauk 44 (3), 199–200.
 Kiatsupaibul et al. (2017) Kiatsupaibul, S., Hayter, A. J., Liu, W., 2017. Rank constrained distribution and moment computations. Computational Statistics and Data Analysis 105, 229–242.
 Kiatsupaibul et al. (2011) Kiatsupaibul, S., Smith, R. L., Zabinsky, Z. B., 2011. An analysis of a variation of hitandrun for uniform sampling from general regions. ACM Transactions on Modeling and Computer Simulation 21 (3), Article number 16.
 Lovász and Vempala (2006a) Lovász, L., Vempala, S. S., 2006a. Fast algorithms for logconcave functions: Sampling, rounding, integration and optimization. In: Proceedings of the 47th IEEE Symposium on Foundations of Computer Science (FOCS ’06). pp. 57–68.
 Lovász and Vempala (2006b) Lovász, L., Vempala, S. S., 2006b. Hitandrun from a corner. SIAM Journal of Computing 35 (4), 985–1005.
 Markowitz (1952) Markowitz, H. M., 1952. Portfolio selection. Journal of Finance 7, 71–91.
 Smith (1984) Smith, R. L., 1984. Efficient Monte Carlo procedures for generating points uniformly distributed over bounded convex regions. Operations Research 32 (6), 1296–1308.