Context-aware Dynamic Assets Selection for Online Portfolio Selection based on Contextual Bandit
Online portfolio selection is a sequential decision-making problem in financial engineering, aiming to construct a portfolio by optimizing the allocation of wealth across a set of assets to achieve the highest return. In this paper, we present a novel context-aware dynamic assets selection problem and propose two innovative online portfolio selection methods named Exp4.ONSE and Exp4.EGE respectively based on contextual bandit algorithms to adapt to the dynamic assets selection scenario. We also provide regret upper bounds for both methods which achieve sublinear regrets. In addition, we analyze the computational complexity of the proposed algorithms, which implies that the computational complexity can be significantly reduced to polynomial time. Extensive experiments demonstrate that our algorithms have marked superiority across representative real-world market datasets in term of efficiency and effectiveness.
Online portfolio selection, which aims to construct a portfolio to optimize the allocation of wealth across a set of assets, is a fundamental research problem in financial engineering. There are two key theories used to tackle this problem, i.e., the Mean Variance Theory [Markowitz1952] mainly from the finance community and Capital Growth Theory [Hakansson and Ziemba1995] originated from information theory. However, both of them work under a strong assumption that we need to construct portfolios among the given assets combination. Therefore, their assets combination is fixed, namely, once the initial combination of assets is determined, the subsequent combination will no longer change, only with the proportion of assets in the combination adjusted. Due to these limitations, the existing methods, which we refer to as fixed assets selection methods, can hardly be applied to real-world applications, because investors do not want to hold the fixed assets combination all the time, namely, the assets combination need to be selected dynamically.
In contrast, in this paper, we consider the dynamic assets selection problem, and reconstruct the problem setting as follows. Suppose the market has rounds and assets for selection. In round (), we can construct a portfolio by arbitrarily selecting assets () from the assets to form an assets combination, and then allocating their wealth proportion to achieve the highest total return. A simple idea to address this problem is to use the history return of all assets as a weight, and then use the multiplicative weight update method for all alternative combinations to choose the best one, such as the Full-feedback algorithm [Ito et al.2018]. However, we can easily find that the number of selectable combinations has reached . Ito et al. \shortciteito2018regret proves that a special case of this problem with a cardinality constraint is an NP-complete problem. Thus the overall computational complexity is staggering, even though the efficient fixed assets selection methods such as the Online Newton Step (ONS) [Agarwal et al.2006] and Exponential Gradient (EG) [Helmbold et al.1998] algorithms have been devoted to reducing computational complexity in the allocation part. In order to reduce the computational complexity, we address the dynamic assets selection problem by partial observation. In other words, we only calculate the wealth allocation and observe the return in the selected assets combination instead of considering all assets combinations in each round, thus resulting in a multi-armed bandit problem. Therefore, an idea that combines a bandit algorithm for the dynamic assets selection problem and an efficient fixed assets selection method to process the wealth allocation, can greatly reduce the computational complexity.
In addition, context-awareness is important in dynamic assests selection probelm. Most of the portfolio algorithms are context-free, which ignore the contextual information of the investment scenario. In reality, however, investors might have more information than just the price relatives observed so far. For example, some scholars [Cover and Ordentlich1996, Helmbold et al.1998] mentioned the side information such as prevailing interest rates or consumer-confidence figures can indicate which assets are likely to outperform the other assets in the portfolio. For the dynamic assets selection problem, by choosing assets combination only in terms of return, the context-free bandit algorithm ignores such characteristic, which may not be able to adapt to the highly non-stationary financial market. Thus, we address context-aware dynamic assets selection problem based on contextual bandit algorithm [Auer et al.2002, Li et al.2010], which considers both the contextual information and return, aiming at finding the optimal assets combination in the dynamically unstable market.
In this paper, we propose Exp4.ONSE and Exp4.EGE methods to target the cotext-aware dynamic assests selection problem. Specifically, for selecting an assets combination, we provide a new bandit algorithm on top of a conventional contextual bandit method named Exp4 [Auer et al.2002]. In addition, for allocating the proportion of wealth, we propose an Online Newton Step Estimator (ONSE) algorithm and a more efficient Exponential Ggradient Estimator (EGE) algorithm based on an original return estimation method. Table 1 summarizes the regret upper bounds and computational complexity of the two methods. The characteristics and contributions of our methods are as follows:
|Methods||Regret Upper Bound||Computational Complexity|
is number of trading rounds, is number of assets, is number of available assets combinations, and is number of experts.
We first present a novel context-aware dynamic assets selection problem in the online portfolio selection, in which arbitrary assests combination can be dynamically chosen with hint of context to find a near-optimal assets combination and its portfolio in a highly non-stationary environment. This problem achieves an exponential computational complexity.
To address the context-aware dynamic assets selection problem, we propose two efficient Exp4.ONSE and Exp4.EGE methods which construct a portfolio by dynamically selecting an assets combination and employing two original ONSE and EGE methods to allocate wealth.
We rigorously prove that the proposed Exp4.ONSE algorithm achieves regret of , and Exp4.EGE algorithm achieves regret of , which demonstrate that both of our algorithms achieve sublinear regret in .
By only calculating the portfolio and observing return of selected assets combination, our algorithms can significantly reduce the computational complexity. Specifically, when the number of assets is large, our algorithms can run in a polynomial time of , , and .
Extensive experiments demonstrate that our algorithms have a lower regret in a highly non-stationary situation on synthetic dataset and have superior performance compared with the state-of-the-art methods on real-world datasets in term of efficiency and effectiveness.
Portfolio Selection: Online portfolio selection, which sequentially selects a portfolio over a set of assets in order to achieve certain targets, is a natural and important task for asset portfolio management. One category of algorithms based on Capital Growth Theory [Hakansson and Ziemba1995], termed as “Follow-the-Winner”, tries to increase the relative weight of more successful stocks based on their historical performance and asymptotically achieve the same growth rate as that of an optimal strategy. Cover \shortcitecover1991universal proposed the Universal Portfolio (UP) strategy, which can achieve the minimax optimal regret for this problem, with amazing complexity of , even if the optimized one reached [Kalai and Vempala2002]. Some scholoars [Agarwal et al.2006, Hazan, Agarwal, and Kale2007, Hazan and Kale2015] proposed the Follow the Leader approaches, by solving the optimization problem with L2-norm regularization via online convex optimization technique [Shalev-Shwartz2012], which admits much faster implementation via standard optimization methods. Helmbold et al. \shortcitehelmbold1998line proposed an extremely efficient approach with time complexity per round and achieved regret of named EG. However, those algorithms above are the fixed assets selection methods, which are not suitable for our proposed dynamic assets selection problem.
Regarding dynamic assets selection problem, Ito et al. \shortciteito2018regret proposed Full-feedback and Bandit-feedback algorithms that tried to introduce the bandit algorithm to address this problem. However, both of these two algorithms have exponentially large computational complexity.
Bandit Algorithm: Bandit algorithms can be divided into stochastic and adversarial algorithms. A stochastic bandit, such as UCB, linUCB [Auer, Cesa-Bianchi, and Fischer2002, Li et al.2010], is completely dominated by the distributions of rewards of the respective actions. Regarding portfolio selection problem, some scholoars proposed stochastic bandit-based portfolio methods [Shen et al.2015, Shen and Wang2016, Huo and Fu2017]. However, rewards in the real world are more complex and unstable, especially when it comes to competitive scenarios, such as stock trading. It’s hard to assume that the return of stocks are truly randomly generated.
In addition, an adversarial version was introduced by Auer et al. \shortciteauer2002nonstochastic, which is a more pragmatic approach because it can make no statistical assumptions about the return while still keeping the objective of competing with the best action in hindsight. They proposed Exp3 algorithm with expected cumulative regret of . Also, they were the first to consider the K-armed bandit problem with expert advice, and proposed Exp4 algorithm, on top of which we proposed our algorithms. Based on Exp4, there are some other improved algorithms [Beygelzimer et al.2011, Syrgkanis, Krishnamurthy, and Schapire2016, Wei and Luo2018], but do not directly apply to our setting.
Consider a self-financed and no margin/short financial market containing assets. In each trading round, the performance of the assets can be described by a vector of return, denoted by , where is the next round’s opening price of the asset divided by its opening price on the current round. Thus the value of an investment in asset increases by or falls by times its previous value from one round to the next. A portfolio is defined by a weight vector satisfies the constraint that every weight is non-negetive and the sum of all the weights equals one, i.e., . The element of specifies the proportion of wealth allocated to the asset . Given a portfolio and the return , investors using this portfolio increase (or decrease) their wealth from one round to the next by a factor of .
The assets combination is restricted with a set of available combinations . For an assets combination , is the set of portfolios whose supports are included in , i.e., , where . In particular, Ito et al. \shortciteito2018regret defines a special form of with cardinality constraints, . Note that when , the problem becomes the fixed assets selection problem; while the problem even turns into the single asset selection problem when .
Let denote the weight of portfolio for a round . From , a portfolio strategy increases the initial wealth by a factor of , namely, the final cumulative wealth after a sequence of rounds is . Since the model assumes multi-period investment, we define the exponential growth rate according to the Capital Growth Theory [Hakansson and Ziemba1995] as , where .
Let denote the optimal fixed strategy for rounds, i.e., . The performance of our portfolio selection is measured by , which we call regret. Then the regret of the algorithm can be expressed as:
We follow the same assumption in [Ito et al.2018] that is bounded in a closed interval , where and are constants satisfying , but we do not make any statistical assumption about the behavior of .
In general, some other assumptions are made in the above widely adopted models: (1) Transaction cost: there is no transaction costs/taxes. (2) Market liquidity: one can buy and sell any quantity of any asset in its closing prices. (3) Impact cost: market behavior is not affected by any portfolio selection strategy.
|Number of assets|
|Number of trading rounds|
|Number of experts|
|Index of trading rounds|
|Available Assets combinations|
|Portfolio weight vector|
|The optimal assets combination|
|The optimal portfolio weight vector|
Table 2 lists the key notations in this paper.
In this section, we first introduce the strategy for context-aware dynamic assets selection problem. Then we describe how to calculate the proportion of wealth allocation. Finally, we summarize the two proposed algorithms.
Context-aware Dynamic Assets Selection
Towards solving the context-aware dynamic assets selection problem, we proposed a new bandit algorithm based on the Exp4 algorithm, which considers both the historical return and the context of the assets while making the choice of assets combinations.
We start by standardizing the following keywords:
Expert: We consider a mapping function as an expert, is trained by some supervised learning methods.
Probability vector: The probability vectors represents a probability distribution over the assets combinations which is recommended by experts. In this paper, we assume there are experts, and , where for each .
Expert authority vector: The expert authority vector is a vector used to score experts. The higher the score, the closer the expert is to the best expert. is initialized to the uniform distribution .
Let be the reward vector. For each expert, the associated expert gives the probability vector according to , where is the context available in round . Note that since the construction of is not strict and unique, and does not affect the regret analysis, we will only provide one of the practices in our experiment.
The algorithm’s goal in this case is to combine the advice of the experts in such a way so that its return is close to that of the best expert. In each round , the experts make their recommendations , and the algorithm chooses an assets combination based on their comprehensive recommendations, outputs as the portfolio weight vector. Finally, the algorithm updates the expert authority vector based on the reward vector and proceeds to the next round.
For maintaining the proportion of wealth allocation, we contruct a estimated retrun and propose an ONSE algorithm based on ONS [Agarwal et al.2006]. In the computation of , we first define the gradient (denoted as ) of the reward function:
Then we give the formula of the following vector and matrice which utilizes the gradient and the Hessian:
Since an observer does not have to observe all the entries of , we do not always need to update for all . In order to deal with this problem, we construct unbiased estimators of and for each as follows:
where is the probability of choosing in round . Note that and can be calculated from the observed information alone. Using these unbiased estimators, we compute the portfolio vectors by ONS as follows:
where is the projection in the norm induced by as
In addition, we use the smoothened portfolio to modify the ONSE algorithm. Note that for . The ONSE algorithm in round is described in Algorithm 1.
Model 1: Exp4.ONSE
The Exp4.ONSE method updates the probability of choosing by the Exp4 algorithm and portfolios by ONSE respectively. Hence the convex quadratic programming problem is solved only once in a round. The entire algorithm is summarized in Algorithm 2.
Model 2: Exp4.EGE
In order to make the algorithm more efficient, we try to transform an algorithm EG with running time per round into a dynamic assets selection method.
Exp4.EGE updates the probability of choosing by the Exp4 algorithm and updates portfolios by EGE. In EGE algorithm, we construct unbiased estimators of for each by . Then we compute the portfolio vector using the unbiased estimator by exponential gradient:
where is the learning rate and . Note that for .
Therefore, we only need to replace the ONSE algorithm in the Exp4.ONSE algorithm with the EGE algorithm to get the Exp4.EGE algorithm. The EGE algorithm in round is described in Algorithm 3.
Regret Upper Bound
In the following, we let , and the regret can be expressed as:
Our algorithm achieves the regret described below for arbitrary inputs, where constants are given by , , and .
We now introduce the first theorem, which is about the regret upper bounds of Exp4.ONSE algorithm.
For any , any , , and , Exp4.ONSE achieves the following regret bound:
Setting and ,we obtain
And we have the second theorem about the upper regret bound of Exp4.EGE algorithm.
For any , any , algorithm Exp4.EGE achieves the following regret bound:
Setting and , we obtain
The proofs can be found in the supplementary material.
Ito et al. \shortciteito2018regret implies that unless the complexity class BPP includes NP, Full-feedback algorithm will not run in polynomial time, if an algorithm achieves regret for arbitrary and arbitrary . But our algorithms based on the bandit algorithms can reduce the computational complexity to get an approximate solution.
Algorithm Exp4.ONSE runs in time. In each round , from the definition of and in Eq(4), the update of given by Eq(5) is needed only for . So ONSE can be computed in time per round [Agarwal et al.2006]. Furthermore, for , both updating and computing can be performed in time. This implies that sampling can be performed in time. Since in practice, the total rounds can be smaller than the number of assets combinations, in order to further reduce the computational complexity and the space complexity, we also apply a key-value pair stucture to implement algorithm, which can reduce to . Thus the space complexity is .
Similarly, Algorithm Exp4.EGE runs in time per round, since EGE can be computed in time per round [Helmbold et al.1998].
In this section, we introduce the extensive experiments conducted on two synthetic dataset and four real-world datasets, which aim to answer the following questions:
Q1: How does the nonstational enviroment affect the regret of our Exp4.ONSE and Exp4.EGE methods? (see Experiment 1)
Q2: How efficient is our algorithm than state-of-the-art approaches? (see Experiment 2)
Q3: How do our approaches outperform the state-of-the-art approaches on real-word finance markets? (see Experiment 3)
Data Collection: Similarly to [Ito et al.2018, Huo and Fu2017], we used synthetic dataset to evaluate the regret. We also conducted our experiments on four real-world datasets from financial market to eveluate the performance.
(1) Synthetic dataset is generated as follows: given parameters , , ,, and , we generate and randomly divide stages for the rounds. In each stage, we choose an asset from that .
(2) Fama and French (FF) datasets have been widely recognized as high-quality and standard evaluation protocols [Fama and French1992] and have an extensive coverage of assets classes and span a long period. FF25 and FF100 are two datasets with different total assets in FF datasets.
(3) ETFs dataset has high liquidity and diversity, which become popularized among investors.
(4) SP500 dataset is the daily return data of the 500 firms listed in the S&P500 Index.
Table 3 summarizes these real-world datasets. They implicitly underline different perspectives in performance assessment. While FF25 and FF100 highlight the long-term performance, ETFs and SP500 reflect the vicissitude market environment after the recent financial crisis starting from 2007. The four datasets have diverse trading frequencies: monthly and daily. Thus, through empirical evaluations on those datasets, we can thoroughly understand the performance of each method.
|FF25||Monthly||06/01/1963 - 11/31/2018||545||25|
|FF100||Monthly||07/01/1963 - 11/31/2018||544||100|
|ETFs||Daily||12/08/2011 - 11/10/2017||1,138||547|
|SP500||Daily||02/11/2013 - 02/07/2018||1,355||500|
Expert Advice Construction: We now describe the expert advice constructed for our experiments.
First, we collected raw assets’ features. Each asset was originally represented by a feature vector of components, which include: (1) return: the average return of last rounds and whether the return exceeds the average value; (2) trading volume: last volume and whether the volume exceeds the average value. The above dimensions constitute the raw features in our experiments.
Although most of the existing contextual bandit methods used the expert advice construction method proposed by Li et al. \shortciteli2010contextual, which was applied in the recommendation [Beygelzimer et al.2011, Wang, Wu, and Wang2017] and not applicable to the portfolio issue. Therefore, we proposed a novel approach constructing expert advice. We constructed experts prediction models based on factors model [Sharpe1963] to predict assets’ return, where each expert came from any combinations of these dimensions. Then we trained these experts prediction models by multiple linear regression model. In each round , for any expert , we could get predicted return for all assets. For assets combinations , we sorted the average predicted return of the assets contained in the . After that, we took assets combinations by the top-k method, taking its reciprocal rank [Chapelle et al.2009] as the probability of this combination, where (). And for the rest of the combinations, we treated their probabilities as uniform distribution. Finally, the probability vector consisted of the rank reciprocal of the previous assets combinations and the probability of other combinations. Similarly, to reduce computational complexity, we store in the form of key-value pairs.
Comparison methods: The methods empirically evaluated in our experiments can be categorized into three groups:
(2) The context-free dynamic assets selection methods, which are state-of-the-art methods. Full-feedback algorithm [Ito et al.2018] combined the multiplicative weight update method and the follow-the-approximate-leader (FTAL) method. Bandit-feedback algorithm [Ito et al.2018] combined the Exp3 algorithm and FTAL method.
(3) The cotext-aware dynamic assets selection methods. Since there is no previous work in this group, we compared the performance of our proposed methods EXP4.ONSE and EXP4.EGE.
Experiment Setup and Metrics: Regarding the parameter settings, in the expert advice generation section, we used the first rounds to train multiple linear regression, thus the start time of all methods is for the sake of fairness. And we set parameters according to Theorem 1 for Exp4.ONSE and Theorem 2 for Exp4.EGE. And for the parameters of other comparison methods, we used the parameter settings recommended in the relevant studies.
On the synthetic dataset, we considered regret and running time. In terms of regret, we generated two tasks: slightly non-stationary task with and highly non-stationary task with . For running time, we randomly selected - assets from running time task on synthetic dataset with .
On the real-world datasets, we randomly selected assets from datasets, and used cumulative wealth [Brandt2010] as the standard criterion to evaluate the performance. The results were averaged over executions.
We present our experimental results in three sections. For the synthetic dataset, we evaluate regret to answer question Q1 and running time for question Q2, while for real-world datasets, we assess cumulative wealth for question Q3.
Experiment 1: evaluation of regret on synthetic dataset
In order to answer question Q1, we analysed the regret of all comparison methods on slightly non-stationary task and highly non-stationary task. From the results shown in Figure 1, we can reach three conclusions. Firstly, comparing Exp4.ONSE and Exp4.EGE with the fixed assets selection methods ONS, we can find that ONS can converge quickly within a stage, which is better than our methods. But ONS’s response to stage changes is slow, so that ONS’s regret appears stepped increase. Therefore, though our approches have no significant improvement in the slightly non-stationary environment, in the highly non-stationary case, we are much better than it. Secondly, comparing our approches with the context-free dynamic assets selection methods, our approches achieve lower regret than Full-feedback and Bandit-feedback in both of two cases, especially in the highly non-stationary case. Thirdly, the results empirically show that the sub-linear regret of our methods mentioned in the theoretical analysis is correct, and Exp4.ONSE achieves lower regret than Exp4.EGE.
Experiment 2: evaluation of running time on synthetic dataset and real-world datasets
In order to answer question Q2, we measured the running time of comparison methods on running time task of synthetic dataset. The PC we used has a four-core processor with frequency of 3.6GHz and memory of 8GB. The results of running time are plotted in Figure 2 (a) which shows that Exp4.ONSE and Exp4.EGE have greatly improved efficiency over Full-feedback by an average of . Though our methods have increased the running time by an average of than Bandit-feedback when , they are more efficient than Bandit-feedback by decreasing the running time with an average of when . In addition, the results empirically show that our theoretical computational complexity analysis is correct.
Moreover, we compared the running time with comparison methods on the real-world datasets (see Figure 2 (b)). It shows that Exp4.ONSE’s average running time is seconds, which improve the efficience compared with the state-of-the-art methods by an average of , ranging from to . And Exp4.EGE’s average running time is seconds, which improve the efficience compared with the state-of-the-art methods by an average of , ranging from to . Such time efficiency supports Exp4.ONSE and Exp4.EGE in large-scale real applications.
Experiment 3: evaluation of cumulative wealth on real-world datasets
As for the real-world datasets, we compared the performance with the competing approaches based on their cumulative wealth to answer question Q3. In Table 4, the cumulative wealth achieved by various trading strategies on the four real-world datasets. The top two best results in each dataset are highlighted in bold. Exp4.ONSE outperforms all the four comparison methods by an average of , ranging from to , wherein an average of compared with state-of-the-art methods. And Exp4.EGE outperforms all the four comparison methods by an average of , ranging from to , wherein an average of compared with state-of-the-art methods. We note that the main reason Exp4.ONSE and Exp4.EGE achieved such superior results is that it is powerful to exploit context to adapt to volatile market environment.
Moreover, we are interested in examining how the cumulative wealth changes over trading periods. Figure 3 shows the trends of the cumulative wealth by Exp4.ONSE and Exp4.EGE methods and four comparison methods. From the results, we can see that the proposed methods consistently surpasse the benchmarks and the state-of-the-art methods over the entire trading periods on most datasets, which again demonstrates the effectiveness of the proposed methods.
Summary: According to the result of experiments, we can draw the following conclusions: In general, our Exp4.ONSE and Exp4.EGE methods (1) have a lower regret in non-stationary enviroment; (2) consume less time than the state-of-the-art methods, when the number of assets is large; (3) have generated the greatest increase in wealth on all representative real-world market datasets. In addition, we conducted a risk assessment of all methods, and due to limited space, we did not release the results in this paper. The results of the risk assessment indicate that our methods have no significant gap with most of the comparison methods in terms of risk reduction, even though we do not explicitly consider risk in our problem setting.
In this paper, we propose two novel online portfolio selection methods named Exp4.ONSE and Exp4.EGE, which address the dynamic assets selection problem based on contextual bandit algorithm to find the outperform assets portfolio in highly non-stationary environment. Extensive experiments show that our methods can achieve satisfying performance. In the future, we will consider how to reduce the risk of our methods to get a better risk-adjusted return.
- [Agarwal et al.2006] Agarwal, A.; Hazan, E.; Kale, S.; and Schapire, R. E. 2006. Algorithms for portfolio management based on the newton method. In Proceedings of the 23rd international conference on Machine learning, 9–16. ACM.
- [Auer et al.2002] Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM journal on computing 32(1):48–77.
- [Auer, Cesa-Bianchi, and Fischer2002] Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2-3):235–256.
- [Beygelzimer et al.2011] Beygelzimer, A.; Langford, J.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 19–26.
- [Brandt2010] Brandt, M. W. 2010. Portfolio choice problems. In Handbook of financial econometrics: Tools and techniques. Elsevier. 269–336.
- [Chapelle et al.2009] Chapelle, O.; Metlzer, D.; Zhang, Y.; and Grinspan, P. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management, 621–630. ACM.
- [Cover and Ordentlich1996] Cover, T. M., and Ordentlich, E. 1996. Universal portfolios with side information. IEEE Transactions on Information Theory 42(2):348–363.
- [Cover1991] Cover, T. M. 1991. Universal portfolios. Mathematical Finance 1(1):1–29.
- [Fama and French1992] Fama, E. F., and French, K. R. 1992. The cross-section of expected stock returns. the Journal of Finance 47(2):427–465.
- [Hakansson and Ziemba1995] Hakansson, N. H., and Ziemba, W. T. 1995. Capital growth theory. Handbooks in operations research and management science 9:65–86.
- [Hazan, Agarwal, and Kale2007] Hazan, E.; Agarwal, A.; and Kale, S. 2007. Logarithmic regret algorithms for online convex optimization. Machine Learning 69(2-3):169–192.
- [Hazan and Kale2015] Hazan, E., and Kale, S. 2015. An online portfolio selection algorithm with regret logarithmic in price variation. Mathematical Finance 25(2):288–310.
- [Helmbold et al.1998] Helmbold, D. P.; Schapire, R. E.; Singer, Y.; and Warmuth, M. K. 1998. On-line portfolio selection using multiplicative updates. Mathematical Finance 8(4):325–347.
- [Huo and Fu2017] Huo, X., and Fu, F. 2017. Risk-aware multi-armed bandit problem with application to portfolio selection. Royal Society open science 4(11):171377.
- [Ito et al.2018] Ito, S.; Hatano, D.; Hanna, S.; Yabe, A.; Fukunaga, T.; Kakimura, N.; and Kawarabayashi, K.-I. 2018. Regret bounds for online pportfolio selection with a cardinality constraint. In Advances in Neural Information Processing Systems, 10588–10597.
- [Kalai and Vempala2002] Kalai, A., and Vempala, S. 2002. Efficient algorithms for universal portfolios. Journal of Machine Learning Research 3(Nov):423–440.
- [Li et al.2010] Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661–670. ACM.
- [Markowitz1952] Markowitz, H. 1952. Portfolio selection. The journal of finance 7(1):77–91.
- [Shalev-Shwartz2012] Shalev-Shwartz, S. 2012. Online learning and online convex optimization. Foundations and Trends® in Machine Learning 4(2):107–194.
- [Sharpe1963] Sharpe, W. F. 1963. A simplified model for portfolio analysis. Management science 9(2):277–293.
- [Shen and Wang2016] Shen, W., and Wang, J. 2016. Portfolio blending via thompson sampling. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 1983–1989.
- [Shen et al.2015] Shen, W.; Wang, J.; Jiang, Y.-G.; and Zha, H. 2015. Portfolio choices with orthogonal bandit learning. In Twenty-Fourth International Joint Conference on Artificial Intelligence.
- [Syrgkanis, Krishnamurthy, and Schapire2016] Syrgkanis, V.; Krishnamurthy, A.; and Schapire, R. 2016. Efficient algorithms for adversarial contextual learning. In International Conference on Machine Learning, 2159–2168.
- [Wang, Wu, and Wang2017] Wang, H.; Wu, Q.; and Wang, H. 2017. Factorization bandits for interactive recommendation. In Thirty-First AAAI Conference on Artificial Intelligence, 2695–2702.
- [Wei and Luo2018] Wei, C., and Luo, H. 2018. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, 1263–1291.
Appendix A Appendix A
Proof of Theorem 1
The regret can be expressed as Eq(7).
First we define the function as follows:
where is the estimators of as definition.
The first term on the right-hand side of Eq(7) can be bounded as follows:
where the first indequality comes from Lemma 2 in [Agarwal et al.2006], the second and forth indequality are based on proof of Theorem 3 in [Ito et al.2018], and the third indequality holds since from the definition of .
Denote . Since , the eigenvalues of include none-zero eigenvalues and , we have . and . Thus, we can obtain
where is the number of experts, is the upper bound of the expected return of the best strategy, and the uniform expert which always assigns uniform weight to all actions is included in the family of experts. As the assumption , we can obtain . Cominbing Eq(11) with Eq(12) we obtain Theorem 1.
Proof of Theorem 2
The regret can be expressed as Eq(7).
For the first term on the right-hand side of Eq(7), we start with defining and letting , where .Then
where the inequality holds since from the definition of .
Next, we bound . Since and for and , we set where . Then we have
Now, using inequality for (see [Helmbold et al.1998]), we obtain
Combining with Eq(13) gives
since for all x.
Since , and adding all according to , we have