Contextaware Dynamic Assets Selection for Online Portfolio Selection based on Contextual Bandit
Abstract
Online portfolio selection is a sequential decisionmaking problem in financial engineering, aiming to construct a portfolio by optimizing the allocation of wealth across a set of assets to achieve the highest return. In this paper, we present a novel contextaware dynamic assets selection problem and propose two innovative online portfolio selection methods named Exp4.ONSE and Exp4.EGE respectively based on contextual bandit algorithms to adapt to the dynamic assets selection scenario. We also provide regret upper bounds for both methods which achieve sublinear regrets. In addition, we analyze the computational complexity of the proposed algorithms, which implies that the computational complexity can be significantly reduced to polynomial time. Extensive experiments demonstrate that our algorithms have marked superiority across representative realworld market datasets in term of efficiency and effectiveness.
Introduction
Online portfolio selection, which aims to construct a portfolio to optimize the allocation of wealth across a set of assets, is a fundamental research problem in financial engineering. There are two key theories used to tackle this problem, i.e., the Mean Variance Theory [Markowitz1952] mainly from the finance community and Capital Growth Theory [Hakansson and Ziemba1995] originated from information theory. However, both of them work under a strong assumption that we need to construct portfolios among the given assets combination. Therefore, their assets combination is fixed, namely, once the initial combination of assets is determined, the subsequent combination will no longer change, only with the proportion of assets in the combination adjusted. Due to these limitations, the existing methods, which we refer to as fixed assets selection methods, can hardly be applied to realworld applications, because investors do not want to hold the fixed assets combination all the time, namely, the assets combination need to be selected dynamically.
In contrast, in this paper, we consider the dynamic assets selection problem, and reconstruct the problem setting as follows. Suppose the market has rounds and assets for selection. In round (), we can construct a portfolio by arbitrarily selecting assets () from the assets to form an assets combination, and then allocating their wealth proportion to achieve the highest total return. A simple idea to address this problem is to use the history return of all assets as a weight, and then use the multiplicative weight update method for all alternative combinations to choose the best one, such as the Fullfeedback algorithm [Ito et al.2018]. However, we can easily find that the number of selectable combinations has reached . Ito et al. \shortciteito2018regret proves that a special case of this problem with a cardinality constraint is an NPcomplete problem. Thus the overall computational complexity is staggering, even though the efficient fixed assets selection methods such as the Online Newton Step (ONS) [Agarwal et al.2006] and Exponential Gradient (EG) [Helmbold et al.1998] algorithms have been devoted to reducing computational complexity in the allocation part. In order to reduce the computational complexity, we address the dynamic assets selection problem by partial observation. In other words, we only calculate the wealth allocation and observe the return in the selected assets combination instead of considering all assets combinations in each round, thus resulting in a multiarmed bandit problem. Therefore, an idea that combines a bandit algorithm for the dynamic assets selection problem and an efficient fixed assets selection method to process the wealth allocation, can greatly reduce the computational complexity.
In addition, contextawareness is important in dynamic assests selection probelm. Most of the portfolio algorithms are contextfree, which ignore the contextual information of the investment scenario. In reality, however, investors might have more information than just the price relatives observed so far. For example, some scholars [Cover and Ordentlich1996, Helmbold et al.1998] mentioned the side information such as prevailing interest rates or consumerconfidence figures can indicate which assets are likely to outperform the other assets in the portfolio. For the dynamic assets selection problem, by choosing assets combination only in terms of return, the contextfree bandit algorithm ignores such characteristic, which may not be able to adapt to the highly nonstationary financial market. Thus, we address contextaware dynamic assets selection problem based on contextual bandit algorithm [Auer et al.2002, Li et al.2010], which considers both the contextual information and return, aiming at finding the optimal assets combination in the dynamically unstable market.
In this paper, we propose Exp4.ONSE and Exp4.EGE methods to target the cotextaware dynamic assests selection problem. Specifically, for selecting an assets combination, we provide a new bandit algorithm on top of a conventional contextual bandit method named Exp4 [Auer et al.2002]. In addition, for allocating the proportion of wealth, we propose an Online Newton Step Estimator (ONSE) algorithm and a more efficient Exponential Ggradient Estimator (EGE) algorithm based on an original return estimation method. Table 1 summarizes the regret upper bounds and computational complexity of the two methods. The characteristics and contributions of our methods are as follows:
Methods  Regret Upper Bound  Computational Complexity  

Exp4.ONSE 



Exp4.EGE 



is number of trading rounds, is number of assets, is number of available assets combinations, and is number of experts.

We first present a novel contextaware dynamic assets selection problem in the online portfolio selection, in which arbitrary assests combination can be dynamically chosen with hint of context to find a nearoptimal assets combination and its portfolio in a highly nonstationary environment. This problem achieves an exponential computational complexity.

To address the contextaware dynamic assets selection problem, we propose two efficient Exp4.ONSE and Exp4.EGE methods which construct a portfolio by dynamically selecting an assets combination and employing two original ONSE and EGE methods to allocate wealth.

We rigorously prove that the proposed Exp4.ONSE algorithm achieves regret of , and Exp4.EGE algorithm achieves regret of , which demonstrate that both of our algorithms achieve sublinear regret in .

By only calculating the portfolio and observing return of selected assets combination, our algorithms can significantly reduce the computational complexity. Specifically, when the number of assets is large, our algorithms can run in a polynomial time of , , and .

Extensive experiments demonstrate that our algorithms have a lower regret in a highly nonstationary situation on synthetic dataset and have superior performance compared with the stateoftheart methods on realworld datasets in term of efficiency and effectiveness.
Related Works
Portfolio Selection: Online portfolio selection, which sequentially selects a portfolio over a set of assets in order to achieve certain targets, is a natural and important task for asset portfolio management. One category of algorithms based on Capital Growth Theory [Hakansson and Ziemba1995], termed as “FollowtheWinner”, tries to increase the relative weight of more successful stocks based on their historical performance and asymptotically achieve the same growth rate as that of an optimal strategy. Cover \shortcitecover1991universal proposed the Universal Portfolio (UP) strategy, which can achieve the minimax optimal regret for this problem, with amazing complexity of , even if the optimized one reached [Kalai and Vempala2002]. Some scholoars [Agarwal et al.2006, Hazan, Agarwal, and Kale2007, Hazan and Kale2015] proposed the Follow the Leader approaches, by solving the optimization problem with L2norm regularization via online convex optimization technique [ShalevShwartz2012], which admits much faster implementation via standard optimization methods. Helmbold et al. \shortcitehelmbold1998line proposed an extremely efficient approach with time complexity per round and achieved regret of named EG. However, those algorithms above are the fixed assets selection methods, which are not suitable for our proposed dynamic assets selection problem.
Regarding dynamic assets selection problem, Ito et al. \shortciteito2018regret proposed Fullfeedback and Banditfeedback algorithms that tried to introduce the bandit algorithm to address this problem. However, both of these two algorithms have exponentially large computational complexity.
Bandit Algorithm: Bandit algorithms can be divided into stochastic and adversarial algorithms. A stochastic bandit, such as UCB, linUCB [Auer, CesaBianchi, and Fischer2002, Li et al.2010], is completely dominated by the distributions of rewards of the respective actions. Regarding portfolio selection problem, some scholoars proposed stochastic banditbased portfolio methods [Shen et al.2015, Shen and Wang2016, Huo and Fu2017]. However, rewards in the real world are more complex and unstable, especially when it comes to competitive scenarios, such as stock trading. It’s hard to assume that the return of stocks are truly randomly generated.
In addition, an adversarial version was introduced by Auer et al. \shortciteauer2002nonstochastic, which is a more pragmatic approach because it can make no statistical assumptions about the return while still keeping the objective of competing with the best action in hindsight. They proposed Exp3 algorithm with expected cumulative regret of . Also, they were the first to consider the Karmed bandit problem with expert advice, and proposed Exp4 algorithm, on top of which we proposed our algorithms. Based on Exp4, there are some other improved algorithms [Beygelzimer et al.2011, Syrgkanis, Krishnamurthy, and Schapire2016, Wei and Luo2018], but do not directly apply to our setting.
Although there are existing studies doing banditbased portfolio [Ito et al.2018, Shen et al.2015, Shen and Wang2016, Huo and Fu2017], they do not consider the context of financial market, which could not adapt to the highly nonstationary financial market.
Problem Setup
Consider a selffinanced and no margin/short financial market containing assets. In each trading round, the performance of the assets can be described by a vector of return, denoted by , where is the next round’s opening price of the asset divided by its opening price on the current round. Thus the value of an investment in asset increases by or falls by times its previous value from one round to the next. A portfolio is defined by a weight vector satisfies the constraint that every weight is nonnegetive and the sum of all the weights equals one, i.e., . The element of specifies the proportion of wealth allocated to the asset . Given a portfolio and the return , investors using this portfolio increase (or decrease) their wealth from one round to the next by a factor of .
The assets combination is restricted with a set of available combinations . For an assets combination , is the set of portfolios whose supports are included in , i.e., , where . In particular, Ito et al. \shortciteito2018regret defines a special form of with cardinality constraints, . Note that when , the problem becomes the fixed assets selection problem; while the problem even turns into the single asset selection problem when .
Let denote the weight of portfolio for a round . From , a portfolio strategy increases the initial wealth by a factor of , namely, the final cumulative wealth after a sequence of rounds is . Since the model assumes multiperiod investment, we define the exponential growth rate according to the Capital Growth Theory [Hakansson and Ziemba1995] as , where .
Let denote the optimal fixed strategy for rounds, i.e., . The performance of our portfolio selection is measured by , which we call regret. Then the regret of the algorithm can be expressed as:
(1)  
We follow the same assumption in [Ito et al.2018] that is bounded in a closed interval , where and are constants satisfying , but we do not make any statistical assumption about the behavior of .
In general, some other assumptions are made in the above widely adopted models: (1) Transaction cost: there is no transaction costs/taxes. (2) Market liquidity: one can buy and sell any quantity of any asset in its closing prices. (3) Impact cost: market behavior is not affected by any portfolio selection strategy.
Symbol  Definition 

Number of assets  
Number of trading rounds  
Number of experts  
Index of trading rounds  
Return vector  
Available Assets combinations  
Assets combination  
Portfolio weight vector  
The optimal assets combination  
The optimal portfolio weight vector 
Table 2 lists the key notations in this paper.
Methodology
In this section, we first introduce the strategy for contextaware dynamic assets selection problem. Then we describe how to calculate the proportion of wealth allocation. Finally, we summarize the two proposed algorithms.
Contextaware Dynamic Assets Selection
Towards solving the contextaware dynamic assets selection problem, we proposed a new bandit algorithm based on the Exp4 algorithm, which considers both the historical return and the context of the assets while making the choice of assets combinations.
We start by standardizing the following keywords:
Expert: We consider a mapping function as an expert, is trained by some supervised learning methods.
Probability vector: The probability vectors represents a probability distribution over the assets combinations which is recommended by experts. In this paper, we assume there are experts, and , where for each .
Expert authority vector: The expert authority vector is a vector used to score experts. The higher the score, the closer the expert is to the best expert. is initialized to the uniform distribution .
Let be the reward vector. For each expert, the associated expert gives the probability vector according to , where is the context available in round . Note that since the construction of is not strict and unique, and does not affect the regret analysis, we will only provide one of the practices in our experiment.
The algorithm’s goal in this case is to combine the advice of the experts in such a way so that its return is close to that of the best expert. In each round , the experts make their recommendations , and the algorithm chooses an assets combination based on their comprehensive recommendations, outputs as the portfolio weight vector. Finally, the algorithm updates the expert authority vector based on the reward vector and proceeds to the next round.
Wealth Allocation
For maintaining the proportion of wealth allocation, we contruct a estimated retrun and propose an ONSE algorithm based on ONS [Agarwal et al.2006]. In the computation of , we first define the gradient (denoted as ) of the reward function:
(2) 
Then we give the formula of the following vector and matrice which utilizes the gradient and the Hessian:
(3) 
Since an observer does not have to observe all the entries of , we do not always need to update for all . In order to deal with this problem, we construct unbiased estimators of and for each as follows:
(4) 
where is the probability of choosing in round . Note that and can be calculated from the observed information alone. Using these unbiased estimators, we compute the portfolio vectors by ONS as follows:
(5) 
where is the projection in the norm induced by as
In addition, we use the smoothened portfolio to modify the ONSE algorithm. Note that for . The ONSE algorithm in round is described in Algorithm 1.
Model 1: Exp4.ONSE
The Exp4.ONSE method updates the probability of choosing by the Exp4 algorithm and portfolios by ONSE respectively. Hence the convex quadratic programming problem is solved only once in a round. The entire algorithm is summarized in Algorithm 2.
Model 2: Exp4.EGE
In order to make the algorithm more efficient, we try to transform an algorithm EG with running time per round into a dynamic assets selection method.
Exp4.EGE updates the probability of choosing by the Exp4 algorithm and updates portfolios by EGE. In EGE algorithm, we construct unbiased estimators of for each by . Then we compute the portfolio vector using the unbiased estimator by exponential gradient:
(6) 
where is the learning rate and . Note that for .
Therefore, we only need to replace the ONSE algorithm in the Exp4.ONSE algorithm with the EGE algorithm to get the Exp4.EGE algorithm. The EGE algorithm in round is described in Algorithm 3.
Detail Analysis
Regret Upper Bound
In the following, we let , and the regret can be expressed as:
(7)  
Our algorithm achieves the regret described below for arbitrary inputs, where constants are given by , , and .
We now introduce the first theorem, which is about the regret upper bounds of Exp4.ONSE algorithm.
Theorem 1
For any , any , , and , Exp4.ONSE achieves the following regret bound:
Setting and ,we obtain
And we have the second theorem about the upper regret bound of Exp4.EGE algorithm.
Theorem 2
For any , any , algorithm Exp4.EGE achieves the following regret bound:
Setting and , we obtain
The proofs can be found in the supplementary material.
Computational Complexity
Ito et al. \shortciteito2018regret implies that unless the complexity class BPP includes NP, Fullfeedback algorithm will not run in polynomial time, if an algorithm achieves regret for arbitrary and arbitrary . But our algorithms based on the bandit algorithms can reduce the computational complexity to get an approximate solution.
Algorithm Exp4.ONSE runs in time. In each round , from the definition of and in Eq(4), the update of given by Eq(5) is needed only for . So ONSE can be computed in time per round [Agarwal et al.2006]. Furthermore, for , both updating and computing can be performed in time. This implies that sampling can be performed in time. Since in practice, the total rounds can be smaller than the number of assets combinations, in order to further reduce the computational complexity and the space complexity, we also apply a keyvalue pair stucture to implement algorithm, which can reduce to . Thus the space complexity is .
Similarly, Algorithm Exp4.EGE runs in time per round, since EGE can be computed in time per round [Helmbold et al.1998].
Experiments
In this section, we introduce the extensive experiments conducted on two synthetic dataset and four realworld datasets, which aim to answer the following questions:
Q1: How does the nonstational enviroment affect the regret of our Exp4.ONSE and Exp4.EGE methods? (see Experiment 1)
Q2: How efficient is our algorithm than stateoftheart approaches? (see Experiment 2)
Q3: How do our approaches outperform the stateoftheart approaches on realword finance markets? (see Experiment 3)
Experimental Settings
Data Collection: Similarly to [Ito et al.2018, Huo and Fu2017], we used synthetic dataset to evaluate the regret. We also conducted our experiments on four realworld datasets from financial market to eveluate the performance.
(1) Synthetic dataset is generated as follows: given parameters , , ,, and , we generate and randomly divide stages for the rounds. In each stage, we choose an asset from that .
(2) Fama and French (FF) datasets have been widely recognized as highquality and standard evaluation protocols [Fama and French1992] and have an extensive coverage of assets classes and span a long period. FF25 and FF100 are two datasets with different total assets in FF datasets.
(3) ETFs dataset has high liquidity and diversity, which become popularized among investors.
(4) SP500 dataset is the daily return data of the 500 firms listed in the S&P500 Index.
Table 3 summarizes these realworld datasets. They implicitly underline different perspectives in performance assessment. While FF25 and FF100 highlight the longterm performance, ETFs and SP500 reflect the vicissitude market environment after the recent financial crisis starting from 2007. The four datasets have diverse trading frequencies: monthly and daily. Thus, through empirical evaluations on those datasets, we can thoroughly understand the performance of each method.
Dataset  Frequency  Time Period  T  N 

FF25  Monthly  06/01/1963  11/31/2018  545  25 
FF100  Monthly  07/01/1963  11/31/2018  544  100 
ETFs  Daily  12/08/2011  11/10/2017  1,138  547 
SP500  Daily  02/11/2013  02/07/2018  1,355  500 
Expert Advice Construction: We now describe the expert advice constructed for our experiments.
First, we collected raw assets’ features. Each asset was originally represented by a feature vector of components, which include: (1) return: the average return of last rounds and whether the return exceeds the average value; (2) trading volume: last volume and whether the volume exceeds the average value. The above dimensions constitute the raw features in our experiments.
Although most of the existing contextual bandit methods used the expert advice construction method proposed by Li et al. \shortciteli2010contextual, which was applied in the recommendation [Beygelzimer et al.2011, Wang, Wu, and Wang2017] and not applicable to the portfolio issue. Therefore, we proposed a novel approach constructing expert advice. We constructed experts prediction models based on factors model [Sharpe1963] to predict assets’ return, where each expert came from any combinations of these dimensions. Then we trained these experts prediction models by multiple linear regression model. In each round , for any expert , we could get predicted return for all assets. For assets combinations , we sorted the average predicted return of the assets contained in the . After that, we took assets combinations by the topk method, taking its reciprocal rank [Chapelle et al.2009] as the probability of this combination, where (). And for the rest of the combinations, we treated their probabilities as uniform distribution. Finally, the probability vector consisted of the rank reciprocal of the previous assets combinations and the probability of other combinations. Similarly, to reduce computational complexity, we store in the form of keyvalue pairs.
Comparison methods: The methods empirically evaluated in our experiments can be categorized into three groups:
(1) The fixed assets selection methods. ONS [Agarwal et al.2006] and EG [Helmbold et al.1998], derived from traditional financial theory, can serve as conventional baseline methods.
(2) The contextfree dynamic assets selection methods, which are stateoftheart methods. Fullfeedback algorithm [Ito et al.2018] combined the multiplicative weight update method and the followtheapproximateleader (FTAL) method. Banditfeedback algorithm [Ito et al.2018] combined the Exp3 algorithm and FTAL method.
(3) The cotextaware dynamic assets selection methods. Since there is no previous work in this group, we compared the performance of our proposed methods EXP4.ONSE and EXP4.EGE.
Experiment Setup and Metrics: Regarding the parameter settings, in the expert advice generation section, we used the first rounds to train multiple linear regression, thus the start time of all methods is for the sake of fairness. And we set parameters according to Theorem 1 for Exp4.ONSE and Theorem 2 for Exp4.EGE. And for the parameters of other comparison methods, we used the parameter settings recommended in the relevant studies.
On the synthetic dataset, we considered regret and running time. In terms of regret, we generated two tasks: slightly nonstationary task with and highly nonstationary task with . For running time, we randomly selected  assets from running time task on synthetic dataset with .
On the realworld datasets, we randomly selected assets from datasets, and used cumulative wealth [Brandt2010] as the standard criterion to evaluate the performance. The results were averaged over executions.
Experimental Results
We present our experimental results in three sections. For the synthetic dataset, we evaluate regret to answer question Q1 and running time for question Q2, while for realworld datasets, we assess cumulative wealth for question Q3.
Experiment 1: evaluation of regret on synthetic dataset
In order to answer question Q1, we analysed the regret of all comparison methods on slightly nonstationary task and highly nonstationary task. From the results shown in Figure 1, we can reach three conclusions. Firstly, comparing Exp4.ONSE and Exp4.EGE with the fixed assets selection methods ONS, we can find that ONS can converge quickly within a stage, which is better than our methods. But ONS’s response to stage changes is slow, so that ONS’s regret appears stepped increase. Therefore, though our approches have no significant improvement in the slightly nonstationary environment, in the highly nonstationary case, we are much better than it. Secondly, comparing our approches with the contextfree dynamic assets selection methods, our approches achieve lower regret than Fullfeedback and Banditfeedback in both of two cases, especially in the highly nonstationary case. Thirdly, the results empirically show that the sublinear regret of our methods mentioned in the theoretical analysis is correct, and Exp4.ONSE achieves lower regret than Exp4.EGE.
Experiment 2: evaluation of running time on synthetic dataset and realworld datasets
In order to answer question Q2, we measured the running time of comparison methods on running time task of synthetic dataset. The PC we used has a fourcore processor with frequency of 3.6GHz and memory of 8GB. The results of running time are plotted in Figure 2 (a) which shows that Exp4.ONSE and Exp4.EGE have greatly improved efficiency over Fullfeedback by an average of . Though our methods have increased the running time by an average of than Banditfeedback when , they are more efficient than Banditfeedback by decreasing the running time with an average of when . In addition, the results empirically show that our theoretical computational complexity analysis is correct.
Moreover, we compared the running time with comparison methods on the realworld datasets (see Figure 2 (b)). It shows that Exp4.ONSE’s average running time is seconds, which improve the efficience compared with the stateoftheart methods by an average of , ranging from to . And Exp4.EGE’s average running time is seconds, which improve the efficience compared with the stateoftheart methods by an average of , ranging from to . Such time efficiency supports Exp4.ONSE and Exp4.EGE in largescale real applications.
Experiment 3: evaluation of cumulative wealth on realworld datasets
As for the realworld datasets, we compared the performance with the competing approaches based on their cumulative wealth to answer question Q3. In Table 4, the cumulative wealth achieved by various trading strategies on the four realworld datasets. The top two best results in each dataset are highlighted in bold. Exp4.ONSE outperforms all the four comparison methods by an average of , ranging from to , wherein an average of compared with stateoftheart methods. And Exp4.EGE outperforms all the four comparison methods by an average of , ranging from to , wherein an average of compared with stateoftheart methods. We note that the main reason Exp4.ONSE and Exp4.EGE achieved such superior results is that it is powerful to exploit context to adapt to volatile market environment.
Methods  FF25  FF100  SP500  ETFs 

ONS  292.92  202.51  1.60  1.87 
EG  915.11  672.93  1.79  2.75 
Fullfeedback  768.76  1878.89  4.29  3.50 
Banditfeedback  208.28  367.10  1.75  1.58 
Exp4.ONSE  1179.07  2000.63  6.43  7.33 
Exp4.EGE  1888.98  2372.39  19.81  5.99 
Moreover, we are interested in examining how the cumulative wealth changes over trading periods. Figure 3 shows the trends of the cumulative wealth by Exp4.ONSE and Exp4.EGE methods and four comparison methods. From the results, we can see that the proposed methods consistently surpasse the benchmarks and the stateoftheart methods over the entire trading periods on most datasets, which again demonstrates the effectiveness of the proposed methods.
Summary: According to the result of experiments, we can draw the following conclusions: In general, our Exp4.ONSE and Exp4.EGE methods (1) have a lower regret in nonstationary enviroment; (2) consume less time than the stateoftheart methods, when the number of assets is large; (3) have generated the greatest increase in wealth on all representative realworld market datasets. In addition, we conducted a risk assessment of all methods, and due to limited space, we did not release the results in this paper. The results of the risk assessment indicate that our methods have no significant gap with most of the comparison methods in terms of risk reduction, even though we do not explicitly consider risk in our problem setting.
Conclusions
In this paper, we propose two novel online portfolio selection methods named Exp4.ONSE and Exp4.EGE, which address the dynamic assets selection problem based on contextual bandit algorithm to find the outperform assets portfolio in highly nonstationary environment. Extensive experiments show that our methods can achieve satisfying performance. In the future, we will consider how to reduce the risk of our methods to get a better riskadjusted return.
References
 [Agarwal et al.2006] Agarwal, A.; Hazan, E.; Kale, S.; and Schapire, R. E. 2006. Algorithms for portfolio management based on the newton method. In Proceedings of the 23rd international conference on Machine learning, 9–16. ACM.
 [Auer et al.2002] Auer, P.; CesaBianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM journal on computing 32(1):48–77.
 [Auer, CesaBianchi, and Fischer2002] Auer, P.; CesaBianchi, N.; and Fischer, P. 2002. Finitetime analysis of the multiarmed bandit problem. Machine learning 47(23):235–256.
 [Beygelzimer et al.2011] Beygelzimer, A.; Langford, J.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 19–26.
 [Brandt2010] Brandt, M. W. 2010. Portfolio choice problems. In Handbook of financial econometrics: Tools and techniques. Elsevier. 269–336.
 [Chapelle et al.2009] Chapelle, O.; Metlzer, D.; Zhang, Y.; and Grinspan, P. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management, 621–630. ACM.
 [Cover and Ordentlich1996] Cover, T. M., and Ordentlich, E. 1996. Universal portfolios with side information. IEEE Transactions on Information Theory 42(2):348–363.
 [Cover1991] Cover, T. M. 1991. Universal portfolios. Mathematical Finance 1(1):1–29.
 [Fama and French1992] Fama, E. F., and French, K. R. 1992. The crosssection of expected stock returns. the Journal of Finance 47(2):427–465.
 [Hakansson and Ziemba1995] Hakansson, N. H., and Ziemba, W. T. 1995. Capital growth theory. Handbooks in operations research and management science 9:65–86.
 [Hazan, Agarwal, and Kale2007] Hazan, E.; Agarwal, A.; and Kale, S. 2007. Logarithmic regret algorithms for online convex optimization. Machine Learning 69(23):169–192.
 [Hazan and Kale2015] Hazan, E., and Kale, S. 2015. An online portfolio selection algorithm with regret logarithmic in price variation. Mathematical Finance 25(2):288–310.
 [Helmbold et al.1998] Helmbold, D. P.; Schapire, R. E.; Singer, Y.; and Warmuth, M. K. 1998. Online portfolio selection using multiplicative updates. Mathematical Finance 8(4):325–347.
 [Huo and Fu2017] Huo, X., and Fu, F. 2017. Riskaware multiarmed bandit problem with application to portfolio selection. Royal Society open science 4(11):171377.
 [Ito et al.2018] Ito, S.; Hatano, D.; Hanna, S.; Yabe, A.; Fukunaga, T.; Kakimura, N.; and Kawarabayashi, K.I. 2018. Regret bounds for online pportfolio selection with a cardinality constraint. In Advances in Neural Information Processing Systems, 10588–10597.
 [Kalai and Vempala2002] Kalai, A., and Vempala, S. 2002. Efficient algorithms for universal portfolios. Journal of Machine Learning Research 3(Nov):423–440.
 [Li et al.2010] Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661–670. ACM.
 [Markowitz1952] Markowitz, H. 1952. Portfolio selection. The journal of finance 7(1):77–91.
 [ShalevShwartz2012] ShalevShwartz, S. 2012. Online learning and online convex optimization. Foundations and Trends® in Machine Learning 4(2):107–194.
 [Sharpe1963] Sharpe, W. F. 1963. A simplified model for portfolio analysis. Management science 9(2):277–293.
 [Shen and Wang2016] Shen, W., and Wang, J. 2016. Portfolio blending via thompson sampling. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, 1983–1989.
 [Shen et al.2015] Shen, W.; Wang, J.; Jiang, Y.G.; and Zha, H. 2015. Portfolio choices with orthogonal bandit learning. In TwentyFourth International Joint Conference on Artificial Intelligence.
 [Syrgkanis, Krishnamurthy, and Schapire2016] Syrgkanis, V.; Krishnamurthy, A.; and Schapire, R. 2016. Efficient algorithms for adversarial contextual learning. In International Conference on Machine Learning, 2159–2168.
 [Wang, Wu, and Wang2017] Wang, H.; Wu, Q.; and Wang, H. 2017. Factorization bandits for interactive recommendation. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2695–2702.
 [Wei and Luo2018] Wei, C., and Luo, H. 2018. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, 1263–1291.
Appendix A Appendix A
Proof of Theorem 1
Proof A.1
The regret can be expressed as Eq(7).
First we define the function as follows:
(8) 
where is the estimators of as definition.
The first term on the righthand side of Eq(7) can be bounded as follows:
(9)  
where the first indequality comes from Lemma 2 in [Agarwal et al.2006], the second and forth indequality are based on proof of Theorem 3 in [Ito et al.2018], and the third indequality holds since from the definition of .
Denote . Since , the eigenvalues of include nonezero eigenvalues and , we have . and . Thus, we can obtain
(10) 
Combining this with Eq(9), the first term on the righthand side of Eq(7) can be bounded as follows:
(11) 
Since is chosen by Exp4, the second term on the righthand side of Eq(7) can be bounded as follows (see Theorem 7.1 in [Auer et al.2002]):
(12)  
where is the number of experts, is the upper bound of the expected return of the best strategy, and the uniform expert which always assigns uniform weight to all actions is included in the family of experts. As the assumption , we can obtain . Cominbing Eq(11) with Eq(12) we obtain Theorem 1.
Proof of Theorem 2
Proof A.2
The regret can be expressed as Eq(7).
For the first term on the righthand side of Eq(7), we start with defining and letting , where .Then
(13)  
where the inequality holds since from the definition of .
Next, we bound . Since and for and , we set where . Then we have
(14)  
Now, using inequality for (see [Helmbold et al.1998]), we obtain
(15) 
Combining with Eq(13) gives
(16)  
since for all x.
Since , and adding all according to , we have
(17)  