Adaptive Portfolio by Solving Multiarmed Bandit via Thompson Sampling
Abstract
As the cornerstone of modern portfolio theory, Markowitz’s meanvariance optimization is considered a major model adopted in portfolio management. However, due to the difficulty of estimating its parameters, it cannot be applied to all periods. In some cases, naive strategies such as Equallyweighted and Valueweighted portfolios can even get better performance. Under these circumstances, we can use multiple classic strategies as multiple strategic arms in multiarmed bandit to naturally establish a connection with the portfolio selection problem. This can also help to maximize the rewards in the bandit algorithm by the tradeoff between exploration and exploitation. In this paper, we present a portfolio bandit strategy through Thompson sampling which aims to make online portfolio choices by effectively exploiting the performances among multiple arms. Also, by constructing multiple strategic arms, we can obtain the optimal investment portfolio to adapt different investment periods. Moreover, we devise a novel reward function based on users’ different investment risk preferences, which can be adaptive to various investment styles. Our experimental results demonstrate that our proposed portfolio strategy has marked superiority across representative realworld market datasets in terms of extensive evaluation criteria.
1 Introduction
The portfolio selection problem is a fundamental issue in the financial sector for many asset investments, including funds, stocks, bonds, and options. According to Gary Brinson, the father of global asset allocation, “Asset allocation is the main factor that affects all overall returns.” In the long run, more than 90% of a portfolio’s performance is attributable to its asset allocation [Brinson et al.1995]. Thus, asset allocation of a portfolio is the key determinant of performance, risk, and volatility over time.
Modern portfolio theory and analysis tend to build upon the seminal work of Markowitz [Markowitz1952]. Up to now, the meanvariance paradigm has remained the mainstream choice for academia and industry. However, the main problem of using a single strategy is that it cannot be adapted to the changing environment. For instance, in an event of the stock market crash, the soldall strategy with cash will have a good performance. Meanwhile, during the bull market, the buyandhold strategy is likely to perform even better. In terms of this issue, a simple approach in the investment field is to periodically review the effectiveness of the current strategy and appropriately adjust the strategy for the next phase. This is a typical problem of exploration and exploitation. Therefore, we use reinforcement learning to solve the problem of how to determine optimal portfolio strategy to adapt for different investment periods.
Meanwhile, each investor’s pursuit of risk and benefit is different, which is called the user’s investment risk preference. Although a good strategy can ultimately help in achieving a good return, not every investor is willing to take on some risks in the process. For example, users with a low risk tolerance are unlikely to consider shortterm losses, whereas users with high risk tolerance tend to pursue high returns and usually do not care about retracement. For this reason, we take the users’ investment risk preferences into account and propose a novel reward function, which can be adaptive to various investment styles such as highriskhighreturn and lowrisklowreturn investment style.
In this paper, we first turn the portfolio problem into a multiarmed bandit problem and construct a series of strategic arms basing on the classic strategies. Subsequently, we apply Thompson sampling method to select strategic arm and further update the Beta distribution of strategic arms based on the user’s investment risk preference. The contributions of the present work are summarized as follows:

To adapt to different market conditions in different periods, we utilize the multiarmed bandit problem to adaptively select the most suitable strategy to form an online portfolio strategy.

We devise a novel reward function based on the users’ investment risk preferences to ascertain that the method fits a variety of users’ needs. This helps to achieve different returntorisk ratio.

Experimental results indicate that the proposed portfolio strategy has marked superiority across representative realworld market datasets in terms of a series of standard financial evaluation indicators, which include Sharpe ratios, cumulative wealth, volatility, and maximum drawdowns.
2 Related Work
In this section, we briefly discuss two topics, that is, multiarmed bandit and portfolio selection problem.
2.1 Multiarmed Bandit and Thompson Sampling
This section contains theories, solutions for the multiarmed bandit problem and Thompson sampling.
There are many exploration vs exploitation dilemmas in many aspects of our life. At the same time, investment strategies attempt to balance existing portfolios and new portfolios to achieve higher returns. In this case, if we can speculate the future trend of all assets in the market, we can find the best investment strategy by just simulating bruteforce instead of using several other smart approaches. This dilemma originates from the incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control. With exploitation, we can take advantage of the best known option. With exploration, we can take some risk to collect information about unknown options. Therefore, the best longterm strategy may involve shortterm sacrifices.
The multiarmed bandit problem is a classic problem that exhibits the exploration vs exploitation dilemma. It is like facing multiple slot machines in a casino and each is configured with an unknown probability of how likely you can get a reward at one play. The aim is to maximize the cumulative reward. If we know the optimal action with the best reward, then the goal is same as to minimize the potential regret or loss by not picking the optimal action.
The possible methods that can be used to solve this problem are roughly divided into three distinct categories, greedy algorithm, upper confidence bounds (UCB) algorithm [Auer et al.2002] and Thompson sampling [Thompson1933].
Thompson sampling has a simple idea. However, it works great for solving the multiarmed bandit problem [Chapelle and Li2011, Russo and Van Roy2014]. At each time step, select action according to the Beta probability that is optimal. After observing the true reward, update the Beta distribution accordingly. This essentially involves doing Bayesian inference to compute the posterior with the known prior and the likelihood of getting the sampled data.
With the rise of reinforcement learning, numerous works study how to apply multiarmed bandit to various fields, such as recommender [Li et al.2010, Wu et al.2016] and ecommence [Brodén et al.2017, Brodén et al.2018]. Besides, some scholars have tried to incorporate reinforcement learning into the field of portfolio optimization [Liang et al.2018, Jiang et al.2017, Sani et al.2012]. Still, other studies have used assets directly as arms in multiarmed bandit. For instance, Shen [Shen et al.2015] proposed to use the UCB algorithm to achieve online portfolio selection by constructing an orthogonal portfolio. Meanwhile, other studies have only used Thompson sampling to generate portfolio. For example, Shen [Shen and Wang2016] presented an online portfolio algorithm that leverages Thompson sampling to mix two different strategies. Inspired by these studies, we combine the multiarmed bandit and Thompson sampling, use the classic strategies as strategy arms to achieve an adaptive portfolio.
2.2 Portfolio Strategy
This section presents the current state of research on portfolios, including meanvariance models, forecast trends, and the Universal portfolio. Existing studies have specially based upon classical financial theory and have combined with machine learning to achieve better performance.
In 1952, Markowitz put forward the meanvariance model, which was the first of its kind in modern portfolios [Markowitz1952]. This model constrains the relevant conditions of portfolio issues to pursue a balance of risk and return. In particular, some scholars attempted to improve the effect of the meanvariance model by adding regularity [Brodie et al.2009, Shen et al.2014]. Other studies have improved the performance of the meanvariance model by changing the sampling method. For instance, Shen [Shen and Wang2017] proposed a new portfolio strategy through resampling subsets of the original large universe of assets.
In addition, some scholars pursued the maximum returntorisk ratio of the portfolio through trend forecasting, such as by predicting stock price movements in the stock market. For example, Palmowski et al. [Palmowski et al.2018] studied a portfolio selection problem in a continuoustime ItôMarkov additive market in which the prices of financial assets were described by Markov additive processes. Meanwhile, Paolinelli [Paolinelli and Arioli2019] proposed a model for stocks dynamics based on a nonGaussian path integral, which connected between time horizons and trading strategies.
The third type of research is based on the Universal portfolio theory. This is a portfolio selection algorithm from the field of machine learning and information theory. The algorithm learns adaptively from historical data and maximizes the logoptimal growth rate in the long run. Huang et al. [Huang et al.2015] designed semiuniversal portfolio strategy under transaction fee, which tries to avoid rebalancing when the transaction fee outweighs the benefit of trading.
All of the above methods are all based on a single financial theory to construct an online investment portfolio. However, the method of this paper adaptively adopts different investment strategies in multiple cycles to achieve the highest longterm returntorisk ratio.
3 Methodology
In this section, we first introduce the notations and finance terms used in this paper. We will also discuss several strategic arms based on classic portfolios, formulate portfolio blending a multiarmed bandit problem, and how to solve this problem using Thompson sampling. Lastly, we summarize the proposed algorithm.
3.1 Notations and Problem Definition
To start with, we give the problem an abstract definition. We consider a selffinancing, limited time and limited asset financial environment. The trading periods consist of , where represents one day, week or month, depending on the cycle of rebalancing and is the total cycles of participation in the transaction. We also represent the return vector of assets at time to time as . The formula of the return of the ith asset is , where and represent the price of the ith asset at times and . The transaction fee is also an important factor in the final benefit. For the sake of simplifying the model, however, it is not considered in this model. Still, we think about how to reduce trading behavior.
as the portfolio weight vector at time denotes the investment decision at time , where represents the allocation weight of the ith asset in the entire portfolio. We assume that the sum of the combined weights is (except for pure cash position), i.e., , where is a column vector with ones as its entities. Also, we correspond to the following two cases of and the actual trading strategy: indicates that we need to take a long position of the ith asset at market price; while shows that we need to take a short sale position of ith asset. The actual operation requires a deposit, and also needs to pay dividends for shortselling assets, etc. However, for the sake of simplification, we will not consider this situation for the time being, and only consider the gains or losses caused by stock price changes.
3.2 Strategic Arms Based on Classic Portfolios
In our research, we do not directly use assets as arms in multiarmed bandit. Instead, we use classic portfolio strategies in finance as strategic arms to reduce the number of arms, and also to reduce transaction volume as well as increase stability. We use the following strategies:
Buy and Hold (BH): This is an intuitive idea which involves doing nothing and continuing to hold the existing portfolio in this time window.
(1) 
Sold All (SA): Involves selling all the assets so that the combination is an empty position or a pure cash position.
(2) 
Equallyweighted portfolio (EW): Regardless of the asset, all assets are directly placed into equal weight positions during each rebalancing period.
(3) 
Valueweighted portfolio (VW): As a passive investment strategy, positions in each rebalancing period are allocated as per the current capital of each asset.
(4) 
Meanvariance portfolio (MV): Meanvariance model is a strategy constructed in line with the Markowitz’s theory. It captures the aforementioned riskreturn tradeoff.
(5) 
where is the expected return and is the variance of portfolio returns.
3.3 Portfolio Bandit via Thompson Sampling (PBTS)
Each strategy has its own suitable period and scene, thus they also have a certain probability to get the most profit. Basing on this idea, this paper regards the portfolio selection problem as a multiarmed bandit problem, and classic portfolio strategies as the strategic arms in order to achieve higher longterm returns. The specific definition is as follows:
The multiarmed bandit of the portfolio strategy is . is a collection of strategic arms (classic portfolio strategies),
(6) 
where represents the total number of strategic arms. There are arms at each time , and which arm is selected according to which strategy is used to adjust the weight of the portfolio.
Assume is the probability distribution function of the return, at each time , . And the probability of each strategic arm is a Beta distribution .
At time k, each arm randomly samples a value from its respective Beta distribution, then the arm of this selection is:
(7) 
In order to judge whether this choice is successful, we comprehensively consider the users’ investment risk preferences and use the Sharpe ratio as a measure. Therefore, we give a criterion based on the topk strategy. The judgment formula is:
(8) 
where is an indicator function and represents the Sharp ratio of user’s historical selection of arm at time . Usually, the international average generally takes a 36month net growth rate to calculate the Sharpe ratio.
The choice of can be selected based on users’ investment risk preferences. If the user prefers to pursue highrisk and highreturn, the smaller the can be; the larger the can be, if the user tends to pursue a relatively stable investment.
Additionally, for each arm’s Beta distribution, we first use as the initial prior of each arm and update the a priori results using sliding window of historical data. Since there is no investment strategy performance, , the even distribution of standards, is a reasonable initialization for investors. At each rebalancing time, the investor builds the Bernoulli test described above, observes subsequent successes or failures, and updates the posterior distribution accordingly.
Algorithm 1 summarizes the process of building a multiarmed bandit problem and solving problem via Thompson sampling.
4 Experiments
4.1 Data
Dataset  Frequency  Time Period  m  n  Description 

FF25  Monthly  06/01/1963  11/31/2018  545  25  25 portfolios of firms sorted by size and booktomarket 
FF49  Monthly  07/01/1969  11/31/2018  472  49  49 industry portfolios representing the U.S. stock market 
FF100  Monthly  07/01/1963  11/31/2018  544  100  100 portfolios of firms sorted by size and booktomarket 
ETFs  Daily  12/08/2011  11/10/2017  1,138  608  Exchangetraded funds in U.S. stock market 
SP500  Daily  02/11/2013  02/07/2018  1,355  476  500 firms listed in the S&P 500 Index 
In our experiment, we consider two types of datasets. The first one is the FF dataset, which was built by Fama and French based on the US stock market and continues to be updated to date [Fama and French1992]. Overall, they have an extensive coverage of assets classes and span a long period. In our experiments, the FF25, FF49, FF100 datasets include monthly returns of 25, 49, and 100 assets more than half a century. Among them, FF25 and FF100 are formed on size and booktomarket, while FF49 is an industry portfolio. The second one is a more frequent stock market data, which includes constituents of the SP500 and ETFs in the US stock market. We exclude assets with missing data for the past five years. Thus, we remain with 476 stocks from 500 constituent stocks as well as 608 ETFs retained by 1,340 ETFs.
Table 1 is a summary of the datasets, representing different investment perspectives in the market. The FF datasets emphasize longterm gains, spanning more than half a century. They include the different periods of the US stock market as well as multiple financial crises that can reflect the longterm gains of the strategy. Meanwhile, the SP500 and ETF datasets reflect at high trading frequencies. Regardless of the extreme market, the mediumterm performance of the strategy is highlighted. In particular, we choose the timing of our datasets to avoid the latest financial crisis after 2007.
4.2 Evaluation Metrics
We use the standard criteria in finance [Brandt2010] to measure the performance of the portfolio strategy outside the training sample: (1) Sharpe Ratio; (2) Cumulative Wealth; (3) Maximum Drawdown; (4) Volatility.
Sharpe Ratio (SR) measures the returntorisk ratio of a portfolio strategy and normalizes the return on the portfolio using its standard deviation. It is expressed as:
(10) 
where .
SR is a comprehensive measure that combines both returns and risks into the evaluation, giving the return value of each risk of the portfolio.
Cumulative Wealth (CW) is a weighted cumulative return measuring the time at which each asset’s revenue in a portfolio strategy begins to accumulate to the last calculated return. It is expressed as:
(11) 
Maximum Drawdown (MDD) is the maximum amount of wealth reduction that a cumulative wealth has produced from its maximum value over time, expressed as:
(12) 
where retracement represents to the loss from the maximum wealth value during its operation to the time , and denotes to the cumulative wealth up to the time . Since the sharp decline inevitably causes investors to panic and cause divestment, the maximum retracement is usually the primary risk measure for the money management industry.
Volatility (VO) is a quantitative risk metric for the investment industry. The calculation of portfolio volatility is related to the standard deviation in Equation (10). To measure the portfolio strategy with different weight adjustment frequencies, we calculate the annualized volatility using the following formula:
(13) 
where is the number of times the weights are adjusted each year. In our experiment, for the monthly datasets, and for the daily datasets.
4.3 Competing Portfolios
To comprehensively assess the proposed method, we consider ten modern competing portfolios according to our literature review:
Equallyweighted portfolio (EW): EW is one of classic strategies, which. It has outperformed 14 sophisticated models across seven realworld datasets at monthly frequency of 2000 years [DeMiguel et al.2007]. Therefore, EW is the first benchmark algorithm for portfolio research.
Valueweighted portfolio (VW): VW is a strategy that imitates the market’s passive portfolio, which is the same as the market index’s volatility. It is also an important benchmark strategy.
Meanvariance portfolio (MV): MV is one of our basic strategies based on Markowitz’s theory and outperforms in different markets and time spans.
Orthogonal Bandit portfolio (OBP): OBP constructs multiple assets by constructing orthogonal portfolios. It also uses the upper confidence bound bandit framework to derive the optimal portfolio strategy that represents the combination of passive and active investments as per a riskadjusted reward function [Shen et al.2015].
Portfolio Blending via Thompson Sampling (TSEM, TSVM): This strategy is applied by Thompson sampling to the portfolio field for mixing EW and MV as TSEM, VW and MV as TSVM [Shen and Wang2016].
Portfolio Selection via Subset Resampling (SSR): The SSR method estimates the parameters by resampling subsets of the original assets, and aggregates the subsets of the multiple constructs to obtain the portfolio of all assets [Shen and Wang2017].
Generally, EW, VW, and MV are three portfolio strategy arms of PBTS, which should be compared with the hybrid model proposed in this paper. OBP, TSEM, TSVM, and SSR are the heuristic experiments of the model. They are well recognized as important portfolio strategies based on the exploration and exploitation problem. Therefore, to be more convincing, we also compare with these four models.
Dataset  Metrics  PBTS  EW  VW  MV  OBP  TSEM  TSVM  SSR 

FF25  SR  22.60  20.02  19.84  19.30  15.92  19.82  19.93  19.08 
CW  589.41  291.93  398.67  766.58  241.60  588.41  520.81  772.46  
MDD (%)  43.83  54.10  55.91  57.98  59.41  57.07  56.60  58.49  
VO (%)  17.71  17.51  17.68  18.20  22.03  17.71  17.60  18.41  
FF49  SR  24.20  23.15  23.22  11.77  18.55  15.77  15.93  13.22 
CW  29.94  19.46  17.26  12.43  24.65  15.23  16.21  12.68  
MDD (%)  38.30  52.83  51.42  79.90  51.97  68.72  68.35  75.79  
VO (%)  14.39  15.10  15.05  29.76  18.87  22.19  21.96  26.44  
FF100  SR  21.76  20.71  21.43  19.21  15.81  20.85  20.62  20.21 
CW  57.27  28.12  53.28  18.04  43.14  29.42  22.69  28.78  
MDD (%)  30.76  58.73  53.72  50.26  54.80  51.80  53.29  52.38  
VO (%)  16.06  16.88  16.33  18.18  22.16  16.77  16.94  17.30  
ETFs  SR  194.49  197.22  147.67  17.93  70.94  31.38  31.42  19.40 
CW  1.15  1.28  1.51  0.15  1.88  0.63  0.58  0.38  
MDD (%)  15.40  16.20  18.46  96.44  23.71  78.77  79.58  88.09  
VO (%)  9.83  9.69  12.94  106.56  26.95  60.89  60.81  98.54  
SP500  SR  126.88  124.49  127.56  41.27  52.54  66.71  66.56  39.77 
CW  1.65  1.52  1.53  1.27  1.30  1.49  1.47  1.32  
MDD (%)  14.97  16.41  14.97  36.81  41.09  20.82  21.24  46.49  
VO (%)  15.06  15.35  14.98  46.32  36.38  28.65  28.72  48.07 
4.4 Parameter Settings
We use the “rolling range” setting proposed by DeMiguel [DeMiguel et al.2007]. In regard to the model proposed in this paper, we set the sliding window as . For the parameter of the PBTS, we utilize cross validation to establish the optimal parameters. And for the parameters of other comparison algorithms, we use the parameter settings recommended in the relevant studies.
4.5 Results and Analysis
Results Table 2 summarize portfolio performance evaluated by the SR, CW, MDD, and VO for all the tested benchmarks, respectively. From the comparisons of the various methods, the values in bold represent the winners’ performance. The proposed PBTS method achieves a better performance in most of the cases. On the one hand, for the SR, the results of the PBTS are in the first echelon on all datasets, with a slightly lower EW on the ETFs dataset as well as less than VW on the SP500 dataset. This indicates that the PBTS basically has a better returntorisk ratio. For the absolute return indicator, we use Figure 1 to reflect the change in earnings over time. PBTS outperforms other methods on most datasets, only below the MV and SSR on the FF25 dataset and lower than VW and OBP on the ETFs dataset. However, the OBP and SSR method has large fluctuations on other datasets. As well, the robustness is lower than the PBTS method. On the other hand, PBTS performs better on the risk indicators. As summarized in Table 2, the MDD of PBTS is the smallest; while PBTS’s VO is lower than EW in the ETFs dataset, which is usually the lowest VO in the classic strategies, and is superior to other comparison methods in other datasets.
Analysis In summary, through the performance of the three longterm FF datasets, we believe that the PBTS method has an outstanding performance in terms of longterm performance. This is consistent with the goal of PBTS of achieving excellent longterm returns. However, in the midterm high frequency situation, through the ETF and SP500 datasets, we realize that PBTS is not robust enough and depends on the performance of basic strategies.
Parameter effect analysis We analyze the performance of PBTS in the case of different , as shown in Figure 2. In FF25 dataset, as becomes larger, the volatility decreases, but the cumulative wealth decreases. This is consistent with our hypothesis that the larger the , the lower the user’s risk reference and the lower the risk that can be borne, but the return is reduced.
5 Conclusions and Future Work
In this paper, we constructed the portfolio selection problem into a multiarmed bandit problem, wherein we used the classic portfolio strategies as the strategic arms to form a dynamic portfolio strategy with multiple cycles to adapt for different periods. Moreover, we devise a reward function based on the user’s investment risk preference to judge the standard and select the optimal arm of each period via Thompson sampling. Our algorithm could appropriately balance the benefits and risks well and achieve higher returns by controlling risk.
In the future work, we will consider the correlation between the strategic arms and the impact of the previous selection path on the next choice. Also, the actual status of financial scenarios such as transaction fee, tax, and dividend should be considered as factors to build a portfolio strategy that is more consistent with the real scenario.
References
 [Auer et al.2002] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 [Brandt2010] Michael W Brandt. Portfolio choice problems. In Handbook of financial econometrics: Tools and techniques, pages 269–336. Elsevier, 2010.
 [Brinson et al.1995] Gary P Brinson, L Randolph Hood, and Gilbert L Beebower. Determinants of portfolio performance. Financial Analysts Journal, 51(1):133–138, 1995.
 [Brodén et al.2017] Björn Brodén, Mikael Hammar, Bengt J Nilsson, and Dimitris Paraschakis. Bandit algorithms for ecommerce recommender systems. In Proceedings of the Eleventh ACM Conference on Recommender Systems, pages 349–349. ACM, 2017.
 [Brodén et al.2018] Björn Brodén, Mikael Hammar, Bengt J Nilsson, and Dimitris Paraschakis. Ensemble recommendations via thompson sampling: an experimental study within ecommerce. In 23rd International Conference on Intelligent User Interfaces, pages 19–29. ACM, 2018.
 [Brodie et al.2009] Joshua Brodie, Ingrid Daubechies, Christine De Mol, Domenico Giannone, and Ignace Loris. Sparse and stable markowitz portfolios. Proceedings of the National Academy of Sciences, 106(30):12267–12272, 2009.
 [Chapelle and Li2011] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
 [DeMiguel et al.2007] Victor DeMiguel, Lorenzo Garlappi, and Raman Uppal. Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy? The review of Financial studies, 22(5):1915–1953, 2007.
 [Fama and French1992] Eugene F Fama and Kenneth R French. The crosssection of expected stock returns. the Journal of Finance, 47(2):427–465, 1992.
 [Huang et al.2015] Dingjiang Huang, Yan Zhu, Bin Li, Shuigeng Zhou, and Steven CH Hoi. Semiuniversal portfolios with transaction costs. In TwentyFourth International Joint Conference on Artificial Intelligence, 2015.
 [Jiang et al.2017] Zhengyao Jiang, Dixing Xu, and Jinjun Liang. A deep reinforcement learning framework for the financial portfolio management problem. arXiv preprint arXiv:1706.10059, 2017.
 [Li et al.2010] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
 [Liang et al.2018] Zhipeng Liang, Hao Chen, Junhao Zhu, Kangkang Jiang, and Yanran Li. Adversarial deep reinforcement learning in portfolio management. arXiv preprint arXiv:1808.09940, 2018.
 [Markowitz1952] Harry Markowitz. Portfolio selection. The journal of finance, 7(1):77–91, 1952.
 [Palmowski et al.2018] Zbigniew Palmowski, Łukasz Stettner, and Anna Sulima. Optimal portfolio selection in an itômarkov additive market. arXiv preprint arXiv:1806.03496, 2018.
 [Paolinelli and Arioli2019] Giovanni Paolinelli and Gianni Arioli. A model for stocks dynamics based on a nongaussian path integral. Physica A: Statistical Mechanics and its Applications, 517:499–514, 2019.
 [Russo and Van Roy2014] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
 [Sani et al.2012] Amir Sani, Alessandro Lazaric, and Rémi Munos. Riskaversion in multiarmed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.
 [Shen and Wang2016] Weiwei Shen and Jun Wang. Portfolio blending via thompson sampling. In Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, pages 1983–1989. AAAI Press, 2016.
 [Shen and Wang2017] Weiwei Shen and Jun Wang. Portfolio selection via subset resampling. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 [Shen et al.2014] Weiwei Shen, Jun Wang, and Shiqian Ma. Doubly regularized portfolio with risk minimization. In Proceedings of the TwentyEighth AAAI Conference on Artificial Intelligence, pages 1286–1292. AAAI Press, 2014.
 [Shen et al.2015] Weiwei Shen, Jun Wang, YuGang Jiang, and Hongyuan Zha. Portfolio choices with orthogonal bandit learning. In TwentyFourth International Joint Conference on Artificial Intelligence, 2015.
 [Thompson1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
 [Wu et al.2016] Qingyun Wu, Huazheng Wang, Quanquan Gu, and Hongning Wang. Contextual bandits in a collaborative environment. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 529–538. ACM, 2016.