Adaptive Portfolio by Solving Multi-armed Bandit via Thompson Sampling
As the cornerstone of modern portfolio theory, Markowitz’s mean-variance optimization is considered a major model adopted in portfolio management. However, due to the difficulty of estimating its parameters, it cannot be applied to all periods. In some cases, naive strategies such as Equally-weighted and Value-weighted portfolios can even get better performance. Under these circumstances, we can use multiple classic strategies as multiple strategic arms in multi-armed bandit to naturally establish a connection with the portfolio selection problem. This can also help to maximize the rewards in the bandit algorithm by the trade-off between exploration and exploitation. In this paper, we present a portfolio bandit strategy through Thompson sampling which aims to make online portfolio choices by effectively exploiting the performances among multiple arms. Also, by constructing multiple strategic arms, we can obtain the optimal investment portfolio to adapt different investment periods. Moreover, we devise a novel reward function based on users’ different investment risk preferences, which can be adaptive to various investment styles. Our experimental results demonstrate that our proposed portfolio strategy has marked superiority across representative real-world market datasets in terms of extensive evaluation criteria.
The portfolio selection problem is a fundamental issue in the financial sector for many asset investments, including funds, stocks, bonds, and options. According to Gary Brinson, the father of global asset allocation, “Asset allocation is the main factor that affects all overall returns.” In the long run, more than 90% of a portfolio’s performance is attributable to its asset allocation [Brinson et al.1995]. Thus, asset allocation of a portfolio is the key determinant of performance, risk, and volatility over time.
Modern portfolio theory and analysis tend to build upon the seminal work of Markowitz [Markowitz1952]. Up to now, the mean-variance paradigm has remained the mainstream choice for academia and industry. However, the main problem of using a single strategy is that it cannot be adapted to the changing environment. For instance, in an event of the stock market crash, the sold-all strategy with cash will have a good performance. Meanwhile, during the bull market, the buy-and-hold strategy is likely to perform even better. In terms of this issue, a simple approach in the investment field is to periodically review the effectiveness of the current strategy and appropriately adjust the strategy for the next phase. This is a typical problem of exploration and exploitation. Therefore, we use reinforcement learning to solve the problem of how to determine optimal portfolio strategy to adapt for different investment periods.
Meanwhile, each investor’s pursuit of risk and benefit is different, which is called the user’s investment risk preference. Although a good strategy can ultimately help in achieving a good return, not every investor is willing to take on some risks in the process. For example, users with a low risk tolerance are unlikely to consider short-term losses, whereas users with high risk tolerance tend to pursue high returns and usually do not care about retracement. For this reason, we take the users’ investment risk preferences into account and propose a novel reward function, which can be adaptive to various investment styles such as high-risk-high-return and low-risk-low-return investment style.
In this paper, we first turn the portfolio problem into a multi-armed bandit problem and construct a series of strategic arms basing on the classic strategies. Subsequently, we apply Thompson sampling method to select strategic arm and further update the Beta distribution of strategic arms based on the user’s investment risk preference. The contributions of the present work are summarized as follows:
To adapt to different market conditions in different periods, we utilize the multi-armed bandit problem to adaptively select the most suitable strategy to form an online portfolio strategy.
We devise a novel reward function based on the users’ investment risk preferences to ascertain that the method fits a variety of users’ needs. This helps to achieve different return-to-risk ratio.
Experimental results indicate that the proposed portfolio strategy has marked superiority across representative real-world market datasets in terms of a series of standard financial evaluation indicators, which include Sharpe ratios, cumulative wealth, volatility, and maximum drawdowns.
2 Related Work
In this section, we briefly discuss two topics, that is, multi-armed bandit and portfolio selection problem.
2.1 Multi-armed Bandit and Thompson Sampling
This section contains theories, solutions for the multi-armed bandit problem and Thompson sampling.
There are many exploration vs exploitation dilemmas in many aspects of our life. At the same time, investment strategies attempt to balance existing portfolios and new portfolios to achieve higher returns. In this case, if we can speculate the future trend of all assets in the market, we can find the best investment strategy by just simulating brute-force instead of using several other smart approaches. This dilemma originates from the incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control. With exploitation, we can take advantage of the best known option. With exploration, we can take some risk to collect information about unknown options. Therefore, the best long-term strategy may involve short-term sacrifices.
The multi-armed bandit problem is a classic problem that exhibits the exploration vs exploitation dilemma. It is like facing multiple slot machines in a casino and each is configured with an unknown probability of how likely you can get a reward at one play. The aim is to maximize the cumulative reward. If we know the optimal action with the best reward, then the goal is same as to minimize the potential regret or loss by not picking the optimal action.
The possible methods that can be used to solve this problem are roughly divided into three distinct categories, -greedy algorithm, upper confidence bounds (UCB) algorithm [Auer et al.2002] and Thompson sampling [Thompson1933].
Thompson sampling has a simple idea. However, it works great for solving the multi-armed bandit problem [Chapelle and Li2011, Russo and Van Roy2014]. At each time step, select action according to the Beta probability that is optimal. After observing the true reward, update the Beta distribution accordingly. This essentially involves doing Bayesian inference to compute the posterior with the known prior and the likelihood of getting the sampled data.
With the rise of reinforcement learning, numerous works study how to apply multi-armed bandit to various fields, such as recommender [Li et al.2010, Wu et al.2016] and e-commence [Brodén et al.2017, Brodén et al.2018]. Besides, some scholars have tried to incorporate reinforcement learning into the field of portfolio optimization [Liang et al.2018, Jiang et al.2017, Sani et al.2012]. Still, other studies have used assets directly as arms in multi-armed bandit. For instance, Shen [Shen et al.2015] proposed to use the UCB algorithm to achieve online portfolio selection by constructing an orthogonal portfolio. Meanwhile, other studies have only used Thompson sampling to generate portfolio. For example, Shen [Shen and Wang2016] presented an online portfolio algorithm that leverages Thompson sampling to mix two different strategies. Inspired by these studies, we combine the multi-armed bandit and Thompson sampling, use the classic strategies as strategy arms to achieve an adaptive portfolio.
2.2 Portfolio Strategy
This section presents the current state of research on portfolios, including mean-variance models, forecast trends, and the Universal portfolio. Existing studies have specially based upon classical financial theory and have combined with machine learning to achieve better performance.
In 1952, Markowitz put forward the mean-variance model, which was the first of its kind in modern portfolios [Markowitz1952]. This model constrains the relevant conditions of portfolio issues to pursue a balance of risk and return. In particular, some scholars attempted to improve the effect of the mean-variance model by adding regularity [Brodie et al.2009, Shen et al.2014]. Other studies have improved the performance of the mean-variance model by changing the sampling method. For instance, Shen [Shen and Wang2017] proposed a new portfolio strategy through resampling subsets of the original large universe of assets.
In addition, some scholars pursued the maximum return-to-risk ratio of the portfolio through trend forecasting, such as by predicting stock price movements in the stock market. For example, Palmowski et al. [Palmowski et al.2018] studied a portfolio selection problem in a continuous-time Itô-Markov additive market in which the prices of financial assets were described by Markov additive processes. Meanwhile, Paolinelli [Paolinelli and Arioli2019] proposed a model for stocks dynamics based on a non-Gaussian path integral, which connected between time horizons and trading strategies.
The third type of research is based on the Universal portfolio theory. This is a portfolio selection algorithm from the field of machine learning and information theory. The algorithm learns adaptively from historical data and maximizes the log-optimal growth rate in the long run. Huang et al. [Huang et al.2015] designed semi-universal portfolio strategy under transaction fee, which tries to avoid rebalancing when the transaction fee outweighs the benefit of trading.
All of the above methods are all based on a single financial theory to construct an online investment portfolio. However, the method of this paper adaptively adopts different investment strategies in multiple cycles to achieve the highest long-term return-to-risk ratio.
In this section, we first introduce the notations and finance terms used in this paper. We will also discuss several strategic arms based on classic portfolios, formulate portfolio blending a multi-armed bandit problem, and how to solve this problem using Thompson sampling. Lastly, we summarize the proposed algorithm.
3.1 Notations and Problem Definition
To start with, we give the problem an abstract definition. We consider a self-financing, limited time and limited asset financial environment. The trading periods consist of , where represents one day, week or month, depending on the cycle of rebalancing and is the total cycles of participation in the transaction. We also represent the return vector of assets at time to time as . The formula of the return of the i-th asset is , where and represent the price of the i-th asset at times and . The transaction fee is also an important factor in the final benefit. For the sake of simplifying the model, however, it is not considered in this model. Still, we think about how to reduce trading behavior.
as the portfolio weight vector at time denotes the investment decision at time , where represents the allocation weight of the i-th asset in the entire portfolio. We assume that the sum of the combined weights is (except for pure cash position), i.e., , where is a column vector with ones as its entities. Also, we correspond to the following two cases of and the actual trading strategy: indicates that we need to take a long position of the i-th asset at market price; while shows that we need to take a short sale position of i-th asset. The actual operation requires a deposit, and also needs to pay dividends for short-selling assets, etc. However, for the sake of simplification, we will not consider this situation for the time being, and only consider the gains or losses caused by stock price changes.
3.2 Strategic Arms Based on Classic Portfolios
In our research, we do not directly use assets as arms in multi-armed bandit. Instead, we use classic portfolio strategies in finance as strategic arms to reduce the number of arms, and also to reduce transaction volume as well as increase stability. We use the following strategies:
Buy and Hold (BH): This is an intuitive idea which involves doing nothing and continuing to hold the existing portfolio in this time window.
Sold All (SA): Involves selling all the assets so that the combination is an empty position or a pure cash position.
Equally-weighted portfolio (EW): Regardless of the asset, all assets are directly placed into equal weight positions during each rebalancing period.
Value-weighted portfolio (VW): As a passive investment strategy, positions in each rebalancing period are allocated as per the current capital of each asset.
Mean-variance portfolio (MV): Mean-variance model is a strategy constructed in line with the Markowitz’s theory. It captures the aforementioned risk-return trade-off.
where is the expected return and is the variance of portfolio returns.
3.3 Portfolio Bandit via Thompson Sampling (PBTS)
Each strategy has its own suitable period and scene, thus they also have a certain probability to get the most profit. Basing on this idea, this paper regards the portfolio selection problem as a multi-armed bandit problem, and classic portfolio strategies as the strategic arms in order to achieve higher long-term returns. The specific definition is as follows:
The multi-armed bandit of the portfolio strategy is . is a collection of strategic arms (classic portfolio strategies),
where represents the total number of strategic arms. There are arms at each time , and which arm is selected according to which strategy is used to adjust the weight of the portfolio.
Assume is the probability distribution function of the return, at each time , . And the probability of each strategic arm is a Beta distribution .
At time k, each arm randomly samples a value from its respective Beta distribution, then the arm of this selection is:
In order to judge whether this choice is successful, we comprehensively consider the users’ investment risk preferences and use the Sharpe ratio as a measure. Therefore, we give a criterion based on the top-k strategy. The judgment formula is:
where is an indicator function and represents the Sharp ratio of user’s historical selection of arm at time . Usually, the international average generally takes a 36-month net growth rate to calculate the Sharpe ratio.
The choice of can be selected based on users’ investment risk preferences. If the user prefers to pursue high-risk and high-return, the smaller the can be; the larger the can be, if the user tends to pursue a relatively stable investment.
Then update the Beta distribution of arm , expressed as:
where is determined by Equation (8).
Additionally, for each arm’s Beta distribution, we first use as the initial prior of each arm and update the a priori results using sliding window of historical data. Since there is no investment strategy performance, , the even distribution of standards, is a reasonable initialization for investors. At each rebalancing time, the investor builds the Bernoulli test described above, observes subsequent successes or failures, and updates the posterior distribution accordingly.
Algorithm 1 summarizes the process of building a multi-armed bandit problem and solving problem via Thompson sampling.
|FF25||Monthly||06/01/1963 - 11/31/2018||545||25||25 portfolios of firms sorted by size and book-to-market|
|FF49||Monthly||07/01/1969 - 11/31/2018||472||49||49 industry portfolios representing the U.S. stock market|
|FF100||Monthly||07/01/1963 - 11/31/2018||544||100||100 portfolios of firms sorted by size and book-to-market|
|ETFs||Daily||12/08/2011 - 11/10/2017||1,138||608||Exchange-traded funds in U.S. stock market|
|SP500||Daily||02/11/2013 - 02/07/2018||1,355||476||500 firms listed in the S&P 500 Index|
In our experiment, we consider two types of datasets. The first one is the FF dataset, which was built by Fama and French based on the US stock market and continues to be updated to date [Fama and French1992]. Overall, they have an extensive coverage of assets classes and span a long period. In our experiments, the FF25, FF49, FF100 datasets include monthly returns of 25, 49, and 100 assets more than half a century. Among them, FF25 and FF100 are formed on size and book-to-market, while FF49 is an industry portfolio. The second one is a more frequent stock market data, which includes constituents of the SP500 and ETFs in the US stock market. We exclude assets with missing data for the past five years. Thus, we remain with 476 stocks from 500 constituent stocks as well as 608 ETFs retained by 1,340 ETFs.
Table 1 is a summary of the datasets, representing different investment perspectives in the market. The FF datasets emphasize long-term gains, spanning more than half a century. They include the different periods of the US stock market as well as multiple financial crises that can reflect the long-term gains of the strategy. Meanwhile, the SP500 and ETF datasets reflect at high trading frequencies. Regardless of the extreme market, the medium-term performance of the strategy is highlighted. In particular, we choose the timing of our datasets to avoid the latest financial crisis after 2007.
4.2 Evaluation Metrics
We use the standard criteria in finance [Brandt2010] to measure the performance of the portfolio strategy outside the training sample: (1) Sharpe Ratio; (2) Cumulative Wealth; (3) Maximum Drawdown; (4) Volatility.
Sharpe Ratio (SR) measures the return-to-risk ratio of a portfolio strategy and normalizes the return on the portfolio using its standard deviation. It is expressed as:
SR is a comprehensive measure that combines both returns and risks into the evaluation, giving the return value of each risk of the portfolio.
Cumulative Wealth (CW) is a weighted cumulative return measuring the time at which each asset’s revenue in a portfolio strategy begins to accumulate to the last calculated return. It is expressed as:
Maximum Drawdown (MDD) is the maximum amount of wealth reduction that a cumulative wealth has produced from its maximum value over time, expressed as:
where retracement represents to the loss from the maximum wealth value during its operation to the time , and denotes to the cumulative wealth up to the time . Since the sharp decline inevitably causes investors to panic and cause divestment, the maximum retracement is usually the primary risk measure for the money management industry.
Volatility (VO) is a quantitative risk metric for the investment industry. The calculation of portfolio volatility is related to the standard deviation in Equation (10). To measure the portfolio strategy with different weight adjustment frequencies, we calculate the annualized volatility using the following formula:
where is the number of times the weights are adjusted each year. In our experiment, for the monthly datasets, and for the daily datasets.
4.3 Competing Portfolios
To comprehensively assess the proposed method, we consider ten modern competing portfolios according to our literature review:
Equally-weighted portfolio (EW): EW is one of classic strategies, which. It has outperformed 14 sophisticated models across seven real-world datasets at monthly frequency of 2000 years [DeMiguel et al.2007]. Therefore, EW is the first benchmark algorithm for portfolio research.
Value-weighted portfolio (VW): VW is a strategy that imitates the market’s passive portfolio, which is the same as the market index’s volatility. It is also an important benchmark strategy.
Mean-variance portfolio (MV): MV is one of our basic strategies based on Markowitz’s theory and outperforms in different markets and time spans.
Orthogonal Bandit portfolio (OBP): OBP constructs multiple assets by constructing orthogonal portfolios. It also uses the upper confidence bound bandit framework to derive the optimal portfolio strategy that represents the combination of passive and active investments as per a risk-adjusted reward function [Shen et al.2015].
Portfolio Blending via Thompson Sampling (TS-EM, TS-VM): This strategy is applied by Thompson sampling to the portfolio field for mixing EW and MV as TS-EM, VW and MV as TS-VM [Shen and Wang2016].
Portfolio Selection via Subset Resampling (SSR): The SSR method estimates the parameters by re-sampling subsets of the original assets, and aggregates the subsets of the multiple constructs to obtain the portfolio of all assets [Shen and Wang2017].
Generally, EW, VW, and MV are three portfolio strategy arms of PBTS, which should be compared with the hybrid model proposed in this paper. OBP, TS-EM, TS-VM, and SSR are the heuristic experiments of the model. They are well recognized as important portfolio strategies based on the exploration and exploitation problem. Therefore, to be more convincing, we also compare with these four models.
4.4 Parameter Settings
We use the “rolling range” setting proposed by DeMiguel [DeMiguel et al.2007]. In regard to the model proposed in this paper, we set the sliding window as . For the parameter of the PBTS, we utilize cross validation to establish the optimal parameters. And for the parameters of other comparison algorithms, we use the parameter settings recommended in the relevant studies.
4.5 Results and Analysis
Results Table 2 summarize portfolio performance evaluated by the SR, CW, MDD, and VO for all the tested benchmarks, respectively. From the comparisons of the various methods, the values in bold represent the winners’ performance. The proposed PBTS method achieves a better performance in most of the cases. On the one hand, for the SR, the results of the PBTS are in the first echelon on all datasets, with a slightly lower EW on the ETFs dataset as well as less than VW on the SP500 dataset. This indicates that the PBTS basically has a better return-to-risk ratio. For the absolute return indicator, we use Figure 1 to reflect the change in earnings over time. PBTS outperforms other methods on most datasets, only below the MV and SSR on the FF25 dataset and lower than VW and OBP on the ETFs dataset. However, the OBP and SSR method has large fluctuations on other datasets. As well, the robustness is lower than the PBTS method. On the other hand, PBTS performs better on the risk indicators. As summarized in Table 2, the MDD of PBTS is the smallest; while PBTS’s VO is lower than EW in the ETFs dataset, which is usually the lowest VO in the classic strategies, and is superior to other comparison methods in other datasets.
Analysis In summary, through the performance of the three long-term FF datasets, we believe that the PBTS method has an outstanding performance in terms of long-term performance. This is consistent with the goal of PBTS of achieving excellent long-term returns. However, in the mid-term high frequency situation, through the ETF and SP500 datasets, we realize that PBTS is not robust enough and depends on the performance of basic strategies.
Parameter effect analysis We analyze the performance of PBTS in the case of different , as shown in Figure 2. In FF25 dataset, as becomes larger, the volatility decreases, but the cumulative wealth decreases. This is consistent with our hypothesis that the larger the , the lower the user’s risk reference and the lower the risk that can be borne, but the return is reduced.
5 Conclusions and Future Work
In this paper, we constructed the portfolio selection problem into a multi-armed bandit problem, wherein we used the classic portfolio strategies as the strategic arms to form a dynamic portfolio strategy with multiple cycles to adapt for different periods. Moreover, we devise a reward function based on the user’s investment risk preference to judge the standard and select the optimal arm of each period via Thompson sampling. Our algorithm could appropriately balance the benefits and risks well and achieve higher returns by controlling risk.
In the future work, we will consider the correlation between the strategic arms and the impact of the previous selection path on the next choice. Also, the actual status of financial scenarios such as transaction fee, tax, and dividend should be considered as factors to build a portfolio strategy that is more consistent with the real scenario.
- [Auer et al.2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- [Brandt2010] Michael W Brandt. Portfolio choice problems. In Handbook of financial econometrics: Tools and techniques, pages 269–336. Elsevier, 2010.
- [Brinson et al.1995] Gary P Brinson, L Randolph Hood, and Gilbert L Beebower. Determinants of portfolio performance. Financial Analysts Journal, 51(1):133–138, 1995.
- [Brodén et al.2017] Björn Brodén, Mikael Hammar, Bengt J Nilsson, and Dimitris Paraschakis. Bandit algorithms for e-commerce recommender systems. In Proceedings of the Eleventh ACM Conference on Recommender Systems, pages 349–349. ACM, 2017.
- [Brodén et al.2018] Björn Brodén, Mikael Hammar, Bengt J Nilsson, and Dimitris Paraschakis. Ensemble recommendations via thompson sampling: an experimental study within e-commerce. In 23rd International Conference on Intelligent User Interfaces, pages 19–29. ACM, 2018.
- [Brodie et al.2009] Joshua Brodie, Ingrid Daubechies, Christine De Mol, Domenico Giannone, and Ignace Loris. Sparse and stable markowitz portfolios. Proceedings of the National Academy of Sciences, 106(30):12267–12272, 2009.
- [Chapelle and Li2011] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
- [DeMiguel et al.2007] Victor DeMiguel, Lorenzo Garlappi, and Raman Uppal. Optimal versus naive diversification: How inefficient is the 1/n portfolio strategy? The review of Financial studies, 22(5):1915–1953, 2007.
- [Fama and French1992] Eugene F Fama and Kenneth R French. The cross-section of expected stock returns. the Journal of Finance, 47(2):427–465, 1992.
- [Huang et al.2015] Dingjiang Huang, Yan Zhu, Bin Li, Shuigeng Zhou, and Steven CH Hoi. Semi-universal portfolios with transaction costs. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
- [Jiang et al.2017] Zhengyao Jiang, Dixing Xu, and Jinjun Liang. A deep reinforcement learning framework for the financial portfolio management problem. arXiv preprint arXiv:1706.10059, 2017.
- [Li et al.2010] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
- [Liang et al.2018] Zhipeng Liang, Hao Chen, Junhao Zhu, Kangkang Jiang, and Yanran Li. Adversarial deep reinforcement learning in portfolio management. arXiv preprint arXiv:1808.09940, 2018.
- [Markowitz1952] Harry Markowitz. Portfolio selection. The journal of finance, 7(1):77–91, 1952.
- [Palmowski et al.2018] Zbigniew Palmowski, Łukasz Stettner, and Anna Sulima. Optimal portfolio selection in an itô-markov additive market. arXiv preprint arXiv:1806.03496, 2018.
- [Paolinelli and Arioli2019] Giovanni Paolinelli and Gianni Arioli. A model for stocks dynamics based on a non-gaussian path integral. Physica A: Statistical Mechanics and its Applications, 517:499–514, 2019.
- [Russo and Van Roy2014] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
- [Sani et al.2012] Amir Sani, Alessandro Lazaric, and Rémi Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.
- [Shen and Wang2016] Weiwei Shen and Jun Wang. Portfolio blending via thompson sampling. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 1983–1989. AAAI Press, 2016.
- [Shen and Wang2017] Weiwei Shen and Jun Wang. Portfolio selection via subset resampling. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
- [Shen et al.2014] Weiwei Shen, Jun Wang, and Shiqian Ma. Doubly regularized portfolio with risk minimization. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pages 1286–1292. AAAI Press, 2014.
- [Shen et al.2015] Weiwei Shen, Jun Wang, Yu-Gang Jiang, and Hongyuan Zha. Portfolio choices with orthogonal bandit learning. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
- [Thompson1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
- [Wu et al.2016] Qingyun Wu, Huazheng Wang, Quanquan Gu, and Hongning Wang. Contextual bandits in a collaborative environment. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 529–538. ACM, 2016.