Universal Algorithm for Online Trading Based on the Method of Calibration
We present a universal method for algorithmic trading in Stock Market which performs asymptotically at least as well as any stationary trading strategy that computes the investment at each step using a fixed function of the side information that belongs to a given RKHS (Reproducing Kernel Hilbert Space). Using a universal kernel, we extend this result for any continuous stationary strategy. In this learning process, a trader rationally chooses his gambles using predictions made by a randomized well-calibrated algorithm. Our strategy is based on Dawid’s notion of calibration with more general checking rules and on some modification of Kakade and Foster’s randomized rounding algorithm for computing the well-calibrated forecasts. We combine the method of randomized calibration with Vovk’s method of defensive forecasting in RKHS. Unlike in statistical theory, no stochastic assumptions are made about the stock prices. Our empirical results on historical markets provide strong evidence that this type of technical trading can “beat the market” if transaction costs are ignored.
Keywords: algoriyhmic trading, asymptotic calibration, defensive forecasting, reproducing kernel Hilbert space, universal kernel, universal trading strategy, stationary trading strategy, side information
Predicting sequences is the key problem for machine learning, computational finance and statistics. These predictions can serve as a base for developing the efficient methods for playing financial games in Stock Market.
The learning process proceeds as follows: observing a finite-state sequence given online, a forecaster assigns a subjective estimate to future states.
A minimal requirement for testing any prediction algorithm is that it should be calibrated (cf. Dawid 1982). Dawid gave an informal explanation of calibration for binary outcomes. Let a sequence of binary outcomes be observed by a forecaster whose task is to give a probability of a future event . In a typical example, is interpreted as a probability that it will rain. Forecaster is said to be well-calibrated if it rains as often as he leads us to expect. It should rain about of the days for which , and so on.
A more precise definition is as follows. Let denote the characteristic function of a subinterval , i.e., if and , otherwise. An infinite sequence of forecasts is calibrated for an infinite binary sequence of outcomes if for characteristic function of any subinterval of the calibration error tends to zero, i.e.,
as . The indicator function determines some “checking rule” that selects indices , where we compute the deviation between forecasts and outcomes .
Foster and Vohra (1998) show that calibration is almost surely guaranteed with a randomizing forecasting rule, i.e., where the forecasts are chosen using internal randomization and the forecasts are hidden from the weather until the weather makes its decision whether to rain or not.
The origin of the calibration algorithms is the Blackwell (1956) approachability theorem but, as its drawback, the forecaster has to use linear programming to compute the forecasts. We modify and generalize a more computationally efficient method from Kakade and Foster (2004), where “an almost deterministic” randomized rounding universal forecasting algorithm is presented. For any sequence of outcomes and for any precision of rounding , an observer can simply randomly round the deterministic forecast up to to a random forecast in order to calibrate for this sequence with probability one:
where is the characteristic function of any subinterval of . This algorithm can be easily generalized such that the calibration error tends to zero as .
Kakade and Foster and others considered a finite outcome space and a probability distribution as the forecast. In this paper, the outcomes are real numbers from unit interval and the forecast is a single real number (which can be an output of a random variable). This setting is closely related to Vovk (2005a) defensive forecasting approach (see below).
In this case real valued predictions could be interpreted as mean values of future outcomes under some unknown to us probability distributions in . We do not know precise form of such distributions – we should predict only future means.
The well known applications of the method of calibration belong to different fields of the game theory and machine learning. Kakade and Foster proved that empirical frequencies of play in any normal-form game with finite strategy sets converges to a set of correlated equilibrium if each player chooses his gamble as the best response to the well calibrated forecasts of the gambles of other players. In series of papers: Vovk et al. (2005), Vovk (2005a), Vovk (2006), Vovk (2006a), Vovk (2007), Vovk developed the method of calibration for the case of more general RKHS and Banach spaces. Vovk called his method defensive forecasting (DF). He also applied his method for recovering unknown functional dependencies presented by arbitrary functions from RKHS and Banach spaces. Chernov et al. (2010) show that well-calibrated forecasts can be used to compute predictions for the Vovk (1997) aggregating algorithm. In defensive forecasting, continuous loss (gain) functions are considered.
In this paper we present a new application of the method of calibration. We construct “a universal” strategy for algorithmic trading in Stock Market which performs asymptotically at least as well as any not “too complex” trading strategy . Technically, we are interested in the case where the trading strategy is assumed to belong to a large reproducing kernel Hilbert space (to be defined shortly) and the complexity of is measured by its norm. Using a universal kernel, we extend this result to any continuous stationary trading strategy. Our universal trading strategy is represented by a discontinuous function though it uses a randomization.
First discuss some standard financial terminology. A trader in Stock Market uses a strategy: going long or going short, or skip the step. In finance, a long position in a security, such as a stock or a bond, or equivalently to be long in a security, means that the holder of the position owns the security and will profit if the price of the security goes up. Short selling (also known as shorting or going short) is the practice of selling securities or other financial instruments, with the intention of subsequently repurchasing them (“covering”) at a lower price.
In this paper, the problem of universal sequential investment in Stock Market with side information is studied. We consider the method of trading called in financial industrial applications algorithmic trading or systematic quantitative trading, which means rule-based automatic trading strategies, usually implemented with computer based trading systems.
The problem of algorithmic trading is considered in machine learning framework, where algorithms adaptive to input data are designed and their performance is evaluated.
There are three common types of analysis for adaptive algorithms: average case analysis which requires a statistical model of input data; worst-case analysis which is non-informative because, for any trading algorithm, we can present a sequence of stock prices moving in the direction opposite to the trader’s decisions; competitive analysis which is popular in the prediction with expert advice framework.
A non-traditional objective (in computational finance) is to develop algorithmic trading strategies that are in some sense always guaranteed to perform well. In competitive analysis, the performance of an algorithm is measured to any trading algorithm from a broad class. We only ask than an algorithm performs well relative to the difficulty in classsifying of the input data. Given a particular performance measure, an adaptive algorithm is strongly competitive with a class of trading algorithms if it achieves the maximum possible regret over all input sequences. Unlike in statistical theory, no stochastic assumptions are made about the stock prices.
This line of research in finance was pioneered by Cover (see Cover and Gluss 1986, Cover 1991, Cover and Ordentlich 1996) who designed universal portfolio selection algorithms that can provably do well (in terms of their total return) with respect to some adaptive online or offline benchmark algorithms. Such algorithms are called universal.
We consider the simplest case: algorithmic trading with only stock. Our results can be generalized for the case of several stocks and for dynamical portfolio hedging in sense of framework proposed by Cover and Ordentlich (1996).
We consider a game with players: Stock Market and Trader. At the beginning of each round Trader is shown an object which contains a side information. Past prices of the stock are also given for Trader (they can be considered as a part of the side information). Using this information, Trader announces a number of shares of the stock he wants to purchase by each. At the end of the round Stock Market announces the price of the stock, and Trader receives his gain or suffers loss for round . The total gain or loss for the first rounds is equal to .
We show that, using the well-calibrated forecasts, it is possible to construct a universal strategy for algorithmic trading in the stock market which performs asymptotically at least as well as any stationary trading strategy presented by a continuous function from the object . This universal trading strategy is of decision type: we buy or sell only one share of the stock at each round. The learning process is the most traditional one. At each step, Trader makes a randomized prediction of a future price of the stock and takes “the best response” to this prediction. He chooses a strategy to going long: if , or to going short: , otherwise, where is the randomized past price of the stock. Trader uses some randomized algorithm for computing the well-calibrated forecasts .
Therefore, our universal strategy uses some internal randomization.
Trader M can buy or sell only one share of the stock. Therefore, in order to compare the performance of the traders we have to standardize the strategy of Trader D. We use the norm and a normalization factor , where is a continuous function. Our main result, Theorems 4 and 5 (Section 4), and Theorem 7 (Section 5), says that this trading strategy performs asymptotically at least as well as any stationary trading strategy presented by a continuous function . With probability one, the gain of this trading strategy is asymptotically not less than the average gain of any stationary trading strategy from one share of the stock:
where is a side information used by the stationary trading strategy at step .
Evidently, the requirement (2) for all continuous is equivalent to the requirement:
for all continuous such that .
To achieve this goal we extend in Theorem 1 (Section 3) Kakade and Foster’s forecasting algorithm for a case of arbitrary real valued outcomes and to a more general notion of calibration with changing parameterized checking rules. We combine it with Vovk et al. (2005) defensive forecasting method in RKHS (see Vovk 2005a). In Section 5, using a universal kernel, we generalize this result to any continuous stationary trading strategy. We show in Section 6 that the universality property fails if we consider discontinuous trading strategies. On the other hand, we show in Theorem 9 that a universal trading strategy exists for a class of randomized discontinuous trading strategies.
In Section 7 results of numerical experiments are presented. Our empirical results on historical markets provide strong evidence that this type of algorithmic trading can beat the market: our universal strategy is always better than “buy-and-hold” strategy for each stock chosen arbitrarily in Stock Market. This strategy outperforms also an algorithmic trading strategy using some standard prediction algorithm (ARMA).
By a kernel function on a set we mean any function which can be represented as a dot product , where is a mapping from to some Hilbert feature space.
The reproducing kernels are of special interest. A Hilbert space of real-valued functions on a compact metric space is called RKHS (Reproducing Kernel Hilbert Space) on if the evaluation functional is continuous for each . Let be a norm in and . The embedding constant of is defined . We consider RKHS with .
Let for . An example of RKHS is the Sobolev space , which consists of absolutely continuous functions with , where For this space, (see Vovk 2005a).
Let be an RKHS on with the dot product for . By Riesz–Fisher theorem, for each there exists such that .
The reproducing kernel is defined . The main properties of the kernel: 1) for all (symmetry property); 2) for all , for all , and for all real numbers , where (positive semidefinite property).
Conversely, a kernel defines RKHS: any symmetric, positive semidefinite kernel function defines some canonical RKHS and a mapping such that . Also, . The mapping is also called “feature map” (see Cristianini and Shawe-Taylor 2000, Chapter 3).
A function is induced by a kernel if there exists an element such that . This definition is independent of a map . For any continuous kernel , every induced function is continuous (see Steinwart (2001)). 111 It is Lipschitz continuous (with respect to some semimetrics induced by the feature map (Steinwart 2001, Lemma 3). In what follows we consider continuous kernels. Therefore, all functions from canonical RKHS are continuous.
For Sobolev space , the reproducing kernel is
(see Vovk 2005a).
Well known examples of kernels on : Gaussian kernel , where is the Euclidian norm; , where and .
Other examples and details of the kernel theory see in Scholkopf and Smola (2002).
Some special kernel corresponds to the method of randomization defined below. A random variable is called randomization of a real number if , where is the symbol of mathematical expectation with respect to the corresponding to probability distribution.
We use a specific method of randomization of real numbers from unit interval proposed by Kakade and Foster (2004). Given positive integer number divide the interval on subintervals of length with rational endpoints , where . Let denotes the set of these points. Any number can be represented as a linear combination of two neighboring endpoints of defining subinterval containing :
where , , , and . Define for all other . Define a random variable
Let be a vector of probabilities of rounding.
For any -dimensional vector , we round each coordinate , to with probability and to with probability , where . Let be the corresponding random vector.
Let and . For any , let be a vector of probability distribution in : . For , the dot product is the symmetric positive semidefinite kernel function.
3 Well-calibrated forecasting with side information
A universal trading strategy, which will be defined in Section 4, is based on the well-calibrated forecasts of stock prices. In this section we present a randomized algorithm for computing well-calibrated forecasts using a side information.
A standard way to present any forecasting process is the game-theoretic protocol. The basic online prediction protocol has two players Reality and Predictor (see Fig 1).
At the beginning of each step , Predictor is given some data relevant to predicting the following outcome . We call a signal or a side information. Signals are taken from the object space.
The outcomes are taken from an outcome space and predictions are taken from a prediction space. In this paper an outcome is a real number from the unit interval and a forecast is a single number from this interval (which can be output of a random variable). We could interpret the forecast as the mean value of a future outcome under some unknown to us probability distribution in .
Reality is called oblivious if an infinite sequence of outcomes and signals is defined before the game starts and Reality only reveals their next value at each step . In this case the outcomes and signals do not depend on past predictions. In case of non oblivious Reality this sequence is not fixed in advance and any next value can be output of some measurable function from previous moves of Predictor, ie, from past predictions .
In what follows we compare two types of forecasting algorithms: randomized algorithms which we will construct and stationary forecasting strategies which are continuous functions from some RKHS using a side information as input. We consider two type of predictors: and , playing according to the basic prediction protocol presented at Fig 1.
This protocol is perfect-information for Predictor C. This means that Predictor C can use other players moves so far. Past outcomes and predictions are also known to Reality in the perfect-information protocol.
Predictor D can use only a signal that is given at the beginning of any step . Predictor D uses a stationary prediction strategy , where is a function whose input is the signal and output is the number of shares. We suppose that is a real number from the unit interval. The number can encode any information. For example, it can be past outcomes and signals and even the future outcome .
Predictor C uses a randomized strategy which we will define below. We collect all information used for the internal randomization in a vector . This vector can contain any information known before the move of Predictor C at step : the signal , past outcomes and so on.
In general, we suppose that is a vector of dimension : . We call it an information vector and assume that some method for computing information vectors given past outcomes and signals is fixed.
We use the tests of calibration to measure the discrepancy between predictions and outcomes. These tests use the checking rules. We consider checking rules of more general type than that used in the literature on asymptotic calibration.
For any subset , define the checking rule that is an indicator function:
where is an -dimensional vector.
In the online prediction protocol defined on Fig 1, given , a sequence of forecasts is called -calibrated for a sequences of outcomes and information vectors if for any subset the following asymptotic inequality holds:
The sequence of forecasts is called well-calibrated if
If Reality is non oblivious and acts “adversatively”, then, as shown by Oakes (1985) and Dawid (1985), any deterministic forecasting algorithm will not always be calibrated. In case where , Reality can define their outcomes by the rule:
Then any sequence of forecasts will not be calibrated for the sequence of such outcomes . It is easy to verify that the condition (4) fails for or for .
Following the method of Foster and Vohra (1998), at each step , using the past outcomes , we will define a deterministic forecast and randomize it to a random variable using the method of randomization defined in Section 2. We also randomize the information vector to a random vector . We call this sequential randomization.
This sequential randomization generates for any a probability distribution on the set of all finite sequences of forecasts and information vectors. In case of oblivious Reality this is simply the product distribution which in their turn generates the overall probability distribution on the set of all infinite trajectories . In case of non oblivious Reality, at any step , a probability distribution on exists such that the corresponding method of randomization of is defined as conditional distribution on . The overall probability distribution on the set of all infinite trajectories generating these can be defined by Ionescu–Tulcea theorem (see Shiryaev (1980)).
The following theorem on calibration with a side information is the main tool for an analysis presented in Sections 4 and 6. We will show that for any subset , with -probability 1, the equality (4) is valid, where and are replaced on their randomized variants and .
In the prediction protocol defined on Fig 1, let be a sequence of outcomes and be the corresponding sequences of signals given online. We assume that a sequence of the information vectors also be defined online.
Let also, be an RKHS on with a kernel and a finite embedding constant .
For any , an algorithm for computing forecasts and a sequential method of randomization can be constructed such that the following three items hold:
For any , , and , with probability at least ,
where are the corresponding randomizations of and are the corresponding randomizations of -dimensional information vectors ;
For any and ,
where are signals.
For any , with probability 1,
Proof. At first, in Proposition 2 (below), given , we modify a randomized rounding algorithm of Kakade and Foster (2004) to construct some -calibrated forecasting algorithm, and combine it with Vovk (2005a) defensive forecasting algorithm. After that, we revise it tending such that (5) will hold.
Proof. We define a deterministic forecast and after that we randomize it.
The partition and probabilities of rounding were defined above by (3). In what follows we round some deterministic forecast to with probability and to with probability . We also round each coordinate , , of the information vector to with probability and to with probability , where .
Let , where and , , , and be a vector of probability distribution in . Define the corresponding kernel .
Let the deterministic forecasts be already defined (put ). We want to define a deterministic forecast .
The kernel can be represented as a dot product in some feature space: . Consider
The following lemma presents a general method for computing the deterministic forecasts.
for all .
( Vovk et al. 2005) A sequence of forecasts can be computed such that for all .
Proof. By definition the function is continuous in . The needed forecast is computed as follows. If for all then define ; if for all then define . Otherwise, define to be a root of the equation (some root exists by the intermediate value theorem). Evidently, for all . Lemma is proved.
Now we continue the proof of the proposition.
Let forecasts be computed by the method of Lemma 3. Then for any ,
Since for all and
the subtracted sum of (10) is upper bounded by .
Since and for all , the subtracted sum of (11) is upper bounded by . As a result we obtain
for all . Let us denote By (12), for all .
Let . By definition for any ,
Insert the term in the sum (14), where is the characteristic function of an arbitrary set , sum by , and exchange the order of summation. Using Cauchy–Schwarz inequality for vectors , and Euclidian norm, we obtain
for all , where is the cardinality of the partition.
Let be a random variable taking values with probabilities (only two of them are nonzero). Recall that is a random variable taking values with probabilities .
Let and be its indicator function. For any , the mathematical expectation of a random variable is equal to
where . By Azuma–Hoeffding inequality (see (28) below), for any and , with -probability ,
By definition of the deterministic forecast
for all .
Now we turn to the proof of Theorem 1.
In what follows we use the upper bound in (18).
To prove the bound (5) choose a monotonic sequence of rational numbers such that as . We also define an increasing sequence of positive integer numbers For any , we use for randomization on steps the partition of on subintervals of length .
We start our sequences from and . Also, define the numbers such that the inequality
holds for all