Ensemble Committees for Stock Return Classification and Prediction
Abstract
This paper considers a portfolio trading strategy formulated by algorithms in the field of machine learning. The profitability of the strategy is measured by the algorithm’s capability to consistently and accurately identify stock indices with positive or negative returns, and to generate a preferred portfolio allocation on the basis of a learned model. Stocks are characterized by time series data sets consisting of technical variables that reflect market conditions in a previous time interval, which are utilized produce binary classification decisions in subsequent intervals. The learned model is constructed as a committee of random forest classifiers, a nonlinear support vector machine classifier, a relevance vector machine classifier, and a constituent ensemble of nearest neighbors classifiers. This selection of algorithms is appealing for two reasons: first, there is strikingly little research in economic timeseries forecasting that employs learners beyond neural networks and clustering algorithms, and this construction offers a viable alternative; second, this selection incorporates an array of techniques that have both theoretically optimal classification properties and high empirical success rates in areas outside of finance, in addition to offering a mixture of parametric and nonparametric models. The ensemble committee is augmented by a boosting metaalgorithm and feature selection is performed by a supervised Relief algorithm. The Global Industry Classification Standard (GICS) is used to explore the ensemble model’s efficacy within the context of various fields of investment including Energy, Materials, Financials, and Information Technology. Data from 2006 to 2012, inclusive, are considered, which are chosen for providing a range of market circumstances for evaluating the model. The model is observed to achieve an accuracy of approximately 70% when predicting stock price returns three months in advance.
1 Introduction
It is crucial, in this modern era of financial uncertainty, to explore and understand methodologies for effectively predicting future outcomes given a historical record. In particular, the motivating question behind this work is whether there exists an appropriate selection of technical explanatory variables that will produce highaccuracy stock price return predictions. Our chosen approach to this complex financial forecasting problem is to train an ensemble of classification models on a subset of labeled financial data that are categorized as members of the sets or according to whether there was a positive or negative shift in stock price from an initial time to a subsequent time.
This work offers two important contributions to the field of financial forecasting. First among these is pursuant to the recommendation for future research offered in Huerta et al. [4]: the learning model is constructed such that feature selection is conducted a priori and is a fully automated process; this allows the model itself to “discover” which parameters it believes are important to effective prediction rather than being forced to accept humandesignated explanatory variables. This gives the ensemble a attribute of adaptability in the sense that it may reformulate its relevant parameters according to the GICS or according to location in time. Second, the construction of the ensemble incorporates a probabilistic ranking component that approximates a level of confidence associated with each prediction. Within the financial literature, it is typically the case that portfolios are formulated according to a scoring function that is intended to capture the differences of desirable and undesirable stocks [10]. The capability of the ensemble is similar in concept, yet rather than providing an absolute hierarchy of stock preferences, a probabilistic measure of the desirability of the stock is returned. In uncertain games such as portfolio investment, there are clear benefits to possessing a probabilistic confidence criterion.
2 Description of the Learning Algorithms
This section provides a brief introduction to the classification algorithms incorporated into the ensemble. The section is intended for individuals with little familiarity with learning algorithms. Notice that this section is intended merely to introduce the constituent models and to identify how they form a cohesive whole. Readers with some expertise in machine learning may proceed without delay to subsequent sections of the paper without loss of understanding. The constituent algorithms are presented in the order (1) random forest classifier, (2) nonlinear support vector machine classifier, (3) relevance vector machine classifier, and (4) ensemble of nearest neighbors classifiers. The metaalgorithm of “boosting” is also formalized here.
In the field of financial trading it is of some urgency to construct models that may be learned and deployed efficiently, yet must also be relatively robust to the inherently stochastic nature of stock returns. Using these crucial ideas, we can motivate an ensemble of this form by referring to the differences between parametric and nonparametric algorithms within machine learning. Parametric models are useful in the sense that they are fast to learn and deploy, but typically make strong assumptions about the distribution of the data. Nonparametric models avoid almost all prior suppositions about the data, but the complexity of nonparametric learning algorithms tends to be larger than that for parameterized learners. Because random forests and nearest neighbors classifiers are nonparametric learners, and support vector machines and relevance vector machines are parameterized, the ensemble model presented here benefits from the advantages of both styles, yet still maintains the ability to deploy the constituent models individually.
2.1 Random Forest
A random forest classifier is itself a learned ensemble of decision trees such that the constituent learners are “decorrelated” by growing each tree on a randomly chosen subset of all data vectors and all features. This ensemble results in an decision function of the form,
(1) 
where is the decision function of the tree in the ensemble. In growing each tree, the key idea is to minimize the impurity of the training data in the nodes resulting from the candidate splits. There are several measures of such impurity including the misclassification rate and entropy, however we elect here to use the Gini index measurement, which is given by,
(2) 
where is the set of potential class labels and in the binary setting we have that . It can be shown that the minimizing the Gini index at each node split is equivalent to minimizing the expected error rate [9]. This particular impurity metric is chosen for the decision trees because it offers something of a compromise between entropy and absolute error rate in terms of its sensitivity to class probabilities.
2.2 NonLinear Support Vector Machine
The support vector machine (hereafter SVM) has remained a popular choice in machine learning classification problems and in financial prediction in particular [4, 6, 7]. This is almost certainly because of the SVM’s elegant mathematical foundations and relative ease of implementation. The functional form of the SVM resembles,
(3) 
The constituent parameters of this formulation are described as follows,
 :

These are parameters that determine the shape of the separating hyperplane. After applying the kernel function, the data is ideally linearly separable in the feature space, which suggests why a linear construction of the hyperplane is appropriate. Learning the optimal value of the is achieved by optimizing the value of a quadratic program, a detailed description of which may be found in any introductory text on machine learning.
 :

This is the number of support vectors that are identified in the optimization process used to train the SVM. The for nonsupport vector data points are zero, which is why it is only necessary to consider the support vectors in evaluating the summation.
 :

These are the class labels of the support vectors, which all exist in the set .
 :

The kernel function that is essentially a distance augmentation. Many choices exist for the actual functional form of the kernel, but in this project a radial basis function (RBF) is used. The radial basis function resembles,
(4) A matter that will be addressed later is the principled selection of the parameter within the RBF, which constitutes a metaparameter (that is, a value chosen a priori) in the model.
 :

This is the intercept term for the hyperplane that separates the feature space for classification. This parameter is not calculated in the same way as the , but is constructed after the optimization process.
2.3 Relevance Vector Machine
The relevance vector machine (RVM) assumes a form similar to the SVM, but is capable of providing a probabilistic interpretation to predictions. For target values (remapping the input training labels from is trivial), then predictions are of the form,
(5) 
Where the is essentially analogous to the bias parameter in the SVM. In this case, the are constructed through an iterative optimization process where there is an initial assumption,
(6) 
and the are precision parameters corresponding to the . Rather than taking the sign of the linear function, as was the case in the SVM situation, a classification decision is constructed by considering a logistic sigmoid function as follows:
(7) 
Such that,
(8) 
In particular, if then a classification decision of is returned, whereas if the converse is true, then a classification decision of is returned (after remapping into the original targetspace) [11, 1]. The threshold of was selected by experimentation as delivering strong predictive results. This selection can be justified intuitively in some sense by the observation that high confidence is preferable in stock prediction, and that one is more likely to invest correctly when the probability of class membership is 0.80 rather than 0.50. The RVM is then capable of providing an interpretation that a given stock input has a particular class membership probability above the 0.80 threshold, which the user can incorporate into further analyses.
Summarily speaking, the RVM is an advantageous algorithm to use due to its probabilistic properties, and the value of the sigmoid function represents a degree of confidence that a given feature vector belongs to the class of stocks with positive returns from an initial quarter to a subsequent quarter.
2.4 Nearest Neighbors Ensemble
A disadvantage of parametric kernel methods in the style of the SVM and the RVM is that, for poor choices of the parameter , the learned model will fail to detect relevant patterns within the data and the model will demonstrate very lacking performance. In the context of financial learning, a bad may result in the learner labeling every test case as a stock giving positive (or negative) returns, even though in actuality this can scarcely be true.
This issue can be addressed by implementing a classifier that estimates across the feature space using a naive Euclidean Distance metric. This classification algorithm is known as a Nearest Neighbors (NN) method. In particular, after applying Bayes’ Rule [1] we arrive at a class posterior probability distribution,
(9) 
Where is the number of closest (Euclidean) neighbors to consider and is the number of points of class that are within the set of closest neighbors to the vector . Thus, a classification decision may determined by maximizing the quantity with respect to the class label.
Empirically speaking, it is the case that NN classifiers are not strong classifiers in the sense that they are capable of delivering high accuracy as a single unit. Therefore, we train a committee of one hundred weak NN classifiers using a subset of the training examples for each constituent learner. The parameter is selected through a 10fold crossvalidation process where the minimizing the average error is used to train a model for the larger ensemble.
2.5 Boosting Classifier Performance
Financial timeseries forecasting suffers from something of a bad reputation, being based on noisy data for which it is often intractable to learn an effective model. Techniques exist in machine learning to relieve this problem and within the scope of this work we adopted the method of boosting for improving a basis set of “weak learners” to generate a more capable predictor.
(10) 
(11)  
(12) 
(13) 
The essential idea behind the boosting algorithm is that if the first classifier performs strongly and accurately predicts the stock return outcome, then it is less important for subsequent classification algorithms to also make these same correct predictions, and therefore their impact on the prediction task becomes less meaningful and less necessary. A direction for future research in this area would be in application to the discovery of methodologies for incorporating a weighted objective function that more finely balances the constituent models (consider for instance an augmentation of Adaptive Boosting).
In some sense of the word, this implementation of boosting is a “hacky” solution, relying chiefly on inspiration from a grossly simplified version of AdaBoost and on an empirical evaluation of simply what works well. Nonetheless, it was determined experimentally that this algorithm learns weighting coefficients of comparable accuracy to an exhausting grid search of parameters on the testing data. Therefore, though the algorithm lacks a theoretical framework, its efficacy in practice justifies its presence.
3 MetaParameter Selection for Parametric Learners
In the ensemble we have elected the use of the SVM and RVM parametric learners. We conduct an extensive series of crossvalidation evaluations to select the bestperforming value of in the kernel function. In particular, we partition the training data into five randomly chosen subsets. For every candidate value of , a model is learned by excluding one of the partitions from the training. This excluded division is then used to test the learned model and an error rate is recorded. The candidate value of that performs best on average is chosen to be the best metaparameter and is used to formulate the full model.
Explanatory Variable  

Variable Index  Variable Name  Description [3] 
1  ACTQ  Current assets (total) 
2  CHEQ  Cash and shortterm investments 
3  DLCQ  Debt in current liabilities 
4  DLTTQ  Longterm debt (total) 
5  EPSPXQ  Earnings per share (basic and excluding extraordinary items) 
6  EPSX12  Earnings per share in 12 months (basic and excluding extraordinary items) 
7  ICAPTQ  Quarterly invested capital (total) 
8  LCTQ  Current liabilities (total) 
9  LTQ  Liabilities (total) 
10  NIQ  Net income (loss) 
11  OEPS12  Earnings per share from operations (12 months moving) 
12  OIADPQ  Operating income after depreciation (quarterly) 
13  REVTQ  Revenue (total and quarterly) 
14  SPCE12  S&P core earnings (12MM) 
15  SPCEQ  S&P core earnings 
16  WCAPQ  Working capital (balance sheet) 
17  XOPRQ  Operating expense (total and quarterly) 
18  CAPXY  Capital expenditures 
19  EPSFIY  Earnings per share (diluted and including extraordinary items) 
20  IVCHY  Increase in investments 
21  REVTY  Revenue (total and yearly) 
22  SPCEDY  S&P core earnings EPS diluted 
23  SPCEEPSPY  S&P core earnings EPS basic (preliminary) 
24  SPCEPY  S&P core earnings (preliminary) 
25  XOPRY  Operating expense (total and yearly) 
26  CSHTRQ  Common shares traded (quarterly) 
27  MKVALTQ  Market value (total) 
28  PRCCQ  Price close (quarter) 
29  PRCHQ  Price high (quarter) 
30  PRCLQ  Price low (quarter) 
The partial advantage of the RVM over its SVM counterpart is the lack of need to tune the socalled “slackness” parameter, , that represents, intuitively, the extent to which the SVM is permitted to violate the presumed linear separability of the data (often after it has been remapped to the feature space by the kernel). This parameter was investigated thoroughly by Huerta, who determined that the assignment was most often preferred in a crossvalidated process [4]. Following this previous work, we do not increase the complexity of the model training algorithm by finetuning the slackness parameter. Instead, is chosen a priori.
4 Data Description and Prefiltrations
The data used in this work were obtained from the Wharton Research Data Services (WRDS), which provided access to CRSP and Compustat databases. Data are considered from the years 2006 through 2012. This interval of financial history is of particular relevance because it reflects a period in time that was characterized by financial collapse and, to some extent, eventual recovery. This range of situations presents the learned model with a dynamic set of circumstances in which it may be applied.
We provide a table (see Table 1) of the explanatory variables incorporated into the analysis of this work, where each explanatory variable is assigned both a index for reference and a brief description from the Compustat database.
4.1 GICS Partitioning and Motivation
A major element of this work is to examine the effectiveness of the ensemble model on financial data from the various Global Industry Classification Standard (GICS) sectors. The GICS sectors are: Energy, Materials, Industrials, Consumer Discretionary, Consumer Staples, Health Care, Financials, Information Technology, Telecommunication Services, and Utilities. As discovered by Huerta, there exist certain industry sectors (in particular, the Telecommunications and Utilities sectors) that do not possess a “critical mass” of data vectors to effectively divide into training and testing categories. In particular, the numerical calculations required in formulating the RVM model tend to lack the robustness for these lowdensity training sets and this condition prevents the model from terminating successfully. Therefore, these sectors are not considered in terms of deploying the model.
Partitioning the financial data according to the GICS sector is motivated first by an interest to observe differences is learned explanatory variables between industries. In other words, it is believed that technical variables have explanatory powers that vary across industry sectors, and that by selectively choosing these factors for each sector, the predictive capabilities of model will be generally improved. The partitioning is further motivated by an interest in improving the deployability of the model. The algorithmic complexity of these models increases considerably as additional vectors are added to the training set; thus, to improve the time to termination of the model, it is beneficial to decrease the number of points in the training set, which is conveniently and justifiably accomplished by training a model for every GICS sector.
4.2 Supervised Feature Selection with a Relief Algorithm
A key attribute of this project was the automation of feature selection within the varying industry sectors. This approach allows the ensemble learner to evaluate for itself the relevance of explanatory variables to prediction, and does not rely on human specification or on prior domainexpert review. In a sense, this automated feature estimation constitutes a “purer” learning methodology than does prespecification.
A Relief algorithm is chosen for the purpose of variable selection because they have demonstrated an empirical capability to detect dependencies within the data from both regression and classification problems. In particular, we implemented the ReliefF algorithm to identify critical features across industries. ReliefF is advantageous for selecting informative explanatory variables because it is comparatively less sensitive to stochastic input data than other Relief algorithms [8]. The essential idea behind the ReliefF algorithm is to select at random examples in the training data, observe their nearest neighbors, and assign a weighting to each explanatory variable according to how well it distinguishes the example from nearby data points of separate class [12]. In particular, the ReliefF algorithm estimates the probabilities and arrives at a weighting for each feature ,
(14) 
This construction is intuitively thought of as the probability that the feature takes a value different from the value given that the classes are not identical, minus the probability that assumes a different value yet the classes are the same [12].
The algorithm for the ReliefF is written explicitly as follows,
5 Results and Analysis of the Ensemble
Presented here are a series of empirical experiments developed to demonstrate concurrently both the strengths and weaknesses of a learning algorithm of this form. In particular, we demonstrate through empirical evidence that this learning algorithm has an ability to form effective predictions in cases where at least one of the constituent algorithms performs weakly in terms of stock price return prediction. However, some results indicate that substantial overfitting can occur in instances where too much data is presented to the model. Furthermore, the boosting procedure tends to learn only to imitate a constituent learner’s solution in cases where all members of the ensemble perform strongly.
As discussed in previous section, identifying important, industryspecific predictors was a major consideration in this project. Figure 1 demonstrates the relevances of particular explanatory variables across all of the GICS. For the purposes of visualization, the variable importances returned by the RefiefF algorithm have been normalized to the interval by a linear scaling procedure. In particular, low values toward zero indicate weak explanatory power, while high values are characteristic of highly predictive variables for that sector. The visualization of these results indeed reveal that the importances of predictors do vary across industry sectors in terms of their efficacy in predicting stock price returns. See Table 1 for a summary of the explanatory variables used in this analysis and their corresponding indices.
Here the weights are linearly normalized to be in the interval , where higher values indicate features of greater relevance for a particular industry. The period under consideration was selected arbitrarily and for the purposes of visualization and example.
From the discussion of the constituent algorithms incorporated into the ensemble, it is apparent that crossvalidation serves a vital role in formulating an effective model for stock return prediction. For both the support vector machine and the relevance vector machine are evaluated for a choice of from the discrete set , where these values are selected for consistency with Huerta. Further, these choices of parameter were evaluated in terms of efficacy by fivefold crossvalidation, where an average of errors is used to determine the best choice. In the case of the SVM, the crossvalidation procedure is consistent with Huerta as the assignment is the most common selection.
It was surprising to observe that the RVM does not follow this choice and instead most often chooses a unique to industry, though the functional form of the model is identical to the SVM. This phenomenon can be explained to an extent by the observation that, in practice, the relevance and support vectors selected by either algorithm for prediction are rarely the same, and more often the support vectors are near to decision boundaries, whereas relevance vectors are more antithetical in nature [11]. It therefore makes intuitive sense that a different distance augmentation would be preferable for the classification task. This result of industryrelated impacts on parameter selection is summarized for the year 2009 in Table 2.
Cross Validation across Industries for RVM  

GICS Name  Hyperparameter Selection  Frequency of Selection 
Energy  4.0  50% 
Materials  0.5  80% 
Industrials  0.5  95% 
Consumer Discretionary  4.0  65% 
Consumer Staples  1.0  80% 
Health Care  2.0  80% 
Financials  0.5  95% 
Information Technology  1.0  75% 
In constructing the Nearest Neighbor predictor it was necessary to evaluate the number of neighbors to consider for a new input to the model. This value for will determine if a company’s stock is best described by the known stock performance nearest to it, or by some double, triple, or tuple averaging of nearest known stocks. The purpose, then, is to derive an estimate for given a training set. This is accomplished by performing a 10fold crossvalidation procedure on partitions of the training set for values of and then choosing the for which the crossvalidated error was first minimized. For many industries, it was observed that was the most effective parameter for neighbor searches. Figure 1(a) provides a visualization of these measurements and indicates the value of that first minimized the error.
It was similarly crucial for the model to have some measure of when to stop “growing” the random forest (by “growing” we suggest the action of adding additional learned decision trees to the model). We selected as the stopping criterion a point on the error curve for which there is no substantial increase in accuracy by adding another tree. Because of the variable and dynamic of these error curves, it is necessary to evaluate such a criterion by inspection, so as to avoid the pitfalls of local minima and premature termination of the growing.
Clearly, however, it is undesirable to have to refer to a “true” error rate, where the error is considered as a function of the number of trees. Doing so would force the model to require a testing data set for validation purposes, which is an unprincipled approach. Fortunately, the random forest model offers a methodology for obtaining an estimate of the error rate that is evaluated internally at training. This error rate is referred to as the OutofBag (OOB) error and relies on the fact that the trees in the forest are grown entirely on a bootstrapped subsample of the training data. The essential idea is to evaluate all data points not used in the training of an individual tree (about onethird of ) to obtain a classification decision. For each data point, a naive voting system compares the results of the forest to the true class and creates an average proportion of the number of times the prediction is not consistent with the correct class [2].
Figure 1(b) presents an example of the OOB error estimate, the true error curve created at testing, and a crossvalidated error curve using the entire data set. In this example, it was apparently the case that the error estimate was an underestimate of the model’s true error. In practice, the OOB error tends to convincingly level out around 120 grown trees, which consistently falls on a level portion of the true error curve. This implies that this technique, while perhaps growing a larger number of trees than necessary, will achieve a local minimum of the true error as a function of the number of trees.
5.1 Justifying the Boosting Procedure
As mentioned in the relevant section on boosting the ensemble’s results in a committee fashion, the approach was unprincipled from a theoretical perspective. That being said, it is possible to justify the boosting from a empirical perspective. Below is a table summarizing the individual results of each classification algorithm on varying industry sectors in the period from the first quarter of 2009 to the second quarter of 2009. Training was performed on 10% of the available data for each industry. The selection of the 10% of stocks that constitute the training set were selected at random from the total set of stocks. This implies that the size of the training set varies by industry, since some industries are larger than others.
Furthermore, the training and testing error rates specified in the table are reported post boosting. The column Train corresponds to the error achieved by the model on the training data set . Similarly, Test is the error rate reported by the model when deployed on the 90% of the data not used in training the model. The error rate presented here is formulated as:
(15) 
Error Rate for Individual Algorithms and Benchmarks  
GICS Name  Forest  SVM  RVM  NN Ensemble  Train  Test 
Energy  34.67%  43.34%  80.66%  58.39%  45.16%  19.34% 
Materials  34.01%  35.03%  71.43%  60.88%  33.33%  28.57% 
Consumer Discretionary  47.66%  47.66%  67.45%  49.14%  48.07%  32.55% 
Consumer Staples  48.10%  44.30%  65.19%  51.27%  38.89%  34.81% 
Information Technology  37.34%  37.18%  64.12%  49.03%  47.06%  35.88% 
The careful reader will have noticed that three GICS sectors were excluded from Table 3. In particular, the Industrials, Health Care, and Financials industries were discovered to be difficult for the ensemble to model effectively via a boosting operation described in Section 2.5. In these cases, a hypothesis to explain the uncharacteristic model behavior is that none of the algorithms individually perform particularly weakly in these areas, typically scoring around a 30% error rate, which differs substantially from individual performances observed in other industry sectors. Furthermore, we observed that in the case of Health Care and Financials shifting the decision threshold of the RVM from 0.8 to 0.5 resulted in a dramatic increase in accuracy for these sectors, suggesting that a high probability of a positive stock return is incompatible with these stochastic industries. Results from the same time period as Table 3 are reproduced below in Table 4:
Individual Algorithms and Benchmarks (GICS 35 & 40)  
GICS Name  Forest  SVM  RVM (threshold)  NN Ensemble  Train  Test 
Industrials  33.05%  34.96%  31.78% (0.8)  33.05%  15.09%  31.78% 
Health Care  30.57%  36.09%  29.09% (0.5)  32.70%  7.69%  32.70% 
Financials  30.43%  36.23%  30.43% (0.5)  27.53%  12.50%  30.43% 
5.2 Results for Aggregated GICS Data
A major component of this work was the division of market areas into distinct sectors as specified by the GICS. This was motivated, in part, by the relevance of particular explanatory variables to particular fields of financial prediction. That being said, the model is capable of being deployed in the context of a “full market.” In other words, we can fail to specify divisions and accept all stocks as not belonging to any particular field.
This approach is not without its disadvantages, and the most stressing of these is algorithmic in nature. In particular, the standard model, which was trained on the individual sectors, will on the average, execute within 11.12 seconds (this includes all training, crossvalidation, and testing procedures). By contrast, the model executed on the aggregated data is slower by nearly a factor of five, completing its runtime in 53.46 seconds on the average. The implementation of these algorithms was within the MATLAB environment, suggesting that significant runtime improvements could be realized with C or Fortran implementations; but despite these potential advances, it is our suspicion that a significant gap in execution time between the aggregated and partitioned models is unavoidable.
It was found that the model trained on the stock data from the first and second quarters of 2009 produced an accuracy of 39.34%, which, while failing to outperform the industryspecific classifications, would seem to indicate that an aggregated approach to stock price return prediction is not necessarily unreasonable. However, closer inspection of the individual learning algorithms reveals that the RVM, and, to a certain extent, the random forest, tremendously overfit the financial data and essentially learn to reproduce the most common classification result. In the case of the first and second quarters of 2009, this resulted in models that had not actually learned from the data, producing instead a nearly constant forecast of positive returns for all stock inputs. The high accuracy then comes as a result of guessing the most common class label.
Indeed, users of the ensemble model should be wary of such a low training error, which effectively means that the model has learned nothing except to memorize the input data. A potential explanation for the model’s decision to reproduce the results of the Nearest Neighbors algorithm only stems from the fact that this constituent model achieves essentially zero error on the training set. In turn, the model will recognize the superiority of the Nearest Neighbors approach and will adopt it without question.
In light of these concerns over severe overfitting of the training data, it is recommended not to apply this learning methodology to the aggregated GICS sectors, but instead to partition them so as to avoid the involved problems of learning essentially nothing from the data.
Individual Algorithms and Benchmarks (2009 Q1  2009 Q2)  
GICS Name  Forest  SVM  RVM  NN Ensemble  Train  Test 
Aggregated  34.60%  49.61%  32.07%  39.20%  0.91%  39.20% 
5.3 Results for TimeSeries Financial Data
It may be of some interest to attempt to forecast stock price returns in quarters far beyond the training instance. Here we consider models trained from first quarter of 2006, and from the first quarter of 2007, with the intent of accurately predicting the immediate subsequent quarter. The parameters learned from this operation are then used to predict the stock price return for all subsequent quarters until the third quarter of 2012. Speaking more formally, here we train four models corresponding to the random forest, the SVM, the RVM, and a Nearest Neighbors ensemble. Each model then learns its structure and parameters, and performs the typical 5fold cross validation procedure when necessary to confirm the selection of hyperparameters. Then, these learned models are used to form predictions for subsequent quarters following the training procedure. However, these parameters are never relearned by the models, but are kept constant in order to be consistent with the idea that one cannot have significant prior knowledge of what will happen definitely in the future.
The empirical evidence we present involves training multiple, industrypartitioned models on the years and quarters mentioned previously. In the case of most industries, it is discovered that training on a model on a given interval tended to produce good accuracy for financial quarters immediately subsequent to the training period. This behavior is characteristically observed, for example, in Figures 2(a) & 2(b). Notice that in these instances the error rate can sometimes exceed the random guessing threshold in later quarters to achieve remarkably poor performance; this is suggestive of a phenomenon where characteristics of the data that are learned to be relevant in one quarter become distinctly uninformative later on.
This phenomenon was not observed consistently throughout all industries and time intervals, however. Certain industries and certain time intervals sometimes displayed significant explanatory power over stock price returns remarkably far in advance. This is demonstrated in particular in the Energy and Information Technology when trained for predicting the second quarter of 2007 with first quarter data (see Figures 2(c) & 2(d)). It is an interesting observation to note that in both cases this remarkable increase in predictive accuracy arrived in the 2008 year, when the financial crisis caused significant distress in the stock market. The high accuracy of the ensemble model in either case (around the 20% error rate mark) implies that a trader using this model to forecast stock price returns far in advance would have been able to skillfully profit from the market crash.
As a point of comparison, we also trained the model with the intention of forecasting stock returns from 2008’s first quarter to the second quarter of the same year. These learned parameters were then used to predict stock prices for all subsequent quarters in the same style as the previous experiments. The key results of this analysis are reproduced graphically in Figure 4. In particular, we reproduce results from Consumer Discretionary, Health Care, and Information Technology sectors. These visualizations suggest that the ensemble algorithm is capable of predicting stock price returns effectively when trained on data in the midst of financial crises, maintaining an error rate of approximately 30% in most instances. In the interest of full disclosure, however, we note that the algorithm failed to learn anything effective for prediction in the Financials sector, consistently failing to breach the random guessing error benchmark of 50%.
6 Conclusions and Recommendations for Further Research
We have presented an architecture for learning to predict stock price returns by considering a binary classification problem, where positive return predictions are denoted by the class label and negative predictions . This architecture relies on an ensemble committee model of random forests, relevance vector machines, support vector machines, and a nearest neighbor constituent ensemble. The ensemble architecture we present has explanatory power over immediate quarters that has been demonstrated to fall in the range of approximately 70% accuracy on testing data from financial quarters between 2006 and 2012, inclusive. The model can occasionally overfit the data, leading to poor performances on the test set, but in practice this happens in a minority of possible applications of the committee.
Further research in the field of financial modeling should, of course, by encouraged and pursued. To this end, we recommend exploring the applications of socalled “Deep Learning” methodologies to stock price prediction. This chiefly involves learning weight coefficients on a large directed and layered graph. Models of this form have, in the past, proven difficult to train and optimize, but recent breakthroughs in only the last several months have lead to a renaissance in deep learning.
More directly related to the research presented here is to incorporate a more sophisticated weighting scheme in the boosting procedure. Furthermore, it may be worthwhile to attempt representation learning using autoencoders, rather than relying on pure input data. This has the advantage of representing stocks not as collected data as in Table 1, but as a linear combination of “basis stocks,” which may be discovered to capture interesting interrelationships that cannot be perceived by normal methodologies. For further reading on the topic of autoencoding algorithms, refer to [5]. Furthermore, it is of course preferable to be able to obtain stock price return classifications on a daily basis. Accessing daily timeseries data on stock price returns encourages the prediction of relatively returns immediately, rather than models that predict months or years in advance. Additionally, incorporating such financial phenomena as earning surprises and additional explanatory variables would be worthwhile for further refining the efficacy of the model.
7 Acknowledgments
The author would like to thank Professor Meifang Chu for her valuable assistance in data collection and for general guidance in advancing this project. This work was supported, in part, by a Neukom Institute grant for computational research.
References
 [1] Bishop, Christopher M. Pattern Recognition and Machine Learning. New York: Springer, 2006. Print.
 [2] Breiman, Leo, and Adele Cutler. “Random Forests.” UCB Department of Statistics, Web. 22 Apr. 2013.
 [3] Compustat Database. Wharton Research Data Services. University of Pennsylvania, Web. 13 Apr. 2013. https://wrdsweb.wharton.upenn.edu/wrds/.
 [4] Huerta, Ramon, Elkan, Charles and Corbacho, Fernando, Nonlinear Support Vector Machines Can Systematically Identify Stocks with High and Low Future Returns (September 6, 2012).
 [5] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. NIPS, 2007.
 [6] K. Kim, Financial Time Series Forecasting Using Support Vector Machines, Neurocomputing, 55, 307319 (2003).
 [7] M. V. Sewell, The Application of Intelligent Systems to Financial Time Series Analysis, Ph.D thesis, Department of Computer Science, University College London, University of London (2012)
 [8] Megchelenbrink, Wout, ReliefBased Feature Selection in Bioinformatics: Detecting Functional Specificity Residues from Multiple Sequence Alignments, Master thesis, Department of Information Science, Radboud University, Nijmegen (2010)
 [9] Murphy, Kevin. Machine Learning A Probabilistic Perspective. Cambridge: MIT, 2012. Print.
 [10] N. Jegadeesh, and S. Titman, Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency, Journal of Finance, 48 6591, (1993).
 [11] Tipping, Michael E. Sparse Bayesian Learning and the Relevance Vector Machine. Journal of Machine Learning Research, 211244 (2001).
 [12] Wang, Yuhang, and F. Makedon. Application of ReliefF Feature Filtering Algorithm to Selecting Informative Genes for Cancer Classification using Microarray Data.