Robust multivariate and functional archetypal analysis with application to financial time series analysis 111The code and data for reproducing the examples are available at www3.uji.es/~epifanio/RESEARCH/rfada.rar. A preliminary version of this work was presented at the 8th International Conference on Mathematical and Statistical Methods for Actuarial Sciences and Finance (MAF 2018) (Moliner and Epifanio (2018)), where the application data were analyzed in a non-robust way.
Archetypal analysis approximates data by means of mixtures of actual extreme cases (archetypoids) or archetypes, which are a convex combination of cases in the data set. Archetypes lie on the boundary of the convex hull. This makes the analysis very sensitive to outliers. A robust methodology by means of M-estimators for classical multivariate and functional data is proposed. This unsupervised methodology allows complex data to be understood even by non-experts. The performance of the new procedure is assessed in a simulation study, where a comparison with a previous methodology for the multivariate case is also carried out, and our proposal obtains favorable results. Finally, robust bivariate functional archetypoid analysis is applied to a set of companies in the S&P 500 described by two time series of stock quotes. A new graphic representation is also proposed to visualize the results. The analysis shows how the information can be easily interpreted and how even non-experts can gain a qualitative understanding of the data.
keywords:Multivariate functional data, Archetype analysis, Stock, M-estimators, Multivariate time series
Many econometric data are big data in the form of time series that can be seen as functions (Tsay (2016)). In Functional Data Analysis (FDA) the observations are functional time series or multivariate functions. Ramsay and Silverman (2005) provide an excellent overview of FDA. FDA has been applied in many different fields (Ramsay and Silverman (2002)), and although it is a relatively new area of research in the business and economic sectors, applications are beginning to proliferate in these fields (Aguilera et al. (1999); Chen and Li (2017); Kowal et al. (2017)).
Data mining of functional time series (Fu (2011)) is as important as in the classical multivariate version. It is desirable to understand and describe the entire data set and to be able to extract information that is easily interpretable even by non-experts. We deal with an unsupervised statistical learning problem since only input features and not output are present. Data decomposition techniques to find the latent components are usually employed in classical multivariate statistics (see (Hastie et al., 2009, Chapter 14) for a complete review of unsupervised learning techniques). A data matrix is considered as a linear combination of several factors. The constraints on the factors and how they are combined give rise to different unsupervised techniques (Mørup and Hansen (2012); Thurau et al. (2012); Vinué et al. (2015)) with different goals. For instance, Principal Component Analysis (PCA) explains data variability adequately at the expense of the interpretability of the factors, which is not always straightforward since factors are linear combinations of the features. For their part, clustering techniques such as -means or -medoids return factors that are easily interpreted. Note that data are explained through several centroids, which are means of groups of data in the case of -means or medoids, which are concrete observations in the case of -medoids. Nevertheless, the binary assignment of data to the clusters diminishes their modeling flexibility as compared with PCA.
Archetype analysis (AA) lies somewhere in between these two methods, as the interpretability of its factors is as easy as with clustering techniques but its modeling flexibility is higher than for clustering methodologies. Vinué et al. (2015) provide a summary table showing the relationship between several multivariate unsupervised techniques, as do Mørup and Hansen (2012). Stone and Cutler (1996) also showed that AA may be more appropriate than PCA when the data do not have elliptical distributions.
AA was formulated by Cutler and Breiman (1994). In AA each observation in the data set is approximated by a mixture (convex combination) of pure or extremal types called archetypes. Archetypes themselves are restricted to be convex combinations of the individuals in the data set. However, AA may not be satisfactory in some fields since, being artificial constructions, nothing guarantees the existence of subjects in our sample with characteristics similar to those of the archetypes (Seiler and Wohlrabe (2013)). In order to solve this issue, the new concept of archetypoids was introduced by Vinué et al. (2015). In Archetypoid Analysis (ADA) each observation in the data set is approximated by a convex combination of a set of real extreme observations called archetypoids. AA and ADA were extended to dense functional data by Epifanio (2016) and to sparse functional data by Vinué and Epifanio (2017). In the functional context, functions from the data set, which can be multivariate functions, are explained by mixtures of archetypal functions.
This process not only allows us to relate the subjects of the sample to extreme patterns but also facilitates comprehension of the data set. Humans understand the data better when the individuals are shown through their extreme constituents (Davis and Love (2010)) or when features of an individual are shown as opposed to those of another (Thurau et al. (2012)). In other words, as regards human interpretability, the central points returned by clustering methods do not seem as good as extreme types, which are also more easily understandable than a linear combination of data.
AA and ADA have therefore aroused the interest of researchers working in different fields, such as astrophysics (Chan et al. (2003)), biology (D’Esposito et al. (2012)), climate (Steinschneider and Lall (2015); Su et al. (2017)), developmental psychology (Ragozini et al. (2017)), e-learning (Theodosiou et al. (2013)), genetics (Thøgersen et al. (2013)), industrial engineering (Epifanio et al. (2013, 2017)), machine learning (Mørup and Hansen (2012); Seth and Eugster (2016a, b); Ragozini and D’Esposito (2015)), multi-document summarization (Canhasi and Kononenko (2013, 2014)), nanotechnology (Fernandez and Barnard (2015)), neuroscience (Tsanousa et al. (2015); Hinrich et al. (2016)) and sports (Eugster (2012)). AA has also been applied in market research (Li et al. (2003); Porzio et al. (2008); Midgley and Venaik (2013)) in the multivariate context. Despite the fact that financial time series are commonly analyzed by unsupervised techniques ranging from PCA (Alexander (2001); Tsay (2010); Ingrassia and Costanzo (2005)) to clustering (Dose and Cincotti (2005); Basalto et al. (2007); Tseng and Li (2012); D’Urso et al. (2013); Dias et al. (2015)), including robust versions of these (Malioutov (2013); Verdonck et al. (2011); D’Urso et al. (2016)), to the best of our knowledge functional archetypal analysis has not been used in financial or business applications until now.
As archetypes are situated on the boundary of the convex hull of data (Cutler and Breiman (1994)), AA and ADA solutions are sensitive to outliers. The problem of robust AA in the multivariate case was addressed by Eugster and Leisch (2011). The idea is to find the archetypes of the large majority of the data set rather than of the totality. Eugster and Leisch (2011) considered a kind of M-estimators for multivariate real-valued data (-variate), where the domain of their loss function is not , but . Recently, Sinova et al. (2018) considered functional M-estimators for the first time, where the domain of their loss function is . We base our proposal to robustify archetypal solutions in the multivariate and functional case on this last kind of loss function, which is commonly used in robust analysis (Maronna et al. (2006)).
The main novelties of this work consist of: 1. Proposing a robust methodology for classical multivariate and functional AA and ADA; 2. Introducing a new visualization procedure that makes it easy to summarize the results and multivariate time series; 3. Applying functional archetypal analysis to financial time series for the first time, more specifically to multivariate financial time series.
Section 2 reviews AA and ADA for the multivariate and functional case and Section 3 introduces their respective robust versions. Our proposal is compared with the only previously existing methodology for robust multivariate AA in Section 4, where a simulation study with functional data is also carried out to validate our procedure. In Section 5, robust ADA is applied to a data set of 496 companies that are characterized by two financial time series. Furthermore, some visualization tools are also introduced in this section. Finally, conclusions and future work are discussed in Section 6. The code in R (R Development Core Team (2018)) and data for reproducing the results are available at www3.uji.es/~epifanio/RESEARCH/rfada.rar.
2 Archetypal analysis
2.1 AA and ADA in the classical multivariate case
Let be an matrix with cases and variables. In AA, three matrices are sought: a) the archetypes , which are the rows of a matrix ; b) an matrix that contains the mixture coefficients that approximate each case by a mixture of the archetypes (); and c) a matrix that contains the mixture coefficients that define each archetype ( = ). To determine these matrices, the following residual sum of squares (RSS) with the respective constraints is minimized ( denotes the Euclidean norm for vectors):
under the constraints
with for and
with for .
It is important to mention that archetypes do not necessarily match real cases. Specifically, this will only happen when one and only one is equal to one for each archetype, i.e. when each archetype is composed of only one case that presents the entire weight. Therefore, in ADA the previous constraint 2) is changed by the following one, and as a consequence the previous continuous optimization problem of AA is transformed into a mixed-integer optimization problem:
with and .
Cutler and Breiman (1994) demonstrated that archetypes are located on the boundary of the convex hull of the data if 1, although this does not necessarily happen for archetypoids (see Vinué et al. (2015)). However, if = 1, the archetype coincides with the mean and the archetypoid with the medoid (Kaufman and Rousseeuw (1990)).
Cutler and Breiman (1994) developed an alternating minimizing algorithm to estimate the matrices in the AA problem. It alternates between calculating the best for given archetypes and the best archetypes for a given . In each step a penalized version of the non-negative least squares algorithm by Lawson and Hanson (1974) is used to solve the convex least squares problems. That algorithm, with certain modifications (previous data standardization and use of spectral norm in equation 1 instead of Frobenius norm for matrices), was implemented by Eugster and Leisch (2009) in the R package archetypes. In our R implementation those modifications were canceled and the data are not standardized by default and the objective function to optimize coincides with equation 1.
As regards the estimation of the matrices in the ADA problem, Vinué et al. (2015) developed an algorithm based on the idea of the Partitioning Around Medoids (PAM) clustering algorithm (Kaufman and Rousseeuw (1990)). This algorithm consists of two stages: the BUILD phase and the SWAP phase. In the BUILD phase, an initial set of archetypoids is computed, while that set is improved during the SWAP phase by exchanging the chosen observations for unselected cases and by inspecting whether these replacements diminish the RSS. Vinué (2017) implemented that algorithm in the R package Anthropometry with three possible initial sets in the BUILD step. One of them is referred to as the set and consists of the nearest neighbors in Euclidean distance to the archetypes. The second candidates, the set, consist of the observations with the maximum value for each archetype , i.e. the observations with the largest relative share for the respective archetype. The third initial candidates, the so-called set, are the cases with the maximum value for each archetype , i.e. the cases that most influence the construction of the archetypes. Each of these three sets goes through the SWAP phase and three sets are obtained. From these three sets, the one with lowest RSS (often the same set is retrieved from the three initializations) is selected as the ADA solution.
Archetypes are not necessarily nested and neither are archetypoids. Therefore, changes in will yield different conclusions. This is why the selection criterion is particularly important. Thus, if the researcher has prior knowledge of the structure of the data, the value of can be selected based on that information. Otherwise, the elbow criterion, which has been widely used (Cutler and Breiman (1994); Eugster and Leisch (2009); Vinué et al. (2015)), could be considered. With the elbow criterion, the RSS is represented for different values and the value is chosen as the point where the elbow is found.
2.2 AA and ADA in the functional case
In FDA each datum is a function. In this context, the values of the variables in the classical multivariate context become function values with a continuous index , and the data set adopts the form with . It is assumed that these functions belong to a Hilbert space, they satisfy reasonable smoothness conditions and are square-integrable functions on that interval. In addition, in the definition of the inner product, the sums are replaced by integrals.
Again, the goal of functional archetype analysis (FAA) is to approximate the functional data sample by mixtures of archetype functions. The difference with the multivariate case is that now both archetypes and observations are functions. In FAA, two matrices and are also calculated to minimize the RSS. However, certain aspects should be highlighted. On the one hand, RSS are now calculated with a functional norm (the -norm, , is considered) instead of a vector norm. On the other hand, observational and archetype vectors and now correspond to observational and archetype functions and . In any case, the interpretation of matrices and is the same as in the standard multivariate case.
Functional archetypoid analysis (FADA) is also an adaptation of ADA, changing vectors for functions. In this regard, FADA aims to find functions of the sample (archetypoids) that approximate the functions of the sample through the mixtures of these functional archetypoids. Again, vector norms are replaced by functional norms. Interpretation of the matrices is the same as before.
In practice, the functions are recorded at discrete times. Standard AA and ADA could be applied to the function values of equally-spaced values from to to obtain FAA and FADA. However, this approach is not computationally efficient (Epifanio (2016)). Therefore, we represent data by means of basis functions. This reduces noise, i.e. functions are smoothed. Furthermore, data observations do not have to be equally spaced, the number of observed points can vary across records and they can be measured at different time points. This also makes it possible to perform a more efficient analysis, since the number of coefficients of the basis functions is usually smaller than the number of time points evaluated. The crux of the matter is to choose an appropriate basis together with the number of basis elements. Nevertheless, this issue appears repeatedly in all FDA problems. Functions of the sample should be expanded by basis functions that share common features (see Ramsay and Silverman (2005) for a detailed explanation about smoothing functional data). For densely observed functions, the case that concerns us, the basis coefficients are computed separately for each function, while information from all functions should be used to calculate the coefficients for each function (James (2010)) for sparsely observed functions.
Let us see how the RSS is formulated in terms the coefficients , the vector of length that approximates with the basis functions ( = 1, …, ) (see Epifanio (2016) for details):
where = and is the order symmetric matrix with the inner products of the pairs of basis functions = . If the basis is orthonormal, for instance the Fourier basis, is the order identity matrix and FAA and FADA can be computed using AA and ADA with the basis coefficients. If not, has to be computed previously one single time by numerical integration.
2.2.1 Multivariate functional archetypal analysis
It is common to analyze data with more than one dimension. In our context, this means working with samples in which we analyze more than one function for each individual, so each function describes a characteristic of the subject.
First of all, we need to define an inner product between multivariate functions. The simplest definition is to add up the inner products of the multivariate functions. Therefore, the squared norm of a -variate function is the sum of the squared norms of the components. Consequently, FAA and FADA for -variate functions is equivalent to independent FAA and FADA with shared parameters and . In practical terms, this means to work with a composite function formed by stringing the functions together.
Without loss of generality, let be a bivariate function. So, its squared norm is . Let and be the vectors of length of the coefficients for and for the basis functions . Therefore, to compute FAA and FADA, the RSS is reformulated as:
where and with the corresponding AA or ADA constraints for and . The union of and composes the observations. If the basis functions are orthonormal, FAA and FADA are reduced to computing standard AA and ADA for the coefficient matrix composed by joining the coefficient matrix for and components.
3 Robust archetypal analysis
The RSS is formulated as the sum of the squared (vectorial or functional) norm of the residuals, ( = 1, …, ). Here, denote vectors of length in the multivariate case or (univariate or multivariate) functions in the functional case. The least squares loss function does not provide robust solutions since it favors outliers; large residuals have large effects. M-estimators try to lower the large influence of outliers by changing the square loss function for a less rapidly increasing loss function. Eugster and Leisch (2011) defined a loss function from to . However, Sinova et al. (2018) established several conditions of the loss function for functional M-estimators, the first of which is that the loss function is defined from to and the loss is specified as . Sinova et al. (2018) also analyzed the common families of loss functions. We follow the ideas of Sinova et al. (2018) and the Tukey biweight or bisquare family of loss function (Beaton and Tukey (1974)) with tuning parameter is considered, since this loss function copes with extreme outliers well. Therefore, RSS in equations 1, 2 and 3 are replaced by , where denotes the Euclidean norm for vectors, the -norm for univariate functions and corresponding norm for -variate functions, and is given by
For the tuning parameter , we follow Cleveland (1979), as did Eugster and Leisch (2011), and = 6 with being the median of the residual norms unequal to zero. From the computational point of view, we only have to replace RSS with this new objective function in the previous algorithms. It only depends on the norm of the residuals, so in the functional case, it can be expressed in terms of the coefficients in the basis and , and no integration is needed.
Another possibility would be to use the Huber family of loss functions (Huber (1964)) that depends on a tuning parameter. For example, this loss function was used by Sun et al. (2017), but the tuning parameter was manually set and no suggestion was given about its selection. An important difference between Huber and the bisquare family is that residuals larger than contribute the same to the loss in this last family, which is not the case with the Huber family. For that reason, the bisquare family can better cope with extreme outliers.
4 Simulation study
4.1 Multivariate data
To compare the performance of our proposal, the same procedures and data set, known as ozone, examined by Eugster and Leisch (2011) are considered. The data, which are also used as a demo in the R library archetypes, consist of 330 observations of 9 standardized variables that are related to air pollution. We create a corrupted data set by adding a total of 5 outliers, as Eugster and Leisch (2011) did. Table 1 shows the percentiles of 3 archetypes (A1, A2 and A3) extracted in four situations: a) original AA with the original data set before adding the outliers (AAO), this solutions plays the gold standard reference role; b) original AA with the corrupted data set (AAC); c) robust AA by Eugster and Leisch (2011) with the corrupted data set (RAA-EL); d) our proposal of robust AA with the corrupted data set (RAA-ME).
At a glance it can be seen that the archetypes returned by our proposal are the most similar to the original ones. To corroborate it, the Frobenius norm of the difference between the gold standard reference and the different alternatives applied to the corrupted data set has been computed with the following results for each situation: AAC 206.3; RAA-EL 221.3; RAA-ME 64.5. Our proposal provides the solution with the smallest difference with respect to the gold standard reference, while the robust AA solution by Eugster and Leisch (2011) provides a greater difference than the non-robust AA solution. For both AAC and RAA-EL one of the archetypes (A2) is built entirely of a mixture of the added outliers, which is not the case with our proposal (0.07 is the only weight for the outliers). Therefore, a more robust solution is achieved with our proposal.
4.2 Functional data
To check the robustness of our proposal, we now consider a set of = 100 functions that are generated from the following model, which was analyzed previously by Febrero et al. (2008), Fraiman and Svarc (2013) and Arribas-Gil and Romo (2014) for functional outlier detection procedures. are generated from = , whereas the remaining functions are generated from this contamination model: , where and is a Gaussian process with zero mean and covariance function = . The functions are measured at 50 equispaced points between 0 and 1. A total of 100 simulations have been run with two contamination rates = 0.1 and 0.15. Original and robust ADA have been applied with = 2 archetypoids. With = 0.1, 10% of the times one outlier belongs to the solution for original ADA, while no outlier is included as an archetypoid for robust ADA. With = 0.15, 78% of the times one outlier belongs to the solution for original ADA, while this percentage was only 32% for robust ADA. Therefore, our proposal provides robust solutions.
When dealing with time series, the theoretical complexity of many of the statistical methods available for analysis leads to periodic summaries of the data series being commonly used in practice. Furthermore, many of the techniques, such as the classic Box-Jenkins theory (Box and Jenkins (1976)), involve verifying a set of quite restrictive hypotheses, such as stationarity, equally-spaced observations or belonging to a specific kind of well-known processes. What makes our proposal attractive is not only the lack of these restrictive hypotheses (it could also be applied to sparsely measured time series), but also the data speak for themselves and the results can be interpreted easily by non-specialists. It also allows them to be visualized, which is an important task (Peng (2008)). To illustrate these claims, robust bivariate FADA is applied to the company stock prices as detailed below.
We consider a data set from QuantQuote (2017), which is composed of a collection of daily resolution data with the typical open, high, low, close, volume (OHLCV) structure. This collection runs from 01/01/1998 to 07/31/2013 for 500 currently active symbols in the S&P 500. In addition, the aggregate S&P 500 index OHLCV daily tick data (Yahoo (2017)) have been added to the data set. The length of this time series has been selected so that its range coincides with that of the QuantQuote disaggregated series. We have also collected information from ALPS Portfolio Solutions Distributor, Inc. (2017), which classifies stocks on the S&P 500 index within ten major sectors, namely: Consumer Discretionary (CD), Consumer Staples (CS), Energy (E), Financials (F), Health Care (HC), Industrials (I), Materials (M), Real Estate (RE), Technology (T), and Utilities (U).
We are interested in extracting financially relevant information. With regard to investments, and more specifically to portfolio theory, it is widely accepted that the two key variables are risk and profitability (Markowitz (1952)). On the one hand, we have chosen as a measure of profitability in a time the aggregate returns over a period of length N, . So, for each price series, where is the value of the stock at the time . In our case, we have chosen = 250, approximately the number of days that stock markets remain open in a year. On the other hand, we have chosen as a measure of volatility the beta or coefficients, which are widely used in portfolio theory. This is defined for a particular stock in a time as where stands for the aggregated returns of in time over the last days and are the returns of the aggregated S&P 500 index in the same period. To be consistent, we perform the calculations with a time frame of 250 days.
Although FDA can handle missing data well, as explained previously, our sample presents a different problem: some stocks do not exist throughout the entire time series. This is not because we have missing data, but because the companies were founded or were first listed on the stock exchange after 01/01/1998, i.e. the domains of the functions are different. Here we propose an alternative to deal with this problem, trying to maximize the size of our sample and minimize the stocks that must be discarded by taking into account observations since 2000-01-01 and dropping the stocks with more than 20% missing values. In this way, only four companies from 500 are left out.
The following step is to transform discrete data to functional data. The Fourier basis has been considered because the series are quite stable (without strong local features) and it is familiar to economists. As regards the number of bases , as suggested by Ramsay and Silverman (2005), we computed an unbiased estimate of the residual variance using 4 to 22 bases and selected = 11, which is the number of bases that makes decreasing the residual variance substantially. In summary, for each stock , both variables, return and beta coefficient in a 250 day time frame, could be expressed as the functions: and where stands for the vector of coefficients on the basis functions for corresponding to the particular stock and, in the same way, stands for the vector of coefficients for corresponding to the particular stock , and are the Fourier basis functions. Therefore, after smoothing, the original data set of dimensions is reduced to . As both functions are measured in non-compatible units, each functional variable is standardized before analysis by standardizing the coefficients in the basis as explained by Epifanio (2016). This data set was analyzed in a non-robust way by Moliner and Epifanio (2018).
5.2 Robust bivariate FADA
Table 2 shows the companies obtained as archetypoids for different values. When increases, the number of sectors represented in the set of archetypoids increases too. Since archetypoids are not nested, archetypoids may not coincide at all when the value varies. However, we find a nested structure in the order in which sectors appear, and in fact, some companies remain when increases, such as NTAP or LNC.
|3||(E) NBL||(F) USB||(T) BRCM|
|4||(E) DNR||(F) LNC||(T) BRCM||(U) ED|
|5||(E) EOG||(F) LNC||(T) EBAY||(U) ED||(T) NTAP|
|6||(E) EOG||(F) LNC||(T) AKAM||(U) SO||(T) NTAP||(M) BMC|
|7||(E) EOG||(F) LNC||(T) AKAM||(U) ED||(T) NTAP||(M) BMC||(T) FLIR|
|8||(E) HP||(F) LNC||(T) AKAM||(U) PNW||(T) NTAP||(M) BMC||(T) FLIR||(H) ISRG|
In the interests of brevity and as an illustrative example we analyze the results of = 5, which is the value selected by the elbow criterion. We include a brief description of the data-driven selected companies based on the information in ALPS Portfolio Solutions Distributor, Inc. (2017): EOG Resources, Inc. (EOG) is one of the largest independent (non-integrated) crude oil and natural gas companies in the United States. EOG’s strategy is to generate the best rates of return by controlling operating and capital costs while maximizing oil and natural gas reserve recoveries; Lincoln National Corp.(LNC) is a holding company. Through subsidiary companies, it operates multiple insurance and investment management businesses; eBay(EBAY) is a powerful marketplace that hosts one of the largest on-line trading communities for the sale of goods and services; Consolidated Edison, Inc. (ED) provides a wide range of energy-related products and services to its customers through regulated utility subsidiaries and competitive energy and telecommunications businesses; NetApp (NTAP) is a company that provides a full range of hybrid cloud data services that simplify the management of applications and data for large enterprises.
Figure 1 shows the functions of each variable for the 5 archetypoids. It can be seen that ED is a company that, in comparison with the other archetypoids, presents low yet constant values for both variables. Looking at NTAP, it presents high returns at the beginning and end of the time series, while its volatility decreases over time. LNC presents the typical profile of a financial company, with moderate profitability and volatility during the first three quarters of the time series. When the crisis broke out in 2007, volatility shot up to unprecedented levels while profitability plummeted. Regarding volatilities, EBAY presents just the opposite profile, with greater beta values in the first years that stabilize over time. With regard to the returns, we see that this company has a moderate level of profitability compared to the other archetypoids, but with slightly higher oscillations during the first half of the time series. Finally, EOG is characterized by having bell-shaped functions, that is, with relatively low values at the extremes of the temporal domain and higher values at the center.
What makes this technique very interesting is that the companies are expressed as mixtures of those extreme profiles through the values. For a visual overview of the S&P 500 stocks, we propose the following procedure that provides a data-driven taxonomy of the S&P 500 stocks. The idea is to group together companies that share similar profiles. A simple method to do this is to establish a threshold , so that if the weight of an archetypoid for a given individual is greater than , we will say that this subject is in the cluster generated by this archetypoid. If we repeat this process for each archetypoid, we will generate 5 “pure” clusters of subjects, i.e., clusters of subjects that are represented mostly by a single archetypoid. To take this a little further, this process is repeated with combinations of two archetypoids. Thus, we will group in the same cluster the individuals whose sum of weights for two concrete archetypoids is greater than , thus generating additional clusters. In this case, it might happen that for a given subject, there is more than one combination of archetypoids whose sum exceeds . So as not to complicate the graphic representation, we will classify these subjects generically as mixtures, even though these mixtures will be composed of different sets of archetypoids.
Figure 2 condenses all the information extracted by means of a network that is built through an adjacency matrix that gathers the cluster information. The smaller is, the larger the number of companies that belong to any cluster, and vice versa. We have chosen a value of = 0.8, which is high but not so much, i.e. = 0.8 is a breakeven between having a considerable number of companies, but at the same time not too many to be able to read the graphic representation concisely. Archetypoids are highlighted with a gray square, the lines and color codes allow us to differentiate the structure of the clusters, and different sectors are represented through different geometrical shapes.
Starting with the pure clusters, we see that EBAY is alone in its own cluster and NTAP generates a small group with 2 other companies. This would suggest that although technology companies present extreme patterns, those patterns are not widely shared by the rest of the companies in the sample. ED is the archetypoid that generates the largest cluster on its own. Most of the companies in this cluster belong to the Utilities sector, although we also find some in the non-durable goods sector and some health companies. LNC generates a small cluster with three other companies. Two of them, XL Group LTD (XL) and Hartgord Financial Services Group INC. (HIG), belong to the same sector and have similar profiles. The other company is Host Hotels and Resorts Inc (HST), a real estate investment trust. For its part, EOG generates a small cluster in which 3 other companies are included. Two of them, Helmerich & Payne INC (HP) and Southwestern Energy CO (SWN) are from the Energy sector too, but Monster Beverage Corp (MNST) is from the Consumer Staples sector.
Regarding mixed clusters, we will point out some general characteristics, since detailing all the relationships shown in Figure 2 would be too extensive a task. A first aspect is that the combination of NTAP and LNC forms a cluster with only one company. In contrast, ED and LNC form the largest cluster for this level of . Focusing on this, the ED-LNC cluster has just a few companies from the Utilities sector, since these companies are mainly classified in the cluster generated by ED alone. Many of the companies classified in the ED-LNC cluster belong to the financial, real estate and consumer staples sectors.
Apart from those already mentioned, the following most important clusters according to their size are ED-EOG and ED-NTAP. The ED-EOG cluster stands out for its economic sense, since the two generating archetypoids belong to the energy extraction and energy distribution sectors. Thus, this cluster is generated by archetypoids that have a direct economic relationship, so it is expected that the companies in this cluster should also be strongly related. The result is consistent with expectations and we see how most of the companies are part of the Utilities, Energy or Industrial sectors. At a financial level, it may be more interesting to see which companies in these clearly bounded sectors do not belong to this cluster, which would make it possible to diversify investments and reduce the risk. However, this type of analysis goes beyond the objective of this paper.
The EBAY-ED cluster is certainly heterogeneous in terms of the sectors that it comprises. However, looking at the companies in the cluster, we see that the algorithm is capable of portraying economic relationships that are not so direct. Some examples of companies in this cluster are Amazon (AMZN), which also bases its business model on online sales, or UPS, which is a parcel company. We see that although the sectors to which they belong are different, the companies are related since they are all influenced by the development of online sales.
The remaining clusters are characterized by their smaller size. To highlight some of them, we can mention the one formed by LNC-EOG, which contains companies such as Allegheny Technologies Incorporated (ATI), one of the largest producers of specialty materials in the world, or the cluster formed by EBAY-NTAP, which is only composed of technological companies.
Now we analyze the composition of the sectors managed by market analysts to evaluate the performance of the results qualitatively. Figure 3 shows the normalized relative weight of archetypoids in each of the ten sectors. It can be seen that each archetypoid represents the component with the greatest weight of the sector to which it belongs. Thus, EOG represents more than half the weight of the Energy sector, LNC accounts for more than 43% of the financial sector, ED represents more than 80% of the weight of the Utilities sector and in the Technology sector, the sum of the EBAY and NTAP weights exceeds the sum of the weight of the other three archetypoids.
But it is not only interesting to analyze the weights of the most important components. The composition of the mixtures also gives clues about the similarities and dependencies of the different sectors among themselves. For example, if we compare the Consumer Discretionary sector with the Consumer Staples sector, we see that the weights keep a proportion that we would expect. For example, the companies that manufacture durable goods, which are included in the Consumer Discretionary sector, have a direct relationship with those that provide the investment to finance these purchases and the companies that provide these goods, represented by the LNC and EBAY archetypoids, respectively, and that is why these archetypoids have higher weights in this sector. On the other hand, companies that provide basic or non-durable goods, which belong to the Consumer Staples sector, have a minor relationship with the financial sector. It may be obvious, but it is worth emphasizing that, by definition, non-durable goods are those that are purchased without resorting to financing. This sector has instead a great similarity with the Utilities sector (distributors of electricity, water, gas, etc.). This makes economic sense, since basic goods and services distributed by companies in the Utilities sector have similar demand curves. In other words, during the expansive cycles of the economy, consumers decide to increase their investments in goods that require financing, such as a car, a washing machine or a computer. However, spending on electricity, water, gas or telecommunications of households will remain relatively constant, as will spending on basic products such as bread, milk, oil, soap or toothpaste.
Regarding the composition of the energy sector (extractors of oil, gas, etc.) the small weight of the technological archetypoids is striking. This may point to the scarce relationship between the technologies used in each sector. For example, the Energy sector carries out activities where heavy machinery and in general mechanical operations (drilling, mining, extraction etc.) are fundamental. This is completely the opposite to the dynamics that prevail in the technological universe, where the main elements are computer applications, digital technology or patents.
The Industrials and Materials sectors have similar profiles, which shows the great interrelation between them. Regarding the Real Estate sector, we see how the LNC archetypoid, belonging to the sector of financial companies, has the greatest weight outside its own sector. The relationship between these two sectors is also evident. Finally, we can say that the Utilities sector is in some way the purest, in the sense that the weight of the archetypoid of this sector is greater than the weight of the other four archetypoids all together.
This paper introduces robust archetypal analysis (AA and ADA) for multivariate data and functional data, both for univariate and multivariate functions. A simulation study has demonstrated the good performance of our proposal.
Furthermore, the application to the time series of stock quotes in the S&P 500 from 2000 to 2013 has illustrated the potential of these unsupervised techniques as an alternative to the commonly used clustering of time series, which are usually described by complex models. Understanding the results of these or other statistical learning models is not always an easy task. Additionally, if these results have to be explained to a public with little or no mathematical knowledge, things can be even worse. In this regard, ADA is particularly suitable since its results can be interpreted in a very simple way by any non-expert person. For instance, anyone with minimal knowledge of investments understands what we mean if we say that a certain stock behaves like a mixture of Consolidated Edison and Lincoln National stocks. Another advantage of the applied model is that the functional version of archetypoid analysis (FADA) allows us to condense vectors of observations of any length into a few coefficients, which provides an improvement in computational efficiency and makes this method highly recommended when working with long time series.
With regard to the financial conclusions, in the first place, it has been seen that when we increase the number of chosen archetypoids, the Technology sector appears repeatedly while other sectors do not appear. Therefore, there are companies within this sector that exhibit very different behaviors, such as EBAY and NTAP, FLIR and AKAM.
Furthermore, we have proposed a visual representation of the companies through the definition of clusters based on the mixtures obtained. Finally, we have analyzed the sectors according to the normalized relative weight of archetypoids that compose each sector and we have seen that some sectors present certain similarities. It is worth mentioning that sectors like Consumer Discretionary, Materials or Industrials offer better opportunities for diversifying risks, since their composition is more heterogeneous. On the other hand, sectors such as Utilities or Consumer Staples present more homogeneous structures, where the weight of the dominant archetypoids of the sector can exceed 80%.
As regards future work, from the mathematical point of view, an open problem is the extension of archetypal analysis to mixed data (functional and vector parts). An appropriate definition of the inner product is needed. Nevertheless, applications, even beyond econometrics, are the main direction of future work. Even so, the application of these models to the world of finance is still a relatively unexplored field. The application of models with functional data allows us to take into account variables collected with different frequencies, such as daily quotes, quarterly balances or annual results, which makes these models especially suitable for financial time series.
Taking this into account, a future development may be to extend the implementation of the bivariate model to a -variable model that makes it possible to work with a large amount of data from each company. From a financial point of view, it may be possible to develop investment strategies using the results shown here to improve performance and reduce the risk of investment decisions.
This work is supported by DPI2017-87333-R from the Spanish Ministry of Economy and Competitiveness (AEI/FEDER, EU) and UJI-B2017-13 from Universitat Jaume I.
- ALPS Portfolio Solutions Distributor, Inc. (2017) ALPS Portfolio Solutions Distributor, Inc., 09 2017. SectorSPDR web-site. Retrieved on 15/09/2017 from http://www.sectorspdr.com/sectorspdr/sector/.
- Aguilera et al. (1999) Aguilera, A. M., Ocaña, F. A., Valderrama, M. J., 1999. Stochastic modelling for evolution of stock prices by means of functional principal component analysis. Applied Stochastic Models in Business and Industry 15 (4), 227–234.
- Alexander (2001) Alexander, C., 2001. Market models: a guide to financial data analysis. Wiley, Chichester.
- Arribas-Gil and Romo (2014) Arribas-Gil, A., Romo, J., 2014. Shape outlier detection and visualization for functional data: the outliergram. Biostatistics 15 (4), 603–619.
- Basalto et al. (2007) Basalto, N., Bellotti, R., Carlo, F. D., Facchi, P., Pantaleo, E., Pascazio, S., 2007. Hausdorff clustering of financial time series. Physica A: Statistical Mechanics and its Applications 379 (2), 635 – 644.
- Beaton and Tukey (1974) Beaton, A. E., Tukey, J. W., 1974. The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics 16 (2), 147–185.
- Box and Jenkins (1976) Box, G. E., Jenkins, G. M., 1976. Time series analysis: forecasting and control, revised ed. Holden-Day.
- Canhasi and Kononenko (2013) Canhasi, E., Kononenko, I., 2013. Multi-document summarization via archetypal analysis of the content-graph joint model (doi: 10.1007/s10115-013-0689-8). Knowledge and Information Systems, 1–22.
- Canhasi and Kononenko (2014) Canhasi, E., Kononenko, I., 2014. Weighted archetypal analysis of the multi-element graph for query-focused multi-document summarization. Expert Systems with Applications 41 (2), 535 – 543.
- Chan et al. (2003) Chan, B., Mitchell, D., Cram, L., 2003. Archetypal analysis of galaxy spectra. Monthly Notices of the Royal Astronomical Society 338 (3), 790Ã¯Â¿Â½–795.
- Chen and Li (2017) Chen, Y., Li, B., 2017. An adaptive functional autoregressive forecast model to predict electricity price curves. Journal of Business & Economic Statistics 35 (3), 371–388.
- Cleveland (1979) Cleveland, W. S., 1979. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74 (368), 829–836.
- Cutler and Breiman (1994) Cutler, A., Breiman, L., 1994. Archetypal Analysis. Technometrics 36 (4), 338–347.
- Davis and Love (2010) Davis, T., Love, B., 2010. Memory for category information is idealized through contrast with competing options. Psychological Science 21 (2), 234–242.
- D’Esposito et al. (2012) D’Esposito, M. R., Palumbo, F., Ragozini, G., 2012. Interval Archetypes: A New Tool for Interval Data Analysis. Statistical Analysis and Data Mining 5 (4), 322–335.
- Dias et al. (2015) Dias, J. G., Vermunt, J. K., Ramos, S., 2015. Clustering financial time series: New insights from an extended hidden Markov model. European Journal of Operational Research 243 (3), 852 – 864.
- Dose and Cincotti (2005) Dose, C., Cincotti, S., 2005. Clustering of financial time series with application to index and enhanced index tracking portfolio. Physica A: Statistical Mechanics and its Applications 355 (1), 145 – 151.
- D’Urso et al. (2013) D’Urso, P., Cappelli, C., Lallo, D. D., Massari, R., 2013. Clustering of financial time series. Physica A: Statistical Mechanics and its Applications 392 (9), 2114 – 2129.
- D’Urso et al. (2016) D’Urso, P., Giovanni, L. D., Massari, R., 2016. GARCH-based robust clustering of time series. Fuzzy Sets and Systems 305 (Supplement C), 1 – 28.
- Epifanio (2016) Epifanio, I., 2016. Functional archetype and archetypoid analysis. Computational Statistics & Data Analysis 104, 24 – 34.
Epifanio et al. (2017)
Epifanio, I., Ibáñez, M. V., Simó, A., 2017. Archetypal shapes
based on landmarks and extension to handle missing data. Advances in Data
Analysis and Classification.
- Epifanio et al. (2013) Epifanio, I., Vinué, G., Alemany, S., 2013. Archetypal analysis: contributions for estimating boundary cases in multivariate accommodation problem. Computers & Industrial Engineering 64 (3), 757–765.
- Eugster and Leisch (2009) Eugster, M. J., Leisch, F., 2009. From Spider-Man to Hero - Archetypal Analysis in R. Journal of Statistical Software 30 (8), 1–23.
- Eugster (2012) Eugster, M. J. A., 2012. Performance profiles based on archetypal athletes. International Journal of Performance Analysis in Sport 12 (1), 166–187.
- Eugster and Leisch (2011) Eugster, M. J. A., Leisch, F., 2011. Weighted and robust archetypal analysis. Computational Statistics & Data Analysis 55 (3), 1215–1225.
- Febrero et al. (2008) Febrero, M., Galeano, P., González-Manteiga, W., 2008. Outlier detection in functional data by depth measures, with application to identify abnormal NOx levels. Environmetrics 19 (4), 331–345.
- Fernandez and Barnard (2015) Fernandez, M., Barnard, A. S., 2015. Identification of nanoparticle prototypes and archetypes. ACS Nano 9 (12), 11980–11992.
- Fraiman and Svarc (2013) Fraiman, R., Svarc, M., 2013. Resistant estimates for high dimensional and functional data based on random projections. Computational Statistics & Data Analysis 58, 326 – 338.
- Fu (2011) Fu, T., 2011. A review on time series data mining. Engineering Applications of Artificial Intelligence 24 (1), 164 – 181.
- Hastie et al. (2009) Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning. Data mining, inference and prediction. 2nd ed., Springer-Verlag.
- Hinrich et al. (2016) Hinrich, J. L., Bardenfleth, S. E., Roge, R. E., Churchill, N. W., Madsen, K. H., Mørup, M., 2016. Archetypal analysis for modeling multisubject fMRI data. IEEE Journal on Selected Topics in Signal Processing 10 (7), 1160–1171.
- Huber (1964) Huber, P. J., 1964. Robust estimation of a location parameter. Annals of Mathematical Statistics 35 (1), 73–101.
- Ingrassia and Costanzo (2005) Ingrassia, S., Costanzo, G. D., 2005. Functional principal component analysis of financial time series. In: New Developments in Classification and Data Analysis: Proceedings of the Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society. Springer, Berlin, Heidelberg, pp. 351–358.
- James (2010) James, G., 2010. Oxford Handbook on Statistics and Functional Data Analysis. Oxford University Press, Ch. Sparseness and Functional Data Analysis, p. 298Ã¯Â¿Â½323.
- Kaufman and Rousseeuw (1990) Kaufman, L., Rousseeuw, P. J., 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, New York.
- Kowal et al. (2017) Kowal, D. R., Matteson, D. S., Ruppert, D., 2017. Functional autoregression for sparsely sampled data. Journal of Business & Economic Statistics 0 (0), 1–13.
- Lawson and Hanson (1974) Lawson, C. L., Hanson, R. J., 1974. Solving Least Squares Problems. Prentice Hall.
- Li et al. (2003) Li, S., Wang, P., Louviere, J., Carson, R., 2003. Archetypal Analysis: A New Way To Segment Markets Based On Extreme Individuals. In: ANZMAC 2003 Conference Proceedings. pp. 1674–1679.
- Malioutov (2013) Malioutov, D., 2013. Beyond PCA for modeling financial time-series. In: 2013 IEEE Global Conference on Signal and Information Processing. pp. 1140–1140.
- Markowitz (1952) Markowitz, H., 1952. Portfolio selection. The Journal of Finance 7 (1), 77–91.
- Maronna et al. (2006) Maronna, R. A., Martin, D. R., Yohai, V. J., 2006. Robust Statistics: Theory and Methods. John Wiley and Sons, New York.
- Midgley and Venaik (2013) Midgley, D., Venaik, S., 2013. Marketing strategy in MNC subsidiaries: pure versus hybrid archetypes. In: P. McDougall-Covin and T. Kiyak, Proceedings of the 55th Annual Meeting of the Academy of International Business. pp. 215–216.
- Moliner and Epifanio (2018) Moliner, J., Epifanio, I., 2018. Bivariate functional archetypoid analysis: An application to financial time series. In: Mathematical and Statistical Methods for Actuarial Sciences and Finance. Springer, p. chapter 85.
- Mørup and Hansen (2012) Mørup, M., Hansen, L. K., 2012. Archetypal analysis for machine learning and data mining. Neurocomputing 80, 54–63.
- Peng (2008) Peng, R., 2008. A method for visualizing multivariate time series data. Journal of Statistical Software 25 (1), 1–17.
- Porzio et al. (2008) Porzio, G. C., Ragozini, G., Vistocco, D., 2008. On the use of archetypes as benchmarks. Applied Stochastic Models in Business and Industry 24, 419–437.
- QuantQuote (2017) QuantQuote, 09 2017. QuantQuote Free Historical Stock Data web-site. Retrieved on 15/09/2017 from https://quantquote.com/historical-stock-data.
R Development Core Team (2018)
R Development Core Team, 2018. R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN
- Ragozini and D’Esposito (2015) Ragozini, G., D’Esposito, M. R., 2015. Archetypal networks. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015. ACM, New York, NY, USA, pp. 807–814.
- Ragozini et al. (2017) Ragozini, G., Palumbo, F., D’Esposito, M. R., 2017. Archetypal analysis for data-driven prototype identification. Statistical Analysis and Data Mining: The ASA Data Science Journal 10 (1), 6–20.
- Ramsay and Silverman (2002) Ramsay, J. O., Silverman, B. W., 2002. Applied Functional Data Analysis. Springer.
- Ramsay and Silverman (2005) Ramsay, J. O., Silverman, B. W., 2005. Functional Data Analysis, 2nd Edition. Springer.
- Seiler and Wohlrabe (2013) Seiler, C., Wohlrabe, K., 2013. Archetypal scientists. Journal of Informetrics 7 (2), 345–356.
- Seth and Eugster (2016a) Seth, S., Eugster, M. J. A., 2016a. Archetypal analysis for nominal observations. IEEE Trans. Pattern Anal. Mach. Intell. 38 (5), 849–861.
- Seth and Eugster (2016b) Seth, S., Eugster, M. J. A., 2016b. Probabilistic archetypal analysis. Machine Learning 102 (1), 85–113.
- Sinova et al. (2018) Sinova, B., González-Rodríguez, G., Van Aelst, S., 2018. M-estimators of location for functional data. Bernoulli 24 (3), 2328–2357.
- Steinschneider and Lall (2015) Steinschneider, S., Lall, U., 2015. Daily precipitation and tropical moisture exports across the Eastern United States: An application of archetypal analysis to identify spatiotemporal structure. Journal of Climate 28 (21), 8585–8602.
- Stone and Cutler (1996) Stone, E., Cutler, A., 1996. Introduction to archetypal analysis of spatio-temporal dynamics. Physica D: Nonlinear Phenomena 96 (1), 110 – 131.
Su et al. (2017)
Su, Z., Hao, Z., Yuan, F., Chen, X., Cao, Q., 2017. Spatiotemporal variability
of extreme summer precipitation over the Yangtze river basin and the
associations with climate patterns. Water 9 (11).
- Sun et al. (2017) Sun, W., Yang, G., Wu, K., Li, W., Zhang, D., 2017. Pure endmember extraction using robust kernel archetypoid analysis for hyperspectral imagery. ISPRS Journal of Photogrammetry and Remote Sensing 131, 147 – 159.
- Theodosiou et al. (2013) Theodosiou, T., Kazanidis, I., Valsamidis, S., Kontogiannis, S., 2013. Courseware usage archetyping. In: Proceedings of the 17th Panhellenic Conference on Informatics. PCI ’13. ACM, New York, NY, USA, pp. 243–249.
- Thøgersen et al. (2013) Thøgersen, J. C., Mørup, M., Damkiær, S., Molin, S., Jelsbak, L., 2013. Archetypal analysis of diverse pseudomonas aeruginosa transcriptomes reveals adaptation in cystic fibrosis airways. BMC Bioinformatics 14, 279.
- Thurau et al. (2012) Thurau, C., Kersting, K., Wahabzada, M., Bauckhage, C., 2012. Descriptive matrix factorization for sustainability: Adopting the principle of opposites. Data Mining and Knowledge Discovery 24 (2), 325–354.
- Tsanousa et al. (2015) Tsanousa, A., Laskaris, N., Angelis, L., 2015. A novel single-trial methodology for studying brain response variability based on archetypal analysis. Expert Systems with Applications 42 (22), 8454 – 8462.
- Tsay (2010) Tsay, R. S., 2010. Analysis of Financial Time Series (3rd edition). Wiley, Hoboken, NJ.
- Tsay (2016) Tsay, R. S., 2016. Some methods for analyzing big dependent data. Journal of Business & Economic Statistics 34 (4), 673–688.
- Tseng and Li (2012) Tseng, J.-J., Li, S.-P., 2012. Quantifying volatility clustering in financial time series. International Review of Financial Analysis 23 (Supplement C), 11 – 19.
- Verdonck et al. (2011) Verdonck, T., Hubert, M., Rousseeuw, P., 2011. Robust covariance estimation for financial applications. In: Workshop on Actuarial and Financial Statistics.
- Vinué (2017) Vinué, G., 2017. Anthropometry: An R package for analysis of anthropometric data. Journal of Statistical Software 77 (6), 1–39.
- Vinué and Epifanio (2017) Vinué, G., Epifanio, I., 2017. Archetypoid analysis for sports analytics. Data Mining and Knowledge Discovery 31 (6), 1643–1677.
- Vinué et al. (2015) Vinué, G., Epifanio, I., Alemany, S., 2015. Archetypoids: A new approach to define representative archetypal data. Computational Statistics & Data Analysis 87, 102 – 115.
- Yahoo (2017) Yahoo, 09 2017. Yahoo Finance web-site. Retrieved on 15/09/2017 from https://es.finance.yahoo.com/lookup.