For the purpose of monitoring the behavior of complex infrastructures (e.g. aircrafts, transport or energy networks), high-rate sensors are deployed to capture multivariate data, generally unlabeled, in quasi continuous-time to detect quickly the occurrence of anomalies that may jeopardize the smooth operation of the system of interest. The statistical analysis of such massive data of functional nature raises many challenging methodological questions. The primary goal of this paper is to extend the popular Isolation Forest (IF) approach to Anomaly Detection, originally dedicated to finite dimensional observations, to functional data. The major difficulty lies in the wide variety of topological structures that may equip a space of functions and the great variety of patterns that may characterize abnormal curves. We address the issue of (randomly) splitting the functional space in a flexible manner in order to isolate progressively any trajectory from the others, a key ingredient to the efficiency of the algorithm. Beyond a detailed description of the algorithm, computational complexity and stability issues are investigated at length. From the scoring function measuring the degree of abnormality of an observation provided by the proposed variant of the IF algorithm, a Functional Statistical Depth function is defined and discussed, as well as a multivariate functional extension. Numerical experiments provide strong empirical evidence of the accuracy of the extension proposed.
101 \jmlryear2019 Functional Isolation Forest]Functional Isolation Forest
nomaly detection, functional data analysis, isolation forest, unsupervised learning
The digital information boom, that goes hand in hand with the recent technological advances in data collection and management (e.g. IoT, distributed platforms), offers new perspectives in many areas of human activity (e.g. transportation, energy, health, commerce, insurance), and confronts these domains with major scientific challenges for exploiting these observations. The ever growing availability of massive data, often collected in quasi-real time, engendered high expectations, in particular the need of increased automation and computational efficiency, with the goal to design more and more ‘intelligent’ systems. In particular, modern high-rate sensors enabling the continuous observation of the behavior of complex systems pave the way for the design of efficient unsupervised machine-learning approaches to anomaly detection, that may find applications in various domains ranging from fraud surveillance to distributed fleet monitoring through predictive maintenance or health monitoring of complex systems. However, although many unsupervised learning procedures for anomaly detection (AD in abbreviated form) have been proposed, analyzed and applied in a variety of practical situations (see, e.g., (Chandola)), the case of functional data, though of crucial importance in practice (refer to (ramsey; ferraty) for an account of Functional Data Analysis) has received much less attention in the literature, the vast majority of methods that are documented in the literature being generally model-based. The main barrier to the design of nonparametric anomaly detection techniques tailored to the functional framework lies in the huge diversity of patterns that may carry the information that is relevant to discriminate between abnormal and normal observations, see (rousseeuw).
It seems indeed far from straightforward to extend machine-learning methods for anomaly detection in the finite-dimensional case such as (ScottNowak06; SPSSW01; SHS05; VerVer06; ParkHuangDing10), unless preliminary filtering techniques are used. The filtering approach consists in projecting the functional data onto an adequate finite dimensional function subspace and using then the coefficients describing the latter to ”feed” next some AD algorithm for multivariate data (ramsey). The basis functions are either selected through Principal Component Analysis (they correspond in this case to elements of the Karhunen-Loeve basis related to the process under study, supposedly of second order), or else are chosen among a dictionary of ”time-frequency atoms” according to their capacity to represent efficiently the data. The representation a priori chosen, which can either enhance artificially certain accessory patterns or else make totally disappear some crucial features, critically determines performance of such an approach, the type of anomalies that can be recovered being essentially shaped by this choice.
The angle embraced in the present article is very different, the goal pursued being to extend the popular Isolation Forest methodology (LiuTZ08; LiuTZ12) to the functional setup. This ensemble learning algorithm builds a collection of isolation trees based on a recursive and randomized tree-structured partitioning procedure. An isolation tree is a binary tree, representing a nested collection of partitions of the finite dimensional feature space, grown iteratively in a top-down fashion, where the cuts are axis perpendicular and random (uniformly, w.r.t. the splitting direction and the splitting value both at the same time). Incidentally, a variant referred to as Extended Isolation Forest (Hariri), has recently been proposed in the purpose of bias reduction: rather than randomly selecting a perpendicular split, a splitting direction is randomly chosen in the unit ball. An anomaly score is assigned to any observation, depending on the length of the path necessary to isolate it from the rest of the data points, the rationale behind this approach being that anomalies should be easier to isolate in a random manner than normal (in the sense of ’non-abnormal’) data. Beyond obvious advantages regarding computational cost, scalability (e.g. isolation trees can be built from subsamples) and interpretability, the great flexibility offered by Isolation Forest regarding the splitting procedure called recursively makes it appealing when it comes to isolate (multivariate) functions/curves, possibly exhibiting a wide variety of geometrical shapes. It is precisely the goal of this paper to introduce a new generic algorithm, Functional Isolation Forest (FIF) that generalizes (Extended) Isolation Forest to the infinite dimensional context. Avoiding dimensionality reduction steps, this extension is shown to preserve the assets of the original algorithm concerning computational cost and interpretability. Its efficiency is supported by strong empirical evidence through a variety of numerical results.
The paper is organized as follows. Section 2 recalls the principles under the Isolation Forest algorithm for AD in the multivariate case and introduces the framework we consider for AD based on functional data. In Section 3, the extension to the functional case is presented and its properties are discussed at length. In Section 4, we study the behavior of the new algorithm and compare its performance to alternative methods standing as natural competitors in the functional setup through experiments. In Section 5, extension to multivariate functional data is considered, as well as relation to the data depth function and an application to the supervised classification setting. Eventually, several concluding remarks are collected in Section 6.
2 Background and Preliminaries
Here we briefly recall the Isolation Forest algorithm and its advantages (Section 2.1) and next introduce the framework for functional anomaly detection we consider throughout the paper (Section 2.2).
2.1 Isolation Forest
As a first go, we describe the Isolation Forest algorithm for AD in the multivariate context in a formalized manner for clarity’s sake, as well as the Extended Isolation Forest version, see (LiuTZ08; LiuTZ12) and (Hariri) respectively. These two unsupervised algorithms can be viewed as Ensemble Learning methods insofar as they build a collection of binary trees and an anomaly scoring function based on the aggregation of the latter. Let be a training sample composed of independent realizations of a generic random variable, , that takes its value in a finite dimensional Euclidian space, say, .
An isolation tree (itree in abbreviated form) of depth is a proper binary tree that represents a nested sequence of partitions of the feature space . The root node corresponds to the whole space , while any node of the tree, indexed by the pair where denotes the depth of the node with and , the node index with , is associated to a subset . A non terminal node has two children, corresponding to disjoint subsets and such that . A node is said to be terminal if it has no children.
Each itree is obtained by recursively filtering a subsample of training data of size in a top-down fashion, by means of the following procedure. The dataset composed of the training observations present at a node is denoted by . At iteration of the itree growing stage, a direction in , or equivalently a split variable , is selected uniformly at random (and independently from the previous draws) as well as a split value in the interval corresponding to the range of the projections of the points in onto the -th axis. The children subsets are then defined by and , the children training datasets being defined as and .
An itree is thus built by iterating this procedure until all training data points are isolated (or the depth limit set by the user is attained). A preliminary subsampling stage can be performed in order to avoid swamping and masking effects, when the size of the dataset is too large. When it isolates any training data point, the itree contains exactly internal nodes and terminal nodes. An itree constructed accordingly to a training subsample allows to assign to each training datapoint a path length , namely the depth at which it is isolated from the others, i.e. the number of edges traverses from the root node to the terminal node that contains the sole training data . More generally, it can be used to define an anomaly score for any point .
Anomaly Score prediction. As the terminal nodes of the itree form a partition of the feature space, one may then define the piecewise constant function by: ,
This random path length is viewed as an indication for its degree of abnormality in a natural manner: ideally, the more abnormal the point , the higher the probability that the quantity is small. Hence, the algorithm above can be repeated times in order to produce a collection of itrees , referred to as an iforest, that defines the scoring function
where is the average path length of unsuccessful searches in a binary search tree, see (LiuTZ08) for further details.
Extended Isolation Forest. Observing that the geometry of the abnormal regions of the feature space is not necessarily well-described by perpendicular splits (i.e. by unions of hypercubes of the cartesian product ), a more flexible variant of the procedure recalled above has been proposed in (Hariri), in the purpose of bias reduction. Rather than selecting a direction in , one may choose a direction , denoting by the unit sphere of the euclidian space . A node is then cut by choosing randomly and uniformly a threshold value in the range of the projections onto this direction of the training data points lying in the corresponding region. In the case where ’s ditribution has a density w.r.t. a -finite measure of reference, the goal of anomaly detection can be formulated as the recovery of sublevel sets , , (under mild assumptions, they are minimum volume sets or quantile regions, see (Polonik97; ScottNowak06), when measuring the volume by ), which may be not accurately approximated by unions of hyperrectangles (in the Gaussian situation for instance, such regions are the complementary sets of ellipsoïds, being Lebesgue measure on ).
2.2 Functional Data Analysis and Anomaly Detection
A functional random variable is a r.v. that takes its values in a space of functions, see, e.g., (ferraty). To be more specific, let be a time interval and consider a r.v. taking its values in the Hilbert space of real valued and square integrable (w.r.t. Lebesgue measure) functions :
Without any loss of generality, we restrict ourselves with functions defined on throughout the paper. In practice, only a finite dimensional marginal , , and can be observed. However, considering as a discretized curve rather than a simple random vector of dimension permits to take into account the dependence structure between the measurements over time, especially when the time points are not equispaced. To come back to a function from discrete values, interpolation procedures or approximation schemes based on appropriate dictionaries can be used, combined with a preliminary smoothing step when the observations are noisy. From a statistical perspective, the analysis is based on a functional dataset composed of independent realizations of finite-dimensional marginals of the stochastic process , that may be very heterogeneous in the sense that these marginals may correspond to different time points and be of different dimensionality. One may refer to ramsey’s book for a deep view on Functional Data Analysis (FDA in short). For simplicity, the functional data considered throughout the paper correspond to the observations of independent realizations of at the same points.
In this particular context, functional anomaly detection aims at detecting the curves that significantly differ from the others among the dataset available. Given the richness of spaces of functions, the major difficulty lies in the huge diversity in the nature of the observed differences, which may not only depend on the locations of the curves. Following in the footsteps of rousseeuw, one may distinguish between three types of anomalies: shift (the observed curve has the same shape as the majority of the sample except that it is shifted away), amplitude or shape anomalies. All these three types of anomalies can be isolated/transient or persistent, depending on their duration with respect to that of the observations. One may easily admit that certain types of anomalies are harder to detect than others: for instance, an isolated anomaly in shape compared to an isolated anomaly in amplitude (i.e. change point). Although FDA has been the subject of much attention in recent years, very few generic and flexible methods tailored to functional anomaly detection are documented in the machine-learning literature to the best of our knowledge, except for specific types of anomalies (e.g. change-points).
In Statistics, although its applications are by no means restricted to AD, the concept of functional depth that allows to define a notion of centrality in the path space and a center-outward ordering of the curves of the functional dataset, see, e.g., (cuevas; claeskens; rousseeuw), has been used for this purpose. However, since the vast majority of functional depth functions introduced only describe the relative location properties of the sample curves, they generally fail to detect other types of anomalies. Another popular approach, usually referred to as filtering, consists in bringing the AD problem to the multivariate case by means of an adequate projection using Functional Principal Component Analysis (FPCA) (ramsey) or a preliminary selected basis of the function space considered (e.g. Fourier, wavelets) and apply next an AD algorithm designed for the finite-dimensional setup to the resulting representation. Such methods have obvious drawbacks. In FPCA, estimation of the Kahrunen-Loève basis can be very challenging and lead to loose approximations, jeopardizing next the AD stage, while the a priori representation offered by the ’atoms’ of a predefined basis or frame may unsuccessfully capture the patterns carrying the relevant information to distinguish abnormal curves from the others. Another approach is based on the notion of Minimum Volume sets (MV-sets in shortened version), originally introduced in (EinmahlMason92) and that generalizes the concept of quantile for multivariate distributions and offers a nice nonparametric framework for anomaly detection in finite dimension, see ScottNowak06’s work. Given the fact that no analogue of Lebesgue measure on an infinite-dimensional Banach space exists and since, considering a law of reference (e.g. the Wiener or a Poisson measure) on the function space of interest, the volume of a measurable subset can be hardly computed in general, it is far from straightforward to extend MV-set estimation to the functional setup.
The angle embraced in this paper is quite different. The direct approach we promote here is free from any preliminary representation stage and can be straightforwardly applied to a functional dataset. Precisely, in the subsequent section, we propose to extend the IF algorithm to the functional data framework, in a very flexible way, so as to deal with a wide variety of anomaly shapes.
3 Functional Isolation Forest
We consider the problem of learning a score function that reflects the degree of anomaly of elements in an infinite dimensional space w.r.t. . By , we denote a functional Hilbert space equipped with a scalar product such that any is a real function defined on . In the following, we describe in detail the proposed Functional Isolation Forest (FIF) algorithm and discuss its properties.
3.1 The FIF algorithm
A Functional Isolation Forest is a collection of Functional Isolation Trees (F-itrees) built from , a training sample composed of independent realizations of a functional random variable, , that takes its values in . Given a functional observation , the score returned by FIF is a monotone transformation of the empirical mean of the path lengths computed by the F-itrees , for as defined in Eq. 1 in the multivariate case. While the general construction principle depicted in Section 2.1 remains the same for a F-itree, dealing with functional values raises the issue of finding an adequate feature space to represent various properties of a function. A function may be considered as abnormal according to various criteria of location and shape, and the features should permit to measure such properties. Therefore four ingredients have been introduced to handle functional data in a general and flexible way: (i) a set of candidate Split variables and (ii) a scalar product both devoted to function representation, (iii) a probability distribution to sample from this set and select a single Split variable, (iv) a probability distribution to select a Split value. The entire construction procedure of a F-itree is described in Figure 1.
Function representation To define the set of candidate Split variables, a direct extension of the original IF algorithm (LiuTZ08) would be to randomly draw an argument value (e.g. time), and use functional evaluations at this point to split a node, but this boils down to only rely on instantaneous observations of functional data to capture anomalies, which in practice will be usually interpolated. Drawing a direction on a unit sphere as in (Hariri) is no longer possible due to the potentially excessive richness of . To circumvent these difficulties, we propose to project the observations on elements of a dictionary that is chosen to be rich enough to explore different properties of data and well appropriate to be sampled in a representative manner. More explicitly, given a function , the projection of a function on , defines a feature that partially describes . When considering all the functions of dictionary , one gets a set of candidate Split variables that provides a rich representation of function , depending on the nature of the dictionary. Dictionaries have been throughly studied in the signal processing community to achieve sparse coding of signals, see e.g. Mallat2. They also provide a way to incorporate a priori information about the nature of the data, a property very useful in an industrial context in which functional data often come from the observation of a well known device and thus can benefit from expert knowledge.
Sampling a Once a dictionary is chosen, a probability distribution on is defined to draw a Split variable . Note that the choice of the sampling distribution gives an additional flexibility to orientate the algorithm towards the search for specific properties of the functions.
Sampling a Given a chosen Split variable and a current training dataset , a Split value is uniformly drawn in the real interval defined by the smallest and largest values of the projections on when considering the observations present in the node.
Discussion on the dictionary The choice of a suited dictionary plays a key role in construction of the FIF anomaly score. The dictionary can consist of deterministic functions, incorporate stochastic elements, contain the observations from , or be a mixture of several mentioned options. In Computational Harmonic Analysis, a wide variety of bases or frames, such as wavelets, ridgelets, cosine packets, brushlets and so on, have been developed in the last decades in order to represent efficiently/parsimoniously functions, signals or images exhibiting specific form of singularities (e.g. located at isolated points, along hyperplanes) and may provide massive dictionaries.
The following ones will be used throughout the article: mexican hat wavelet dictionary (MHW), Brownian motion dictionary (B), Brownian bridge dictionary (BB), cosine dictionary (Cos), uniform indicator dictionary (UI), dyadic indicator dictionary (DI), and the self-data dictionary (Self) containing the dataset itself. See Section B and C of the Supplementary Materials for detailed definitions of these dictionaries and further discussion on them, respectively.
Discussion on the scalar product Besides the dictionary, the scalar product defined on brings some additional flexibility to measure different type of anomaly. While scalar product allows for detection of location anomalies, scalar product of derivatives (or slopes) would allow to detect anomalies regarding shape. This last type of anomalies can be challenging; e.g. rousseeuw mention that shape anomalies are more difficult to detect, and mozha argue that one should consider both location and slope simultaneously for distinguishing complex curves. Beyond these two, a wide diversity of scalar products can be used, involving a variety of -scalar products related to derivatives of certain orders, like in the definition of Banach spaces such as weighted Sobolev spaces, see Sobolev_book.
3.2 Ability of FIF to detect a variety of anomalies
As discussed in Section 2.2, most of state-of-the-art methods have a focus on a certain type of anomalies and are unable to detect various deviations from the normal behavior. The flexibility of the FIF algorithm allows for choosing the scope of the detection by selecting both the scalar product and the dictionary. Nevertheless, by choosing appropriate scalar product and dictionary, FIF is able to detect a great diversity of deviations from normal data. First, to account for both location and shape anomalies, we suggest the following scalar product that provides a compromise between the both
and illustrate its use right below. Thus, setting yields the classical scalar product, corresponds to the scalar product of derivative, and is the Sobolev scalar product. To illustrate the FIF’s ability to detect a wide variety of anomalies at a time, we calculate the FIF anomaly scores with the Sobolev scalar product and the gaussian wavelets dictionary for a sample consisting of curves defined as follows (inspired by (cuevas), see Fig. 2):
100 curves defined by with equispaced in ,
5 abnormal curves composed by one isolated anomaly with a jump in , one magnitude anomaly and three kind of shape anomalies , noised by on the interval and .
One can see that the five anomalies, although very different, are all detected by FIF with a significantly different score.
4 Numerical Results
In this section, we provide an empirical study of the proposed algorithm. First, in Section 4.1 we explore the stability and consistency of the score function w.r.t. the probability distribution of a r.v. and the sample size. Furthermore, we examine the influence of proposed dictionaries on the score function and bring performance comparisons with benchmark methods. Second, in Section 4.2, we benchmark the performance of FIF on several real labeled datasets by measuring its ability to recover an ”abnormal” class on the test set. In all experiments, the number of F-itrees is fixed to and the height limit is fixed to .
4.1 Impact of the Hyperparameters on Stability
Since functional data are more complex than multivariate data, and the dictionary constitutes an additional source of variance, a question of stability of the FIF anomaly score estimates is of high interest. This issue is even more important because of the absence of theoretical developments due to their challenging nature.
The empirical study is conducted on two simulated functional datasets presented in Fig. 3: Dataset (a) is the standard Brownian motion being a classical stochastic process widely used in the literature. Dataset (b) has been used by claeskens and has smooth paths. For each dataset, we choose/add four observations for which the FIF anomaly score is computed after training: a normal observation , two anomalies and , and a more extreme anomaly . We therefore expect the following ranking of the scores: , for both datasets.
Further, we provide an illustration of the empirical convergence of the score. All other parameters being fixed, we increase the number of observations when calculating the scores of the four selected observations; the empirical median and the boxplots of the scores computed over random draws of the dataset are shown in Fig. 4.
First, one observes score convergence and variance decrease in . Further, let us take a closer look at the score tendencies on the example of and . The score of first increases (for dataset (a)) and slightly decreases (for dataset (b)) with growing until reaches , which happens because this abnormal observation is isolated quite fast (and thus has short path length) but the in the denominator of the exponent of (1) increases in . For , the score of decreases in since overestimates the real path length of for subsamples in which it is absent; frequency of such subsamples grows in and equals, e.g., for . On the other hand, this phenomenon allows to unmask grouped anomalies as mentioned in (LiuTZ08). The behavior is reciprocal for the typical observation . Its FIF anomaly score starts by decreasing in since tends to belong to the deepest branches of the trees and is always selected while . For larger , the path length of is underestimated for subsamples where it is absent when growing the tree, which explains slight increase in the score before it stabilizes.
A second experiment illustrated in Fig. 5 is conducted to measure the impact of various dictionaries shortly cited in Section 3 and more thoroughly described in Section B of the Supplementary Materials; scalar product is used. One observes that the variance of the score seems to be mostly stable across dictionaries, for both datasets. Thus, random dictionaries like uniform indicator (UI) or Brownian motion (B) do not introduce additional variance into the FIF score. Since we know the expected ranking of the scores, we can observe that FIF relying on the Self, UI, and dyadic indicator (DI) dictionaries fail to make a strong difference between and . Since differs only slightly in the amplitude from the general pattern, these dictionaries seem insufficient to capture this fine dissimilarity: while Self and DI dictionaries simply do not contain enough elements, UI dictionary is to simple to capture this difference (it shares this feature with DI dictionary). For the scalar product on derivatives (see Fig. 18 in the Supplementary Materials), distinguishing anomalies for the Brownian motion becomes difficult since they differ mainly in location, while for a sine function the scores resemble those with the usual scalar product. Thus, even though—as seen in Section 3.2—capturing different types of anomalies is one of the general strengths of the FIF algorithm, the dictionary may still have an impact on detection of functional anomalies in particular cases.
More experiments were run regarding the stability of the algorithm, but for sake of space, we describe them in Section C of the Supplementary Materials.
4.2 Real Data Benchmarking
To explore the performance of the proposed FIF algorithm, we conduct a comparative study using classification datasets from the UCR repository (UCRArchive). We consider the larger class as normal and some of others as anomalies (see Table 1 for details). When classes are balanced, i.e for 9 datasets out of 13, we keep only part of the anomaly class to reduce its size, always taking the same observations (at the beginning of the table) for a fair comparison. Since the datasets are already split into train/test sets, we use the train part (without labels) to build the FIF and compute the score on the test set. We assess the performance of the algorithm by measuring an Area Under the Receiver Operation Characteristic curve (AUC) on the test set. Both train and test sets are rarely used during learning in unsupervised setting since labels are unavailable when fitting the model. Thus, when fitting the models on unlabeled training data, good performances on the test set show a good generalization power.
Competitors FIF is considered with two finite size dictionaries dyadic indicator, the self-data and the infinite size dictionary cosines (with and ); its parameters are set , and the height limit to . We contrast the FIF method with three most used multivariate anomaly detection techniques and two functional depths, with default settings. The multivariate methods—isolation forest (IF) (LiuTZ08), local outlier factor (LOF) (Breunig), and one-class support vector machine (OCSVM) (SPSSW01)— are employed after dimension reduction by Functional PCA keeping principal components with largest eigenvalues after a preliminary step of filtering using Haar basis. The depths are the random projection halfspace depth (cuevas) and the functional Stahel-Donoho outlyingness (rousseeuw).
Analysis of the results Taking into account the complexity of the functional data, as expected there is no method performing generally best. Nevertheless, FIF performs well in most of the cases, giving best results for datasets and second best for datasets. It is worth to mention that the dictionary plays an important role in identifying anomalies, while FIF seems to be rather robust w.r.t. other parameters: The “CinECGTorso” dataset contains anomalies differing in location shift which are captured by the cosine dictionary. Dyadic indicator dictionary allows to detect local anomalies in “TwoLeadECG” and “Yoga” datasets. Self-data dictionary seems suited for Datasets “SonyRobotAI2” and “StarlightCurves” whose challenge is to cope with many different types of anomalies.
|p||training : /||testing : /||normal lab||anomaly lab|
|Chinatown||24||4 / 14 (29%)||95 / 345||2||1|
|Coffee||286||5 / 19 (26%)||6 / 19||1||0|
|ECGFiveDays||136||2 / 16 (12%)||53 / 481||1||2|
|ECG200||96||31 / 100 (31%)||36 / 100||1||-1|
|Handoutlines||2709||362 / 1000 (36 %)||133 / 370||1||0|
|SonyRobotAI1||70||6 / 20 (30 %)||343 / 601||2||1|
|SonyRobotAI2||65||4 / 20 (20 %)||365 / 953||2||1|
|StarLightCurves||1024||100 / 673 (15 %)||3482 / 8236||3||1 and 2|
|TwoLeadECG||82||2 / 14 (14 %)||570 / 1139||1||2|
|Yoga||426||10 / 173 ( 06 %)||1393 / 3000||2||1|
|EOGHorizontal||1250||10 / 40 (25 %)||30 / 61||5||6|
|CinECGTorso||1639||4 / 16 (25 %)||345 / 688||3||4|
|ECG5000||140||31 / 323 (10 %)||283 / 2910||1||3,4 and 5|
5 Extensions of FIF
Extension to multivariate functions FIF can be easily extended to the multivariate functional data, i.e. when the quantity of interest lies in for each moment of time. For this, the coordinate-wise sum of the corresponding scalar products is used to project the data onto a chosen dictionary element: . The dictionary is then defined in , e.g., by componentwise application of one or several univariate dictionaries, see Section 3. In the Supplementary Materials we give an illustration of multivariate functional anomaly detection on the MNIST (Lecun) dataset, each digit being seen as a 2D-curve.
Connection to data depth Regarding FIF score as an anomaly ranking yields a connection to the notion of the statistical depth function (see (Mosler2013) for an overview), which has been successfully applied in outlier detection (see, e.g., (rousseeuw)). Statistical data depth has been introduced as a measure of centrality (or depth) of an arbitrary observation with respect to the data at hand . A data depth measure based on FIF score can be defined for (multivariate) functional data as: . Data depth proves to be a useful tool for a low-dimensional data representation called depth-based map. Using this property, LI and mozha define a -plot classifier which consists in applying a multivariate classifier to the depth-based map. Low-dimensional representation is of particular interest for functional data and a -plot classifier can be defined using the FIF-based data depth. Let be a training set for supervised classification containing classes, each subset standing for class . The depth map is defined as follows:
As an illustration, we apply the depth map to digits (, and , observations per digit for training and testing) of the MNIST dataset after their transformation to two-variate functions using skimage python library (see Figure 6 ). One observes appealing geometrical interpretation (observe, e.g., the location of the abnormally distant—from their corresponding classes—observations) and a clear separation of the classes. To illustrate separability, we apply linear multiclass (one-against-all) SVM in the depth space, which delivers the accuracy of on the test data.
The Functional Isolation Forest algorithm has been proposed, which is an extension of Isolation Forest to functional data. The combined choice of the dictionary itself, the probability distribution used to pick a Split variable and the scalar product used for the projection enables FIF to exhibit a great flexibility in detecting anomalies for a variety of tasks. FIF is extendable to multivariate functional data. When transformed in a data depth definition, FIF can be used for supervised classification via a low-dimensional representation—the depth space. The open-source implementation of the method, along with all reproducing scripts, can be accessed at https://github.com/Gstaerman/FIF.
A Illustrative figures
B Presentation of used dictionaries
In this part, we define properly every dictionaries used in the paper.
Self-data dictionary (Self) consisting of the training dataset itself.
Brownian motion dictionary (B) is a combination of the space of continuous function and the Wiener measure on .
Brownian bridge dictionary (BB) is a combination of the space of continuous function and the Brownian bridge measure on .
Cosine dictionary (Cos) consisting of curves with the following forms:
with and .
Mexican hat wavelet dictionary (MHW) consists of the negative second derivatives of the normal density, shifted and scaled in a appropriate fashion:
with and .
Dyadic indicator dictionary (DI) consisting of a set of indicator functions on the elements of binary partitioning, for a given (chosen according to the granularity to be captured or from the discretisation considerations) having as elements :
Uniform indicator dictionary (UI) consists of indicator function on where and are choosen uniformly on such that .
Dyadic indicator derivative (DId) consisting of a set of indicator functions on the elements of binary partitioning, for a given (chosen according to the granularity to be captured or from the discretisation considerations) having as elements :
Uniform indicator derivative (UId) consists of functions on where and are choosen uniformly on such that .
C Further discussion on the choice of dictionary
To illustrate the dicussion on dictionaries, especially the incorporation of stochastic elements and external informations, we bring an example of the use of the Brownian motion dictionary. Let be the Wiener measure defined on the space of continuous function on [0,1] and be the space. We define by Brownian motion dictionary (B) the Split variables space induced by and . Although seeming universal, this dictionary explores almost the entire argument space equivalently, and in practice can be unable to detect isolated anomalies. On Fig. 9 we plot the following synthetic dataset:
30 curves defined by on and on with equispaced in .
1 abnormal curve with the same shape but that is shifted at the beginning and whose continuation is deep in the preceding curves.
One can see that the anomaly is not detected, that indicated as anomaly curve (the one with highest anomaly score) is on the fringe of the dataset though. Illustrative incorporation of the prior knowledge, in its simplified version, can consist, e.g., in adding to the measure , a Dirac of the indicator function on the interval of interest with weights: ; this assigns the highest anomaly score to the desired observation. In the sequel, follows a uniform distribution if is not explicitly mentioned.
When having not enough prior knowledge, e.g. just knowing to stick to local features of functional data but not the precise interval, one would like to use a dictionary exploring different localities. To illustrate possible advantage of this approach, we use Mexican hat wavelet and Dyadic indicator dictionaries. Regard the “Chinatown” dataset UCRArchive, which represents pedestrian count in Chinatown-Swanston St North for months during year . With functions (working days) representing normal observations and taking functions (weekends) as anomalies (Figure 10). One observes that while the Mexican hat wavelet dictionary correctly detects part of the anomalies, due to its smooth nature it is distracted by two normal curves with high deviation on the second half of the domain. Having straight fronts and begin non-zero only in a small part of the domain, the dyadic indicator dictionary detects all four abnormal observations. Nevertheless, it is not adapted to scalar product that involves derivative. To adapt the dyadic indicator dictionary and the uniform indicator dictionary for the scalar product involving derivatives, we define their slope versions being and , respectively, with the notation defined above (i.e. their derivatives become indicator functions). Clearly, this list can be extended with further task-specific dictionaries.
Before, we were considering dictionaries that are independent of data. Nevertheless one can use observations or their certain transform as a dictionary itself: projections on both normal and abnormal observations shall differ for normal ones and for anomalies; this suggests the self-data dictionary (Self). This can be extended to the local self-data dictionary which consists of the product of the self-data dictionary with the uniform indicator dictionary. As an example, we apply this to the “ECG5000” dataset plotted in Figure 11, where, different to the cosine dicitonary, it allows to detect all abnormal observations.
To conclude this, we provide a last example (see Fig. 12) where we highlight the impact of the scalar product choice. To illustrate the score change caused by different values of , we calculate the FIF anomaly scores with and for a sample consisting of curves as follows (inspired by cuevas, see Fig. 12):
90 curves defined by with equispaced in ,
10 abnormal curves defined by noised by on the interval .
One can see that even though the noisy curves are abnormal for the majority of the data, they are considered as normal ones when only location is taken into account. On the other hand, they are easily distinguished with the high anomaly score when derivatives are examined.
c.1 Direction importance of finite size Dictionaries
Although feature importance have been tackled in supervised random trees (see e.g. Br01, Geurts06), this has not been adressed in the Isolation Forest literature (see LiuTZ08, LiuTZ12 and Hariri). As a very randomized procedure, there is no incrementally way to define feature importance from the supervised setting. Nevertheless, it is a matter of interest in many anomaly detection applications to get interpretability of models, especially when dealing with functional data where many information are contained in curves. Thus, it is rewarding to get an a posteriori sparse representation of the dictionary which corresponds to the discriminating directions that have great importance in the construction of the model. Furthermore, it could bring some information on the distribution of normal data by studying the dispersion of the projection coefficients on a direction (e.g multi-modality). To extend this notion to the Functional Isolation Forest algorithm , we propose two ways to evaluate the importance of the elements of to discriminate anomaly curves. The general idea is to give importance to elements of which allows to discriminate between the sample. The naive idea is to add ”+1” to the elements of where an instance of the node sample is isolated (except for the cells with only two instances) such that good directions are those with a high score (after the forest construction). A clever one, more adaptive, would be to get weighted gain since curves isolated at nodes closer to the root should be more rewarding. To do this, we choose to give a reward depending on the size of the sample node where a curve is isolated. Precisely, the given reward is equal to the size of the node sample divided by the (sub)-sample used to build the tree. An example of the latter is given in Figure 13. The experiment is conducted on the real-world CinECGTorso dataset (more details in Section 4.2). We use FIF with the Dyadic indicator dictionary and the scalar product. As we can see, the two most important elements of the dictionary are indicator functions which localize the peak around where anomalies are really different from the normal ones. These leads to some interpretability of a ”black box” procedure.
D Study of the parameters of FIF
In this section, we present results of a simulation study of the variance of the FIF algorithm. The experiments were conducted on the datasets (a) and (b) from Section 4 (see also Figure 3 of Section 4), for each of the four specified observations , , , using the following settings (except varying parameter):
Dictionary: Gaussian wavelets (negative second derivative of the standard Gaussian density) with random variance selected in an uniform way in and a translation parameter selected randomly in . We fixed the size of the dictionary to 1000.
Scalar product: dot product.
Size of the dataset: .
Subsampling size: .
The number of trees: .
The height limit: fixed to .
The figures below indicate boxplots of the FIF anomaly score, over runs. Empirical study of the FIF anomaly score and its variance when increasing the number of F-itrees is depicted in Figure 14.
Empirical study of the FIF anomaly score and its variance when increasing the subsample size is depicted in Figure 15.
Empirical study of the FIF anomaly score and its variance when increasing the height limit of the F-itree is depicted in Figure 16.
Taking finite size versions of the infinite gaussian wavelets dictionary, an empirical study of the FIF anomaly score and its variance when increasing the size of the dictionary is depicted in Figure 17.
Empirical study of the FIF anomaly score for a variety of dictionaries with the scalar product of the derivatives is depicted in Figure 18.
Analysis of the results of Section C In a first experiment, we show the boxplots of the score estimated by FIF when increasing the number of F-itrees and observe that, as expected, the variance diminishes when grows (see Figure 14). We also see in Figure 15 that with an increasing subsample size the FIF anomaly score increases for anomalies since these are more often present in the subsample and thus isolated faster (with shorter path length) when calculating the score than when they were absent in the subsample; this effect is reciprocal for normal observations. A similar behavior is observed with increasing height limit in Figure 16. The variance of the score tends to slightly increase with and because of more observations/branching possibilities. If the dictionary is sufficiently rich, its size does not influence the FIF anomaly score and its variance stabilizes relatively fast while growing the size of the dictionary (see Figure 17) which encourages the use of massive (and infinite size) dictionaries.
E Complementary results on the performance comparison
e.1 benchmark datasets
Here, we plot the thirteen benchmark train datasets used in the experiment. Anomalies are represented by blue color while normal data are drawn in red.
e.2 Functional depth
In this part, we present further functional depths which are outperformed (on an average) by the two depth functions presented in Section 4 on the real-world datasets and we display their AUC performance. SFD corresponds to simplicial integrated depth, HFD to Halfspace integrated depth, RP-SD to the random projection method with simplicial depth, RP-RHD to the random projection method with random halfspace depth, fAO to the functional adjusted outlyingness, fDO to the functional directional outlyingness to and fbd to functional bagdistance. The reader is referred to cuevas; Fraiman2001; rousseeuw for the bibliography on employed functional data depth notions.
e.3 Isolation forest after dimension reduction by filtering methods on the benchmark datasets
Here, we show the results of the filtering approach using 106 bases from the PyWavelets python library and the Fourier basis. Afterwards, we apply (multivariate) Isolation Forest on the coefficients of the projections and display the AUC performance.
e.4 If with different filtering preliminary step on the benchmark datasets
Here, we show the results of the FPCA approach using the Fourier basis and further bases from the PyWavelets python library as preliminary filtering stage. Afterwards, we apply (multivariate) Isolation Forest on the coefficients of the projections and display the AUC performance.
F Multivariate Functional Isolation Forest and depth mapping
FIF can be easily extended to the multivariate functional data, i.e. when the quantity of interest lies in for each moment of time:
For this, the coordinate-wise sum of the corresponding scalar products is used to project the data onto a chosen dictionary element: