PyODDS: An End-to-End Outlier Detection System
PyODDS is an end-to end Python system for outlier detection with database support. PyODDS provides outlier detection algorithms which meet the demands for users in different fields, w/wo data science or machine learning background. PyODDS gives the ability to execute machine learning algorithms in-database without moving data out of the database server or over the network. It also provides access to a wide range of outlier detection algorithms, including statistical analysis and more recent deep learning based approaches. PyODDS is released under the MIT open-source license, and currently available at https://github.com/datamllab/pyodds with official documentations at https://pyodds.github.io/.
anomaly detection, end-to-end system, outlier detection, deep learning, machine learning, data mining, full stack system, data visualization
Outliers refer to the objects with patterns or behaviors that are significantly rare and different with the rest of majorities. Outlier detection plays an important role for various applications, such as fraud detection, cyber security, medical diagnosis and industrial manufacturer. Detecting outliers in data has been studied in statistics community as early as the century. Overtime, a variety of anomaly detection approaches have been specifically developed for certain application domains with various data characteristics.
To meet the diverse demands, a well-structured system with following characteristics is in need. First, it needs to cover various techniques for outlier detection in different scenarios. Second, it can analyze both static and time series data, since in many real-world applications, instances are not independent and identical distributed. For example, instances usually have temporal correlations with time-stamps, such as unexpected spikes, drops, trend changes and level shifts. Third, it is also a demand to have an database backend for data storage and operations, which enable in-database analysis without moving data out of the database server or over the network, as to reduce the cost of query and loading data from different remote servers. To analyze time series, the burden of throughput is heavier than usual since time window based approaches require to access and query the data in a more frequent way.
While many software libraries are available for outlier detection, existing tookits are focusing on static data: PyNomaly Constantinou (2018), ELKI Data Mining Achtert et al. (2010), RapidMiner Hofmann and Klinkenberg (2013), and PyOD Zhao et al. (2019) contain different outlier detection methods for static data with various programming languages, yet there tackle with time series data, and do not cater specifically to backend-servers. To fill this gap, we propose and implement PyODDS, a full stack, end-to-end system for outlier detection, which supports both static and time series data.
PyODDS has advantages from the following perspectives. First, it contains 13 algorithms, including statistical approaches, and recent neural network frameworks. Second, PyODDS supports both static and time series data analysis, with flexible time-slices segmentation. Third, PyODDS supports operation and maintenance from a light-weight SQL based database, which reduces the cost of queries and loading data from different remote servers. Fourth, PyODDS provides visualization tools for the original distribution of raw data, and predicted results, which offers users a direct and vivid perception. Last, PyODDS includes a unified API with detailed documentation, such as outlier detection approaches, database operations, and visualization functions.
2 The PyODDS library
Our system is written in Python and uses TDengine as the database support service. The sequence from query data to evaluation is outlined in Figure 1. PyODDS follows the API design of scikit-learn, all implemented methods (as shown in Table 1) are formulated as individual classes with same interfaces: (1) function is to fit the selected model according to the given training data; (2) function returns a binary class label corresponding to each instance in testing sets; (3) produces an outlier score for each instance to denote their outliernesses.
|CBLOF||He et al. (2003)||algo.cblof.CBLOF||Fixed-length, shallow|
|SOD||Kriegel et al. (2009)||algo.sod.SOD||Fixed-length, shallow|
|HBOS||Aggarwal (2015)||algo.hbos.HBOS||Fixed-length, shallow|
|IFOREST||Liu et al. (2008)||algo.iforest.IFOREST||Fixed-length, shallow|
|KNN||Ramaswamy et al. (2000)||algo.knn.KNN||Fixed-length, shallow|
|LOF||Breunig et al. (2000)||algo.cblof.CBLOF||Fixed-length, shallow|
|OCSVM||Schölkopf et al. (2001)||algo.ocsvm.OCSVM||Fixed-length, shallow|
|PCA||Shyu et al.||algo.pca.PCA||Fixed-length, shallow|
|AUTOENCODER||Hawkins et al. (2002)||
|DAGMM||Zong et al. (2018)||algo.dagmm.DAGMM||Fixed-length, deep|
|LSTMENCDEC||Malhotra et al. (2016)||
|Time series, deep|
|LSTMAD||Malhotra et al. (2015)||algo.lstm_ad.LSTMAD||Time series, deep|
|LUMINOL||algo.luminol.LUMINOL||Time series, shallow|
PyODDS also includes database operation functions for client users: (1) function allows the client to connect the server with host address and user information; (2) returns a pandas DataFrame containing time series retrieved from a given time range. For other server-side database operations, they are supported by the backend database platform TDengine, which provides caching, stream computing, message queuing and other functionalities.
Moreover, PyODDS includes a set of utility functions for model evaluation and visualization: (1) is to visualize the original distribution for static data and time series; (2) is to visualize the predicted outlier score in testing cases; (3) produces the evaluation for the performance of the given algorithm, including accuracy score, precision score, recall score, f1 score, roc-auc score and processing time cost.
A PyODDS API demo is shown in Listing 1. Lines 1-3 import the utility functions. Lines 4-5 create a connection object with cursor connecting to the dataset by the host address and user information. Lines 7-8 show how to query data from a given time range. Line 10-12 declare an object as a specific algorithm for data analysis, and fit the model through the training data. Lines 14-16 produce the predicted result for the testing data. Lines 18-20 give a visualization of the prediction results. An example of using visualization functions is shown in Figure 2. The two-dimensional artificial data used in the example is created by generate data which generates inliers from two Gaussian distribution and outliers as random noises from other distribution. The first and the second figure (from left to right) denote the distribution of static data: the first one uses the kernel density estimation procedure to visualize the original distribution, and the second one plots the instances as scatters where lighter shades denote to higher outlierness score. The third and fourth figure visualize the prediction results for time series: the third one plots original features as curves where the x axis denotes timestamp, y axis denotes the features, the fourth figure represents the outlierness score in y axis, corresponding to the time series in x axis.
3 Conclusion and Future Work
This paper introduces PyODDS, an open-source system for anomaly detection utilizing state-of-the-art machine learning techniques. It contains 13 algorithms, including classical statistical approaches, and recent neural network frameworks for static and time series data. It also provides an end-to-end solution for individuals as well as enterprises, and supports operations and maintenance from light-weight SQL based database to back-end machine learning algorithms. As a full-stack system, PyODDS aims to lower the threshold of learning scientific algorithms and reduces the skills requirement from both database and machine learning sides. In the future, we plan to enhance the system by implementing models for heterogeneous data Li et al. (2019a); Huang et al. (2019); Li et al. (2019b), improving the interpretability and reliability of the algorithms Liu et al. (2019), and integrating more advanced outlier detection methods.
- Visual evaluation of outlier detection models. In International Conference on Database Systems for Advanced Applications, pp. 396–399. Cited by: §1.
- Outlier analysis. In Data mining, Cited by: Table 1.
- LOF: identifying density-based local outliers. In ACM sigmod record, Cited by: Table 1.
- PyNomaly: anomaly detection using local outlier probabilities (LoOP).. Journal of Open Source Software. Cited by: §1.
- Outlier detection using replicator neural networks. In International Conference on Data Warehousing and Knowledge Discovery, Cited by: Table 1.
- Discovering cluster-based local outliers. Pattern Recognition Letters. Cited by: Table 1.
- RapidMiner: data mining use cases and business analytics applications. CRC Press. Cited by: §1.
- Graph recurrent networks with attributed random walks. Cited by: §3.
- Outlier detection in axis-parallel subspaces of high dimensional data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Cited by: Table 1.
- Cited by: §3.
- Deep structured cross-modal anomaly detection. IJCNN. Cited by: §3.
- Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, Cited by: Table 1.
- Is a single vector enough? exploring node polysemy for network embedding. arXiv preprint arXiv:1905.10668. Cited by: §3.
- LSTM-based encoder-decoder for multi-sensor anomaly detection. arXiv preprint arXiv:1607.00148. Cited by: Table 1.
- Long short term memory networks for anomaly detection in time series. In Proceedings, Cited by: Table 1.
- Efficient algorithms for mining outliers from large data sets. In ACM Sigmod Record, Cited by: Table 1.
- Estimating the support of a high-dimensional distribution. Neural computation. Cited by: Table 1.
-  A novel anomaly detection scheme based on principal component classifier. Cited by: Table 1.
- PyOD: a python toolbox for scalable outlier detection. Journal of Machine Learning Research. Cited by: §1.
- Deep autoencoding gaussian mixture model for unsupervised anomaly detection. Cited by: Table 1.