PyODDS: An End-to-End Outlier Detection System

PyODDS: An End-to-End Outlier Detection System

\nameYuening Li \emailyueningl@tamu.edu
\nameDaochen Zha \emaildaochen.zha@tamu.edu
\addrDepartment of Computer Science and Engineering
\nameNa Zou \emailnzou1@tamu.edu
\addrDepartment of Industrial & Systems Engineering
\nameXia Hu \emailxiahu@tamu.edu
\addrDepartment of Computer Science and Engineering
Texas A&M University
College Station, TX 77840, USA
Abstract

PyODDS is an end-to end Python system for outlier detection with database support. PyODDS provides outlier detection algorithms which meet the demands for users in different fields, w/wo data science or machine learning background. PyODDS gives the ability to execute machine learning algorithms in-database without moving data out of the database server or over the network. It also provides access to a wide range of outlier detection algorithms, including statistical analysis and more recent deep learning based approaches. PyODDS is released under the MIT open-source license, and currently available at https://github.com/datamllab/pyodds with official documentations at https://pyodds.github.io/.

\firstpageno

1

\editor{keywords}

anomaly detection, end-to-end system, outlier detection, deep learning, machine learning, data mining, full stack system, data visualization

1 Introduction

Outliers refer to the objects with patterns or behaviors that are significantly rare and different with the rest of majorities. Outlier detection plays an important role for various applications, such as fraud detection, cyber security, medical diagnosis and industrial manufacturer. Detecting outliers in data has been studied in statistics community as early as the century. Overtime, a variety of anomaly detection approaches have been specifically developed for certain application domains with various data characteristics.

To meet the diverse demands, a well-structured system with following characteristics is in need. First, it needs to cover various techniques for outlier detection in different scenarios. Second, it can analyze both static and time series data, since in many real-world applications, instances are not independent and identical distributed. For example, instances usually have temporal correlations with time-stamps, such as unexpected spikes, drops, trend changes and level shifts. Third, it is also a demand to have an database backend for data storage and operations, which enable in-database analysis without moving data out of the database server or over the network, as to reduce the cost of query and loading data from different remote servers. To analyze time series, the burden of throughput is heavier than usual since time window based approaches require to access and query the data in a more frequent way.

While many software libraries are available for outlier detection, existing tookits are focusing on static data: PyNomaly Constantinou (2018), ELKI Data Mining Achtert et al. (2010), RapidMiner Hofmann and Klinkenberg (2013), and PyOD Zhao et al. (2019) contain different outlier detection methods for static data with various programming languages, yet there tackle with time series data, and do not cater specifically to backend-servers. To fill this gap, we propose and implement PyODDS, a full stack, end-to-end system for outlier detection, which supports both static and time series data.

PyODDS has advantages from the following perspectives. First, it contains 13 algorithms, including statistical approaches, and recent neural network frameworks. Second, PyODDS supports both static and time series data analysis, with flexible time-slices segmentation. Third, PyODDS supports operation and maintenance from a light-weight SQL based database, which reduces the cost of queries and loading data from different remote servers. Fourth, PyODDS provides visualization tools for the original distribution of raw data, and predicted results, which offers users a direct and vivid perception. Last, PyODDS includes a unified API with detailed documentation, such as outlier detection approaches, database operations, and visualization functions.

Figure 1: Overview of PyODDS

2 The PyODDS library

Our system is written in Python and uses TDengine as the database support service. The sequence from query data to evaluation is outlined in Figure  1. PyODDS follows the API design of scikit-learn, all implemented methods (as shown in Table 1) are formulated as individual classes with same interfaces: (1) function is to fit the selected model according to the given training data; (2) function returns a binary class label corresponding to each instance in testing sets; (3) produces an outlier score for each instance to denote their outliernesses.

Methods Reference Class API Category
CBLOF  He et al. (2003) algo.cblof.CBLOF Fixed-length, shallow
SOD  Kriegel et al. (2009) algo.sod.SOD Fixed-length, shallow
HBOS  Aggarwal (2015) algo.hbos.HBOS Fixed-length, shallow
IFOREST  Liu et al. (2008) algo.iforest.IFOREST Fixed-length, shallow
KNN  Ramaswamy et al. (2000) algo.knn.KNN Fixed-length, shallow
LOF  Breunig et al. (2000) algo.cblof.CBLOF Fixed-length, shallow
OCSVM  Schölkopf et al. (2001) algo.ocsvm.OCSVM Fixed-length, shallow
PCA  Shyu et al. algo.pca.PCA Fixed-length, shallow
AUTOENCODER  Hawkins et al. (2002)
algo.autoencoder
.AUTOENCODER
Fixed-length, deep
DAGMM  Zong et al. (2018) algo.dagmm.DAGMM Fixed-length, deep
LSTMENCDEC  Malhotra et al. (2016)
algo.lstm_enc_dec_axl
.LSTMED
Time series, deep
LSTMAD  Malhotra et al. (2015) algo.lstm_ad.LSTMAD Time series, deep
LUMINOL algo.luminol.LUMINOL Time series, shallow
Table 1: Outlier detection models in PyODDS
1>>> from utils.import_algorithm import algorithm_selection
2>>> from utils.utilities import  output_performance,connect_server,query_data
3
4>>> # connect to the database
5>>> conn,cursor=connect_server(host, user, password)
6
7>>> # query data from specific time range
8>>> data = query_data(database_name,table_name,start_time,end_time)
9
10>>> # train the anomaly detection algorithm
11>>> clf = algorithm_selection(algorithm_name)
12>>> clf.fit(X_train)
13
14>>> # get outlier result and scores
15>>> prediction_result = clf.predict(X_test)
16>>> outlierness_score = clf.decision_function(X_test)
17
18>>> # evaluate and visualize the prediction_result
19>>> output_performance(X_test,prediction_result,outlierness_score)
20>>> visualize_distribution(X_test,prediction_result,outlierness_score)
Listing 1: Demo of PyODDS API

PyODDS also includes database operation functions for client users: (1) function allows the client to connect the server with host address and user information; (2) returns a pandas DataFrame containing time series retrieved from a given time range. For other server-side database operations, they are supported by the backend database platform TDengine, which provides caching, stream computing, message queuing and other functionalities.

Moreover, PyODDS includes a set of utility functions for model evaluation and visualization: (1) is to visualize the original distribution for static data and time series; (2) is to visualize the predicted outlier score in testing cases; (3) produces the evaluation for the performance of the given algorithm, including accuracy score, precision score, recall score, f1 score, roc-auc score and processing time cost.

A PyODDS API demo is shown in Listing 1. Lines 1-3 import the utility functions. Lines 4-5 create a connection object with cursor connecting to the dataset by the host address and user information. Lines 7-8 show how to query data from a given time range. Line 10-12 declare an object as a specific algorithm for data analysis, and fit the model through the training data. Lines 14-16 produce the predicted result for the testing data. Lines 18-20 give a visualization of the prediction results. An example of using visualization functions is shown in Figure  2. The two-dimensional artificial data used in the example is created by generate data which generates inliers from two Gaussian distribution and outliers as random noises from other distribution. The first and the second figure (from left to right) denote the distribution of static data: the first one uses the kernel density estimation procedure to visualize the original distribution, and the second one plots the instances as scatters where lighter shades denote to higher outlierness score. The third and fourth figure visualize the prediction results for time series: the third one plots original features as curves where the x axis denotes timestamp, y axis denotes the features, the fourth figure represents the outlierness score in y axis, corresponding to the time series in x axis.

Figure 2: Demonstration of using PyODDS in visualizing prediction result

3 Conclusion and Future Work

This paper introduces PyODDS, an open-source system for anomaly detection utilizing state-of-the-art machine learning techniques. It contains 13 algorithms, including classical statistical approaches, and recent neural network frameworks for static and time series data. It also provides an end-to-end solution for individuals as well as enterprises, and supports operations and maintenance from light-weight SQL based database to back-end machine learning algorithms. As a full-stack system, PyODDS aims to lower the threshold of learning scientific algorithms and reduces the skills requirement from both database and machine learning sides. In the future, we plan to enhance the system by implementing models for heterogeneous data Li et al. (2019a); Huang et al. (2019); Li et al. (2019b), improving the interpretability and reliability of the algorithms Liu et al. (2019), and integrating more advanced outlier detection methods.

References

  • E. Achtert, H. Kriegel, L. Reichert, E. Schubert, R. Wojdanowski, and A. Zimek (2010) Visual evaluation of outlier detection models. In International Conference on Database Systems for Advanced Applications, pp. 396–399. Cited by: §1.
  • C. C. Aggarwal (2015) Outlier analysis. In Data mining, Cited by: Table 1.
  • M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander (2000) LOF: identifying density-based local outliers. In ACM sigmod record, Cited by: Table 1.
  • V. Constantinou (2018) PyNomaly: anomaly detection using local outlier probabilities (LoOP).. Journal of Open Source Software. Cited by: §1.
  • S. Hawkins, H. He, G. Williams, and R. Baxter (2002) Outlier detection using replicator neural networks. In International Conference on Data Warehousing and Knowledge Discovery, Cited by: Table 1.
  • Z. He, X. Xu, and S. Deng (2003) Discovering cluster-based local outliers. Pattern Recognition Letters. Cited by: Table 1.
  • M. Hofmann and R. Klinkenberg (2013) RapidMiner: data mining use cases and business analytics applications. CRC Press. Cited by: §1.
  • X. Huang, Q. Song, Y. Li, and X. Hu (2019) Graph recurrent networks with attributed random walks. Cited by: §3.
  • H. Kriegel, P. Kröger, E. Schubert, and A. Zimek (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Cited by: Table 1.
  • Y. Li, X. Huang, J. Li, M. Du, and N. Zou (2019a) Cited by: §3.
  • Y. Li, N. Liu, J. Li, M. Du, and X. Hu (2019b) Deep structured cross-modal anomaly detection. IJCNN. Cited by: §3.
  • F. T. Liu, K. M. Ting, and Z. Zhou (2008) Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, Cited by: Table 1.
  • N. Liu, Q. Tan, Y. Li, H. Yang, J. Zhou, and X. Hu (2019) Is a single vector enough? exploring node polysemy for network embedding. arXiv preprint arXiv:1905.10668. Cited by: §3.
  • P. Malhotra, A. Ramakrishnan, G. Anand, L. Vig, P. Agarwal, and G. Shroff (2016) LSTM-based encoder-decoder for multi-sensor anomaly detection. arXiv preprint arXiv:1607.00148. Cited by: Table 1.
  • P. Malhotra, L. Vig, G. Shroff, and P. Agarwal (2015) Long short term memory networks for anomaly detection in time series. In Proceedings, Cited by: Table 1.
  • S. Ramaswamy, R. Rastogi, and K. Shim (2000) Efficient algorithms for mining outliers from large data sets. In ACM Sigmod Record, Cited by: Table 1.
  • B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (2001) Estimating the support of a high-dimensional distribution. Neural computation. Cited by: Table 1.
  • [18] M. Shyu, S. Chen, K. Sarinnapakorn, and L. Chang A novel anomaly detection scheme based on principal component classifier. Cited by: Table 1.
  • Y. Zhao, Z. Nasrullah, and Z. Li (2019) PyOD: a python toolbox for scalable outlier detection. Journal of Machine Learning Research. Cited by: §1.
  • B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen (2018) Deep autoencoding gaussian mixture model for unsupervised anomaly detection. Cited by: Table 1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393181
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description