SAFE: Scalable Automatic Feature Engineering Framework for Industrial Tasks
Machine learning techniques have been widely applied in Internet companies for various tasks, acting as an essential driving force, and feature engineering has been generally recognized as a crucial tache when constructing machine learning systems. Recently, a growing effort has been made to the development of automatic feature engineering methods, so that the substantial and tedious manual effort can be liberated. However, for industrial tasks, the efficiency and scalability of these methods are still far from satisfactory. In this paper, we proposed a staged method named SAFE (Scalable Automatic Feature Engineering), which can provide excellent efficiency and scalability, along with requisite interpretability and promising performance. Extensive experiments are conducted and the results show that the proposed method can provide prominent efficiency and competitive effectiveness when comparing with other methods. What’s more, the adequate scalability of the proposed method ensures it to be deployed in large scale industrial tasks.
Nowadays, machine learning (ML) techniques have been widely explored and applied in almost all Internet companies, and serving as essential parts in diversified fields, such as recommendation system [4, 22, 9], fraud detection [2, 31, 33], advertising [15, 25, 34], and face recognition [1, 29], etc. With the help of these techniques, excellent performance and significant improvement have been obtained.
Generally speaking, to build a machine learning system, a professional and complex ML pipeline is always needed, which usually includes data preparation, feature engineering, model generation, and model evaluation, etc. It is widely agreed that the performance of machine learning methods depends to a large extent on the quality of the features, and generating a good feature set becomes a crucial step to chase high performance . Therefore, most machine learning engineers take a large effort to obtain useful features when building a machine learning system.
However, it is frustrating that feature engineering is often the most indispensable part of human intervention in ML pipelines since human intuition and experience are gravely required, thus, it becomes tedious, task-specific and challenging, and hence, time-consuming. On the other hand, with the growing need for ML techniques in industrial tasks, it becomes impracticable to manually perform feature engineering in all of these tasks. This promotes the birth of automatic feature engineering, which is an important topic of automatic machine learning (AutoML) [28, 10, 35, 8]. The development of automatic feature engineering can not only liberate machine learning engineers from the substantial and tedious process, but also power machine learning techniques to be applied in more and more applications.
For a regular supervised learning task, problem can be formulated as using training examples to find a function , which is defined as returning the value that obtains the highest score: , where is the input space, is the output space and is a scoring function. The goal of automatic feature engineering is to learn a feature representation , to construct a new feature representation from the original feature , with which the performance of subsequent machine learning tools can be further improved as much as possible.
Several studies have been conducted on this topic. To name a few, some methods use reinforcement learning based strategy to perform automatic feature engineering [6, 14, 30]. These methods require many rounds of attempts and it is necessary to generate a new feature set and evaluate it in each round, making them infeasible in industrial tasks. Transfer learning or meta-learning based strategies are also proposed for automatic feature engineering [12, 21]. However, a large number of experiments on various datasets are needed in advance to train these methods, and it is intractable to introduce new operators or increase the number of parent features. Some methods follow the generation-selection procedure [11, 16, 13] to do automated feature engineering. However, these methods always perform as generating all legal features in the feature generation stage and then selecting a subset features from them, thus the time and space complexity is extremely high, making it unapplicable for tasks with large data size or high feature dimension.
In industrial tasks, the size of real business data is always very huge, which introduces extremely high requirements for space and time complexity. At the same time, due to the rapid change of business, there are also high requirements for the flexibility and scalability of the algorithms. Besides, there are more requirements that need to be addressed :
Strong applicability: A tool that is highly adaptable means that it is user-friendly and easy to use. The performance of an automatic feature engineering algorithm should not depend on a large number of hyper-parameters or one of its hyper-parameter configurations can be applied to different data sets.
Distributed computing: the number of samples and features in real-world business tasks are pretty large, which makes distributed computing necessary. Most parts of the automatic feature engineering algorithm should be able to be calculated in parallel.
Real-time inference: real-time inference is involved in many real-world businesses. In such cases, once an instance is inputted, the feature should be produced instantly and the prediction can be performed subsequently.
In this paper, we approach the problem from the typical two-stage perspective and propose a method named SAFE (Scalable Automatic Feature Engineering) to perform efficient automatic feature engineering, which includes feature generation stage and feature selection stage. We guarantee computational efficiency, scalability and the requirements mentioned above. The major contributions of this paper are summarized as follows:
In the feature generation stage, different from the previous methods which focuses on what operator to use or how to generate all legal features, we focus on mining the original feature pairs that generate more effective new features with higher probability, to improve the efficiency of the process.
In the feature selection stage, we propose a pipeline of feature selection, with the consideration of the power of a single feature, the redundancy of feature pairs, and the feature importance evaluated by the typical tree model. It is suitable for multiple different business data sets and various machine learning algorithms.
We have experimentally proved the advantages of our algorithm on a large set of data sets and multiple classifiers. Compared with the original feature space, the prediction accuracy is improved by on average.
The rest of this paper is organized as follows: Section II reviews the related work; Section III explains the problem setting; Section IV details the proposed method and provides some analyses; Section V states the detail of the data set, evaluation method and presents the experimental results to validate our method; Section VI concludes the paper.
Ii Related Work
As a nonnegligible issue for automatic machine learning [28, 10, 35, 8], automatic feature engineering has drawn extensive attention in recent years, and many methods have been proposed from different perspectives to solve this task [20, 6, 11, 12, 24, 5, 16, 13, 14, 30, 21]. In this section, we mainly discuss three typical strategies, which include the generation-selection strategy, reinforcement learning based strategy and transfer learning based strategy.
Given a supervised learning data set, a typical method for automatic feature engineering is to follow the generation-selection procedure. The FICUS algorithm  initializes by constructing a set of candidate features, and iterates to improve it until the computation budget is exhausted. During each iteration, it performs beam search to construct new features and selects features typically by using heuristic measures based on information gain in a decision tree. TFC  also solves this task by an iterative framework. In each iteration, it generates all legal features based on the current feature pool and all available operators, then selects the best features from all candidate features by using information gain, and keeps them as the new feature pool. With this framework, higher-order feature combinations can be obtained as the iteration progresses. However, the exhaustive search in each iteration leads to a combinatorial explosion of feature space, making this approach non-scalable. To avoid exhaustive search, learning based methods, such as the FCTree algorithm , have been proposed. FCTree trains a decision tree and performs feature generation by applying several sequential transformations to the original feature, and select features according to information gain on each node of the decision tree. Once a tree is built, features chosen at internal decision nodes are used to obtain the constructed features.  is a regression-based algorithm, which learns the representation by mining pairwise feature associations, identifying the linear or non-linear relationship between each pair, applying regression and selecting those relationships that are stable and improve the prediction performance. These algorithms always encounter performance and scalability bottlenecks since the cost of time and resource in the feature generation and selection procedure may be extremely unsatisfactory, if without ingenious design.
Reinforcement learning based strategies are also explored.  formalizes feature selection as a reinforcement learning problem and introduces an adaptation of the Monte-Carlo tree search. Here, the problem of choosing a subset of the available features is cast as a single-player game whose states are all possible subsets of features and the actions consist of choosing a feature and adding it to the subset.  handles this problem by exploring on a directed acyclic graph which represents the relationship between different transformed versions of the data, and learns an effective strategy to explore available feature engineering choices under a given budget through Q-learning.  formalizes this task as an optimization problem over a Heterogeneous Transformation Graph (HTG). It proposes a deep Q-learning on HTG to support efficient learning of fine-grained and generalized FE policies that can transfer knowledge of engineering “good” features from a collection of data sets to other unseen data sets.
Transfer learning or meta-learning based strategies are also proposed for automatic feature engineering.  employs learning to rank techniques to evaluate the newly constructed features and select the most promising ones. It is extremely time-consuming. For instance, their reported results were obtained after running for several days on moderately sized data sets. In contrast,  can generate effective features within seconds on average. It is based on learning the effectiveness of applying a transformation (e.g., arithmetic or aggregate operators) on numerical features, from past feature engineering experiences. However, since the meta-features do not take the relationship between features into account, it works better only when using unary transformations.
Beyond that, there are also other methods that direct at different settings. For example,  automatically constructs features from relational databases via deep feature synthesis. It focuses on the relationships between the various tables in the database to generate new features. A similar approach is adopted by . What’s more, many studies try to perform feature engineering simultaneously while training the model, by introducing operations such as feature cross  or using techniques like self-attentive neural networks . We need to address that, different from the methods that learn feature representations simultaneously with model training, we are aiming at learning a new representation for each sample based on the original features, which can be used to perform the subsequent machine learning models, and we have no constraint on what model to be used afterward. At the same time, for industrial tasks, interpretability is always required . The generated features in our framework can be easily explained, to satisfy the interpretability requirement in industrial tasks.
To apply automatic feature engineering techniques in real-world applications, especially for industrial tasks, the efficiency and scalability of the aforementioned methods are still far from satisfactory. Methods with excellent efficiency and scalability, along with requisite interpretability and promising performance are in high demand.
Iii Problem Statement
Consider a predictive modeling task, which consists of:
A dataset of input-output pairs. Let be a record of the input space with original features . Let be the corresponding output label. Training data with records can be denoted as . Similarly, validation data and test data can be denoted as and .
A machine learning algorithm that accepts a training set and a validation set as input and produces a function , which return the predicted label given the input .
A loss function which computes the loss of a learned function , according to the ground-truth label .
The goal of automatic feature engineering is to learn a feature generation function to generate new feature representation based on the original features , by using the set of operations , so that the learning algorithm can find a function that minimizes the loss function . i.e., to approximate the true underlying input-output function as much as possible. More formally, we want to obtain the feature generation function, with which the loss of the learned predictive function can be minimized:
in which the predictive function can be obtained by , and are the generated training and validation dataset, i.e., , and .
The operators , also known as -ary operators, acts on original features for feature generation, and it can be divided into unary operators , binary operators , ternary operators , etc. It should be noted that operators which do not satisfy the commutative property will be treated as multiple different operators in our subsequent descriptions and experiments, such as “”.
Unary operators are used for discretizing, normalizing, or mathematically transforming unit features:
Discretization is the process of transferring continuous features into discrete features. It plays an important role in feature processing. It is robust to anomalous data and can make the trained model more stable. Typical feature discretization methods include ChiMerge, equidistant binning, equal-frequency binning, clustering binning, etc.
Normalization refers to a process that makes features more normal or regular. Typical feature normalization methods include Min-Max normalization, Z-score, standardization of dispersion, etc.
Mathematical transformations acting on unit features include log, sigmoid, square, square root, tanh, round, etc.
Binary operators combine two original features to generate a new feature:
Four basic arithmetic operations: , , , .
Logical operators act on two boolean features, such conjunction (), disjunction (), alternative denial (), joint denial (), material conditional (), converse implication (), biconditional (), exclusive or (), etc.
GroupByThenMax, GroupByThenMin, GroupByThenAvg, GroupByThenStdev and GroupByThenCount. These operators implement the SQL-based operations with the same name.
Ridge regression and kernel ridge regression in  can also be considered as binary operators.
Ternary operators combine three features to generate a new feature. A common ternary operator is a conditional operator, which is a basic conditional statement in many programming languages. For the conditional expression , if the value of is true, the value of is obtained; otherwise, the value of is obtained.
There are also many operators that can accept multiple original features as input, such as MAX, MIN, MEAN, etc. We divide them into different categories when they accept a different number of inputs.
It should be pointed out that there are still many operators that apply in specific fields, we call them domain-specific operators, such as lag operators in time series analysis, genetic operators in biology, etc.
Because of the existence of various operators, an applicable automatic feature engineering algorithm framework should not limit operators and new operators should be easily added.
What’s more, to ensure the method to be feasible for large scale industrial tasks, the whole automatic feature engineering framework should be time and space-friendly, and with requisite interpretability and promising performance.
Iv Proposed Method
As shown in Fig. 1 and Algorithm 1, our automatic feature engineering algorithm is an iterative process where each iteration comprises of two phases: feature generation and feature selection. The number of iterations is limited by the calculation time or computation space.
As discussed above, exhaustive searching is ineffective due to the infinite feature space. Even if the numbers of operators and iterations are limited, exhaustive searching can also result in a combinatorial explosion. To avoid this problem, we use a tree based method, i.e., XGBoost , to mine the relationships between the current set of base features to narrow down the search space for feature combinations and then sort and filter the feature combinations by information gain ratio. We then apply the predefined operators on the filtered feature combinations with a high information gain ratio and obtain the new feature set . By combining the base features and the generated features , the candidate feature set can be obtained, which is denoted as .
As the number of the current set of features is still very large, we use efficient and effective feature ranking and selection methods after that. The basic idea is to find the informative features, remove the redundant ones, and then each feature with be attached a score so that the filter process can be performed if necessary. Concretely, we first use the information value to pick out the features that have a high impact on the label, which are regarded as more informative features. Then, we use the pearson correlation coefficient to remove the redundant features. Finally, we use XGBoost to score the remaining features by the average gain across all splits in which the feature is used. We only choose the features with the highest scores as for the next iteration, with the consideration of scalability and efficiency.
In the next two subsections, we will explain the feature generation and feature selection process in detail.
Iv-B Feature Generation
The goal of this phase is to ingeniously generate the set of the new feature set using the current feature set . Moreover, we want to reduce the number of newly generated features, while the effective ones should not be omitted. The training set, validation set, and test set at this point are represented as , and .
The search space of original feature generation is:
where is the number of original features and represents the set of -ary operators.
The number of elements in the search space is:
where represents the number of ways of obtaining an ordered subset of elements from a set of elements.
Mine feature combination relations
As mentioned earlier, this search space is so large that we have to narrow it down, and the informative feature combinations should not be ignored. First we train a tree model, i.e., XGBoost, on and . Consider a regression tree in the XGBoost model, as shown in Fig. 2. We call as the split features and as the corresponding split values, and the features which do not act as a split feature are called non-split features. The parent node of the leaf node is represented as and the different split features on a path of the tree from the root node to can be represented as . For example, and in Fig. 2. All paths of all trees in the XGBoost model can be represented as . Our automatic feature engineering algorithm SAFE is based on two basic assumptions:
For unary operators, features that generated based on split features are more efficient than that generated based on non-split features.
For other operators, new features that generated based on split features which from the same path are more efficient than the new features that generated based on split features which from different path, while the latter is still more efficient than features that generated based on non-split features.
We empirically verify the rationality of these two assumptions in section V. Based on these two basic assumptions, we can find features or feature combinations on these paths for feature generation, which not only greatly reduces the search space, guarantees efficiency, but also ensures the validity of feature generation. The search space of feature generation by this way is:
where is the number of paths and represents the set of -ary operators.
The maximum number of elements in the search space is:
where means the -th feature on the path. It should be noted that some combinations of features on different paths may be the same, so the actual number will be much smaller than this value. It can be found through formulas and experiments that: .
Sort feature combinations
To further narrow down the search space of feature generation, we use the information gain ratio to sort the features and feature combinations in the search space. Take a feature combination with elements as an example, we already know their split values . Here, is a collection because a split feature may appear multiple times in a path. These split features and split values can divide all records into parts. The information gain ratio can be calculated by subtracting the original information entropy from the information entropy on these parts. The pseudo-code of the algorithm at this phase is shown in Algorithm 2.
features or feature combinations with the highest information gain ratio will be used to generate new features , where denotes the number of feature combinations with features.
Since the number of features that searched and generated is much less than exhaustive searching, this allows us to employ iterative feature generation strategies on large data sets.
Iv-C Feature Selection
Candidate features that composed of the base features in this iteration and generated features might not be of equal importance. To computationally-efficiently select the more informative features, we use a three-step feature selection process: Firstly, according to the information value, the features with low predictive power are removed. Then the redundant features are removed according to the pearson correlation coefficient. Finally, the remaining features are sorted by using a tree based method, i.e., XGBoost .
Remove uninformative features
Since some of the candidate features will inevitably have little or no impact on the target, we first remove the features with low predictive power. The pseudo-code of the algorithm at this phase is shown in Algorithm 3.
Information value (IV) is a very useful concept for feature selection during model building, and it is widely used in the industrial tasks. IV measures the degree to which a feature affects the target. The formula for information value is shown below:
where and represent the number of all positive records and negative records. and represent the number of positive records and negative records in the -th bin.
|Information Value||Predictive Power|
|to||Useless for prediction|
|Extremely strong predictor|
As shown in Table I, the rules of thumb guide us on how to remove features with low predictive power. Typically, variables with medium and strong predictive powers are selected for model development. Therefore, we take the threshold of feature selection as .
Remove redundant features
The candidate features at this time are with certain predictive power, but some of them are redundant. For example, “speed” and “one-hour travel” are highly relevant, we only need to keep one. The pseudo-code of the algorithm at this phase is shown in Algorithm 4.
A pearson correlation is a number between and that indicates the extent to which two features are linearly related. Its absolute value of 1 means that the two features are completely linearly related, and its absolute value of 0 means there is no linear relationship between the two features. , a pearson correlation between features and is calculated by:
where and means the average of all elements of feature and feature .
The larger the absolute value of the correlation coefficient, the stronger the correlation is. Usually, the relative strength of the variable is judged by the range of values in Table II. Therefore, we set the threshold of pearson correlation to . If the pearson correlation coefficient of the two features is greater than , the feature with the smaller IV of them will be removed.
|Pearson Correlation Coefficient||Correlation|
|to||Very weak or no correlation|
|to||Extremely strong correlation|
Rank feature importance
At this stage, we use a lightweight tree-based method, i.e., XGBoost, to sort the remaining candidate features by the average gain across all splits, and further filter can be performed if a maximum value of the number of final selected features is required to make the later process more efficient.
Iv-D Time complexity Analysis
In this section, we analyze the time complexity of the algorithm. We first analyze the time complexity of some existing algorithms. Reinforcement learning based strategies are beyond our consideration since the executing time of them is too long. We mainly analyze generation-selection based strategies (TFC, FCTree, AutoLearn) and transfer learning or meta-learning based strategies (ExploreKit, LFE). Then we analyze the time complexity of SAFE. It should be noted that for the sake of simplicity and generality, we only consider the first iteration of all iterative algorithms (TFC, ExploreKit, SAFE), and we only consider binary operators. Recall that we denote the number of records and features as and , respectively, and represents the number of ways to obtain an ordered subset of elements from a set of elements.
TFC  generates all legal features and then selects the best ones using information gain. The time complexity of feature generation is , the time complexity of feature selection is , the time complexity of feature ranking is . For real business data with a large amount of data, is always much smaller than , so its time complexity is:
The time complexity of decision tree algorithm is , in which is the depth of the tree. FCTree  algorithm adds features at each level of decision tree, so its time complexity is . For real business data with large amount of data, is always much smaller than , so its time complexity is:
AutoLearn  identifies the linear or non-linear relationship between each pair and uses randomized lasso and mutual information for feature selection. The time complexity of feature generation is , the time complexity of feature selection is , the time complexity of feature ranking is . So the time complexity is:
ExploreKit  is a meta-learning based strategies. In the feature generation phase, it needs an exhaustive combination of features, so its time complexity of feature generation is . In the feature ranking phase, it calculates the meta-features associated with the original data set and candidate features, such as entropy-based measures and statistical tests, to score each candidate feature, so its time complexity of feature generation is . In addition, it needs to train a meta-learning model in advance. So the all-time complexity is:
LFE  is also a meta-learning based strategy. The difference is that it does not require exhaustive feature generation. At the feature selection stage, meta-features are only related to the original features. So the whole complexity can be calculated as:
The most important calculation at the phases of mining feature combination relations and feature importance ranking is to train an XGBoost. Their time complexity is and , respectively. Where and mean the total number of trees, and mean the maximum depth of the tree and is the maximum number of rows in each block  and is the number of features after removing redundant features. The number of features after feature generation and removing uninformative features can be denoted as and , respectively. Next, we analyze the time complexity of the other four phases:
Sorting feature combinations: As shown in Algorithm 2, there are binary feature combinations and the time complexity of this phase is .
Feature generation: As shown in section IV-B3, new features will be generated, so the time complexity of this phase is .
Remove uninformative features: As the Algorithm 3 shows, the time complexity of the third and fourth steps is and , respectively. So the overall time complexity of this phase is .
Remove redundant features: As the Algorithm 4 shows, pearson correlation is calculated once for each feature pair. Because the time complexity of Pearson correlation calculation is , the overall time complexity of this phase is .
The trees in XGBoost are usually not deep, so we can treat as a constant and ignore it. For real business data with large amount of data, , and . So the all time complexity is:
As shown in Eq. (13), we can easily control the number of features generated and the time complexity of the algorithm by controlling the total number of trees of XGBoost.
The time complexity and space complexity of our algorithm is very low and can be adjusted flexibly according to actual needs. Below we will continue to discuss whether the algorithm meets the requirements in section I.
Our algorithm is user-friendly and does not require learning a cumbersome model like reinforcement learning and transfer learning based methods. Besides, the hyperparameters which needed to set in advance are only used to control the complexity of the algorithm, such as the number of iterations or iteration time, the number of trees in the forest and the depth of each tree. Therefore, the setting of these hyperparameters is not complicated.
XGBoost is recognized as an algorithm that leverages the parallelism of computing resources, and it has been proven that XGBoost can push the limits of computing power for boosted trees algorithms. At the same time, other aspects of our algorithm can be easily parallelized, such as calculating the information value of the individual feature and the pearson correlation of each feature pair in parallel.
In our algorithm, whether newly generated features can be used for real-time inference depends on the operators that used for feature generation. Users can choose different operators according to the actual situation to meet the real-time requirements of the business.
For simplicity and versatility, we only select four basic binary operators , , and when experimenting with each algorithm.
V-a Experiments on benchmark data sets
We first conduct experiments on 12 benchmark data sets, with different sample and feature size. All of these data are available on the OpenML database
Algorithms for comparison
We compare our model with the original features (ORIG), two other state-of-the-art feature generating algorithms, i.e., FCTree  and TFC , and two of our own comparison algorithms, i.e., Random (RAND) and SAFE-Important (IMP). We use Area Under Curve (AUC) as the evaluation metric.
RAND algorithm randomly selects different feature combinations of all original features for feature generation. Different from it, IMP algorithm only randomly selects different feature combinations with the split features of XGBoost for feature generation. RAND and IMP follow the same feature selection process as SAFE. For the convenience of comparison, The maximum number of RAND, IMP, and SAFE output features are set to . Features generated by FCTree will also be reduced to according to information gain. Moreover, TFC, RAND, IMP and SAFE only perform one iteration.
We evaluate the generated features (and also the original features) of each compared algorithm on 9 state-of-the-art classification algorithms (CLF), which are AdaBoost (AB), Decison Tree (DT), Extremely randomized Trees (ET), nearest neighbors (NN), Logistic Regression (LR), Multi Layered Perceptron (MLP), Random Forest (RF), SVM with linear kernel (SVM) and XGBoost. All parameters of these algorithms are set as the default values in scikit-learn  and XGBoost . We performed times experiments, and obtain the final results by averaging the results of these experiments ( is 100 for the first 9 data sets and 10 for the rest data sets).
The reported performances are measured in terms of AUC, which are shown in Table III. The value in the table means . It can be seen from the experimental results that SAFE has a significant advantage over all other compared algorithms no matter what model is performed after the feature generation process. Compared with the original feature space, the features generated by our model can improve the overall prediction AUC by on average. Compared with FCTree and TFC, SAFE can improve the performance by and on average, respectively. What’s more, SAFE performs better than RAND and IMP, indicating that our algorithm does mine a combination of features that are more likely to generate better features.
We compare the importance of generated features with original features. We combine the original features with the top-ranked generated features (up to ) to form a new data set and use random forest to score feature importance. The experimental results are shown in Fig. 3. It is evident that the new features generated by SAFE (indicated in orange) are relatively more important than the original features (indicated in blue), which validates the effectiveness of the generated features.
Section IV-D has analyzed the time complexity of each method. Table V lists the actual executing time. It can be seen that SAFE has a great advantage and the execution time of it is on average () times the execution time of FCTree (TFC).
We further compare the stability of the generated features of each algorithm. The basic idea is that the generated features are more stable if the same features are generated each time when we repeat the automatic feature engineering procedure; and if the features generated each time are different, then the stability of the generated features is unsatisfactory.
Suppose we have conducted experiments, each time the automatic feature engineering algorithm will generate features. Therefore, a total of features will be generated. Their distribution can be expressed as , where represents the feature id, represents the number of occurrences of the feature and . Therefore, the distribution with the best feature stability is , and the worst is .
We use Jensen-Shannon Divergence (JSD)  to evaluate the stability of the feature distribution generated by different automatic feature engineering algorithms. JSD is a variant of Kullback-Leibler divergence (KLD) , which is converted as:
where and KLD is calculated as:
We take as and calculate the stability of the generated features of each algorithm. That is, the JSD between the actual feature distribution and the ideal distribution . The smaller the value is, the better it is. The experimental results are shown in Table VI. We did not compare the TFC algorithm because the execution time of TFC is too long, so it is difficult to calculate so many times. From the experimental results, it can be seen that the stability of the generated features of SAFE has certain advantages.
Performance at different iterations
We then validate whether the performance can be further improved as the iteration process goes on. We set the iteration round to 5, and the sampled results are shown in Fig. 4. As we can see, the performance may further be improved as the round proceeds, and become stable after some rounds. This is reasonable, since that in the first some rounds, more useful feature combinations can be excavated so that the performance can be further improved, and after some rounds, there may be no new useful feature combinations can be found, thus the features will not be updated, and the performance keeps unchanged.
V-B Experiments on business data sets
Experiments on extra-large scale business data sets are further conducted to verify the effectiveness and scalability of the proposed method on real industrial tasks. The data sets come from the tasks for fraud detection in Ant Financial, which aims at finding the potential fraud transactions (or malicious users), so that the system can stop these transactions (or catch these users) to avoid the economic losses. Table VII presents the detailed information of these data sets. As we can see, the number of samples is extremely large (e.g., up to 8 million training samples for Data3), which severely hinders the employment of many preceding state-of-the-art methods. All parameters of the evaluated models are set as the default values as before.
The results are shown in table VIII. TFC and FCTree are not compared since the execution time is too long for these two methods when applying for these extremely large scale data sets. Thanks to the delicate design in the feature generation procedure of SAFE, the whole time consuming is acceptable even for the industrial tasks. More important, as we can see, the proposed method SAFE can consistently improve the performance, which validate the effectiveness of the proposed method when applying in real industrial tasks and make it a choice for performing automatic feature engineering for extremely large scale industrial data sets. Actually, this framework has been deployed in our system, providing help for many different real-world tasks.
Automatic feature engineering has become an important topic of autoML in recent years, and many different methods have been proposed to handle this task. However, the efficiency and scalability of these methods are still far from satisfactory, especially for industrial tasks, while automatically performing feature engineering is severely demanded. In this paper, we propose a scalable and efficient method named SAFE for automatic feature engineering. Extensive experiments on both benchmark data sets and extra-large scale business data sets are conducted, and detailed analysis is provided, which shows that the proposed method can provide prominent efficiency and competitive effectiveness when comparing with other methods.
- (2019) Efficient face recognition using regularized adaptive non-local sparse coding. IEEE Access 7, pp. 10653–10662. External Links: Cited by: §I.
- (2009) Anomaly detection: A survey. ACM Comput. Surv. 41 (3), pp. 15:1–15:58. External Links: Cited by: §I.
- (2016) XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 785–794. External Links: Cited by: §IV-A, §IV-C, §IV-D6, §V-A2.
- (2010) The youtube video recommendation system. In Proceedings of the 2010 ACM Conference on Recommender Systems, RecSys 2010, Barcelona, Spain, September 26-30, 2010, pp. 293–296. External Links: Cited by: §I.
- (2010) Generalized and heuristic-free feature construction for improved accuracy. In Proceedings of the SIAM International Conference on Data Mining, SDM 2010, April 29 - May 1, 2010, Columbus, Ohio, USA, pp. 629–640. External Links: Cited by: §II, §II, §IV-D2, §V-A1.
- (2010) Feature selection as a one-player game. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pp. 359–366. External Links: Cited by: §I, §II, §II.
- (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edition. Springer Series in Statistics, Springer. External Links: Cited by: §I.
- (2019) AutoML: A survey of the state-of-the-art. CoRR abs/1908.00709. External Links: Cited by: §I, §II.
- (2014) Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD 2014, August 24, 2014, New York City, New York, USA, pp. 5:1–5:9. External Links: Cited by: §I.
- (2019) Automated machine learning. Springer. Cited by: §I, §II.
- (2015) Deep feature synthesis: towards automating data science endeavors. In 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, Campus des Cordeliers, Paris, France, October 19-21, 2015, pp. 1–10. External Links: Cited by: §I, §II, §II.
- (2016) ExploreKit: automatic feature generation and selection. In IEEE 16th International Conference on Data Mining, ICDM 2016, December 12-15, 2016, Barcelona, Spain, pp. 979–984. External Links: Cited by: §I, §II, §II, §IV-D4.
- (2017) AutoLearn - automated feature generation and selection. In 2017 IEEE International Conference on Data Mining, ICDM 2017, New Orleans, LA, USA, November 18-21, 2017, pp. 217–226. External Links: Cited by: §I, §II, §II, 4th item, §IV-D3.
- (2018) Feature engineering for predictive modeling using reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018, pp. 3407–3414. Cited by: §I, §II, §II.
- (2006) Learning to advertise. In SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6-11, 2006, pp. 549–556. External Links: Cited by: §I.
- (2017) One button machine for automating feature engineering in relational databases. CoRR abs/1706.00327. External Links: Cited by: §I, §II, §II.
- (1991) Divergence measures based on the shannon entropy. IEEE Trans. Information Theory 37 (1), pp. 145–151. External Links: Cited by: §V-A5.
- (2019) AutoCross: automatic feature crossing for tabular data in real-world applications. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019., pp. 1936–1945. External Links: Cited by: §I.
- (2003) Information theory, inference, and learning algorithms. Cambridge University Press. External Links: Cited by: §V-A5.
- (2002) Feature generation using general constructor functions. Machine Learning 49 (1), pp. 59–98. External Links: Cited by: §II, §II.
- (2017) Learning feature engineering for classification. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 2529–2535. External Links: Cited by: §I, §II, §II, §IV-D5.
- (2007) Content-based recommendation systems. In The Adaptive Web, Methods and Strategies of Web Personalization, pp. 325–341. External Links: Cited by: §I.
- (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §V-A2.
- (2009) Iterative feature construction for improving inductive learning algorithms. Expert Syst. Appl. 36 (2), pp. 3401–3406. External Links: Cited by: §II, §II, §IV-D1, §V-A1.
- (2005) Impedance coupling in content-targeted advertising. In SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, August 15-19, 2005, pp. 496–503. External Links: Cited by: §I.
- (2019) AutoInt: automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019, pp. 1161–1170. External Links: Cited by: §II.
- (2017) Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, Halifax, NS, Canada, August 13 - 17, 2017, pp. 12:1–12:7. External Links: Cited by: §II.
- (2018) Taking human out of learning applications: A survey on automated machine learning. CoRR abs/1810.13306. External Links: Cited by: §I, §II.
- (2017) Improved performance of face recognition using CNN with constrained triplet loss layer. In 2017 International Joint Conference on Neural Networks, IJCNN 2017, Anchorage, AK, USA, May 14-19, 2017, pp. 1948–1955. External Links: Cited by: §I.
- (2019) Automatic feature engineering by deep reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, Montreal, QC, Canada, May 13-17, 2019, pp. 2312–2314. External Links: Cited by: §I, §II, §II.
- (2018) Anomaly detection with partially observed anomalies. In Companion of the Web Conference 2018, WWW 2018, Lyon, France, April 23-27, 2018, pp. 639–646. External Links: Cited by: §I.
- (2019) Interpretable MTL from heterogeneous domains using boosted tree. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019, pp. 2053–2056. External Links: Cited by: §II.
- (2019) Distributed deep forest and its application to automatic detection of cash-out fraud. ACM TIST 10 (5), pp. 55:1–55:19. External Links: Cited by: §I.
- (2017) Optimized cost per click in taobao display advertising. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017, pp. 2191–2200. External Links: Cited by: §I.
- (2019) Survey on automated machine learning. CoRR abs/1904.12054. External Links: Cited by: §I, §II.