A study of local optima for learning feature interactions using neural networks
Abstract
In many fields such as bioinformatics, high energy physics, power distribution, etc., it is desirable to learn nonlinear models where a small number of variables are selected and the interaction between them is explicitly modeled to predict the response. In principle, neural networks (NNs) could accomplish this task since they can model nonlinear feature interactions very well. However, NNs require large amounts of training data to have a good generalization. In this paper we study the datastarved regime where a NN is trained on a relatively small amount of training data. For that purpose we study feature selection for NNs, which is known to improve generalization for linear models. As an extreme case of data with feature selection and feature interactions we study the XORlike data with irrelevant variables. We experimentally observed that the crossentropy loss function on XORlike data has many nonequivalent local optima, and the number of local optima grows exponentially with the number of irrelevant variables. To deal with the local minima and for feature selection we propose a node pruning and feature selection algorithm that improves the capability of NNs to find better local minima even when there are irrelevant variables. Finally, we show that the performance of a NN on real datasets can be improved using pruning, obtaining compact networks on a small number of features, with good prediction and interpretability.
1 Introduction
Many fields of science such as bioinformatics, high energy physics, power distribution, etc., deal with tabular data with the rows representing the observations and the columns representing the features (measurements) for each observation. In some cases we are interested in predictive models to best predict another variable of interest (e.g. catastrophic power failures of the energy grid). In other cases we are interested in finding what features are involved in predicting the response (e.g. what genes are relevant in predicting a certain type of cancer) and the predictive power is secondary to the simplicity of explanation. Furthermore, in most of these cases a linear model is not sufficient since the variables have high degrees of interaction in obtaining the response.
Neural networks (NN) have been used in most of these cases, because they can model complex interactions between variables, however they require large amounts of training data. We are interested in cases when the available data is limited and the NNs are prone to overfitting.
To get insight on how to train NNs to deal with such data, we will study the XOR data, which has feature interactions and many irrelevant variables. The feature interactions are hard to detect in this data because they are not visible in any marginal statistics.
We will see the the loss function has many local minima that are not equivalent and that irrelevant features make the optimization harder when data is limited. To address these issues we propose a node pruning and feature selection algorithm that can obtain a compact NN on a small number of features, thus helping deal with the case of limited data and irrelevant features.
1.1 Related Work
Local minima. Recent studies [2, 5] have shown that the local minima of some convolutional neural networks are equivalent in the sense that they have the same loss (energy) value and a path can be found between the local minima along which the energy stays the same. For this reason, we will focus our attention to fully connected neural networks and find examples where the local minima have different loss values. Moreover, [11] proves that all differentiable local minima are global minima for the one hidden layer NNs with piecewise linear activation and square loss. However, nothing is proved for nondifferentiable local minima.
Network pruning. There has been quite a lot of work recently about neural network pruning, either for the purpose of improving speed and reducing complexity or giving insights about explaining the essential capability of the pruning technique. [7] and [6] propose the ”Deep Compression”, a threestage technique, which significantly reduces the storage requirement for training deep neural networks without affecting their accuracy. [10] shows that for structured pruning methods, directly training the small target subnetwork or pruned model with random initialization can achieve a comparable or even better performance than retraining using the remaining parameters after pruning. They also obtain similar results towards to a unstructured pruning method [7] after finetuning the pruned subnetwork on smallscale datasets. [4] introduces the Lottery Tickets Hypothesis which claims that a randominitialized dense neural network contains a subnetwork that can be trained in isolation with the corresponding original initialized parameters to obtain the same test accuracy of the original network after training for the same number of iterations.
2 An Empirical Study of the Trainability of DataStarved Neural Networks
To study feature selection methods for neural networks, we will look at a challenging case study, the XOR problem with irrelevant variables. The dimensional XOR is a binary classification problem that can be formulated as
(1) 
Observe that in this formulation the XOR data is dimensional but the degree of interaction is dimensional, with . We call this data the D XOR in dimensions. In this paper we will work with , as is very simple. We assume that is sampled uniformly from . The first features are the only ones used in generating the response, and we call them the true features.
The XOR problem an example of data that can only be modeled by using higher order feature interactions, and for which lower order marginal models have no discrimination power. This makes it very difficult to detect what features are relevant for predicting the response .
The neural networks (NN) that we will study are two layer neural networks with ReLU activation for the hidden layer. These NNs can model the XOR data very well given sufficiently many hidden nodes. The networks will be trained using the Adam optimizer and the crossentropy loss function.
To take the data variability out of the picture, for each we will construct a large dataset of values with a large enough number of observations and features and use subsets consisting of the first observations and features for our experiments. For each observation the is obtained deterministically using Eq. (1). The same way we construct a separate test set with observations.
2.1 Deep local minima based on the true features
In this section we study the NNs only on the true features used in generating the response, thus the feature selection is assumed given by an oracle. We study the local optima of the loss function that the NNs can obtain by training from a random initialization, for different numbers of hidden nodes. We are also interested in the connection between the number of hidden nodes and the training and test AUC.
Since for each the dataset is assumed fixed, we will use the best test AUC obtained for each as a target that we would like to reach for the same even for data that has many irrelevant variables.
Dependence on . In a first experiment, we train a NN with different numbers of hidden nodes on a dataset with observations and 10 random initializations. Then for each we select the result with smallest loss out of the 10 initializations and compute its train and test AUC. In Figure 1 are shown the obtained values of the loss, train and test AUC vs number of hidden nodes . We see that the loss decreases considerably first, then it stabilizes. Same happens with the train and test AUC. These experiments were used to select the number of hidden nodes that would obtain a maximal test AUC. At the same time the train AUC is larger than 0.95. The selected number of hidden nodes for each is shown in Table 1.
Dataset size  

4  64  128  128 
2.2 Local minima in training neural networks
Loss values vs. . In Figure 2 are shown the average loss values obtained from 10 random initializations vs sample size for different number of hidden nodes . Shown are the loss values for data with (dashed lines) and for data with (solid lines), which has at least 10 irrelevant variables. One can see that when there are no irrelevant variables (), the obtained loss stays relatively constant and only slightly decreases with sample size. However, when there are irrelevant variables () the loss gradually increases, a sign of overfitting for small sample sizes, which could be addressed by variable selection. Moreover, the loss has a region where it takes large values ( for k=4 and for k=5) for some network sizes , which is a sign that the optimization is difficult there. Looking at the test AUC, we see that it increases with the sample size, and for the data with irrelevant variables ( ) it never reaches the values of the test AUC for , i.e. when we train a model on the relevant variables only.
Local minima. To study the local minima of a NN on the XOR data, we trained a NN with 100 random initializations. The number of hidden nodes was taken from Table 1.
The local minima were sorted by the loss value and their loss, train and test AUC are shown in Figure 3. We see that the loss values are clearly different and they reflect in different training and test AUCs. Since the dataset is the same for all initializations, the fact that the loss values are different indicates that there are many local minima with different values.
Hit time. To see how hard to find are the local minima, we compute the hit time, which we define as the average number of random initializations required to find a local minimum with a train AUC of at least 0.95. The hit time is displayed in Figure 4 for NNs with 20 hidden nodes and observations. Observe that the hit time quickly blows up as increases and has a superexponential dependence on . It is impractical to learn NN models on 3D, 4D or 5D XOR data when there are hundreds of irrelevant variables.
Dependence on . As we see from previous observations, the NNs can handle the XOR data if is small, but even if in the range NN can work, the test AUC still decreases as increases. We also observe that increasing the number of hidden node in NNs may not be very helpful for improving the test AUC. To demonstrate this observation, for different numbers of hidden node, we train a NN with 10 random initializations and keep the best test AUC and its associated training AUC among the 10 trials. We repeat this process 10 times and display Figure LABEL:fig:nodeauc the average test and train AUC vs the number of hidden nodes. When the number of hidden nodes increases, the training AUC becomes better and better and finally it reaches 1.0. But the test AUC are a different story, it quickly reaches its best value when the number of hidden nodes is relatively small, and then no further improvement happens as the number increases. This tells us that increasing the number of hidden nodes will make too many irrelevant hidden nodes exist in the NN, and lead to overfitting.
From this empirical study we conclude:

If the training data is difficult (such as the XOR data), not all local minima are equivalent, since in Figure 3 there was a large difference between the largest and smallest loss values as well as the corresponding test AUCs.

For a fixed the optimization problem is harder for data starved NNs, when the sample size is in a certain range, but not large enough.

For a fixed training size , the number of shallow local minima quickly blows up as the number of irrelevant variables increases and finding the deep local minima becomes extremely hard.

If the number of irrelevant variables is not too large (e.g. as in Figure 2), an NN with a sufficiently large number of hidden nodes will find a deep optimum more often than one with a small number of hidden nodes, but it might overfit.
These conclusions are the basis for the proposed pruning methodology presented in the next section.
3 Node and Feature Selection for Neural Networks
The above study showed how important it is to remove the irrelevant variables when training neural networks on difficult data with a small number of observations.
We use neural networks with one hidden layer and ReLU activation for the hidden layer. If the hidden layer has neurons and the input , we can represent the weights of the hidden nodes as vectors , the biases as a vector and the weights of the output neuron as a vector . Denoting the ReLU activation as we can write the neural network as:
(2) 
3.1 Node Selection with Annealing for NN
To find better local optima, we propose to start with a NN with many hidden neurons and use a pruning method similar to the Feature Selection with Annealing [1] to select the well trained hidden nodes and remove the rest.
(3) 
However, the NN has some builtin redundancy that we need to take into consideration when comparing the hidden nodes with each other. Observe that if we multiply and by a constant and divide by the same we obtain an equivalent NN that has exactly the same output, due to the fact that we use ReLU activation. We can remove this redundancy and normalize the hidden neurons by normalizing their weight vectors . The proposed method for pruning the nodes including this normalization step is presented in Algorithm 1.
The node annealing schedule follows the equation:
where , , is the starting number of nodes (we used ) and is the final number of nodes, e.g. . The annealing parameter was set to . An example is shown as the blue curve in Figure 6.
3.2 Feature Selection for NN
We can use the node selection procedure from Section 3.1 to train better NNs than by random initialization when there are irrelevant variables. However, the irrelevant variables will still have a negative influence on the obtained model, and an even better model can be obtained by removing the irrelevant features.
After normalizing the NN using Eq. (3), we can compute the group weight (relevance) of each feature using the norm of the corresponding variables in the weight vectors :
(4) 
Using this group criterion we can use Feature Selection with Annealing to select the relevant features for a NN. The procedure is described in Alg. 2.
The variable annealing schedule follows the equation:
where and . An example is shown as the red curve in Figure 6.
4 Experiments
In this section we perform experiments on the XOR data and some real datasets. All the experiments were trained with the Adam optimizer [9] with the default learning rate 0.001 and weight decay 0.0001.
4.1 XOR Data
We ran FSA+NSA on the XOR data with , for epochs, where the node pruning happened after epochs and feature selection started after epochs. We started with nodes and pruned them to . The result is shown as the black curve in Figure 2. One can see that the FSA+NSA procedure does a very good job in selecting the features and training a small model on the selected features. In most cases it even outperforms (in terms of test AUC) the NN model trained on the true features.
NN(best)  NN(equivalent)  FSA+NSA  

Car Evaluation, , classes.  
Number of weights (nodes)  1600 (64)  150 (6)  120+32 = 152 
Test Accuracy  100.00.00  98.230.06  100.00.00 
Image Segmentation, , classes.  
Number of weights (nodes)  6656 (256)  364 (14)  266+98 = 364 
Test Accuracy  96.870.72  96.270.58  98.400.32 
Optical Recognition of Handwritten Digits, , classes.  
Number of weights (nodes)  37888 (512)  1998 (27)  1792+160 = 1952 
Test Accuracy  98.800.29  98.250.19  99.010.20 
Multiple Features, , classes.  
Number of weights (nodes)  14464 (64)  904 (4)  583+320 = 903 
Test Accuracy  97.850.80  95.450.98  98.150.82 
ISOLET, , classes.  
Number of weights (nodes)  41152 (64)  5787 (9)  4683+1118 = 5801 
Test Accuracy  96.730.50  94.310.61  96.910.54 
4.2 Real Datasets
In this section, we perform an evaluation on a number of real multiclass datasets to compare the performance of a fully connected NN and the compact NN obtained by FSA+NSA. The real datasets were carefully selected from the UCI ML repository [3] to ensure that the dataset is not too large (the number of data points less than 10000) and that a standard fully connected neural network (with one hidden layer) can have a reasonable generalization power on this data. If a dataset is large, then the loss landscape is simple and the neural network can be trained easily, so there is no need for pruning to escape bad optima. If a dataset is such that a neural network can rarely be trained on it successfully, it means that the loss might not have any good local optima, then again pruning might not make sense.
Our real dataset experiments are not aimed at comparing the performance with other classification techniques, but to test the effectiveness of FSA+NSA in guiding neural networks to find better local optima, we will combine all the samples including training, validation and testing data to form a single dataset for each data type first, and then divide them into a training and testing set with a ratio . The obtained training dataset will be used in a 10run averaged 5fold crossvalidation grid search training process to find the best hyperparameter settings of a one hidden layer fully connected neural network. After getting the best hyperparameter setting from the crossvalidation, we use them to retrain the fully connected NNs with the entire training dataset 10 different times, and each time we record the best test accuracy. This procedure is used for the fully connected NN, and the NN with FSA+NSA with different sparsity levels and record the best sparsity level and testing accuracy. Finally, we will also train a socalled ”equivalent” fully connected neural network with roughly the same number of connections as the best sparse neural network we get from FSA+NSA.
The number of hidden nodes was searched in , the L2 regularization coefficient was searched in , the batch size was searched in . Other NN training techniques like Dropout [12] and Batch Normalization [8] were not used in our experiments due to the simplicity of the architecture of experimented NNs. The sorted loss values of the models with 200 random initializations are shown in Figure 7. The comparison results are listed in Table 2.
We see from Table 2 that using FSA+NSA to guide the search for a local optimum leads to NNs with good generalization on all these datasets, easily outperforming a NN of an equivalent size (with a similar number of weights) and in most cases even the standard NN with the best generalization to unseen data. We see from Fig. 7 that the FSA+NSA can obtain lower loss values than the other networks in all cases but one.
The experiments show that the XOR data is indeed an extreme example where deep local optima are be hard to find, but even these datasets exhibit some nonequivalent local optima and the things we learned from the XOR data carry over to these datasets to help us train NNs with better generalization.
5 Conclusion
This paper presented an empirical study of the trainability of neural networks and the connection between the amount of training data and the loss landscape. We observed that when the training data is large (where ”large” depends on the problem), the loss landscape is simple and easy to train. When the training data is limited, the number of local optima can become very large, making the optimization problem very difficult. For these cases we introduce a method for training a neural network that avoids many local optima by starting with a large model with many hidden neurons and gradually removing neurons to obtain a compact network trained in a deep minimum. Moreover, the performance of the obtained pruned subnetwork is hard to achieve by retraining using random initialization, due to the existence of many shallow local optima around the deep minimum. Experiments also show that the pruning method is useful in improving generalization on the XOR data and on a number of real datasets.
Many mature fields of science, such as physics, material science, electrical engineering etc., have two branches: one theoretical and one experimental, and researchers are usually specialized on only one of these branches. The experimental scientists are skilled in designing and conducting experiments, handling different tools and devices and observing phenomena. These phenomena are later explained by their theoretical colleagues that are specialized in proving things theoretically or simulating them numerically. Sometimes the opposite happens when a theoretical scientist predicts a certain phenomenon that is later verified by an experimentalist. Each branch requires different sets of skills and there are very few scientists in those fields that are both theoretical and experimental.
We feel that Machine Learning has reached a degree of maturity where it could also benefit from such a division. Some studies could be purely experimental and leave the theoretical justification to other more theoretically skilled researchers. In this regard, our paper is a purely experimental study, observing some phenomena and providing some intuitive solutions. We leave the theoretical study of the phenomena as well as the proof of the theoretical grounding for our FSA+NSA algorithm as future work for somebody with the appropriate skill set.
Footnotes
 Contact Author
References
 (2017) Feature selection with annealing for computer vision and big data learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2), pp. 272–286. Cited by: §3.1.
 (2018) Essentially no barriers in neural network energy landscape. arXiv preprint arXiv:1803.00885. Cited by: §1.1.
 (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4.2.
 (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. ICLR. Cited by: §1.1.
 (2018) Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, pp. 8803–8812. Cited by: §1.1.
 (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1.1.
 (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1.1.
 (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.2.
 (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
 (2018) Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §1.1.
 (2016) No bad local minima: data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361. Cited by: §1.1.
 (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §4.2.