# High-dimensional Neural Feature using Rectified Linear Unit and Random Matrix Instance

## Abstract

We design a ReLU-based multilayer neural network to generate a rich high-dimensional feature vector. The feature guarantees a monotonically decreasing training cost as the number of layers increases. We design the weight matrix in each layer to extend the feature vectors to a higher dimensional space while providing a richer representation in the sense of training cost. Linear projection to the target in the higher dimensional space leads to a lower training cost if a convex cost is minimized. An -norm convex constraint is used in the minimization to improve the generalization error and avoid overfitting. The regularization hyperparameters of the network are derived analytically to guarantee a monotonic decrement of the training cost and therefore, it eliminates the need for cross-validation to find the regularization hyperparameter in each layer.

Alireza M. Javid, Arun Venkitaraman, Mikael Skoglund, and Saikat Chatterjee
\addressSchool of Electrical Engineering and Computer Science

KTH Royal Institute of Technology, Sweden

{almj, arunv, skoglund, sach}@kth.se
\ninept
{keywords}
Rectified linear unit, random matrix, convex cost function

## 1 Introduction

Extension of low-dimensional signal to high-dimensional space is a traditional method for constructing useful feature vectors, specifically for classification problems. The intuition is that, by extending to a high dimension, the feature vectors of different classes become easily separable by a linear classifier. Support vector machine (SVM) and kernel regression (KR), using ‘kernel trick’, are examples of creating high-dimensional features. A popular kernel is the radial basis function (RBF) kernel, or Gaussian kernel, which is shown to extend the feature vector to a very high, infinite, dimensional space [1]. In this paper, we design a high-dimensional feature using a multilayer neural network architecture. The architecture uses the rectified-linear-unit (ReLU) activation, instances of random matrices, and a fixed structured matrix. We refer to this as high-dimensional neural feature (HNF) throughout the paper.

Relevant literature review: Neural networks and deep learning architectures have received overwhelming attention over the last decade [23]. Appropriately trained neural networks have been shown to outperform the traditional methods in different applications, for example in classification and regression tasks[16, 5]. By the continually increasing computational power, the field of machine learning is being enriched with active research pushing classification performance to higher levels for several challenging datasets [22, 14, 12]. However, very little is known regarding how many numbers of neurons and layers are required in a network to achieve better performance. In particular, the technical issue - guaranteeing performance improvement with increasing the number of layers - is not straight-forward in traditional neural network architectures, e.g., deep neural network (DNN) [19], convolutional neural network (CNN) [11], recurrent neural network (RNN) [18], etc. We endeavor to address this technical issue by extending the feature vectors to a higher dimensional space using instances of random matrices.

Random matrices has been widely used as a mean for reducing the computational complexity of neural networks while achieving comparable performance as with fully-learned networks [21, 6, 17, 4, 13]. In the case of the simple, yet effective, extreme learning machine (ELM), all layers of the network are assigned randomly chosen weights and the learning takes place only at the extreme layer [8, 7, 9, 24]. It has also been shown recently that a similar performance to fully-learned networks may be achieved by training a network with most of the weights assigned randomly and only a small fraction of them being updated throughout the layers [15]. These approaches indicate that randomness has much potential in terms of high-performance at low computational complexity. There exist other works employing predefined weight matrices that do not need to be learned. Scattering convolution network [3] is a famous example of these approaches which employs wavelets-based scattering transform to design the weight matrices.

Our contributions: Motivated by the prior use of random matrices, we design the HNF architecture using an appropriate combination of ReLU, random matrices, and fixed matrices. We theoretically show that the output of each layer provides a richer representation compared to the previous layers if a convex cost is minimized to estimate the target. We use an -norm convex constraint to improve the generalization error and avoid overfitting to the training data. We analytically derive the regularization hyperparameter to ensure the decrement of the training cost in each layer. Therefore, there is no need for cross-validation to find the optimum regularization hyperparameters of the network. Finally, we show the classification performance of the proposed HNF against ELM and the state-of-the-art results.

Notations: We define as a non-linear function comprised of a stack of ReLU activation functions. The non-linear function uses ReLU function scalar-wise. A vector has non-negative part and non-positive part such that and .

## 2 Proposed single layer structure

Consider a dataset containing samples of pair-wise -dimensional input data and -dimensional target vector as . Let us construct two single layer neural networks and compare effectiveness of their feature vectors. In one network, we construct the feature vector as , and in the other network, we build the feature vector . We use the same input vector , predetermined weight matrix , and ReLU activation function for both networks. However, in the second network, the effective weight matrix is where is fully pre-determined. To predict the target, we use a linear projection of feature vector. Let the predicted target for the first network be , and the predicted target for the second network . Note that and . By using -norm based regularization, we find optimal solutions for the following convex optimization problems.

(1) | ||||

(2) |

where denotes the Frobenius norm and the expectation operation is done by sample averaging over all data points in the training dataset. The regularization parameter is the same for the two networks. By defining , we have

(3) |

The above relation is due to the special structure of and use of ReLU activation . Note that the solution exists in the feasible set of the minimization (2), i.e., . Therefore, we can show the optimal costs of the two networks have the following relation

(4) |

where the equality happens when , where is a zero matrix of size . Any other optimal solution of will lead to inequality relation due to the convexity of the cost. Therefore, we can conclude that the feature vector of the second network is richer than the feature vector of the first network in the sense of reduced training cost. The proposed structure provides an additional property for the feature vector which we state in the following proposition. The proposition and its proof will be used in the next section to construct a multilayer structure.

###### Proposition 1.

For the feature vector , there exists an invertible mapping function when the weight matrix is full-column rank.

###### Proof.

We now state Lossless Flow Property (LFP), as used in [17, 10]. A non-linear function holds the lossless flow property (LFP) if there exist two linear transformations and such that . It is shown in [17] that ReLU holds LFP. In other words, if , then holds for every when is ReLU. Letting , we can easily find , where denotes pseudo-inverse when is a full-column rank matrix. Therefore, the resulting inverse mapping would be linear. ∎

## 3 Rich feature vector design with deepness

In this section, we show that the proposed HNF provides richer feature vector as the number of layers increases. Consider an -layer feed-forward network according to our proposed structure on the weight matrices as follows

(5) |

Note that is the number of neurons in the ’th layer of the network. The input-output relation in each layer is characterized by

(6) | ||||

(7) |

where , , and for . Let the predicted target using the ’th layer feature vector be . We find optimal solutions for the following convex optimization problems

(8) | ||||

(9) |

Let us define . Assuming that weight matrices are full-column rank, we can similarly derive . Note that we have following relations

(10) |

If we choose , by using (10), we can easily see that . Therefore, by including in the feasible set of the minimization (9), we can guarantee that the optimal cost of ’th layer would be lower that that of layer . In particular, by choosing , we can see that the optimal costs follow the relation

(11) |

where the equality happens when . Any other optimal solution of will lead to inequality relation due to the convexity of the cost. Therefore, we can conclude that the feature vector of an -layer network is richer than the feature vector of an -layer network in the sense of reduced training cost. Note that if we choose the weight matrix to be orthonormal, then

(12) |

where we have used the fact that . As we have , a sufficient condition to guarantee the cost relation (11) is to use the relation between regularization parameters as . We can safely choose . Note that the regularization parameter in the first layer can also be determined analytically. Consider to be the solution of the following least-squares optimization

(13) |

Dataset | size of | size of | Input | Number of | Proposed HNF | ELM | Proposed HNF | state-of-the-art | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

train data | test data | dimension | classes | Accuracy | Accuracy | Accuracy | [reference] | |||||

Letter | 13333 | 6667 | 16 | 26 | 93.3 | 5 | 88.3 | 94.6 | 3 | 95.8 [20] | ||

Shuttle | 43500 | 14500 | 9 | 7 | 99.3 | 5 | 99.0 | 99.6 | 3 | 99.9 [8] | ||

MNIST | 60000 | 10000 | 784 | 10 | 97.1 | 5 | 96.9 | 97.7 | 3 | 99.7 [22] |

Note that the above minimization has a closed-form solution. Similar to the argument in (11), by choosing , it can be easily seen that

(14) |

where the equality happens only when .

Similar to Proposition 1, we can prove the following proposition regarding the invertibility of the feature vector at the ’th layer of the proposed structure.

###### Proposition 2.

For the feature vector in (5), there exists an invertible mapping function when the set of weight matrices are full-column rank.

###### Proof.

It can be easily proved by repeatedly using the lossless flow property (LFP) similar to Proposition 1. ∎

Note that the dimension of the feature vector is growing exponentially as . Therefore, the proposed HNF is not suitable for cases where the input dimension is too high. One way to circumvent this issue is to employ the kernel trick by using the feature vector . We will address this solution in future works.

### 3.1 Improving the feature vector of other methods

Note that the feature vector in (6) can be any feature vector that is used for linear projection to the target in any other learning method. In Section 3, we assume to be the feature vector constructed from using the matrix ; and therefore, the regularization parameter is derived to guarantee performance improvement compared to least-square method as shown in (14). A potential extension would be to build the proposed HNF using the feature vector from other methods that employ linear projection to estimate the target. For example, extreme learning machine (ELM) uses a linear projection of the nonlinear features vector to predict the target. In this following, we build the proposed HNF by employing the feature vector used in ELM to improve the performance.

Similar to equation (11), we can show that it is possible to improve the feature vector of ELM by using the proposed HNF. Consider , to be feature vector used in ELM for linear projection to the target. In ELM framework, is an instance of normal distribution, not necessarily full-column rank, and can be any activation function not necessarily ReLU. The optimal mapping to the target in ELM is found by solving the following minimization problem.

(15) |

Note that this minimization problem has a closed-form solution. We construct a richer feature vector in the second layer of the HNF as

(16) |

where and . The optimal mapping to the target by using this feature vector can be found by solving

where is the regularization parameter. By choosing , we can see that the optimal costs follow the relation

(17) |

where the equality happens only when . Otherwise, the inequality has to follow.

Similarly, we can continue to add more layer to improve the performance. Specifically, for ’th layer of the HNF, we have , and we can show that equation (11) holds here as well when the set of matrices are full-column rank.

## 4 Experimental Results

In this section, we carry out experiments to validate the performance improvement and observe the effect of using the matrix in the architecture of an HNF. We report our results for three popular datasets in the literature as in Table 1. Note that we only choose the datasets where the input dimension is not very large due to the computational complexities. The optimization method used for solving the minimization problem (9) is the Alternating Direction Method of Multipliers (ADMM) [2]. The step size in the ADMM algorithm is set to in all the simulations. The number of iterations of ADMM is set to 100 in all the simulations. The weight matrix in every layer in an instance of Gaussian distribution with appropriate size and entries drawn independently from to ensure being full-column rank. For simplicity, the number of nodes is chosen according to for in all the experiment. The number of nodes in the first layer is chosen for each dataset individually such that it satisfies for every dataset with input dimension , as reported in Table 1.

We carry out two sets of experiments. First, we implement the proposed HNF with a fixed number of layers and show performance improvement throughout the layers. In this setup, the only hyperparameter that needs to be chosen is the number of nodes in the first layer . Note that the regularization parameter is chosen such that it guarantees (14) and therefore, eliminates the need for cross-validation in the first layer. Second, we build the proposed HNF by using the ELM feature vector in the first layer as in (16) and show the performance improvement throughout the layers. In this setup, the only hyperparameter that needs to be chosen is the number of nodes in the first layer which is the number of nodes of ELM to be exact. Note that the regularization parameter is chosen such that it guarantees (17), and therefore, eliminates the need for cross-validation. Finally, we present the performance of the corresponding state-of-the-art results in Table 1.

The performance results of the proposed HNF with layers are reported in Table 1. We report test classification accuracy as a measure to evaluate the performance. Note that the number of neurons in the first layer of HNF is chosen appropriately for each dataset such that it satisfies . For example, for MNIST dataset, we set . The performance improvement in each layer of HNF is given in Figure 1, where train and test classification accuracy is shown versus total number of nodes in the network . Note that the total number of nodes being zero corresponds to direct mapping of the input to the target using least-squares according to (13). It can be seen that the proposed HNF provides a substantial improvement in performance with a small number of layers.

The corresponding performance for the case of using the ELM feature vector in the first layer of HNF is reported in Table 1. It can be seen that HNF provides a tangible improvement in performance compared to ELM. Note that the number of neurons in the first layer is, in fact, the same as the number of neurons used in ELM. We choose to get the best performance for ELM in every dataset individually. The number of layers in the network is set to to avoid the increasing computational complexity. The performance improvement in each layer of HNF in this case is given in Figure 2, where train and test classification accuracy is shown versus total number of nodes in the network . Note that the initial point corresponding to is in fact equal to the ELM performance reported in Table 1, which is derived according to (15).

Finally, we compare the performance of the proposed HNF with the state-of-the-art performance for these three datasets. We can see that the proposed HNF provides competitive performance compared to state-of-the-art results in the literature.

## 5 Conclusion

We show that by using a combination of a random matrix and ReLU activation functions, it is possible to guarantee a monotonically decreasing training cost as the number of layers increases. The proposed method can be used by employing any other convex cost function to estimate the target. Note that the same principle applies if instead of random matrices, we use any other real orthonormal matrices. Discrete cosine transform (DCT), Haar transform, and Walsh-Hadamard transform are examples of this kind. The proposed HNF is a universal architecture in the sense that it can be applied to improve the performance of any other learning method which employs linear projection to predict the target.

### References

- (2006) Pattern recognition and machine learning (information science and statistics). Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 0387310738 Cited by: §1.
- (2011-01) Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3 (1), pp. 1–122. External Links: ISSN 1935-8237, Link, Document Cited by: §4.
- (2013-08) Invariant scattering convolution networks. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), pp. 1872–1886. External Links: ISSN 0162-8828, Link, Document Cited by: §1.
- (2017) Progressive learning for systematic design of large neural networks. arXiv preprint arXiv:1710.08177. Cited by: §1.
- (2017) A study and comparison of human and deep learning recognition performance under visual distortions. arXiv preprint arXiv:1705.02498. Cited by: §1.
- (2016-07) Deep neural networks with random gaussian weights: a universal classification strategy?. IEEE Trans. Signal Process. 64 (13), pp. 3444–3457. Cited by: §1.
- (2015) Trends in extreme learning machines: a review. Neural Networks 61 (Supplement C), pp. 32 – 48. Cited by: §1.
- (2012-04) Extreme learning machine for regression and multiclass classification. J. Trans. Sys. Man Cyber. Part B 42 (2), pp. 513–529. Cited by: §1, Table 1.
- (2017) Experimental study on extreme learning machine applications for speech enhancement. IEEE Access PP (99). Cited by: §1.
- (2018-08) Mutual information preserving analysis of a single layer feedforward network. 15th International Symposium on Wireless Communication Systems (ISWCS) (), pp. 1–5. External Links: Document, ISSN Cited by: §2.
- (2012) ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NeurISP), pp. 1097–1105. External Links: Link Cited by: §1.
- (2016-05) Generalizing pooling functions in convolutional neural networks: mixed, gated, and tree. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics 51, pp. 464–472. External Links: Link Cited by: §1.
- (2018-04) Distributed large neural network with centralized equivalence. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (), pp. 2976–2980. External Links: Document, ISSN 2379-190X Cited by: §1.
- (2016) All you need is a good init. Proceedings of International Conference on Learning Representations (ICLR). Cited by: §1.
- (2018) Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing. arXiv preprints arXiv:1802.00844. Cited by: §1.
- (2015) ImageNet large scale visual recognition challenge. Intl. J. Computer Vision 115 (3), pp. 211–252. Cited by: §1.
- (2019) SSFN: self size-estimating feed-forward network and low complexity design. arXiv preprint arXiv:1905.07111. Cited by: §1, §2.
- (2013) Training recurrent neural networks. Ph.D. Thesis, University of Toronto, Toronto, Ont., Canada, Canada. External Links: ISBN 978-0-499-22066-0 Cited by: §1.
- (2013) Deep neural networks for object detection. Advances in Neural Information Processing Systems (NeurISP), pp. 2553–2561. External Links: Link Cited by: §1.
- (2016-04) Extreme learning machine for multilayer perceptron. IEEE Transactions on Neural Networks and Learning Systems 27 (4), pp. 809–821. External Links: Document, ISSN 2162-237X Cited by: Table 1.
- (2017) Mathematics of Deep Learning. arXiv preprints arXiv:1712.04741. Cited by: §1.
- (2013-06) Regularization of neural networks using dropconnect. Proceedings of the 30th International Conference on Machine Learning 28 (3), pp. 1058–1066. External Links: Link Cited by: §1, Table 1.
- (2011-01) Deep learning and its applications to signal and information processing [exploratory dsp]. IEEE Signal Process. Mag. 28 (1), pp. 145–154. Cited by: §1.
- (2015-07) Hierarchical extreme learning machine for unsupervised representation learning. (), pp. 1–8. External Links: Document, ISSN Cited by: §1.