Learning from the Kernel and the Range Space
In this article, a novel approach to learning a complex function which can be written as the system of linear equations is introduced. This learning is grounded upon the observation that solving the system of linear equations by a manipulation in the kernel and the range space boils down to an estimation based on the least squares error approximation. The learning approach is applied to learn a deep feedforward network with full weight connections. The numerical experiments on network learning of synthetic and benchmark data not only show feasibility of the proposed learning approach but also provide insights into the mechanism of data representation.
The learning problem in machine intelligence has been traditionally formulated as an optimization task where an error metric is minimized. In the system of linear equations, because it is difficult to have an exact match between the sample size and the number of model parameters, an approximation is often sought-after according to the primal solution space or the dual solution space in the least error sense. Such an optimization, particularly one that is based on minimizing the least squares error, has been a popular choice due to its simplicity and tractability in analysis and implementation. The approach is predominant in engineering applications as evident from its pervasive adoption in statistical and network learning.
Attributed to the computational effectiveness of the backpropagation algorithm running on the then limited hardware (see e.g.,[1, 2, 3, 4, 5]) and the theoretical establishment of the mapping capability (see e.g., [6, 7, 8, 9]), the multilayer neural networks were once a popular tool for research and applications in the 1980s. With the advancement of computing facilities in the 1990s-2000s, such minimization of the error cost function had been progressed to the more memory intensive search algorithms utilizing the first- and the second-order methods of gradient descent (see e.g., [10, 11, 12]). Recently, driven by another leap bound advancement in the computing resources together with the availability of a large quantity of data, the multilayer neural networks reemerged as deep learning networks . In view of the more demanding task of processing a large quantity of data with the highly complex network function on the limited computing resources such as the operating memory and the level of data vectorization, the backpropagation remained a viable tool for the optimization search.
In this work, we explore into utilization of the Kernel And the Range space (abbreviated as KAR space) for network learning. This approach exploits the approximation property of the kernel and the range space of the system of linear equations for learning the network weights. The main advantage of this approach is that neither descent nor gradient computation is needed for network learning. Moreover, once the network has been initialized, the learning can be calculated in a single operating pass where no iterative search is needed. The proposed approach can be applied to networks of arbitrary number of layers. This proposal opens up a new way of solving the network and functional learning problems without having to compute the gradient.
2 The Kernel and the Range Space
Consider the system of linear equations given by
where is the data matrix, is the target vector, and is the unknown parameter vector to be solved. The range or image of a matrix is the span of its column vectors. The range of the corresponding matrix transformation is called the column space of the matrix. The kernel or the null space of a linear map is the set of solutions to the homogeneous equation . In other words, is in the kernel of if and only if is orthogonal to each of the row vectors of .
For an under-determined system (1) where , the number of equations is less than the unknowns. This gives rise to an infinite number of solutions. However, a least norm solution can be obtained by constraining to its subspace  by utilizing the row space of , i.e., . Here, constitutes the kernel and is also known as the Gram matrix.
For an over-determined system where , the equations in (1) are generally unsolvable when a strict equality is desired (see e.g., ). However, by multiplying to both sides of (1), the resulted equations
is called the normal equation which can be rearranged to give the least squares error solution .
Solving for in the system of linear equations of the form (1) in the column space (range) of or in the row space (kernel) of is equivalent to solving the least squares error approximation problem. Moreover, the resultant solution is unique with a minimum-norm value in the sense that for all feasible .
The proof is omitted here due to the space constraint. This result for systems with single output (output containing a single column, ) can be generalized to system with multiple outputs (output with multiple columns, ) as follows.
Solving for in the system of linear equations of the form
in the column space (range) of or in the row space (kernel) of is equivalent to minimizing the sum of squared errors given by
Moreover, the resultant solution is unique with a minimum-norm value in the sense that for all feasible .
Since the trace of is equal to the sum
of the squared lengths of the error vectors , ,
the unique solution in the column
space of or that in the
row space of , not only minimizes this sum, but also minimizes each term in the
sum . Moreover, since the column and the row spaces are independent, the sum
of the individually minimized norms is also minimum.
The process of solving the algebraic equations under the kernel-and-range (KAR) spaces with implicit least squares error seeking shall be exploited to solve the network learning problem in the following section.
3 Network Learning
Consider an -layer network (of structure where is the output dimension) given by
where , , , , , , , and . We shall partition the term into where and . Assume that exists, we can take the inverse of to both sides of (6). By separately considering the bias term and moving it to the left hand side, we can post-multiply both sides of the equation by to get
Next, by taking advantage of the column-row space manipulation, we arrive at
where denotes the pseudo-inverse operation. This pseudo-inverse can be in the form of left or right operation depending on the matrix rank condition.
This process of inversion and moving the bias term plus kernel-and-range space manipulation is continued until reaching the first-layer where its hidden weights can be written as:
After having derived, it can be back-substituted to obtain as
The process is iterated until the weights of the th-layer is obtained:
Based on the above derivation, the full weights for each layer (i.e., , ) can be obtained in an analytic form when the weights without considering the bias component (i.e., , ) is known. Here, we propose to use a random initialization of , for solving , in network learning. We shall call this learning network KARnet for convenience.
4 Synthetic Data
In this section, we observe the behavior of the proposed network learning on three synthetic data sets with known properties. The first set of data represents the regression problem whereas the second and third data sets are well-known benchmarks for classification. For all the three problems and including the following experiments, our choice of the activation function and its inverse for each layer are respectively the modified softplus function (i.e., ) and its inverse given by .
4.1 Single Dimensional Regression Problem
The first set of synthetic data has been generated using based on
for training. To simulate noisy outputs, a 20% of variation
from the original values has been incorporated where 10 trials of the noisy
measurements are included for training as well. A two-layer network is adopted to learn
the data. Fig. 1(a) shows the learning results for all the eight
training data points when a two-layer network uses six hidden nodes (i.e., a 6-1
structure). This is an over-determined system since there are more data samples than the
effective number of parameters. Fig. 1(b) shows the results when
eight hidden nodes are used for a two-layer network (i.e., a 8-1 structure). This is an
under-determined system as there are less data samples than the number of
effective parameters (total including the bias’s weight). Here, we note that
for the two-layer network, the system size is determined by the dimension of and its rank.
Next, a five-layer network is used to learn the same set of 1D data. Fig. 2(a) shows the results for the network of 1-1-1-6-1 structure. This constitutes an over-determined case where there are less effective parameters than data samples). Fig. 2(b) shows the results for the network of 1-1-1-8-1 structure which is under-determined. Here, the system structure of interest is given by
In summary, the network is seen to find its fit through all data points including those noisy ones for the under-determined case in both networks. However, for the over-determined case, the network does not fit every data points due to the insufficient number of parameters for modelling all data points. This example clearly explains the fitting behavior of multilayer network learning.
4.2 The XOR Problem
The next example is the well-known XOR problem which consists of four data points with one of the data points being perturbed by a small value to facilitate numerical stability in learning (i.e., the input data points are which are associated with labels respectively). A two-layer KARnet with two hidden nodes is adopted to learn the data. For comparison, the feedforwardnet of the Matlab toolbox is adopted with a similar architecture (adopting a two-layer structure with softplus activation) for learning the same set of data using the default training method trainlm. Fig. 3(a) and (b) show respectively the learned decision surfaces for KARnet and feedforwardnet. These results show the capability of KARnet to fit the nonlinear surface and the premature stopping of feedforwardnet for this data set.
Next, we compare the two networks using a five-layer architecture with each layer having two hidden nodes for both networks. Fig. 4(a) and (b) show respectively the decision surfaces for KARnet and feedforwardnet. These results show the fitting capability of KARnet despite the larger number of adjustable parameters and again, the premature stopping of feedforwardnet for this data set.
4.3 The Three-Spiral Problem
In this example, a total of 1500 randomly perturbed data points which form a 3-spiral distribution have been used as the training data. Among these data, each of the spiral arm consists of 500 data points (which are shown as red, green and blue circles in Fig. 5). A two-layer KARnet with 100 hidden nodes has been adopted for learning these data points with respective labels using an indicator matrix. Since there are three classes, the number of output nodes is 3. The learned decision regions (which are shown in light red, green and blue tones) as shown in Fig. 5 show the mapping capability of KARnet for the three-category problem.
5 Experiments on Real-World Data
5.1 Nursery Data Set
The goal in this database [21, 22] was to rank applications for nursery schools based upon attributes such as occupation of parents and child’s nursery, family structure and financial standing, as well as the social and health picture of the family. The eight input features for the 12960 instances are namely, ‘parents’ with attributes usual, pretentious, great_pret; ‘has_nurs’ with attributes proper, less_proper, improper, critical, very_crit; ‘form’ with attributes complete, completed, incomplete, foster; ‘children’ with attributes 1, 2, 3, more; ‘housing’ with attributes convenient, less_conv, critical; ‘finance’ with attributes convenient, inconv; ‘social’ with attributes non-prob, slightly_prob, problematic; ‘health’ with attributes recommended, priority, and not_recom. These input attributes are converted into discrete numbers and normalized to the range (0,1]. The output decisions include ‘not_recom’ with 4320 instances, ‘recommend’ with 2 instances, ‘very_recom’ with 328 instances, ‘priority’ with 4266 instances and ‘spec_prior’ with 4044 instances. Since the category ‘recommend’ has not enough instances for partitioning in 10-fold cross-validation, it is merged into the ‘very_recom’ category. We thus have 4 decision categories for classification.
For KARnet’s hidden parameter tuning, an inner 10-fold cross-validation loop using only the training set was adopted to determine the hidden node size among 1, 2, 3, 5, 10, 20, 30, 50, 80, 100, 200, 500. For 3-layer and 4-layer networks, the network structures of -- and --- are adopted respectively ( is the output dimension). The chosen hidden node size is then applied for 10 runs of test evaluation using the outer cross-validation loop. The results for 2-layer, 3-layer, and 4-layer networks are respectively 92.39% at , 92.64% at , and 92.73% at . These results are comparable with 98.89% for the feedforwardnet (, 2-layer) and 91.69% for the TERRM method .
5.2 Letter Recognition
The data set comes with 20,000 samples, each with 16 feature attributes. The goal is to recognize the 26 capital letters in the English alphabet based on a large number of black-and-white rectangular pixel displays. The character images consist of 20 different fonts where each letter within these 20 fonts was randomly distorted to produce a large pool of unique stimuli [24, 22]. Each stimulus was converted into 16 primitive numerical attributes such as the statistical moments and the edge counts. These attributes were then scaled to fit into a range of integer values from 0 to 15.
Similar to the above evaluation setting, 10 trials of 10-fold stratified cross-validation have been performed for classifying the 26 categories. The results for 2-layer, 3-layer, and 4-layer KARnets are respectively 88.99%, 94.32%, and 94.12%, all at . The feedforwardnet (2-layer) encounters “out of memory” for the current computing platform (Intel i7-6500U CPU at 2.59GHz with 8G of RAM).
5.3 Optical Recognition of Handwritten Digits
This data set was collected based on a total of 43 people, wherein 30 contributed to the training set and different 13 to the test set [25, 22]. The original 3232 bitmaps were divided into non-overlapping blocks of 44 where the number of on pixels were counted within each block. This generated an input matrix of 88 where each element was an integer in the range [0, 16]. The dimensionality (64) is thus reduced (from 3232) and the resulted image is invariant to minor distortions. The total number of samples collected for training and testing are respectively 3823 and 1797. In our experiment, these two sets (training and test sets) of data are combined for the running of 10 trials of 10-fold cross-validation tests. Fig. 6 shows some samples of the image data taken from the training set and the testing set, respectively.
The results for 2-layer, 3-layer, and 4-layer networks are respectively 97.25% at , 97.17% at , and 96.96% at . These results are comparable with the 96.81% for the TERRM method . The feedforwardnet (2-layer) encounters “out of memory” for the current computing platform.
5.4 Comparison with State-of-the-arts
The results of KARnet are compared with several state-of-the-art methods namely, the Reduced Multivariate Polynomial method (RM, ), the Total Error Rate method adopting RM (TERRM, ), the Support Vector Machines adopting Polynomial (SVM-Poly, ) and Radial Basis Function (SVM-Rbf, ) kernels, and the feedforwardnet (2-layer) from the Matlab toolbox , all running under the similar 10 trials of 10-fold cross-validation protocol. Table 1 shows that the proposed KARnet has comparable prediction accuracy with state-of-the-art methods. While the SVMs had been tuned by adjusting the kernel parameters (such as the order in the polynomial kernel, and the Gaussian width in the radial basis kernel), the proposed network had been tuned by adjusting the number of hidden nodes () in each layer according to the structures -- and ---.
|FFnet||: Feedforwardnet from Matlab .|
|OM||: Out of memory for the current computing platform.|
In this article, the solution based on the manipulation of the kernel and the range space has been found to be equivalent to that obtained by the least squares error estimation. By exploiting this observation, a learning approach based on the kernel and the range space manipulation has been introduced. The approach solves the system of linear equations directly by exploiting the row and the column spaces without the need for error formulation and gradient descent. The adoption of the learning approach to deep networks learning validated its feasibility. The learning results of synthetic and real-world data provided not only the numerical evidence but also the insights regarding the fitting mechanism. This opens up the vast possibilities along the research direction.
This research was supported by Basic Science Research Program through the National
Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and
Technology (Grant number: NRF-2015R1D1A1A09061316).
- According to (9)-(11), a Moore-Penrose inverse operation is taken over the matrices (i.e., , , , ) to relate among the weight solutions. For the 2-layer network, the matrix which is nearest to the output is of size . Here, the minimum size of the hidden nodes required for this matrix to be invertible is . This in turn gives rise to an output weight of size for . Hence, the effective number of parameters (or adjustable parameters) needed for data representation is hinged upon the number of output weights which corresponds to the sample size and the output dimension (i.e., ).
- H. J. Kelley, “Gradient theory of optimal flight paths,” Ars Journal, vol. 30, no. 10, pp. 947–954, 1960.
- P. J. Werbos, “Beyond regression: New tools for prediction and analysis in the behavioral sciences,” Ph.D. dissertation, Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences, 1974.
- ——, “Backpropagation : Past and future,” in ICNN proceedings (IEEE), 1988.
- ——, “Backpropagation through time: what it does and how to do it,” Neural Networks, vol. 78, no. 10, pp. 1550–1560, 1990.
- S. O. Haykin, Neural Networks and Learning Machines. New York: Prentice Hall, 2009.
- K.-I. Funahashi, “On the approximate realization of continuous mappings by neural networks,” Neural Networks, vol. 2, no. 3, pp. 183–192, 1989.
- K. Hornik, M. Stinchcombe, and H. White, “Multi-layer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.
- G. Cybenko, “Approximations by superpositions of a sigmoidal function,” Math. Cont. Signal & Systems, vol. 2, pp. 303–314, 1989.
- R. Hecht-Nielsen, “Kolmogorov’s mapping neural network existence theorem,” in Proceedings of IEEE First International Conference on Neural Networks (ICNN), vol. III, 1987, pp. 11–14.
- R. Battiti, “First and second order methods for learning: Between steepest descent and newtons method,” Neural Computation, vol. 4, no. 2, pp. 141–166, 1992.
- P. Patrick van der Smagt, “Minimisation methods for training feedforward neural networks,” Neural Networks, vol. 7, no. 1, pp. 1–11, 1994.
- E. Barnard, “Optimization for training neural nets,” IEEE Trans. on Neural Networks, vol. 3, no. 2, pp. 232–240, 1992.
- I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
- W. R. Madych, “Solutions of underdetermined systems of linear equations,” in Lecture Notes – Monograph Series, Spatial Statistics and Imaging, 1991, vol. 20, pp. 227–238, institute of Mathematical Statistics.
- S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge: Cambridge University Press, 2004.
- A. Albert, Regression and the Moore-Penrose Pseudoinverse. New York: Academic Press, Inc., 1972, vol. 94.
- Adi Ben-Israel and Thomas N.E. Greville, Generalized Inverses: Theory and Applications, 2nd ed. New York: Springer-Verlag, 2003.
- S. L. Campbell and C. D. Meyer, Generalized Inverses of Linear Transformations, (SIAM edition of the work published by Dover Publications, Inc., 1991) ed. Philadelphia, USA: Society for Industrial and Applied Mathematics, 2009.
- G. Strang, Introduction to Linear Algebra, 5th ed. Wellesley: Wellesley-Cambridge Press, 2016.
- R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: John Wiley & Sons, Inc, 2001.
- M. Olave, V. Rajkovic, and M. Bohanec, “An application for admission in public school systems,” in Expert Systems in Public Administration, I. Th. M. Snellen, W. B. H. J. van de Donk, and J.-P. Baquiast, Eds. North Holland: Elsevier Science Publishers, 1989, pp. 145–160.
- M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
- K.-A. Toh and H.-L. Eng, “Between classification-error approximation and weighted least-squares learning,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 4, pp. 658–669, April 2008.
- P. W. Frey and D. J. Slate, “Letter recognition using holland-style adaptive classifiers,” Machine Learning, vol. 6, no. 2, pp. 161–182, 1991.
- C. Kaynak, “Methods of combining multiple classifiers and their applications to handwritten digit recognition,” Master’s thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University, 1995.
- K.-A. Toh, Q.-L. Tran, and D. Srinivasan, “Benchmarking a reduced multivariate polynomial pattern classifier,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 26, no. 6, pp. 740–755, 2004.
- C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- The MathWorks, “Matlab and simulink,” in [http://www.mathworks.com/], 2017.
- K.-A. Toh, “Deterministic neural classification,” Neural Computation, vol. 20, no. 6, pp. 1565–1595, June 2008.