Feedforward Supervised Learning
for Deep Self-Organizing Map Networks
In this study, we propose a novel deep neural network and its supervised learning method that uses a feedforward supervisory signal. The method is inspired by the human visual system and performs human-like association-based learning without any backward error propagation. The feedforward supervisory signal that produces the correct result is preceded by the target signal and associates its confirmed label with the classification result of the target signal. It effectively uses a large amount of information from the feedforward signal, and forms a continuous and rich learning representation. The method is validated using visual recognition tasks on the MNIST handwritten dataset.
Keywords:supervised learning, deep learning, self-organizing map
A multilayered deep neural network is one of the most powerful methods for human-like recognition tasks, such as image  and speech recognition . Some previous studies have demonstrated great performance for supervised learning in signal classification tasks [3, 4]. Gradient-based learning rules, in particular, back-propagation (BP) learning , are generally used for supervised learning in feedforward type networks. However, the amount of supervisory information in the last layer is not sufficient to supervise the entire deep neural network because the information is selected and reduced from layer to layer. This tendency is more serious in pattern discrimination tasks because the amount of information is extremely limited to the discrete values of the discriminant label output. Bengio et al. proposed a stacked auto-encoder to ensure the amount of information from error signals by reconstructing the input and using layer-wise learning . However, layer-wise learning requires step-by-step learning, which results in difficulties in incremental learning and online updating. Some previous studies have used unsupervised learning that does not use the prior information of the data structure, and reported self-organizing behavior and good discrimination results in a very deep neural network [1, 7]. However, unsupervised learning could not control the classification of input data, which resulted in the enlargement of the network and low efficiency for learning.
In this study, we propose a novel learning method for deep neural networks that uses feedforward propagated supervisory signals. The method effectively uses a large amount of information from the feedforward propagated supervisory signal, which enables robust leaning in a deep neural network. It associates the classification of new input with that of pre-trained input, and revises the internal representation of the entire neural network. We validate the propose learning method using a numerical simulation of visual pattern discrimination tasks.
2 Network Model
The network model was inspired by the human visual system in the cortex. The network is composed of self-organizing map (SOM) modules . Each SOM module consists of one hundred neurons, and receives a subset of the output of the corresponding location of the previous layer. The connection is similar to a receptive field (RF) of recent convolutional neural networks, but is not convolutional, meaning no weight sharing among modules in a layer.
Each neuron calculates an inner product between the weight and input as follows:
where is the inner product of the -th layer at time , is the weight matrix, and is the output vector of the previous layer. The inner product is then processed using winners-share-all (WSA) regularization in each module. WSA is a variant of winner-takes-all (WTA), and involves not only the winning neuron, but also neighboring neurons. The neuron that has the most prominent inner product is selected as the winning neuron, and outputs 1.0. Neighboring neurons output a distance-decayed value determined using the Gaussian kernel. The output of the -th layer is described as follows:
where is the -th element of the output vector, and is the spatial distance from the winning neuron to the -th neuron. The spatial decay factor is determined empirically, and set to .
Fig. 1(a) shows a schematic of the network structure for the experiment. As shown in Table 1, the first layer consists of 49 SOMs with 4,900 neurons, and neurons receive a 66-pixel image that is a part of the input image of 2828 pixels, which results in a total of 176,400 connections in the layer.
|Layer||Maps||# of Neurons||RF||Stride||# of Connections|
|1||77||4,900||66 pixels||44 pixels||176,400|
|2||55||2,500||33 maps||11 map||2,250,000|
|3||55||2,500||55 maps||11 map||6,250,000|
|4||55||2,500||55 maps||11 map||6,250,000|
|5||11||100||55 maps||11 map||250,000|
We use traditional unsupervised competitive learning for pre-training [8, 9]. It updates the weight of the most prominent neuron and its neighbors, and forms a two-dimensional spatial structure of the template sets for the input pattern. The update rule is described as follows:
where is the weight vector, which is the -th row of vector matrix , and is the input vector to the module. As for traditional SOMs, the learning coefficient linearly decreases from 1.00 at the beginning to 0.00 at the end, and the standard deviation of Gaussian kernel also decreases, from 3.5 to 0.0. The weight vector is normalized by L2-norm at every update. The method generates a spatially continuous feature map, which is similar to the map generated using topographic independent component analysis .
Pre-training was performed in a layer-wise manner, which is similar to a biological critical period. Initially, only the first layer was learned, with 2,500 iterations. Next, 2,500 iterations were applied to the first and second layer. Finally, the first four layers were processed, applying 10,000, 7,500, 5,000 and 2,500 iterations sequentially. The last layer was not processed by the pre-training. Fig. 1(b) shows a typical example of the generated feature maps in the first layer.
4 Advance Propagation Learning
Advance propagation (AP) learning is a supervised learning method that enables a feedforward supervisory signal using the sparse dynamics of the network. It is based on learning vector quantization (LVQ) , but requires additional advance input as a supervisory signal to specify the learning ‘location’. Before processing the target input, advance input, which produces the required classification label, is propagated throughout the entire network. Then, the target input is processed with the after-effect of the advance input. The after-effect guides the correct ‘path’ of the propagation, and specifies the learning ‘location’ in the network. The point is that advance propagation does not restrict the propagation ’path’, but just suggests it. It merges various paths by various types of inputs with a same internal learning representation. LVQ-like conditional learning followed by the target input specifies the learning ‘direction’, thereby revising the weight vector to produce the required label.
A learning trial is processed as follows: First, the target input is processed by each module in the network, and the output label of the network is checked. If it corresponds to the required label, then the weights of the activated units are updated by competitive learning as Eq.3 (Fig. 2(a)). Otherwise, the weights are updated to opposite direction of Eq.3 (Fig. 2(b)), and then AP learning is evoked. First, the advance input that produces the required label output is processed by the same network, which results in the required label output at time (Fig. 2(c)left). Subsequently, the target input is processed again with the after-effect of the advance input (Fig. 2(c)right) as follows:
where is the ratio of the after-effect of the advance input. The vector represents the direction of the feature vector corrected by the after-effect of the advance input . The important point is that the network has highly nonlinear behavior using WSA, and the output is not equal to that produced by the linear summation of the two inputs. The following competitive learning uses the combined input in as same manner as Eq. 3. Consequently, the full version of the equation with multi-layer decay and a Gaussian kernel for the WSA output is described as follows:
where is the decay coefficient from layer to layer and is the total number of layers. The weight vector is normalized by L2-norm at every update, as in traditional competitive learning. We used in the experiments.
To validate the proposed learning method, we performed a discrimination test on the MNIST handwritten image dataset (10 digits, 2828 pixels, grayscale) . AP learning was applied to the pre-trained network. The most matched output of the last layer in the pre-training result was selected for each label. Advance inputs as supervisory signals were dynamically determined, and updated from one trial to the next. Each input signal was initially tested using its label, and AP learning was applied if the label was incorrect.
One learning block consisted of 10,000 samples of the training dataset input for learning, and 10,000 samples of the validation dataset only to calculate the error rate. The calculation was performed on a workstation (Opteron 6366 1.8 GHz) using custom C code with OpenMP parallelization.
Fig. 3(a) represents the change in the error rate using AP learning. Initially, the error rate was determined using pre-training with unsupervised competitive learning, which resulted in %. AP learning improved the error rate to % after 20 iterations of the entire training set with the decay coefficient . If the decay coefficient equaled zero, which meant that there was no learning in the upper layer, then the learning stopped at quite an early stage, which resulted in a high error rate ( %). If the coefficient value was non-zero, then learning was processed over the entire network, which resulted in a low error rate. The results demonstrated that the proposed method effectively processed learning over the entire network at the time.
In this paper, we introduced a novel supervised learning method for a deep feedforward neural network, and validated the efficiency for a visual recognition task. The method focused on using the rich input information in the early layer as the supervisory signal in each layer. We demonstrated that the proposed method could operate supervised fine-tuning on the pre-trained multilayered network (Fig. 3(a)). The learning method formed an effective learning representation with continuous features, and also drastically reorganized the representation in the later layers (Fig. 3(b)).
The proposed learning method was applied to the entire network concurrently and not layer by layer. Only the correct/incorrect signal was broadcast throughout the network, and each local module used just the broadcast signal and locally propagated information, which is quite suitable for highly distributed parallel computing systems. Moreover, the method require no back propagation information, decreasing the usage of memory drastically. It is critical to process enormously long sequence in deep recurrent networks. It could be good for application of such a long sequence like natural language.
No requirement of back propagation means more biologically plausible than classical learning methods. The error back propagation is sometimes argued its biological unfitness, and there are no evidence of its existence in physiological condition. The proposed method just utilizes the feedforward signal for the association based supervised learning, and the advance supervised signal might correspond to association by Hebbian rule within the time window of spike timing dependent plasticity (STDP) .
One of the interesting points of the proposed learning method is that it seamlessly incorporated both reinforcement learning  and competitive learning. Reinforcement learning emerges if there is no advance input, and the traditional competitive learning emerges if there is no correct/incorrect signal. This suggests that these learning methods can share the same hardware implementation, and the learning mode can be selected by the sequence of input and correct/incorrect signals. Moreover, the timing of the correct/incorrect signal can control the associative layer, and it might be useful for deeper or recurrent networks.
-  Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y.: Building high-level features using large scale unsupervised learning. Proc. 29th Int. Conf. on Machine Learning. (2012).
-  Dahl, G.E., Yu, D., Deng, L., Acero, A.: nn Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. on Audio, Speech, and Lang. Process. 20(1), 30–42 (2012).
-  LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard R.E., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989).
-  Krizhevsky, A., Sutskerver, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. in Neural Inf. Process. Syst. 25, 1106–1114 (2012).
-  Rumelhart, D.D., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature. 323(9), 533–536 (1986).
-  Bengio, Y., Lambling, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. Adv. in Neural Inf. Process. Syst. 19, 153–160 (2007).
-  Fukushima, K.: Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biol. Cybern. 36(4), 193–202 (1980).
-  Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59–69 (1982).
-  Rumelhart, D.E., Zipser, D.: Feature discovery by competitive learning. Cogn. Sci. 9, 75–112 (1985).
-  Hyvärinen, A., Hoyer, P.O.: A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Res. 41(18), 2413–2423 (2001).
-  Ahalt, S.C., Krishnamurthy, A.K., Chen, P., Melton D.E.: Competitive learning algorithms for vector quantization. Neural Netw. 3(3), 277–290 (1990).
-  LeCun, Y., Cortes, C.: The MNIST database of handwritten digits.
-  Markram, H., Lübke, J., Frotscher, M., Sakmann, B.. Sutton, R.S., Barto, A.G.: Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science 275(5297), 213–215 (1997).
-  Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, The MIT Press (1998).