Biologically Inspired Feedforward Supervised Learning for Deep Self-Organizing Map Networks

Biologically Inspired
Feedforward Supervised Learning
for Deep Self-Organizing Map Networks

Takashi Shinozaki CiNet, National Institute of Information and Communications Technology
1-4 Yamadaoka, Suita, Osaka 565-0871, Japan,
Graduate School of Information Science and Technology, Osaka University
1-5 Yamadaoka, Suita, Osaka 565-0871, Japan,
tshino@nict.go.jp
Abstract

In this study, we propose a novel deep neural network and its supervised learning method that uses a feedforward supervisory signal. The method is inspired by the human visual system and performs human-like association-based learning without any backward error propagation. The feedforward supervisory signal that produces the correct result is preceded by the target signal and associates its confirmed label with the classification result of the target signal. It effectively uses a large amount of information from the feedforward signal, and forms a continuous and rich learning representation. The method is validated using visual recognition tasks on the MNIST handwritten dataset.

Keywords:
supervised learning, deep learning, self-organizing map

1 Introduction

A multilayered deep neural network is one of the most powerful methods for human-like recognition tasks, such as image [1] and speech recognition [2]. Some previous studies have demonstrated great performance for supervised learning in signal classification tasks [3, 4]. Gradient-based learning rules, in particular, back-propagation (BP) learning [5], are generally used for supervised learning in feedforward type networks. However, the amount of supervisory information in the last layer is not sufficient to supervise the entire deep neural network because the information is selected and reduced from layer to layer. This tendency is more serious in pattern discrimination tasks because the amount of information is extremely limited to the discrete values of the discriminant label output. Bengio et al. proposed a stacked auto-encoder to ensure the amount of information from error signals by reconstructing the input and using layer-wise learning [6]. However, layer-wise learning requires step-by-step learning, which results in difficulties in incremental learning and online updating. Some previous studies have used unsupervised learning that does not use the prior information of the data structure, and reported self-organizing behavior and good discrimination results in a very deep neural network [1, 7]. However, unsupervised learning could not control the classification of input data, which resulted in the enlargement of the network and low efficiency for learning.

In this study, we propose a novel learning method for deep neural networks that uses feedforward propagated supervisory signals. The method effectively uses a large amount of information from the feedforward propagated supervisory signal, which enables robust leaning in a deep neural network. It associates the classification of new input with that of pre-trained input, and revises the internal representation of the entire neural network. We validate the propose learning method using a numerical simulation of visual pattern discrimination tasks.

2 Network Model

The network model was inspired by the human visual system in the cortex. The network is composed of self-organizing map (SOM) modules [8]. Each SOM module consists of one hundred neurons, and receives a subset of the output of the corresponding location of the previous layer. The connection is similar to a receptive field (RF) of recent convolutional neural networks, but is not convolutional, meaning no weight sharing among modules in a layer.

Each neuron calculates an inner product between the weight and input as follows:

(1)

where is the inner product of the -th layer at time , is the weight matrix, and is the output vector of the previous layer. The inner product is then processed using winners-share-all (WSA) regularization in each module. WSA is a variant of winner-takes-all (WTA), and involves not only the winning neuron, but also neighboring neurons. The neuron that has the most prominent inner product is selected as the winning neuron, and outputs 1.0. Neighboring neurons output a distance-decayed value determined using the Gaussian kernel. The output of the -th layer is described as follows:

(2)

where is the -th element of the output vector, and is the spatial distance from the winning neuron to the -th neuron. The spatial decay factor is determined empirically, and set to .

Fig. 1(a) shows a schematic of the network structure for the experiment. As shown in Table 1, the first layer consists of 49 SOMs with 4,900 neurons, and neurons receive a 66-pixel image that is a part of the input image of 2828 pixels, which results in a total of 176,400 connections in the layer.

(a)    (b)

Figure 1: (a) Schematic of the network structure. (b) Learning representation with continuous features in the first layer generated by unsupervised pre-training.
Layer Maps # of Neurons RF Stride # of Connections
1 77 4,900 66 pixels 44 pixels 176,400
2 55 2,500 33 maps 11 map 2,250,000
3 55 2,500 55 maps 11 map 6,250,000
4 55 2,500 55 maps 11 map 6,250,000
5 11 100 55 maps 11 map 250,000
Table 1: Network parameters.

3 Pre-training

We use traditional unsupervised competitive learning for pre-training [8, 9]. It updates the weight of the most prominent neuron and its neighbors, and forms a two-dimensional spatial structure of the template sets for the input pattern. The update rule is described as follows:

(3)

where is the weight vector, which is the -th row of vector matrix , and is the input vector to the module. As for traditional SOMs, the learning coefficient linearly decreases from 1.00 at the beginning to 0.00 at the end, and the standard deviation of Gaussian kernel also decreases, from 3.5 to 0.0. The weight vector is normalized by L2-norm at every update. The method generates a spatially continuous feature map, which is similar to the map generated using topographic independent component analysis [10].

Pre-training was performed in a layer-wise manner, which is similar to a biological critical period. Initially, only the first layer was learned, with 2,500 iterations. Next, 2,500 iterations were applied to the first and second layer. Finally, the first four layers were processed, applying 10,000, 7,500, 5,000 and 2,500 iterations sequentially. The last layer was not processed by the pre-training. Fig. 1(b) shows a typical example of the generated feature maps in the first layer.

(a)          (b)

(c)

Figure 2: (a) Example of a correct label output by a clearer image that conducted a correct propagation ‘path’. (b) Example of an incorrect label output by a difficult image. (c) The difficult image produced the required label output with a guide of the after-effect conducted by the clearer image. The shaded region retained the after-effect along the correct ‘path’ conducted by advance propagation.

4 Advance Propagation Learning

Advance propagation (AP) learning is a supervised learning method that enables a feedforward supervisory signal using the sparse dynamics of the network. It is based on learning vector quantization (LVQ) [11], but requires additional advance input as a supervisory signal to specify the learning ‘location’. Before processing the target input, advance input, which produces the required classification label, is propagated throughout the entire network. Then, the target input is processed with the after-effect of the advance input. The after-effect guides the correct ‘path’ of the propagation, and specifies the learning ‘location’ in the network. The point is that advance propagation does not restrict the propagation ’path’, but just suggests it. It merges various paths by various types of inputs with a same internal learning representation. LVQ-like conditional learning followed by the target input specifies the learning ‘direction’, thereby revising the weight vector to produce the required label.

A learning trial is processed as follows: First, the target input is processed by each module in the network, and the output label of the network is checked. If it corresponds to the required label, then the weights of the activated units are updated by competitive learning as Eq.3 (Fig. 2(a)). Otherwise, the weights are updated to opposite direction of Eq.3 (Fig. 2(b)), and then AP learning is evoked. First, the advance input that produces the required label output is processed by the same network, which results in the required label output at time (Fig. 2(c)left). Subsequently, the target input is processed again with the after-effect of the advance input (Fig. 2(c)right) as follows:

(4)

where is the ratio of the after-effect of the advance input. The vector represents the direction of the feature vector corrected by the after-effect of the advance input . The important point is that the network has highly nonlinear behavior using WSA, and the output is not equal to that produced by the linear summation of the two inputs. The following competitive learning uses the combined input in as same manner as Eq. 3. Consequently, the full version of the equation with multi-layer decay and a Gaussian kernel for the WSA output is described as follows:

(5)

where is the decay coefficient from layer to layer and is the total number of layers. The weight vector is normalized by L2-norm at every update, as in traditional competitive learning. We used in the experiments.

5 Experiments

To validate the proposed learning method, we performed a discrimination test on the MNIST handwritten image dataset (10 digits, 2828 pixels, grayscale) [12]. AP learning was applied to the pre-trained network. The most matched output of the last layer in the pre-training result was selected for each label. Advance inputs as supervisory signals were dynamically determined, and updated from one trial to the next. Each input signal was initially tested using its label, and AP learning was applied if the label was incorrect.

One learning block consisted of 10,000 samples of the training dataset input for learning, and 10,000 samples of the validation dataset only to calculate the error rate. The calculation was performed on a workstation (Opteron 6366 1.8 GHz) using custom C code with OpenMP parallelization.

Fig. 3(a) represents the change in the error rate using AP learning. Initially, the error rate was determined using pre-training with unsupervised competitive learning, which resulted in %. AP learning improved the error rate to % after 20 iterations of the entire training set with the decay coefficient . If the decay coefficient equaled zero, which meant that there was no learning in the upper layer, then the learning stopped at quite an early stage, which resulted in a high error rate ( %). If the coefficient value was non-zero, then learning was processed over the entire network, which resulted in a low error rate. The results demonstrated that the proposed method effectively processed learning over the entire network at the time.

The initial scattered learning representation in the last layer was rearranged to a more sparse and efficient style (Fig. 3(b)upper). Simultaneously, the optimal stimuli of the representative neurons were modified to create a more generalized image (Fig. 3(b)lower).

(a) (b)

Figure 3: (a) Error rate of the discrimination for each layer decay parameter . (b) upper: Use of neurons in the last layer. Brighter color represents the use of the corresponding neuron. Ten representative neurons corresponding respective digit labels remained after AP learning. lower: Optimal stimuli of a typical representative neuron (which codes the figure ‘6’) before and after AP learning.

6 Conclusion

In this paper, we introduced a novel supervised learning method for a deep feedforward neural network, and validated the efficiency for a visual recognition task. The method focused on using the rich input information in the early layer as the supervisory signal in each layer. We demonstrated that the proposed method could operate supervised fine-tuning on the pre-trained multilayered network (Fig. 3(a)). The learning method formed an effective learning representation with continuous features, and also drastically reorganized the representation in the later layers (Fig. 3(b)).

The proposed learning method was applied to the entire network concurrently and not layer by layer. Only the correct/incorrect signal was broadcast throughout the network, and each local module used just the broadcast signal and locally propagated information, which is quite suitable for highly distributed parallel computing systems. Moreover, the method require no back propagation information, decreasing the usage of memory drastically. It is critical to process enormously long sequence in deep recurrent networks. It could be good for application of such a long sequence like natural language.

No requirement of back propagation means more biologically plausible than classical learning methods. The error back propagation is sometimes argued its biological unfitness, and there are no evidence of its existence in physiological condition. The proposed method just utilizes the feedforward signal for the association based supervised learning, and the advance supervised signal might correspond to association by Hebbian rule within the time window of spike timing dependent plasticity (STDP) [13].

One of the interesting points of the proposed learning method is that it seamlessly incorporated both reinforcement learning [14] and competitive learning. Reinforcement learning emerges if there is no advance input, and the traditional competitive learning emerges if there is no correct/incorrect signal. This suggests that these learning methods can share the same hardware implementation, and the learning mode can be selected by the sequence of input and correct/incorrect signals. Moreover, the timing of the correct/incorrect signal can control the associative layer, and it might be useful for deeper or recurrent networks.

References

  • [1] Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J., Ng, A.Y.: Building high-level features using large scale unsupervised learning. Proc. 29th Int. Conf. on Machine Learning. (2012).
  • [2] Dahl, G.E., Yu, D., Deng, L., Acero, A.: nn Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. on Audio, Speech, and Lang. Process. 20(1), 30–42 (2012).
  • [3] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard R.E., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989).
  • [4] Krizhevsky, A., Sutskerver, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. in Neural Inf. Process. Syst. 25, 1106–1114 (2012).
  • [5] Rumelhart, D.D., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature. 323(9), 533–536 (1986).
  • [6] Bengio, Y., Lambling, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. Adv. in Neural Inf. Process. Syst. 19, 153–160 (2007).
  • [7] Fukushima, K.: Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biol. Cybern. 36(4), 193–202 (1980).
  • [8] Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43, 59–69 (1982).
  • [9] Rumelhart, D.E., Zipser, D.: Feature discovery by competitive learning. Cogn. Sci. 9, 75–112 (1985).
  • [10] Hyvärinen, A., Hoyer, P.O.: A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Res. 41(18), 2413–2423 (2001).
  • [11] Ahalt, S.C., Krishnamurthy, A.K., Chen, P., Melton D.E.: Competitive learning algorithms for vector quantization. Neural Netw. 3(3), 277–290 (1990).
  • [12] LeCun, Y., Cortes, C.: The MNIST database of handwritten digits.
  • [13] Markram, H., Lübke, J., Frotscher, M., Sakmann, B.. Sutton, R.S., Barto, A.G.: Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science 275(5297), 213–215 (1997).
  • [14] Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, The MIT Press (1998).
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
44718
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description